Professional Documents
Culture Documents
Learning Algorithms
Abstract
Crime rates have been increasing in many cities over the past few years; therefore,
analyzing hotspots and proactively preventing them is critical. Crime and incident data are being
collected by the police data initiative and are available for the public to encourage joint problem-
solving.
This project focuses on analyzing and predicting hotspots for three major cities (Orlando,
Fort Lauderdale and Gainesville) in Florida using classification. The data were cleansed and
mapped based on the National Incident-Based Reporting System(NIBRS) crime types. Tableau
Public software was used to create a visual report with ability to analyze data using various
dimensions for all the cities. Classification was done using four different algorithms for Fort
Lauderdale and Gainesville in the open source data mining tool Weka 3.8.
The ability to visualize crimes by NIBRS standard crime type at various times allows the
user to analyze multiple scenarios and identify hotspots quickly. Crime spot prediction was done
successfully by all four algorithms and the results were compared to each other to identify the
best one. Classification using k-nearest neighbor was the most precise (60.39%) for Fort
Lauderdale whereas ada boosting was the most precise (84.19%) for Gainesville.
Introduction
Recent statistics show that crime rate is increasing in some cities and is affecting the
quality of life in those communities [3]. In the past few years, some cities have been
emphasizes on locating crime “hotspots” so that law enforcement can be deployed effectively in
those areas. In 2015 White House press announced the “Smart Cities” initiative to help
communities tackle local challenges and improve city services [5]. As part of this, Police data
initiative is identified as a key initiative that can help local authorities with data to improve
community policing [6]. An immense amount of crime data is being collected a part of this
initiative [9]. Using this historical data, predicting crime can be useful for the police department
to proactively monitor high crime areas. This research focuses on predicting crime spots using
visualization and machine learning algorithms for three major cities in Florida for which the data
Incident datasets from Orlando, Fort Lauderdale and Gainesville are used for prediction.
Data is visualized using Tableau Public software and analyzed to identify crime “Hot Spots” by
different crime types for all the three cities. Four different classification algorithms are used to
classify dataset based on binary class, Crime Status. Machine learning algorithms considered are
Decision Tree, Naïve Bayes, K-Nearest neighbor and Boosting. By Experiment, prediction
results of all the four algorithms for Fort Lauderdale and Gainesville incident data were recorded
and analyzed to identify the best algorithm to use. Orlando data is not usable by these algorithms
for prediction as the incidents data only include crimes and thus doesn’t provide sufficient data
• Weka: Weka is a is a suite of machine learning software written in Java, developed at the
• Tableau Public: Tableau Public is a free service tool which allows the user to load and
analyze the data visually. It also publishes the visualizations on the web
• Data Sets
o For visualization and prediction, city incident data with the Case ID, longitude
and latitude of the incident, street name and NIBRS Case type are downloaded
Approach
Data Collection
The data were collected from the Police Data Initiative (PDI) open data site. PDI is a law
enforcement community that promotes the use of open data to improve public safety by
collaboration between law enforcement agencies, technologists, and researchers. Currently, there
are around 130 law enforcement agencies have released more than 200 data sets and the
inventory is growing. There are many data sets that are available on the site like accidents crash
data, incident data, complaints, and officer-involved shootings. Incident data includes
information about incidents where the police department responds to an offense and a report of a
crime is generated. In Florida, Gainesville, Orlando, and Fort Lauderdale are the cities that have
Data Cleansing
In data mining, data cleansing is an essential process to prepare the data to remove
incomplete or inconsistent data. In our datasets, crime type is not consistent across the cities. For
example, Fort Lauderdale has 938 different crime type categories whereas Orlando has only 24
crime type categories. This is because each law enforcement agencies classify incidents
according to their own offense definitions. To analyze and visualize crimes across multiple cities
it’s important that crime types are standardized across them. NIBRS provides an offense lookup
table with various types of crime and NIBRS crime category covering the offense. Crime types
in all the data sets are mapped to NIBRS type by defining a mapping table for them.
Geographical longitude and latitude are critical for this study and they were not provided as a
single attribute in all the data sets. For example, in Gainesville set, street names and Geo
coordinates are listed in one field. Steps were taken to parse and substring appropriate data for
analyzing.
Data Exploration
Data exploration is an initial data analysis step to better understand the data and its
specific characteristics. Visual exploration is the best way for this as it allows the analyst to
quickly absorb large amounts of visual information. All the three data sets are explored using
Classification is a supervised learning method used to predict a certain outcome based on a given
input. In general, there are two steps for data classification. The first step is to build a classifier
model based on a data with a predetermined set of classes. This is a “learning” process where the
model is trained from known data. In this step, it is required that each instance of the data is
labeled with an appropriate class label. This dataset is called training set. Because the class label
of each instance in the dataset is provided to the model, it is known as “Supervised” learning. In
the second step, the model is used for predicting the class label for the test set. Test set will have
all the attributes except for the class label. The classification model will classify each instance to
predict the class label for each instance of the test set. There are many data mining algorithms
that can be used. Various algorithms were analyzed and prioritized based on its limitations, and
crime hotspots
1. Decision Tree
2. Naïve Bayes
3. K-Nearest Neighbor
4. Ada Boosting
Decision Trees
Decision tree is a widely used algorithm in data mining. This algorithm uses a flow chart like
structure to predict the class labels based on several input attributes [6]. Decision trees have 3
kinds of nodes:
1. Root Node – Top most node in the tree with no incoming edges but with zero or more
outgoing edges
an input attribute to separate records that have different characteristics and each branch
represents an outcome of those tests. Each leaf node of the tree represents the actual class label
of the instance. The classification problem is resolved based on series of questions (Root or
Internal node) until a conclusion about the class label is reached (Leaf node). The diagram above
classification. When training the data, the tree is built in a recursive way to optimize the model.
In the first root split, all the attributes will be considered and training data is divided into groups
based on the split. In the example, training data is split based on body temperature. The cost of
the split will be calculated and the step is repeated with the other attribute gives birth. The
attribute with the lowest cost will be considered the root node. There are many measures that can
be used to calculate the cost of the split. Classification error (number of incorrectly classified
instances/ total number of instances) is one of them. The process continues recursively with other
features until a condition is met to stop the split. One way to stop is based on a minimum number
of training inputs to use in each leaf. Another way is using the maximum depth of the model tree.
Maximum depth is the longest path from a root to a leaf. Once the training model is built, the
testing of each instance in the test set starts with the root node. Test conditions are applied to the
records and appropriate branch path is followed until a leaf node with a class label is reached.
theorem. Let X denote the attribute set of the input instances and Y denote the class variable. By
𝑃(𝑋|𝑌)∗𝑃(𝑌)
Bayes’ Theorem: P(Y|X) = 𝑃(𝑋)
• P(Y|X) is the posterior probability which provides the probability of hypothesis Y given
the data X
• P(X|Y) is class conditional probability which is of data X given that the hypothesis Y was
true.
• P(Y) is the prior probability which provides probability of hypothesis Y being true
P(X) is always constant when comparing posterior probabilities for different values of Y
and thus it can be ignored. Prior probability can be calculated from the training set by dividing
the number of training records that belong to a class by the total number of instances. For
example, if the dataset has the same number of instances in each class the probability for each of
them are same. Class conditional probability is the probability of each input value given each
class value. This is calculated by dividing the frequency of each attribute value for a given class
value divided by the frequency of instances with that class value. For estimating the class
conditional probability P(X|Y) Naïve Bayes classifier is used. This algorithm estimates the class
conditional probability with an assumption that the attributes are conditionally independent,
than attempting to calculate the probability for each attribute value P (x1, x2, x3|Y) [7]. It takes
an assumption that they are conditionally independent and calculates P(x1|Y), P(x2|Y) etc. It can
be represented as,
𝑑
𝑃(𝑋|𝑌 = 𝑦) 𝐸 ∏ 𝑃 (𝑋𝑖 |𝑌 = 𝑦)
𝑖=1
Where each attribute set X = {X1, X2, …Xd} consists of d attributes. To classify a test record, the
naïve Bayes classifier computes the posterior probability for each class Y:
𝑃(𝑌) ∏𝑑𝑖=1 𝑃(𝑋𝑖|𝑌)
𝑃(𝑌|𝑋) =
𝑃(𝑋)
Ada Boosting
aggregating the predictions of multiple classifiers. Multiple training sets are created by
resampling the original data and classifier is built from each training set using any standard
classification algorithms (For example, decision trees) [12]. Ada boosting algorithm attempts to
boost the accuracy of any underlying classification algorithm by assigning weight to each
training sample and adaptively change the weight at the end of each round. Following are the
1. Apply same weight for all the instances in the training set
4. Reset the weights for all samples. Assign higher weight to incorrectly classified instances so
5. Repeat from step 3 until set number of rounds are completed or higher accuracy is reached
points. First trained classifier labeled one ‘+’ and two ‘-’ classes incorrectly. In the next round,
those data points are assigned higher weights (highlighted bigger than rest of the data points).
The second classifier will focus on predicting them correctly due to higher weights. This
continues to the next round until the final classifier is obtained by combining the learnings from
The nearest neighbor classifier is an instance based learner which makes predictions
using specific training instances “closer” to test instance. These are also called as “Lazy
Learners” and doesn’t require model building. However, the classification process is quite
expensive as the test instance needs to be classified by computing its proximity to training
examples. Most common class label among its K nearest neighbor will be chosen.
Model Evaluation
Once the model is built by a classifier, the efficiency of it can be evaluated using several
metrics which are based on counts of correctly and incorrectly labeled instances by the model.
The confusion matrix is a table that can provide a visual representation of the performance of a
model. In the notation below the rare class is denoted as positive class and majority class is
+ -
Actual + TP FN
Case
- FP TN
• True positive (TP): Number of positive examples that are correctly predicted by the
model
• False negative (FN): Number of positive examples that are wrongly predicted as negative
by the model
• False positive (FP): Number of negative examples that are wrongly predicted as positive
by the model
• True negative (TN): Number of negative examples that are correctly predicted by the
model
In this paper, the following metrics are used to evaluate and compare the classification models
[7]
.
Accuracy
Error rate
Error rate provides a summarized number that represents a proportionate number of times
correctly predicted by the classifier. This is also called as True positive rate. Large recall value
Precision
positive in the group classifier predicted as positive. Many false positive errors predicted by the
Using these measures, the performance of a model needs to be evaluated using a test set
for which the labels are already known. In the cases where there is no separate test set is
available, other alternative methods can be used [10]. The first method is called holdout method.
In this, the original data set if split into two disjoint sets. One is used as training set and the other
is used as testing set. The proportion of split between train and test set is decided by the analyst
based on the problem that is in hand. It can be 1:1 or 2:1 or 3:1 or any other ratios. This method
has few limitations. First, many instances used for training the model will be fewer as some
portion of the data is kept for testing. To overcome this, if the training set split is too large, test
accuracy will be less reliable. Moreover, training and testing are subsets of original data which
means that a class overrepresented in one will be underrepresented in the other. An alternative
method that is widely used is “K- fold Cross Validation”. In this approach, all the instances will
be used for training as well as for testing using a random subsampling method. In k-fold cross
validation, data is divided into k-subsets. K-1 subsets are used as training set and one subset is
used as testing. In this, each partition is used exactly once for testing as this procedure is
repeated k-times. The total error is calculated by taking the mean of the accuracy from the k-
runs. 10-fold cross validation is the most commonly used to evaluate or compare the classifiers.
Results
For all the cities data is mapped using the longitude and latitude of the incident location.
These points are imposed on US geographical map to visualize the hot spots by the NIBRS crime
type. Crime spots are weighted by the number of incidents that occurred in a specific
geographical location. This gives a visual view of “hotspots” in the cities. Selection can be made
by year of the crime or month of the crime or NIBRS crime type for further analysis.
Visualization provides parameters to allow users to select a subset of data for further analysis
based on a scenario. For example, if the police department would like to focus on drug-related
crime, the chart allows selection of that specific crime and helps to identify the hotspots for
Visualization
In the city of Fort Lauderdale, Robbery is the top crime and following three areas are
“hotspots” for this crime: West Broward Blvd/ NW 25th Ave, E Oakland Park Blvd/ N Federal
Orlando international airport followed by Orlando International Premium Outlets and Conroy
In Gainesville, the top crime is robbery and visualization show that the “hotspots” for
robbery is at NE12th Ave/ NE 19th Ter followed by NW 23rd/ NW 34th and SW Archer Rd/ SW
34th St.
Prediction
The table below provides basic statistics of the datasets that were used in this study.
Total # of % of # of Non- % of Non-
Dataset Incidents # of Crimes Crimes Crimes Crimes
Fort Lauderdale 163,765 73,953 45.16 89,812 54.84
Gainesville 49,516 8,888 17.95 40,628 82.05
Orlando 158,459 158,459 100 0 0
In this study, 10-fold cross validation is used to obtain the results. The tool randomly
divides the data into 10 separate subsets. 9 of them were used for training and the rest was used
for testing. The process was repeated 10 times so that each subset is used as a test set at least
once and the average of the result is calculated. Table below provides the results of the
58.1% and lower Error rate (41.0%). Better Precision is also achieved by KNN with 60.39%.
Recall value of Naïve Bayes (69.84%) is the best among the four classifiers that were compared.
For Gainesville, Decision tree provided the best accuracy rate of 82.05% and lower Error rate
(17.95%). Better Precision is also achieved by Boosting with 84.19%. Recall value of Decision
tree (100%) is the best among the four classifiers that were compared.
Prediction results are mapped for both Fort Lauderdale and Gainesville, based on the
algorithm results with best precision. The pictures below highlight the predicted hot spots.
Conclusion
The aim of this project is to create an application to predict crime hotspots using
visualization and classification for three Florida cities. To achieve that, incident data for the
cities are obtained, mapped using visualization software and classified using data mining
algorithms. Visualization was done successfully and the application allows hot spot analysis
using multiple dimensions like time and crime type. In addition, four different algorithms are
used to build classification models that can predict hotspots and the models are compared using
evaluation techniques. Based on Accuracy, k-nearest neighbor was the most effective algorithm
for Fort Lauderdale and decision tree was the most effective algorithm for Gainesville. When
Precision is considered, boosting is a better algorithm for prediction in Gainesville and k-nearest
The current research provides a baseline for how machine learning algorithms can be
used for proactive policing. This study can be further developed by adding additional Florida
cities as the incident data becomes available for them in Police Data Initiative repository. It can
potentially be a single interface for the users to quickly access crime spots for various cities. The
predictions can be enhanced by adding more features to the dataset. Adding the actual weather
during the incident occurrence time may provide more insights. For example, there were more
crimes in certain areas when there was a hurricane. Adding the weather information can help the
model to identify these patterns. Similarly adding census data for the incident location can also
be beneficial. Particularly, augmenting the dataset with median income level, ethnicity
population spread, and education level will certainly help with improved prediction of hotspots.
Limitations
The tool used for prediction (Weka 3.8) didn’t have a single class classifier. Currently,
Orlando’s data did not include all incidents, rather there has only crime data. Having a one class
classifier could have been used to predict the hotspots for Orlando. Some of the data points in the
dataset are “outliers” from the most of the other common data points for the city. For example,
some incident data recorded for Gainesville had geo-coordinates in Utah. These outliers tend to
Applications
This application is a very beneficial tool, not only for the Police Officers but also for the
public. It provides visualization of crime spots which can be further analyzed using multiple
parameters like time, season, crime type, etc. For example, police officers can filter the data for
certain months when “snow bird” population is higher in retirement community areas and see if
there is a different crime pattern. They can also use the prediction models to proactively predict
the hotspots and patrol those areas more actively. Common people can utilize this to understand
crime spots and use that knowledge when they are considering an area to buy a property or to be
References
https://www.nij.gov/topics/technology/maps/documents/gps-bulletin-v2i3.pdf
2. Brownlee, J. (2016, April 11). Naive Bayes for machine-learning [Online forum post].
3. Castillo, M. (2015, June 4). Is a new crime wave on the horizon? CNN. Retrieved from
http://www.cnn.com/2015/06/02/us/crime-in-america/
Communities Tackle Local Challenges and Improve City Services [Press release]. (2015,
office/2015/09/14/fact-sheet-administration-announces-new-smart-cities-initiative-help
6. FACT SHEET: Announcing Over $80 million in New Federal Investment and a Doubling
of Participating Communities in the White House Smart Cities Initiative [Press release].
(n.d.). Retrieved from https://obamawhitehouse.archives.gov/the-press-
office/2016/09/26/fact-sheet-announcing-over-80-million-new-federal-investment-and
from https://www.slideshare.net/tilanigunawardena/k-nearest-neighbors
8. Marsh, B. (2016, September). Multivariate Analysis of the Vector Boson Fusion Higgs
https://www.researchgate.net/profile/Brendan_Marsh3/publication/306054843_Multivari
ate_Analysis_of_the_Vector_Boson_Fusion_Higgs_Boson/links/57ac9d6508ae7a6420c2
ffa8/Multivariate-Analysis-of-the-Vector-Boson-Fusion-Higgs-Boson.pdf
10. Schneider, J. (1997, February 7). Cross Validation. Retrieved January 7, 2018, from
https://www.cs.cmu.edu/~schneide/tut5/node42.html
11. Shojaee, S., Mustapha, A., Sidi, F., & Jabar, M. Z. (2013, May). A Study on
https://www.researchgate.net/profile/Somayeh_Shojaee/publication/266971832_A_Study
_on_Classification_Learning_Algorithms_to_Predict_Crime_Status/links/54436e830cf2e
6f0c0f94761/A-Study-on-Classification-Learning-Algorithms-to-Predict-Crime-
Status.pdf
12. Tan, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to datamining. Boston, Ma:
13. Wu, X., Kumar, V., Quinlan, R. J., Ghosh, J., Yang, Q., Motoda, H., . . . Steinberg, D.
https://atasehir.bel.tr/Content/Yuklemeler/Dokuman/Dokuman3_4.pdf