Professional Documents
Culture Documents
2023
SPATIO-TEMPORAL CRIME HOTSPOT
DETECTION USING HYBRID MACHINE
LEARNING ALGORITHM TO IMPROVE
PREDICTION ACCURACY
by
August 2023
ACKNOWLEDGEMENT
academic journey. Dr. Sukumar's expertise and dedication have been instrumental in
mentorship and constructive feedback on the building the codes and report evaluation.
to the success of this thesis. Their patient guidance and willingness to share knowledge
have been pivotal in enhancing my research skills and critical thinking abilities. I
would also like to extend my appreciation to all the faculty members of the department
learning. Lastly, I wish to acknowledge my family and friends for their constant
beloved late sister, Puntalir Anbalagu for motivating and supporting me to enrol this
course after completing my degree. Their love and support have been the driving force
behind my achievements. Thank you to all who have contributed to this endeavour in
ii
TABLE OF CONTENTS
ACKNOWLEDGEMENT ......................................................................................... ii
ABSTRACT ................................................................................................................ x
iii
2.4.1 Bidirectional-LSTM (Bi-LSTM).................................................... 17
iv
4.3 Results ............................................................................................................ 43
4.3.5 MSE................................................................................................ 46
REFERENCES ......................................................................................................... 53
APPENDICES
v
LIST OF TABLES
Page
vi
LIST OF FIGURES
Page
Figure 4.12: Average Time Taken for Model Training Comparison ......................... 49
vii
LIST OF ABBREVIATIONS
viii
LIST OF APPENDICES
ix
SPATIO-TEMPORAL CRIME HOTSPOT DETECTION USING
PREDICTION ACCURACY
ABSTRACT
Crime hotspot detection and prediction are crucial for effective law
enforcement and proactive crime prevention methods. As a result, the goal of this
crime hotspots. Based on previous research, numerous machine learning methods such
as Decision Tree, Random Forest, Nave Bayes, Linear Regression, ARIMA, Kernel
Density Estimation, Gradient Boosting, and LSTM were explored. This paper further
spatiotemporal crime hotspots. As a result, using the dataset acquired from the Boston
performing model. Bi-LSTM-CNN performed the best compared to the other models
by achieving the highest accuracy, highest R2 score, and lowest RMSE, MAPE,
training time and MSE. Overall, law enforcement agencies can use the Bi-LSTM-CNN
x
CHAPTER 1
INTRODUCTION
The significant objective of a smart city is to improve the quality of life of its
residents by making better use of the city's resources. The dramatic alteration of urban
areas has a huge influence on cities' socioeconomic growth. Smart cities infrastructure
on the quality of citizen life, better management of urban population concerns, and
sustainability in all aspects of their lives. Smart cities have enriched human life by
devices used by a big population in a city which includes sensors, cameras and tracking
devices.
threats for the welfare of society. Crimes can have a tremendous impact influence on
a country's economic growth. As a result, countries spend a large amount of their GDP
developers, research teams, legal authorities, industrial community, and residents are
critical for presenting and developing ideas to address smart city difficulties and attain
smart city goals. Cities are becoming overcrowded, pushing governments to launch
secure workplace, it could be challenging for the government officials. For successful
1
Intelligent technologies can forecast future crimes and patterns by examining
prior crimes. Researchers can now gather and analyse massive volumes of data thanks
generates patterns from existing data gathered by law enforcement and criminals to
avoid possible human error during classification and identification. The analysis and
prediction of crimes may be a quick and efficient procedure. Many existing studies
make use of artificial intelligence and machine learning to extract criminal trends and
detect crimes. Even though the data processing and classification time is rapidly
sector as businesses and organizations seek to better their operations by collecting and
analysing huge volumes of data. This research examines based on the category of
crimes, time, and location of the crime occurrence. The targeted category of the crimes
are rape, murder, robbery, and physical assault, which usually happens in public areas.
These crimes can be associated directly with time and location of occurrence without
deviation.
1.1 Motivation
traditional crime analysis methods and explore the potential of spatio-temporal crime
predictions models. By considering the spatial and temporal aspects of crime data, we
aim to develop more accurate and efficient methods for crime hotspot predictions.
2
relationships and interactions between various factors contributing to crime hotspots.
Hybrid algorithms have the potential to leverage the strengths of different models and
identifying crime hotspots with higher precision. Law enforcement agencies can
machine learning algorithms for crime hotspot detection is crucial for their practical
the most effective approach in terms of both prediction accuracy and efficiency in
prevention?
• How does the existing models can be enhanced to improve the crime hotspot
safety. However, traditional methods of crime analysis often rely on manual inspection
of crime data, which can be time-consuming and prone to errors. To address this issue,
3
there is a need for an automated approach to crime hotspot detection that can analyze
spatio-temporal crime data and predict the likelihood of crime in certain locations and
times. A crime prediction algorithm for spatio-temporal crime hotspot detection using
results an increase in the failure rate for authorities to detect and stop the crime.
1.4 Objective
algorithms.
3. To compare the existing machine learning algorithm performance with the new
algorithm.
results in results in high accuracy for Boston Crime dataset for crime hotspot
prediction.
4
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction
Crime is a major concern in every society and identifying crime hotspots can
help prevent and reduce criminal activities. A crucial step in crime analysis and
prevention is the recognition of spatiotemporal crime hotspots. It can be helpful for law
tactics if they are able to pinpoint areas and times when criminal activity is most
prevalent. In recent years, hotspots have been identified and crime data has been
analysed using machine learning techniques. The section below summarizes the
crime hotspots based on the prediction accuracy, dataset applied and limitations of each
model.
as part of data preparation. This step is essential because missing values, outliers, or
unrelated properties in the raw data may have a major impact on the model's accuracy.
For spatio-temporal data, like Boston Crime Dataset ((BPD), 2018) there are specific
data preprocessing required to ensure optimal model training. Finding and resolving
missing or erroneous data is a common step in the data preprocessing process known as
"data cleaning". This process can be applied to eliminate inaccurate or unneeded data
as well as impute missing values with the aid of techniques like mean, median, or mode.
For the Boston Crime Dataset, it is only required to remove and drop unused columns
(Salam, 2022).
5
2.3 Single Machine Learning Algorithms
tree-like model with decision nodes and leaf nodes. The leaf node here represents a
choice, whereas the decision nodes are arranged in the order of two or more branches.
To handle categorical and continuous data, a decision tree is utilized. This algorithm is
simple and useful decision-making diagram. Using the trees is simple and practical
outcomes of the algorithms A decision tree's key benefit is that it can swiftly adapt to
the dataset.
decision trees and then concludes an average output. This algorithm is widely used for
used for both classification and regression problems. Almost identical hyperparameters
exist for the forest and a decision tree. On randomly split data, its ensemble approach
of decision trees is created. This entire group may be compared to a forest with several
independent random samples growing on each tree. When there are enough trees
present, the random forest technique may become too slow and inefficient for real-time
prediction. The random forest approach, in contrast, creates the findings based on
Based on the study conducted (Yin, Michael, & Afa, 2020), the random forest
algorithm, but not a big difference based on the accuracy improvement. Even though
the comparison between DT and RF is done, it is compatible with the very large dataset
since the study has reached the bottleneck of the algorithm used. Additionally,
according to another research, decision trees outperform other algorithms for the Boston
6
crime dataset in terms of precision, recall, and F1-score, resulting in a robust tree that
Crime dataset using the Principal Component Analysis(PCA), where the methos is used
for dimensionality reduction. it figures out how to project the initial data into less
dimensions (Sharma, Choudhury, & Kandwal, 2021). Apart from this, another study
(Jogendra, Sravani, Akhil, Sureshkumar, & Yasaswi, 2022) mentioned that Decision
Tree algorithm has a very good performance evaluation based on the evaluation metrics
applications like text categorization is the Naive Bayes model. It replicates the input
distribution for one class or category and belongs to the family of generative learning
algorithms. This method is predicated on the idea that given the class, the properties of
the input data are conditionally independent, enabling the algorithm to predict outcomes
Naive Bayes classifiers are among the most basic Bayesian network models, yet
when used in conjunction with kernel density estimation, they may attain excellent
accuracy levels. With the aid of this method, the classifier may perform better in
probability density function of the input data using a kernel function. As a result, the
naive Bayes classifier is an effective machine learning tool, especially for sentiment
7
ℎ 𝑥 𝑝(ℎ)
𝑝( ) = 𝑝( )∗
𝑥 ℎ 𝑝(𝑥)
Naïve Bayes model have also multiple techniques such as Gaussian Naïve Bayes
the 3 techniques, Multinomial NB and Gaussion NB has the highest accuracy, where
the training time is quite low, which best suited for real time predictions as well.
event's result using the data for the independent variables, relationships that could be
predict the outcome. The technique frequently resembles a straight line that as closely
as possible correlates to the various data points. A continuous form or a number is the
outcome. The output might include things like financial earnings or sales, the quantity
of goods sold, etc. In the previous scenario, there might be one or more independent
y = β0 + β1x + ε
Equation 2.2: Linear Regression Equation
Y= Dependent Variable
X= Independent Variable
β0 = intercept of the line
β1 = Linear regression coefficient (slope of the line)
8
ε = random error
The linear relationship between a dependent (y) and one or more independent
(y) variables is shown by the Linear Regression procedure. In other words, it establishes
the way a change in the independent variable's value affects the value of the dependent
variable. Independent and dependent variables are related in a straight line with a slope.
(Decision Tree, Random Forest, Linear Regression & Neural Network) with the highest
accuracy of 88% with MAE score of 0.02538 (Mittal, Goyal, & Sethi, 2018).
it is a regression of the variable against itself. We utilize lagged values of the target
variable as our input variables to predict values for the future. A model of order p
In the equation above, m's present value is a linear function of its prior p values.
The regression coefficients are [0, p] and are determined after training. One of the
methods that is often used to determine the optimal values of p is by looking at plots of
the autocorrelation and partial autocorrelation functions. Any differencing that must be
used to make the data steady is represented by integrated. The data may be tested for
stationarity using the dickey-fuller test, and after that, various differencing factors can
be tried out. A lag of mt-mt-1 is indicated by the differencing factor, d=1. Instead of
9
moving average methods forecast future values. A moving average model can be
The moving average component of the regression model is denoted by the letter
"q," and the random residual deviations between the model and the target variable are
denoted by the letter "error" (e) in the equation above. Since it can only be determined
after the model has been fitted and because it is a parameter as well, "e" is an
implicitly capture that information. The addition of Seasonality adds robustness to the
SARIMA has a better accuracy using the Seasonal component and setting the
parameters for the algorithm using grid search technique (Noor, et al., 2022).
The data smoothing problem is often used in signal processing and data science because
makes it possible to produce a smooth curve out of a batch of random data. However,
the estimate may also be used to generate points that appear to have come from just a
10
certain sample set. Particularly useful for modelling items and project simulation is this
function.
By visualizing the data, the Kernel Density Estimation starts to shape the
distribution's curve. The distance between each point at a particular place in the
distribution is weighted to determine the curve's shape. The estimation is larger if there
are more points clustered nearby since there is a greater chance of seeing a point there.
The specific method utilized to balance the points throughout the data set is called the
kernel function. The kernel's form changes depending on its bandwidth. A smaller
bandwidth restricts the function's application space and makes the estimate curve appear
rough and jagged. The size and form of the estimate may be altered by adjusting the
For crimes in Bangalore, a study conducted with KDE algorithm where it was
able to solve the problems based on the proposed algorithm, where the kernel density
function, k of features, f in spatial point for every distance between events (Boppuru,
2023).
decision trees.
weak models and optimize them to minimize the errors of the previous models. The
ensemble is built by adding models sequentially, each one attempting to correct the
11
Figure 2.1: Example of Gradient Boosting Model
sequential input or time series data. Well-known programmes like Siri, voice search,
and Google Translate use these deep learning techniques. They are commonly used for
12
language processing. Recurrent neural networks (RNNs) and feedforward and
Due to the use of previous data, it compares and influences current input and
output by their memory. Unlike traditional neural networks, which assume that inputs
and outputs are independent of one another, recurrent neural networks' outputs are
dependent on the preceding components in the sequence. Even if they would be helpful
variety have gained popularity for addressing long-term dependencies in sequential data
analysis. Traditional RNNs are susceptible to the vanishing gradient problem, which
LSTM is made to get around this problem. The input gate, forget gate, output gate, and
memory cell are the four main parts of the LSTM architecture.
The input gate chooses which data from the input to maintain, the forget gate
chooses which data from the memory cell to erase, and the output gate chooses which
13
data from the memory cell to output. The memory cell gradually stores the data while
selectively updating its contents in response to input. The decision of whether to allow
input or forget a piece of information is made by the input and forget gates using
sigmoid activation functions, which have a range of 0 to 1. The memory cell updates its
state using a hyperbolic tangent function, and the output gate employs sigmoid
Several studies have reported the usage of an LSTM model for crime detection.
Comparing the federated LSTM and LSTM based on a study (Salam, 2022) by using
the same parameter values, almost similar metrics value with no significant differences,
In a study that employed historical crime data of public property from 2015 to
2018 in a coastal city in China, Zhang et al. (2020) proposed an LSTM-based crime
prediction model. When simply using historical crime data, the results demonstrated
that the LSTM model outperformed alternative approaches such as KNN, RF, SVM,
model in a study conducted by Abdul Salam (2022). Based on the Boston crime dataset
((BPD), 2018), the suggested model is created using TensorFlow Federated (TFF)
model and the Keras API. To increase precision and reduce loss, the model's parameters
are adjusted. The results show that the federated LSTM model performs better than the
traditional LSTM model in terms of decreased loss, enhanced accuracy, and longer
training times.
14
Han et. al (2020) suggests a model of daily crime prediction in his research using
and effectively in Chicago, America. The goal is to address crime-related issues in urban
neighborhoods. Based on the study, we can conclude that hybrid models have a better
(CNN) and a Long-Short Term Memory (LSTM) network (thus, CLSTM-NN) was used
to forecast the occurrence of criminal activity in Baltimore, USA. The results show that
while the significance thresholds for Random Forest and Ridge are both greater than
0.05, those for LSTM and the Integrated Model are both lower than 0.05. This implies
that while the findings from ridge and Random Forest are not significant, the results
from LSTM and the Integrated Model are substantial and trustworthy.
information. They evaluate the model using call-for-service data collected over a five-
year period between March 2012 and the end of December 2016 by the Portland,
Oregon Police Bureau (PPB). They contrast our model with the most advanced
Random Forests, K-Nearest Neighbours, and Decision Trees. The STNN(LSTM) model
observed.
namely, LR, SVM, NB, KNN, DT, MLP, RF, and XGBoost, and time sequence models
such as LSTM and ARIMA was used to forecast crime based on the criminal records
15
for the cities of Chicago and Los Angeles. It was found that in terms of root mean square
error (RMSE) and mean absolute error (MAE), LSTM performed reasonably well for
time series analysis compared to ARIMA on both data sets. However, the authors of
this work acknowledge several drawbacks to employing LSTM. First, since LSTM
models need a lot of training data to make accurate predictions, they performed better
when applied to predict crime using the Chicago dataset, which contains plenty of
instances, than when they were applied to the Los Angeles dataset, which contained
fewer instances. In addition, training LSTM models costs money and takes a long time.
such as CNNs, random forests, or support vector machines, to get around these
constraints. These hybrid models can increase the precision of crime detection models
method for processing temporal data, but because crime data includes both temporal
and geographical components, it might not be enough to detect crimes. The constraints
of crime detection models can be solved, and their accuracy increased by combining
Deep Neural Network are widely used for image classification by translating
facts encoded into actual knowledge, more comprehensible. At each layer, Deep Neural
Networks (DNNs) change the data and generate a new representation. DNNs attempt to
categorize the data in a classification issue, improving this process layer by layer until
the desired result is obtained. This work may be thought of as the separation of lower-
16
dimensional manifolds in a data space, which is in accordance with the manifold
hypothesis, which claims that natural data forms lower-dimensional manifolds in its
mostly for natural language processing. It is a valuable tool for modelling the sequential
dependencies between words and phrases in both directions of the sequence since,
unlike ordinary LSTM, the input flows in both directions and it can use information
from both sides. One more LSTM layer is added by Bi-LSTM, which changes the
information flow's direction. The additional LSTM layer's input sequence flows
backward in this case, and the outputs from the two LSTM layers are then combined in
LSTM would take inputs in two separate directions, as opposed to typical LSTM
networks, one from the past to the future and the other from the future to the past.
Reversing the information preserves state information from the future. Thus, by
17
merging two disguised states, the network can always maintain data from the past and
the future. As a result, Chandra et al. (2021) investigated the performance of deep
encoder-decoder LSTM networks, and CNN and it was found that bi-directional LSTM
networks with encoder-decoders perform better than other models for both simulated
Butt et al. (2022) stated that Bi-LSTM and Exponential Smoothing (ES) hybrid
for crime forecasting. Using crime statistics from 2010 to 2017 for New York City, the
suggested method is assessed. The suggested method performed better than cutting-
edge Seasonal Autoregressive Integrated Moving Averages (SARIMA) with low Mean
Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE), and Mean
technique can aid law enforcement authorities in reducing and eliminating crime.
The Bi-LSTM neural network that Deepak et al. (2021) presented categorizes
the various sorts of crime based on information gathered from Google News and
Twitter. HREACC, HANNCC, HMNCC, and GSCA were contrasted with the
suggested technique. For all four different standard datasets, it was discovered that the
suggested method beat the others with an average improvement in accuracy of 9.2%
and a very low False-Negatives Ratio value of 0.11. Thus, by using a Bi-LSTM neural
network, which is an effective option and has been trained on a sizable dataset with a
wide range of data, it is possible to classify fresh items of evidence with greater accuracy
efficient multi-module strategy for forecasting crime. Decision Level Fusion and
Feature Level Fusion are the 2 modules of the suggested methodology. The Fusion
18
model, Stacked Bidirectional LSTM, and Temporal-based Attention LSTM used by the
first module. The training data from the first two models is used by the Fusion model.
On the dataset of several cities, the key models for the transfer learning technique are
temporal-related models, hence, reducing learning model's training time. The second
module produces the final prediction using the Spatio-Temporal based Attention-
LSTM, Stacked Bidirectional LSTM, and the outcome of feature-level fusion. Based on
the information from the previous 24 hours, the suggested model forecasts the upcoming
hour. The output of the proposed model may be the estimated number of crimes in any
potential criminal activity based on category, location, and time. The American cities
of San Francisco and Chicago were primarily focused for the experimental analysis for
this experimental analysis. The mean absolute error for the San Francisco and Chicago
datasets
For the San Francisco and Chicago datasets, the model's mean absolute error is
0.008, 0.02, its coefficient of determination is 0.95 and 0.94, and its symmetric mean
absolute percent error is, respectively, 1.03% and 0.6%. Thus, the suggested machines
learning model performs better than several other well-known models, such as
SARIMAX.
making use of the bidirectional feature of the LSTM architecture, bi-LSTM models may
successfully capture both past and future associations in sequential crime data.
2.4.2 LSTM-CNN
CNN) combination of two potent neural network models is the suggested machine
19
learning approach for identifying spatiotemporal crime. CNN is the best algorithm for
processing images and signals, whereas LSTM is made to cope with sequential data.
Several research have employed the LSTM-CNN model to forecast crime in diverse
contexts.
CNN and the LSTM model in combination can produce a reliable crime prediction
approach with a high forecast accuracy. CNN was used to extract the attributes and the
LSTM to forecast the crime rate. Given that the calculated R-squared and MAE values
were 0.99 and 0.0027, respectively, the forecast is accurate (Muthamizharasan &
Ponnusamy, 2022).
(CNN) and a Long-Short Term Memory (LSTM) network (thus, CLSTM-NN) was used
to forecast the occurrence of criminal activity in Baltimore, USA. The results show that
while the significance thresholds for Random Forest and Ridge are both greater than
0.05, those for LSTM and the Integrated Model are both lower than 0.05. This implies
that while the findings from the ridge and Random Forest models are not significant,
the results from the regression models using the LSTM and the Integrated Model are
20
Convolutional neural networks with long short-term memories (CNN-LSTM)
were used in surveillance systems by Esan et al. (2020) to identify unusual behavioural
patterns in an academic setting. While the LSTM uses the gate mechanism to store
important information for memory, the CNN extracts the image features from the
picture frame sequences. The outcomes are contrasted with those from detection models
that are already in use, such as models built using dictionaries, motion deep nets, social
force, and probabilistic principal analysis. The outcomes demonstrate that the suggested
approach performs better than the others stated with 86% accuracy.
detection. These algorithms are capable of precisely capturing the temporal and spatial
required.
2.4.3 Bi-LSTM-CNN
CNN architectures. The CNN-Bi-LSTM model effectively make predictions the via the
fully connected layer by selecting the spatial features of the data through CNN,
extracting the temporal information for Bi-LSTM's input, and combining three models
21
Figure 2.5: Bi-LSTM-CNN Model
The hybrid attention-based Bi-LSTM and CNN network was used by Kumar et
Character embedding was done through the CNN network, and word embedding was
done via attention-based Bi-LSTM. For word embedding, FastText was developed
using language-specific code-mixed Tamil and Malayalam text for the Tamil and
Malayalam models, respectively. One-hot encoding vectors was utilised for character
encoding. After combining the output from the CNN and attention-based Bi-LSTM
layers, a softmax layer is used to predict offensive and neutral text. Hyper-parameters
can affect how well deep neural networks operate. So, by varying the learning rate,
batch size, optimizer, epochs, loss function, and activation function, an extensive testing
was run. The parameters that provided the recommended system the highest
performance was learning rate of 0.001, batch size of 32, Adam as the optimizer, epochs
= 100, binary cross-entropy as the loss function, ReLU activation in the internal layers
between fact descriptions and charges. In text classification, CNN seeks to identify
global text features while Bi-LSTM concentrates on local text features. The hybrid
model, which combines Bi-LSTM and CNN, was proposed to extract every
22
spatiotemporal feature of text description because neither model can perform on its own.
As a result, the model obtained more accurate text feature information and boosted the
Singh et al. (2023) studied the detection of violence using advanced deep
used by analysing fight scenes on surveillance videos. ConvLSTM model was reported
to be slower and less effective overall than the other two models. The CNN Bi-LSTM
model consistently beat the CNN LSTM in the CCTV dataset and achieved the highest
Two-dimensional convolutional neural networks were used in both the LSTM and Bi-
LSTM models to extract features from frames. However, the Bi-LSTM layer showed
better results and was discovered to be more efficient for identification since it analysed
temporal data in both directions rather than just the forward way.
According to previous studies, Bi-LSTM CNN hybrid model has great potential
to be applied as a tool to study spatio-temporal data. Therefore, the hybrid model Bi-
Butt et al. (2022) state that a machine learning model needs to be carefully
number of data points and to investigate the proper application of the same models to
fresh datasets. In this study (Zhuang & Cao, 2022), the performance of the model in
predicting crime hotspots is validated using RMSE, MAPE, and MSE. R2 is mentioned
in another work by Butt et al. (2022) as being a crucial evaluation statistic for LSTM
hybrid models. The use of performance evaluation criteria is justified by the fact that
23
similar studies have frequently employed these evaluation metrics in prior research.
Additionally, MSE was chosen since it scales dependently and highly weights outliers.
The big dataset in this study suggests (Butt, Letchmunan, Hassan, & Koh, 2022)
that RMSE is a reasonable performance metric to assess the model's performance and
heavily reliant on the percentage of data. It is also easily comprehensible and obliquely
2.6 Summary
24
tanh, epoch = 300,
hidden layer size = 3
Memory Neurons=20
STNN- 0.815 Learning Rate =
LSTM 0.0005
Activation = relu
Decision Criterion = gini, max
0.76
Tree depth = 5
There is no
Gaussion-
0.743 - limitation,
NB
(Zhuang, however future
Random Estimators = 10, min CFS
Almeida, 0.7625 development
Forest samples split =2 data,
Morabito, & will include
K=1, Distance Portland
Ding, 2017) KNN 0.6375 more features
measure = L2
into the hybrid
Logistic Epochs = 300,
0.75 model.
Regression penalty = L2
Hidden layer
size(100,50),
ML
0.7675 Learning
Perception
Rate(0.001),
Activation(relu)
Decision To adopt Boston
No parameter info,
(Aljuboori, Tree, Naïve Precision, gender and Crime
but DT performs best
Shaker, & Bayes, Recall and identity of Dataset
in all performance
Fadhil, 2022) Logistic F1 offender as
classifier
Regression attributes
LSTM Loss=Huber, Boston
Almost Consider the
Federated Optimizer = SGD, Crime
similar global
LSTM Adam, Metrics = all, Dataset
output, but optimization
(Salam, 2022) Data size = 319073,
FLSTM is and reduce the
Batch size = 4,
better communication
Window size = 60,
slightly overhead
round number = 5
25
ES-Bi- Using the New
LSTM knowledge of York
the criminal City
Lowest domain to Crime
(Butt, Seasonal=24, Batch
MAPE, improve Dataset
Letchmunan, size=48, Epoch = 10,
RMSE, transfer
Hassan, & optimizer=RMSProp,
MAE & learning to
Koh, 2022) MSE
R^2 increase the
accuracy of
crime
prediction
Naïve Bayes If class label is Denver
absent, the City
(Kanimozhi, probability of Crime
N, G, estimation will Dataset
Accuracy = No parameter
Ranjitha, & be zero. Use
93.07% information
Yuvarani, different
2021) models in
enhance
performance
ARIMA No parameter ARIMA Chicago
No value performs better and Los
information
in future Angeles
(SAFAT, LSTM
predictions & Crime
ASGHAR, &
RMSE = crime trend Dataset
GILLANI, Epochs=40,
8.78 LSTM,
2021) batch=31
MAE = 6 evaluated
using RMSE
and RAE only.
Kernel Bengalu
Performance is
(Boppuru, Density Accuracy = No parameter ru
like ARIMA
2023) Estimation 77.49% information Crime
model,
(KDE) Dataset
26
LSTM- Missing of Chennai
CNN data monthly, City
(Muthamizhar
R^2 = 0.99 quarterly, or Crime
asan & No parameter
MAE = seasonal basis Dataset
Ponnusamy, information
0.0027 to produce
2022)
more accurate
result.
SARIMA Not all types of Saudi
R^2 = 0.853 Grid Search method
(Noor, et al., crime are used Arabia
MAE = to determine the
2022) for training the Crime
0.066059 optimal parameters
model. Dataset
Feed Chicago
71.3% Accuracy
Forward and
(Stec & CNN drops not clear Portland
72.7% No parameter
Klabjan, due to fewer Crime
RNN 74.1% information
2018) training Dataset
CNN + examples
75.6%
RNN
Decision NCRB
85.75%
Tree
(Mittal,
Random No parameter No limitation
Goyal, & 88.61%
Forest information stated
Sethi, 2018)
LR 89.61%
NN 88.31%
(Jogendra, Decision 80% No parameter Recommended Chicago
Sravani, Tree + K- information to use CNN or Crime
Akhil, means DNN for the Dataset
Sureshkumar, future use
& Yasaswi,
2022)
(Kang & DNN 83.25% Layer1 = 256, Unable to Chicago
Kang, 2017) Layer2 = 256, implement Crime
Layer3 = 128 DNN with Dataset
insufficient
27
Layer size = (1024, dataset, which
1024, 2), activation = will cause
softmax performance
degradation.
(Anuvarshini, LSTM- MAE = Kernel size = 2, Full CNN Boston
Deeksha, C, CNN 24.56, convo1D= 1, stride performs better Crime
& Krishna, RMSE = window size = 30, than LSTM Dataset
2022) 30.11 batch size = 32,
epochs = 100
(Deepak, Bi-LSTM Precision, No parameter Performance Multiple
Rooban, & recall, f- information varies with crime
Santhanavijay measure, dataset, but has dataset
an, 2021) accuracy, an average of
FNR 80% and above
(Singh, Rani, Bi-LSTM- Accuracy MaxPooling2D, 16 Bi-LSTM- Image
Bansal, & CNN filters of size 3×3 & CNN has the data
Techniques, 4x4, dropout = 0.75, highest
2023) Conv2D with 64 performance
filters
Table 2.1: Comparison of Algorithms
Overall, the hybrid models performed better compared to the singular models
excluding LSTM model in previous research. Therefore, in this study LSTM, Bi-LSTM,
spatio-temporal crime hotspots. The hybrid models, Bi-LSTM and LSTM-CNN, has a
very convincing accuracy and training loss on the previous study based on the structured
crime dataset. According to previous studies, Bi-LSTM-CNN has shown great potential
temporal crime hotspot prediction using Bi-LSTM-CNN model. Hence, this study
28
prediction. Based on the on the best model, evaluation metrics such as RMSE, MAPE,
29
CHAPTER 3
METHODOLOGY
3.1 Introduction
Start
Experimental Setup
HW/SW
Collection of
Experimental Dataset
Dataset
Preprocessing
Model Evaluation
End
30
3.2 Experimental Setup
Components Specification
CPU Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
GPU NVIDIA Quadro RTX 3000
Storage 250GB SSD + 2TB HDD
RAM 32.0 GB
Table 3.1: Hardware Specification
With Keras package, a python code was developed to model all the 4 algorithms
to predict the spatio-temporal crime hotspot locations. Tensorflow was used to choose
cores between CPU or GPU for training and testing. Below are the list of dependencies
• Jupyter Notebook
• Keras Layers (LSTM, Conv1D, MaxPooling1D, Flatten, Dense, Dropout,
Bidirectional)
• Keras Optimizer (Adam)
• Tensorflow
• Pandas
• Numpy
• Matplotlib
• Sklearn
This dataset is collected from Boston Police Department (BPD) to document the
initial details surrounding an incident to which BPD officers respond ((BPD), 2018).
This is a dataset containing records from the various crime incident report system from
June 14, 2015, and continue to September 3, 2018, which includes a reduced set of fields
focused on capturing the type of incident as well as when and where it occurred. The
31
dataset consists of 17 attributes (Columns) and 327820 data (Rows). The crime data
Description of attributes:
• OCCURRED_ON_DATE/YEAR/MONTH/DAY_OF_WEEK/HOUR:
Time of crime.
happened.
First, the dataset is checked for empty cells in the spreadsheet. The
“SHOOTING” column only has ‘Y’ values given; the empty cells are filled with value
‘N’. Some of the rows also have shifted, where 2434 rows were removed to clean the
dataset. Then, the dataset was partitioned 80% for training models and balance 20%
32
3.4 Model Building and Training
Below are the 4 algorithms planned to be built to generate training model and
test the Boston Crime Dataset. The following sections describe the hyperparameters
33
3.4.1 Building the Model
Below is the flow for LSTM model generation and testing the model:
a. LSTM
b. Bi-LSTM
c. LSTM-CNN
d. Bi-LSTM-CNN
34
ii. Input data was initialized and determined by number of time
3. A dropout Layer was added to all the models with a rate of 0.2.
4. A dense layer with a few units equal to the number of classes in the target
3. The metric used to evaluate the model during training is set to accuracy.
1. The input dataset was reshaped to have dimensions to match the input shape of
a. LSTM
b. Bi-LSTM
c. LSTM-CNN
d. Bi-LSTM-CNN
2. The categorical target variable (y_train) was converted into one-hot encoded
vectors.
3. 50 epochs parameter was set, which is the number of times the entire dataset
4. Batch size is the number of samples used in each gradient update, was set to 32.
35
5. Optionally, the argument display progress was also enabled (verbose=1).
This entails refining the model's parameters and assessing the model's
effectiveness using validation data. The usefulness of the model in identifying crime
hotspots is assessed using a variety of measures, including Root Mean Squared Error
(RMSE), Mean Absolute Percentage Error (MAPE), and Mean Absolute Error (MAE),
F-2 Score, training loss and accuracy. The model will then be tested on new data to
evaluate its generalizability and robustness. The final step involves analysing the results
of the experiments and drawing conclusions about the effectiveness of the proposed
approach.
The square root of the mean of the square of all the errors is known as the root
mean squared error (RMSE). RMSE is regarded as a superior all-purpose error metric
variable and not between variables. It evaluates how well a regression line matches the
∑𝑁
𝑖=1(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑𝑖 − 𝐴𝑐𝑡𝑢𝑎𝑙𝑖 )
2
𝑅𝑀𝑆𝐸 = √
𝑁
Equation 3.1: RMSE Equation
36
3.5.2 Mean Absolute Percentage Error (MAPE)
forecasting technique's accuracy. It is the average of the absolute % errors of each item
in a dataset, which may be used to assess how accurate the predicted quantities were in
comparison to the actual numbers. MAPE, which necessitates the use of dataset values
other than zero, may frequently be used to examine large data sets successfully. The
1 |𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡|
𝑀𝐴𝑃𝐸 = ∗ ∑[ ] ∗ 100
𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 |𝑎𝑐𝑡𝑢𝑎𝑙|
Equation 3.2: MAPE Equation
The technique of employing the mean squared error (MSE) means of assessing
how closely a regression line matches a collection of points. To do this, the distances
between the points and the regression line ,also referred to as the "errors" are squared.
differences. This mistake type is called the mean squared error since you're averaging a
1
𝑀𝑆𝐸 = ( ) ∗ ∑(𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡)2
𝑛
n = number of objects
∑= summation notation
Actual = original or observed y-value,
Forecast = y-value from regression.
37
3.5.4 R-Squared (R²)
a regression model. The degree to which the regression model accurately predicts the
observed data and the percentage of the dependent variable's variation that can be
accounted for by the model's independent variables are both measured. R-squared is
different models.
SSR: the sum of the squared differences between the predicted and the actual values.
SST: the sum of the squared differences between the actual values and the mean of the
dependent variable.
total variance. A value of 1 for R-squared indicates that the regression model perfectly
fits the data, explaining all the variability. A score of 0 on the other hand signifies that
the model does not account for any variability and is essentially comparable to
forecasting the dependent variable's mean. R-squared can also take on negative values,
classification tasks with multiple classes which is proposed to calculate for all the
between predicted class probabilities and the true class labels. The categorical cross-
38
entropy loss quantifies the difference between the predicted probability distribution and
the true class labels. The equation for categorical cross-entropy loss is as follows:
Given:
Keras is used to calculate the categorical cross-entropy loss function for the model.
Thus, during training, the loss is calculated using this equation for each batch of training
samples.
The accuracy is calculated based on the predictions made by the model on the
test dataset input features (X_test) and the corresponding true labels from categorical
data (y_test). Below is the generic equation how the accuracy is calculated:
The evaluate() method calculates the loss and accuracy for the provided test data
(X_test) and the true labels (y_test), which are converted into one-hot encoded vectors.
In summary, the accuracy is obtained directly from the model evaluation process and
indicates the proportion of samples in the test dataset that were properly categorised to
39
CHAPTER 4
4.1 Introduction
This section consists of the results obtained based on the dataset used to predict
crime hotspot using LSTM, Bi-LSTM, LSTM-CNN, and Bi-LSTM-CNN. The models
are evaluated using evaluation metrics including RMSE, MAPE, MSE, R-squared
value, accuracy, and training loss in order to compare the performance of the provided
Below are the total number of crimes categorized under each type of crime.
40
Figure 4.2: District-wise Crime Distribution
41
Figure 4.4: Year-wise Crime Distribution
42
4.3 Results
Accuracy (%)
Models 1st 2nd 3rd 4th 5th Average
Training Training Training Training Training
LSTM 96.9 94.3 97.2 92.7 93.0 94.82
Bi- 90.4 85.5 91.7 86.3 92.5
LSTM 89.28
LSTM- 93.9 97.8 89.2 97.8 96.2
CNN 94.98
Bi- 98.8 98.6 96.6 94.8 93.2
LSTM- 96.4
CNN
Table 4.1: Accuracy of Trained Models
Accuracy (%)
98
96
94
92
90
88
86
84
LSTM Bi-LSTM LSTM-CNN Bi-LSTM-CNN
43
Bi- 0.051 0.056 0.112 0.136 0.137
LSTM- 0.0984
CNN
Training Loss
0.3
0.25
0.2
0.15
0.1
0.05
0
LSTM Bi-LSTM LSTM-CNN Bi-LSTM-CNN
4.3.3 RMSE
RMSE
Models Training: Training: Training: Training: Training: Average
1 2 3 4 5
LSTM 2.454 4.125 2.824 4.258 4.755 3.6832
Bi- 4.443 3.674 3.524 3.720 4.206
LSTM 3.9134
LSTM- 3.216 2.487 4.127 2.122 2.22
CNN 2.8344
Bi- 1.920 2.490 2.561 2.178 3.860
LSTM- 2.6018
CNN
44
Table 4.3: RMSE Average
RMSE
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
LSTM Bi-LSTM LSTM-CNN Bi-LSTM-CNN
4.3.4 MAPE
MAPE
Models Training: Training: Training: Training: Training: Average
1 2 3 4 5
45
MAPE
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
LSTM Bi-LSTM LSTM-CNN Bi-LSTM-CNN
4.3.5 MSE
MSE
Models Training: Training: Training: Training: Training: Average
1 2 3 4 5
LSTM 6.026 17.020 7.977 18.133 22.614 14.354
Bi- 19.747 13.501 12.424 13.844 17.691
LSTM 15.4414
LSTM- 10.346 6.190 17.036 4.505 4.970
CNN 8.6094
Bi- 3.688 6.204 6.559 4.747 14.902
LSTM- 7.22
CNN
46
MSE
18
16
14
12
10
8
6
4
2
0
LSTM Bi-LSTM LSTM-CNN Bi-LSTM-CNN
4.3.6 R-Squared
R-Squared
Models Training: Training: Training: Training: Training: Average
1 2 3 4 5
LSTM 0.975 0.930 0.967 0.926 0.908 0.9412
Bi- 0.919 0.945 0.949 0.943 0.928
LSTM 0.9368
LSTM- 0.957 0.975 0.9307 0.981 0.979
CNN 0.96454
Bi- 0.985 0.974 0.973 0.980 0.946
LSTM- 0.9716
CNN
47
R-Squared
0.98
0.97
0.96
0.95
0.94
0.93
0.92
0.91
LSTM Bi-LSTM LSTM-CNN Bi-LSTM-CNN
Each model was run for 5 times, each time epochs are set to 50. Below is the table of
comparison of the time taken for each model for training ins seconds.
48
Model Training Time (s)
1600
1400
1200
1000
800
600
400
200
0
LSTM Bi-LSTM LSTM-CNN Bi-LSTM-CNN
4.4 Summary
Based on the data analysis, the type of crime was taken as the categorical data
(y_test) for this experiment. In the crime type distribution, Motor Vehicle Accident
Response is recorded to be highest reported crime. Apart from this, district B2 has the
highest reported crime and district A15 is vice versa. Crimes were also analysed based
on timeline, such as hour, month, and year wise. Throughout the day, high crime rates
are reported between 5pm to 7pm. The crime rates are uncertain based on distribution
graph for year wise since it may go high or low for the upcoming years. For the month
wise, months 7, 8, and 9 are the timeline with highest recorded crimes.
For the models trained and compared, Bi-LSTM-CNN results a very convincing
recorded to have highest average training accuracy (96.4%) and average R-squared
(0.9716) value, but not a very big significant difference with LSTM and LSTM-CNN
models. For the training loss, Bi-LSTM-CNN has the lowest average loss (0.0984). A
low average RMSE value for Bi-LSTM-CNN model (2.6018), followed by LSTM-
CNN (2.8344). There is not much of a difference for the average MAPE value recorded
49
between Bi-LSTM-CNN (0.03132) and LSTM-CNN (0.03256). For the MSE, Bi-
LSTM-CNN has the lowest value recorded (7.22) compared to other 3 models.
To choose the optimum model based on the training time, a further evaluation
was also carried out. Even though LSTM-CNN has a better average training time around
833s, the Bi-LSTM-CNN takes an additional of 170s to produce a better result and
performs better in other evaluation metrics since training is also a one time step.
50
CHAPTER 5
5.1 Conclusion
Crime hotspot identification is critical for increasing public safety and law
allows law enforcement organizations to better allocate resources and conduct targeted
learning algorithms were investigated to reduce the focus of the investigation. It was
compares a non-hybrid model, LSTM and hybrid machine learning model including
outperformed the other models in terms of accuracy, R-squared value, RMSE, MAPE,
and MSE. An additional training time evaluation was also performed to show a
shorter training time, but Bi-LSTM-CNN model additional training time can be
compensated for its better accuracy. This demonstrates that the hybrid model, Bi-
51
5.2 Recommendations for Future Research
CNN algorithm to plot hotspots on Boston, to make the model more usable for real-
different region dataset to test the modularity of the algorithm. Another idea is to
develop a programme that will get daily real-time data from the crime department and
52
REFERENCES
53
International Conference on Information and Knowledge Management (CIKM
’18). Turin, Italy.
Jogendra, K., Sravani, M., Akhil, M., Sureshkumar, P., & Yasaswi, V. (2022). Crime
Rate Prediction Based on K-means Clustering and Decision Tree Algorithm.
Computer Networks and Inventive COmmunication Technologies, 451-462.
Kang, H.-W., & Kang, H.-B. (2017, April 17). Prediction of crime occurrence from
multimodal data using deep learning. Dept. of Digital Media, Catholic
University of Korea, Bucheon, Gyonggi-Do, Korea.
Kanimozhi, N., N, V. K., G, S. P., Ranjitha, G., & Yuvarani, S. (2021). CRIME TYPE
AND OCCURRENCE PREDICTION USING MACHINE LEARNING
ALGORITHM. International Conference on Artificial Intelligence and Smart
Systems. doi:10.1109/ICAIS50930.2021.9395953
Khan, M., Ali, A., & Alharbi, Y. (2022). Predicting and Preventing Crime: A Crime
PredictionModel Using San Francisco Crime Data by Classification
Techniques.
Kumar, A., Saumyab, S., & Singh, J. P. (2020). NITP-AI-NLP@HASOC-Dravidian-
CodeMix-FIRE2020: A Machine Learning Approach to Identify Offensive
Languages from Dravidian Code-Mixed Text. Forum for Information Retrieval
Evaluation. Hyderabad, India.
Lamari, Y., Freskura, B., Abdessamad, A., Eichberg, S., & Bonviller, S. d. (2020).
Predicting Spatial Crime Occurrences through an Efficient Ensemble-Learning
Model. International Journal of Geo-Information, 9(645).
Mittal, M., Goyal, L. M., & Sethi, J. K. (2018). Monitoring the Impact of Economic
Crisis on Crime in India Using Machine Learning. 1469-1485.
Muthamizharasan, M., & Ponnusamy, R. (2022). Forecasting Crime Event Rate with
a CNN-LSTM Model. Innovative Data Communication Technologies and
Application, 461-470.
Noor, T. H., Almars, A. M., Alwateer, M., Almaliki, M., Gad, I., & Atlam, E.-S.
(2022). SARIMA: A Seasonal Autoregressive Integrated Moving Average
Model for Crime Analysis in Saudi Arabia. Electronics 2022, 11(3986).
Olah, C. (2014, April 6). Neural Networks, Manifolds, and Topology. Retrieved from
https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Ribeiro, J., Meneses, L., Costa, D., Miranda, W., & Alves, R. (2022). Prediction of
Homicides in Urban Centers: A Machine Leaning Approach. 344-361.
SAFAT, W., ASGHAR, S., & GILLANI, S. A. (2021, May 6). Empirical Analysis for
Crime Prediction and Forecasting Using Machine Learning and Deep Learning
Techniques. 9, 70080-70094.
Salam, M. A. (2022, April). Time Series Crime Prediction Using a Federated Machine
Learning Model.
Sharma, H. K., Choudhury, T., & Kandwal, A. (2021). Machine learning based
analytical approach for geographical analysis and prediction of Boston City
crime using geospatial dataset. GeoJournal.
Singh, S., Rani, i., Bansal, P., & Techniques, A. S. (2023). Designing of an Efficient
Model for Violence . 2023 International Conference on Advancement in
Computation & Computer Technologies (InCACCT), (pp. 533-538).
Stec, A., & Klabjan, D. (2018, June 5). Forecasting Crime with Deep Learning.
TASNIM, N., IMAM, I. T., & HASHEM, M. M. (2022). A Novel Multi-Module
Approach to Predict Crime Based on Multivariate Spatio-Temporal Data Using
Attention and Sequential Fusion Model. 10, 48009-48030.
54
Tong, X., Ni, P., Li, Q., Yuan, Q., Liu, J., Lu, H., & Li, G. (2021). Urban Crime Trends
Analysis and Occurrence Possibility Prediction based on Light Gradient
Boosting Machine. 2021 IEEE 4th International Conference on Big Data and
Artificial Intelligence.
Ye, X., Duan, L., & Peng, Q. (2021). Spatiotemporal Prediction of Theft Risk with
Deep Inception-Residual Networks. Smart Cities, 4, 204 - 216.
Yin, J., Michael, I. A., & Afa, I. J. (2020, February 9). Machine Learning Algorithms
for Visualization and Prediction Modeling of Boston Crime Data. p. 15.
ZHANG, X., LIU, L., & XIAO, L. (2020). Comparison of Machine Learning
Algorithms for Predicting Crime Hotspots. 8, 181302-181310.
Zhuang, W., & Cao, &. Y. (2022). Short-Term Traffic Flow Prediction Based on CNN-
BILSTM with Multicomponent Information. 12.
Zhuang, Y., Almeida, M., Morabito, M., & Ding, W. (2017). Crime Hot Spot
Forecasting:. IEEE International Conference on Big Knowledge. Lowell,
Massachussets.
55
APPENDICES