Thesis Template Final Content v6

SPATIO-TEMPORAL CRIME HOTSPOT
DETECTION USING HYBRID MACHINE

LEARNING ALGORITHM TO IMPROVE
PREDICTION ACCURACY
THEEBAN PILLAI ANBALAGU
UNIVERSITI SAINS MALAYSIA
2023
SPATIO-TEMPORAL CRIME HOTSPOT
DETECTION USING HYBRID MACHINE
LEARNING ALGORITHM TO IMPROVE
PREDICTION ACCURACY
by
THEEBAN PILLAI ANBALAGU
Thesis submitted in fulfilment of the requirements

for the degree of
Master of Science
August 2023
ACKNOWLEDGEMENT
I would like to express my deepest gratitude to my esteemed Lecturer, Dr. Sukumar,
for their invaluable guidance, unwavering support, and encouragement throughout my
academic journey. Dr. Sukumar's expertise and dedication have been instrumental in
shaping my understanding of the subject matter and have inspired me to pursue
excellence in my research. I am also profoundly thankful to Umair Butt, for their
mentorship and constructive feedback on the building the codes and report evaluation.
Their insightful suggestions and continuous motivation have significantly contributed
to the success of this thesis. Their patient guidance and willingness to share knowledge
have been pivotal in enhancing my research skills and critical thinking abilities. I
would also like to extend my appreciation to all the faculty members of the department
for providing a stimulating academic environment and fostering an atmosphere of
learning. Lastly, I wish to acknowledge my family and friends for their constant
encouragement and understanding while this academic pursuit. Not to forget my
beloved late sister, Puntalir Anbalagu for motivating and supporting me to enrol this
course after completing my degree. Their love and support have been the driving force
behind my achievements. Thank you to all who have contributed to this endeavour in
any capacity. Your support has been invaluable.
ii
TABLE OF CONTENTS
ACKNOWLEDGEMENT ......................................................................................... ii
TABLE OF CONTENTS .......................................................................................... iii
LIST OF TABLES .................................................................................................... vi
LIST OF FIGURES ................................................................................................. vii
LIST OF ABBREVIATIONS ................................................................................ viii
LIST OF APPENDICES .......................................................................................... ix
ABSTRACT ................................................................................................................ x
CHAPTER 1 INTRODUCTION .......................................................................... 1
1.1 Motivation ........................................................................................................ 2
1.2 Research Questions .......................................................................................... 3
1.3 Problem Statement ........................................................................................... 3
1.4 Objective .......................................................................................................... 4
1.5 Research Contributions .................................................................................... 4
CHAPTER 2 LITERATURE REVIEW .............................................................. 5
2.1 Introduction ...................................................................................................... 5
2.2 Data Pre-Processing ......................................................................................... 5
2.3 Single Machine Learning Algorithms .............................................................. 6
2.3.1 Decision Tree and Random Forest ................................................... 6
2.3.2 Naïve Bayes...................................................................................... 7
2.3.3 Linear Regression (LR) .................................................................... 8
2.3.4 Autoregressive Integrated Moving Average (ARIMA) ................... 9
2.3.5 Kernel Density Estimation (KDE) ................................................. 10
2.3.6 Gradient Boosting (GB) ................................................................. 11
2.3.7 Long-Short Term Memory ............................................................. 12
2.4 Hybrid Deep Learning Algorithms ................................................................ 16
iii
2.4.1 Bidirectional-LSTM (Bi-LSTM).................................................... 17
2.4.2 LSTM-CNN ................................................................................... 19
2.4.3 Bi-LSTM-CNN .............................................................................. 21
2.5 Performance Evaluation Metrics .................................................................... 23
2.6 Summary ........................................................................................................ 24
CHAPTER 3 METHODOLOGY ....................................................................... 30
3.1 Introduction .................................................................................................... 30
3.2 Experimental Setup ........................................................................................ 31
3.2.1 Hardware Setup .............................................................................. 31
3.2.2 Software Setup ............................................................................... 31
3.3 Experimental Dataset ..................................................................................... 31
3.3.1 Collection of Experimental Dataset ............................................... 31
3.3.2 Data Cleaning and Preprocessing ................................................... 32
3.4 Model Building and Training ......................................................................... 33
3.4.1 Building the Model......................................................................... 34
3.4.2 Compiling the Model ..................................................................... 35
3.4.3 Training the Model ......................................................................... 35
3.5 Model Evaluation ........................................................................................... 36
3.5.1 Root Mean Squared Error (RMSE) ................................................ 36
3.5.2 Mean Absolute Percentage Error (MAPE)..................................... 37
3.5.3 Mean Squared Error (MSE) ........................................................... 37
3.5.4 R-Squared (R²) ............................................................................... 38
3.5.5 Training Loss.................................................................................. 38
3.5.6 Accuracy of the Model Generated ................................................. 39
CHAPTER 4 RESULTS AND DISCUSSION................................................... 40
4.1 Introduction .................................................................................................... 40
4.2 Data Analysis ................................................................................................. 40
iv
4.3 Results ............................................................................................................ 43
4.3.1 Accuracy of the Trained Models .................................................... 43
4.3.2 Training Loss.................................................................................. 43
4.3.3 RMSE ............................................................................................. 44
4.3.4 MAPE ............................................................................................. 45
4.3.5 MSE................................................................................................ 46
4.3.6 R-Squared ....................................................................................... 47
4.3.7 Time Taken for Each Algorithm for Training ................................ 48
4.4 Summary ........................................................................................................ 49
CHAPTER 5 CONCLUSION AND FUTURE RECOMMENDATIONS ...... 51
5.1 Conclusion ...................................................................................................... 51
5.2 Recommendations for Future Research ......................................................... 52
REFERENCES ......................................................................................................... 53
APPENDICES
v
LIST OF TABLES
Page
Table 2.1: Comparison of Algorithms ....................................................................... 28
Table 3.1: Hardware Specification ............................................................................. 31
Table 4.1: Accuracy of Trained Models .................................................................... 43
Table 4.2: Loss during Model Training ..................................................................... 44
Table 4.3: RMSE Average ......................................................................................... 45
Table 4.4: MAPE Average ......................................................................................... 45
Table 4.5: MSE Average ............................................................................................ 46
Table 4.6: R-Squared Average ................................................................................... 47
Table 4.7: Average Time Taken for Model Training ................................................. 48
vi
LIST OF FIGURES
Page
Figure 2.1: Example of Gradient Boosting Model ..................................................... 12
Figure 2.2: LSTM Model ........................................................................................... 13
Figure 2.3: Bi-LSTM Model ...................................................................................... 17
Figure 2.4: CNN-LSTM Model ................................................................................. 20
Figure 2.5: Bi-LSTM-CNN Model ............................................................................ 22
Figure 3.1: Overall Methodology............................................................................... 30
Figure 3.2: Model Building and Training Flowchart ................................................. 33
Figure 4.1: Crime Type Distribution.......................................................................... 40
Figure 4.2: District-wise Crime Distribution ............................................................. 41
Figure 4.3: Hour-wise Crime Distribution ................................................................. 41
Figure 4.4: Year-wise Crime Distribution ................................................................. 42
Figure 4.5: Month-wise Crime Distribution............................................................... 42
Figure 4.6: Average Accuracy Comparison ............................................................... 43
Figure 4.7: Average Training Loss Comparison ........................................................ 44
Figure 4.8: Average RMSE Comparison ................................................................... 45
Figure 4.9: Average MAPE Comparison ................................................................... 46
Figure 4.10: Average MSE Comparison .................................................................... 47
Figure 4.11: Average R-Squared Value Comparison ................................................ 48
Figure 4.12: Average Time Taken for Model Training Comparison ......................... 49
vii
LIST OF ABBREVIATIONS
ARIMA Autoregressive Integrated Moving Average

Bi-LSTM Bidirectional Long Short-Term Memory
BPD Boston Police Department
CNN Convolutional Neural Network
DDoS Distributed Denial-of-Service
DNN Deep Neural Network
DT Decision Tree
GBM Gradient Boosting Machine
KDE Kernel Density Estimation
KNN K-Nearest Neighbor
LR Linear Regression
LSTM Long Short-Term Memory
MAE Mean Absolute Error
MLP Multilayer Perceptron
MSE Mean Squared Error
NB Naïve Bayes
PCA Principal Component Analysis
ReLU Rectified Linear Unit
RF Random Forest
RMSE Root Mean Squared Error
RNN Recurrent Neural Network
SARIMA Seasonal Autoregressive Integrated Moving Average
STNN Spatio-Temporal Neural Network
SVM Support Vector Machine
TFF TensorFlow Federated
viii
LIST OF APPENDICES
Appendix A Python Code for the 4 Algorithms
ix
SPATIO-TEMPORAL CRIME HOTSPOT DETECTION USING
HYBRID MACHINE LEARNING ALGORITHM TO IMPROVE
PREDICTION ACCURACY
ABSTRACT
Crime hotspot detection and prediction are crucial for effective law
enforcement and proactive crime prevention methods. As a result, the goal of this
research is to find a suitable machine learning algorithm for detecting spatiotemporal
crime hotspots. Based on previous research, numerous machine learning methods such
as Decision Tree, Random Forest, Nave Bayes, Linear Regression, ARIMA, Kernel
Density Estimation, Gradient Boosting, and LSTM were explored. This paper further
investigated on hybrid models including Bi-LSTM, LSTM-CNN, and Bi-LSTM-CNN.
It was discovered that hybrid models outperform solitary models in forecasting
spatiotemporal crime hotspots. As a result, using the dataset acquired from the Boston
Police Department, a comparison was performed to establish the best
performing model. Bi-LSTM-CNN performed the best compared to the other models
by achieving the highest accuracy, highest R2 score, and lowest RMSE, MAPE,
training time and MSE. Overall, law enforcement agencies can use the Bi-LSTM-CNN
hybrid model to prevent crime more effectively.
x
CHAPTER 1
INTRODUCTION
The significant objective of a smart city is to improve the quality of life of its
residents by making better use of the city's resources. The dramatic alteration of urban
areas has a huge influence on cities' socioeconomic growth. Smart cities infrastructure
has been developed because of technological improvements, and it primarily focuses
on the quality of citizen life, better management of urban population concerns, and
sustainability in all aspects of their lives. Smart cities have enriched human life by
leveraging technology to address socioeconomic issues such as education, health,
transportation, economics, and public safety. However, cities' relatively expanding
population presents issues based on a vast quantity of data created by electronic
devices used by a big population in a city which includes sensors, cameras and tracking
devices.
Security is a critical component of a country's foundation. It is the obligation
of a country's law enforcement authorities to regulate crime incidences and crime
threats for the welfare of society. Crimes can have a tremendous impact influence on
a country's economic growth. As a result, countries spend a large amount of their GDP
on law enforcement agencies in order to fight crime. Thus, collaboration among
developers, research teams, legal authorities, industrial community, and residents are
critical for presenting and developing ideas to address smart city difficulties and attain
smart city goals. Cities are becoming overcrowded, pushing governments to launch
smart city programs to improve infrastructure management. To maintain a safe and
secure workplace, it could be challenging for the government officials. For successful
policymaking toward improved and peaceful communities, law enforcement
authorities must study crime trends and patterns.
1
Intelligent technologies can forecast future crimes and patterns by examining
prior crimes. Researchers can now gather and analyse massive volumes of data thanks
to the rising usage of powerful algorithms in criminal investigation. Crime Detection
generates patterns from existing data gathered by law enforcement and criminals to
avoid possible human error during classification and identification. The analysis and
prediction of crimes may be a quick and efficient procedure. Many existing studies
make use of artificial intelligence and machine learning to extract criminal trends and
detect crimes. Even though the data processing and classification time is rapidly
refined, accuracy is an important aspect to be considered.
Data mining techniques are becoming increasingly common in the security
sector as businesses and organizations seek to better their operations by collecting and
analysing huge volumes of data. This research examines based on the category of
crimes, time, and location of the crime occurrence. The targeted category of the crimes
are rape, murder, robbery, and physical assault, which usually happens in public areas.
These crimes can be associated directly with time and location of occurrence without
deviation.
1.1 Motivation
The motivation behind this research is to address the limitations of the
traditional crime analysis methods and explore the potential of spatio-temporal crime
predictions models. By considering the spatial and temporal aspects of crime data, we
aim to develop more accurate and efficient methods for crime hotspot predictions.
Our main aim is to increase the accuracy of crime hotspot identification by
adding hybrid machine learning techniques. Accuracy enhancement is first and
foremost necessary. Traditional approaches may struggle to capture the intricate
2
relationships and interactions between various factors contributing to crime hotspots.
Hybrid algorithms have the potential to leverage the strengths of different models and
techniques, leading to more precise and reliable predictions.
Moreover, efficient resource allocation is important for crime prevention.
Hybrid machine learning algorithms can assist in optimizing resource allocation by
identifying crime hotspots with higher precision. Law enforcement agencies can
prioritize these areas, allocating personnel and surveillance systems, accordingly,
thereby maximizing the impact of crime prevention efforts.
Apart from that, assessing the efficiency and accuracy of hybrid
machine learning algorithms for crime hotspot detection is crucial for their practical
implementation to avoid misinterpretation of data prediction based on lower accuracy.
By evaluating and comparing the performance of different algorithms, we can identify
the most effective approach in terms of both prediction accuracy and efficiency in
detecting crime hotspots.
1.2 Research Questions
• How can we increase accuracy in predicting crime hotspot for crime
prevention?
• How does the existing models can be enhanced to improve the crime hotspot
training efficiency and prediction accuracy?
1.3 Problem Statement
Crime analysis and prevention is an important aspect of maintaining public
safety. However, traditional methods of crime analysis often rely on manual inspection
of crime data, which can be time-consuming and prone to errors. To address this issue,
3
there is a need for an automated approach to crime hotspot detection that can analyze
spatio-temporal crime data and predict the likelihood of crime in certain locations and
times. A crime prediction algorithm for spatio-temporal crime hotspot detection using
machine learning is available but lacking in a better performance which eventually
results an increase in the failure rate for authorities to detect and stop the crime.
1.4 Objective
The aim of this research is:
1. To study crime hotspot detection using the existing machine learning
algorithms.
2. To develop a more accurate and efficient machine learning algorithm based on
the available dataset.
3. To compare the existing machine learning algorithm performance with the new
algorithm.
1.5 Research Contributions
The following are the research contributions:
1. Having a hybrid algorithm for spatio-temporal crime prediction results in a
high accuracy compared to traditional models.
2. A combination of a Bi-LSTM and a CNN algorithm is developed, which
results in results in high accuracy for Boston Crime dataset for crime hotspot
prediction.
4
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction
Crime is a major concern in every society and identifying crime hotspots can
help prevent and reduce criminal activities. A crucial step in crime analysis and
prevention is the recognition of spatiotemporal crime hotspots. It can be helpful for law
enforcement organizations to deploy resources and create effective crime-fighting
tactics if they are able to pinpoint areas and times when criminal activity is most
prevalent. In recent years, hotspots have been identified and crime data has been
analysed using machine learning techniques. The section below summarizes the
application of various machine learning models for the detection of spatiotemporal
crime hotspots based on the prediction accuracy, dataset applied and limitations of each
model.
2.2 Data Pre-Processing
To be used in machine learning, data must be cleaned, processed, and prepared
as part of data preparation. This step is essential because missing values, outliers, or
unrelated properties in the raw data may have a major impact on the model's accuracy.
For spatio-temporal data, like Boston Crime Dataset ((BPD), 2018) there are specific
data preprocessing required to ensure optimal model training. Finding and resolving
missing or erroneous data is a common step in the data preprocessing process known as
"data cleaning". This process can be applied to eliminate inaccurate or unneeded data
as well as impute missing values with the aid of techniques like mean, median, or mode.
For the Boston Crime Dataset, it is only required to remove and drop unused columns
(Salam, 2022).
5
2.3 Single Machine Learning Algorithms
2.3.1 Decision Tree and Random Forest
As suggested by its name, this supervised machine learning algorithm creates a
tree-like model with decision nodes and leaf nodes. The leaf node here represents a
choice, whereas the decision nodes are arranged in the order of two or more branches.
To handle categorical and continuous data, a decision tree is utilized. This algorithm is
simple and useful decision-making diagram. Using the trees is simple and practical
technique to comprehend how a selection is concluded as well as to visualize the
outcomes of the algorithms A decision tree's key benefit is that it can swiftly adapt to
the dataset.
The random forest is an algorithm with a combination of multiple
decision trees and then concludes an average output. This algorithm is widely used for
used for both classification and regression problems. Almost identical hyperparameters
exist for the forest and a decision tree. On randomly split data, its ensemble approach
of decision trees is created. This entire group may be compared to a forest with several
independent random samples growing on each tree. When there are enough trees
present, the random forest technique may become too slow and inefficient for real-time
prediction. The random forest approach, in contrast, creates the findings based on
randomly selected observations and traits built on numerous decision trees.
Based on the study conducted (Yin, Michael, & Afa, 2020), the random forest
model performance accuracy increases with number of decision trees is used in
algorithm, but not a big difference based on the accuracy improvement. Even though
the comparison between DT and RF is done, it is compatible with the very large dataset
since the study has reached the bottleneck of the algorithm used. Additionally,
according to another research, decision trees outperform other algorithms for the Boston
6
crime dataset in terms of precision, recall, and F1-score, resulting in a robust tree that
employs longitude and latitude (Aljuboori, Shaker, & Fadhil, 2022).
Moreover, Decision Tree algorithm also can be enhanced for Boston
Crime dataset using the Principal Component Analysis(PCA), where the methos is used
for dimensionality reduction. it figures out how to project the initial data into less
dimensions (Sharma, Choudhury, & Kandwal, 2021). Apart from this, another study
(Jogendra, Sravani, Akhil, Sureshkumar, & Yasaswi, 2022) mentioned that Decision
Tree algorithm has a very good performance evaluation based on the evaluation metrics
(MAE, MAE, R-Squared, RMSE and Accuracy).
2.3.2 Naïve Bayes
A well-known supervised machine learning approach for classification
applications like text categorization is the Naive Bayes model. It replicates the input
distribution for one class or category and belongs to the family of generative learning
algorithms. This method is predicated on the idea that given the class, the properties of
the input data are conditionally independent, enabling the algorithm to predict outcomes
rapidly and precisely.
Naive Bayes classifiers are among the most basic Bayesian network models, yet
when used in conjunction with kernel density estimation, they may attain excellent
accuracy levels. With the aid of this method, the classifier may perform better in
challenging situations where the data distribution is ill-defined by estimating the
probability density function of the input data using a kernel function. As a result, the
naive Bayes classifier is an effective machine learning tool, especially for sentiment
analysis, spam filtering, and text categorization, among other applications.
Mathematically it can be stated:
7
ℎ 𝑥 𝑝(ℎ)
𝑝( ) = 𝑝( )∗
𝑥 ℎ 𝑝(𝑥)
Equation 2.1: Naïve Bayes Equation
p(h/x) is the probability of event (h) occurring if (x) is true.
(x/h) is the probability of event (x) occurring if (h) is true.
Naïve Bayes model have also multiple techniques such as Gaussian Naïve Bayes
(GaussionNB), Multinomial Naïve Bayes(MultinomialNB) and Bernoulli Naïve Bayes
(BernoulliNB). In a study (Kanimozhi, N, G, Ranjitha, & Yuvarani, 2021) comparing
the 3 techniques, Multinomial NB and Gaussion NB has the highest accuracy, where
the training time is quite low, which best suited for real time predictions as well.
2.3.3 Linear Regression (LR)
Linear regression is a supervised machine learning technique. To forecast an
event's result using the data for the independent variables, relationships that could be
predict the outcome. The technique frequently resembles a straight line that as closely
as possible correlates to the various data points. A continuous form or a number is the
outcome. The output might include things like financial earnings or sales, the quantity
of goods sold, etc. In the previous scenario, there might be one or more independent
variables. The mathematical notation for linear regression is:
y = β0 + β1x + ε
Equation 2.2: Linear Regression Equation
Y= Dependent Variable
X= Independent Variable
β0 = intercept of the line
β1 = Linear regression coefficient (slope of the line)
8
ε = random error
The linear relationship between a dependent (y) and one or more independent
(y) variables is shown by the Linear Regression procedure. In other words, it establishes
the way a change in the independent variable's value affects the value of the dependent
variable. Independent and dependent variables are related in a straight line with a slope.
Based on a study, Linear regression able to outperform other 3 algorithms
(Decision Tree, Random Forest, Linear Regression & Neural Network) with the highest
accuracy of 88% with MAE score of 0.02538 (Mittal, Goyal, & Sethi, 2018).
2.3.4 Autoregressive Integrated Moving Average (ARIMA)
In an autoregression model, the variable of interest is predicted using a linear
combination of the variable's prior values. As suggested by the phrase "autoregression,"
it is a regression of the variable against itself. We utilize lagged values of the target
variable as our input variables to predict values for the future. A model of order p
autoregression will look like:
𝑚𝑡 = 𝑜 + 𝑎1 𝑚𝑡−1 + 𝑎2 𝑚𝑡−2 + 𝑎3 𝑚𝑡−3 + ⋯ + 𝑎𝑝 𝑚𝑡−𝑝
Equation 2.3: Autoregression Part of the Model Equation
In the equation above, m's present value is a linear function of its prior p values.
The regression coefficients are [0, p] and are determined after training. One of the
methods that is often used to determine the optimal values of p is by looking at plots of
the autocorrelation and partial autocorrelation functions. Any differencing that must be
used to make the data steady is represented by integrated. The data may be tested for
stationarity using the dickey-fuller test, and after that, various differencing factors can
be tried out. A lag of mt-mt-1 is indicated by the differencing factor, d=1. Instead of
utilizing history values in a regression-like model to predict historical prediction errors,
9
moving average methods forecast future values. A moving average model can be
represented by the following equation:
𝑚𝑡 = 𝑜 + 𝑎1 𝑒𝑡−1 + 𝑎2 𝑒𝑡−2 + ⋯ + 𝑎𝑞 𝑒𝑡−𝑞
Equation 2.4: Moving Average Part of the Model Equation
The moving average component of the regression model is denoted by the letter
"q," and the random residual deviations between the model and the target variable are
denoted by the letter "error" (e) in the equation above. Since it can only be determined
after the model has been fitted and because it is a parameter as well, "e" is an
unobservable parameter in this case.
SARIMA, which stands for Seasonal-ARIMA, contains the forecast's
seasonality component. The significance of seasonality is obvious, yet ARIMA fails to
implicitly capture that information. The addition of Seasonality adds robustness to the
SARIMA model. Based on a study compared between ARIMA and SARIMA,
SARIMA has a better accuracy using the Seasonal component and setting the
parameters for the algorithm using grid search technique (Noor, et al., 2022).
2.3.5 Kernel Density Estimation (KDE)
The mathematical technique known as Kernel Density Estimation (KDE) is used
to calculate the probability density function of a random variable. The estimator
attempts to determine a population's characteristics based on a limited quantity of data.
The data smoothing problem is often used in signal processing and data science because
it is a reliable approach for predicting probability density. The technique essentially
makes it possible to produce a smooth curve out of a batch of random data. However,
the estimate may also be used to generate points that appear to have come from just a
10
certain sample set. Particularly useful for modelling items and project simulation is this
function.
By visualizing the data, the Kernel Density Estimation starts to shape the
distribution's curve. The distance between each point at a particular place in the
distribution is weighted to determine the curve's shape. The estimation is larger if there
are more points clustered nearby since there is a greater chance of seeing a point there.
The specific method utilized to balance the points throughout the data set is called the
kernel function. The kernel's form changes depending on its bandwidth. A smaller
bandwidth restricts the function's application space and makes the estimate curve appear
rough and jagged. The size and form of the estimate may be altered by adjusting the
kernel function's parameters such as bandwidth and amplitude.
For crimes in Bangalore, a study conducted with KDE algorithm where it was
able to solve the problems based on the proposed algorithm, where the kernel density
function, k of features, f in spatial point for every distance between events (Boppuru,
2023).
2.3.6 Gradient Boosting (GB)
A machine learning algorithm named Gradient Boosting (GB) is used to solve
classification and regression issues. It is an ensemble learning technique that creates a
powerful predictive model by combining several weak prediction models, often
decision trees.
The basic idea behind gradient boosting is to iteratively build an ensemble of
weak models and optimize them to minimize the errors of the previous models. The
ensemble is built by adding models sequentially, each one attempting to correct the
mistakes of its predecessors. This iterative process makes it a boosting algorithm.
11
Figure 2.1: Example of Gradient Boosting Model
Tong et. al (2021), developed the Light Gradient Boosting Machine

(LightGBM) to forecast crime occurrences based on the dataset from year 2001 to 2020
in San Francisco. According to the authors, LightGBM was effective in predicting crime
and provided accurate forecasts of crime possibility compared to other classification
models. Alsirhani et al. (2018) developed a DDoS detection framework using a Gradient
Boosting technique and the Apache Processing Engine Spark. The authors discovered
that integrating GBT with Apache Spark performed extremely well for detecting DDoS
attacks with a greater depth of decision trees and a greater number of iterations. The
results also show that it has direct impact on the processing delays based on the size of
the dataset and the number of features, as well as the depth of the decision tree and
number of iterations. Lamari et al. (2020) used a gradient boosting model to forecast
spatial crime occurrences, and it produced the most accurate predictions for violent
crimes, property crimes, motor vehicle thefts, vandalism, and overall crime count when
compared to the Poisson model and neural network model. Khan et al. (2022) used
crime data from San Francisco to compare Random Forest, Gradient Boosting decision
Tree, and Nave Bayes models for crime prediction and prevention. It can be concluded
that the GB DT performed the best after analysing the precision and recall data.
2.3.7 Long-Short Term Memory
A recurrent neural network (RNN) is an artificial neural network that uses
sequential input or time series data. Well-known programmes like Siri, voice search,
and Google Translate use these deep learning techniques. They are commonly used for
ordinal or temporal problems in speech recognition, picture captioning, and natural
12
language processing. Recurrent neural networks (RNNs) and feedforward and
convolutional neural networks (CNNs) both require training data to learn.
Due to the use of previous data, it compares and influences current input and
output by their memory. Unlike traditional neural networks, which assume that inputs
and outputs are independent of one another, recurrent neural networks' outputs are
dependent on the preceding components in the sequence. Even if they would be helpful
in determining the output of a certain sequence, future events cannot be considered in
the predictions made by unidirectional recurrent neural networks.
Recurrent neural networks (RNNs) of the Long Short-Term Memory (LSTM)
variety have gained popularity for addressing long-term dependencies in sequential data
analysis. Traditional RNNs are susceptible to the vanishing gradient problem, which
makes it challenging to learn long-term dependencies. By employing memory cells and
gates to selectively forget or remember information at certain time steps in a sequence,
LSTM is made to get around this problem. The input gate, forget gate, output gate, and
memory cell are the four main parts of the LSTM architecture.
Figure 2.2: LSTM Model
The input gate chooses which data from the input to maintain, the forget gate
chooses which data from the memory cell to erase, and the output gate chooses which
13
data from the memory cell to output. The memory cell gradually stores the data while
selectively updating its contents in response to input. The decision of whether to allow
input or forget a piece of information is made by the input and forget gates using
sigmoid activation functions, which have a range of 0 to 1. The memory cell updates its
state using a hyperbolic tangent function, and the output gate employs sigmoid
activation to choose which portion of the memory cell to output.
Several studies have reported the usage of an LSTM model for crime detection.
Comparing the federated LSTM and LSTM based on a study (Salam, 2022) by using
the same parameter values, almost similar metrics value with no significant differences,
whereby LSTM can still be considered as a better traditional algorithm.
In a study that employed historical crime data of public property from 2015 to
2018 in a coastal city in China, Zhang et al. (2020) proposed an LSTM-based crime
prediction model. When simply using historical crime data, the results demonstrated
that the LSTM model outperformed alternative approaches such as KNN, RF, SVM,
NB, and CNN.
Similarly, a federated long short-term memory (LSTM) model is suggested for
forecasting the frequency of crimes, and it is contrasted with a conventional LSTM
model in a study conducted by Abdul Salam (2022). Based on the Boston crime dataset
((BPD), 2018), the suggested model is created using TensorFlow Federated (TFF)
model and the Keras API. To increase precision and reduce loss, the model's parameters
are adjusted. The results show that the federated LSTM model performs better than the
traditional LSTM model in terms of decreased loss, enhanced accuracy, and longer
training times.
14
Han et. al (2020) suggests a model of daily crime prediction in his research using
a combination of Long Short-Term Memory Network (LSTM) and Spatial-Temporal
Graph Convolutional Network (ST-GCN) to identify high-risk regions automatically
and effectively in Chicago, America. The goal is to address crime-related issues in urban
neighborhoods. Based on the study, we can conclude that hybrid models have a better
prediction capability for crimes based on the sliding time window.
In another study by Dewan et. al (2022), where a Convolutional Neural Network
(CNN) and a Long-Short Term Memory (LSTM) network (thus, CLSTM-NN) was used
to forecast the occurrence of criminal activity in Baltimore, USA. The results show that
while the significance thresholds for Random Forest and Ridge are both greater than
0.05, those for LSTM and the Integrated Model are both lower than 0.05. This implies
that while the findings from ridge and Random Forest are not significant, the results
from LSTM and the Integrated Model are substantial and trustworthy.
Zhuang et al. (2017) propose the Spatio-Temporal Neural Network (STNN) as
a technique for precisely anticipating crime hotspots with incorporating spatial
information. They evaluate the model using call-for-service data collected over a five-
year period between March 2012 and the end of December 2016 by the Portland,
Oregon Police Bureau (PPB). They contrast our model with the most advanced
classification methods, including Multi-Layer Perceptron, Gaussian Naive Bayes,
Random Forests, K-Nearest Neighbours, and Decision Trees. The STNN(LSTM) model
outperformed each of the conventional machine learning techniques, as can be
observed.
A study conducted by Safat et. al (2021) where several machine algorithms,
namely, LR, SVM, NB, KNN, DT, MLP, RF, and XGBoost, and time sequence models
such as LSTM and ARIMA was used to forecast crime based on the criminal records
15
for the cities of Chicago and Los Angeles. It was found that in terms of root mean square
error (RMSE) and mean absolute error (MAE), LSTM performed reasonably well for
time series analysis compared to ARIMA on both data sets. However, the authors of
this work acknowledge several drawbacks to employing LSTM. First, since LSTM
models need a lot of training data to make accurate predictions, they performed better
when applied to predict crime using the Chicago dataset, which contains plenty of
instances, than when they were applied to the Los Angeles dataset, which contained
fewer instances. In addition, training LSTM models costs money and takes a long time.
Additionally, to reach optimal performance, LSTM models necessitate meticulous
hyperparameter tuning, which can be a difficult iterative process.
Researchers frequently integrate LSTM with other machine learning methods,
such as CNNs, random forests, or support vector machines, to get around these
constraints. These hybrid models can increase the precision of crime detection models
by utilizing the advantages of several methodologies. Overall, LSTM is an effective
method for processing temporal data, but because crime data includes both temporal
and geographical components, it might not be enough to detect crimes. The constraints
of crime detection models can be solved, and their accuracy increased by combining
LSTM with additional techniques.
2.4 Hybrid Deep Learning Algorithms
Deep Neural Network are widely used for image classification by translating
facts encoded into actual knowledge, more comprehensible. At each layer, Deep Neural
Networks (DNNs) change the data and generate a new representation. DNNs attempt to
categorize the data in a classification issue, improving this process layer by layer until
the desired result is obtained. This work may be thought of as the separation of lower-
16
dimensional manifolds in a data space, which is in accordance with the manifold
hypothesis, which claims that natural data forms lower-dimensional manifolds in its
embedding space (Fefferman, Mitter, & Narayanan, 2016) (Olah, 2014).
2.4.1 Bidirectional-LSTM (Bi-LSTM)
Recurrent neural network called a Bidirectional LSTM (Bi-LSTM) is used
mostly for natural language processing. It is a valuable tool for modelling the sequential
dependencies between words and phrases in both directions of the sequence since,
unlike ordinary LSTM, the input flows in both directions and it can use information
from both sides. One more LSTM layer is added by Bi-LSTM, which changes the
information flow's direction. The additional LSTM layer's input sequence flows
backward in this case, and the outputs from the two LSTM layers are then combined in
a variety of ways, including average, sum, multiplication, and concatenation.
Figure 2.3: Bi-LSTM Model
According to Chandra et al. (2021), initially developed for word-embedding in
natural language processing, bi-directional LSTM networks (BD-LSTM) operate
similarly to BD-RNNs in accessing long-range context or state in both directions. BD-
LSTM would take inputs in two separate directions, as opposed to typical LSTM
networks, one from the past to the future and the other from the future to the past.
Reversing the information preserves state information from the future. Thus, by
17
merging two disguised states, the network can always maintain data from the past and
the future. As a result, Chandra et al. (2021) investigated the performance of deep
learning techniques, such as simple RNN, LSTM networks, Bi-LSTM networks,
encoder-decoder LSTM networks, and CNN and it was found that bi-directional LSTM
networks with encoder-decoders perform better than other models for both simulated
and real-world time series.
Butt et al. (2022) stated that Bi-LSTM and Exponential Smoothing (ES) hybrid
for crime forecasting. Using crime statistics from 2010 to 2017 for New York City, the
suggested method is assessed. The suggested method performed better than cutting-
edge Seasonal Autoregressive Integrated Moving Averages (SARIMA) with low Mean
Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE), and Mean
Absolute Error (MAE). As a result, by predicting crime patterns, the suggested
technique can aid law enforcement authorities in reducing and eliminating crime.
The Bi-LSTM neural network that Deepak et al. (2021) presented categorizes
the various sorts of crime based on information gathered from Google News and
Twitter. HREACC, HANNCC, HMNCC, and GSCA were contrasted with the
suggested technique. For all four different standard datasets, it was discovered that the
suggested method beat the others with an average improvement in accuracy of 9.2%
and a very low False-Negatives Ratio value of 0.11. Thus, by using a Bi-LSTM neural
network, which is an effective option and has been trained on a sizable dataset with a
wide range of data, it is possible to classify fresh items of evidence with greater accuracy
and speed than with other methods.
Tasnim et al. (2022) conducted a study using deep learning techniques as an
efficient multi-module strategy for forecasting crime. Decision Level Fusion and
Feature Level Fusion are the 2 modules of the suggested methodology. The Fusion
18
model, Stacked Bidirectional LSTM, and Temporal-based Attention LSTM used by the
first module. The training data from the first two models is used by the Fusion model.
On the dataset of several cities, the key models for the transfer learning technique are
temporal-related models, hence, reducing learning model's training time. The second
module produces the final prediction using the Spatio-Temporal based Attention-
LSTM, Stacked Bidirectional LSTM, and the outcome of feature-level fusion. Based on
the information from the previous 24 hours, the suggested model forecasts the upcoming
hour. The output of the proposed model may be the estimated number of crimes in any
category for a specific location. Additionally, it gives law enforcement information on
potential criminal activity based on category, location, and time. The American cities
of San Francisco and Chicago were primarily focused for the experimental analysis for
this experimental analysis. The mean absolute error for the San Francisco and Chicago
datasets
For the San Francisco and Chicago datasets, the model's mean absolute error is
0.008, 0.02, its coefficient of determination is 0.95 and 0.94, and its symmetric mean
absolute percent error is, respectively, 1.03% and 0.6%. Thus, the suggested machines
learning model performs better than several other well-known models, such as
SARIMAX.
In conclusion, Bidirectional LSTM (BI-LSTM) has demonstrated promising
performance as a machine learning system for forecasting spatiotemporal crime. By
making use of the bidirectional feature of the LSTM architecture, bi-LSTM models may
successfully capture both past and future associations in sequential crime data.
2.4.2 LSTM-CNN
The Long Short-Term Memory and Convolutional Neural Network (LSTM-
CNN) combination of two potent neural network models is the suggested machine
19
learning approach for identifying spatiotemporal crime. CNN is the best algorithm for
processing images and signals, whereas LSTM is made to cope with sequential data.
Several research have employed the LSTM-CNN model to forecast crime in diverse
contexts.
Figure 2.4: CNN-LSTM Model
According to a study by Muthamizharasan and Ponnusamy (2022) combining
CNN and the LSTM model in combination can produce a reliable crime prediction
approach with a high forecast accuracy. CNN was used to extract the attributes and the
LSTM to forecast the crime rate. Given that the calculated R-squared and MAE values
were 0.99 and 0.0027, respectively, the forecast is accurate (Muthamizharasan &
Ponnusamy, 2022).
In another study by Dewan et al. (2022), where a Convolutional Neural Network
(CNN) and a Long-Short Term Memory (LSTM) network (thus, CLSTM-NN) was used
to forecast the occurrence of criminal activity in Baltimore, USA. The results show that
while the significance thresholds for Random Forest and Ridge are both greater than
0.05, those for LSTM and the Integrated Model are both lower than 0.05. This implies
that while the findings from the ridge and Random Forest models are not significant,
the results from the regression models using the LSTM and the Integrated Model are
significant and reliable.
20
Convolutional neural networks with long short-term memories (CNN-LSTM)
were used in surveillance systems by Esan et al. (2020) to identify unusual behavioural
patterns in an academic setting. While the LSTM uses the gate mechanism to store
important information for memory, the CNN extracts the image features from the
picture frame sequences. The outcomes are contrasted with those from detection models
that are already in use, such as models built using dictionaries, motion deep nets, social
force, and probabilistic principal analysis. The outcomes demonstrate that the suggested
approach performs better than the others stated with 86% accuracy.
To conclude, various studies have demonstrated encouraging outcomes when
using LSTM-CNN machine learning algorithms for spatio-temporal crime hotspot
detection. These algorithms are capable of precisely capturing the temporal and spatial
characteristics of crime data as well as identifying high-risk locations for criminal
activity. For these algorithms to be useful in various circumstances, more study is
required.
2.4.3 Bi-LSTM-CNN
A Bidirectional-LSTM-CNN model is a model that combines Bi-LSTM and
CNN architectures. The CNN-Bi-LSTM model effectively make predictions the via the
fully connected layer by selecting the spatial features of the data through CNN,
extracting the temporal information for Bi-LSTM's input, and combining three models
(Zhuang & Cao, 2022).
21
Figure 2.5: Bi-LSTM-CNN Model
The hybrid attention-based Bi-LSTM and CNN network was used by Kumar et
al. (2020) to extract objectionable languages from Dravidian Code-Mixed text.
Character embedding was done through the CNN network, and word embedding was
done via attention-based Bi-LSTM. For word embedding, FastText was developed
using language-specific code-mixed Tamil and Malayalam text for the Tamil and
Malayalam models, respectively. One-hot encoding vectors was utilised for character
encoding. After combining the output from the CNN and attention-based Bi-LSTM
layers, a softmax layer is used to predict offensive and neutral text. Hyper-parameters
can affect how well deep neural networks operate. So, by varying the learning rate,
batch size, optimizer, epochs, loss function, and activation function, an extensive testing
was run. The parameters that provided the recommended system the highest
performance was learning rate of 0.001, batch size of 32, Adam as the optimizer, epochs
= 100, binary cross-entropy as the loss function, ReLU activation in the internal layers
of the network, and softmax activation at the output layer.
Bi-attention-LSTM-CNN hybrid model was used by Guo et al. (2020) to
differentiate the discriminative characteristics of charges as an internal mapping
between fact descriptions and charges. In text classification, CNN seeks to identify
global text features while Bi-LSTM concentrates on local text features. The hybrid
model, which combines Bi-LSTM and CNN, was proposed to extract every
22
spatiotemporal feature of text description because neither model can perform on its own.
As a result, the model obtained more accurate text feature information and boosted the
accuracy of text classification.
Singh et al. (2023) studied the detection of violence using advanced deep
learning techniques where ConvLSTM, 2D LSTM-CNN and 2D Bi-LSTM CNN was
used by analysing fight scenes on surveillance videos. ConvLSTM model was reported
to be slower and less effective overall than the other two models. The CNN Bi-LSTM
model consistently beat the CNN LSTM in the CCTV dataset and achieved the highest
degree of accuracy. In the hockey dataset, it marginally outperformed CNN LSTM.
Two-dimensional convolutional neural networks were used in both the LSTM and Bi-
LSTM models to extract features from frames. However, the Bi-LSTM layer showed
better results and was discovered to be more efficient for identification since it analysed
temporal data in both directions rather than just the forward way.
According to previous studies, Bi-LSTM CNN hybrid model has great potential
to be applied as a tool to study spatio-temporal data. Therefore, the hybrid model Bi-
LSTM CNN is proposed for the study of spatio-temporal crime detection.
2.5 Performance Evaluation Metrics
Butt et al. (2022) state that a machine learning model needs to be carefully
assessed to ensure its accuracy in interpreting a complex phenomenon given a small
number of data points and to investigate the proper application of the same models to
fresh datasets. In this study (Zhuang & Cao, 2022), the performance of the model in
predicting crime hotspots is validated using RMSE, MAPE, and MSE. R2 is mentioned
in another work by Butt et al. (2022) as being a crucial evaluation statistic for LSTM
hybrid models. The use of performance evaluation criteria is justified by the fact that
23
similar studies have frequently employed these evaluation metrics in prior research.
Additionally, MSE was chosen since it scales dependently and highly weights outliers.
The big dataset in this study suggests (Butt, Letchmunan, Hassan, & Koh, 2022)
that RMSE is a reasonable performance metric to assess the model's performance and
is appropriate for LSTM-related models. Similarly, RMSE is sensitive to outliers and is
heavily reliant on the percentage of data. It is also easily comprehensible and obliquely
illustrates the model's expected accuracy.
2.6 Summary
Ref. Model Accuracy Parameters Limitations Dataset

Decision
(Yin, The accuracy is very The accuracy Boston
Tree &
Michael, & 51.68% low and not is very low and Crime
Random
Afa, 2020) significant not significant Dataset
Forest
Max Depth = 12,
bootstrap = true,
n_estimator = 100,
Random
0.76 min sample leaf = 1,
Forest RF performs
CCP alpha = 0.0,
the best
(Ribeiro, criterion = gini, max
compared to Crime
Meneses, feature = log2
other models, Data
Costa, Min sample leaf =
but done on Para,
Miranda, & 40, criterion =
Decision specific crime, Brazil
Alves, 2022) 0.71 entropy, splitter =
Tree homicides
random, min samples
only.
split = 2
Learning rate =
Neural
0.74 invscaling, solver =
Network
adam, activation =
24
tanh, epoch = 300,
hidden layer size = 3
Memory Neurons=20
STNN- 0.815 Learning Rate =
LSTM 0.0005
Activation = relu
Decision Criterion = gini, max
0.76
Tree depth = 5
There is no
Gaussion-
0.743 - limitation,
NB
(Zhuang, however future
Random Estimators = 10, min CFS
Almeida, 0.7625 development
Forest samples split =2 data,
Morabito, & will include
K=1, Distance Portland
Ding, 2017) KNN 0.6375 more features
measure = L2
into the hybrid
Logistic Epochs = 300,
0.75 model.
Regression penalty = L2
Hidden layer
size(100,50),
ML
0.7675 Learning
Perception
Rate(0.001),
Activation(relu)
Decision To adopt Boston
No parameter info,
(Aljuboori, Tree, Naïve Precision, gender and Crime
but DT performs best
Shaker, & Bayes, Recall and identity of Dataset
in all performance
Fadhil, 2022) Logistic F1 offender as
classifier
Regression attributes
LSTM Loss=Huber, Boston
Almost Consider the
Federated Optimizer = SGD, Crime
similar global
LSTM Adam, Metrics = all, Dataset
output, but optimization
(Salam, 2022) Data size = 319073,
FLSTM is and reduce the
Batch size = 4,
better communication
Window size = 60,
slightly overhead
round number = 5
25
ES-Bi- Using the New
LSTM knowledge of York
the criminal City
Lowest domain to Crime
(Butt, Seasonal=24, Batch
MAPE, improve Dataset
Letchmunan, size=48, Epoch = 10,
RMSE, transfer
Hassan, & optimizer=RMSProp,
MAE & learning to
Koh, 2022) MSE
R^2 increase the
accuracy of
crime
prediction
Naïve Bayes If class label is Denver
absent, the City
(Kanimozhi, probability of Crime
N, G, estimation will Dataset
Accuracy = No parameter
Ranjitha, & be zero. Use
93.07% information
Yuvarani, different
2021) models in
enhance
performance
ARIMA No parameter ARIMA Chicago
No value performs better and Los
information
in future Angeles
(SAFAT, LSTM
predictions & Crime
ASGHAR, &
RMSE = crime trend Dataset
GILLANI, Epochs=40,
8.78 LSTM,
2021) batch=31
MAE = 6 evaluated
using RMSE
and RAE only.
Kernel Bengalu
Performance is
(Boppuru, Density Accuracy = No parameter ru
like ARIMA
2023) Estimation 77.49% information Crime
model,
(KDE) Dataset
26
LSTM- Missing of Chennai
CNN data monthly, City
(Muthamizhar
R^2 = 0.99 quarterly, or Crime
asan & No parameter
MAE = seasonal basis Dataset
Ponnusamy, information
0.0027 to produce
2022)
more accurate
result.
SARIMA Not all types of Saudi
R^2 = 0.853 Grid Search method
(Noor, et al., crime are used Arabia
MAE = to determine the
2022) for training the Crime
0.066059 optimal parameters
model. Dataset
Feed Chicago
71.3% Accuracy
Forward and
(Stec & CNN drops not clear Portland
72.7% No parameter
Klabjan, due to fewer Crime
RNN 74.1% information
2018) training Dataset
CNN + examples
75.6%
RNN
Decision NCRB
85.75%
Tree
(Mittal,
Random No parameter No limitation
Goyal, & 88.61%
Forest information stated
Sethi, 2018)
LR 89.61%
NN 88.31%
(Jogendra, Decision 80% No parameter Recommended Chicago
Sravani, Tree + K- information to use CNN or Crime
Akhil, means DNN for the Dataset
Sureshkumar, future use
& Yasaswi,
2022)
(Kang & DNN 83.25% Layer1 = 256, Unable to Chicago
Kang, 2017) Layer2 = 256, implement Crime
Layer3 = 128 DNN with Dataset
insufficient
27
Layer size = (1024, dataset, which
1024, 2), activation = will cause
softmax performance
degradation.
(Anuvarshini, LSTM- MAE = Kernel size = 2, Full CNN Boston
Deeksha, C, CNN 24.56, convo1D= 1, stride performs better Crime
& Krishna, RMSE = window size = 30, than LSTM Dataset
2022) 30.11 batch size = 32,
epochs = 100
(Deepak, Bi-LSTM Precision, No parameter Performance Multiple
Rooban, & recall, f- information varies with crime
Santhanavijay measure, dataset, but has dataset
an, 2021) accuracy, an average of
FNR 80% and above
(Singh, Rani, Bi-LSTM- Accuracy MaxPooling2D, 16 Bi-LSTM- Image
Bansal, & CNN filters of size 3×3 & CNN has the data
Techniques, 4x4, dropout = 0.75, highest
2023) Conv2D with 64 performance
filters
Table 2.1: Comparison of Algorithms
Overall, the hybrid models performed better compared to the singular models
excluding LSTM model in previous research. Therefore, in this study LSTM, Bi-LSTM,
LSTM-CNN, and Bi-LSTM-CNN to evaluate the better performing model in predicting
spatio-temporal crime hotspots. The hybrid models, Bi-LSTM and LSTM-CNN, has a
very convincing accuracy and training loss on the previous study based on the structured
crime dataset. According to previous studies, Bi-LSTM-CNN has shown great potential
in classifying spatio-temporal data, despite having fewer studies related to spatio-
temporal crime hotspot prediction using Bi-LSTM-CNN model. Hence, this study
assesses the hybrid model Bi-LSTM-CNN for a spatio-temporal crime hotspot
28
prediction. Based on the on the best model, evaluation metrics such as RMSE, MAPE,
MSE and R2 score can be used to get optimum comparison result.
29
CHAPTER 3
METHODOLOGY
3.1 Introduction
The overall working flow for this experiment is as shown below.
Start
Experimental Setup
HW/SW
Collection of
Experimental Dataset
Dataset
Preprocessing
Model Buidling &

Training
Model Evaluation
End
Figure 3.1: Overall Methodology
30
3.2 Experimental Setup
3.2.1 Hardware Setup
The experiment is conducted using as the hardware specification below:
Components Specification
CPU Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
GPU NVIDIA Quadro RTX 3000
Storage 250GB SSD + 2TB HDD
RAM 32.0 GB
Table 3.1: Hardware Specification
3.2.2 Software Setup
With Keras package, a python code was developed to model all the 4 algorithms
to predict the spatio-temporal crime hotspot locations. Tensorflow was used to choose
cores between CPU or GPU for training and testing. Below are the list of dependencies
and libraries installed prior to the code execution:
• Jupyter Notebook
• Keras Layers (LSTM, Conv1D, MaxPooling1D, Flatten, Dense, Dropout,
Bidirectional)
• Keras Optimizer (Adam)
• Tensorflow
• Pandas
• Numpy
• Matplotlib
• Sklearn
3.3 Experimental Dataset
3.3.1 Collection of Experimental Dataset
This dataset is collected from Boston Police Department (BPD) to document the
initial details surrounding an incident to which BPD officers respond ((BPD), 2018).
This is a dataset containing records from the various crime incident report system from
June 14, 2015, and continue to September 3, 2018, which includes a reduced set of fields
focused on capturing the type of incident as well as when and where it occurred. The
31
dataset consists of 17 attributes (Columns) and 327820 data (Rows). The crime data
was obtained in the form of csv file.
Description of attributes:
• INCIDENT_NUMBER: Crime case number registered by the police.
• OFFENSE_CODE: Specific kind of crime code.
• OFFENSE_CODE_GROUP: Conducted crime activity name.
• OFFENSE_DESCRIPTION: Detailed specification of crime.
• DISTRICT: Neighbourhood 133 in Boston.
• REPORTING_AREA: Incident reported location to the police.
• SHOOTING: “Y” means shooting occurred during crime.
• OCCURRED_ON_DATE/YEAR/MONTH/DAY_OF_WEEK/HOUR:
Time of crime.
• UCR_PART: Rate of the crime, part 1 is the highest rank.
• STREET/LATITUDE/LONGITUDE/LOCATION: Location of crime
happened.
3.3.2 Data Cleaning and Preprocessing
First, the dataset is checked for empty cells in the spreadsheet. The
“SHOOTING” column only has ‘Y’ values given; the empty cells are filled with value
‘N’. Some of the rows also have shifted, where 2434 rows were removed to clean the
dataset. Then, the dataset was partitioned 80% for training models and balance 20%
was used for testing and evaluating the models.
32
3.4 Model Building and Training
Below are the 4 algorithms planned to be built to generate training model and
test the Boston Crime Dataset. The following sections describe the hyperparameters
used for this project.
Figure 3.2: Model Building and Training Flowchart
33
3.4.1 Building the Model
Below is the flow for LSTM model generation and testing the model:
1. A sequential model was initialized using Sequential() function, to stack layers
one after another.
2. Below are the layers added for each respective model:
a. LSTM
i. An LSTM layer with 50 memory unit.
ii. Input data was initialized and determined by number of time
steps and the number of features which 1 for both.
b. Bi-LSTM
i. A Bi-LSTM layer to the model.
ii. The layer wraps an LSTM layer with 50 memory unit.
iii. Input data was initialized and determined by number of time
c. LSTM-CNN
i. A 1D convolutional layer with 32 filters, a kernel size of 3, uses
the ‘ReLU’ activation function.
iii. A 1D max pooling layer with a pool size of 2.
iv. An LSTM layer with 50 memory unit.
d. Bi-LSTM-CNN
i. A 1D convolutional layer with 32 filters, a kernel size of 3, uses
the ‘ReLU’ activation function.
34
iii. A 1D max pooling layer with a pool size of 2.
iv. A Bi-LSTM layer with 50 memory unit.
3. A dropout Layer was added to all the models with a rate of 0.2.
4. A dense layer with a few units equal to the number of classes in the target
variable. The activation function used is ‘SoftMax’.
3.4.2 Compiling the Model
Compiling the model after the layers are added:
1. The optimizer was set to ‘Adam()’
2. The loss function was set to ‘categorical_crossentropy’.
3. The metric used to evaluate the model during training is set to accuracy.
3.4.3 Training the Model
Training the model compiled based on the layers declared:
1. The input dataset was reshaped to have dimensions to match the input shape of
the model layers used:
a. LSTM
b. Bi-LSTM
c. LSTM-CNN
d. Bi-LSTM-CNN
2. The categorical target variable (y_train) was converted into one-hot encoded
vectors.
3. 50 epochs parameter was set, which is the number of times the entire dataset
passed through the model during training.
4. Batch size is the number of samples used in each gradient update, was set to 32.
35
5. Optionally, the argument display progress was also enabled (verbose=1).
3.5 Model Evaluation
This entails refining the model's parameters and assessing the model's
effectiveness using validation data. The usefulness of the model in identifying crime
hotspots is assessed using a variety of measures, including Root Mean Squared Error
(RMSE), Mean Absolute Percentage Error (MAPE), and Mean Absolute Error (MAE),
F-2 Score, training loss and accuracy. The model will then be tested on new data to
evaluate its generalizability and robustness. The final step involves analysing the results
of the experiments and drawing conclusions about the effectiveness of the proposed
approach.
3.5.1 Root Mean Squared Error (RMSE)
The square root of the mean of the square of all the errors is known as the root
mean squared error (RMSE). RMSE is regarded as a superior all-purpose error metric
for numerical forecasts. Since RMSE is scale-dependent, it should only be used to
compare prediction errors of various models or model configurations for a single
variable and not between variables. It evaluates how well a regression line matches the
observed data. The RMSE calculation formula is:
∑𝑁
𝑖=1(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑𝑖 − 𝐴𝑐𝑡𝑢𝑎𝑙𝑖 )
2
𝑅𝑀𝑆𝐸 = √
𝑁
Equation 3.1: RMSE Equation
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑𝑖 = The predicted value for the ith observation.

𝐴𝑐𝑡𝑢𝑎𝑙𝑖 = The observed(actual) value for the ith observation
N = Total number of observations.
36
3.5.2 Mean Absolute Percentage Error (MAPE)
The mean absolute percentage error (MAPE), a measurement, establishes a
forecasting technique's accuracy. It is the average of the absolute % errors of each item
in a dataset, which may be used to assess how accurate the predicted quantities were in
comparison to the actual numbers. MAPE, which necessitates the use of dataset values
other than zero, may frequently be used to examine large data sets successfully. The
formula is as shown below:
1 |𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡|
𝑀𝐴𝑃𝐸 = ∗ ∑[ ] ∗ 100
𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 |𝑎𝑐𝑡𝑢𝑎𝑙|
Equation 3.2: MAPE Equation
3.5.3 Mean Squared Error (MSE)
The technique of employing the mean squared error (MSE) means of assessing
how closely a regression line matches a collection of points. To do this, the distances
between the points and the regression line ,also referred to as the "errors" are squared.
It is necessary to square to remove any unfavorable pattern. It also targets to larger
differences. This mistake type is called the mean squared error since you're averaging a
group of errors. The forecast becomes more precise as MSE decreases.
1
𝑀𝑆𝐸 = ( ) ∗ ∑(𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡)2
𝑛
Equation 3.3: MSE Equation
n = number of objects
∑= summation notation
Actual = original or observed y-value,
Forecast = y-value from regression.
37
3.5.4 R-Squared (R²)
R-squared is a statistical evaluation metric used to assess the goodness of fit of
a regression model. The degree to which the regression model accurately predicts the
observed data and the percentage of the dependent variable's variation that can be
accounted for by the model's independent variables are both measured. R-squared is
commonly used to understand the performance of regression models and compare
different models.
𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠(𝑆𝑆𝑅)

𝑅2 =
𝑇𝑜𝑡𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠(𝑆𝑆𝑇)
Equation 3.4: R-Squared Formula
SSR: the sum of the squared differences between the predicted and the actual values.
SST: the sum of the squared differences between the actual values and the mean of the
dependent variable.
R-squared is calculated by taking the proportion of the explained variance to the
total variance. A value of 1 for R-squared indicates that the regression model perfectly
fits the data, explaining all the variability. A score of 0 on the other hand signifies that
the model does not account for any variability and is essentially comparable to
forecasting the dependent variable's mean. R-squared can also take on negative values,
this typically happens when the model is poorly fitted or overfit.
3.5.5 Training Loss
Categorical loss entropy is a loss function in machine learning models for
classification tasks with multiple classes which is proposed to calculate for all the
proposed models for evaluation. It is designed to measure the dissimilarity or error
between predicted class probabilities and the true class labels. The categorical cross-
38
entropy loss quantifies the difference between the predicted probability distribution and
the true class labels. The equation for categorical cross-entropy loss is as follows:
𝑙𝑜𝑠𝑠 = −Σ(𝑦𝑡𝑟𝑢𝑒 ∗ log(𝑦𝑝𝑟𝑒𝑑 ))

Equation 3.5: Training Loss Equation
Given:
• 𝑦𝑡𝑟𝑢𝑒 : True labels (one-hot encoded vectors)
• 𝑦𝑝𝑟𝑒𝑑 : Predicted probabilities for each class.
Keras is used to calculate the categorical cross-entropy loss function for the model.
Thus, during training, the loss is calculated using this equation for each batch of training
samples.
3.5.6 Accuracy of the Model Generated
The accuracy is calculated based on the predictions made by the model on the
test dataset input features (X_test) and the corresponding true labels from categorical
data (y_test). Below is the generic equation how the accuracy is calculated:
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑠𝑎𝑚𝑝𝑙𝑒𝑠

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
Equation 3.6: Formula to Calculate Accuracy
The evaluate() method calculates the loss and accuracy for the provided test data
(X_test) and the true labels (y_test), which are converted into one-hot encoded vectors.
In summary, the accuracy is obtained directly from the model evaluation process and
indicates the proportion of samples in the test dataset that were properly categorised to
all the samples.
39
CHAPTER 4
RESULTS AND DISCUSSION
4.1 Introduction
This section consists of the results obtained based on the dataset used to predict
crime hotspot using LSTM, Bi-LSTM, LSTM-CNN, and Bi-LSTM-CNN. The models
are evaluated using evaluation metrics including RMSE, MAPE, MSE, R-squared
value, accuracy, and training loss in order to compare the performance of the provided
technique and choose the best performing model.
4.2 Data Analysis
Below are the total number of crimes categorized under each type of crime.
Figure 4.1: Crime Type Distribution
40
Figure 4.2: District-wise Crime Distribution
Figure 4.3: Hour-wise Crime Distribution
41
Figure 4.4: Year-wise Crime Distribution
Figure 4.5: Month-wise Crime Distribution
42
4.3 Results
4.3.1 Accuracy of the Trained Models
Accuracy (%)
Models 1st 2nd 3rd 4th 5th Average
Training Training Training Training Training
LSTM 96.9 94.3 97.2 92.7 93.0 94.82
Bi- 90.4 85.5 91.7 86.3 92.5
LSTM 89.28
LSTM- 93.9 97.8 89.2 97.8 96.2
CNN 94.98
Bi- 98.8 98.6 96.6 94.8 93.2
LSTM- 96.4
CNN
Table 4.1: Accuracy of Trained Models
Accuracy (%)
98
96
94
92
90
88
86
84
LSTM Bi-LSTM LSTM-CNN Bi-LSTM-CNN
Figure 4.6: Average Accuracy Comparison
4.3.2 Training Loss
Loss During Training

Models Training: Training: Training: Training: Training: Average
1 2 3 4 5
LSTM 0.108 0.133 0.105 0.145 0.185 0.1352
Bi- 0.264 0.355 0.200 0.320 0.188
LSTM 0.2654
LSTM- 0.128 0.091 0.257 0.079 0.097
CNN 0.1304
43
Bi- 0.051 0.056 0.112 0.136 0.137
LSTM- 0.0984
CNN
Table 4.2: Loss during Model Training
Training Loss
0.3
0.25
0.2
0.15
0.1
0.05
0
Figure 4.7: Average Training Loss Comparison
4.3.3 RMSE
RMSE
1 2 3 4 5
LSTM 2.454 4.125 2.824 4.258 4.755 3.6832
Bi- 4.443 3.674 3.524 3.720 4.206
LSTM 3.9134
LSTM- 3.216 2.487 4.127 2.122 2.22
CNN 2.8344
Bi- 1.920 2.490 2.561 2.178 3.860
LSTM- 2.6018
CNN
44
Table 4.3: RMSE Average
RMSE
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
Figure 4.8: Average RMSE Comparison
4.3.4 MAPE
MAPE
1 2 3 4 5
LSTM 0.0637 0.0670 0.0299 0.0828 0.0655 0.06178

Bi- 0.0737 0.103 0.0357 0.01504 0.0170
LSTM 0.04889
LSTM- 0.0421 0.022 0.0526 0.0211 0.0250
CNN 0.03256
Bi- 0.0180 0.025 0.0257 0.0618 0.061

LSTM- 0.03132
CNN
Table 4.4: MAPE Average
45
MAPE
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
Figure 4.9: Average MAPE Comparison
4.3.5 MSE
MSE
1 2 3 4 5
LSTM 6.026 17.020 7.977 18.133 22.614 14.354
Bi- 19.747 13.501 12.424 13.844 17.691
LSTM 15.4414
LSTM- 10.346 6.190 17.036 4.505 4.970
CNN 8.6094
Bi- 3.688 6.204 6.559 4.747 14.902
LSTM- 7.22
CNN
Table 4.5: MSE Average
46
MSE
18
16
14
12
10
8
6
4
2
0
Figure 4.10: Average MSE Comparison
4.3.6 R-Squared
R-Squared
1 2 3 4 5
LSTM 0.975 0.930 0.967 0.926 0.908 0.9412
Bi- 0.919 0.945 0.949 0.943 0.928
LSTM 0.9368
LSTM- 0.957 0.975 0.9307 0.981 0.979
CNN 0.96454
Bi- 0.985 0.974 0.973 0.980 0.946
LSTM- 0.9716
CNN
Table 4.6: R-Squared Average
47
R-Squared
0.98
0.97
0.96
0.95
0.94
0.93
0.92
0.91
Figure 4.11: Average R-Squared Value Comparison
4.3.7 Time Taken for Each Algorithm for Training
Each model was run for 5 times, each time epochs are set to 50. Below is the table of
comparison of the time taken for each model for training ins seconds.

1 2 3 4 5
LSTM 1142.0 1183.13 1300.50 1168.66 1298.41 1218.54
Bi- 1429.61 1433.36 1481.91 1387.33 1444.75
1435.39
LSTM
LSTM- 823.2 812.15 840.09 802.90 888.34
833.336
CNN
Bi- 979.12 977.45 1099.96 994.04 986.88
LSTM- 1007.49
CNN
Table 4.7: Average Time Taken for Model Training
48
Model Training Time (s)
1600
1400
1200
1000
800
600
400
200
0
Figure 4.12: Average Time Taken for Model Training Comparison
4.4 Summary
Based on the data analysis, the type of crime was taken as the categorical data
(y_test) for this experiment. In the crime type distribution, Motor Vehicle Accident
Response is recorded to be highest reported crime. Apart from this, district B2 has the
highest reported crime and district A15 is vice versa. Crimes were also analysed based
on timeline, such as hour, month, and year wise. Throughout the day, high crime rates
are reported between 5pm to 7pm. The crime rates are uncertain based on distribution
graph for year wise since it may go high or low for the upcoming years. For the month
wise, months 7, 8, and 9 are the timeline with highest recorded crimes.
For the models trained and compared, Bi-LSTM-CNN results a very convincing
performance in all evaluation metrics compare to other algorithms. Bi-LSTM-CNN
recorded to have highest average training accuracy (96.4%) and average R-squared
(0.9716) value, but not a very big significant difference with LSTM and LSTM-CNN
models. For the training loss, Bi-LSTM-CNN has the lowest average loss (0.0984). A
low average RMSE value for Bi-LSTM-CNN model (2.6018), followed by LSTM-
CNN (2.8344). There is not much of a difference for the average MAPE value recorded
49
between Bi-LSTM-CNN (0.03132) and LSTM-CNN (0.03256). For the MSE, Bi-
LSTM-CNN has the lowest value recorded (7.22) compared to other 3 models.
To choose the optimum model based on the training time, a further evaluation
was also carried out. Even though LSTM-CNN has a better average training time around
833s, the Bi-LSTM-CNN takes an additional of 170s to produce a better result and
performs better in other evaluation metrics since training is also a one time step.
50
CHAPTER 5
CONCLUSION AND FUTURE RECOMMENDATIONS
5.1 Conclusion
Crime hotspot identification is critical for increasing public safety and law
enforcement operations. The capacity to detect and anticipate high-crime regions
allows law enforcement organizations to better allocate resources and conduct targeted
crime prevention initiatives. The development of sophisticated data analysis
techniques and machine learning algorithms has resulted in substantial advances in
spatiotemporal crime hotspot detection in recent years.
The study focuses on spatiotemporal crime hotspot detection with a hybrid
machine learning method to increase forecasting performance. Various machine
learning algorithms were investigated to reduce the focus of the investigation. It was
discovered that hybrid machine learning algorithms outperform traditional machine
learning algorithms in forecasting spatiotemporal crime hotspots. This research
compares a non-hybrid model, LSTM and hybrid machine learning model including
Bi-LSTM, LSTM-CNN, and Bi-LSTM-CNN. It was discovered that Bi-LSTM-CNN
outperformed the other models in terms of accuracy, R-squared value, RMSE, MAPE,
and MSE. An additional training time evaluation was also performed to show a
significant difference among the models. In contradictory, the LSTM-CNN has a
shorter training time, but Bi-LSTM-CNN model additional training time can be
compensated for its better accuracy. This demonstrates that the hybrid model, Bi-
LSTM-CNN, may be employed for spatiotemporal crime hotspot detection, leading to
improved crime prevention when deployed by police agencies.
51
5.2 Recommendations for Future Research
For future recommendation, it is recommended to implement the Bi-LSTM-
CNN algorithm to plot hotspots on Boston, to make the model more usable for real-
life application for the crime department. Moreover, it is recommended to include
different region dataset to test the modularity of the algorithm. Another idea is to
develop a programme that will get daily real-time data from the crime department and
provide accurate hotspots based on the most recent information.
52
REFERENCES
(BPD), B. P. (2018). Crimes in Boston. Retrieved from

https://www.kaggle.com/datasets/AnalyzeBoston/crimes-in-boston
Aljuboori, F., Shaker, H., & Fadhil, A. (2022). Approaches, A Crime Data Analysis
of Prediction Based on Classification. Baghdad Science Journal, 19(5), 1073-
1077.
Alsirhani, A., Sampalli, S., & Bodorik, P. (2018). DDoS Detection System: Utilizing
Gradient Boosting Algorithm and Apache Spark. 2018 IEEE Canadian
Conference on Electrical & Computer Engineering (CCECE).
Annie, S. S., Pathmanaban, J., Kingsley, S., Sriman, B., K, S. N., & E, S. K. (2023,
March 14). Prediction and Prevention Analysis Using Machine Learning
Algorithms for Detecting the Crime Data. 2022 1st International Conference
on Computational Science and Technology (ICCST), (pp. 986-991).
Anuvarshini, S. R., Deeksha, N., C, D. S., & Krishna, S. K. (2022). Crime Forecasting
: A Theoretical Approach. 2022 IEEE 7th International Conference on Recent
Advances and Innovations in Engineering (ICRAIE).
Boppuru, P. R. (2023, February). Geo-spatial crime density attribution using optimized
machine algorithms. International Journal of Information Technology.
doi:10.1007/s41870-023-01160-7
Butt, U. M., Letchmunan, S., Hassan, F. H., & Koh, T. W. (2022, September 7). Hybrid
of deep learning and exponential smoothing for enhancing crime forecasting
accuracy.
Cai, C., Tao, Y., Zhu, T., & Deng, &. Z. (2021). Short-Term Load Forecasting Based
on Deep Learning Bidirectional LSTM Neural Network.
CHANDRA, R., GOYAL, S., & GUPTA, R. (2021). Evaluation of deep learning
models for multi-step ahead time series prediction.
Deepak, G., Rooban, S., & Santhanavijayan, A. (2021). A knowledge centric
hybridized approach for crime classification incorporating deep bi-LSTM
neural network. Multimedia Tools and Applications (2021), 28061–28085.
Dewan, A., Islam, K. M., Fariha, T. R., Murshed, M. M., Ishtiaque, A., Adnan, M. S.,
. . . Chowdhury, M. B. (2022). Spatial Pattern and Land Surface Features
Associated with Cloud-to-Ground Lightning in Bangladesh: An Exploratory
Study. Earth Systems and Environment (2022), 437-451.
Esan, D. O., Owolawi, P., & Tu, C. (2020). Detection of Anomalous Behavioural
Patterns In University Environment Using CNN-LSTM. 2020 International
Conference on Computational Science and Computational Intelligence
(CSCI), (pp. 29-35).
Fefferman, C., Mitter, S., & Narayanan, H. (2016). Testing the manifold hypothesis.
Journal of the Amer, 29(4), 983–1049.
Guo, J., Wu, B., & Zhou, P. (2020). 2020 IEEE Fifth International Conference on Data
Science in Cyberspace. Cyberspace (DSC). Beijing, China.
HAN, X., HU, X., WU, H., SHEN, B., & WU, J. (2020). Risk Prediction of Theft
Crimes in Urban Communities: An Integrated Model of LSTM and ST-GCN.
8, 217222-217230.
Huang, C., Zhang, J., Zheng, Y., & Chawla, N. V. (2018). DeepCrime: Attentive
Hierarchical Recurrent Networks for Crime Prediction. The 27th ACM
53
International Conference on Information and Knowledge Management (CIKM
’18). Turin, Italy.
Jogendra, K., Sravani, M., Akhil, M., Sureshkumar, P., & Yasaswi, V. (2022). Crime
Rate Prediction Based on K-means Clustering and Decision Tree Algorithm.
Computer Networks and Inventive COmmunication Technologies, 451-462.
Kang, H.-W., & Kang, H.-B. (2017, April 17). Prediction of crime occurrence from
multimodal data using deep learning. Dept. of Digital Media, Catholic
University of Korea, Bucheon, Gyonggi-Do, Korea.
Kanimozhi, N., N, V. K., G, S. P., Ranjitha, G., & Yuvarani, S. (2021). CRIME TYPE
AND OCCURRENCE PREDICTION USING MACHINE LEARNING
ALGORITHM. International Conference on Artificial Intelligence and Smart
Systems. doi:10.1109/ICAIS50930.2021.9395953
Khan, M., Ali, A., & Alharbi, Y. (2022). Predicting and Preventing Crime: A Crime
PredictionModel Using San Francisco Crime Data by Classification
Techniques.
Kumar, A., Saumyab, S., & Singh, J. P. (2020). NITP-AI-NLP@HASOC-Dravidian-
CodeMix-FIRE2020: A Machine Learning Approach to Identify Offensive
Languages from Dravidian Code-Mixed Text. Forum for Information Retrieval
Evaluation. Hyderabad, India.
Lamari, Y., Freskura, B., Abdessamad, A., Eichberg, S., & Bonviller, S. d. (2020).
Predicting Spatial Crime Occurrences through an Efficient Ensemble-Learning
Model. International Journal of Geo-Information, 9(645).
Mittal, M., Goyal, L. M., & Sethi, J. K. (2018). Monitoring the Impact of Economic
Crisis on Crime in India Using Machine Learning. 1469-1485.
Muthamizharasan, M., & Ponnusamy, R. (2022). Forecasting Crime Event Rate with
a CNN-LSTM Model. Innovative Data Communication Technologies and
Application, 461-470.
Noor, T. H., Almars, A. M., Alwateer, M., Almaliki, M., Gad, I., & Atlam, E.-S.
(2022). SARIMA: A Seasonal Autoregressive Integrated Moving Average
Model for Crime Analysis in Saudi Arabia. Electronics 2022, 11(3986).
Olah, C. (2014, April 6). Neural Networks, Manifolds, and Topology. Retrieved from
https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Ribeiro, J., Meneses, L., Costa, D., Miranda, W., & Alves, R. (2022). Prediction of
Homicides in Urban Centers: A Machine Leaning Approach. 344-361.
SAFAT, W., ASGHAR, S., & GILLANI, S. A. (2021, May 6). Empirical Analysis for
Crime Prediction and Forecasting Using Machine Learning and Deep Learning
Techniques. 9, 70080-70094.
Salam, M. A. (2022, April). Time Series Crime Prediction Using a Federated Machine
Learning Model.
Sharma, H. K., Choudhury, T., & Kandwal, A. (2021). Machine learning based
analytical approach for geographical analysis and prediction of Boston City
crime using geospatial dataset. GeoJournal.
Singh, S., Rani, i., Bansal, P., & Techniques, A. S. (2023). Designing of an Efficient
Model for Violence . 2023 International Conference on Advancement in
Computation & Computer Technologies (InCACCT), (pp. 533-538).
Stec, A., & Klabjan, D. (2018, June 5). Forecasting Crime with Deep Learning.
TASNIM, N., IMAM, I. T., & HASHEM, M. M. (2022). A Novel Multi-Module
Approach to Predict Crime Based on Multivariate Spatio-Temporal Data Using
Attention and Sequential Fusion Model. 10, 48009-48030.
54
Tong, X., Ni, P., Li, Q., Yuan, Q., Liu, J., Lu, H., & Li, G. (2021). Urban Crime Trends
Analysis and Occurrence Possibility Prediction based on Light Gradient
Boosting Machine. 2021 IEEE 4th International Conference on Big Data and
Artificial Intelligence.
Ye, X., Duan, L., & Peng, Q. (2021). Spatiotemporal Prediction of Theft Risk with
Deep Inception-Residual Networks. Smart Cities, 4, 204 - 216.
Yin, J., Michael, I. A., & Afa, I. J. (2020, February 9). Machine Learning Algorithms
for Visualization and Prediction Modeling of Boston Crime Data. p. 15.
ZHANG, X., LIU, L., & XIAO, L. (2020). Comparison of Machine Learning
Algorithms for Predicting Crime Hotspots. 8, 181302-181310.
Zhuang, W., & Cao, &. Y. (2022). Short-Term Traffic Flow Prediction Based on CNN-
BILSTM with Multicomponent Information. 12.
Zhuang, Y., Almeida, M., Morabito, M., & Ding, W. (2017). Crime Hot Spot
Forecasting:. IEEE International Conference on Big Knowledge. Lowell,
Massachussets.
55
APPENDICES
APPENDIX A PYTHON CODE FOR THE 4 ALGORITHMS

Thesis Template Final Content v6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis Template Final Content v6

Uploaded by

Copyright:

Available Formats

SPATIO-TEMPORAL CRIME HOTSPOT

DETECTION USING HYBRID MACHINE

THEEBAN PILLAI ANBALAGU

UNIVERSITI SAINS MALAYSIA

THEEBAN PILLAI ANBALAGU

Thesis submitted in fulfilment of the requirements

I would like to express my deepest gratitude to my esteemed Lecturer, Dr. Sukumar,

for their invaluable guidance, unwavering support, and encouragement throughout my

shaping my understanding of the subject matter and have inspired me to pursue

excellence in my research. I am also profoundly thankful to Umair Butt, for their

Their insightful suggestions and continuous motivation have significantly contributed

for providing a stimulating academic environment and fostering an atmosphere of

encouragement and understanding while this academic pursuit. Not to forget my

any capacity. Your support has been invaluable.

TABLE OF CONTENTS .......................................................................................... iii

LIST OF TABLES .................................................................................................... vi

LIST OF FIGURES ................................................................................................. vii

LIST OF ABBREVIATIONS ................................................................................ viii

LIST OF APPENDICES .......................................................................................... ix

CHAPTER 1 INTRODUCTION .......................................................................... 1

1.1 Motivation ........................................................................................................ 2

1.2 Research Questions .......................................................................................... 3

1.3 Problem Statement ........................................................................................... 3

1.4 Objective .......................................................................................................... 4

1.5 Research Contributions .................................................................................... 4

CHAPTER 2 LITERATURE REVIEW .............................................................. 5

2.1 Introduction ...................................................................................................... 5

2.2 Data Pre-Processing ......................................................................................... 5

2.3 Single Machine Learning Algorithms .............................................................. 6

2.3.1 Decision Tree and Random Forest ................................................... 6

2.3.2 Naïve Bayes...................................................................................... 7

2.3.3 Linear Regression (LR) .................................................................... 8

2.3.4 Autoregressive Integrated Moving Average (ARIMA) ................... 9

2.3.5 Kernel Density Estimation (KDE) ................................................. 10

2.3.6 Gradient Boosting (GB) ................................................................. 11

2.3.7 Long-Short Term Memory ............................................................. 12

2.4 Hybrid Deep Learning Algorithms ................................................................ 16

2.4.2 LSTM-CNN ................................................................................... 19

2.4.3 Bi-LSTM-CNN .............................................................................. 21

2.5 Performance Evaluation Metrics .................................................................... 23

2.6 Summary ........................................................................................................ 24

CHAPTER 3 METHODOLOGY ....................................................................... 30

3.1 Introduction .................................................................................................... 30

3.2 Experimental Setup ........................................................................................ 31

3.2.1 Hardware Setup .............................................................................. 31

3.2.2 Software Setup ............................................................................... 31

3.3 Experimental Dataset ..................................................................................... 31

3.3.1 Collection of Experimental Dataset ............................................... 31

3.3.2 Data Cleaning and Preprocessing ................................................... 32

3.4 Model Building and Training ......................................................................... 33

3.4.1 Building the Model......................................................................... 34

3.4.2 Compiling the Model ..................................................................... 35

3.4.3 Training the Model ......................................................................... 35

3.5 Model Evaluation ........................................................................................... 36

3.5.1 Root Mean Squared Error (RMSE) ................................................ 36

3.5.2 Mean Absolute Percentage Error (MAPE)..................................... 37

3.5.3 Mean Squared Error (MSE) ........................................................... 37

3.5.4 R-Squared (R²) ............................................................................... 38

3.5.5 Training Loss.................................................................................. 38

3.5.6 Accuracy of the Model Generated ................................................. 39

CHAPTER 4 RESULTS AND DISCUSSION................................................... 40

4.1 Introduction .................................................................................................... 40

4.2 Data Analysis ................................................................................................. 40