You are on page 1of 5

2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS)

A Predictive Model for Road Traffic Data


Analysis and Visualization to Detect
2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS) | 979-8-3503-9737-6/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICACCS57279.2023.10112862

Accident Zones
S. Subhashini R. Maruthi
Department of Computer Applications Department of Computer Applications
Hindustan Institute of Technology and Science Hindustan Institute of Technology and Science
Chennai, Tamil Nadu, India Chennai, Tamil Nadu, India
21248005@student.hindustanuniv.ac.in rmaruthi@hindustanuniv.ac.in

Abstract—The usage of vehicles were increased regions in Tamil Nadu are where accidents occur most
throughout the world, the count of accidents is also increased frequently. The busiest city, like Chennai, is a major
globally. One of the world's major concerns is recognized to metropolitan with a huge number and access to all forms
be the number of traffic accidents. In each year, there are of transportation. Accidents happen less frequently in
increasingly more traffic accidents. Consequently, it has a places like Nilgiris as fewer people are living there. People
significant impact on a nation's society, economy, and just select those cities for vacations, not as their stable
progress. As accidents occur in various places and at various locations. In the year 2020, the total number of accidents
times, identifying where they are more likely to occur can be that took place in Tamil Nadu is more than forty-five
complicated. This paper presents a Predictive Model to
thousand. The overall number of accidents that are
analyze and visualize road traffic accident zones in Tamil
Nadu. The zones are categorized by four different attributes
recorded in Tamil Nadu in 2020 was above 45,000. In the
such as low, medium, high and very high. Data analysis is upcoming decades, this will increase. The state only
done in order to make a determination. The Real time data experiences an increase in the number of accidents each
are acquired and interpreted using Latent Class Clustering year. Both national and international visitors from other
Analysis (LCA). The outcome is represented through states are frequent in Tamil Nadu. Through this study, they
Cartogram visualization technique. Accident analysis is done can learn about the accidents that occur in each district
to identify the reason or causes of an accident in attempt to they visit. They can use it to view the accident zones and
prevent similar instances from occurring repeatedly. As a how often accidents occur there. ln order to enable them to
result, it is simpler to recognize the zones so the data is easier travel securely and to be aware of how busy the region is.
for the users to absorb. This paper supports determining the location through
Google Maps representation. The data are collected,
Keywords—Road accident zone analysis, Road Accidents, cleaned up, integrated, verified, and analyzed. Latent Class
Predictive model, Data analysis, Data visualization, Latent Clustering Analysis is used for data analysis (LCA).
class clustering analysis, Cartogram
II. RELATED WORKS
I. INTRODUCTION
Descriptive model has been studied to find the road
Road accidents are one of the steering reasons of traffic accident zones. This study also uses infographic
demise around the world at present. To minimize the visualization techniques to identify the location efficiently.
sequels of the accident, some conduct are demanded. The This investigation provides only the data in simple format
study of accidents that can distinguish and anatomize the and does not provide the details about the future [1]. A
proxies of accident are furthermore necessary to elect the study is examined to detect hotspots (road accident zones)
most efficient moves. Analyzing accident data can also be using the method of kernel density and hotspot analysis. It
utilized to identify the street, vehicle, and motorist- related does not provide statistical significance and it often gives
accident causes. mortal miscalculations, motorist collapse, the same results [2]. A Logistic regression model for
poor street sensation, vehicle mechanical failure, hurry and analyzing and visualizing road accidents in accidents prone
racing in crime of business acts, traffic jams, way areas was studied. This model is not applicable if the
irruption,etc. are crucial factors in accidents. To pinpoint number of observations are lesser than the number of
the areas with the topmost attention of accidents, data from features [3]. Using the fuzzy clustering technique, the
road accidents are anatomized. Accidents occur for a probability of road accidents in Medan city is investigated.
variety of circumstances, including the part of day, the It performs badly on datasets with clusters of various sizes
vehicle's type, age of rider, the number of passengers, the [4]. K-Means Technique was applied in a study to cluster
type of road, also the weather. Teenagers between the ages the areas of accident-prone highway zones. Outliers and
of 15 and 25 seem to be more likely to be impacted in this noisy data cannot be dealt by this [5].
instance. Those young people utilize the bike for
entertainment and don't have a significant amount of Traffic Accident Evaluation and Prediction through
control. They are of the age category for those without a Machine Learning Techniques has been examined. The
license. Government regulations prohibit them from algorithms for logistic regression, decision tree, and
driving a vehicle until they are 18 years old. The busiest random forests were discussed in this table. The one with

979-8-3503-9737-6/23/$31.00 ©2023 IEEE

1227
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on October 17,2023 at 14:26:49 UTC from IEEE Xplore. Restrictions apply.
2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS)

best accuracy is chosen [6].A study suggested Machine


Learning to predict road accidents. This was conducted
through Random Forest Model. More trees can make the
algorithm ineffective for prediction [7].Spatial analysis
model is suggested in a study to classify the traffic
accident-prone roads. MCDM-ANN (Multi class decision
making – Artificial Neural Network) has been used in this
study. For big neural networks it needs more processing
time [8]. Machine learning techniques have been used in a
study to examine traffic accidents and detect hotspots.
Comparison of algorithms such as Naive Bayes, Decision
Trees, Random Forests, Logistic Regression, DBSCAN
Fig. 1. Steps for processing data
and Hierarchical Clustering are done and the one which
provides highest accuracy is chosen [9]. In a research of
Spatiotemporal Investigation of Traffic Accidents A. Data Collection & Verification
Hotspots, spatial autocorrelation is examined. Using
geospatial techniques, it is built. The test statistic's value is
exaggerated [10].
A study has been conducted on prediction of road
accidents using Machine Learning. This was conducted
through the Expectation Maximization method. This has
slow convergence [11]. By applying Geographic Fig. 2. Process of data collection method & verification
information system temporal-spatial statistical analytic
techniques, a study is carried out to identify the regions The Data Collection method is the process of gathering
with the highest rates of road accidents. Kernel density information needed based on the question and it evaluates
estimation has been used in this study. It is always biased, outcomes which is represented in Fig. 2. Verification is the
particularly near the boundaries when the data is bounded process of checking inconsistencies in the data collected
[12]. R programming has been used to analyse and that is necessary to measure the progress. It determines
visualise traffic accident statistics in a study. R utilizes whether the data was transferred accurately from source to
more memory as compared to other programming destination. The data collected is measured, analyzed and
languages [13]. A study has been suggested for Dubai road verifies the accurate data from the resources to evaluate,
accident analysis using apriori based algorithm. It requires forecast the outcomes, trends.
more time for analyzing a large number of dataset [14].
Data mining has been used in a study to classify the B. Data Pre-processing
locations of vehicle accidents. It tends to make use of
association rule mining and the K-means clustering
technique. The algorithm performance is poor [15].

III. METHODOLOGY
A predictive model is proposed to analyze the accident
zones using road traffic data. Road accident zone data Fig. 3. Process of data pre-processing method
analysis will be conducted using unsupervised machine
learning clustering technique. Data’s are clustered into Data analysis must involve the step of data pre-
different subgroups and analyzed based on characteristics processing. It involves the conversion of unstructured data
and visualized using maps. The accident data set for the into a form that is understood by computers. This is
initial process is derived from the public domain. represented in Fig. 3. Raw data that are incomplete may
contain errors and they don’t have a regular format. First
The data set contains instances and locations near step is cleaning the unwanted and duplicate data. Data
traffic accidents in a specific region. Before putting the Integration is the process of combining data from different
data set through pre-processing, it must be validated. Data sources together. Converting data from one form to
cleaning should be done to remove any null data or another is the goal of data transformation. Final step is
redundant information that might be present. Data that has reduction of data by reducing the storage capacity of data.
been cleaned is utilised as pre-processing data and fed into
the algorithm as input. To classify the data, the algorithm C. Data Coding
generates features from various dataset properties. The
method is used to evaluate the data, and based on the
features that are retrieved; a probability is produced for the
processed data. The data is then visualized using data
visualization technique. Fig. 1 represents the steps for
processing the data.
Fig. 4. Process of data coding

1228
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on October 17,2023 at 14:26:49 UTC from IEEE Xplore. Restrictions apply.
2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS)

Data Coding is the process of deriving codes from the


observed data as shown in Fig. 4. Coding is a way of
adjusting and refining the data. This process identifies
values of data, relationship between them. Sort and
categorize the data for analyzing more specific issues.
Then coding is done to give free-form data structure so that
it can be systematically evaluated.

D. Data Summarization

Fig. 5. Process of data summarization

Data Summarization is the summary of generated data.


It is the process of reducing data with the required
attributes. The process of summarizing data is shown in
Fig. 5. It is a simplified value of data. Entire dataset is Fig. 7. Tools and Strategy for Data Analysis
summarized to get the useful information. The extracted
A statistical technique termed LCA (Latent Class
information can be presented in tabular or graphical
Clustering Analysis) is used to analyze models like
manner. The original data is compressed and it is known as
cluster, factor, and regression. This method involves
summarized data.
collecting and analyzing data. Data connections are
identified. Clustering of data occurs. Latent Class Cluster
E. Data Cleaning Analysis indicates latent relationships between
observations that might be presented. LCA contains
discrete latent categorical variables with a binomial
distribution and is involved with the structure of groups.
The Fig. 8 represents the working process of Latent Class
Clustering Analysis (LCA) algorithm.

Fig. 6. Process of data cleaning

It is a process of removing noisy and duplicate data.


Duplicate data is created when various datasets are
combined; this data needs to be cleaned, and unnecessary Fig. 8. Process of Latent Class Clustering Analysis
observations should be eliminated. It is essential to
address structural flaws and filter outliers. The proper Latent classes are used to identify unique groupings
handling of missing data is necessary. The flow diagram using discrete data. The basic premise of the method is
given in Fig. 6 represents the process of data cleaning by that we can try to assume the presence of these categories
fixing errors, removing outliers and also handles removal by attempting to logically classify them according to their
of missing data. attributes. These attributes are analyzed. The data have
only been partly processed. To simplify the model, some
F. Choosing Data Analysis strategy and tools questions are omitted that seemed more descriptive than
Analysis of data is the act of collecting and examining normative. These questions are later examined as potential
data to produce insightful conclusions. Before picking the factors that ought to be included in the simulation to see if
kind of data analysis technique, the two types of data— they have a bearing on the findings. The other queries
qualitative and quantitative—should be taken into were then rewritten to allow for a variety of responses and
consideration. The Fig. 7 represents the process of simplify clustering.
choosing data analysis strategy and tools. The type of Once a model with a specific amount of latent classes
dataset and the desired outcome must be considered when has indeed been chosen, responses are divided further into
choosing the data analysis tool. latent classes based on their likelihood of future class
membership. Then groupings are obtained and using
classification statistics, quality may be evaluated. The
relationship between a group's class membership and

1229
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on October 17,2023 at 14:26:49 UTC from IEEE Xplore. Restrictions apply.
2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS)

distal outcomes will then be examined. The bias-adjusted IV. EXPERIMENTAL RESULTS
three-step methodology and multiple-group latent class
analysis are the best methods for evaluation.
The measured variables are statistically distinct from
one another within each latent class. The correlations
between the measured variables are described by the
latent variable classes [16]. The model of latent class is
expressed as follows in one format:

(1)
Where T represents the total number of latent classes,
and pt indicates the unconditional probability that should
add up to one. The conditional or marginal probabilities

are denoted by
The form is as follows for a model of two-way latent
class:

Fig. 9. Visualization of road traffic accident zones

(2)
TABLE 1. KEY SYMBOL TO POINT THE ZONES
Non-negative matrix factorization and stochastic latent
semantic analysis are also connected to this concept. Symbol Description
The decision on the number of clusters is the first of Low
two difficulties in model selection, and the structure of the
model based on the amount of clusters is the second. A Medium
lower Bayesian Information Criterion (BIC), combined
with goodness-of-fit statistics, multivariate regression
latent variables, bootstrap samples, probability tests, and High
Wald tests, are generally preferable. The separation of the
clusters, or the uncertainty of categorization, is the basis
Very High
for another set of methodologies for analyzing LC cluster
models [17].
Model fit, cluster separation and partition stability are
selecting strategy criteria for LCA. For the eventual The map in the Fig. 9 represents the Cartogram
selection of the number of clusters, additional factors such technique of road traffic accident zones in Tamil Nadu.
as parsimony, the amount of population shares, and This real time map shows the part of Tamil Nadu. Four
evaluation of clusters must be taken into account. The different pointers are marked on all the districts according
chosen clusters can be explored and visualized. To further to the dataset. Different colours and shapes were used to
determine the number of classes, a range of factors can be highlight the districts as per predicting the accident rate
applied. No factor is generally accepted as the best [18]. with previous year records. Table 1 shows the key
symbols which are used to point the zones in the map. The
Visualization is termed as representing data and green square represents the districts with low accident rate.
information in the way of pictures, graphs, chart, map, plot The yellow diamond represents the districts with medium
and animations. Here to plot the regions cartogram – a accident rate. The blue circle represents the districts with
visualization technique based on map is used. Cartogram high accident rate. The red star represents the districts with
or value by area maps is termed as map based data very high accident rate. These different symbols and
visualization. Map-based data visualization is termed as a colours also made it simple to understand the category.
cartogram. The values are specified in harmony with each
other in order to convey the information through the map.
They are a sort of map that illustrates a region's geography. V. CONCLUSION
Depending on the value, the area's size is determined. This study examines data on road traffic accidents that
According to the user's preferences, the regions can be occurred in Tamil Nadu. To present this predictive model
coloured or shaded. It is typically used to illustrate data in a realistic way, a real time dataset is used. The technique
pertaining to countries and regions. For example, Election of visualization is also presented in GoogleMyMaps which
results or population. Here the regions are pointed using is related to Google maps. The collected real time data has
some marks and they have some specifications based on been processed, analyzed and visualized. The data that has
the values. been visualized can be applied to prevent accidents. Users
can get essential information on the zones. By knowing the

1230
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on October 17,2023 at 14:26:49 UTC from IEEE Xplore. Restrictions apply.
2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS)

pointed zones, the users can get awareness about the safety [8] Anik Vega Vitianingsih, Nanna Suryana, Zahriah Othman(2021).
of driving in each district. So that they can drive with Spatial analysis model for traffic accident-prone roads
classification: a proposed framework. IAES International Journal
safety in the corresponding places. The major objective of of Artificial Intelligence (IJ-AI) Vol. 10, No. 2, June 2021
this paper is to use visualisation to forecast the zones based [9] Santos, D.; Saias, J.; Quaresma, P.; Nogueira, V.B. Machine
on earlier year records using the data that has been Learning Approaches to Traffic Accident Analysis and Hotspot
obtained. It is concluded that the busiest city has seen the Prediction. Computers 2021, 10,157.
largest number of accidents. Individuals must acquire a bit https://doi.org/10.3390/computers10120157
more about safety and roadway regulations, especially in [10] Syed Saqib Ali Kazmi, Mehreen Ahmed, Rafia Mumtaz, Zahid
the busiest metropolis. For future work, the accident data Anwar, Spatiotemporal Clustering and Analysis of Road Accident
Hotspots by Exploiting GIS Technology and Kernel Density
in interior zones are preferred. So that users of this model Estimation, The Computer Journal, Volume 65, Issue 2, February
can know further about accident zones in each districts. 2022, Pages 155–176, https://doi.org/10.1093/comjnl/bxz158.
[11] Asghar Pasha1,Vijayalakshmi,MD Atique3, MD Hussain4, Harsh
REFERENCES narnot5,Bipin , Road Accident Prediction using Machine Learning,
International Research Journal of Engineering and Technology
[1] Rabbani, Muhammad & Musarat, Muhammad Ali & Alaloul, (IRJET), Volume: 08 Issue: 07 | July 2021.
Wesam & Ayub, Saba & Bukhari, Hamna & Altaf, Muhammad.
[12] Khanh Giang Le a, b, Pei Liua and Liang-Tay Lin, Determining the
(2022). Road Accident Data Collection Systems in Developing and
road traffic accident hotspots using GIS-based temporal-spatial
Developed Countries: A Review. International Journal of
statistical analytic techniques in Hanoi, Vietnam, GEO-SPATIAL
Integrated Engineering. 14. 336-352. 10.30880/ijie.2022.14.01.031.
INFORMATION SCIENCE 2020, VOL. 23, NO. 2, 153–164
[2] Mesquitela, J.; Elvas, L.B.; Ferreira, J.C.; Nunes, L. Data Analytics https://doi.org/10.1080/10095020.2019.1683437
Process over Road Accidents Data—A Case Study of Lisbon City.
[13] Sodikov, Jamshid. (2018). Road Traffic Accident Data Analysis
ISPRS Int. J. Geo-Inf. 2022, 11, 143. https://doi.org/
and Visualization in R. International Journal of Computer Science
10.3390/ijgi11020143
Engineering and Information Technology Research (IJCSEITR). 8.
[3] Sreedhar, Megna. (2021). Road Traffic Accident Analysis and 25-32. 10.24247/ijcseierdjun20184.
Visualization of Accident Prone Areas. International Journal for
[14] Maya John, Hadil Shaiba, Apriori-Based Algorithm for Dubai
Research in Applied Science and Engineering Technology. 9. 552-
Road Accident Analysis, Procedia Computer Science, Volume 163,
561. 10.22214/ijraset.2021.33280
2019, Pages 218-227, ISSN 1877-0509,
[4] Syahputri, Khalida & Sari, Rachida & Rizkya, Indah & Tarigan, https://doi.org/10.1016/j.procs.2019.12.103.
Ukurta & Siregar, Ikhsan & Farhan, Tengku. (2020). Clustering the (https://www.sciencedirect.com/science/article/pii/S187705091932
vulnerability of traffic accidents in Medan city with a fuzzy c- 1428)
means algorithm. IOP Conference Series: Materials Science and
[15] Kumar, S., Toshniwal, D. A data mining approach to characterize
Engineering. 801. 012030. 10.1088/1757-899X/801/1/012030.
road accident locations. J. Mod. Transport. 24, 62–72 (2016).
[5] Puspitasari, Diah & Wahyudi, Mochamad & Rizaldi, Muhammad https://doi.org/10.1007/s40534-016-0095-5
& Nurhadi, Acmad & Ramanda, Kresna & Sumanto,. (2020). K-
[16] https://en.wikipedia.org/wiki/Latent_class_model
Means Algorithm for Clustering The Location Of Accident-Prone
On The Highway. Journal of Physics: Conference Series. 1641. [17] Jeroen K. Vermunt, Tilburg University, Jay Magidson Statistical
012086. 10.1088/1742-6596/1641/1/012086. Innovations Inc, Latent Class Cluster Analysis,
https://jeroenvermunt.nl/hagenaars2002b.pdf
[6] Vyshnavi K G, Dr. Nalini N,.(2022) Machine Learning Algorithms
for Road Accident Analysis and Forecasting. International Journal [18] Olga Lezhnina, Gábor Kismihók, Latent Class Cluster Analysis:
of Research in Engineering and Science (IJRES). Volume 10 Issue Selecting the number of clusters, MethodsX, Volume
7 ǁ July 2022 ǁ PP. 283-288 9,2022,101747, ISSN 22150161,
https://doi.org/10.1016/j.mex.2022.101747.
[7] Dipanshu Gupta, Vagisha Goel, Rithik Gupta, Mohd Shariq,
(https://www.sciencedirect.com/science/article/pii/S221501612200
Rajesh Singh (2022). ROAD ACCIDENT PREDICTOR USING
1273)
MACHINE LEARNING. International Research Journal of
Modernization in Engineering Technology and Science
Volume:04/Issue:05/May-2022

1231
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on October 17,2023 at 14:26:49 UTC from IEEE Xplore. Restrictions apply.

You might also like