0% found this document useful (0 votes)
41 views31 pages

SNCS D 25 02365

The proposed IDPS in real-time effectively mitigates and detects DoS and Spoofing GAS attacks in IoV environments using XGBoost for classification and Kafka- Spark Streaming for real-time data processing. The system achieves a high recall of 98.5% and has the ability to block any malicious activity and quarantine any infected devices on its own. In addition, the real-time dashboard gives the user an ongoing view of the security events, attack distributions, and system health metrics, which can h

Uploaded by

Ashish Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views31 pages

SNCS D 25 02365

The proposed IDPS in real-time effectively mitigates and detects DoS and Spoofing GAS attacks in IoV environments using XGBoost for classification and Kafka- Spark Streaming for real-time data processing. The system achieves a high recall of 98.5% and has the ability to block any malicious activity and quarantine any infected devices on its own. In addition, the real-time dashboard gives the user an ongoing view of the security events, attack distributions, and system health metrics, which can h

Uploaded by

Ashish Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SN Computer Science

A Scalable Real-Time Intrusion Detection and Prevention System for IoV Networks
Using XGBoost and Big Data Technologies
--Manuscript Draft--

Manuscript Number: SNCS-D-25-02365

Full Title: A Scalable Real-Time Intrusion Detection and Prevention System for IoV Networks
Using XGBoost and Big Data Technologies

Article Type: Original Research

Abstract: The Internet of Vehicles (IoV) is a concept of smart transportation systems that
integrates connected vehicles with other vehicles and the infrastructure. Nevertheless,
new cybersecurity threats to IoV networks have appeared, such as DoS and
spoofing_gas attacks that can paralyze the network and alter sensor readings. In this
study, two phases were adopted: first, to compare Random Forest, XGBoost,
LightGBM, SGD Classifier, Extra Trees, Gaussian Naïve Bayes, and Logistic
Regression in the CICIoV2024, CICI-IDS-2027, and BoTNeTIoT datasets to identify the
most suitable model for real-time intrusion detection, and XGBoost was chosen as the
best model. In the second phase, a real-time Intrusion Detection and Prevention
System (IDPS) was proposed to employ Kafka for data acquisition, Spark Streaming
for large-scale computation, and XGBoost for attack identification. The system offers
complete and comprehensive monitoring and sends alerts for high attack levels and
real-time metrics of attack categories, countermeasures, and system resources. The
effectiveness of the proposed IDPS is demonstrated through experimental results, with
a high recall of 98.5%, which means that there are almost no false negatives, and low-
latency processing with an average inference time of 0.49 ms. The study also shows
the scalability and effectiveness of the system in protecting IoV infrastructures.

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Manuscript Click here to access/download;Manuscript;A Scalable Real-
Time [Link]

1
2
3
4
5
6 A Scalable Real-Time Intrusion Detection and
7
8 Prevention System for IoV Networks Using
9
10 XGBoost and Big Data Technologies
11
12
13 Ikram Hamdaoui1*, Khalid El Makkaoui1 and Zakaria El Allali1
14 1*
15 LaMAO laboratory, MSC team, FPD, Mohammed First University,
16 Nador, Morocco.
17
18
19 *Corresponding author(s). E-mail(s): ikram.hamdaoui2@[Link];
20 Contributing authors: [Link]@[Link]; [Link]@[Link];
21
22
23 Abstract
24
The Internet of Vehicles (IoV) is a concept of smart transportation systems that
25
integrates connected vehicles with other vehicles and the infrastructure. Nev-
26
ertheless, new cybersecurity threats to IoV networks have appeared, such as
27
DoS and Spoofing Gas attacks that can paralyze the network and alter sensor
28
readings. In this study, two phases were adopted: first, to compare Random For-
29 est, XGBoost, LightGBM, SGD Classifier, Extra Trees, Gaussian Naı̈ve Bayes,
30 and Logistic Regression in the CICIoV2024, CIC-IDS-2017, and BoTNeTIoT
31 datasets to identify the most suitable model for real-time intrusion detection, and
32 XGBoost was chosen as the best model. In the second phase, a real-time Intru-
33 sion Detection and Prevention System (IDPS) was proposed to employ Kafka for
34 data acquisition, Spark Streaming for large-scale computation, and XGBoost for
35 attack identification. The system offers complete and comprehensive monitoring
36 and sends alerts for high attack levels and real-time metrics of attack categories,
37 countermeasures, and system resources. The effectiveness of the proposed IDPS
38 is demonstrated through experimental results, with a high recall of 98.5%, which
39 means that there are almost no false negatives, and low-latency processing with
40 an average inference time of 0.49 ms. The study also shows the scalability and
41 effectiveness of the system in protecting IoV infrastructures.
42
43 Keywords: Big Data, Intrusion Detection and Prevention System, Machine Learning,
44 Real-Time Processing, Security, Vehicular Networks
45
46
47
48
49
50 1
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
1 Introduction
1
2 The Internet of Vehicles (IoV) is changing the way transportation systems operate
3 by utilizing real-time communication between vehicles, roads, and cloud-based servers
4 (see Fig. 1). This interconnected ecosystem enhances road safety, reduces congestion,
5 and enables autonomous driving through continuous data sharing. However, as the IoV
6 expands, it generates a large amount of data at high velocity, which must be processed
7 in real time and protected from cyber threats [1]. The application of Big Data (BD) and
8 distributed computing systems has been identified as a solution for the management of
9 data streams in the IoV context [2]. However, the security of these large-scale networks
10 is still a major issue since IoV systems are based on multi-modal data collected from
11 a variety of sensors, including the vehicles’ Controller Area Network (CAN-BUS),
12
GPS, Vehicle-to-Vehicle (V2V) [3], and Vehicle-to-Infrastructure (V2I) networks [4].
13
These vulnerabilities make IoV networks an attractive target for cyber attacks that
14
15 can compromise the safety, reliability, and productivity of the system [5].
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35 Fig. 1 IoV Communication Model
36
37
38 Among the most serious threats to IoV are coordinated cyberattacks, during which
39 several attack vectors are launched at the same time to paralyze vehicular movement
40 and smart city infrastructure [6]. One such scenario involves a coordinated attack on an
41 IoV fleet, combining Spoofing GAS and Denial of Service (DoS) attacks to compromise
42 vehicle functionality and traffic management systems. An attacker controls the gas
43 sensor data of the vehicle CAN-BUS and reports incorrect information about fuel levels
44 and emissions data. As a result, the autonomous vehicles miscalculate their fuel load
45 and thus breakdown at the peak of the traffic jam. At the same time, a DoS attack
46 sends a flood of traffic to the V2I network, overwhelming traffic control systems, and
47 delaying critical automated responses to traffic conditions (see Fig. 2). This combined
48
49
50 2
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
attack results in severe traffic congestion, failure of operational efficiency, and increased
1 accident risks, exposing major security gaps in IoV networks [7].
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21 Fig. 2 Coordinated Cyberattack on IoV Networks
22
23 These challenges can be met by a proactive real-time security control system that
24 is able to detect cyber threats and counter them in real time. Conventional security
25 measures are not capable of analyzing fast-moving vehicle data and detecting abnormal
26 behaviors in real time. Therefore, the application of ML-based Intrusion Detection
27
and Prevention Systems (IDPS) has been proposed to identify cyber threats in IoV
28
networks [8].
29
30 This paper proposes a scalable real-time IDPS for IoV environments, which
31 employs ML techniques to identify and counteract coordinated attacks. The approach
32 involves two key phases:
33 1. Applying random forest (RF), XGBoost, LightGBM, SGD Classifier, Extra
34 Trees, Gaussian Naı̈ve Bayes, and logistic regression models in the CICIoV2024[9],
35 CICI-IDS2027[10] and BoTNeTIoT[11] datasets to identify the most suitable model
36 for intrusion detection.
37 2. Design a real-time IDPS that uses Kafka for data ingestion, Spark Streaming
38 for data processing, and XGBoost for anomaly detection (see Fig. 3).
39 The security of IoV systems is dependent on real-time data processing, as delays in
40 intrusion detection may result in traffic congestion, vehicle failures, and safety risks[12].
41 Previous research has shown that the application of BD security solutions, with the
42 help of distributed computing systems such as Hadoop [13] and Spark, can enhance
43 the efficiency of real-time intrusion detection [14]. The application of Hadoop-based
44 architecture for intelligent traffic monitoring has been studied, and promising results
45
have been achieved in handling the large amount of vehicular data [15]. Moreover, the
46
effectiveness of real-time anomaly detection techniques based on ML algorithms in IoT
47
48 and vehicular networks has been investigated to improve the response time against
49
50 3
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30 Fig. 3 Real-Time IDPS Workflow for IoV Security
31
32 cyber threats [16]. Therefore, The integration of BD-driven security analytics enables
33 scalable, high-speed threat detection and mitigation in IoV environments
34 Thus, the application of BD security analytics in this work improves the cyber
35 resilience of the IoV by providing real-time threat detection and mitigation[17]. Our
36 contributions include:
37
38 • A comparative evaluation of ML models on real-world IoV datasets to select the
39 best-performing model for intrusion detection.
40 • The implementation of a real-time IDPS leveraging Kafka, Spark Streaming, and
41 XGBoost, ensuring low-latency attack detection and prevention.
42 • The integration of BD security frameworks to enhance the scalability of IoV security
43 solutions.
44
45 This paper aims to address the gap between IoV cybersecurity and BD analytics
46 for secure management of coordinated cyber threats in smart transportation systems.
47 The remainder of this article is structured as follows: The second section reviews
48
49
50 4
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
related work on intrusion detection approaches in IoV networks, and examines their
1 focus on ML and real-time scalability. The third section discusses security challenges
2 in IoV environments, and further elaborates on the mechanisms and impacts of DoS
3 and Spoofing GAS attacks. The fourth section outlines the experimental evaluation
4 of various ML models on the CICIoV2024, CIC-IDS-2017 and BoTNeTIoT datasets,
5 describing preprocessing, feature selection and hyperparameter optimization meth-
6 ods. The fifth section introduces the proposed scalable real-time IDPS, outlining its
7
architecture for real-time data ingestion via Kafka, distributed processing of Spark
8
Streaming, and intrusion detection with XGBoost. The sixth section presents exper-
9
10 imental results and provides an in-depth discussion, assessing system effectiveness,
11 real-time responsiveness and resource utilization. Finally, the seventh section con-
12 cludes the paper and outlines potential future directions, such as the integration of
13 deep learning models and the extension to detect newly emerging coordinated attacks.
14
15 2 Related Work
16
17 The IoV needs IDPS systems to protect against serious cyber threats, including DoS
18 attacks and Spoofing GAS attacks that target CAN-BUS and V2I communication
19 vulnerabilities. Several ML and deep learning (DL)-based approaches have been pre-
20 sented to improve the efficiency of IoV IDPS, focusing on intrusion detection accuracy,
21 dataset optimization, and real-time scalability.
22 In [9], the authors described CICIoV2024 as a benchmark dataset for CAN-BUS
23 network intrusion detection. This dataset includes real-world attacks, for example,
24
Spoofing GAS and DoS, thus helping to build a realistic test bed for training ML-
25
based IDPS models. It was observed that the binary representation for binary attack
26
27 classification was more accurate for Deep Neural Networks (DNNs) and RF models.
28 Nevertheless, while CICIoV2024 enriches the resource of the dataset for the benefit of
29 IoV security research, it does not include real-time preventive measures to deal with
30 cyber threats.
31 Similarly, the authors in [18] proposed an optimized ML-based IDPS for IoV, which
32 employed XGBoost, LightGBM, and Extra Trees Classifiers for identifying DDoS
33 attacks in the IoV networks. In their study, they integrated Bayesian optimization for
34 hyperparameter tuning and Synthetic Minority Oversampling Technique (SMOTE)
35 for data balancing to improve the detection accuracy and prevent bias. Nevertheless,
36 the solution is available only for attack detection and does not encompass real-time
37 response, which is a problem in live IoV environments.
38 Detection of advanced IoV threats using DL approaches has started to gain some
39 attention in the recent past. In [19], the authors suggested a deep transfer learning-
40 based IDPS for electric vehicles that employed LeNet to detect anomalies in the
41
CAN-BUS. Their model showed a great improvement in detection precision and also
42
a reduction in the time taken for training as compared to traditional ML models.
43
44 Nevertheless, the high computational costs and the lack of real-time response mecha-
45 nisms make their approach less suitable for low-latency IoV security applications. In
46 addition, the authors of [20] proposed a transfer learning-based IDPS to detect CAN-
47 BUS anomalies with 99.91% accuracy across datasets. Transfer learning improves the
48
49
50 5
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
model’s flexibility, but their research does not incorporate real-time mitigation strate-
1 gies, which our study addresses by linking real-time analytics to Kafka and Spark
2 streaming.
3 Federated Learning (FL) has also been used as a privacy-preserving technique for
4 intrusion detection in IoV by [21]. An FL-based IDPS for distributed intrusion detec-
5 tion without collecting and storing the raw data has been suggested by their system.
6 The CIC-IDS2017 dataset was used to evaluate the effectiveness of the proposed sys-
7
tem, which achieved an accuracy of 96.23% while at the same time protecting the
8
privacy of the users. Nevertheless, FL is distinguished by the fact that it has high com-
9
10 munication overhead and does not incorporate real-time countermeasures to handle
11 the attacks, thus making it not suitable for large IoV security applications.
12 Other researchers have also investigated hybrid approaches that combine ML-based
13 IDPS with BD architectures. In [22], the authors developed a signature-based IDS
14 using ML and DL and fuzzy clustering to minimize the number of false positives.
15 Nevertheless, their approach is purely diagnostic in nature and does not entail real-
16 time action. In the same manner, the authors of [23] introduced an ensemble-based
17 approach to IoV IDPS that is able to select the most appropriate ML model for each
18 type of attack. Their model included the use of XGBoost, LightGBM, and CatBoost,
19 and the F1 score was 99.9997% for the Car-Hacking dataset and 99.811% for the
20 CICIDS2017 dataset. In addition, there is a flip side to ensemble methods, and that
21 is the fact that they are computationally intensive and thus not suitable for real-time
22 IoV security.
23 Table 1 summarizes the different approaches explored in recent research, comparing
24 their datasets, methodologies, detection performance, and limitations in real-time IoV
25
IDPS.
26
However, most of the research that has been done so far to support IDPS in ML
27
28 lacks real-time prevention, only focuses on the detection of attacks, or is inclined
29 towards low efficiency. Our work addresses these gaps by:
30 1. Using several ML algorithms (RF, XGBoost, LightGBM, Extra Trees, CNNs,
31 and RNNs) to compare their effectiveness in detecting intrusions in the IoV.
32 2. Combining real-time data streaming using Kafka for data ingestion, Spark
33 Streaming for data processing, and XGBoost for attack classification to enhance the
34 detection and response time.
35 3. Implementing active mitigation strategies such as blocking malicious IP for
36 DoS and isolating a compromised vehicle in a Spoofing GAS attack, thus making our
37 system a real-time IDPS for IoV security.
38
39
40 3 Security Challenges of IoV Communication
41
The IoV depends on wireless communication, real-time information exchange of sensor
42
data, and cloud services to enhance smart transportation systems. But these features
43
44 are also accompanied by serious security risks, which makes IoV networks very attrac-
45 tive to cyber attacks. Of all these threats, gas sensor spoofing (Spoofing GAS) and
46 DoS attacks are particularly dangerous threats that can pose a risk to vehicular safety,
47 traffic control, and autonomous control.
48
49
50 6
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Table 1 Comparison of IoV IDPS Approaches
1
Study Approach Dataset Used Detection Real-Time Prevention
2
Accuracy Capability Mechanism
3
4 [9] CICIoV2024 dataset CICIoV2024 High (F1- No (Batch pro- No (Detection-
5 benchmark with ML Score & RF cessing, no real- only, lacks
6 evaluation performed time detection) prevention)
7 best)
8 [18] XGBoost-based ML CICDDoS2019 99.02% No (Batch No (Focuses on
9 optimization with processing, detection only)
10 Bayesian tuning no Spark
11
Streaming)
12
[19] Deep Transfer Real-world 98.10% No (Batch pro- No (Detection-
13
14 Learning-based IDS CAN-BUS cessing, not focused, no active
15 with LeNet data (Flood- real-time) prevention)
16 ing, Fuzzing,
17 Spoofing
18 attacks)
19 [20] Transfer Learning- Car-Hacking 99.91% (Car- No (Batch No (Detection-
20 based IDS with (Source), Hacking), training, no only, no
21 CNN on CAN-BUS OTIDS 99.87% real-time prevention
22 data (Target) (OTIDS) detection) measures)
23 [21] Federated Learning- CIC-IDS2017, 96.23% No (No No (Detection-
24 based IDS with KDD-Cup 99, real-time coun- focused, lacks
25 SMOTE and outlier UNSW-NB-15 termeasures mitigation
26 detection mentioned) strategies)
27 [22] Hybrid ML & DL Multiple Reduced false No (Batch No (Detection-
28 IDS with fuzzy clus- datasets (not positives processing, focused, lacks
29
tering explicitly significantly no real-time prevention
30
named) capability) measures)
31
32 [23] Leader Class and Car-Hacking & 99.9997% No (High com- No (Focuses on
33 Confidence Decision CICIDS2017 (Car-Hacking), putational classification
34 Ensemble (LCCDE) 99.811% complexity) improvement)
35 using XGBoost, (CICIDS2017)
36 LightGBM,
37 CatBoost
38
39
40 Table 2 illustrates the combined impact of Spoofing GAS and DoS attacks,
41 demonstrating their disruptive effects on traffic flow and smart transportation.
42 This section analyzes these two security challenges, detailing their attack mecha-
43 nisms, impacts, and potential consequences on IoV operations.
44
45
3.1 Spoofing GAS Attack on IoV Networks
46
47 Attack Mechanism
48
49
50 7
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
1 Table 2 Comparative Analysis of Spoofing GAS and DoS Attacks
2
3 Threat Type Target System Attack Method Impact
4 Spoofing GAS CAN-BUS Sensors Injecting false sensor data Fuel miscalculations, vehicle shut-
5 downs, regulatory violations
6 DoS Attack V2I Communication Flooding networks with Traffic light failures, delayed
(Roadside Unit, Traffic malicious traffic autonomous vehicle responses,
7 Systems) emergency service disruption
8
9
10
11
A Spoofing GAS attack targets gas sensor data that is transmitted through the
12
CAN-BUS and results in wrong fuel level or emissions data being displayed in the
13
14 vehicle’s onboard computer. This attack exploits the lack of authentication and
15 encryption in CAN-BUS communication, allowing adversaries to inject malicious CAN
16 messages[24]. The attacker can gain access through:
17 Step 1: The attacker intercepts the vehicle’s CAN-BUS communications through
18 malware or physical access to the vehicle’s network.
19 Step 2: The adversary injects false fuel level data, which makes the vehicle believe
20 it has more fuel than it actually does.
21 Step 3: The driver or the AI-based navigation system calculates the wrong refueling
22 requirements, and the vehicles shutdown at unsuitable locations.
23 Impact on IoV Systems
24 Unexpected Vehicle Failures: False fuel indicators make vehicles run out of gas in
25 critical traffic situations, resulting in traffic jams.
26 Environmental Violations: Fake emissions data skirting the regulatory checks so
27 that high-emission vehicles can avoid being caught.
28
Self-Driving Vehicles Disruptions: Self-driving systems may misjudge fuel usage
29
and refueling distances wrong due to tampered gas sensor information, which may
30
31 result in unexpected failures and the wrong itinerary and may pose some risk.
32
33 3.2 DoS Attacks on IoV Networks
34
Attack Mechanism
35
A DoS attack on V2I networks [25] is an attempt to bring down roadside units
36
(RSUs), base stations, and cloud-based traffic management systems by overwhelming
37
38 them with excessive requests[26]. The attack progresses in three main stages:
39 Step 1: The attacker sends a large number of data packets to RSUs, base stations,
40 or cloud services.
41 Step 2: The traffic management system is unable to process the genuine data, which
42 leads to a delay in vehicle response.
43 Step 3: Such attacks affect navigation, emergency services, and traffic signals and
44 result in chaos in smart transportation systems.
45 Impact on IoV Systems
46 Traffic Signal Failures: Makes disruptions of real-time updates of signal control to
47 result in uncoordinated and even conflicting traffic flows.
48
49
50 8
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Autonomous Vehicle Malfunctions: Self-driving cars depend on V2I communica-
1 tions to update their maps, and any blockage of these signals can lead to wrong
2 calculations in route planning.
3 Emergency Response Delays: Ambulances, police, and fire trucks are not able to
4 receive priority clearance, which delays the time of emergency response.
5
6
7 4 Evaluating ML Models for IDS
8
To address the security challenges identified in Section 3, this section presents a
9
10 comprehensive evaluation of intrusion detection datasets, preprocessing techniques,
11 feature selection, and machine learning model configurations suitable for real-time IoV
12 environments.
13 To design and assess an effective IDS for the IoV, we adopt three benchmark
14 datasets: CICIoV2024, CIC-IDS-2017, and BoTNeTIoT. The datasets present a com-
15 prehensive evaluation framework for ML models, as they encompass a range of cyber
16 threats, including CAN-BUS intrusions, network-based cyberattacks, and large-scale
17 botnet-based anomalies.
18
19 4.1 CICIoV2024
20
21 The CICIoV2024 dataset was created to improve the security of in-vehicle commu-
22 nications by simulating different attack scenarios on CAN-BUS networks within IoV
23 systems and using real-world vehicular traffic data from a 2019 Ford vehicle.
24 Key Characteristics:
25
• Real-world CAN-BUS traffic with normal and malicious packets is captured.
26
27 • Different types of attacks, like DoS attacks and spoofing-based attacks (e.g.,
28 Spoofing GAS, Steering Wheel Spoofing, Speed Spoofing), are included.
29 • Data is available in both binary, decimal, and hexadecimal forms, which gives the
30 ML models the chance to learn from different forms of data.
31
32 4.2 CIC-IDS-2017
33
34 CIC-IDS-2017 is a more detailed IoV-specific dataset that includes intrusion detection
35 for not only CAN-BUS but also V2I and V2V attacks.
36 Key Characteristics:
37 • Incorporates IoV network-level intrusions such as DDoS, Spoofing, Man in the
38 Middle (MITM), and SQL Injection.
39 • The dataset is meant for real-time attack detection, hence useful for assessing
40
streaming-based IDS implementations.
41
• Network flow-based and packet-based intrusion detection features are provided by
42
43 this dataset.
44
45 4.3 BoTNeTIoT
46
The BoTNeTIoT dataset represents a real-world threat that targets both IoV and
47
Internet of Things (IoT) systems through botnet-based cyberattacks and includes
48
49
50 9
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
extensive DDoS, DoS, and reconnaissance attacks on connected vehicles as well as
1 smart city ecosystems.
2 Key Characteristics:
3
4 • Includes several types of botnet-related intrusions, such as DDoS attacks (TCP,
5 UDP, and HTTP flooding), DoS attacks (protocol-specific exploits), scanning
6 attacks (reconnaissance and OS fingerprinting), and service exploits & data exfil-
7 tration.
8 • It offers the data in PCAP, Argus, and CSV formats, which allows for various
9 methods of intrusion detection.
10 • Use Case: Autonomies large-scale botnet detection capabilities in IoV and IoT
11 integrated smart transportation systems.
12
13 Table 3 presents a comparative overview of these datasets, highlighting their focus
14 areas, attack types, data formats, and key features.
15
16 Table 3 Comparison of IoV Intrusion Detection Datasets
17
18 Dataset Focus Area Attack Types Data Key Features
19 Formats
20 CICIoV2024 CAN-BUS Intru- DoS, Spoof- Hexadecimal, Real-world CAN-BUS
21 sion Detection ing GAS, Decimal, data, High accuracy for
22
Spoofing RPM, Binary DNNs and RF, Evaluates
23
Spoofing SPEED, vehicle sensor security
24
25 Spoof-
26 ing STEERING
27 CIC-IDS- Network-Based Brute Force PCAP, CSV 5-day real-world attack
28 2017 IoV Intrusions (FTP, SSH), DoS, traffic, IoV network secu-
29 DDoS, Heart- rity evaluation
30 bleed, Web Attacks,
31 Botnet
32 BoTNeTIoT IoT and Vehic- DDoS, DoS, PCAP, Argus, Focuses on large-scale bot-
33 ular Botnet Scanning, Ser- CSV net threats in IoV, Evalu-
34 Attacks vice Exploits, ates IoV-IoT hybrid secu-
35 Data Exfiltration rity
36
37
38
39
40 4.4 Justification of Dataset Selection for IoV Contexts
41 This study employs three publicly datasets CICIoV2024, CIC-IDS-2017 and BoTNe-
42 TIoT to evaluate the performance of several ML models in the context of IDPS for the
43 IoV. CICIoV2024 directly addresses CAN-BUS attacks, while the inclusion of CIC-
44 IDS-2017 and BoTNeTIoT expands the evaluation to encompass broader IoV threat
45
vectors, particularly those affecting V2I and IoT-integrated infrastructure.
46
CICIoV2024 is explicitly designed for CAN-BUS intrusion scenarios, offering
47
48 realistic spoofing attacks such as Spoofing GAS and Spoofing RPM. These directly
49
50 10
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
affect in-vehicle sensor integrity and are critical for evaluating detection at the
1 Electronic Control Unit (ECU) level.
2 CIC-IDS-2017 focuses on network-level attacks (e.g., DDoS, Heartbleed) and
3 captures detailed flow-level features such as Flow Duration, Packet Length Mean, and
4 Total Forward Packets. While originally general-purpose, these features are applicable
5 to V2I attack modeling, particularly in simulating Roadside Unit (RSU) flooding or
6 protocol abuse.
7
BoTNeTIoT includes botnet-oriented attacks such as DDoS, service scanning,
8
and data exfiltration. These are especially relevant in smart transportation environ-
9
10 ments where IoV systems coexist with IoT devices like smart traffic lights and edge
11 sensors. The inclusion of BoTNeTIoT supports evaluation of hybrid threat scenarios
12 and system generalizability.
13 To clarify this mapping, Table 4 summarizes how each dataset corresponds to
14 distinct IoV security domains.
15
16 Table 4 Mapping of Datasets to IoV Threat Vectors
17
18 Dataset IoV Threat Relevant Features / Attacks
19 CICIoV2024 CAN-BUS Spoofing Raw sensor data (e.g., GAS, RPM,
20 SPEED), spoofed CAN IDs
21 CIC-IDS-2017 V2I DoS Flow duration, Packet Length Mean,
22 Forward Packets (RSU flooding simu-
23 lation)
BoTNeTIoT IoV-IoT Botnet Packet rate, scanning behavior, exfil-
24 tration patterns from edge nodes
25
26
27
28
29
30 4.5 Preprocessing and Feature Engineering
31
32 To improve the model’s accuracy and generalization ability in intrusion detection,
33 we used data preprocessing and feature selection and hyperparameter tuning for the
34 CICIoV2024, CIC-IDS-2017, and BoTNeTIoT datasets. These steps were important
35 to overcome the challenges such as CAN-BUS intrusions, network-based cyberattacks
36 and large scale botnet threats in IoV environments.
37 Data Preprocessing:
38 Each dataset required specific preprocessing steps to prepare missing values, scale
39 numerical features, transform categorical variables, and normalize the data before
40 using it for ML model training.
41
42 • For CICIoV2024 (CAN-BUS Intrusions), the dataset did not contain any missing
43 values, so there was no need to deal with missing data. Label encoding was used to
44 convert the attack types from categorical to numerical features. Z-score normaliza-
45 tion was used for numerical fields to make features between DATA 0 and DATA 7
46 which represent raw CAN-BUS sensor readings such as vehicle speed, engine RPM,
47 throttle position, and steering wheel angle, have similar scales as the data.
48
49
50 11
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
• In CIC-IDS-2017 (Network-Based IoV Intrusions), we replaced missing values in the
1 Flow Bytes feature with 0 to ensure comparability of the data. Protocol Type and
2 State Flags, categorical variables, were transformed into numerical variables using
3 one-hot encoding to enhance model interpretability. Min-max scaling was used to
4 normalize network flow-based attributes to standardize feature values and to transfer
5 varying ranges to a similar scale.
6 • In BoTNeTIoT (Botnet-Based Cyberattacks), missing values were handled by
7
median imputation to preserve the statistical integrity of the data. Protocol types,
8
attack types, and state features, which are categorized, were transformed into
9
10 numerical variables through label encoding. To make sure that numerical features
11 have a similar kind of distribution, Z-score normalization was used. Due to the great
12 class imbalance in the dataset, SMOTE was used to balance attack categories.
13 Feature Engineering:
14 To improve classification accuracy and model efficiency, feature selection techniques
15 were applied to remove redundant or irrelevant features.
16
17 • For CICIoV2024, seven key CAN-BUS parameters were identified as the most rele-
18 vant features, including speed, engine RPM, gas sensor readings, and steering wheel
19 position. To refine the feature set, SelectKBest with ANOVA F-score was applied,
20 ensuring the selection of only the most informative attributes for intrusion detection.
21 Additionally, log transformation was implemented to enhance data distribution,
22 reducing skewness and improving model interpretability.
23 • For CIC-IDS-2017, five critical network flow attributes were selected: Flow Duration,
24 Packet Length Mean, Total Forward Packets, Flow Bytes per Second, and Back-
25
ward Packet Length Mean. Pearson Correlation Coefficient was utilized to eliminate
26
highly correlated variables, thereby preventing redundancy in feature representa-
27
28 tion. Furthermore, mutual information was employed to identify the top 10 most
29 informative features, ensuring that only the most predictive attributes were retained
30 for ML model training.
31 • For BoTNeTIoT, feature selection focused on packet-level behaviors, extracting key
32 attributes such as packet rate, source-destination connection frequency, and protocol
33 behavior. To refine the dataset, Recursive Feature Elimination (RFE) was used
34 to systematically remove less relevant features while preserving those contributing
35 most to attack detection. Additionally, Principal Component Analysis (PCA) was
36 applied to reduce dimensionality, retaining 12 critical packet-level attributes, which
37 enhanced computational efficiency while preserving classification accuracy.
38
39 Table 5 summarizes the key preprocessing, feature selection, and hyperparameter
40 tuning techniques applied to each dataset.
41 With the datasets preprocessed and normalized, we now proceed to fine-tune each
42 ML model to ensure optimal performance under the chosen feature configurations.
43 Hyperparameter Tuning:
44 To enhance model performance, we adopted three optimization strategies—Grid
45 Search, Random Search, and Bayesian Optimization—based on each model’s param-
46 eter complexity and sensitivity [27, 28]. Gradient boosting models were tuned using
47 Bayesian Optimization due to their expansive and continuous search spaces. Simpler
48
49
50 12
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
1 Table 5 Preprocessing, Feature Engineering, and Hyperparameter Optimization Summary
2
3 Dataset Data Cleaning Feature Selection Normalization Hyperparameter
Tuning
4
5 CICIoV2024 Checked for missing val- SelectKBest (F-score) Standardization XGBoost (Bayesian
6 ues, label-encoded cate- (Z-score) Optimization), RF
gorical features (Grid Search)
7 CIC-IDS-2017 Standardized network Pearson Correlation, Min-Max Scaling LightGBM (Grid
8 protocol fields, replaced Mutual Information Search), Extra Trees
9 missing values (Random Search)
10 BoTNeTIoT Applied SMOTE for Recursive Feature Logarithmic Logistic Regression
11 class balancing, resolved Elimination (RFE), Transformation (Grid Search), SGD
mixed data types PCA Classifier (Random
12 Search)
13
14
15
16
models like Random Forest, Extra Trees, and Logistic Regression were optimized via
17
18 Grid Search, while the stochastic SGD Classifier was tuned using Random Search.
19 Although Gaussian Naı̈ve Bayes (GNB) is commonly treated as parameter-free,
20 we explicitly tuned its var smoothing parameter to improve numerical stability in
21 low-variance CAN-BUS data. Table 6 summarizes the optimization methods and
22 hyperparameters applied.
23
24
25 Table 6 Hyperparameter Optimization for Machine Learning Models
26
27 Model Optimization Technique Optimized Parameters
28 Random Forest Grid Search (systematic exploration of n estimators=100, max depth=None,
29 every combination of hyperparameters min samples split=2
30 within specified ranges)
31 XGBoost Bayesian Optimization (probabilistic learning rate=0.1, max depth=6,
32 modeling and iterative refinement) n estimators=100, subsample=0.8
LightGBM Grid Search (evaluation of parameter num leaves=31, learning rate=0.05, fea-
33
grids for optimal settings) ture fraction=0.8, boosting type=gbdt
34 Extra Trees Grid Search (exhaustive search over n estimators=100, criterion=entropy,
35 hyperparameter combinations) min samples leaf=1
36 SGD Classifier Random Search (efficient search of large loss=log loss, penalty=l1, alpha=5e-5, learn-
37 hyperparameter spaces) ing rate=optimal
38 Gaussian Naı̈ve Grid Search (tuned via preprocessing and var smoothing=1e-09; feature selection with
Bayes feature selection) SelectKBest (k=7); log1p transformation and
39 StandardScaler applied
40 Logistic Regres- Grid Search (evaluates full combination penalty=l2, solver=saga, max iter=1000, C=1
41 sion space)
42
43
44
45
46
47
48
49
50 13
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
4.6 Evaluation metrics
1
2 For the purpose of evaluating the effectiveness of the trained ML models in identifying
3 intrusions in the IoV environment, several metrics were used. These metrics help to
4 ensure a thorough assessment of the accuracy of the classification and the effectiveness
5 of the system in real-time, which are both important for an IDPS. The chosen criteria
6 assess the model’s capacity to identify cyber threats correctly in order to reduce the
7 likelihood of false alarms in a real-time system. Accuracy:
8 Accuracy is a simple metric that describes the overall correctness of the model’s
9 predictions. Although accuracy is a good overall performance measure, it may be
10 unreliable for imbalanced data sets.
11 Precision:
12 Precision is the ratio of the number of correct positive predictions to the number of
13 positive predictions made by the model, which reduces the number of false positives.
14 A high precision value means that there is a low rate of false positives, and hence the
15 system does not raise unnecessary alarms, which is important in IoV applications.
16
Recall:
17
Recall measures how well the model is able to identify the actual attacks. A high
18
19 recall value means that the model has flagged most of the intrusions, thus ensuring
20 that the cyber threats are well detected.
21 F1-Score:
22 The F1-Score is the harmonic mean of precision and recall, which captures the
23 balance between precision and recall. This metric is particularly relevant in IoV cyber-
24 security contexts, where both false positives and false negatives can result in severe
25 consequences. The training time represents the time taken to build a given ML model.
26 It is an important factor in choosing an IDS model since costly computational models
27 may not be suitable for real-time use in an IoV environment.
28 Training time:
29 Training time is the time taken to build a given ML model. It is an important
30 factor in choosing an IDS model since complex models may be slow for real-time use
31 in IoV networks.
32 Prediction time:
33 Prediction time is the time it takes for the trained model to make a decision on a
34
given input. Since IoV security needs to operate in real-time, lower prediction latency
35
is crucial for effective attack response.
36
37 Total execution time:
38 Total execution time includes the training time and the prediction time. It gives
39 a general view of how many computational resources a model requires for real-time
40 application.
41
42 4.7 Results of Evaluation
43
44 Below, we summarize the comparative performance of the various ML models in
45 different datasets.
46 CICIoV2024 Dataset
47
48
49
50 14
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
In CICIoV2024, the top-performing models regarding accuracy, precision, and
1 recall were RF, XGBoost, LightGBM, and Extra Trees, each attaining an accuracy
2 of 99.64%. Gaussian Naı̈ve Bayes and Logistic Regression showed inferior scores,
3 achieving 92.47% and 89.49%, respectively. The SGD Classifier showed the lowest per-
4 formance, achieving an accuracy of 89.17%, rendering it less suitable for CAN-BUS
5 intrusion detection (see Table 7)
6
7
8
9 Table 7 Performance Comparison of ML Models on CICIoV2024
10
11 Model Accuracy Precision Recall F1-Score Train (s) Predict (s) Total (s)
12 Random Forest 0.9964 0.9967 0.9964 0.9963 36.9300 1.2100 38.1400
13 XGBoost 0.9964 0.9967 0.9964 0.9963 13.1000 0.1200 13.2200
14 LightGBM 0.9964 0.9967 0.9964 0.9963 9.5600 1.2100 10.7700
Extra Trees 0.9964 0.9967 0.9964 0.9963 30.0200 1.4900 31.5000
15 SGD Classifier 0.8918 0.8343 0.8918 0.8456 58.7800 0.0200 58.8000
16 Gaussian NB 0.9248 0.9495 0.9248 0.9260 0.1600 0.0800 0.2300
17 Logistic Regression 0.8949 0.8368 0.8949 0.8521 49.6300 0.0200 49.6500
18
19
20
21
22 • XGBoost had the quickest training duration (13.10 seconds) and prediction interval
23 (0.11 seconds), rendering it an efficient option.
24 • LightGBM exhibited competitive performance with the shortest total execution time
25 of 10.76 seconds.
26 • The SGD Classifier exhibited the longest training duration (58.78 seconds) despite
27 its inferior accuracy.
28 • Gaussian Naı̈ve Bayes demonstrated rapid performance but experienced diminished
29 precision.
30
31 The findings suggest that XGBoost and LightGBM are superior selections for real-
32 time CAN-BUS intrusion detection, owing to their equilibrium of performance and
33 efficiency.
34 CIC-IDS-2017Dataset
35 In CIC-IDS-2017, RF, XGBoost, LightGBM, and Extra Trees all achieved nearly
36 100% accuracy, demonstrating their robustness in network-based intrusion detection.
37 (see Table 8)
38
39 • SGD Classifier and Logistic Regression performed slightly worse, with 95.31% and
40 97.28% accuracy, respectively.
41 • XGBoost achieved 99.90% accuracy while maintaining a significantly lower total
42 execution time (139.25 sec) compared to RF (862.69 sec).
43 • LightGBM achieved 98.73% accuracy while completing execution in 122.45 sec,
44 making it more efficient than RF.
45 • SGD Classifier had a significantly lower accuracy (95.31%) and longer execution
46 time compared to XGBoost and LightGBM.
47
48
49
50 15
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
1 Table 8 Performance Comparison of ML Models on CIC-IDS-2017
2
3 Model Accuracy Precision Recall F1-Score Train (s) Predict (s) Total (s)
4 Random Forest 0.9988 0.9987 0.9988 0.9987 854.3800 8.3200 862.6900
5 XGBoost 0.9990 0.9990 0.9990 0.9990 137.2900 1.9700 139.2500
6 Optimized LightGBM 0.9873 0.9890 0.9873 0.987900 110.9900 11.4600 122.4600
Extra Trees 0.9983 0.9983 0.9983 0.9983 360.2600 10.8300 371.0900
7 SGD Classifier 0.9531 0.9494 0.9531 0.9495 88.8800 0.2400 89.1200
8 Gaussian NB 0.8030 0.6448 0.8030 0.7153 0.2200 0.5500 0.7800
9 Logistic Regression 0.9728 0.9716 0.9728 0.971500 496.5600 0.2300 496.7900
10
11
12
13 • Gaussian Naı̈ve Bayes had the weakest performance (80.30% accuracy), demonstrat-
14 ing poor suitability for network-based IoV intrusion detection.
15
16 The results highlight XGBoost and LightGBM as top-performing models with
17 near-perfect detection rates and faster execution times.
18 BoTNeTIoT Dataset
19 For BoTNeTIoT, RF, LightGBM, and Extra Trees achieved a perfect 100%
20 accuracy, precision, recall, and F1-Score, indicating their reliability in detecting
21 botnet-based attacks. XGBoost followed closely with 99.99% accuracy, making it a
22 strong competitor (see Table 9).
23
24
25
Table 9 Performance Comparison of ML Models on BoTNeTIoT
26
27
Model Accuracy Precision Recall F1-Score Train (s) Predict (s) Total (s)
28
29 Random Forest 1.0000 1.0000 1.0000 1.0000 325.6500 1.8600 327.5100
XGBoost 0.9999 0.9999 0.9999 0.9999 88.3900 0.5400 88.9400
30 LightGBM 1.0000 1.0000 1.0000 1.0000 97.6300 2.4700 100.1000
31 SGD Classifier 0.9165 0.9170 0.9165 0.9166 18.8500 0.1500 18.9900
32 Extra Trees 1.0000 1.0000 1.0000 1.0000 161.3700 2.1000 163.4700
33 Gaussian NB 0.9993 0.9994 0.9993 0.9993 4.0000 1.2900 5.2900
34 Logistic Regression 0.9991 0.9995 0.9991 0.9992 6277.0800 0.0900 6277.1700
35
36
37
38
• XGBoost had the fastest execution time (88.94 sec) compared to RF (327.50 sec).
39
40 • SGD Classifier had a significantly lower accuracy (91.66%) but was computationally
41 efficient.
42 • Gaussian Naı̈ve Bayes performed well with 99.93% accuracy and had the lowest
43 training time (4.00 sec), making it an interesting choice for rapid deployments.
44 • Logistic Regression had the longest total execution time (6277.17 sec), making it
45 unsuitable for real-time botnet detection.
46
The results suggest that RF and LightGBM are the best options for BoTNeTIoT,
47
48 offering the highest accuracy while maintaining reasonable execution efficiency.
49
50 16
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Summary of Comparative Model Performance
1
• CICIoV2024: XGBoost and LightGBM performed best with high accuracy and low
2
3 execution time.
4 • CIC-IDS-2017: XGBoost and LightGBM showed superior accuracy and computa-
5 tional efficiency.
6 • BoTNeTIoT: RF and LightGBM provided the best detection rates while maintain-
7 ing reasonable efficiency.
8
Overall, XGBoost was chosen as the primary model due to its optimal bal-
9
10 ance of accuracy, execution speed, and efficiency in real-time intrusion detection. In
11 CICIoV2024, XGBoost achieved 99.64% accuracy with a total execution time of 13.22s,
12 significantly faster than RF (38.14s) and Extra Trees (31.50s). Similarly, in CIC-IDS-
13 2017, it maintained 99.90% accuracy with an execution time of 139.25s, far more
14 efficient than RF’s 862.69 s.
15 XGBoost’s scalability, fast inference speed, and high precision make it ideal for IoV
16 security, ensuring low-latency attack detection and real-time adaptability. Its perfor-
17 mance across diverse datasets confirms its suitability for intrusion detection in dynamic
18 vehicular networks.
19
20
21
5 The Proposed Intrusion Detection and Prevention
22 System
23
24 Having evaluated the ML models and selected the most effective configurations, this
25 section describes the architecture and components of the proposed real-time IDPS
26 system that deploys these models in practice.
27
28 5.1 System Overview
29
30 In order to protect the IoV from the evolving cyber threats, we suggest a real-time
31 IDPS that combines XGBoost-based detection with stream processing for effective
32 attack prevention. The system is intended to solve the security issues in IoV by iden-
33 tifying DoS attacks and Spoofing GAS intrusions, taking real-time countermeasures,
34 and ensuring continuous watching of vehicular networks.
35 The proposed IDPS consists of three core phases: Detection, Prevention and
36 Reporting.
37
38 • Detection Phase: Employ an XGBoost classifier developed using actual time CAN-
39 BUS and network traffic to label the incoming data as normal or abnormal.
40 • Prevention Phase: Implements automated responses based on attack types:
41 • DoS Attack Mitigation: blocks traffic from the malicious IP.
42
• Spoofing GAS Attack Response: Isolates the compromised vehicle and triggers
43
security alerts.
44
45 • Reporting Phase: Logs detected intrusions, alerts system administrators, and
46 visualizes real-time attack trends.
47
48
49
50 17
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
The system is designed on a distributed and scalable architecture; Apache Kafka
1 [29] is used for real-time data ingestion and Apache Spark Streaming for high-speed
2 processing [30]. The system integrates XGBoost to achieve high detection rates with
3 efficient execution times, thus making it appropriate for resource-constrained vehic-
4 ular environments. The overall architecture and workflow of the proposed IDPS are
5 illustrated in Fig. 4.
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26 Fig. 4 Real-Time IDPS Workflow for IoV Security
27
28 Key Components of the System
29
30 • Real-Time Data Acquisition: Collects network and CAN-BUS traffic from IoV
31 environments.
32 • ML-Based Detection: Applying an optimized model (XGBoost) to classify cyber
33 threats.
34 • Streaming and Processing Framework: Employs Kafka for message queuing and
35 Spark Streaming for real-time distributed analysis.
36 • Automated Prevention Mechanisms: Employs sophisticated traffic filtration and
37 anomaly response measures.
38 • Visualization and Monitoring: A real-time dashboard offers insights into attack
39
patterns and security incidents.
40
41
42 5.2 Data Simulation
43 To evaluate the real-time performance of our IDPS, we utilized the CICIoV2024
44 dataset to simulate CAN-BUS traffic in an IoV environment. Instead of treating the
45
dataset as a static source, we implemented a real-time Kafka-based data streaming
46
pipeline, ensuring that data ingestion and processing closely resemble a dynamic IoV
47
48 ecosystem.
49
50 18
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
To achieve this, we first extracted relevant traffic patterns from the CICIoV2024
1 dataset, categorizing them into different event types based on their characteristics.
2 These events were then formatted into streaming messages and injected into a Kafka
3 topic at varying time intervals, simulating real-world vehicular communications.
4 The types of simulated traffic include:
5
6 • Benign Traffic: Typical vehicular communication for realistic sensor behavior.
7 • DoS Attack Traffic: High-frequency packet floods aimed at V2I communication,
8 leading to network disruptions.
9 • Spoofing GAS Attack Traffic: Modified gas sensor readings to deceive fuel-level
10 estimations in IoV systems.
11
12 To simulate real-world IoV dynamics, we introduced the following enhancements:
13 • Variable Data Injection Rates: Kafka producers send messages at randomized
14 time intervals to replicate network congestion and fluctuating traffic loads.
15 • Network Latency Simulation: Randomized delays were added before transmit-
16
ting messages to simulate real-world CAN-BUS and V2X network variability.
17 • Dynamic Attack Injection: Attacks were introduced at unpredictable intervals
18
19 rather than fixed sequences, ensuring a non-deterministic test scenario.
20 For performance validation, we monitored:
21
22 • End-to-End Processing Latency: Kafka-to-Spark delays were recorded to ensure
23 real-time intrusion detection.
24 • Scalability Under High Loads: We tested increased Kafka producer rates to
25 evaluate system robustness and response under stress conditions.
26 • System Resource Utilization: CPU, memory, and network overhead were
27 monitored to assess performance bottlenecks.
28
29 5.3 Data Ingestion
30
31 To facilitate real-time processing, the generated traffic data is streamed using Apache
32 Kafka [31] a distributed messaging system that ensures efficient and scalable data
33 transmission. The Kafka producer is responsible for publishing benign, Spoofing GAS,
34 and DoS attack traffic to predefined topics, enabling downstream consumers to process
35 the data in real-time.
36 Kafka Producer Setup:
37 The Kafka producer is configured to:
38
39 • Stream messages continuously to simulate real-time IoV traffic.
40 • Serialize data in JSON format to ensure compatibility with consumers.
41 • Assign traffic to specific Kafka topics based on attack type.
42
43 Topic Configuration for Different Traffic Types:
44 To ensure structured data flow, Kafka topics are organized as follows:
45 • benign traffic: Normal CAN-BUS messages.
46 • dos attack: High-frequency flooding traffic mimicking DoS attacks.
47 • Spoofing Gas: Manipulated fuel level readings injected into CAN-BUS.
48
49
50 19
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Each topic allows the Spark consumer (detailed in the next subsection) to subscribe
1 and process data accordingly, ensuring that attack detection mechanisms operate in
2 a structured and scalable manner.
3
4
5.4 Data Consumer and Real-Time Processing
5
6 The data consumer is implemented using Apache Spark Streaming, which continuously
7 retrieves and processes real-time traffic data from Kafka. This setup ensures low-
8 latency and scalable intrusion detection for IoV security.
9 Spark Streaming Configuration for Consuming Kafka Data
10 Spark Streaming is configured to subscribe to Kafka topics where CAN-BUS and
11 network traffic are streamed in real-time. The system:
12
13 • Connects to the Kafka topic, which streams benign, DoS, and Spoofing GAS traffic.
14 • Reads incoming messages in a structured format, ensuring schema alignment with
15 historical training data.
16 • Processes messages in micro-batches, allowing efficient real-time computation.
17 • Applies checkpointing and fault tolerance mechanisms to maintain stability in case
18 of failures.
19
20 Application of the XGBoost Model for Intrusion Detection
21 Once Spark Streaming receives the Kafka data, it applies real-time intrusion
22 classification using the pre-trained XGBoost model:
23 • Feature Extraction & Standardization: Incoming sensor values are normalized using
24
25 Z-score scaling, ensuring compatibility with the trained model.
• XGBoost Inference: The pre-trained XGBoost model loads the processed data and
26
27 performs real-time classification into benign, DoS, or Spoofing GAS.
28 • Detection Output: The predicted attack labels are logged and forwarded to the
29 prevention system for automated mitigation.
30 By leveraging Spark Streaming for distributed processing and XGBoost for anomaly
31
detection, the system achieves high-speed, low-latency, and scalable real-time intrusion
32
detection for IoV security.
33
34
35 5.5 Prevention Mechanisms
36 The prevention mechanisms in our IDPS are implemented to act in real-time against
37
the cyber threats identified in the IoV. Based on the classification results from the
38
XGBoost model, the system applies automated responses to mitigate detected attacks,
39
40 ensuring minimal disruption to vehicular communication and safety.
41 DoS Attack Mitigation:
42 DoS attacks achieve their goal by sending large amounts of traffic to the V2I
43 communication, thereby rendering it incapable of functioning properly. To counteract
44 this:
45 • It checks for anomalies and high-traffic patterns in the incoming data streams to
46
identify DoS attacks.
47
48
49
50 20
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
• If an attack is detected, the system prevents the malicious IP address from sending
1 traffic to the V2I, thus allowing for the restoration of communication.
2 • The detected attack and corresponding preventive action are logged for further
3 analysis.
4
5 Spoofing GAS Attack Mitigation:
6 Spoofing GAS attacks consist of tampering with the gas sensor outputs with the
7 intention of misdirecting the vehicle’s control unit. To mitigate this:
8 • The system isolates and quarantines the affected vehicle to prevent compromised
9
10 sensor data from influencing critical decision-making.
• The system continuously monitors vehicle responses to prevent further manipula-
11
12 tion.
13 • An immediate alert is sent to operators and relevant authorities, notifying them of
14 the spoofed sensor data.
15 These mitigation strategies ensure that threats, when identified, are neutralized
16
right away to ensure the security and reliability of IoV networks.
17
18
19 6 Processed Results and Discussion
20
21 Following the system design presented in Section ??, this section reports the real-time
22 performance results of the proposed IDPS.
23 The results are shown as charts and tables and in a real-time monitoring dash-
24 board that shows where intrusions are detected, how well the system is functioning,
25 and what preventive measures have been taken by the IDPS. The dashboard is auto-
26 matically updated every five seconds to show the latest security events. The following
27 is a detailed explanation of each visualization and its significance.
28
29 6.1 Attack Detection and Classification :
30
31 The Attack Distribution Chart ( Fig. 5) illustrates the total number of benign and
32 malicious traffic samples processed by the system. For example, Benign traffic accounts
33 for 54.8% of the total data, and DoS attacks make up 25.2%, while Spoofing GAS
34 attacks represent 20% of all traffic.
35 This real-time classification allows security analysts to monitor the distribution of
36 network activity and identify trends in cyber threats.
37
38
39 6.2 Prevention Effectiveness:
40 The Prevention Effectiveness Pie Chart displays the proportion of actions taken for
41 different types of detected attacks. for example, in Fig. 6 54.8% of network traffic
42 required no action, as it was classified as benign 25.1% of traffic was blocked due to
43 detected DoS attacks, preventing malicious actors from overloading the system. and
44 20% of traffic was quarantined due to Spoofing GAS attacks, isolating compromised
45
devices to prevent identity spoofing.
46
The chart confirms that the IDPS is actively mitigating detected threats in real
47
48 time, blocking suspicious activity before it can escalate into a security breach.
49
50 21
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 Fig. 5 Attack Distribution Chart
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40 Fig. 6 Prevention Effectiveness Chart
41
42 6.3 Model Performance Metrics:
43
44 The XGBoost performance metrics chart (Fig. 7) summarizes the real-time classifica-
45 tion results of our deployed IDPS:
46 Accuracy = 64.3%, Precision = 91.9%, Recall = 98.5%, F1-Score = 95.1%, and
47 Average Inference Latency ≈ 0.49 ms.
48
49
50 22
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 Fig. 7 XGBoost Real-Time Performance Metrics
20
21 While the reported accuracy appears low compared to offline evaluation (where
22 XGBoost exceeded 99%), this discrepancy is expected due to the nature of real-time
23 streaming conditions. Unlike the offline setting, which used balanced test datasets,
24 the real-time stream reflects class imbalance typical of real-world IoV environ-
25 ments—where benign traffic significantly outweighs attack instances. As a result,
26 accuracy becomes a less informative metric, particularly in the presence of many false
27 positives.
28
Instead, high recall (98.5%) is the most critical indicator in this security-sensitive
29
context, ensuring that nearly all attacks are detected. The F1-Score of 95.1% confirms
30
31 a strong balance between recall and precision, indicating that the model effectively
32 detects intrusions while minimizing false negatives. Although false positives inflate
33 the alert volume and contribute to the lower accuracy (64.3%), they do not compro-
34 mise security. Nonetheless, reducing false alarms is essential to maintain operational
35 efficiency and minimize unnecessary system interventions.
36 These real-time results demonstrate that the proposed system remains robust and
37 responsive under dynamic, imbalanced traffic, supporting its deployment in realistic
38 IoV scenarios.
39
40 6.4 Message Processing Rate Over Time
41
42 The Messages Over Time Chart Fig. 8 tracks how many messages the Spark consumer
43 processes per batch.
44 The system has the capacity to process about 50 messages per batch, and the vari-
45 ation in this is minimal due to network congestion or changes in the rate of incoming
46 traffic.
47
48
49
50 23
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 Fig. 8 Messages Over Time
20
21 The dashboard updates every 5 seconds to keep users informed of the message
22 traffic and possible bottlenecks in the processing.
23 This consistency shows that the IDPS is scalable and can operate at continuous,
24 high-speed network traffic without significant slowdown.
25
26 6.5 Kafka Lag Monitoring
27
28 The Kafka Lag in Fig. 9 shows the latency of the system by showing the number of
29 unprocessed records in the consumer queue.
30 Kafka lag is the measure of the number of messages produced by Kafka minus the
31 number of messages consumed by the Spark Streaming application.
32 In this case, the lag is 8 records, which means that the Spark consumer is consuming
33 messages with almost no latency and is in near real-time.
34 This metric is being continuously updated to reflect real-time streaming efficiency
35 so that administrators can monitor the health of the system and take necessary action
36 if the lag begins to increase significantly. If the system has a high Kafka lag, it could
37
mean that there is a processing bottleneck that could result in a delay in detecting
38
and responding to security threats.
39
40 However, as the system maintains low lag throughout the experiment, it ensures
41 that any attack is detected and prevented in a timely manner, thus validating the
42 need for real-time intrusion detection.
43
44 6.6 System Resource Utilization
45
The Resource Utilization Chart ( Fig. 10) presents CPU, memory, and network usage
46
statistics:
47
48 CPU Usage: 9.6%
49
50 24
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 Fig. 9 Kafka Lag Gauge
20
21 Memory Usage: 51.7%
22 Network Bandwidth Consumption: 37.25%
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43 Fig. 10 Resource Utilization Chart
44
45
46
47
48
49
50 25
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
These results suggest that the system operates efficiently with minimal compu-
1 tational overhead. The relatively low CPU utilization ensures that the IDPS can be
2 deployed on high-performance servers without excessive resource demands.
3
4
6.7 Discussion
5
6 The real-time IDPS dashboard gives an all-round view of the attacks that have been
7 detected on the system, the countermeasures that have been taken, and the health of
8 the system, with an update interval of 5 seconds to ensure that the security operators
9 have the most current information at their disposal. This real-time responsiveness is
10 vital in the field of cybersecurity, where the early detection and mitigation of incidents
11 can help prevent security breaches.
12 The system is highly efficient in identifying and counteracting DoS and Spoof-
13
ing GAS attacks and has a high recall rate of 98.5%. The preventive measures that
14
include blocking the DoS attack sources and isolating devices that are vulnerable to
15
16 Spoofing GAS show that the system is not only detecting threats but also preventing
17 them. However, there are some aspects that can be improved:
18 • Improving the Quality of the Classification Process: As mentioned earlier, the recall
19 is high, but the overall accuracy of about 64% suggests that there is a need to work
20 more on reducing the number of false positives and enhancing the precision.
21 • Controlling Kafka Lag: The system has low lag at the moment ( 8 records), but
22
improving the Spark-Kafka integration can enhance the real-time processing and
23
robustness of the system.
24
• Optimizing Resource Utilization: The system works well with the streaming data
25
26 with low CPU utilization (9.6%), but the memory utilization (51.7%) indicates that
27 there is room for further improvement in order to increase the scalability.
28 In general, the IDPS achieves real-time security analysis, intrusion detection, and
29 prevention in a BD environment. It offers a practical and effective way of protecting
30 high-speed networks and can be easily extended to identify other types of attacks as
31
well.
32
33
34 7 Conclusion and Future Work
35
36 The proposed IDPS in real-time effectively mitigates and detects DoS and Spoof-
37 ing GAS attacks in IoV environments using XGBoost for classification and Kafka-
38 Spark Streaming for real-time data processing. The system achieves a high recall of
39 98.5% and has the ability to block any malicious activity and quarantine any infected
40 devices on its own. In addition, the real-time dashboard gives the user an ongoing view
41 of the security events, attack distributions, and system health metrics, which can help
42 security administrators to act quickly when needed.
43 For future work, research will be conducted on the integration of deep learning
44 models such as CNNs and LSTMs for improved feature extraction and sequence-based
45 anomaly detection to improve the detection of evolving cyber threats. Also, attempts
46 will be made to expand the system to include the identification of new cyber threats
47 affecting IoV, V2V, and V2I communications to cover more threats from new attack
48
49
50 26
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
sources. The performance of the Kafka-Spark Streaming pipeline will also be optimized
1 to reduce Kafka lag and enhance the real-time processing. Also, the implementation
2 of multi-attack scenario detection will enable the system to deal with several types of
3 coordinated cyber attacks and will be more resistant to complex adversarial threats.
4 These improvements will increase the scalability, adaptability, and overall efficiency of
5 the system in the protection of real-time IoV networks.
6
7
8 Declarations
9
10 • Competing Interests: On behalf of all authors, the corresponding author states
11 that there is no conflict of interest.
12 • Funding Information: Not Applicable.
13 • Author Contributions: Ikram Hamdaoui conducted the research, performed
14 experiments, and wrote the manuscript. Khalid El Makkaoui and Zakaria El Allali
15 supervised the work and provided review and feedback.
16 • Data Availability Statement: Publicly available datasets were used:
17 CICIoV2024, CIC-IDS-2017, and BoTNeTIoT
18 • Research Involving Human and/or Animals: Not Applicable.
19 • Informed Consent: Not Applicable.
20
21
22 References
23
[1] Hamdaoui, I., El Makkaoui, K., El Allali, Z.: Securing big data: Current challenges
24
25 and emerging security techniques. In: Proceedings of The International Confer-
26 ence on Artificial Intelligence and Smart Environment, pp. 130–137. Springer,
27 Cham (2023)
28
29 [2] Hamdaoui, I., El Fissaoui, M., El Makkaoui, K., El Allali, Z.: An intelligent traffic
30 monitoring approach based on hadoop ecosystem. In: Proceedings of the 2022
31 5th International Conference on Networking, Information Systems and Security
32 (NISS), pp. 1–6. IEEE, Piscataway, NJ (2022)
33
34 [3] Ušinskis, V., Makulavičius, M., Petkevičius, S., Dzedzickis, A., Bučinskas, V.:
35 Towards autonomous driving: Technologies and data for vehicles-to-everything
36 communication. Sensors 24(11), 3411 (2024)
37
38 [4] Marcillo, P., Tamayo-Urgilés, D., Valdivieso Caraguay, Á.L., Hernández-Álvarez,
39 M.: Security in v2i communications: A systematic literature review. Sensors
40 22(23), 9123 (2022)
41
42 [5] Khezri, E., Hassanzadeh, H., Yahya, R.O., Mir, M.: Security challenges in internet
43 of vehicles (iov) for its: A survey. Tsinghua Science and Technology (2024)
44
45 [6] Wang, T., Tu, M., Lyu, H., Li, Y., Orfila, O., Zou, G., Gruyer, D.: Impact evalu-
46 ation of cyberattacks on connected and automated vehicles in mixed traffic flow
47 and its resilient and robust control strategy. Sensors 23(1), 74 (2022)
48
49
50 27
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
[7] Alalwany, E., Mahgoub, I.: Security and trust management in the internet of
1 vehicles (iov): Challenges and machine learning solutions. Sensors 24(2), 368
2 (2024)
3
4 [8] Wu, W., Joloudari, J.H., Jagatheesaperumal, S.K., Rajesh, K.N., Gaftandzhieva,
5 S., Hussain, S., Doneva, R.: Deep transfer learning techniques in intrusion detec-
6 tion system-internet of vehicles: A state-of-the-art review. Computers, Materials
7 & Continua 80(2) (2024)
8
9 [9] Neto, E.C.P., Taslimasa, H., Dadkhah, S., Iqbal, S., Xiong, P., Rahman, T.,
10 Ghorbani, A.A.: Ciciov2024: Advancing realistic ids approaches against dos and
11 spoofing attack in iov can bus. Internet of Things 26, 101209 (2024)
12
13 [10] Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intru-
14 sion detection dataset and intrusion traffic characterization. In: ICISSp, vol. 1,
15 pp. 108–116 (2018)
16
17 [11] Koroniotis, N., Moustafa, N., Sitnikova, E., Turnbull, B.: Towards the develop-
18 ment of realistic botnet dataset in the internet of things for network forensic
19
analytics: Bot-iot dataset. Future Generation Computer Systems 100, 779–796
20
(2019)
21
22
[12] Xing, L., Wang, K., Wu, H., Ma, H., Zhang, X.: Intrusion detection method for
23
internet of vehicles based on parallel analysis of spatio-temporal features. Sensors
24
25 23(9), 4399 (2023)
26
[13] Hamdaoui, I., El Fissaoui, M., El Makkaoui, K., El Allali, Z.: Hadoop-based
27
28 big data distributions: A comparative study. In: International Conference on
29 Networking, Intelligent Systems and Security, pp. 242–252. Springer, Cham
30 (2022)
31
32 [14] Gebremeskel, G.B.: Leveraging big data analytics for intelligent transporta-
33 tion systems: Optimize the internet of vehicles data structure and modeling.
34 International Journal of Data Science and Analytics, 1–16 (2023)
35
36 [15] Ullah, S., Khan, M.A., Ahmad, J., Jamal, S.S., Huma, Z., Hassan, M.T.,
37 Buchanan, W.J.: Hdl-ids: A hybrid deep learning architecture for intrusion
38 detection in the internet of vehicles. Sensors 22(4), 1340 (2022)
39
40 [16] Albishi, O.A., Abdullah, M.: Ddos attacks detection in iov using ml-based models
41 with an enhanced feature selection technique. International Journal of Advanced
42 Computer Science and Applications 15(2) (2024)
43
44 [17] Danba, S., Bao, J., Han, G., Guleng, S., Wu, C.: Toward collaborative intelligence
45 in iov systems: Recent advances and open issues. Sensors 22(18), 6995 (2022)
46
47
48
49
50 28
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
[18] Limouchi, E., Chan, F.: Optimized machine learning-based intrusion detec-
1 tion system for internet of vehicles. In: 2023 IEEE Symposium Series on
2 Computational Intelligence (SSCI), pp. 1151–1157. IEEE, Mexico City, Mexico
3 (2023)
4
5 [19] Mehedi, S.T., Anwar, A., Rahman, Z., Ahmed, K.: Deep transfer learning based
6 intrusion detection system for electric vehicular networks. Sensors 21(14), 4736
7 (2021)
8
9 [20] Haddaji, A., Ayed, S., Fourati, L.C.: A transfer learning based intrusion detec-
10 tion system for internet of vehicles. In: 2023 15th International Conference on
11 Developments in eSystems Engineering (DeSE), pp. 533–539. IEEE, Piscataway,
12 NJ (2023)
13
14 [21] Sebastian, A.: Enhancing Intrusion Detection In Internet Of Vehicles Through
15 Federated Learning. Preprint at [Link] (2023)
16
17 [22] Ahmed, U., Nazir, M., Sarwar, A., Ali, T., Aggoune, E.H.M., Shahzad, T., Khan,
18 M.A.: Signature-based intrusion detection using machine learning and deep learn-
19
ing approaches empowered with fuzzy clustering. Scientific Reports 15(1), 1726
20
(2025)
21
22
[23] Yang, L., Shami, A., Stevens, G., De Rusett, S.: LCCDE: A decision-based ensem-
23
ble framework for intrusion detection in the Internet of Vehicles. In: GLOBECOM
24
25 2022 - 2022 IEEE Global Communications Conference, pp. 3545–3550. IEEE, Rio
26 de Janeiro, Brazil (2022)
27
28 [24] Tzoannos, Z.R., Kosmanos, D., Xenakis, A., Chaikalis, C.: The impact of spoofing
29 attacks in connected autonomous vehicles under traffic congestion conditions.
30 Telecom 5(3), 747–759 (2024)
31
32 [25] Stübler, T., Amodei, A., Capriglione, D., Tomasso, G., Bonnotte, N., Mohammed,
33 S.: An investigation of denial of service attacks on autonomous driving soft-
34 ware and hardware in operation. In: 2024 IEEE 20th International Conference on
35 Automation Science and Engineering (CASE), pp. 3051–3056. IEEE, Bari, Italy
36 (2024)
37
38 [26] Dilshad, M., Syed, M.H., Rehman, S.: Efficient distributed denial of service attack
39 detection in internet of vehicles using gini index feature selection and federated
40 learning. Future Internet 17(1), 9 (2025)
41
42 [27] Ippolito, P.P.: Hyperparameter tuning: the art of fine-tuning machine and deep
43 learning models to improve metric results. In: Applied Data Science in Tourism:
44 Interdisciplinary Approaches, Methodologies, and Applications, pp. 231–251.
45 Springer, Cham (2022)
46
47 [28] Ali, Y.A., Awwad, E.M., Al-Razgan, M., Maarouf, A.: Hyperparameter search
48
49
50 29
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
for machine learning algorithms for optimizing the computational complexity.
1 Processes 11(2), 349 (2023)
2
3 [29] Raptis, T.P., Passarella, A.: A survey on networked data streaming with apache
4 kafka. IEEE Access 11, 85333–85350 (2023)
5
6 [30] Blamey, B., Hellander, A., Toor, S.: Apache spark streaming, kafka and harmoni-
7 cio: a performance benchmark and architecture comparison for enterprise and
8 scientific computing. In: International Symposium on Benchmarking, Measuring
9 and Optimization, pp. 335–347. Springer, Cham (2019)
10
11 [31] Costin, A.T., Zinca, D., Dobrota, V.: A real-time streaming system for customized
12 network traffic capture. Sensors 23(14), 6467 (2023)
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50 30
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65

You might also like