You are on page 1of 37

Summer Training Report

on

Intrusion Detection System


at
Dept. of Computer Science

Submitted by
Om Rajani
2019UCP1339

Under the Mentorship of


Dr. Jyoti Grover
Assistant Professor
MNIT, Jaipur

MALAVIYA NATIONAL INSTITUTE OF TECHNOLOGY


JAIPUR -302017 (RAJASTHAN) INDIA

i
Certificate

Certificate.png

ii
Declaration

I, Om Rajani, bearing institute ID 2019UCP1339, hereby declare that this summer training
work, which is being presented in this report “intrusion Detection System” submitted to the
Department of Computer Science & Engineering, Malaviya National Institute of Technol-
ogy, Jaipur is an authentic record of my work carried out by me at MNIT, jaipur under the
mentorship of Dr. Jyoti Grover, Assistant Professor from 1 Oct. 2022 to 17 Nov. 2022.

The content of this report, in full or in parts, have not been reproduced from any existing
work of any other person and has not been submitted anywhere else at any time for the sum-
mer training.

Om Rajani
2019UCP1339
29 November 2022

iii
Acknowledgement

I, Om Rajani, would like to acknowledge and express gratitude to Dr. Jyoti Grover madam
for giving me this opportunity to learn.

I would like to acknowledge my sub-mentor, Ms. Ritu Rai, a PhD scholar who helped me a
lot throughout the internship.
I would like to express gratitude to my parents who bestowed their confidence and belief in
me. I wish to express my deep sense of gratitude to my colleagues at MNIT, Jaipur for their
immense support.

At last, I would also like to thank my college MNIT, Jaipur and the Department of Com-
puter Science and Engineering, for providing me the knowledge and opportunity to reach up
to this level.

Sincerely,
Om Rajani
Date: 24 November 2022
2019UCP1339
MNIT, Jaipur

iv
Contents
Certificate i

Declaration ii

Acknowledgement iii

Contents iv

List of Figures vi

1 Introduction 1
1.1 Intrusion in CAN Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 CAN bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Technologies Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 CAN Bus 4
2.1 Electronic Control Unit(ECU) . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Working of the CAN bus . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 CAN Bus Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Dataset 7
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Data attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Types of attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.1 Denial of Service attack . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.2 Fuzzy attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.3 RPM Spoofing attack . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.4 Gear Spoofing attack . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Summary of dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Technologies used 11
4.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.1 NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1.2 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1.3 Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1.4 Tensorflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Anaconda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Google Colab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

v
5 Implementation 14
5.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1.1 File-type conversion . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1.2 Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1.3 Removal of unnecessary data . . . . . . . . . . . . . . . . . . . . . 14
5.1.4 Conversion from hexadecimal to decimal . . . . . . . . . . . . . . 14
5.1.5 Further Pre-processing for Multi-class Classification . . . . . . . . 15
5.2 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2.1 Introduction: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Multi Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3.3 Hyper-parameter Optimisation . . . . . . . . . . . . . . . . . . . . 28
5.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Conclusions and Outcomes 29

7 Bibliography 30

vi
List of Figures
1.1 The IDS-protected vehicle architecture. . . . . . . . . . . . . . . . . . . . 2

2.1 Conceptual diagram of the message priority and inter-frame space (IFS). . . 5
2.2 Structure of a CAN message frame. . . . . . . . . . . . . . . . . . . . . . 6

3.1 Dataset attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7


3.2 DoS attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Fuzzy attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Malfunction attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1 python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5.1 Sample of transformed image for each class . . . . . . . . . . . . . . . . . 16


5.2 Binary class classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.4 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.5 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.6 Accuracy Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.7 The proposed optimized CNN-based IDS framework. . . . . . . . . . . . . 22
5.8 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 24
5.9 Accuracy-loss v/s No. of batches . . . . . . . . . . . . . . . . . . . . . . . 24
5.10 Accuracy-loss v/s No. of epochs . . . . . . . . . . . . . . . . . . . . . . . 24
5.11 VGG-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.12 Accuracy-loss v/s No. of batches . . . . . . . . . . . . . . . . . . . . . . . 26
5.13 Accuracy-loss v/s No. of epochs . . . . . . . . . . . . . . . . . . . . . . . 26
5.14 Inception V3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.15 Accuracy-loss v/s No. of batches . . . . . . . . . . . . . . . . . . . . . . . 27
5.16 Accuracy-loss v/s No. of epochs . . . . . . . . . . . . . . . . . . . . . . . 27
5.17 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

vii
Chapter 1
Introduction

1.1 Intrusion in CAN Bus

Advancement in technology has brought about the concept of intelligent vehicles which are
considered to be more efficient and safer for the users. Intelligent vehicles tend to be con-
nected to other vehicles, roadside infrastructure, such as the traffic management system and
the internet, hence making them to be among the Internet of Things. However, such high
levels of connectivity have meant that intelligent vehicles are at risks of cyber-attacks which
might interfere with different aspects of the vehicle, such as its communication systems, en-
dangering the security and privacy of the vehicle as well as putting the lives of its passengers
at risk.
Due to vehicle being connected to external connections such as GPS, various attacks can
be injected into the intra-vehicular network i.e. CAN bus and thus system would be com-
promised. This is called Intrusion in CAN Bus. This attack can be injected through the
connection with OBD-II port too.

Connected vehicle technology has always been aimed at solving the challenges that are
occasionally experienced with intelligent transport systems. An Intelligent Transport Sys-
tem usually allows intelligent vehicles to be in a position to communicate with the roadside
infrastructure, other vehicles on the road and other road users. The communication system
of an intelligent vehicle is usually referred to as Vehicle-to-Everything (V2X) or it is also
referred to as the VANET, an abbreviation for Vehicular Ad hoc Networks.
An ordinary VANET communication system is usually responsible for three main types of
communication to be considered a smart automobile. Those types of communication are
Vehicle-to-Infrastructure (V2I), Vehicle-to-Vehicle (V2V), and Vehicle-to-Pedestrian (V2P).
V2I involves the vehicle communicating with the roadside infrastructures, such as location
sensors and other traffic monitoring systems. V2V involves a smart automobile being able to
share information with other vehicles on the road. V2P involves the communication between
the vehicle and pedestrians on the road.

The typical attack scenario and the architecture of an IDS-protected vehicle are shown
in Fig. 1.There have been numerous concerns about the privacy and security of intelligent
vehicles and the intelligent transport systems with various attacker models for smart vehicles
being experienced. Among these concerns are cyber security threats on the VANET com-
munication system where cyber attackers may exploit any potential weaknesses within the
system to jam and spoof its signal. This would result in the whole V2X system being affected
through deceptive signaling and delaying of the signal so as to ensure that the message trans-
mitted is distorted and does not achieve its intended purposes. Other security threats faced

1
by smart automotive may include hacking through the internet, as connected vehicles have
access to the internet, or physical access to the vehicle intelligence system .

Figure 1.1: The IDS-protected vehicle architecture.

For example, in 2016 Charlie Miller and Chris Valasek, who are security experts, wire-
lessly hacked the intelligence system of the Jeep Cherokee. Miller and Valasek were able to
demonstrate that the Jeep Cherokee intelligence system had security vulnerability when they
were able to compromise its entertainment system, steering and brakes, and its air condition-
ing system while the driver of the car was still driving .
Another example is with the Nissan Leaf, where its companion application became exploited
by hackers using its identity number that is usually printed on the vehicle’s windows. This
vulnerability allowed the hackers to take control of the heating and air conditioning system.

An example is when an attacker can change the gear of the car by injecting messages with
the specific CAN identifier related to the gear function. Fabricated messages can be injected
directly through the on-board diagnostic (OBD-II) port, as well as through the infotainment
system or wireless communication system. In recent years, various vulnerable features were
discovered in vehicle systems, such as the electric window lift, warning lights, airbag, and

2
tire pressure monitoring system (TPMS)

1.2 CAN bus

The use of CAN-bus well solves the problem of management of the increasing number of
sensors and control devices in modern cars, that previously needed point-to-point connec-
tions. CAN ensures low design and implementation costs in concrete applications and in
hostile environments (i.e., high noise and disturbs in the communications)

The CAN bus was developed by BOSCH as a multi-master, message broadcast system that
specifies a maximum signaling rate of 1 megabit per second (bps). Unlike a traditional net-
work such as USB or Ethernet, CAN does not send large blocks of data point-to-point from
node A to node B under the supervision of a central bus master. In a CAN network, many
short messages like temperature or RPM are broadcast to the entire network, which provides
for data consistency in every node of the system. Once CAN basics such as message format,
message identifiers, and bit-wise arbitration – a major benefit of the CAN signaling scheme
are explained, a CAN bus implementation is examined, typical waveforms presented, and
transceiver features examined for each device, the data in a frame is transmitted serially but
in such a way that if more than one device transmits at the same time, the highest priority
device can continue while the others back off. Frames are received by all devices, including
by the transmitting device.

1.3 Objective

The purpose of this work is to develop an IDS(Intrusion Detection System) that can detect
various types of attacks in intra-vehicle networks(IVNs). Cyber-attackers can inject attacks
on IVNs through the On-Board Diagnostics II (OBD II) interface and affect the normal
functioning of the vehicle. Thus, the IDS is to detect the attack injected in the IVN from the
collection of data frames being transmitted by ECUs in the shared channel called CAN bus.

1.4 Technologies Used

1. Python Programming Language

2. Anaconda

3. Google Colab

3
Chapter 2
CAN Bus

2.1 Electronic Control Unit(ECU)

Modern cars include dozens of independent processing units — the so-called Electronic Con-
trol Units (ECUs) — that are interconnected by different buses or networks. ECUs execute
many safety controls, such as skid detection, crash prediction or anti-lock braking. In gen-
eral, the whole control system, so considerably changed during the years, has soft or hard on
real-time requirements on network components in order to guarantee the correct operation
and the sufficient quality levels (i.e., performance, reliability, safety) of the whole system.
The most used communication standard is the Controller Area Network (CAN) (also known
as CAN-bus).

2.2 Working of the CAN bus

In a CAN bus, several messages containing information, such as RPM, steering angle, and
current speed, are broadcast to the entire network, thereby maintaining the consistency of the
entire system. These messages are identified by a CAN ID. A CAN message has a unique
11- or 29-bit identifier. The base ID field contains the 11-bit ID, and the extended ID in-
cludes the remaining 18-bit ID. CAN 2.0A devices use only the base ID, whereas CAN 2.0B
devices use both ID fields. ECUs can determine whether a received message is interesting or
not based on the CAN ID; therefore, they can filter messages that are of no interest.

The CAN ID is also used to determine the ECU that has priority in the arbitration phase.
As the CAN bus is a broadcast system, there may exist a situation where multiple nodes
attempt to send messages at the same time resulting in a collision. In the arbitration phase,
ECUs send their CAN IDs bit-by-bit. The ECU that has more dominant bits, i.e., a CAN ID
with more leading zeros, wins the bus and obtains a chance to transmit a message.
Alternatively, the ECU stops transmitting a message when it detects a zero bit on the
CAN bus, when the current bit of its ID is one. In addition, there is a mandatory gap between
consecutive messages, known as an inter-frame space (IFS). Each message is separated from
the preceding frame by an IFS that consists of at least three recessive bits. If a dominant bit is
detected following consecutive recessive bits, it is regarded as the SOF of the next message.
In addition, for system consistency, the ECU broadcasts messages at regular intervals even if
the data values have not changed. This causes the ECU to have its own message transmission
cycle as shown in Fig. 3.

4
Figure 2.1: Conceptual diagram of the message priority and inter-frame space (IFS).

2.3 CAN Bus Frame

The meaning of each field in a CAN frame (shown in Fig. 2) are as described below:

1. SOF (1 bit): The start of frame (SOF) bit denotes the start of a new message. This
single bit is used for synchronizing all nodes on the bus.

2. Base identifier (11 bits): This is the first part of the identifier, which is used in both the
standard and extended frames.

3. SRR (1 bit): The substitute remote request (SRR) bit is fixed to one and used in the
extended frame.

4. IDE (1 bit): The identifier extension bit (IDE) is fixed to one and used in the extended
frame.

5. Extended identifier (18 bits): This is the second part of the identifier, which is used
only in the extended frame.

6. RTR (1 bit): The remote transmission request (RTR) is used when a specific infor-
mation is required from another node. The identifier specifies the node that has to
respond.

7. Reserved bits (2 bits): These are reserved bits for future use

8. DLC (4 bits): The data length code (DLC) represents the number of bytes of data.

9. Data (64 bits): The actual payload data, which can be up to 64 bits.

5
10. CRC (16 bits): The cyclic redundancy check (CRC) contains the checksum of the
previous data for error detection.

11. ACK (2 bits): The transmitter sends a recessive bit (1); others change this bit to a
dominant bit (0) when there is no error in the received message.

12. EOF (7 bits): The end of frame (EOF) denotes the end of a current CAN message.

Figure 2.2: Structure of a CAN message frame.

6
Chapter 3
Dataset

3.1 Introduction

This dataset named Car Hacking Dataset is taken from HCRL Lab
In the present study, i focused on the following four attack scenarios that can immediately
and severely impair in-vehicle functions or deepen the intensity of an attack and the degree
of damage: Denial of Service(DoS), Fuzzy, Gear Spoofing and RPM Spoofing. To substan-
tiate the four attack scenarios, two different datasets were produced. One of the datasets
contained normal driving data without an attack. The other dataset included the abnormal
driving data that occurred when an attack was performed. In particular, we generated attack
data in which attack packets were injected for five seconds every 20 seconds for the three
attack scenarios.

3.2 Data attributes

It contains Timestamp, CAN ID, DLC, DATA [0], DATA [1], DATA [2], DATA [3], DATA
[4], DATA [5], DATA [6], DATA [7] and Flag.

Figure 3.1: Dataset attributes

1. Timestamp: recorded time (s)

7
2. CAN ID: identifier of CAN message in HEX (ex. 018F)

3. DLC: number of data bytes, from 0 to 8 DATA [0 7]: data value (byte)

4. Flag: T or R, T represents an injected message while R represents a normal message.

3.3 Types of attack

There are 4 types of attack and one normal massage.

3.3.1 Denial of Service attack


As mentioned in Section 2, the CAN is a multi-master network, and the collision of CAN
messages sent from multiple ECU nodes is arbitrated depending on their priority. The CAN
message sent from an ECU node does not contain the address of the sender and receiver.
Instead, when CAN messages transmitted from several different sender ECU nodes are si-
multaneously transmitted to a receiver ECU node, the values of the CAN IDs are compared
to determine the priority of the CAN message to be accepted first. The lower the value of a
CAN ID is, the higher its priority (e.g., the priority of 0×130 is higher than 0×545).
Low-priority CAN messages are automatically re-transmitted on the following CAN bus cy-
cle (the broadcast principle) . The flooding attack allows an ECU node to occupy many of
the resources allocated to the CAN bus by maintaining a dominant status on the CAN bus.
This attack can limit the communications among ECU nodes and disrupt normal driving. We
conducted the flooding attack by injecting a large number of messages with the CAN ID set
to 0×000 into the vehicle networks. In other words, DoS attack samples are high-frequency
empty messages, causing pure black patterns.

Figure 3.2: DoS attack

8
3.3.2 Fuzzy attack
Fuzzing is a software-testing technique used to find vulnerabilities by entering unexpected
values or random data into computer programs. In the second attack scenario, the attacker
performs indiscriminate attacks by iterative injection of random CAN packets.
For the fuzzy attack, we generated random numbers with “randint” function, which is a gen-
eration module for random integer numbers within a specified range. Messages were sent to
the vehicle once every 0.0005 seconds. This process was conducted for both the ID field and
the Data field. The randomly generated CAN ID ranged from 0×000 to 0×7FF and included
both CAN IDs originally extracted from the vehicle and CAN IDs which were not.
We determined that the reaction of the fuzzy attack using random numbers restricted to the
extractable CAN IDs from the vehicle network is more immediate than the reaction using
all random numbers. When the attack was executed on the KIA Soul, a short beeping sound
repeatedly occurred, and the heater turned on. A navigation system error in which the rear
camera turned on regardless of the drive gear status occurred as well. Thus, The feature
patterns of fuzzy attack images are more random than normal images.

Figure 3.3: Fuzzy attack

3.3.3 RPM Spoofing attack


The spoofing attack enabled us to deceive the original ECU and change the RPM (revolu-
tion per minute) gauge and drive gear on the instrument panel. RPM spoofing attacks are
launched by injecting messages with certain CAN IDs and packets to masquerade as legiti-
mate users, so their images also have certain feature patterns

9
Figure 3.4: Malfunction attack

3.3.4 Gear Spoofing attack


The spoofing attack enabled us to deceive the original ECU and change the gear information
and drive gear on the instrument panel. Gear spoofing attacks are launched by injecting
messages with certain CAN IDs and packets to masquerade as legitimate users, so their
images also have certain feature patterns

3.4 Summary of dataset

Given below are the number of rows of attacked and normal in each dataset

Dataset summary
Attacks No. of messages Normal messages Injected mes-
sages
DoS 3665771 3078250 587521
Fuzzy 3838860 3347013 491847
Gear Spoofing 4443142 3845890 597252
RPM Spoofing 4621702 3966805 654897

10
Chapter 4
Technologies used

4.1 Python

Python is a high-level, general-purpose programming language. Its design philosophy em-


phasizes code readability with the use of significant indentation.

Python is dynamically-typed and garbage-collected. It supports multiple programming


paradigms, including structured (particularly procedural), object-oriented and functional pro-
gramming. It is often described as a ”batteries included” language due to its comprehensive
standard library.

Guido van Rossum began working on Python in the late 1980s as a successor to the ABC
programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was released
in 2000 and introduced new features such as list comprehensions, cycle-detecting garbage
collection, reference counting, and Unicode support. Python 3.0, released in 2008, was a
major revision that is not completely backward-compatible with earlier versions. Python 2
was discontinued with version 2.7.18 in 2020.

Python consistently ranks as one of the most popular programming languages.


A shell is special user program which provide an interface to user to use operating system
services. Shell accept human readable commands from user and convert them into something
which kernel can understand. It is a command language interpreter that execute commands
read from input devices such as keyboards or from files. The shell gets started when the user
logs in or start the terminal.

Figure 4.1: python

11
4.1.1 NumPy
NumPy is a library for the Python programming language, adding support for large, multi-
dimensional arrays and matrices, along with a large collection of high-level mathematical
functions to operate on these arrays. The ancestor of NumPy, Numeric, was originally cre-
ated by Jim Hugunin with contributions from several other developers. In 2005, Travis
Oliphant created NumPy by incorporating features of the competing Numarray into Numeric,
with extensive modifications. NumPy is open-source software and has many contributors.

4.1.2 Pandas
pandas is a software library written for the Python programming language for data manip-
ulation and analysis. In particular, it offers data structures and operations for manipulating
numerical tables and time series. It is free software released under the three-clause BSD
license. The name is derived from the term ”panel data”, an econometrics term for data sets
that include observations over multiple time periods for the same individuals. Its name is a
play on the phrase ”Python data analysis” itself.Wes McKinney started building what would
become pandas at AQR Capital while he was a researcher there from 2007 to 2010.

4.1.3 Keras
Keras is an open-source software library that provides a Python interface for artificial neural
networks. Keras acts as an interface for the TensorFlow library.
Up until version 2.3, Keras supported multiple backends, including TensorFlow, Mi-
crosoft Cognitive Toolkit, Theano, and PlaidML. As of version 2.4, only TensorFlow is
supported. Designed to enable fast experimentation with deep neural networks, it focuses on
being user-friendly, modular, and extensible. It was developed as part of the research effort
of project ONEIROS (Open-ended Neuro-Electronic Intelligent Robot Operating System),
and its primary author and maintainer is François Chollet, a Google engineer. Chollet is also
the author of the Xception deep neural network model.

4.1.4 Tensorflow
TensorFlow is a free and open-source software library for machine learning and artificial
intelligence. It can be used across a range of tasks but has a particular focus on training and
inference of deep neural networks.
TensorFlow was developed by the Google Brain team for internal Google use in research
and production. The initial version was released under the Apache License 2.0 in 2015.
Google released the updated version of TensorFlow, named TensorFlow 2.0, in September
2019.

12
4.2 Anaconda

Anaconda is a distribution of the Python and R programming languages for scientific com-
puting (data science, machine learning applications, large-scale data processing, predictive
analytics, etc.), that aims to simplify package management and deployment. The distribu-
tion includes data-science packages suitable for Windows, Linux, and macOS. It is developed
and maintained by Anaconda, Inc., which was founded by Peter Wang and Travis Oliphant in
2012. As an Anaconda, Inc. product, it is also known as Anaconda Distribution or Anaconda
Individual Edition, while other products from the company are Anaconda Team Edition and
Anaconda Enterprise Edition, both of which are not free.
Package versions in Anaconda are managed by the package management system conda.
This package manager was spun out as a separate open-source package as it ended up being
useful on its own and for things other than Python. There is also a small, bootstrap version of
Anaconda called Miniconda, which includes only conda, Python, the packages they depend
on, and a small number of other packages.

4.3 Google Colab

Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows any-
body to write and execute arbitrary python code through the browser, and is especially well
suited to machine learning, data analysis and education. More technically, Colab is a hosted
Jupyter notebook service that requires no setup to use, while providing access free of charge
to computing resources including GPUs.

13
Chapter 5
Implementation

5.1 Pre-processing

Dataset is given in text form so we need to convert it in numeric form for this these following
steps implemented.

5.1.1 File-type conversion


Python will read data from a text file and will create a dataframe with rows equal to number
of lines present in the text file and columns equal to the number of fields present in a single
line. See below example for better understanding:- Once the dataframe is created, we will
store this dataframe into a CSV file format using

Dataframe.to_csv() Method.

5.1.2 Labeling
Add labels at the header

df.columns = [0,1,2,3,4,5,6,7,8,9,10]

5.1.3 Removal of unnecessary data


drop all empty data row using.

dataset = df.dropna()

The dropna() method removes the rows that contains NULL values. The dropna() method
returns a new DataFrame object unless the inplace parameter is set to True , in that case the
dropna() method does the removing in the original DataFrame instead.

5.1.4 Conversion from hexadecimal to decimal


CAN ID and data frames(0-7) are given in hexadecimal form so convert them into decimal
form
To convert a given hexadecimal number to a decimal number in Python, call the builtin
function int() and pass the hex number, and base=16 as arguments. int() converts the given
value using the specified base, and returns the decimal number.

14
for i in dataset.columns[:-1]:
print(i)
if dataset[i].dtype == ’O’:
dataset[i] = dataset[i].apply(int, base = 16)

5.1.5 Further Pre-processing for Multi-class Classification


Extract data from each dataset

We have three datasets containing attacked and non-attacked combinations of data frames
transmitted on CAN Bus at different timestamps. All rows having attacked combination
of attributes in each dataset belong to a one of the three attacks i.e. Fuzzy, Flooding and
Malfunction.
Thus, we combined all the data of three datasets by retaining their classification into groups
of Normal, Fuzzy, Flooding and Malfunction.

Data Normalisation

As CNN models work better on image sets and vehicular network traffic datasets are usu-
ally tabular data, the original network data should be transformed into image forms. The
data transformation process starts with data normalization. Since the pixel values of images
range from 0 to 255, the network data should also be normalized into the scale of 0-255.

Among the normalization techniques, minmax and quantile normalization are the two
commonly used methods that can convert data values to the same range. As min-max
normalization does not handle outliers well and may cause most data samples to have ex-
tremely small values, quantile normalization is used in the proposed framework. The quan-
tile normalization method transforms the feature distribution to a normal distribution and
re-calculates all the features values based on the normal distribution. Therefore, the majority
of variable values are close to the median values, which is effective in handling outliers.

Image Transformation

The data samples are converted into chunks based on the timestamps and feature sizes of
network traffic datasets. The dataset has 10 important features (CAN ID,Label and DATA[0]-
DATA[8]), each chunk of 27 consecutive samples with 9 features (27×10 = 270 feature values
in total) are transformed into an image of shape 10×9×3 [14]. Thus, each transformed image
is a square color image with three channels (red, green, and blue).

15
Figure 5.1: Sample of transformed image for each class

5.2 Binary Classification

5.2.1 Introduction:
In machine learning, binary classification is a supervised learning algorithm that categorizes
new observations into one of two classes.
In this project we need to predict that the data belongs to T class(attacked) or R class
(Normal).

Figure 5.2: Binary class classification

16
5.2.2 Classification
There are three classification techniques used

Decision Tree

Decision tree learning is a supervised learning approach used in statistics, data mining and
machine learning. In this formalism, a classification or regression decision tree is used as a
predictive model to draw conclusions about a set of observations.
Tree models where the target variable can take a discrete set of values are called clas-
sification trees; in these tree structures, leaves represent class labels and branches represent
conjunctions of features that lead to those class labels. Decision trees where the target vari-
able can take continuous values (typically real numbers) are called regression trees.
Decision trees are among the most popular machine learning algorithms given their in-
telligibility and simplicity.
In decision analysis, a decision tree can be used to visually and explicitly represent deci-
sions and decision making. In data mining, a decision tree describes data (but the resulting
classification tree can be an input for decision making).

Figure 5.3: Decision Tree

Confusion Matrices derived as an output:-


Fuzzy Attack
T R
T 814284 6
R 17 123449

17
DoS Attack
T R
T 762207 0
R 0 146439
RPM Spoofing Attack
T R
T 981017 0
R 0 164040
Gear Spoofing Attack
T R
T 951796 0
R 0 148949

Random Forest

Random Forest is a popular machine learning algorithm that belongs to the supervised learn-
ing technique. It can be used for both Classification and Regression problems in ML. It is
based on the concept of ensemble learning, which is a process of combining multiple classi-
fiers to solve a complex problem and to improve the performance of the model.
Random Forest is a classifier that contains a number of decision trees on various subsets of
the given dataset and takes the average to improve the predictive accuracy of that dataset

Figure 5.4: Random Forest

Confusion Matrices derived as an output:-

18
Fuzzy Attack
T R
T 814288 2
R 1 123465
DoS Attack
T R
T 762207 0
R 0 146439
RPM Spoofing Attack
T R
T 981017 0
R 0 164040
Gear Spoofing Attack
T R
T 951796 0
R 0 148949

Logistic regression

Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1. Logistic Regression is much similar to the Linear Regression
except that how they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems. In Logistic re-
gression, instead of fitting a regression line, we fit an ”S” shaped logistic function, which
predicts two maximum values (0 or 1). The curve from the logistic function indicates the
likelihood of something such as whether the cells are cancerous or not, a mouse is obese or
not based on its weight, etc. Logistic Regression is a significant machine learning algorithm
because it has the ability to provide probabilities and classify new data using continuous and
discrete datasets. Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:

19
Figure 5.5: Logistic Regression

Confusion Matrices derived as an output:-

Fuzzy Attack
T R
T 8142849 1441
R 10844 112622
DoS Attack
T R
T 762087 120
R 0 146439
RPM Spoofing Attack
T R
T 981017 0
R 0 164040
Gear Spoofing Attack
T R
T 951796 0
R 0 148949

5.2.3 Results
Comparison between the accuracy of decision tree,random forest,logistic regression is given
below-

20
Datasets Decision Tree Random Forest Logistic Reg.
Fuzzy 0.999 0.999 0.986
DoS 1.00 1.0 0.999
RPM Spoofing 1.0 1.0 1.0
Gear Spoofing 1.0 1.0 1.0

Figure 5.6: Accuracy Comparison

Random forests is great with high dimensional data. It is faster to train than decision
trees because we are working only on a subset of features in this model, so we can easily
work with hundreds of features.

Currently we working on high dimensional dataset so Random Forest gives best accuracy.

The main limitation of random forest is that a large number of trees can make the algo-
rithm too slow and ineffective for real-time predictions. In general, these algorithms are fast
to train, but quite slow to create predictions once they are trained.

In general, logistic regression performs better when the number of noise variables is less
than or equal to the number of explanatory variables and random forest has a higher true and
false positive rate as the number of explanatory variables increases in a dataset.
But this dataset contain high number of variables so Logistic regression is less accurate than
Random Forest and Decision Tree.

21
5.3 Multi Classification

5.3.1 Introduction
In machine learning, multiclass classification is a supervised learning algorithm that catego-
rizes new observations into one of the three or more classes. The following is a multiclass
classification applications, where the Normal, Fuzzy, Flooding and Malfunction are the pos-
sible classes for each observation. In this project we need to predict that the data belongs
from which one of the above classes.

Figure 5.7: The proposed optimized CNN-based IDS framework.

5.3.2 Classification
Following are the three classification technique used:-

22
Convolutional Neural Network

The CNN is a type of deep neural network, which is a computational approach based on a
large collection of neural units. Unlike the traditional MLP models, each layer of the CNN
consists of a rectangular 3D grid of neurons. The neurons of a layer are only connected to
the neurons in a receptive field, which is a small region in the immediately preceding layer,
rather than the entire set of neurons.
CNN is a common DL model that is widely used in image classification and recognition
problems [7]. The images can be directly inputted into CNN models without additional
feature extraction and data reconstruction processes. A typical CNN comprises three types
of layers: convolutional layers, pooling layers, and fully-connected layers [7]. In convolu-
tional layers, the feature patterns of images can be automatically extracted by convolution
operations. In pooling layers, the data complexity can be reduced without losing important
information through local correlations to avoid over-fitting. Fully-connected layers serve as
a conduit to connect all features and generate the output.
CNNs contain a combination of layers which transform an image into output the model
can understand.
Convolutional layer: creates a feature map by applying a filter that scans the image sev-
eral pixels at a time Pooling layer: scales down the information generated by the convo-
lutional layer to effectively store it Fully connected input layer: flattens the outputs into a
single vector Fully connected layer: applies weights over the inputs generated by the feature
analysis Fully connected output layer: generates final probabilities to determine the image
class Process:
Forward and backward propagation iterate through all of the training samples in the net-
work until the optimal weights are determined and only the most powerful and predictive
neurons are activated to make a prediction
The model trains throughout many epochs by taking one forward and one backward pass
of all training samples each time Forward propagation calculates the loss and cost functions
by comparing the difference between the actual and predicted target for each labeled image
Backward propagation uses gradient descent to update the weights and bias for each neuron,
attributing more impact on the neurons which have the most predictive power, until it arrives
to an optimal activation combination As the model sees more examples, it learns to better
predict the target causing the loss measure to decrease The cost function takes the average
loss across all samples indicating overall performance

23
Figure 5.8: Convolutional Neural Network

Outcomes for batch size of 32.

Figure 5.9: Accuracy-loss v/s No. of batches

Figure 5.10: Accuracy-loss v/s No. of epochs

24
The real-time analysis of the performance of convolutional neural networks shows that
training loss decreases drastically with the increase in no. of batches iterating over the model
and decreases slowly with the no. of times learning model works upon the dataset.

VGG19

VGG19 is a variant of VGG model which in short consists of 19 layers (16 convolution
layers, 3 Fully connected layer, 5 MaxPool layers and 1 SoftMax layer). There are other
variants of VGG like VGG11, VGG16 and others. Input: The VGG19 takes in an image in-
put size of 224×224. Convolutional Layers: VGG’s convolutional layers leverage a minimal
receptive field, i.e., 3×3, the smallest possible size that still captures up/down and left/right.
This is followed by a ReLU activation function. ReLU stands for rectified linear unit activa-
tion function, it is a piecewise linear function that will output the input if positive otherwise,
the output is zero. Stride is fixed at 1 pixel to keep the spatial resolution preserved after
convolution Fully-Connected Layers: The VGG19 has 3 fully connected layers. Out of the 3
layers, the first 2 have 4096 nodes each, and the third has 4 nodes, which is the total number
of classes the our dataset has.

Figure 5.11: VGG-19

Outcomes for batch size of 32.

25
Figure 5.12: Accuracy-loss v/s No. of batches

Figure 5.13: Accuracy-loss v/s No. of epochs

Inception V3

Inception v3 is an image recognition model that is made up of symmetric and asymmet-


ric building blocks, including convolutions, average pooling, max pooling, concatenations,
dropouts, and fully connected layers. Batch normalization is used extensively throughout the
model and applied to activation inputs. Loss is computed using Softmax. It is a deep learning
model based on Convolutional Neural Networks, which is used for image classification.
A high-level diagram of the model below:

26
Figure 5.14: Inception V3

Outcomes for batch size of 32.

Figure 5.15: Accuracy-loss v/s No. of batches

Figure 5.16: Accuracy-loss v/s No. of epochs

27
5.3.3 Hyper-parameter Optimisation
CNN models have a large number of hyper-parameters that need tuning. These hyper-
parameters can be classified as model-design hyper-parameters and model-training hyper-
parameters. Model-training hyper-parameters are used to balance the training speed and
model performance, involving the batch size, the number of epochs, and early stop patience.
The above hyper-parameters have a direct impact on the structure, effectiveness, and effi-
ciency of CNN models. Batch size used are 32,64 and 128 while No. of Epochs are chosen
to be either 10 or 20.

5.3.4 Results
comparison between CNN,VGG-19, Inception V3 is given below:-

Models Accuracy Loss


CNN 1.0000 0.0000
VGG-19 1.0000 0.0000
Inception V3 1.0000 3.24 * 10−9

Figure 5.17: Accuracy

The results of evaluating the optimized CNN models and the proposed ensemble models
on the Car-Hacking dataset are shown in the above. As shown in the Table, all optimized
base CNN models achieve 100because the normal and attack patterns in the Car-Hacking
dataset can be obviously distinguished through the transformed images shown in Fig. 5.1.

28
Chapter 6
Conclusions and Outcomes

1. In multi-class classification, VGG-19 and CNN are the two best performing models.

2. In binary-class classification, Random-Forest and Decision tree are the two best-performing
models.

3. I learnt the principles of Machine Learning and experienced the the implementation
part of one of its applications.

4. I learnt about the working of electronic system present inside a vehicle and risks asso-
ciated with it.

5. I learnt about the possible solutions of the anamolies injected in the CAN Bus system
through OBD-II port and applied one of its solution through ML Models based on Lo-
gistic Classification and Deep-Learning Algorithms.

29
Bibliography

[1]https://ocslab.hksecurity.net/Datasets/survival-ids

[2]https://sites.google.com/a/hksecurity.net/ocslab/Datasets/CAN-intrusion-dataset

[3]https://ocslab.hksecurity.net/Datasets/car-hacking-dataset

[4]https://ieeexplore.ieee.org/document/9838780

[5]https://www.sciencedirect.com/science/article/pii/S2214209619302451

[6]https://www.wikipedia.org

30

You might also like