You are on page 1of 83

A Machine Learning Technique based Network

Intrusion Detection System for Cyber Security


Applications

A Dissertation work
Submitted in Partial Fulfillment for the award of

POST GRADUATE DEGREE FOR MASTER OF TECHNOLOGY


IN
CYBER FORENSIC

SUBMITTED TO

RAJIV GANDHI PROUDYOGIKI VISHWAVIDYALAYA


BHOPAL (M.P.)

SUBMITTED BY
RISHIKA
ENROLL NO: 0002CF19MT04

UNDER THE GUIDANCE OF


Prof. Yogendra P.S. Maravi
(ASSISTANT PROFESSOR)

SCHOOL OF INFORMATION TECHNOLOGY


(UNIVERSITY TEACHING DEPARTMENT)
RGPV BHOPAL

2022
izi=
¼,e-bZ@,e-Vsd@,e-QkeZZ Nk=ksa gsrq½

eSa रिषिका vkRetk Jh डॉ. ललन प्रसाद गुप्ता vk;q 24 o"kZ] fuoklh% डालमियानगि िोहतास
(बिहाि) षिन नं। 821305 की gksdj 'kiFkiwoZd fuEu dFku करती gwa fd %&

1- ;g fd eSusa ,e-Vsd ds fo"k; lk;oj QkWjsfUld l= 2019 esa dkmalfyax@laLFkk Lrj dkmalfyax
¼lh-,y-lh-½ ds ek/;e lsas Js.kh lkekU; ls ¼,l-vks-vkbZ-Vh-½] vkj- th- ih- oh- Hk¨iky laLFkk es
izos'k fy;k FkkA

2- eSa fnukad 5 flrEcj] 2019 ls fu;fer Nk=k ds :i esa Lukd®Rrj ikB~¸kØe esa v/;;ujr gwaA

eSa ?kks"k.kk djrk gwa fd bl ikB~;dze dh vof/k esa fdlh Hkh vU; futh {ks= ds
laLFkku@vkS|ksfxd lewg@fdlh Hkh dk;kZy; esa iw.kZ dkfyd :i ls dk;Zjr ugha FkhA

gLrk{kj 'kiFkxzfgrk

xkbZM ,oa lapkyd @izkpk;Z }kjk lR;kfir fd;k tk;s

lR;kfir djrs gS fd Nk=k dk uke रिषिका ukekdau dekd 0002CF19MT04 }kjk mijksDrkuqlkj
Hkjh xbZ tkudkjh izekf.kr ,oa lgh gSa

xkbZM ds gLrk{kj fnukad&

lapkyd@izkpk;Z

gLrk{kj@inuke lhy lfgr

laLFkk dk uke%& ¼,l-vks-vkbZ-Vh-½]


vkj- th- ih- oh- Hk¨iky
laLFkk dk dksM%& 0002
nwjHkk"k Øekad- 0755&2678823
SCHOOL OF INFORMATION TECHNOLOGY
(University Teaching Department)
RAJIV GANDHI PROUDYOGIKI VISHWAVIDYALAYA
(University Of Technology of Madhya Pradesh)

CERTIFICATE

This is to certify that the dissertation entitled “A Machine Learning


Technique based Network Intrusion Detection System for Cyber
Security Applications” being submitted to Rajiv Gandhi Proudyogiki
Vishwavidyalaya, Bhopal (M.P.) in the Department of School of Informaion
Technology by Rishika, Enrollment No. 0002CF19MT04, in partial fulfillment
for award of Master of Technology degree in Cyber Forensics with
specialization in Security. The matter embodied is the actual work by Rishika.
This work has been submitted earlier in part or full for the award of any other
degree. This is a record of the bona-fide work done by her under my guidance.

1) Candidate Name: Rishika 2) Prof. Yogendra P.S. Maravi


Enrollment No: 0002CF19MT04 Supervisor
SOIT, UTD, RGPV BHOPAL SOIT, UTD, RGPV BHOPAL

3) Dr. Sanjeev Sharma 4) 4) Director/Principal of


Director/Principal of thethe
College
College
Head of Department SOIT, UTD, RGPV, BHOPAL
\ BHOPAL
SOIT, UTD, RGPV, BHOPAL
SCHOOL OF INFORMATION TECHNOLOGY
(University Teaching Department)
RAJIV GANDHI PROUDYOGIKI VISHWAVIDYALAYA
(University Of Technology of Madhya Pradesh)

APPROVAL CERTIFICATE

The dissertation work entitled “A Machine Learning Technique


based Network Intrusion Detection System for Cyber Security
Applications” submitted by Rishika ( 0002CF19MT04) has been examined
by us and is hereby approved for the award of Master of Technology degree
in Cyber Forensics for which it has been submitted to School of Information
Technology, Rajiv Gandhi Prodyogiki Vishwavidyalaya, Bhopal(M.P.).

Internal Examiner External Examiner


Date: Date:
SCHOOL OF INFORMATION TECHNOLOGY
(University Teaching Department)
RAJIV GANDHI PROUDYOGIKI VISHWAVIDYALAYA
(University Of Technology of Madhya Pradesh)

DECLARATION

I, Rishika, a student of Master of Technology, Cyber Forensics,


S e s s i o n : 2 0 1 9 - 2 0 2 1 , School of Information Technology, Rajiv
Gandhi Prodyogiki Vishwavidyalaya, Bhopal (M.P.), hereby declare that
the work presented in this dissertation entitled “A Machine Learning
Technique based Network Intrusion Detection System for Cyber Security
Applications” is outcome of my own work, and is bona fide and correct to
the best of my knowledge. This work has been carried out taking care of
Engineering Ethics and does not infringe any patented work and has not been
submitted to any other University o r a n y w h e r e e l s e for the award of
any degree or any professional diploma.

Rishika
Enroll. No:- 0002CF19MT04
Date:-
Declaration of Plagiarism

I hereby declare that the work which is being presented in the dissertation,
entitled “A Machine Learning Technique based Network Intrusion
Detection System for Cyber Security Applications” in the partial fulfillment
of the requirements for the award of degree of Master of Technology in Cyber
Forensics submitted in the Department of School of Information Technology,
Rajiv Gandhi Prodyogiki Vishwavidyalaya, Bhopal(M.P.), is an authentic
record of my own work carried under the guidance of Prof. Yogendra P.S.
Maravi, Assistant Professor, Department School of Information Technology,
RGPV, Bhopal. I have not submitted the matter embodied in this report for
award of any other degree.

I also declare that “A check of plagiarism has been carried out on the
thesis/dissertation and is found within the acceptable limit and report of which
is enclosed herewith”.

Rishika
Enroll.No:- 0002CF19MT04
Date:-

Supervisor

Director/Principal sign with seal


ACKNOWLEDGEMENTS
This Dissertation is one of the milestones of my journey is obtaining my M.Tech
and also play a very important role in my overall career. It would not have been
possible to complete this work without the help and support of the numerous people
around me. At the end of my thesis, it is pleasant task to express my thanks to all
those who contributed in many ways to the success of this study and made it an
unforgettable experience for me.
I would like to express my deep and sincere gratitude to my principal supervisor
Prof. Yogendra P.S. Maravi for his excellent guidance, invaluable suggestions and
continuous encouragement at all the stages of my research work. I have been
fortunate to have him as my guide as he has been a great influence on me, both as a
person and as a professional.
I also gratefully acknowledge Head of Department Dr. Sanjeev Sharma and
faculty member’s Dr. Jitendra Agarwal, Dr. Nishchol Mishra, Dr. Varsha
Sharma, Prof. Priyamwada Sharma, Prof. Vivek Sharma for their advice,
supervision, and crucial contribution. Their involvement with their originality have
triggered and nourished my intellectual maturity that I will benefit from, for a long
time to come. I am grateful in every possible way and hope to keep up our
relationship in the future.
I give thanks to Mr. Neelesh Sharma and all our college staff members who always
being willing helpful to find solutions for any problems and immeasurable
supportive and caring. I would also like to give special acknowledgement to my
friends who shared their research work experience with me and for their valuable
companionship without which completion of my thesis would have been a tough
task. I am extremely grateful to my parents for their love, prayers, caring and
sacrifices. I extend my deepest gratitude to my elder brother Deepak Sav and
family members for their invaluable support, guidance, love, affection and
encouragement and I feel myself blessed to have such caring family.

Rishika
0002CF19MT04
TABLE OF CONTENT

LIST OF FIGURE x
LIST OF TABLE xii
LIST OF ABBREVATIONS xiii
ABSTRACT xiv

CHAPTER 1 INTRODUCTION 1-9


1.1 Overview 2
1.2 Internet Of Things (IOT) 3
1.3 Intrusion Detection System 4
1.3.1 Network Intrusion Detection Systems 5
1.3.2 Host Intrusion Detection Systems 6
1.4 Detection Method 6
1.4.1 Signature-based 7
1.4.2 Anomaly-based 7
1.4.3 Intrusion prevention 7
1.4.4 IDS Placement 8
1.5 Motivation Of Thesis 8
1.6 Objective Of Research 9
1.7 Organization of Dissertation 9

CHAPTER 2 LITERATURE SURVEY 10-20


2.1 Literature Survey 11
2.2 Problem Identification 20

CHAPTER 3 TECHNICAL BACKGROUND 21-37


3.1 IOT Next Evolution 23
3.2 The Internet Of Things 23
3.3 Addressing The Class Imbalance Problem 24
3.4 Cloud Computing Security 24
3.5 Machine Learning Techniques 25

viii
3.5.1 Decision Tree 27
3.5.1.1 Iterative Dichotomiser 3 (ID3) 27
3.5.2 Naïve Bayes 29
3.5.3 Logistic Regression 29
3.5.4 K-Nearest Neighbors Algorithm (KNN) 32
3.5.5 Support Vector Machine 35
3.5.6 Random Forest 37

CHAPTER 4 PROPOSED METHODOLOGY 38-43


4.1 Proposed Work 39
4.2 Methodology 40
4.2.1 Data Selection and Loading 40
4.2.2 Data Pre-processing 40
4.2.3 Splitting Dataset into Train and Test Data 41
4.2.4 Feature Extraction 41
4.2.5 Classification 41
4.2.6 Prediction 3
4.2.7 Result Generation 43

CHAPTER 5 IMPLEMENTATION AND RESULT DISCUSSION 44-57


5.1 Simulation Software 44
5.2 Results Discussion 47

CHAPTER 6 CONCLUSION AND FUTURE SCOPE 58-60


6.1 Conclusion 59
6.2 Future Scope 60

REFERENCES 61-63
List of Publication 64
Plagiarism Report 65

ix
LIST OF FIGURE

Figure No. Title Page No.

Figure 1.1 Attack system 2

Figure 1.2 A simple active defence architecture 4

Figure 3.1 Decision tree 27

Figure 3.2 ID3 28

Figure 3.3 Regression model 30

Figure 3.4 Prediction class metrics 34

Figure 3.5 Support vector machine 35

Figure 3.6 Random forest 37

Figure 4.1 Flow Chart 39

Figure 5.1 Snap shot of Spyder environment 46

Figure 5.2 Dataset 47

Figure 5.3 Missing data removal 48

Figure 5.4 Test data 49

Figure 5.5 Train Data 50

Figure 5.6 Confusion Matrix 51

Figure 5.7 Prediction class metrics 52

Figure 5.8 ROC 54

Figure 5.9 Result Parameters 55

x
Figure 5.10 Graphical representation of result comparison 56

Figure 5.11 Accuracy comparison 56

xi
LIST OF TABLE

Table No. Title Page No.

Table 5.1 Simulation Result of DT with C4.5 54

Table 5.2 Comparison of proposed work with previous work 55

xii
LIST OF ABBREVIATIONS

AI Artificial intelligence

CNN Convolutional Neural Network

DNN Deep neural network

DT Decision tree

HIDS host-based intrusion detection systems

IDPS intrusion detection and prevention systems

IDS intrusion detection system

IoT Internet of things

IPS Intrusion prevention systems

KNN K Nearest Neighbors

ML Machine Learning

NB Naïve Bayes

NIDS network intrusion detection systems

NN Neural Network

RF Random forest

SIEM security information and event management

SVM Support vector machine

xiii
ABSTRACT

Intrusion detection is one of the important security problems in today’s


cyber world. A significant number of techniques have been developed which
are based on machine learning approaches. So for identifying the intrusion we
have designed the machine learning algorithms. By using the algorithm we find
out intrusion and we can identify the attacker’s details also. IDS are mainly two
types: Host based and Network based. A Host based Intrusion Detection
System (HIDS) monitors individual host or device and sends alerts to the user
if suspicious activities such as modifying or deleting a system file, unwanted
sequence of system calls, unwanted configuration changes are detected. A
Network based Intrusion Detection System (NIDS) is usually placed at network
points such as a gateway and routers to check for intrusions in the network
traffic.

This dissertation presents the C4.5 decision tree algorithm for


classification. The C4.5 algorithm is used in Data Mining as a Decision Tree
Classifier which can be employed to generate a decision, based on a certain
sample of data. The dataset is taken from the KDD dataset kaggle. The
simulation is performed using the Python Spyder 3.7 software. The simulation
results shows that the proposed approach gives the significant good results in
term of the precision, recall, F1-Score, Error Rate and accuracy. The overall
achieved accuracy is 96.3% or approx 97% with the 3% error rate.

xiv
CHAPTER 1
INTRODUCTION

1
CHAPTER 1

INTRODUCTION

1.1 OVERVIEW

In computers and computer networks, an attack is any attempt to expose, alter, disable,
destroy, steal or gain information through unauthorized access to or make unauthorized
use of an asset. Cyber attack is any offensive maneuver that targets computer information
systems, infrastructures, computer networks, or personal computer devices. An attacker is
a person or process that attempts to access data, functions, or other restricted areas of the
system without authorization, potentially with malicious intent. Depending on the
context, cyber attacks can be part of cyber warfare or cyber terrorism. A cyber attack can
be employed by sovereign states, individuals, groups, society, or organizations, and it
may originate from an anonymous source. A product that facilitates a cyber attack is
sometimes called a cyber weapon. [1]

A cyber attack may steal, alter, or destroy a specified target by hacking into a susceptible
system.[3] Cyber attacks can range from installing spyware on a personal computer to
attempting to destroy the infrastructure of entire nations. Legal experts are seeking to
limit the use of the term to incidents causing physical damage, distinguishing it from the
more routine data breaches and broader hacking activities. [4]

Figure 1.1: Attack system

2
1.2 INTERNET OF THINGS (IOT)

The Internet of things (IoT) describes the network of physical objects that are embedded
with sensors, software, and other technologies for the purpose of connecting and
exchanging data with other devices and systems over the Internet.

Things have evolved due to the convergence of multiple technologies, real-time analytics,
machine learning, ubiquitous computing, commodity sensors, and embedded systems.
Traditional fields of embedded systems, wireless sensor networks, control systems,
automation (including home and building automation), and others all contribute to
enabling the Internet of things. In the consumer market, IoT technology is most
synonymous with products pertaining to the concept of the "smart home", including
devices and appliances (such as lighting fixtures, thermostats, home security systems and
cameras, and other home appliances) that support one or more common ecosystems, and
can be controlled via devices associated with that ecosystem, such as smartphones and
smart speakers. The IoT can also be used in healthcare systems.

There are a number of serious concerns about dangers in the growth of the IoT, especially
in the areas of privacy and security, and consequently industry and governmental moves
to address these concerns have begun including the development of international
standards.

IoT system architecture, in its simplistic view, consists of three tiers: Tier 1: Devices, Tier
2: the Edge Gateway, and Tier 3: the Cloud. Devices include networked things, such as
the sensors and actuators found in IoT equipment, particularly those that use protocols
such as Modbus, Bluetooth, Zigbee, or proprietary protocols, to connect to an Edge
Gateway. The Edge Gateway layer consists of sensor data aggregation systems called
Edge Gateways that provide functionality, such as pre-processing of the data, securing
connectivity to cloud, using systems such as WebSockets, the event hub, and, even in
some cases, edge analytics or fog computing.

The Internet of things requires huge scalability in the network space to handle the surge
of devices. IETF 6LoWPAN would be used to connect devices to IP networks. With
billions of devices being added to the Internet space, IPv6 will play a major role in
handling the network layer scalability. IETF's Constrained Application Protocol,
ZeroMQ, and MQTT would provide lightweight data transport.

3
Fog computing is a viable alternative to prevent such a large burst of data flow through
the Internet. The edge devices' computation power to analyse and process data is
extremely limited. Limited processing power is a key attribute of IoT devices as their
purpose is to supply data about physical objects while remaining autonomous. Heavy
processing requirements use more battery power harming IoT's ability to operate.
Scalability is easy because IoT devices simply supply data through the internet to a server
with sufficient processing power.

1.3 INTRUSION DETECTION SYSTEM

An intrusion detection system (IDS) is a device or software application that monitors a


network or systems for malicious activity or policy violations. Any intrusion activity or
violation is typically reported either to an administrator or collected centrally using a
security information and event management (SIEM) system. A SIEM system combines
outputs from multiple sources and uses alarm filtering techniques to distinguish malicious
activity from false alarms. [1]

Figure 1.2: A simple active defense architecture

4
IDS types range in scope from single computers to large networks.[3] The most common
classifications are network intrusion detection systems (NIDS) and host-based intrusion
detection systems (HIDS). A system that monitors important operating system files is an
example of an HIDS, while a system that analyzes incoming network traffic is an
example of an NIDS. It is also possible to classify IDS by detection approach. The most
well-known variants are signature-based detection (recognizing bad patterns, such as
malware) and anomaly-based detection (detecting deviations from a model of "good"
traffic, which often relies on machine learning). Another common variant is reputation-
based detection (recognizing the potential threat according to the reputation scores). Some
IDS products have the ability to respond to detected intrusions. Systems with response
capabilities are typically referred to as an intrusion prevention system.[4]

Intrusion detection systems can also serve specific purposes by augmenting them with
custom tools, such as using a honeypot to attract and characterize malicious traffic
detailed investigation and analysis of various machine learning techniques have been
carried out for finding the cause of problems associated with various machine learning
techniques in detecting intrusive activities. Attack classification and mapping of the
attack features is provided corresponding to each attack. Issues which are related to
detecting low-frequency attacks using network attack dataset are also discussed and
viable methods are suggested for improvement. Machine learning techniques have been
analyzed and compared in terms of their detection capability for detecting the various
categories of attacks.

1.3.1 Network Intrusion Detection Systems

Network intrusion detection systems (NIDS) are placed at a strategic point or points
within the network to monitor traffic to and from all devices on the network. [8]It
performs an analysis of passing traffic on the entire subnet, and matches the traffic that is
passed on the subnets to the library of known attacks. Once an attack is identified, or
abnormal behavior is sensed, the alert can be sent to the administrator. An example of an
NIDS would be installing it on the subnet where firewalls are located in order to see if
someone is trying to break into the firewall. Ideally one would scan all inbound and
outbound traffic, however doing so might create a bottleneck that would impair the
overall speed of the network. OPNET and NetSim are commonly used tools for
simulating network intrusion detection systems. NID Systems are also capable of

5
comparing signatures for similar packets to link and drop harmful detected packets which
have a signature matching the records in the NIDS. When we classify the design of the
NIDS according to the system interactivity property, there are two types: on-line and off-
line NIDS, often referred to as inline and tap mode, respectively. On-line NIDS deals
with the network in real time. It analyses the Ethernet packets and applies some rules, to
decide if it is an attack or not. Off-line NIDS deals with stored data and passes it through
some processes to decide if it is an attack or not.

NIDS can be also combined with other technologies to increase detection and prediction
rates. Artificial Neural Network based IDS are capable of analyzing huge volumes of
data, in a smart way, due to the self-organizing structure that allows INS IDS to more
efficiently recognize intrusion patterns.[9] Neural networks assist IDS in predicting
attacks by learning from mistakes; INN IDS help develop an early warning system, based
on two layers. The first layer accepts single values, while the second layer takes the first's
layers output as input; the cycle repeats and allows the system to automatically recognize
new unforeseen patterns in the network. This system can average 99.9% detection and
classification rate, based on research results of 24 network attacks, divided in four
categories: DOS, Probe, Remote-to-Local, and user-to-root.

1.3.2 Host Intrusion Detection Systems

Host intrusion detection systems (HIDS) run on individual hosts or devices on the
network. A HIDS monitors the inbound and outbound packets from the device only and
will alert the user or administrator if suspicious activity is detected. It takes a snapshot of
existing system files and matches it to the previous snapshot. If the critical system files
were modified or deleted, an alert is sent to the administrator to investigate. An example
of HIDS usage can be seen on mission critical machines, which are not expected to
change their configurations.

1.4 DETECTION METHOD

Signature-based IDS refers to the detection of attacks by looking for specific patterns,
such as byte sequences in network traffic, or known malicious instruction sequences used
by malware.[14] This terminology originates from anti-virus software, which refers to
these detected patterns as signatures. Although signature-based IDS can easily detect
known attacks, it is difficult to detect new attacks, for which no pattern is available.

6
1.4.1 Signature-based

In Signature-based IDS, the signatures are released by a vendor for its all products. On-
time updating of the IDS with the signature is a key aspect.

1.4.2 Anomaly-based

Anomaly-based intrusion detection systems were primarily introduced to detect unknown


attacks, in part due to the rapid development of malware. The basic approach is to use
machine learning to create a model of trustworthy activity, and then compare new
behavior against this model. Since these models can be trained according to the
applications and hardware configurations, machine learning based method has a better
generalized property in comparison to traditional signature-based IDS. Although this
approach enables the detection of previously unknown attacks, it may suffer from false
positives: previously unknown legitimate activity may also be classified as malicious.
Most of the existing IDSs suffer from the time-consuming during detection process that
degrades the performance of IDSs.

1.4.3 Intrusion prevention

Some systems may attempt to stop an intrusion attempt but this is neither required nor
expected of a monitoring system. Intrusion detection and prevention systems (IDPS) are
primarily focused on identifying possible incidents, logging information about them, and
reporting attempts. In addition, organizations use IDPS for other purposes, such as
identifying problems with security policies, documenting existing threats and deterring
individuals from violating security policies. IDPS have become a necessary addition to
the security infrastructure of nearly every organization.

IDPS typically record information related to observed events notify security


administrators of important observed events and produce reports. Many IDPS can also
respond to a detected threat by attempting to prevent it from succeeding. They use several
response techniques, which involve the IDPS stopping the attack itself, changing the
security environment (e.g. reconfiguring a firewall) or changing the attack's content.

Intrusion prevention systems (IPS), also known as intrusion detection and prevention
systems (IDPS), are network security appliances that monitor network or system activities

7
for malicious activity. The main functions of intrusion prevention systems are to identify
malicious activity, log information about this activity, report it and attempt to block or
stop it.

Intrusion prevention systems are considered extensions of intrusion detection systems


because they both monitor network traffic and/or system activities for malicious activity.
The main differences are, unlike intrusion detection systems, intrusion prevention systems
are placed in-line and are able to actively prevent or block intrusions that are detected.

1.4.4 IDS Placement

The placement of Intrusion Detection Systems is critical and varies depending on the
network. The most common placement being behind the firewall on the edge of a
network. This practice provides the IDS with high visibility of traffic entering your
network and will not receive any traffic between users on the network. The edge of the
network is the point in which a network connects to the extranet. Another practice that
can be accomplished if more resources are available is a strategy where a technician will
place their first IDS at the point of highest visibility and depending on resource
availability will place another at the next highest point, continuing that process until all
points of the network are covered.

If IDS is placed beyond a network's firewall, its main purpose would be to defend against
noise from the internet but, more importantly, defend against common attacks, such as
port scans and network mapper. IDS in this position would monitor layers 4 through 7 of
the OSI model and would be signature-based. This is a very useful practice, because
rather than showing actual breaches into the network that made it through the firewall,
attempted breaches will be shown which reduces the amount of false positives. The IDS
in this position also assists in decreasing the amount of time it takes to discover
successful attacks against a network.

1.5 MOTIVATION OF THESIS

HACKING incidents are increasing day by day as technology rolls out. A large number
of hacking incidents are reported by companies each year. Many of the existing system
don’t effectively classify and predict the attack which is presented in the network. The
main motivation of this research is to study and simulate the cyber attack system.

8
1.6 OBJECTIVE OF RESEARCH

 The main objective of this research is to propose a model which is introduced to


overcome all the disadvantages that arise in the existing system.
 This system will increase the accuracy of the classification results by classifying
the data based on the social network mental disorders and others using C4.5
Decision tree classification algorithm.
 It enhances the performance of the overall classification results.
 Collect the Network Intrusion Detection System (NIDS) dataset i.e. KDD from
kaggle machine learning repository. Then implementation of the proposed
approach based on decision tree with C4.5 algorithm.
 Next simulation of the proposed method on spyder python 3.7 software, prediction
of various parameters like precision, recall, f-measure and accuracy and finally
generate results graph and compare from previous work.

1.7 ORGANIZATION OF DISSERTATION

Chapter 1: In this part, the basic concept of cyber attack, instruction detection system,
machine learning classification techniques, motivation and objective of thesis.

Chapter 2: This chapter consists previous research work review. Here studied approx
20-25 IEEE papers and write summary. After literature review problem statements are
also discussed.

Chapter 3: This chapter consists about technical work background advantage and
disadvantages.

Chapter 4: In this chapter, shows the flow chart, steps of flow chart working and
methodology steps discussion.

Chapter 5: This chapter shows simulation software, simulation result discussion,


parameter calculation and comparison of parameters with existing results.

Chapter 6: It conclude conclusion of all the work with significant achievement of


proposed design. Future work is also discussed in this chapter.

9
CHAPTER 2
LITERATURE SURVEY

10
CHAPTER 2
LITERATURE SURVEY

This chapter provides a detailed survey of the past work in the cyber attack field. The
theoretical and experimental works from different types of attacks, instruction detection,
and host based detection. This section briefly describes various improvements in
performance in terms of precision, recall, f1, accuracy and error rate. The following
reviews provide a comprehensive survey about the developments in the state of art attack
prediction in IOT cyber attack technology around the world.

2.1 LITERATURE SURVEY

H. Hou et al.,[1] With the continuous development of network technology, cyberattack


detection mechanisms play a vital role in ensuring the security of computers and network
systems. However, with the rapid growth of network traffic, traditional intrusion detection
systems (IDSs) are far from being able to quickly and accurately identify complex and
diverse network attacks, especially those related to low-frequency attacks. To enhance the
overall security of the Internet, an IDS based on hierarchical long short-term memory
(HLSTM) networks is proposed. With the introduction of HLSTM, the network can learn
across multiple levels of temporal hierarchy over complex network traffic sequences. The
system is evaluated on the well-known benchmark data set NSL-KDD for comparison
with other existing methods. The experimental results demonstrate that compared with
existing start-of-the-art methods, our system has better detection performance for
different types of cyberattacks. In addition, the low-frequency network attack types have
higher classification accuracy and a lower false detection rate.

P. Feng et al.,[2] With the widespread usage of Android smart-phones in our daily lives,
Android platform has become an attractive target for malware authors. There is an urgent
need for developing automatic malware detection approach to prevent the spread of
malware. Traditional signature-based detection methods cannot handle the rapid evolution
of complex malware or the emerging of new types of malware. Due to the limitation on
code coverage and poor efficiency of the dynamic analysis, in this work, we propose a
new Android malware detection approach based on static analysis via graph neural

11
network. Instead of extracting Application Programming Interface (API) call information,
we further analyze the source code of Android applications to extract high -level semantic
information, which increases the barrier of evading detection. Particularly, we construct
approximate call graph from function invocation relationships within an Android
application to represent this application, and further extract intra-function attributes,
including required permission, security level and statistical instructions information, to
form the node attributes within graph structures. Then, we use graph neural network
(GNN) to generate a vector representation of the application, and then malware
classification is performed on this representation. We conduct experiments on real-world
application samples. The experimental results demonstrate that our approach implements
high effective malware detection and outperforms state-of-the-art detection approaches.

S. Liu et al.,[3] Internet of Things (IoT) integrates a variety of software (e.g.,


autonomous vehicles and military systems) in order to enable the advanced and intelligent
services. This software increases the potential of cyber-attacks because an adversary can
launch an attack using system vulnerabilities. Existing software vulnerability analysis
methods used to be relying on human experts crafted features, which usually miss many
vulnerabilities. It is important to develop an automatic vulnerability analysis system to
improve the countermeasures. However, source code is not always available (e.g., most
IoT related industry software are closed source). Therefore, vulnerability detection on
binary code is a demanding task. This article addresses the automatic binary-level
software vulnerability detection problem by proposing a deep learning-based approach.
The proposed approach consists of two phases: binary function extraction, and model
building. First, we extract binary functions from the cleaned binary instructions obtained
by using IDA Pro. Then, we employ the attention mechanism on top of a bidirectional
long short-term memory for building the predictive model. To show the effectiveness of
the proposed approach, we have collected datasets from several different sources. We
have compared our proposed approach with a series of baselines including source code-
based techniques and binary code-based techniques. We have also applied the proposed
approach to real-world IoT related software such as VLC media player and LibTIFF
project that used on Autonomous Vehicles. Experimental results show that our proposed
approach betters the baselines and is able to detect more vulnerability.

12
Y. Jin et al.,[4] Cybersecurity threats from malware attacks has become one of the most
serious issues in the Internet nowadays. Most types of malware, after intruding an
individual computer, attempt to connect the corresponding Command and Control (C&C)
servers using IP addresses or Fully Qualified Domain Name (FQDN) and receive further
instructions (e.g. attacking target IP addresses and FQDNs) from them in order to conduct
subsequent cyber attacks. In recent years, it has been clarified that DNS traffic has been
used for communication between the malware infection computers and the C&C servers.
In this research, we focus on these peculiarities and propose a method for detecting
malware infected computers by monitoring unintended DNS traffic on wireless networks
by collaboration with DHCP (Dynamic Host Configuration Protocol) server. By
deploying the proposed system on campus wireless networks, the computers within
DHCP configured environment can be detected when they are infected by some types of
malware and it attempts to communicate with the corresponding C&C servers using DNS
(Domain Name System) protocol. In this work, we describe the detailed design of the
proposed method and the future work includes prototype implementation as well as
evaluations.

B. Peng et al.,[5] Efficient and accurate detection of abnormal interaction information at


network-related terminals of new energy plants and stations can effectively improve the
network security and protection capabilities of new energy plants and stations. Firstly, the
network attack scenarios, malformed messages and irregular business instructions of new
energy plants and stations are analyzed. Secondly, the deep parallel parsing technology of
real-time interactive protocol for new energy plants and stations is proposed to improve
the efficiency of protocol parsing. Thirdly, a real-time interactive process anomaly
detection technology based on feature matching is proposed. K-NN classification
algorithm is used to match the characteristic vectors of network data packets in the new
energy plant and station system, which realizes the anomaly detection of network attack
scenarios, malformed messages and irregular business instructions in the new energy
plant and station system. Finally, a simulation experiment environment of the new energy
plant and station system is built to verify the method proposed in this work. The
experimental results show that the algorithm has high ability of anomaly detection and
low false alarm rate. It is of great significance to improve the level of network security
protection of new energy plants and stations, and to ensure the safe and stable operation
of new energy plants and stations.

13
W. Bi et al.,[6] The safety and security of load frequency control (LFC) system are
menaced by cyber attacks and physical attacks. Sophisticated attackers can launch cyber-
physical attacks (CPAs), in which cyber attacks are used to increase the impacts of
physical attacks by sending wrong instructions to the LFC controller. Considering that
parameters of cyber attacks in CPAs may be either fixed or variable, a novel detection
scheme is designed against two potential CPAs on an LFC system, namely fixed cyber-
physical attacks (FCPAs) and variable cyber-physical attacks (VCPAs), respectively. As
for FCPAs, dynamic characteristics of Area Control Error (ACE) are utilized to detect the
compromised data. Compared with FCPAs, VCPAs are more deceptive. A relation-based
(RB) feature extraction method is introduced to distinguish the signals compromised by
VCPAs from the normal ones. A detection model that does not require compromised
samples is developed with the aid of support vector domain description. In the end, a
comprehensive detection scheme is designed to detect both FCPAs and VCPAs on the
LFC system.

K. Liu et al.,[7] This work reviews the problem of instrusion detection for Smart Home
and different approach to detect instrusion. A hybrid instrusion detection method based
on Convolutional Neural Networks(CNN)and K-means is proposed in this work. At smart
home device node, K-means is used to generate the rule base by clustering, then Principal
Component Analysis(PCA)is used to extract the dimensionality reduced features. During
the test process, PCA is also used to extract the dimensionality reduced features, the
feature matching is performed with the rule base to determine the intrusion data. At the
smart home server side, a CNN model is proposed to detect the specific type of intrusion.
Combined with Synthetic Minority Oversampling Technique(SMOTE)and undersampling
techniques, the CNN model has great performance in reducing missing report
rate(MRR)in minority categories. The results of the experiment conducted in KDD99
dataset show that such a hybrid method can improve the detection rate of smart home
intrusion detection system and reduce MRR in minority categories.

Y. Jin et al.,[8] Malware has become one of the most critical targets of network security
solutions nowadays. Many types of malware receive further instructions from the C&C
servers and the attack targets may be instructed by IP addresses which causes direct
attacks without DNS name resolution from the malware-infected computers. In the
meanwhile, several programs that are hidden from the users (e.g. malware, virus, etc.)

14
may perform DNS name resolutions for cyber attacks or other communications. In this
work, we propose a client based anomaly traffic detection and blocking mechanism by
monitoring DNS name resolution per application program. In the proposed mechanism,
by the collaboration of DNS proxy and packet filter, DNS traffic is monitored on the
client and the traffic destined to the IP addresses obtained without DNS name resolution
or the traffic from unrecognized programs will be detected and blocked. In addition, in
order to mitigate false positive detection, an alert-window will be shown to let the users
decide whether to allow the traffic or not. We implemented a prototype system on a
Windows 7 client and confirmed that the proposed mechanism worked as expected.

R. Velea et al.,[9] Malware detection is an important aspect of cyber security. The


process of identifying malicious code in files or network traffic is very complex and
requires a lot of computational resources. Most security solutions that deal with malware
detection implement advanced string matching algorithms or look for certain behavioral
patterns during program execution. These methods of detection can cause significant
performance penalties for real-time applications, can limit the scan surface or degrade
user experience. In this work we discuss a hybrid approach that leverages CPU and GPU
compute capabilities in order to accelerate pattern matching for malware signatures. The
solution presented focuses on improving performance and reducing power consumption
of string matching algorithms on devices such as ultrabooks and laptops.

S. Merat et al.,[10] The main focus of this work is the improvement of machine learning
where a number of different types of computer processes can be mapped in multitasking
environment. A software mapping and modelling paradigm named SHOWAN is
developed to learn and characterize the cyber awareness behaviour of a computer process
against multiple concurrent threads. The examined process start to outperform, and
tended to manage numerous tasks poorly, but it gradually learned to acquire and control
tasks, in the context of anomaly detection. Finally, SHOWAN plots the abnormal
activities of manually projected task and compare with loading trends of other tasks
within the group.

S. Han et al.,[11] Cyber-physical systems (CPSs) integrate the computation with physical
processes. Embedded computers and networks monitor and control the physical
processes, usually with feedback loops where physical processes affect computations and
vice versa. CPS was identified as one of the eight research priority areas in the August

15
2007 report of the President's Council of Advisors on Science and Technology, as CPS
will be the core component of many critical infrastructures and industrial control systems
in the near future. However, a variety of random failures and cyber attacks exist in CPS,
which greatly restrict their growth. Fortunately, an intrusion detection mechanism could
take effect for protecting CPS. When misbehavior is found by the intrusion detector, the
appropriate action can be taken immediately so that any harm to the system will be
minimized. As CPSs are yet to be defined universally, the application of the instruction
detection mechanism remains open presently. As a result, the effort will be made to
discuss how to appropriately apply the intrusion detection mechanism to CPS in this
work. By examining the unique properties of CPS, it intends to define the specific
requirements first. Then, the design outline of the intrusion detection mechanism in CPS
is introduced in terms of the layers of system and specific detection techniques. Finally,
some significant research problems are identified for enlightening the subsequent studies.

M. Bousaaid et al.,[12] The advent of information and communication technologies


(ICT) in the domain of education represents a real opportunity for spreading knowledge.
Many results have already been obtained, which aimed mainly to facilitate the
arrangement of pedagogical contents by large and massive deployment of digital
environments of work. The development of technologies of multimedia, linked to that of
Internet and democratization of high output, has made henceforth E-learning possible for
learners being in virtual classes and geographically distributed. The quality and quantity
of asynchronous and synchronous communications are the key elements for E-learning
success. It is important to have a propitious supervision to reduce the feeling of isolation
in E-learning. This feeling of isolation is among the main causes of loss and high rates of
stalling in E-learning. To overcome this feeling of isolation experienced by learners, we
hope to put at the trainer and learners' disposal an environment to make them behave as
being in real class. One of the preferable means is by using USB cameras to analyse in
real time the movements linked to behaviours of trainer and learners. It is also allowed for
the trainer to have a “return” quality concerning learners' reactions like those he is used to
in real class, notably taking into account learner's attention and his level of participation.
The researches to be conducted in this domain aim to bring solutions of convergence
coming from real time image for the capture and recognition of hand gestures. These
gestures will be analysed by the system and transformed as indicator of participation. This
latter is displayed in the table of performance of the tutor as a curve according to the time.

16
The indicator is defined as the frequency of raising hand of learner to participate during a
learning session.

A. J. Smith et al.,[13] Reverse Code Engineering (RCE) to detect anti-debugging


techniques in software is a very difficult task. Code obfuscation is an anti-debugging
technique makes detection even more challenging. The Rule Engine Detection by
Intermediate Representation (REDIR) system for automated static detection of obfuscated
anti-debugging techniques is a prototype designed to help the RCE analyst improve
performance through this tedious task. Three tenets form the REDIR foundation. First,
Intermediate Representation (IR) improves the analyzability of binary programs by
reducing a large instruction set down to a handful of semantically equivalent statements.
Next, an Expert System (ES) rule-engine searches the IR and initiates a sense-making
process for anti-debugging technique detection. Finally, an IR analysis process confirms
the presence of an anti-debug technique. The REDIR system is implemented as a
debugger plug-in. Within the debugger, REDIR interacts with a program in the
disassembly view. Debugger users can instantly highlight anti-debugging techniques and
determine if the presence of a debugger will cause a program to take a conditional jump
or fall through to the next instruction.

M. Guri et al.,[14] Modern malicious programs often escape dynamic analysis, by


detecting forensic instrumentation within their own runtime environment. This has
become a major challenge for malware researchers and analysts. Current defensive
analysis of anti-forensic malware often requires painstaking step-by-step manual
inspection. Code obfuscation may further complicate proper analysis. Furthermore,
current defensive countermeasures are usually effective only against anti-forensic
techniques which have already been identified. In this work we propose a new method to
detect and classify anti-forensic behavior, by comparing the trace-logs of the suspect
program between different environments. Unlike previous works, the presented method is
essentially noninvasive (does not interfere with original program flow). We separately
trace the flow of instructions (Opcode) and the flow of Input-Output operations (IO). The
two dimensions (Opcode and IO) complement each other to provide reliable
classification. Our method can identify split behavior of suspected programs without prior
knowledge of any specific anti-forensic technique; furthermore, it relieves the malware
analyst from tedious step-by-step inspection. Those features are critical in the modern

17
Cyber arena, where rootkits and Advanced Persistent Threats (APTs) are constantly
adopting new sophisticated anti-forensic techniques to deceive analysis.

M. Hirabayashi et al.,[15] Vision-based object detection using camera sensors is an


essential piece of perception for autonomous vehicles. Various combinations of features
and models can be applied to increase the quality and the speed of object detection. A
well-known approach uses histograms of oriented gradients (HOG) with deformable
models to detect a car in an image. A major challenge of this approach can be found in
computational cost introducing a real-time constraint relevant to the real world. In this
work, we present an implementation technique using graphics processing units (GPUs) to
accelerate computations of scoring similarity of the input image and the pre-defined
models. Our implementation considers the entire program structure as well as the specific
algorithm for practical use. We apply the presented technique to the real-world vehicle
detection program and demonstrate that our implementation using commodity GPUs can
achieve speedups of 3x to 5x in frame-rate over sequential and multithreaded
implementations using traditional CPUs.

Lin Wang et al.,[16] A number of shocking cyber-attacks have happened in recent years,
and the damage they have caused has led to the emergence of cyber-security as a
consideration when designing embedded systems. Software vulnerability and physical
attack are the most severe threats the system face. This work provides information about
hardware designed to monitor potential intrusions and incidences of unauthorized access.
Crucially, it can also trace execution patterns and cryptographic schemes in relation to
memory authentication. The automated compiler extracts the intrusion detection model
and covers the important instructions with cipher text at the compile time. At runtime, the
proposed hardware monitors the instructions that change program trace and access
memory data, which ensure the process and data follow the permissible behavior and
resist the potential attacks. The security analysis shows that the proposed techniques can
recognize and eliminate a wide range of common software and physical threats with low
performance penalties and minimal overhead.

R. Tao et al.,[17] Application features such as port numbers are used by network-based
intrusion detection systems (NIDSs) to detect attacks coming from networks. System calls
and the operating system related information are used by host-based intrusion detection
systems (HIDSs) to detect intrusions towards a host. However, the relationship between

18
hardware architecture events and denial-of-service (DoS) attacks has not been well
revealed. When increasingly sophisticated intrusions emerge, some attacks are able to
bypass both the application and the operating system level feature monitors. Therefore, a
more effective solution is required to enhance existing HIDSs. In this work, we identify
the following hardware architecture features: instruction count, cache miss, bus traffic
and integrate them into a novel HIDS framework based on a modern statistical gradient
boosting trees model. Through the integration of application, operating system and
architecture level features, our proposed HIDS demonstrates a significant improvement of
the detection rate in terms of sophisticated DoS intrusions.

K. A. Bowman et al.,[18] Microprocessor clock frequency (FCLK) is traditionally


determined based on maximum supply voltage (Vcc) droop and temperature
specifications. Since typical usage patterns usually run at nominal Vcc and temperature,
these infrequent dynamic variations severely limit FCLK. The concept of timing-error
detection and correction to explore the effectiveness of resilient circuits in eliminating
Vcc and temperature FCLK guardbands as well as exploiting path-activation probabilities
to maximize throughput (TP).

N. R. Yang et al.,[19] The energy consumed in instruction fetching accounts for a


significant portion of total processor energy consumption. Energy consumption as well as
performance should be considered when designing high performance embedded
processors. In this work, we present a hardware-based loop detection technique to reduce
the energy consumption in the instruction fetch unit (instruction cache and branch
prediction logic) for high performance embedded processors. The proposed instruction
fetch unit reduces the energy consumed in the instruction cache by replacing the accesses
to the large main instruction cache with those to the small selectively accessed cache
(SAC). It also reduces the energy consumed in the branch prediction logic by reducing
unnecessary accesses to the branch prediction logic. We evaluate the proposed design
using a simulation infrastructure based on SimpleScalar and CACTI. Simulation results
show that the proposed technique reduces the energy consumption in the instruction cache
and the branch prediction logic by 20% and 24% on the average, respectively. Moreover,
the proposed scheme shows little performance loss compared to the traditional scheme.

S. Koohi et al.,[20] Real-time video transmission is considered as an important means for


information distribution. One major application of it is e-learning, which requires real-

19
time video processing and transmission. On the other hand, the process of cut detection is
a fundamental component in automatic video browsing, indexing, searching, retrieval,
and archiving. This work introduces a new video cut detection technique that uses
dominant lines and angles extracted from edge information of the video contents. To the
best of our knowledge, it is the first works done for cut detection in e-learning
application. This method is compatible with our application requirements and has a low
complexity and high speed. We have compared the performance of our proposed method
against three established techniques and have evaluated the results using different video
sequences.

2.2 PROBLEM IDENTIFICATION

After the literature survey following problem is identified in previous research work-

 Low accuracy rate of true data prediction from given dataset.

 Classification error is more in the existing predicted model.

 Low value of the precision, recall and F_measure score.

20
CHAPTER 3
TECHNICAL BACKGROUND

21
CHAPTER 3

TECHNICAL BACKGROUND

An intrusion detection system is a piece of software that detects unwanted manipulation


of a computer system. These manipulations are using initiated through the internet and
may take the form of an attack by a cracker. These systems are used to detect various
types of malicious code and behaviors that can compromise the security and trust of a
computer system. These attacks can manipulate services, data, applications, user
privileges, unauthorized logins, viruses and worms. An intrusion detection system is
composed of three different components. The first component is the sensors. Sensors are
used to generate security events which trigger the intrusion detection system. The second
component is a console. The console is used to monitor events and alerts and the control
sensors. The third component of the instruction detection system is an Engine. The
engine records events found by the sensors in a database and use a system of rules to
generate alerts from the security events received by the intrusion detection system[5].

There are several different types of intrusion detection systems. A network intrusion
detection system is an independent system which identifies intrusions by monitoring
network traffic. Network detections systems monitor all network traffic by connection to
a hub or switch. A protocol-based intrusion detection system is agent that sits at the front
end of a server that is used to monitor and analyze certain protocols between connected
devices. These are usually used with web server to monitor the HTTPS protocol. An
application-based intrusion detection system is a system that monitors a certain
application within a group of servers on a network. You can also combine two or more
network intrusion systems to create a hybrid intrusion system for better protection[8].

When implementing an intrusion detection system you will also need to decide on which
kind you would like to implement. There are two kinds of intrusion detection systems,
reactive and passive. A passive intrusion detection system detects that there is a potential
security breach and signals an alert on the console. A reactive intrusion detection system
provides a little extra protection compared to the passive system. When a reactive
intrusion detection system detects a potential security breach it responds to the suspicious
activity and either changes a firewall setting to block the network traffic or it resets the

22
connection to stop the activity. Intrusion detection systems can also be setup to monitor
traffic that is coming from inside the network.

Intrusion detection systems are an essential part of keeping you network secure. With the
help of an intrusion detection system you will have a better chance of keeping authorized
people from accessing your network and you will be able to be notified when they are
attempting to access you network. Many companies implements intrusion detections
systems. One popular intrusion detection system that is used in many large organizations
is Snort NIDS. Snort is an open source network intrusion prevention and detection
system that uses a rule-driven language, which gives the benefits of signature, protocol
and anomaly based inspection rules [12].

3.1 IOT NEXT EVOLUTION

This research describes the methodology and the development process of creating an IoT
platform. This work also presents the architecture and implementation for the IoT
platform. The goal of this research is to develop an analytics engine which can gather
sensor data from different devices and provide the ability to gain meaningful information
from IoT data and act on it using machine learning algorithms [14].

Advantage

The proposed system is introducing the use of a messaging system to improve the overall
system performance as well as provide easy scalability.

Disadvantage

Low cost devices are easily able to connect wirelessly to the Internet, from handhelds to
coffee machines, also known as Internet of Things (IoT).

3.2 THE INTERNET OF THINGS

The object unique addressing and the representation and storing of the exchanged
information become the most challenging issues, bringing directly to a third, ‘‘Semantic
oriented”, perspective of IoT [15].

23
Advantage

People are informed of the scope and the way in which their movements are tracked by
the system (taking people informed about possible leaks of their privacy is essential and
required by most legislations).

Disadvantage

The user can set the preferences of the proxy. When sensor networks and RFID systems
are included in the network, then the proxy operates between them and the services.

3.3 ADDRESSING THE CLASS IMBALANCE PROBLEM

A balanced dataset is very important for creating a good training set. They aim to
optimize the overall accuracy without considering the relative distribution of each class.
Typically real world data are usually imbalanced and it is one of the main causes for the
decrease of generalization in machine learning algorithms [16].

Advantage

The aim was to reduce the ratio gap between the majority classes with the minority
class. The proposed method is found to be useful for such datasets where the class labels
are not certain and can also help to overcome the class imbalance problem of clinical
datasets and also for other data domains.

Disadvantage

The outcome labels of most of the clinical datasets are not consistent with the underlying
data. The conventional over-sampling and under-sampling technique may not always be
appropriate for such datasets.

3.4 CLOUD COMPUTING SECURITY

This survey work provides a general overview on Cloud Computing. The topics that are
discussed include characteristics, deployment and service models as well drawbacks [18].

24
Advantage

The major part of countermeasures focuses on Intrusion Detection Systems. Moving


towards Mobile Cloud Computing and Internet of Things, this survey work gives a
general explanation on the applications and potential that comes with the integration of
Cloud Computing with any device that has Internet connectivity as well as the challenges
that are before it.

Disadvantage

Several security issues and countermeasures are also discussed to show the major issues
and obstacles that Cloud Computing faces as it is being implemented further.

The convergence of computing and communication has produced a society that feeds on
information. Yet most of the information is in its raw form: data. If data is characterized
as recorded facts, then information is the set of patterns, or expectations, that underlie the
data. There is a huge amount of information locked up in databases—information that is
potentially important but has not yet been discovered or articulated. Our mission is to
bring it forth.

Advantage

In these cases the output took the form of decision trees and classification rules, which are
basic knowledge representation styles that many machine learning methods used.

Disadvantage

The weather problem is a tiny dataset that we will use repeatedly to illustrate
machine learning methods.

3.5 MACHINE LEARNING TECHNIQUES

Machine learning (ML) is the scientific study of algorithms and statistical models that PC
systems use to play out a specific task without using unequivocal instructions, depending
on patterns and deduction instead. It is seen as a subset of computerized reasoning.
Machine learning algorithms construct a numerical model based on sample information,
known as "preparing information", so as to make predictions or decisions without being
expressly programmed to play out the task.[1] Machine learning algorithms are used in a

25
wide assortment of applications, such as email sifting and PC vision, where it is
troublesome or infeasible to build up a traditional calculation for adequately playing out
the task.

Machine learning is closely identified with computational statistics, which focuses on


making predictions using computers. The study of scientific streamlining delivers
methods, hypothesis and application domains to the field of machine learning.
Information mining is a field of study inside machine learning, and focuses on
exploratory information analysis through unsupervised learning [2]. In its application
across business problems, machine learning is also alluded to as prescient analytics.

Classification of Algorithms

There are many types of classification algorithms and machine learning, such as-

1. Decision trees
2. Naive bayes
3. Logistic regression
4. k-nearest neighbor
5. Support vector machines
6. Random Forest
In machine learning, classification is a supervised learning approach which can be
thought of as a means of arranging or classifying some unknown items into a discrete set
of classes.

Classification attempts to become familiar with the relationship between a set of highlight
variables and an objective variable of interest. The objective property in classification is a
straight out factor with discrete values. Given a set of preparing information points
alongside the objective labels, classification determines the class mark for an unlabeled
test case.

There are numerous types of classification algorithms and machine learning, such as
decision trees, naive bayes, linear discriminant analysis, k-nearest neighbor, logistic
regression, neural networks, and support vector machines.

26
3.5.1 Decision Tree

Decision trees are built using recursive partitioning to classify the data, i.e., by splitting
the training set into distinct nodes, where one node contains all of or most of one category
of the data. A decision tree can be constructed by considering the attributes one by one:

Figure 3.1: Decision tree


 First, choose an attribute from our dataset.
 Calculate the significance of the attribute in the splitting of the data.
 Next, split the data based on the value of the best attribute,
 then go to each branch and repeat it for the rest of the attributes.
 After building this tree, you can use it to predict the class of unknown cases

Decision trees are about testing an attribute and branching the cases based on the result of
the test:

1. Each internal node corresponds to a test


2. Each branch corresponds to a result of the test
3. Each leaf node assigns a patient to a class

3.5.1.1 Iterative Dichotomiser 3 (ID3)


In decision tree learning, ID3 is an algorithm invented by Ross Quinlan used to generate a
decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm, and is typically
used in the machine learning and natural language processing domains.

27
The ID3 algorithm begins with the original set as the root node. On each iteration of the
algorithm, it iterates through every unused attribute of the set and calculates
the entropy or the information gain of that attribute. It then selects the attribute which has
the smallest entropy (or largest information gain) value.

Figure 3.2: ID3


The set is then split or partitioned by the selected attribute to produce subsets of the data.
(For example, a node can be split into child nodes based upon the subsets of the
population whose ages are less than 50, between 50 and 100, and greater than 100.) The
algorithm continues to recurse on each subset, considering only attributes never selected
before.
Recursion on a subset may stop in one of these cases:
 Every element in the subset belongs to the same class; in which case the node is
turned into a leaf node and labelled with the class of the examples.
 There are no more attributes to be selected, but the examples still do not belong to the
same class. In this case, the node is made a leaf node and labelled with the most
common class of the examples in the subset.
 There are no examples in the subset, which happens when no example in the parent
set was found to match a specific value of the selected attribute. An example could be
the absence of a person among the population with age over 100 years. Then a leaf
node is created and labelled with the most common class of the examples in the parent
node's set.

28
Throughout the algorithm, the decision tree is constructed with each non-terminal
node (internal node) representing the selected attribute on which the data was split,
and terminal nodes (leaf nodes) representing the class label of the final subset of this
branch.
3.5.2 Naïve Bayes

In Naïve Bayes classifier is a supervised calculation which classifies the dataset on the
basis of Bayes hypothesis. The Bayes hypothesis is a standard or the numerical idea that
is used to get the likelihood is called Bayes hypothesis. Bayes hypothesis requires some
free assumption and it requires autonomous variables which is the basic assumption of
Bayes theorem.

Bayes theorem on Mathematical Representation:

P (A\B) =

Here,

P (A) => independent probability of A (prior probability)


P (B) => independent probability of B
P (B\A) => conditional probability of B given A (likelihood)
P (A\B) => conditional probability of A given B (posterior probability).
Naïve Bayes is a simple and incredible calculation for prescient modeling. This model is
the most viable and proficient classification calculation which can deal with massive,
muddled, non-linear, subordinate information. Naïve comprises two section to be specific
naïve and Bayes where naïve classifier assumes that the presence of the specific element
in a class is disconnected to the presence of some other element.
3.5.3 Logistic Regression

Logistic regression is a classification calculation for unmitigated variables. Logistic


regression is analogous to linear regression, however tries to foresee a clear cut or
discrete objective field, such as 0 or 1, yes or no, and so forth., instead of a numeric one.
Subordinate variables should be continuous. In the event that clear cut, they should be
sham or marker coded. This means we need to transform them to some continuous worth.
Logistic regression can be used for both twofold classification and multi-class
classification. Sigmoid functions are a principle part of logistic regression [20].

29
Logistic regression can be used for the accompanying:

First, when the objective field in your information is clear cut or specifically is paired.
Such as zero/one, yes/no, genuine/false, beat/no stir, positive/negative, etc.

Second, you need the likelihood of your prediction. Logistic regression returns a
likelihood score somewhere in the range of zero and one for a given sample of
information.

Third, if your information is linearly separable. The decision limit of logistic regression is
a line or a plane or a hyper plane. A classifier will classify all the points on one side of the
decision limit as having a place with one class, and each one of those on the opposite side
as having a place with the different class.

Fourth, you have to understand the effect of a component. You can select the best features
based on the statistical significance of the logistic regression model coefficients or
parameters.

Logistic Regression vs. Linear Regression

While Linear Regression is suited for estimating continuous values (for example
estimating house value), it is not the best device for anticipating the class of an observed
information point. So as to estimate the class of an information point, we need some sort
of direction on what might be the most likely class for that information point. For this, we
use Logistic Regression.

Figure 3.3: Regression model

30
The distinction among linear and multiple linear regression is that the linear regression
contains just a single autonomous variable while multiple regression contains more than
one free variables. The best fit line in linear regression is acquired through least square
technique.

Linear regression finds a capacity that relates a continuous ward variable, y, to some
predictors.

Logistic Regression is a variety of Linear Regression, useful when the observed ward
variable, y, is clear cut. It produces an equation that predicts the likelihood of the class
mark as a component of the free variables. Logistic regression fits a special s-shaped bend
by taking the linear regression and transforming the numeric estimate into a likelihood
with the accompanying capacity, which is called sigmoid function . Logistic Regression
passes the contribution through the logistic/sigmoid yet then treats the result as a
probability:

 The Linear regression models information using continuous numeric worth. As


against, logistic regression models the information in the parallel values.
 Linear regression requires to establish the linear relationship among reliant and
free factor whereas it is not necessary for logistic regression.
 In the linear regression, the free factor can be related with one another. Despite
what might be expected, in the logistic regression, the variable must not be
associated with one another.
 Linear regression models information using a straight line where a random
variable, Y(response variable) is modeled as a linear capacity of another random
variable, X (indicator variable). Then again, the logistic regression models the
likelihood of the events in bivariate which are essentially happening as a linear
capacity of a set of ward variables.

Logistic Regression Training

The main objective of training and logistic regression is to change the parameters of the
model, so as to be the best estimation of the labels of the samples in the dataset.

1. Initialize the parameters randomly

31
2. Feed the cost function with training set and calculate the error
3. Calculate the gradient of the cost function
4. Update the weights with the new values
5. Go to step 2 until the cost is small enough
6. Predict the new customer X
 Cost Function

 Cost Function -MSE

Using the Gradient Descent method will minimize the cost function.
Minimizing the cost function will give the best parameter for the model.

Gradient Descent
Gradient Descent is a technique to use the derivative of a cost function to change the
parameter values in order to minimize the cost or error.

3.5.4 K-Nearest Neighbors Algorithm (KNN)


The K-Nearest Neighbors is a calculation for supervised learning and is a classification
calculation that takes a lot of named points and uses them to figure out how to name
different points. This calculation classifies cases based on their similarity to different
cases. In K-Nearest Neighbors, information points that are close to one another are said to
be neighbors. K-Nearest Neighbors is based on this worldview. Thus, the distance
between two cases is a measure of their dissimilarity. There are various ways to figure the
similarity or conversely, the distance or dissimilarity of two information points. For
instance, this should be possible using Euclidean distance.

In a classification problem, the K-Nearest Neighbors algorithm works as follows:

1. Pick a value for K


2. Calculate the distance from the new case hold out from each of the cases in the
dataset
3. Search for the K-observations in the training data that are nearest to the
measurements of the unknown data point
4. Predict the response of the unknown data point using the most popular response
value from the

32
Pick a value for K –

 Low estimation of K results in a profoundly complex model and may result in


over fitting.
 High estimation of K such as K equals 20, at that point the model becomes
excessively summed up.
 Solution is to reserve a piece of your information for testing the exactness of the
model. When you've done as such, choose K equals one and afterward use the
preparation part for modeling and compute the precision of prediction using all
samples in your test set. Rehash this process increasing the K and see which K is
best for your model.

Calculate similarities between two data points –

 use a specific type of Minkowski distance to calculate the distance of these two
customers, which is the Euclidean distance.
 Euclidean distance:

Nearest neighbors analysis can also be used to compute values for a continuous target. In
this situation, the average or median target value of the nearest neighbors is used to obtain
the predicted value for the new case.

Evaluation Metrics in Classification

Evaluation metrics explain the performance of a model.

Jaccard Index (or Jaccard similarity coefficient) is defined as size of the intersection
divided by the size of the union of two label sets (picture a Venn diagram).

For example, if y is actual labels (10 data points) and is the predicted labels (10 data
points), and 8 data points are accurately predicted by the model, than;

33
F1-Score (Confusion Matrix) - best at 0

This matrix shows the corrected and wrong predictions, in comparison with the actual
labels. Each confusion matrix row shows the Actual/True labels in the test set, and the
columns show the predicted labels by classifier. A good thing about the confusion matrix
is that it shows the model’s ability to correctly predict or separate the classes.

Figure 3.4: Prediction class metrics

 Precision is a measure of the accuracy, provided that a class label has been
predicted. It is defined by:
Precision = True Positive/(True Positive + False Positive)
 Recall Is The True Positive Rate:
Recall = True Positive/(True Positive + False Negative)
 F1-Score is the harmonic average of the precision and recall, where an F1 score
reaches its best value at 1 (which represents perfect precision and recall) and its
worst at 0
F1-Score = 2x (precision x recall)/(precision + recall)

Logarithmic Loss (Log Loss) - best at 0

Logarithmic loss (also known as Log loss) measures the performance of classifier where
the predicted output is a probability value between 0 and 1. Classifiers with lower log loss
has better accuracy.

34
3.5.5 Support Vector Machine

Support Vector Machine (SVM) is a supervised calculation that can classify cases by
isolating an informational index into at least two classes using a separator. SVM works
by: Mapping information to a high-dimensional component space so that information
points can be sorted (kernelling), in any event, when the information are not otherwise
linearly separable.

A separator between the categories is found; at that point the information is transformed
in such a manner that the separator could be drawn as a hyperplane.

Following this, characteristics of new information can be used to anticipate the gathering
to which another record should have a place.

kernel function is the mathematical function used for mapping data into a higher
dimensional space, in such a way that can change a linearly inseparable dataset into a
linearly separable dataset, and can be of different types, such as linear, polynomial,
Radial Basis Function, or RBF, and sigmoid.

1. Take a 1D linearly inseparable dataset (x)


2. Define a function to map to 2D, ϕ(x) = [x,x2]

Figure 3.5: Support vector machine

35
One reasonable decision as the best hyperplane is the one that represents the largest
separation or edge between the two classes. Information points closest to the hyperplane
are support vectors. It is instinctive that lone support vectors matter for accomplishing our
objective. What's more, thus, other slanting examples can be overlooked. We attempted to
discover the hyperplane in such a manner that it has the greatest distance to support
vectors. The hyperplane is gained from preparing information using an improvement
method that maximizes the edge. What's more, like numerous different problems, this
streamlining issue can also be solved by angle descent, which is out of scope of this
video.

The two main advantages of support vector machines are that they're:

 accurate in high-dimensional spaces


 use a subset of training points in the decision function called, support vectors, so
it's also memory efficient

The disadvantages of Support Vector Machines include the fact that the algorithm is:

 prone for over-fitting if the number of features is much greater than the number of
samples
 do not directly provide probability estimates, which are desirable in most
classification problems
 not very efficient computationally if your dataset is very big, such as when you
have more than 1,000 rows

SVM applications:

 Image recognition
 Test category assignments
 Detecting spam
 Sentiment analysis
 Gene expression classifications
 Regression, outlier detection and clustering

36
3.5.6 Random Forest

Random Forest is an adaptable, easy to use machine learning calculation that produces,
even without hyper-parameter tuning, an extraordinary result most of the time. It is also
one of the most used algorithms, because it's simplicity and the way that it tends to be
used for both classification and regression tasks. Right now, will realize, how the random
forest calculation works and several other significant things about it.

One major bit of leeway of random forest is, that it tends to be used for both classification
and regression problems, which structure most of current machine learning systems. I will
talk about random forest in classification, since classification is sometimes considered the
structure block of machine learning [22].

Figure 3.6: Random forest

Random Forest, just a random subset of the features is taken into consideration by the
calculation for splitting a hub. You can even make trees progressively random, by
furthermore using random thresholds for each component instead of searching for the best
possible thresholds.

37
CHAPTER 4
PROPOSED METHODOLOGY

38
CHAPTER 4

PROPOSED METHODOLOGY

4.1 PROPOSED WORK


The main contribution of the proposed research work is as followings-
 To collect Network Intrusion Detection dataset i.e. KDD from kaggle machine
learning repository.
 To implement proposed approach based on decision tree with C4.5 algorithm.
 To simulate proposed method on spyder python 3.7 software.
 To prediction of various parameters like precision, recall, f-measure and accuracy.
 To generate results graph and compare from previous work.

Figure 4.1: Flow Chart

39
Steps-

 Firstly, download the dataset from KDD dataset kaggle website, which is a large
dataset provider company for research.
 Now preprocessing of the data, here handing the missing dataset. Remove the
null value or replace from common 1 or 0 value.
 Now apply the classification method based on the machine learning approach.
The Decision Tree (DT) with C4.5 machine learning method is applied.
 Now check and calculate the performance parameters in terms of the precision,
recall, F_measure, accuracy and error rate.

4.2 METHODOLOGY

The methodology of the proposed research is based on the following sub modules-

 Data Selection and Loading


 Data Preprocessing
 Splitting Dataset into Train and Test Data
 Feature Extraction
 Classification
 Prediction
 Result Generation

4.2.1 Data Selection and Loading

 The data selection is the process of selecting the data for detecting the attacks.
 In this project, the KDD dataset is used for detecting attacks.
 The dataset which contains the information about the duration, flag, service, src
bytes, dest bytes and class labels.
4.2.2 Data Pre-processing

 Data pre-processing is the process of removing the unwanted data from the
dataset.

 Missing data removal

40
 Encoding Categorical data

 Missing data removal: In this process, the null values such as missing values are
removed using imputer library.

 Encoding Categorical data: That categorical data is defined as variables with a


finite set of label values. That most machine learning algorithms require numerical
input and output variables. That an integer and one hot encoding is used to convert
categorical data to integer data.

4.2.3 Splitting Dataset into Train and Test Data

 Data splitting is the act of partitioning available data into two portions, usually for
cross-validator purposes.
 One Portion of the data is used to develop a predictive model and the other to
evaluate the model's performance.
 Separating data into training and testing sets is an important part of evaluating
data mining models.
 Typically, when you separate a data set into a training set and testing set, most of
the data is used for training, and a smaller portion of the data is used for testing.

4.2.4 Feature Extraction


Feature scaling. Feature scaling is a method used to standardize the range of independent
variables or features of data. In data processing, it is also known as data normalization
and is generally performed during the data pre-processing step.
Feature Scaling or Standardization: It is a step of Data Pre Processing which is applied to
independent variables or features of data. It basically helps to normalise the data within a
particular range. Sometimes, it also helps in speeding up the calculations in an algorithm.

4.2.5 Classification
The C4.5 algorithm is used in Data Mining as a Decision Tree Classifier which can be
employed to generate a decision, based on a certain sample of data. A decision tree is a
tool that is used for classification in machine learning, which uses a tree structure where
internal nodes represent tests and leaves represent decisions. C4.5 makes use of
information theoretic concepts such as entropy to classify the data. For each dataset there

41
should be two files, one that describes the classes and attributes and one that consists of
the actual data. The file for attributes and classes should contain all the classes in first line
and after that, line by line the attributes and their possible values if the attribute is
discrete. For continuous (numerical) attributes, possible values would be "continuous".
Check the iris dataset folder for actual data and more specific syntax.

Algorithm

C4.5 is a computer program for inducing classification rules in the form of decision trees
from a set of given instances. C4.5 is a software extension of the basic ID3 algorithm
designed by Quinlan
 Select one attribute from a set of training instances
 Select an initial subset of the training instances
 Use the attribute and the subset of instances to build a decision tree
 Use the rest of the training instances (those not in the subset used for construction)
to test the accuracy of the constructed tree
 If all instances are correctly classified – stop
 If an instances is incorrectly classified, add it to the initial subset and construct a
new tree
 Iterate until
A tree is built that classifies all instances correctly
OR
A tree is built from the entire training set
 Let T be the set of training instances
 Choose an attribute that best differentiates the instances contained in T (C4.5 uses
the Gain Ratio to determine)
 Create a tree node whose value is the chosen attribute
 Create child links from this node where each link represents a unique value for the
chosen attribute
 Use the child link values to further subdivide the instances into subclasses

42
4.2.6 Prediction
 It’s a process of accurate predicting the attacks in the network from the dataset.
 This research effectively predicts the data from dataset by enhancing the
performance of the overall prediction results.

4.2.7 Result Generation

The final result will get generated based on the overall classification and prediction. The
performance of this proposed approach is evaluated using some measures like,
 True Positive
 True Negative
 False Positive
 False Negative
 Accuracy
 Precision
 Recall
 F1-Score

43
CHAPTER 5
IMPLEMENTATION AND
RESULT DISCUSSION

44
CHAPTER 5
IMPLEMENTATION AND RESULT DISCUSSION

The implementation of the proposed algorithm is done over python spyder 3.7. The
sklearn, numpy, pandas, matplotlib, pyplot, seaborn, os library helps us to use the
functions available in spyder environment for various methods like decision tree, random
forest, naive bayes etc.

5.1 SOFTWARE USED

Python- It is a translator, elevated level, broadly useful programming language. Made by


Guido van Rossum and first released in 1991, Python's design philosophy emphasizes
code lucidness with its eminent use of significant whitespace. Its language constructs and
article arranged methodology expect to assist programmers with composing clear,
legitimate code for small and huge scale projects.

Python is progressively composed and trash gathered. It supports multiple programming


paradigms, including procedural, object-arranged, and utilitarian programming. Python is
often described as a "batteries included" language because of its comprehensive standard
library.

Python was considered in the late 1980s as a successor to the ABC language. Python 2.0,
released in 2000, presented features like list comprehensions and a trash assortment
system equipped for gathering reference cycles. Python 3.0, released in 2008, was a
significant revision of the language that is not totally backward-good, and much Python 2
code does not run unmodified on Python 3.

The Python 2 language, for example Python 2.7.x, was officially discontinued on 1
January 2020 (first got ready for 2015) after which security patches and different
improvements won't be released for it. With Python 2's finish of-life, just Python 3.5.x
and later are supported.

Python interpreters are accessible for some working systems. A worldwide network of
programmers develops and maintains CPython, an open source reference execution. A
non-profit association, the Python Software Establishment, manages and directs resources
for Python and CPython improvement.

45
Spyder - It is an open source cross-stage coordinated advancement condition (IDE) for
scientific programming in the Python language. Spyder integrates with various
conspicuous packages in the scientific Python stack, including NumPy, SciPy, Matplotlib,
pandas, IPython, SymPy and Cython, as well as other open source software. It is released
under the MIT license.

At first made and created by Pierre Raybaut in 2009, since 2012 Spyder has been kept up
and continuously improved by a group of scientific Python developers and the network.

Figure 5.1: Snap shot of Spyder environment

Spyder is extensible with first-and outsider plugins, includes support for intuitive tools for
information inspection and embeds Python-specific code quality assurance and
introspection instruments, such as Pyflakes, Pylint and Rope. It is accessible cross-stage
through Boa constrictor, on Windows, on macOS through MacPorts, and on significant
Linux distributions such as Curve Linux, Debian, Fedora, Gentoo Linux, openSUSE and
Ubuntu.

Spyder uses Qt for its GUI, and is designed to use both of the PyQt or PySide Python
bindings.QtPy, a dainty abstraction layer created by the Spyder venture and later
embraced by multiple different packages, provides the adaptability to use either backend.

46
5.2 RESULTS DISCUSSION

Figure 5.2: Dataset


Figure 5.2 is showing the KDD data set. This dataset contain the total 999 datas with 42
coloum features like
'duration' real 'protocol_type' {'tcp','udp', 'icmp'} 'service' 'flag' 'src_bytes' real
'dst_bytes' real 'land' {'0', '1'} 'wrong_fragment' real 'urgent' real 'hot' real
'num_failed_logins' real 'logged_in' {'0', '1'} 'num_compromised' real
'root_shell' real 'su_attempted' real 'num_root' real'num_file_creations'
real 'num_shells' real 'num_access_files' real 'num_outbound_cmds' real
'is_host_login' {'0', '1'} 'is_guest_login' {'0', '1'} 'count' real
'srv_count' real 'serror_rate' real 'srv_serror_rate' real rerror_rate'
real srv_rerror_rate' real 'same_srv_rate' real 'diff_srv_rate' real
'srv_diff_host_rate' real 'dst_host_count' real 'dst_host_srv_count' real
'dst_host_same_srv_rate' real 'dst_host_diff_srv_rate' real
'dst_host_same_src_port_rate' real 'dst_host_srv_diff_host_rate' real

47
'dst_host_serror_rate' real 'dst_host_srv_serror_rate' real'dst_host_rerror_rate'
real 'dst_host_srv_rerror_rate' real 'class' {'normal', 'anomaly'}
Example-
0 0 tcp ftp_data SF 491 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 2 2 0.0 0.0 0.0 0.0 1.0 0.0 0.0 150
25 0.17 0.03 0.17 0.0 0.0 0.0 0.05 0.0 0
1 0 udp other SF 146 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
13 1 0.0 0.0 0.0 0.0 0.08 0.15 0.0 255 1
0.0 0.6 0.88 0.0 0.0 0.0 0.0 0.0 0
998 0 tcp private S0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
279 16 1.0 1.0 0.0 0.0 0.06 0.05 0.0 255 5
0.02 0.07 0.0 0.0 1.0 1.0 0.0 0.0 1.

Figure 5.3: Missing data removal

48
Figure 5.3 is shoiwng the missing data removal or preprocess of the dataset. The most
elementary strategy is to remove all rows that contain missing values or, in extreme cases,
entire columns that contain missing values. Pandas library provides the dropna() function
that can be used to drop either columns or rows with missing data.

Figure 5.4: Test data

Figure 5.4 is showing the test data from given dataset. The Test dataset provides the
standard used to evaluate the model. It is only used once a model is completely
trained(using the train and validation sets). The test set is generally what is used to
evaluate competing models (For example on many Kaggle competitions, the validation
set is released initially along with the training set and the actual test set is only released
when the competition is about to close, and it is the result of the the model on the Test set
that decides the winner). Many a times the validation set is used as the test set, but it is
not good practice. The test set is generally well curated. It contains carefully sampled data

49
that spans the various classes that the model would face, when used in the real world.
Here total 250 data consider for the test of data.

Figure 5.5 Train Data

Figure 5.5 is showing the train dataset from given dataset. A training dataset is a dataset
of examples used during the learning process and is used to fit the parameters (e.g.,
weights) of, for example, a classifier. For classification tasks, a supervised learning
algorithm looks at the training dataset to determine, or learn, the optimal combinations of
variables that will generate a good predictive model. The goal is to produce a trained
(fitted) model that generalizes well to new, unknown data. The fitted model is evaluated
using “new” examples from the held-out datasets (validation and test datasets) to estimate
the model’s accuracy in classifying new data. To reduce the risk of issues such as
overfitting, the examples in the validation and test datasets should not be used to train the
model. Most approaches that search through training data for empirical relationships tend
to overfit the data, meaning that they can identify and exploit apparent relationships in the
training data that do not hold in general. Here total 749 data consider for the train of data.

50
Figure 5.6: Confusion Matrix
Figure 5.6 is showing the confusion matrix. A confusion matrix is a table that is often
used to describe the performance of a classification model (or "classifier") on a set of test
data for which the true values are known. The confusion matrix itself is relatively simple
to understand, but the related terminology can be confusing.
TP: True Positive: Predicted values correctly predicted as actual positive
FP: Predicted values incorrectly predicted an actual positive. i.e., Negative values
predicted as positive
FN: False Negative: Positive values predicted as negative
TN: True Negative: Predicted values correctly predicted as an actual negative

We compute the accuracy test from the confusion matrix:

This matrix shows the corrected and wrong predictions, in comparison with the actual
labels. Each confusion matrix row shows the Actual/True labels in the test set, and the

51
columns show the predicted labels by classifier. A good thing about the confusion matrix
is that it shows the model’s ability to correctly predict or separate the classes.

Figure 5.7: Prediction class metrics

 Precision is a measure of the accuracy, provided that a class label has been
predicted. It is defined by:

Precision = True Positive/(True Positive + False Positive)

 Recall Is The True Positive Rate:


Recall = True Positive/(True Positive + False Negative)

 F1-Score is the harmonic average of the precision and recall, where an F1 score
reaches its best value at 1 (which represents perfect precision and recall) and its
worst at 0
F1-Score = 2x (precision x recall)/(precision + recall)

Accuracy: It is defined as the percentage of correct predictions for the test data. It can be
calculated easily by dividing the number of correct predictions by the number of total
predictions.

52
Accuracy = (TP + TN)/(TP + TN + FP + FN)

Error Rate: The inaccuracy of predicted output values is termed the error of the
method. If target values are categorical, the error is expressed as an error rate. This is the
proportion of cases where the prediction is wrong.
Error Rate = 100 – Accuracy

Result Generation-

Confusion matrix Parameters:-

True positive (TP) = 97.6 %

False positive (FP) = 2.4 %

False negative (FN) = 4.8 %

True negative (TN) = 95.19 %

Perfromance Parameters:-

Accuracy = 96.39 %

Error Rate = 3.59 %

Precision = 97.6 %

Recall = 95.3125 %

F1-Score = 96.44 %

Therefore the Accuracy of the decesion tree with C4.5 Alogirthm is approx 97%.

53
Figure 5.8: ROC

The figure 5.8 is showing the Receiver Operating Characteristic Curves (ROC) is
a plot of signal (True Positive Rate) against noise (False Positive Rate). The model
performance is determined by looking at the area under the ROC curve (or AUC). ROC
curve shows the true positive rates against the false positive rate at various cut points. It
also demonstrates a trade-off between sensitivity (recall and specificity or the true
negative rate).
Table 5.1: Simulation Result of DT with C4.5

Sr. No. Parameters Proposed Method (%)

1 Precision 97.6

2 Recall 95.3

3 F-measure 96.4

4 Accuracy 96.3

5 Error Rate 3.5

54
Proposed Method (%)
98

97.5

97

96.5

96
Proposed Method (%)
95.5

95

94.5

94
Precision Recall F-measure Accuracy

Figure 5.9: Result Parameters

Table 5.2: Comparison of proposed work with previous work

Sr. No. Parameters Previous work [1] Previous work [2] Proposed
Work

1 Methodology LSTM SMOTE DT with C4.5

2 Precision (%) 63.76 85.82 97.6

3 Recall (%) 66.36 84.49 95.3

5 F-measure (%) 65.04 85.14 96.4

6 Accuracy (%) 91.63 83.58 96.3

7 Error Rate(%) 8 16 3.5

55
100
90
80
70
60
Precision (%)
50
Recall (%)
40
F-measure (%)
30
20
10
0
Previous work Previous work Proposed Work
[1] [2]

Figure 5.10: Graphical representation of result comparison


.

Accuracy (%)

98
96
94
92
90
Previous work [1]
88
86 Previous work [2]
84 Proposed Work
82
80
78
76
Previous work Previous work Proposed Work
[1] [2]

Figure 5.11: Accuracy comparison


Table 5.2 is showing the results comparison of the previous and proposed research works.
Precision value of existing results is 63.76 and 85.82% while proposed work achieved
97.6%. The recall value achieved by proposed technique is 95.3% while previous

56
achieved is 66.36 and 84.49%. The f measure value is 96.4% and error rate is 3.5% by
proposed technique while previous results is 65.04 and 85.14% of Fmeasure and 8 and
16% is error rate. Finally the accuracy achievement is 96.3% by the proposed
methodology while previous accuracy value is 91.63 and 83.58%.
It is clear from the previous and proposed work performance parameters result
calculation, the proposed work is achieving significant better results than existing.

57
CHAPTER 6
CONCLUSION AND FUTURE
SCOPE

58
CHAPTER 6

CONCLUSION AND FUTURE SCOPE

6.1 CONCLUSION

The influential algorithms for intrusion detection based on various machine learning
techniques. Characteristics of ML techniques makes it possible to design IDS that have
high detection rates and low false positive rates while the system quickly adapts itself to
changing malicious behaviors. We divided these algorithms into two types of ML-based
schemes: Artificial Intelligence (AI) and Computational Intelligence (CI). Although these
two categories of algorithms share many similarities, several features of CI-based
techniques, such as adaptation, fault tolerance, high computational speed and error
resilience in the face of noisy information, conform the requirement of building efficient
intrusion detection systems.

This dissertation presents the C4.5 decision tree algorithm for classification. The C4.5
algorithm is used in Data Mining as a Decision Tree Classifier which can be employed to
generate a decision, based on a certain sample of data. The dataset is taken from the KDD
dataset kaggle

Python Spyder 3.7 is used for the simulation. The simulation results shows that the
proposed approach gives the significant good results in term of the precision, recall, F1-
Score, Error Rate and accuracy. Precision value of existing results is 63.76 and 85.82%
while proposed work achieved 97.6%. The recall value achieved by proposed technique is
95.3% while previous achieved is 66.36 and 84.49%. The f measure value is 96.4% and
error rate is 3.5% by proposed technique while previous results is 65.04 and 85.14% of
Fmeasure and 8 and 16% is error rate. Finally the accuracy achievement is 96.3% by the
proposed methodology while previous accuracy value is 91.63 and 83.58%.

It is clear from the previous and proposed work performance parameters result
calculation, the proposed work is achieving significant better results than existing.

59
6.2 FUTURE WORK

 Intrusion prevention will continue to grow rapidly because of its capability to shut
off attacks, potentially preventing damage and disruption altogether.
 The active defense approach, evaluating the condition of systems and networks
and responding appropriately to remedy whatever is wrong, is new but already
gaining rapidly in popularity.
 Advances in data correlation and alert fusion methods are also likely to occur.
Correlation and fusion methods will meet a larger number of requirements and
user interfaces for access to correlated data and are likely to improve substantially.
 Advances in the determination of the origin of network connections are also
extremely probable. Finally, it is reasonable to expect that improved forensics
functionality will be built into IDSs and IPSs in the future and that honey pots will
be used much more in connection with intrusion detection and intrusion
prevention

60
REFERENCES

1. H. Hou et al., "Hierarchical Long Short-Term Memory Network for Cyberattack


Detection," in IEEE Access, vol. 8, pp. 90907-90913, 2020, doi:
10.1109/ACCESS.2020.2983953.
2. P. Feng, J. Ma, T. Li, X. Ma, N. Xi and D. Lu, "Android Malware Detection
Based on Call Graph via Graph Neural Network," 2020 International Conference
on Networking and Network Applications (NaNA), 2020, pp. 368-374, doi:
10.1109/NaNA51271.2020.00069.
3. S. Liu, M. Dibaei, Y. Tai, C. Chen, J. Zhang and Y. Xiang, "Cyber Vulnerability
Intelligence for Internet of Things Binary," in IEEE Transactions on Industrial
Informatics, vol. 16, no. 3, pp. 2154-2163, March 2020, doi:
10.1109/TII.2019.2942800.
4. Y. Jin, M. Tomoishi and N. Yamai, "Anomaly Detection by Monitoring
Unintended DNS Traffic on Wireless Network," 2019 IEEE Pacific Rim
Conference on Communications, Computers and Signal Processing (PACRIM),
2019, pp. 1-6, doi: 10.1109/PACRIM47961.2019.8985052.
5. B. Peng, Q. Wang, X. Li, J. Cai, J. Fei and W. Chen, "Research on Abnormal
Detection Technology of Real-Time Interaction Process in New Energy Network,"
2019 International Conference on Internet of Things (iThings) and IEEE Green
Computing and Communications (GreenCom) and IEEE Cyber, Physical and
Social Computing (CPSCom) and IEEE Smart Data (SmartData), 2019, pp. 433-
440, doi: 10.1109/iThings/GreenCom/CPSCom/SmartData.2019.00092.
6. W. Bi, K. Zhang, Y. Li, K. Yuan and Y. Wang, "Detection Scheme Against
Cyber-Physical Attacks on Load Frequency Control Based on Dynamic
Characteristics Analysis," in IEEE Systems Journal, vol. 13, no. 3, pp. 2859-2868,
Sept. 2019, doi: 10.1109/JSYST.2019.2911869.
7. K. Liu, Z. Fan, M. Liu and S. Zhang, "Hybrid Intrusion Detection Method Based
on K-Means and CNN for Smart Home," 2018 IEEE 8th Annual International
Conference on CYBER Technology in Automation, Control, and Intelligent
Systems (CYBER), 2018, pp. 312-317, doi: 10.1109/CYBER.2018.8688271.
8. Y. Jin, K. Kakoi, N. Yamai, N. Kitagawa and M. Tomoishi, "A Client Based
Anomaly Traffic Detection and Blocking Mechanism by Monitoring DNS Name

61
Resolution with User Alerting Feature," 2018 International Conference on
Cyberworlds (CW), 2018, pp. 351-356, doi: 10.1109/CW.2018.00070.
9. R. Velea and Ş. Drăgan, "CPU/GPU Hybrid Detection for Malware Signatures,"
2017 International Conference on Computer and Applications (ICCA), 2017, pp.
85-89, doi: 10.1109/COMAPP.2017.8079736.
10. S. Merat and W. Almuhtadi, "Artificial intelligence application for improving
cyber-security acquirement," 2015 IEEE 28th Canadian Conference on Electrical
and Computer Engineering (CCECE), 2015, pp. 1445-1450, doi:
10.1109/CCECE.2015.7129493.
11. S. Han, M. Xie, H. Chen and Y. Ling, "Intrusion Detection in Cyber-Physical
Systems: Techniques and Challenges," in IEEE Systems Journal, vol. 8, no. 4, pp.
1052-1062, Dec. 2014, doi: 10.1109/JSYST.2013.2257594.
12. M. Bousaaid, T. Ayaou, K. Afdel and P. Estraillier, "Hand gesture detection and
recognition in cyber presence interactive system for E-learning," 2014
International Conference on Multimedia Computing and Systems (ICMCS), 2014,
pp. 444-447, doi: 10.1109/ICMCS.2014.6911197.
13. A. J. Smith, R. F. Mills, A. R. Bryant, G. L. Peterson and M. R. Grimaila,
"REDIR: Automated static detection of obfuscated anti-debugging techniques,"
2014 International Conference on Collaboration Technologies and Systems
(CTS), 2014, pp. 173-180, doi: 10.1109/CTS.2014.6867561.
14. M. Guri, G. Kedma, T. Sela, B. Carmeli, A. Rosner and Y. Elovici, "Noninvasive
detection of anti-forensic malware," 2013 8th International Conference on
Malicious and Unwanted Software: "The Americas" (MALWARE), 2013, pp. 1-
10, doi: 10.1109/MALWARE.2013.6703679.
15. M. Hirabayashi, S. Kato, M. Edahiro, K. Takeda, T. Kawano and S. Mita, "GPU
implementations of object detection using HOG features and deformable models,"
2013 IEEE 1st International Conference on Cyber-Physical Systems, Networks,
and Applications (CPSNA), 2013, pp. 106-111, doi:
10.1109/CPSNA.2013.6614255.
16. Lin Wang, Xiang Wang, Zichen Zhou, Qinghai Liu and Hao Yang,
"Architectural-enhanced intrusion detection and memory authentication schemes
in embedded systems," 2010 IEEE International Conference on Information
Theory and Information Security, 2010, pp. 221-224, doi:
10.1109/ICITIS.2010.5688775.

62
17. R. Tao, L. Yang, L. Peng, B. Li and A. Cemerlic, "A case study: Using
architectural features to improve sophisticated denial-of-service attack detections,"
2009 IEEE Symposium on Computational Intelligence in Cyber Security, 2009,
pp. 13-18, doi: 10.1109/CICYBS.2009.4925084.
18. K. A. Bowman et al., "Energy-Efficient and Metastability-Immune Timing-Error
Detection and Instruction-Replay-Based Recovery Circuits for Dynamic-Variation
Tolerance," 2008 IEEE International Solid-State Circuits Conference - Digest of
Technical Works, 2008, pp. 402-623, doi: 10.1109/ISSCC.2008.4523227.
19. N. R. Yang, G. Yoon, J. Lee, I. Hwang, C. H. Kim and J. M. Kim, "Loop
Detection for Energy-Aware High Performance Embedded Processors," 2008
IEEE Asia-Pacific Services Computing Conference, 2008, pp. 1578-1583, doi:
10.1109/APSCC.2008.66.
20. S. Koohi, M. Babagoli, T. Lotfi and S. Kasaei, "Video cut detection in E-Learning
applications," 2007 9th International Symposium on Signal Processing and Its
Applications, 2007, pp. 1-4, doi: 10.1109/ISSPA.2007.4555325.
21. S. Yinbiao and K. Lee, “Internet of Things: Wireless Sensor Networks Executive
summary,” 2014.
22. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci“A survey on sensor
networks,” IEEE Commun. Mag., vol. 40, no. 8, pp. 102–105, 2002.
23. X. Chen, K. Makki, K. Yen, and N. Pissinou, “Sensor network security: a survey,”
IEEE Commun. Surv. Tutorials, vol. 11, no. 2, pp. 52–73, 2009.
24. A.-S. K. Pathan, H.-W. Lee, and C. S. Hong, “Security in wireless sensor
networks: issues and challenges,” 2006 8th Int. Conf. Adv. Commun. Technol.,
vol. 2, p. 6 pp.-pp.1048, 2006.
25. P. Yi, Y. Jiang, Y. Zhong, and S. Zhang, “Distributed Intrusion Detection for
Mobile Ad Hoc Networks,” 2005 Symp. Appl. Internet Work. (SAINT 2005
Work., pp. 94–97, 2005.
26. H. Sedjelmaci and M. Feham, “Novel Hybrid Intrusion Detection System for
Clustered Wireless Sensor Network,” Int. J. Netw. Secur. Its Appl. (IJNSA),
Vol.3, No.4, July 2011, vol. 3, no. 4, pp. 1–14, 2011.
27. L. Khan, M. Awad, and B. Thuraisingham, “A new intrusion detection system
using support vector machines and hierarchical clustering,” VLDB J., vol. 16, no.
4, pp. 507–521, 2007.

63
LIST OF PUBLICATION

[1] Rishika, Yogendra Maravi , Nischol Mishra, Jitendra Agarwal “Survey of IOT
Cyber Security in Network Intrusion Detection Systems” Journal of Xi’an Shiyou
University, Natural Science Edition, ISSN : 1673-064X, Volume 18 Issue 7 July 2022 pp
335-343.
[2] Rishika, Yogendra Maravi , Nischol Mishra, Jitendra Agarwal “A Machine Learning
Technique based Network Intrusion Detection System for Cyber Security
Applications” communicated to international journal.

64
PLAGIARISM REPORT

65
Rishika Thesis
ORIGINALITY REPORT

12 %
SIMILARITY INDEX
8%
INTERNET SOURCES
9%
PUBLICATIONS
7%
STUDENT PAPERS

PRIMARY SOURCES

1 Submitted to National Instituteof Technology


Student Paper
5%
www.jetir.org
2 Internet Source 2%
ko.coursera.org
3 Internet Source 2%
ijircce.com
4 Internet Source 2%
github.com
5 Internet Source 1%
www.coursehero.com
6 Internet Source 1%
Submitted to Glasgow Caledonian University
7 Student Paper 1%
8 Submitted to University of Wales Institute,
Cardiff 1%
Student Paper

9 repository.sharif.edu
Internet Source

1%
10
en.wikipedia.org
Internet Source 1%
11
towardsdatascience.com
Internet Source 1%
12
academic.odysci.com
Internet Source <1%
13
gist.github.com
Internet Source <1%
14
golden.com
Internet Source <1%
15
ijea.jctjournals.com
Internet Source <1%
16
ijesc.org
Internet Source <1%
17
wikimili.com
Internet Source <1%
18
www.guru99.com
Internet Source <1%
19
Submitted to Coventry University
Student Paper <1%
20
www.researchgate.net
Internet Source <1%
21
aethos.readthedocs.io
Internet Source <1%
22
Submitted to Sri Lanka Institute of
Information Technology
<1%
Student Paper

23
Submitted to International College of
Auckland
<1%
Student Paper

24
Submitted to Sogang University
Student Paper <1%
25
www.bensblog.tech
Internet Source <1%
26
Submitted to Amity University
Student Paper <1%
27
tudr.thapar.edu:8080
Internet Source <1%
28
ftp.tugraz.at
Internet Source <1%
29
xuanchengpan.com
Internet Source <1%
30
docplayer.net
Internet Source <1%
31
Lecture Notes in Computer Science, 2016.
Publication <1%
32
arxiv.org
Internet Source < 1
%

Exclude quotes On Exclude matches Off


Exclude bibliography On

You might also like