You are on page 1of 49

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING
SAGARMATHA ENGINEERING COLLEGE

A
MAJOR PROJECT REPORT
ON
DDOS ATTACT DETECTION USING ENSEMBLE MACHINE
LEARNING

BY
AAKASH KHANAL (074BCT001)
ALISH KARKI (074BCT005)
AMAR DHAKAL (074BCT007)

A PROJECT REPORT SUBMITTED TO THE DEPARTMENT OF


ELECTRONICS AND COMPUTER ENGINEERING IN PARTIAL
FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
BACHELOR IN ELECTRONICS AND COMMUNICATION
ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING


SANEPA, LALITPUR, NEPAL
MARCH 2022
COPYRIGHT

The author has agreed that the library, Department of Electronics and Com-
puter Engineering, Sagarmatha Engineering College, may make this project freely
available for inspection. Moreover the author has agreed that the permission for
extensive copying of this project work for scholarly purpose may be granted by the
professor(s), who supervised the project work recorded herein or, in their absence,
by the Head of the Department, wherein this project was done. It is understood
that the recognition will be given to the author of this project and to the Depart-
ment of Electronics and Computer Engineering, Sagarmatha Engineering College
in any use of the material of this project. Copying of publication or other use of
this project for financial gain without approval of the Department of Electronics
and Computer Engineering, Sagarmatha Engineering College and author’s written
permission is prohibited.
Request for permission to copy or to make any use of the material in this project
in whole or part should be addressed to:

Head of Department
Department of Electronics and Computer Engineering
Sagarmatha Engineering College
Sanepa, Lalitpur, Nepal

ii
DECLARATION

We declare that the work hereby submitted for BACHELOR IN COMPUTER


ENGINEERING (BCT) at Sagarmatha Engineering College entitled “DDOS
ATTACT DETECTION USING ENSEMBLE MACHINE LEARNING”
is our own work and has not been previously submitted by any of us at any
university for any academic award.
We authorize Sagarmatha Engineering College and IOE, Pulchowk Campus to lend
this project to other institution or individuals for the purpose of scholarly research.

AAKASH KHANAL (074BCT001)


ALISH KARKI (074BCT005)
AMAR DHAKAL (074BCT007)
Date: March, 2022

iii
RECOMMENDATION

The undersigned certify that they have read and recommended to the Department
of Electronics and Computer Engineering for acceptance, a project work enti-
tled “DDOS ATTACT DETECTION USING ENSEMBLE MACHINE
LEARNING”, submitted by Aakash Khanal, Alish Karki and Amar Dhakal
in partial fulfillment of the requirement for the award of the degree “Bachelor of
Engineering in computer engineering”.

..........................................................................
External Examiner: XYZ
Reader
Department of Electronics & Computer Engineering
Institute of Engineering, Pulchowk Campus, Tribhuvan University
Lalitpur, Nepal

..........................................................................
Supervisor: Er Saurav Raj Pant,
Department of Electronics and Computer Engineering,
Sagarmatha Engineering College
Tribhuvan University Affilate
Sanepa, Lalitpur

......................................................................................
Project Coordinator: Er Bipin Thapa
Department of Electronics and Computer Engineering,
Sagarmatha Engineering College
Tribhuvan University Affilate
Sanepa, Lalitpur

iv
DEPARTMENTAL ACCEPTANCE

The project work entitled “DDOS ATTACT DETECTION USING ENSEM-


BLE MACHINE LEARNING”, submitted by Aakash Khanal,Alish Karki
and Ammar Dhakal in partial fulfillment of the requirement for the award of the
degree of “Bachelor of Engineering in Electronics and Communication
Engineering” has been accepted as a bonafide record of work independently
carried out by team in the department.

..............................................
Senior lecture: Er Bharat Bhatta
Head of Department
Department of Electronics and Computer Engineering,
Sagarmatha Engineering College,
Tribhuvan University Affilate,
Sanepa, Lalitpur
Nepal.

v
ACKNOWLEDGEMENT

We would like to express our sincere gratitude to our project coordinator Er.
Bipin Thapa Magar and mentor Er.Saurav Raj Pant for their encouragement,
suggestions and continuous guidance throughout the course of our project work.
We pay our sincere gratitude to our HOD Er. Bharat bhatta sir and all the
teachers of Department of Electronics and Computer Engineering of Sagarmatha
Engineering College for providing opportunity of learning and implementing
the knowledge in the form of project work.
We would like to express our heartily gratitude towards the students, teachers and
staffs of Sagarmatha Engineering College for giving us continuous invaluable
support on the field visits and data collection, without this support we wouldn’t
have come so far in our research work.

Aakash khanal (074bct001)


Alish karki (074bct005)
Amar Dhakal (074bct007)

vi
ABSTRACT

This report mainly focuses on the Distributed Denial of Service (DDoS) attack as a
part of intrusion and it is one of the major threats to the network services. In our
project, we are intending to create a model which detects DDoS attacks based on
our dataset. We provide a robust mechanism against past intrusions experiences
as multiple classifiers outcome are merged and also intrusion mechanism will be
modeled appropriately. With the use of modified NSL-KDD dataset, the classifier
models are trained with the dataset and then the models are ensembled in order to
increase the efficiency ahead of the existing market systems. The model training is
performed based on reduced features set and the features are chosen as per the
redundant effect of the features in the model.
Keywords: NSL KDD Dataset, DDoS, Ensemble Machine Learning, Classifiers

vii
TABLE OF CONTENTS

COPYRIGHT ii

DECLARATION iii

RECOMMENDATION iv

DEPARTMENTAL ACCEPTANCE v

ACKNOWLEDGEMENT vi

ABSTRACT vii

TABLE OF CONTENTS viii

LIST OF FIGURES xi

LIST OF TABLES xii

LIST OF ABBREVIATIONS xiii

1 INTRODUCTION 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Feasibility Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4.1 Technical Feasibility . . . . . . . . . . . . . . . . . . . . . . 2
1.4.2 Social Feasibility . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4.3 Economic Feasibility . . . . . . . . . . . . . . . . . . . . . 3
1.4.4 Operational Feasibility . . . . . . . . . . . . . . . . . . . . . 3
1.4.5 Time Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Requirment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 3

viii
1.5.1 Hardware Requirements . . . . . . . . . . . . . . . . . . . . 3
1.5.2 Software Requirements . . . . . . . . . . . . . . . . . . . . . 3
1.5.3 User Requirements . . . . . . . . . . . . . . . . . . . . . . . 3
1.5.4 Functional Requirement . . . . . . . . . . . . . . . . . . . . 4
1.5.5 Non-Functional Requirement . . . . . . . . . . . . . . . . . . 4

2 LITERATURE REVIEW 6

3 RELATED THEORY 8
3.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . 10
3.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 METHODOLOGY 12
4.1 System Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Use-Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Software Development Approach . . . . . . . . . . . . . . . . . . . . 14
4.4 Daset Used For Experiment . . . . . . . . . . . . . . . . . . . . . . 14
4.5 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.5.1 Features Selection . . . . . . . . . . . . . . . . . . . . . . . 17
4.5.2 Filling Missing Data . . . . . . . . . . . . . . . . . . . . . . 17
4.5.3 Label Encoding . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.5.4 Division of Dataset into Independent and Dependent Variables 17
4.5.5 Splitting DataSet into Train and Test Dataset . . . . . . . . 18
4.5.6 Features Scaling . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.6 Classifiers Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.6.1 Support Vector Machine . . . . . . . . . . . . . . . . . . . . 18
4.6.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.6.3 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.7 PERFORMANCE EVALUATION . . . . . . . . . . . . . . . . . . . 23

ix
4.7.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . 23
4.7.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.7.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.7.4 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.8 Model Development Tools . . . . . . . . . . . . . . . . . . . . . . . 29
4.8.1 Streamlit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.8.2 Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . 29
4.8.3 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 RESULT AND ANALYSIS 32


5.1 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 EPILOGUE 34
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2 Limitation And Fucture Enhancement . . . . . . . . . . . . . . . . 34

REFERENCES 35

7 APPENDIX A 36
7.0.1 Projecct Schedule . . . . . . . . . . . . . . . . . . . . . . . . 36

x
LIST OF FIGURES

Figure 3.1 Working of Supervised Learning . . . . . . . . . . . . . . . . 9


Figure 3.2 Working of Unsupervised Learning . . . . . . . . . . . . . . 10

Figure 4.1 Overall System Block Diagram . . . . . . . . . . . . . . . . . 12


Figure 4.2 Use-case Diagram of DDos Prevention System . . . . . . . . 13
Figure 4.3 Incremetal model . . . . . . . . . . . . . . . . . . . . . . . . 14
Figure 4.4 Dataset feacture . . . . . . . . . . . . . . . . . . . . . . . . . 15
Figure 4.5 Sub-class of attracts . . . . . . . . . . . . . . . . . . . . . . 16
Figure 4.6 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 4.7 possible hyperplane . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 4.8 Hyperplanes in 2D and 3D feature space . . . . . . . . . . . 20
Figure 4.9 Support vectors . . . . . . . . . . . . . . . . . . . . . . . . . 20
Figure 4.10 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 4.11 Example of decision tree . . . . . . . . . . . . . . . . . . . . 23
Figure 4.12 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 4.13 Confusion Matrix of random forest . . . . . . . . . . . . . . 25
Figure 4.14 Confusion Matrix of decision tree . . . . . . . . . . . . . . . 26
Figure 4.15 Confusion Matrix of SVM . . . . . . . . . . . . . . . . . . . 27
Figure 4.16 Streamlit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 4.17 Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 4.18 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Figure 5.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32


Figure 5.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Figure 7.1 Project Schedule . . . . . . . . . . . . . . . . . . . . . . . . 36

xi
LIST OF TABLES

xii
LIST OF ABBREVIATIONS

DDOS : Distributed denial of service


SVC : Support Vector Classifier
SVM : Support Vector Machine

xiii
CHAPTER 1
INTRODUCTION

1.1 Background

DDoS (Distributed Denial of Service) attack is one of the main threats in today’s
internet world. The DDoS attack makes use of many different sources to send a
lot of useless packets to the target in a short time. The aim of this attack is to
consume the targets resource and makes the target service unavailable. It is an
online service unavailable by overwhelming it with traffic from multiple sources.
Attacked machine may include computers, websites and other networked resources
such as input output devices. This can have serious consequences, especially for
companies which rely on their online availability to do business. In the not-so-
distant past, there have been some large-scale attacks targeting high profile internet
sites. Consequently, there are currently a lot of efforts being made to come up
with mechanisms to detect and mitigate such attacks.
A distributed denial of service (DDoS) attack is designed to overwhelm victims
with traffic and prevent their network resources from working correctly for their
legitimate clients. DDoS attacks require a significant amount of bandwidth to
successfully attack a big adversary, such as a Web-based media company, so they
often command thousands of hosts in a botnet to simultaneously send traffic to a
victim. This action has the effect of aggregating bandwidth to match or surpass
the victim’s network resources, as well as making specific host filtering difficult,
since the attack is coming from so many places all at once.

1.2 Motivation

DDoS attacks are constantly evolving as the nature of technology used and the
motivations of the attackers are changing. Even today, perpetrators are being
caught and charged with DDoS attacks launched via botnets that cause thousands
of dollars of damage to the victims. What motivates us to develop a detection and
prevention of this type of attacks is to be secured with our data and important

1
working area of internet, and concerned with what are the unique characteristics
of these attacks, and how can their effects be mitigated? So, we are going to use a
completely different approach for developing our intrusion system. We are going
to use Machine Learning in order to differentiate the attack on our system from
the normal traffic by training our model.

1.3 Objective

This project is conducted in order:

• To detect Distributed Denial of Service attack on a system

• To ensemble classifiers to enhance system’s reliability

1.4 Feasibility Study

One of the most important factors that we need to consider is that during the
development of a project includes the feasibility study. Feasibility refers to the
state or degree of a task to be done easily or conveniently. For this we consider
following things:

1.4.1 Technical Feasibility

The primary aim of a technical feasibility study is to remove uncertainty. It


is important to calculate the risk associated with our project. The project is
technically feasible as we have all the platform, we need to build the project, and
on top of it execute the project.

1.4.2 Social Feasibility

Social feasibility is one of the feasibility studies where the acceptance of the people
is considered regarding the product to be launched. Since our project aim at
prevention of DDoS attack so this project is equally accepted among people who
are concerned for cyber security threat and doesn’t harm the feelings of any groups
regardless of color, gender, religion, culture, ethics, etc.

2
1.4.3 Economic Feasibility

It refers to the analysis of the cost-effectiveness of a project in order to determine


whether the project is economically viable. Few thousands of rupees may be
required to perform the project as we may need to purchase various resources from
internal or external resources.

1.4.4 Operational Feasibility

Operational feasibility is the measure of how well a proposed system solves the
problems, and takes advantage of the opportunities identified during scope definition.
Our system can operate in any environment and detect the DDoS attacks. It can
be made operating system independent.

1.4.5 Time Feasibility

A time feasibility study will take into account the period in which the project is
going to take up to its completion. Typically, this means estimating how long the
system will take to develop and if it can be completed in a given time period. With
the given time frame of about 6 months, we can complete the project within this
time frame following all the guidelines provided with all project partners in unison.

1.5 Requirment Analysis

1.5.1 Hardware Requirements

The hardware requirements of the system are quite minimal and only requires basic
general computer hardware and software.

1.5.2 Software Requirements

The model to be designed will work on basic operating system for PCs. The
programming is performed using python.

1.5.3 User Requirements

Our proposed system should be able to perform the following tasks properly:

3
• Distributed Denial of Service attack detection on a system

• Ensembling of the classifiers to enhance system’s reliability

1.5.4 Functional Requirement

A functional requirement document defines the functionality of a system. It


describes what a system must do in order to satisfy it fundamental reason for
existence along with fulfilling software requirement specification criteria. For the
proposed system, the functional requirement includes:

• DDoS attacks prevention

• Extract the modified NSL KDD dataset

• Preprocess the dataset

• Train individual classifier and ensemble it

• Evaluate the system performance

1.5.5 Non-Functional Requirement

These requirements are used to judge the operation of the application engineered.
The non- functional requirements of our prediction system are:
Usability:
Since the users are assumed to have the common knowledge of using an easy
desktop application, there are no specific usability requirements. We guess it takes
no more time to become productive on using this application than using a general
application like email.
Reliability:

• Availability: Application shall be available 24/7. When there is no available


information of data, then also the application will not perform its functionality
properly.

• Mean time to repair (MTTR): The system is allowed to be out of operation


after it has failed. The application should be repaired as soon as possible.

4
Supportability:
Data and operations in classes shall fit together. Comments shall explain why
things are done rather than what is done. Unit tests for every single class shall be
implemented. Code repetition shall be avoided.
Accuracy:
The system to be engineered should be accurate and efficient at providing the
result it is designated to give. The higher the accuracy of the system, the better
the system will be.

5
CHAPTER 2
LITERATURE REVIEW

V.Priyadharshini Dr.K. Kuppusamy jointly proposed a journal on prevention of


DDOS attack using New Cracking Algorithm. The proposed system identifies
whether the number of entries of client exceeds more than five times to the same
sever, then the client will be saved as attacker in blocked list and the service
could not be provided. So, the algorithm protects legitimate traffic from a huge
volume of DDOS traffic when an attack occurs. It maintains a status table where
it keeps IP address of current user and their status. If an IP login is more than
five times then it is considered as an attacker. The algorithmic steps consist of
packet filtering, where packet transfer between computers is inspected and MAC
generator distinguishes the packets that contain genuine source IP addresses from
those that contain spoofed address.[1]
Anup Bhange et al., discuss about DDOS attack impact on network traffic and
its Detection approach. This analysis study on flood attacks and Flash Crowd
their improvement, classifying such attacks as either high-rate flood or low-rate
flood. This paper discusses a statistical approach to analysis the distribution
of network traffic to recognize the normal network traffic behavior. The EM
algorithm is discussed to approximate the distribution parameter of Gaussian
mixture distribution model. Another time series analysis method is studied. This
paper also discusses a method to recognize anomalies in network traffic, based on
a non-restricted -stable first-order model and statistical hypothesis testing.[2]
Ma Zhao-hui and Zhao Gan-sen et al., discuss about the DDOS attack detection
in software defined network (SDN). This paper presents a DDoS detection scheme
based on k-means algorithm in SDN environment. After demonstrating the validity
of k- means clustering algorithm, the paper proposes 5 flow table features that can
be used to detect DDoS attacks. Finally, the DDoS detection scheme was tested
by simulation experiment. It focuses on attack detection by feature extraction,
data training and attack identification.[3]
Mohd Azahari Mohd Yusof et al., purpose detection and defense algorithm for

6
different types of DDOS attack. This algorithm focuses to mitigate four types of
DDOS attack like UDP flood, TCP SYN flood, Ping of Death and Smurf attack.
The purposed algorithm is evaluated using the existing Intrusion Detection and
Prevention tool to determine whether it is the best algorithm to mitigate the DDoS
attacks on a network environment. The proposed algorithm will be measured in
terms of false positive rates and detection accuracy.[4]
Stephen M. Specht and Ruby B. Lee discuss about Distributed Denial of Service
Taxonomies of attacks, Tools and Countermeasures. In the paper they have
describe about DDoS attack models and propose taxonomies to characterize the
scope of DDoS attacks, the characteristics of the software attack, tools used and the
countermeasures available. These taxonomies illustrate similarities and patterns in
different DDoS attacks and tools, to assist in countering DDoS attack.[5]
Alekhya kaliki purpose Machine Learning based Application Layer DDoS Attack
detection using Firefly Classification Algorithm. In this journal they devised a
bio- inspired Anomaly based App-DDoS Attack detection that is in the aim of
achieving fast and early detection. It used to achieve the fast and early detection
of the App-DDoS by HTTP flood. The dataset is preprocessed using Machine
Learning Metric like Time Interval, Maximum number of Sessions, Average Session
Time, Page access count, Minimum time interval between two pages, Ratio of
divergent familiar sources, Packets observed per each type of packet and Maximum
bandwidth consumption. Then algorithm is used to classify the attack traffic and
normal traffic on the basis of classify the attack traffic and normal traffic.[6]
An essay published on UK Essay on Controlling IP Spoofing Through interdomain
packet propose an inter-domain packet filter (IDPF) architecture that can alleviate
the level of IP spoofing on the Internet. A key feature of the scheme is that it
does not require global routing information. IDPFs are constructed from the
information hidden in Border Gateway Protocol (BGP) route updates and are
deployed in network border routers. It does not discard packets with valid source
addresses. Even with partial employment on the Internet, IDPFs can proactively
limit the spoofing capability of attackers. [7]

7
CHAPTER 3
RELATED THEORY

Denial of Service and Distributed Denial of Service attacks are the important
sources that make internet services vulnerable. Attack detection is not new in
using machine learning techniques. Some of the attacks that are identified though
machine learning techniques are signature-based and anomalies-based. it is learnt
that signatures are used for signature-based Intrusion detection system, while
detecting unknown attacks are the part of anomaly detection, whereas data flows
generated by the unknown patterns gives the scope for studying DDoS. DDoS
attacks can be prevented. Attacks are defined as violations of security policies of the
network. Usually the attacks are classified into passive and active attacks. Passive
attacks never affect the system but active attacks take control of system. The most
frequent attacks in real-time network are Denial of Service (DoS) attacks. When a
DoS attack is deployed into legitimate systems in a distributed environment, it is
further considered as a Distributed Denial of Service (DDoS) attack. Distributed
Denial of Service attack is where multiple systems try to target one single system.
By flooding the messages to the target system, the services in the systems are denied
and considered as zombies . Some of the types of DDoS attacks are Flooding,
IP Spoofing, TCP SYN Flood, PING Flood, UDP Flood, and Smurf attacks.
Multiple machines are used to construct flooding in DDoS attacks. The DDOS
attack detection system that are present in today’s world mostly uses the machine
learning technique that includes ensemble method.

3.1 Machine Learning

Machine learning is a branch of artificial intelligence (AI) and computer science


which focuses on the use of data and algorithms to imitate the way that humans
learn, gradually improving its accuracy.Machine learning is an important component
of the growing field of data science. Through the use of statistical methods,
algorithms are trained to make classifications or predictions, uncovering key insights

8
within data mining projects. These insights subsequently drive decision making
within applications and businesses, ideally impacting key growth metrics

3.1.1 Supervised Learning

Supervised learning, also known as supervised machine learning, is a subcategory


of machine learning and artificial intelligence. It is defined by its use of labeled
datasets to train algorithms that to classify data or predict outcomes accurately.
As input data is fed into the model, it adjusts its weights until the model has
been fitted appropriately, which occurs as part of the cross validation process.
Supervised learning helps organizations solve for a variety of real-world problems
at scale, such as classifying spam in a separate folder from your inbox.

Figure 3.1: Working of Supervised Learning

Suppose we have a dataset of different types of shapes which includes square,


rectangle, triangle, and Polygon. Now the first step is that we need to train the
model for each shape.

• If the given shape has four sides, and all the sides are equal, then it will be
labelled as a Square.

• If the given shape has three sides, then it will be labelled as a triangle.

• If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the

9
model is to identify the shape.

The machine is already trained on all types of shapes, and when it finds a new
shape, it classifies the shape on the bases of a number of sides, and predicts the
output.

3.1.2 Unsupervised Learning

Unsupervised learning, also known as unsupervised machine learning, uses machine


learning algorithms to analyze and cluster unlabeled datasets. These algorithms
discover hidden patterns or data groupings without the need for human intervention.
Its ability to discover similarities and differences in information make it the
ideal solution for exploratory data analysis, cross-selling strategies, customer
segmentation, and image recognition.

Figure 3.2: Working of Unsupervised Learning

Here, we have taken an unlabeled input data, which means it is not categorized
and corresponding outputs are also not given. Now, this unlabeled input data is
fed to the machine learning model in order to train it. Firstly, it will interpret the
raw data to find the hidden patterns from the data and then will apply suitable
algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into
groups according to the similarities and difference between the objects.

10
3.2 Data Sets

Data sets are the collection of data that can be used for analytic and prediction
purposes. Computers don’t treat the data sets directly as the human does. So,
many of the techniques are applied so that they can be rendered in machine learning.
The machine is able to recognize each of the data sets by its own algorithm.we
used NSL-KDD dataset for training our models.
NSL-KDD is a data set suggested to solve some of the inherent problems of the
KDD’99 data set which are mentioned in . Although, this new version of the
KDD data set still suffers from some of the problems discussed by McHugh and
may not be a perfect representative of existing real networks, because of the lack
of public data sets for network-based IDSs, we believe it still can be applied as
an effective benchmark data set to help researchers compare different intrusion
detection methods.
Furthermore, the number of records in the NSL-KDD train and test sets are
reasonable. This advantage makes it affordable to run the experiments on the
complete set without the need to randomly select a small portion. Consequently,
evaluation results of different research work will be consistent and comparable.

11
CHAPTER 4
METHODOLOGY

4.1 System Block Diagram

The design and architecture for our system can be described by the following
figures. The diagram illustrated are system block diagram, use case diagram and
data flow diagram

Figure 4.1: Overall System Block Diagram

12
In this section, we will review various purposed method that have been planned
to be used for the completion of the intrusion detection system to be designed.
Its development consists of various phases ranging from selection of software
development approach to the algorithms used. We propose a DDoS attack detection
based on Machine Learning. We are going to build a Network Intrusion Detection
System (NIDS) which detects abnormal activities by the network traffic. We are
going to use the dataset to train the classifiers and ensemble them using majority
voting method.

4.2 Use-Case Diagram

Figure 4.2: Use-case Diagram of DDos Prevention System

13
4.3 Software Development Approach

As we are trying to develop our system, we have to use training data to the system
and make changes to the system as per our need to get a more efficient architecture
and maximum efficiency with a small period of time, the software development
model best for us was found to be Incremental Model.

Figure 4.3: Incremetal model

4.4 Daset Used For Experiment

In our project, we are going to use NSL-KDD dataset, an improvement of KDD’99


dataset for experimentation.

14
It contains 42 features as shown in figure:

Figure 4.4: Dataset feacture

15
Datasets are further divided into sub-classes as shown in figure:

Figure 4.5: Sub-class of attracts

16
4.5 Data Processing

4.5.1 Features Selection

In our dataset there are 42 different features then the computation of which
is difficult. The processing of model using all 42 features requires very high
computational power, high computing time and constricted accuracy of the model
due to the presence of redundant features.
By selecting only, the appropriate features, we can increase the computational
efficiency and enhanced the model’s accuracy

4.5.2 Filling Missing Data

There are various field of missing data present in our dataset. The missing field of
data of various features results in distorted working of our model, so the missing
field of various features must be accommodated with the appropriate value.
In our project, the missing value are filled in by calculating the means of all the
individual columns from the data that are initially in it. We’ve used the simple
impute library in our project to fill in the missing values of various features.

4.5.3 Label Encoding

There are various categorical data present in the dataset. The various classifier
algorithm doesn’t understand non-numerical categorical values. So, these values
must be converted into numerical values which can be understood by our model.
This is done in our project using Label Encoder which performs conversion by
assigning numerical values as a whole number to each unique entity in a feature
column.

4.5.4 Division of Dataset into Independent and Dependent Variables

The independent dependent variable are classified divided into separate variable by
slicing the label encoded dataset. The independent variable is the input dependent
variable are the output of a model.

17
4.5.5 Splitting DataSet into Train and Test Dataset

The dataset is splitted into train test dataset. The training portion of our dataset
is used to train our classifier model the testing portion our dataset is used to
confirm the model’s efficient working.
In our dataset we have divided our input dataset as 75 percent training and 25
percent testing.

4.5.6 Features Scaling

The varying values present in various fields of our dataset if fed as so to the
model,assumes the features with the higher value to have more influence on the
model. E.g.- height feature may have data ranging from 4-6 and weight may
have the values ranging from 40-100. If the same data is fed then the model
assumes weight as higher priority field than the height. To remove this problem,
we normalize the dataset of individual features as:

X − Xmin
X= (4.1)
Xmax − Xmin

4.6 Classifiers Used

4.6.1 Support Vector Machine

Support Vector Machine (SVM) is a supervised machine learning algorithm which


canbe used for both classification or regression challenges. However, it is mostly
used in classification problems. In the SVM algorithm, we plot each data item as
a point in n dimensional space (where n is number of features, we have) with the
value of each feature being the value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that differentiates the two classes very
well.
Support Vectors are simply the co-ordinates of individual observation. The SVM
classifier is a frontier which segregates the two classes using hyper-plane/ line.

18
Figure 4.6: SVM

Figure 4.7: possible hyperplane

To separate the two classes of data points, there are many possible hyperplanes that
could be chosen. Our objective is to find a plane that has the maximum margin,
i.e., the maximum distance between data points of both classes. Maximizing the
margin distance provides some reinforcement so that future data points can be
classified with more confidence.
Hyperplane and support vector

19
Figure 4.8: Hyperplanes in 2D and 3D feature space

Hyperplanes are decision boundaries that help classify the data points. Data points
falling on either side of the hyperplane can be attributed to different classes. Also,
the dimension of the hyperplane depends upon the number of features. If the
number of input features is 2, then the hyperplane is just a line. If the number
of input features is 3, then the hyperplane becomes a two-dimensional plane. It
becomes difficult to imagine when the number of features exceeds 3. Support

Figure 4.9: Support vectors

vectors are data points that are closer to the hyperplane and influence the position
and orientation of the hyperplane. Using these support vectors, we maximize the
margin of the classifier. Deleting the support vectors will change the position of
the hyperplane. These are the points that help us build our SVM.

20
4.6.2 Random Forest

Random Forests (RFs) are composed of multiple independent decision trees that
are trained independently on a random subset of data. Random forest is a classifier
that evolves from decision trees. It actually consists of many decision trees. To
classify a new instance, each decision tree provides a classification for input data;
random forest collects the classifications and chooses the most voted prediction
as the result. The input of each tree is sampled data from the original dataset.
In addition, a subset of features is randomly selected from the optional features
to grow the tree at each node. Each tree is grown without pruning. Essentially,
random forest enables a large number of weak or weakly-correlated classifiers
to form a strong classifier. A Random Forest is an ensemble technique capable
of performing both regression and classification tasks with the use of multiple
decision trees and a technique called Bootstrap and Aggregation, commonly known
as bagging. The basic idea behind this is to combine multiple decision trees in
determining the final output rather than relying on individual decision trees.
Random Forest has multiple decision trees as base learning models. We randomly
perform row sampling and feature sampling from the dataset forming sample
datasets for every model. This part is called Bagging.

21
Figure 4.10: Random forest

4.6.3 Decision Tree

Decision tree is the most powerful and popular tool for classification and prediction.
A Decision tree is a flowchart like tree structure, where each internal node denotes
a test on an attribute, each branch represents an outcome of the test, and each
leaf node (terminal node) holds a class label. A tree has many analogies in real life,
and turns out 16 that it has influenced a wide area of machine learning, covering
both classification and regression. In decision analysis, a decision tree can be used
to visually and explicitly represent decisions and decision making. As the name
goes, it uses a tree-like model of decisions. Though a commonly used tool in data
mining for deriving a strategy to reach a particular goal, it’s also widely used in
machine learning, A decision tree is a tree-like graph with nodes representing the
place where we pick an attribute and ask a question; edges represent the answers
the to the question; and the leaves represent the actual output or class label. They
are used in non-linear decision making with simple linear decision surface.

22
Figure 4.11: Example of decision tree

Decision trees classify the examples by sorting them down the tree from the root to
some leaf node, with the leaf node providing the classification to the example. Each
node in the tree acts as a test case for some attribute, and each edge descending
from that node corresponds to one of the possible answers to the test case. This
process is recursive in nature and is repeated for every subtree rooted at the new
nodes

4.7 PERFORMANCE EVALUATION

The performance of our model will be done by constructing a performance matrix


and we calculate the accuracy of individual classifiers as well as that of the ensemble
model.

4.7.1 Confusion Matrix

A way to evaluate the performance of a classifier is to look at the confusion


matrix.The confusion matrix is a matrix used to determine the performance of the
model for a given set of test data. It can only be determined if the true values for
test data are known. With the help of the confusion matrix, we can calculate the

23
different parameters for the model, such as accuracy, precision, etc. The above

Figure 4.12: Confusion Matrix

table has the following cases:

• True Negative: Model has given prediction No, and the real or actual value
was also No.

• True Positive:The model has predicted yes, and the actual value was also
true.

• False Negative:The model has predicted no, but the actual value was yes.
It is also called as Type-II error.

• False Positive: The model has predicted Yes, but the actual value was No.
It is also called a Type-I error.

24
The confusion matrices of the classifiers used in the project are:

Figure 4.13: Confusion Matrix of random forest

25
Figure 4.14: Confusion Matrix of decision tree

26
Figure 4.15: Confusion Matrix of SVM

4.7.2 Accuracy

Accuracy is defined as the closeness of a measured value to a true value. That


means the closer the measured value to the true value, the better is the accuracy.

T rueP ositive + T rueN egative


Accuracy =
T rueP ositive + T rueN egative + F alsepositive + F alseN egative
(4.2)

4.7.3 Precision

Precision refers to the closeness of two or more measurements to each other.


Accuracy and Precision are independent of each other. Precision answers the
question, out of the number of times a model predicted positive, how often was it
correct?
T rueP ositive
percision = (4.3)
T rueP ositive + F alsepositive

27
4.7.4 Recall

Recall measures what proportion of actual positive label is correctly predicted as


positive. We can analyze recall as ‘False Negative’ decreases then our recall increases
and vice-versa. Recall is used when we want to mostly focus on false-negative i.e.
to decrease false negative value thereby increase recall value.

T rueP ositive
percision = (4.4)
T rueP ositive + F alsenegative

28
4.8 Model Development Tools

Machine learning (ML) is the study of computer algorithms that can improve
automatically through experience and by the use of data. During development of
our system we have used following tools and techniques.

4.8.1 Streamlit

Streamlit is a free and open-source framework to rapidly build and share beautiful
machine learning and data science web apps. It is a Python-based library specifically
designed for machine learning engineers. Data scientists or machine learning
engineers are not web developers and they’re not interested in spending weeks
learning to use these frameworks to build web apps. Instead, they want a tool
that is easier to learn and to use, as long as it can display data and collect
needed parameters for modeling. Streamlit allows you to create a stunning-looking
application with only a few lines of code.
Streamlit is the easiest way especially for people with no front-end knowledge to
put their code into a web application:

• No front-end (html, js, css) experience or knowledge is required.

• You don’t need to spend days or months to create a web app, you can create
a really beautiful machine learning or data science app in only a few hours
or even minutes.

• It is compatible with the majority of Python libraries (e.g. pandas, matplotlib,


seaborn, plotly, Keras, PyTorch, SymPy(latex)).

• Less code is needed to create amazing web apps.

• Data caching simplifies and speeds up computation pipelines.

4.8.2 Jupyter Notebook

The Jupyter Notebook is an interactive computing environment that enables users


to author notebook documents that include: - Live code - Interactive widgets - Plots

29
Figure 4.16: Streamlit

- Narrative text - Equations - Images - Video. Jupyter Notebook is maintained by


the people at Project Jupyter.
Jupyter Notebooks are a spin-off project from the IPython project, which used
to have an IPython Notebook project itself. The name, Jupyter, comes from the
core supported programming languages that it supports: Julia, Python, and R.
Jupyter ships with the IPython kernel, which allows you to write your programs in
Python, but there are currently over 100 other kernels that you can also use. These
documents provide a complete and self-contained record of a computation that
can be converted to various formats and shared with others using email, Dropbox,
version control systems (like git/GitHub) or nbviewer.jupyter.org.

Figure 4.17: Jupyter Notebook

4.8.3 Python

Python is a high-level general-purpose programming language. Its design philosophy


emphasizes code readability with the use of significant indentation. Its language
constructs and object-oriented approach aim to help programmers write clear,
logical code for small- and large-scale projects. Python is dynamically-typed
and garbage-collected. It supports multiple programming paradigms, including
structured (particularly, procedural), object-oriented and functional programming.
It is often described as a ”batteries included” language due to its comprehensive
standard library.
Often, programmers fall in love with Python because of the increased productivity it

30
provides. Since there is no compilation step, the edit-test-debug cycle is incredibly
fast. Debugging Python programs is easy: a bug or bad input will never cause a
segmentation fault. Instead, when the interpreter discovers an error, it raises an
exception. When the program doesn’t catch the exception, the interpreter prints a
stack trace. A source level debugger allows inspection of local and global variables,
evaluation of arbitrary expressions, setting breakpoints, stepping through the code
a line at a time, and so on. The debugger is written in Python itself, testifying to
Python’s introspective power. On the other hand, often the quickest way to debug
a program is to add a few print statements to the source: the fast edit-test-debug
cycle makes this simple approach very effective.

Figure 4.18: Python

31
CHAPTER 5
RESULT AND ANALYSIS

Hence we use three different kind of algorithm namely decision tree,random forest
and support vector classifier (SVC). The projected accuracy obtained from decision
tree is found to be 89.57percent ,random forest is 95.8percent,that of SVC is
88.07percent.Comparing three different classifiers accuracy we found out that
random forest classifier obtain highest accuracy on given dataset.

Figure 5.1: Accuracy

32
5.1 Output

Figure 5.2: Output

33
CHAPTER 6
EPILOGUE

6.1 Conclusion

This project work is a successful output of the course called ‘Major Project’
considered as the partial fulfillment of B.E. Computer Engineering at IOE. The
main objective of the project was to detect Distributed Denial of Service attack on
a system. DDoS attack detection by training our model with various classifiers
helps us protect our system when deployed and plays an important role in security.
In this paper, we have analyzed various algorithms namely Decision Tree, Random
Forest and Support Vector Regression. We build individual models using all these
algorithms and found their respective accuracy.
During the entire project development period we were able to develop our skills.
This project enables us to manage time and resources besides of different con-
straints.We learned how to work on group and hence develop a system.
Hence, we can conclude that the security of a system can be enhanced using the
ensembled machine learning model by incorporating individual classifiers.

6.2 Limitation And Fucture Enhancement

The models which we build in our project is only confined to a few numbers of
PCs. The detection of the packets intended for DDoS are determined based on the
data provided from the NSL-KDD dataset, but the real-life data packets may not
be in the form similar to the packets present in the dataset.
In future, we intend to incorporate DDoS detection along with prevention and
deploy the system in a deterministic environment. We intend to make the model
be able to learn from the real time packets and adjust itself accordingly.

34
REFERENCES

[1] V. Priyadharshini and Dr.K. Kuppusamy. Preventation of ddos attacks using


new cracking algorithm. International Journal Of Engineering Research and
Applications(IJERA), 2, May-June 2012.

[2] Anup Bhange, Amber Syad, and Satyendra Singh et.al Thakur. Ddos attacks
impact on network traffic and itsdetection approach. Internation Jounal of
Computer Applications, 40, February 2012.

[3] Ma Zhao-hui, Zhao Gan-sen, and Lin Cheng-chuang. Research on ddos attack
detection in software defined network. 2018 International Conference on Cloud
Computing, Big Data and Blockchain (ICCBB), November 2018.

[4] Mohamad Yusof Darus, Mohd Azahari Mohd Yusof, and Fakariah Hani Mohd
Ali. Detection and defense algorithms of different types of ddos attacks using
machine learning. Computational Science and Technology, pages 370–379,
February 2018.

[5] Stephen M. Specht and Ruby B. Lee. Distributed denial of service: Taxonomies
of attacks, tools, and countermeasures. Proceedings of the ISCA 17th Inter-
national Conference on Parallel and Distributed Computing Systems, January
2004.

[6] Alekhya kaliki1 and K Munivara Prasad. Machine learning based application
layer ddos attack detection using firefly classification algorithm. International
Journal of Pure and Applied Mathematics, 2018.

[7] Xin Yuan, Zhenhai Duan, and Jaideep Chandrashekar. Controlling ip spoofing
through inter-domain packet filters. IEEE Transactions on Dependable and
Secure Computing, February 2008.

35
CHAPTER 7
APPENDIX A

7.0.1 Projecct Schedule

Project schedule is one of the most important part of the project. It gives
information about the overall task to be performed during the whole project time
period, accordingly the project can be accomplished in time. In project planning,
all of the tasks are equally important but some of them need extra time and
dedication. In our project, we need a lot of time and attention toward data
normalization and also for testing the data generation. Similarly, purification and
cleaning of data and its testing and validation were also quite challenging. A lot of
time was taken for the documentation as it is the record of each and every detail
about our system. In order to make it clear and easily understandable we put
extra time and dedication which took maximum time out of our project schedule.
It was updated on a regular basis. The Gantt Chart of our project is shown in
figure below.

Figure 7.1: Project Schedule

36

You might also like