0% found this document useful (0 votes)
108 views56 pages

Machine Learning IDS Project Report

This document describes a project to develop an intrusion detection system using machine learning algorithms. The system will use Random Forest and Support Vector Machine classifiers to analyze network traffic data and detect known and unknown attacks. The performance of the algorithms will be evaluated and compared using various metrics. A web application will also be developed to allow users to input new data and receive classification results in real-time.

Uploaded by

Arnkxy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views56 pages

Machine Learning IDS Project Report

This document describes a project to develop an intrusion detection system using machine learning algorithms. The system will use Random Forest and Support Vector Machine classifiers to analyze network traffic data and detect known and unknown attacks. The performance of the algorithms will be evaluated and compared using various metrics. A web application will also be developed to allow users to input new data and receive classification results in real-time.

Uploaded by

Arnkxy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

INTRUSION DETECTION SYSTEM USING

MACHINE LEARNING

Submitted in partial fulfilment of the requirements of the degree of


Bachelor of Engineering

by

Asfiya Yunus Shaikh (Roll No. 45)


Ansari Mohammad Sameer Shamsuddin (Roll No. 03)
Shaikh MD Umair Ismail (Roll No. 48)
Syed Ghazi Abbas (Roll No. 58)

Under the guidance of


Prof. Shahegul Afroz

Department of Computer Engineering,


Theem College Of Engineering
Village Betegaon, Boisar Chilhar Road, Boisar (E), Palghar

2023-2024
INTRUSION DETECTION SYSTEM USING

MACHINE LEARNING

Submitted in partial fulfilment of the requirements of the degree of


Bachelor of Engineering

by

Asfiya Yunus Shaikh (Roll No. 45)


Ansari Mohammad Sameer Shamsuddin (Roll No. 03)
Shaikh MD Umair Ismail (Roll No. 48)
Syed Ghazi Abbas (Roll No. 58)

Under the guidance of


Prof. Shahegul Afroz

Department of Computer Engineering,


Theem College Of Engineering
Village Betegaon, Boisar Chilhar Road, Boisar (E), Palghar
University Of Mumbai

2023-2024
Declaration

We declare that this written submission represents our ideas in our own words and
where others’ ideas or words have been included; we have adequately cited and
referenced the original sources. We also declare that we have adhered to all
principals of academics honestly and integrity have not misrepresented or fabricated or
falsified any idea/data/fact/sources in my submission. We understand that any violation
of the above will be cause for disciplinary action by the institute and can also evoke penal
action from the source which has thus not been properly cited or from whom proper
permission has not been taken when needed.

Asfiya Yunus Shaikh 45

Ansari Mohammad Sameer


Shamsuddin 03

Shaikh MD Umair Ismail 48

Syed Ghazi Abbas 58

Date:
Project Report Approval for Bachelor Of Engineering

The project report entitled Intrusion Detection System Using Machine Learning
by Asfiya Yunus Shaikh, Ansari Mohammad Sameer Shamsuddin, Shaikh MD Umair
Ismail, Syed Ghazi Abbas is approved for the degree of Bachelor of Engineering in
Computer Engineering.

Date: Examiners:

Place:

Name & Sign of External Examiner

Name & Sign of Internal Examiner


CERTIFICATE

This is to certify that the project entitled “Intrusion Detection System Using
Machine Learning” is a bonafide work of “Asfiya Yunus Shaikh (45), Ansari
Mohammad Sameer Shamsuddin (03), Shaikh MD Umair Ismail (48), Syed
Ghazi Abbas (58)” submitted to the University of Mumbai in partial fulfillment of the
requirement for the award of the degree of “Bachelor of Engineering” in
“Computer Engineering” has been carried out under my supervision at the department of
Computer Engineering of Theem College of Engineering, Boisar. The work is
comprehensive, complete and fit for evalautaion.

Prof. Shahegul Afroz Prof. Shahegul Afroz


Project Guide Project Coordinator

Prof. Mubashir Khan [Link] Siddique


HOD Principal
Acknowledgement

First and foremost, we thank God Almighty for blessing us immensely and empowering
us at times of difficulty like a beacon of light. Without His divine intervention we
wouldn’t have accomplished this project without any hindrance.
We are also grateful to the Management of Theem College of Engineering for their
kind support. Moreover, we thank our beloved Principal [Link] Siddiqui,
our Director, Dr.N.K. Rana for their constant encouragement and valuable advice
throughout the course.
We are profoundly indebted to Prof. Mubashir Khan, Head of the Department of
Computer Engineering and Prof. Shahegul Afroz, Project Coordinator for helping
us technically and giving valuable advice and suggestions from time to time. They are
always our source of inspiration.
Also, we would like to take this opportunity to express our profound thanks to our guide
Prof. Shahegul Afroz, Assistant Professor, Computer Engineering for his/her valuable
advice and whole hearted cooperation without which this project would not have seen
the light of day.
We express our sincere gratitude to all Teaching/Non-Teaching staff members of
Computer Engineering department for their co-operation and support during this
project.

Asfiya Yunus Shaikh (45)


Ansari Mohammad Sameer Shamsuddin (03)
Shaikh MD Umair Ismail (48)
Syed Ghazi Abbas (58)
ABSTRACT

The identification of new and evolving cyber threats is a major challenge for traditional
intrusion detection systems (IDS). These systems often rely on signature-based detection
methods, which are ineffective against novel attacks. To address this issue, we propose
an intrusion detection system using machine learning (IDS-ML) that leverages the power
of machine learning algorithms to detect both known and unknown attacks. Our
proposed IDS-ML system will employ Random Forest and Support Vector Machine
(SVM) algorithms to analyze network traffic data and identify malicious patterns. These
algorithms will be trained on comprehensive datasets such as CICIDS 2018 and UNSW-
NB15, ensuring exposure to a wide range of attack types. To evaluate the
performance of these algorithms, we will compare their true positive rates (TPR) and
false positive rates (FPR), as well as their precision and recall. Furthermore, we will
plot the ROC curve and compute the AUC-ROC for each algorithm. We will also
develop a user- friendly web application that will allow users to input new network
traffic data and receive real-time classification results. The system will analyze the
input data and determine whether it represents anomalous or normal network behavior.
To enhance the understanding of the algorithms’ performance, we will incorporate
visualization graphics that will provide insights into the classification process. Our
IDS-ML system has the potential to significantly improve the security of networks by
effectively detecting both known and unknown attacks. By employing advanced machine
learning algorithms, providing a user-friendly interface, and offering robust evaluation
metrics, our system will be a valuable tool for network administrators and security
professionals.

i
LIST OF FIGURES

[Link] System Architecture 1.........................................................................................13


[Link] System Architecture 2.........................................................................................15
[Link] Use Case Diagram...............................................................................................22
[Link] Activity Diagram.................................................................................................23
[Link] Sequence Diagram..............................................................................................25
[Link] Class Diagram......................................................................................................26

[Link] Home.....................................................................................................................34
[Link] About Us...............................................................................................................35
[Link] Contact Us............................................................................................................35
[Link] Analysis 1..............................................................................................................36
7.4.0.2Analysis 2..............................................................................................................36
[Link] Feature Description............................................................................................37
[Link] Results...................................................................................................................37

i
LIST OF TABLES

8.1 Project Implementation Plan.....................................................................................38


CONTENTS

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
List Of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List Of Tables...................................................................................................................iii

1 INTRODUCTION 1

2 LITERATURE REVIEW 3
2.1 Intrusion Detection System with Machine Learning Algorithms and
Comparison Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2.2 Survey of intrusion detection systems: techniques, datasets and challenges: 3
2.3 Performance Study of Snort and Suricata for Intrusion Detection System: 4
2.4 Intrusion Detection System using Machine Learning Techniques: A Review: 4
2.5 Machine Learning-Based Anomaly Detection in NFV: A Comprehensive
Survey: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.6 MSTREAM: Fast Anomaly Detection in Multi-Aspect Streams: . . . . . 6
2.7 A Novel Model for Anomaly Detection in Network Traffic Based on
Support Vector Machine and Clustering: . . . . . . . . . . . . . . . . . .
6
2.8 Reinforcement Learning for Intrusion Detection: More Model Longness
and Fewer Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.9 A Neural Network Architecture Combining Gated Recurrent Unit
(GRU) and Support Vector Machine (SVM) for Intrusion Detection in
Network
Traffic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.10 A Fine-grained System Driven of Attacks over Several New Representation
Techniques using Machine Learning . . . . . . . . . . . . . . . . . . . . .
8

3 LIMITATIONS OF EXISTING SYSTEM OR RESEARCH GAP 10

4 PROBLEM STATEMENT AND OBJECTIVE 11

i
4.1 Problem Statement................................................................................................11
4.2 Objectives..........................................................................................................11

v
5 PROPOSED SYSTEM 13
5.1 System Architecture 1...........................................................................................13
5.2 System Architecture 2...........................................................................................15
5.3 Model Training.......................................................................................................17
5.3.1 Random Forest Algorithm:......................................................................17
5.3.2 Support Vector Machine Algorithm:.....................................................20
5.4 UML Diagrams......................................................................................................22
5.4.1 Use Case Diagram....................................................................................22
5.4.2 Activity Diagram.......................................................................................23
5.4.3 Sequence Diagram....................................................................................25
5.4.4 Class Diagram...........................................................................................26

6 Experimental Setup 28
6.1 Introduction:..........................................................................................................28
6.2 Dataset Selection:.................................................................................................28
6.3 Data Preprocessing:..............................................................................................28
6.4 Model Training:.....................................................................................................29
6.5 Model Evaluation:.................................................................................................29
6.6 Web Application Development:..........................................................................29
6.7 Model Deployment:..............................................................................................29
6.8 Integration with Frontend:..................................................................................29
6.9 Evaluation and Analysis Display:.......................................................................30
6.10 Details about Inputs to the System...................................................................30
6.11 Evaluation Parameters.........................................................................................30
6.12 Software and Hardware Setup............................................................................32
6.12.1 Software and Packages Requirements:.................................................32
6.12.2 Hardware Requirements:........................................................................33

7 Results and Discussion 34


7.1 HOME.....................................................................................................................34
7.2 ABOUT US.............................................................................................................35
7.3 CONTACT US........................................................................................................35
7.4 ANALYSIS..............................................................................................................36
7.5 FEATURE DESCRIPTION...................................................................................37
7.6 RESULTS................................................................................................................37

8 Implementation Plan of Next Semester 38


8.1 Implementation Plan:..........................................................................................38

v
9 Conclusion 39

v
Chapter 1
INTRODUCTION

The detection of intrusions and anomalous activities remains a critical challenge in the
rapidly evolving cybersecurity landscape. Traditional intrusion detection systems (IDS)
have been the mainstay of network security for many years, but their efficacy in
defending against novel and sophisticated cyberattacks has waned due to the
relentless evolution of cyber threats and the emergence of new attack vectors.
Consequently, the need for innovative and adaptive IDS solutions has become
increasingly imperative.
The inherent limitations of traditional intrusion detection systems (IDS) in effectively
detecting and mitigating novel attacks pose a significant challenge to cybersecurity in
the digital age. Cybercriminals are constantly adapting and devising ingenious methods
to breach network defenses, while traditional rule-based or signature-based IDS systems
struggle to keep pace. Consequently, their ability to detect emerging threats remains
limited, jeopardizing the integrity and confidentiality of data.
Our project aims to bridge this gap by implementing a machine learning (ML)-based
IDS system. By leveraging ML algorithms, such as Random Forest and Support Vector
Machines (SVM), we empower our system to adapt and learn from the evolving threat
landscape. This enables the system to detect anomalies and intrusions that might
otherwise remain concealed within the vast sea of network traffic data. Our project
aims to develop an ML-based IDS that can detect multiple types of suspicious activities,
including denial-of-service (DoS) and port scanning attacks. DoS attacks attempt to
disrupt a server’s normal operation by flooding it with traffic. Port scanning attacks are
used to identify open ports on a server or network device, which can then be exploited
to launch further attacks.
Our IDS will employ Random Forest and Support Vector Machine (SVM) algorithms
to analyze network traffic data and identify malicious patterns. These algorithms will
be trained on comprehensive datasets such as CICIDS 2018 and UNSW-NB15, which
include examples of a wide range of attack types.
The Random Forest algorithm begins by constructing a set of decision trees. Each
tree is trained on a random sample of the data, and the features used to split the data
at each node are also selected randomly[1]. Once the trees have been constructed, they
1
are used to make predictions on new data points. Each tree predicts a class label for
the new data point, and the final prediction of the forest is the most common class
label predicted by the individual trees. Random Forest can be used to develop intrusion
detection systems that can detect both known and unknown attacks. Support Vector
Machines (SVMs) are a type of supervised machine learning algorithm that can be used
for both classification and regression tasks. SVMs work by finding a hyperplane in the
feature space that separates the data points into two classes. The hyperplane is chosen
such that the margin between the two classes is maximized.
There are more such existing methods such as, decision tree algorithm which works
by constructing a tree-like structure where each node represents a feature of the data
and each leaf node represents a class label or prediction value, logistic regression which
is a regression model, which builds a model to predict the probability that the data
belongs to a particular category, and the KNN (K Nearest Neighbor) Algorithm is
another supervised machine learning algorithm that is used for solving classification
problems. In this algorithm the distance between a query scenario and a set of scenarios
is calculated by using a distance function, for example Euclidian Distance formula [1].
There are many more algorithms that can be used to detect anomalies. We will be
using Random Forest as it has been proved to reduce the false alarms effectively [2].
Additionally, we will use Support Vector Machine (SVM) algorithm which proves to
have higher accuracy over some newer datasets [3].

2
Chapter 2
LITERATURE REVIEW

2.1 Intrusion Detection System with Machine Learning Algorithms and


Comparison Analysis:

This research describes a proposed intrusion detection system (IDS) that uses machine
learning algorithms to detect four different types of attacks: denial of service (DoS), user
to root (U2R), remote to local (R2L), and probe. The IDS is implemented in Python and
uses the KDD dataset for training and testing. The paper compares the performance
of four machine learning algorithms: decision tree, random forest, logistic regression,
and k-nearest neighbors (KNN). The results show that the random forest algorithm has
the highest accuracy, followed by the decision tree algorithm. The logistic regression
and KNN algorithms have lower accuracy, but they are still able to detect a
significant number of attacks.
The paper also discusses the use of Snort, a popular open-source IDS, to perform
vulnerability assessment and protocol analysis. The authors argue that Snort can be
used to complement the proposed IDS by providing additional detection capabilities.
However, there are also some potential limitations to the proposed IDS. First, machine
learning algorithms can be computationally expensive to train and deploy. This may
make the proposed IDS impractical for use in some environments. Second, the proposed
IDS is trained on the KDD dataset, which is a well-known dataset, but it is also relatively
old. It is possible that the proposed IDS may not perform as well on real-world data.
Overall, the proposed IDS is a promising approach to intrusion detection. It has several
advantages over traditional rule-based IDSs, but it is important to be aware of the
potential limitations before deploying it in a production environment.

2.2 Survey of intrusion detection systems: techniques, datasets and


challenges:

The paper provides a valuable contribution to the field of IDS research. It highlights
the challenges of detecting modern malware and discusses the limitations of existing

3
IDSs. The authors also provide several suggestions for future research, including
the development of new datasets and new IDS techniques that can overcome evasion
techniques. This paper discusses the challenges of designing and building intrusion
detection systems (IDSs) that are capable of detecting modern malware. The authors
argue that existing IDSs are often ineffective against zero-day attacks and other
sophisticated attacks that use evasion techniques.
The authors also review several machine learning techniques that have been proposed
for IDS research. However, they argue that these techniques may have the problem
of generating and updating the information about new attacks and yield high false
alarms or poor accuracy. The authors conclude that there is a need for newer and
more comprehensive datasets that contain a wide spectrum of malware activities. They
also argue that IDs research should focus on developing systems that are capable of
overcoming evasion techniques.

2.3 Performance Study of Snort and Suricata for Intrusion Detection


System:

This paper compares the performance of two network intrusion detection systems
(NIDS): Suricata and Snort. The authors evaluate the two systems based on accuracy,
performance, and scalability. The authors found that Suricata has higher accuracy than
Snort, meaning that it is better at detecting attacks. However, Suricata also has
higher system requirements than Snort, and it can reach its operational limit sooner
under high traffic loads. The authors also found that Suricata is more scalable than
Snort. This means that Suricata can handle more traffic without dropping packets.
However, the authors note that Snort can be scaled by running multiple instances of
Snort on multiple CPU cores. The paper provides a valuable comparison of two
popular NIDS systems. The authors’ findings are consistent with other research on the
performance of Suricata and Snort. One limitation of the paper is that it is based on a
single dataset. It would be interesting to see if the authors’ findings are consistent when
evaluated on other datasets. Another limitation of the paper is that it does not evaluate
other NIDS systems, such as Bro and Zeek. It would be interesting to see how Suricata
and Snort compare to other NIDS systems in terms of accuracy, performance, and
scalability.

2.4 Intrusion Detection System using Machine Learning Techniques: A


Review:

This paper discusses the use of machine learning classifiers in intrusion detection
systems (IDSs). The authors review a number of research papers published from 2015 to
2020. The authors find that ensemble and hybrid classifiers have better predictive
4
accuracy

5
and detection rate than single classifiers. They also discuss several directions for future
research, including:

• Using hybrid and ensemble classifiers more often. Developing models that can
perform efficiently on multiple datasets.

• Considering feature selection to improve the efficiency and detection rate of IDSs.

• Using more recently updated datasets to deal with the most recent malicious
intrusions and attacks.

The paper provides a valuable overview of the use of machine learning classifiers in IDSs.
The authors’ findings are consistent with other research on this topic. One limitation of
the paper is that it is based on a relatively small number of research papers. It would
be interesting to see if the authors’ findings are consistent when evaluated on a larger
set of research papers. Another limitation of the paper is that it does not discuss the
challenges of using machine learning classifiers in IDSs. For example, machine learning
classifiers can be susceptible to overfitting and adversarial attacks. Overall, the paper
provides a valuable contribution to the field of IDS research. The authors’ findings and
suggestions for future research can help researchers and practitioners to develop more
effective and robust IDSs.

2.5 Machine Learning-Based Anomaly Detection in NFV: A Comprehensive


Survey:

This paper surveys the state-of-the-art research on network-based anomaly detection


in network function virtualization (NFV) networks. The authors review a number of
supervised and unsupervised machine learning algorithms that have been proposed for
anomaly detection in NFV networks. The authors conclude that supervised algorithms
are more accurate and efficient than unsupervised algorithms for anomaly detection in
NFV networks. They also argue that a separate module design is a better solution for
anomaly detection in NFV networks. The paper provides a valuable overview of the
state-of-the-art research on anomaly detection in NFV networks. The authors’ findings
are consistent with other research on this topic.
One limitation of the paper is that it does not discuss the challenges of implementing
anomaly detection in real-world NFV networks. For example, anomaly detection systems
can be computationally expensive to deploy and maintain. Additionally, anomaly
detection systems can be susceptible to false alarms and false negatives. Another
limitation of the paper is that it does not discuss the use of deep learning for
anomaly detection in NFV networks. Deep learning algorithms have shown promise for
anomaly

6
detection in other domains, such as network intrusion detection and fraud detection.
Overall, the paper provides a valuable contribution to the field of NFV security research.
The authors’ findings and suggestions for future research can help researchers and
practitioners to develop more effective and robust anomaly detection systems for NFV
networks.

2.6 MSTREAM: Fast Anomaly Detection in Multi-Aspect Streams:

The proposed research paper presents an algorithm called MSTREAM, designed to


detect anomalous patterns in streaming data with multi-aspect attributes. The need
for such a system arises from the challenge of identifying patterns without setting
prior limits on the duration of anomalous activities or the attributes involved.
MSTREAM uses hash functions (FEATUREHASH and RECORDHASH) to handle
mixed data types effectively. It employs temporal scoring inspired by MIDAS for
anomaly detection and introduces three variants (MSTREAM-PCA, MSTREAM-IB,
and MSTREAM-AE) to reduce data dimensionality while preserving feature correlation,
improving anomaly detection efficiency.
The authors evaluate MSTREAM on the CICIDS-DoS dataset and compare it to
several baseline approaches, including Elliptic Envelope, Local Outlier Factor, Isolation
Forest, DENSEALERT, and Random Cut Forest. The results show that MSTREAM
outperforms all of the baseline approaches in terms of accuracy and F1 score. The
authors also demonstrate the effectiveness of MSTREAM in detecting group anomalies.
The paper provides a valuable contribution to the field of anomaly detection. MSTREAM
is a novel and effective approach for detecting group anomalies in multi-aspect streams.
It is also a streaming algorithm, which means that it can be used to process data in
real time. One limitation of the paper is that it is based on a single dataset. It would
be interesting to see if the performance of MSTREAM is consistent when evaluated on
other datasets. Another limitation of the paper is that it does not discuss the
computational complexity of MSTREAM. It would be interesting to know how the
performance of MSTREAM scales with the size and dimensionality of the data.

2.7 A Novel Model for Anomaly Detection in Network Traffic Based on


Support Vector Machine and Clustering:

This paper proposes a novel model called SVM-C for anomaly detection in network
traffic. SVM-C is a support vector machine (SVM) classifier that uses a linear projection
to extract features from network traffic. The authors evaluate SVM-C on three datasets
and compare it to several baseline classifiers, including Naive Bayes (NB), SVM, and
multilayer perceptron (MLP). The results show that SVM-C outperforms the baseline

7
classifiers on all three datasets in terms of accuracy, precision, and F1 score. SVM-C also
has a lower false positive rate than the baseline classifiers. One limitation of the paper is
that it does not discuss the computational complexity of SVM-C. It would be interesting
to know how the performance of SVM-C scales with the size and dimensionality of
the data. Another limitation of the paper is that it does not compare SVM-C to other
state-of-the-art anomaly detection algorithms. It would be interesting to see how SVM-C
compares to other algorithms in terms of accuracy, performance, and scalability. Overall,
the paper provides a valuable contribution to the field of anomaly detection. SVM-C
is a promising new approach for detecting anomalies in network traffic.

2.8 Reinforcement Learning for Intrusion Detection: More Model


Longness and Fewer Updates

Network-based intrusion detection has seen a surge in the application of machine


learning techniques in recent years. While these methods offer high detection
accuracy, they often struggle to adapt to evolving network behaviors over time.
Existing solutions assume periodic model updates, which can be impractical in real-
world scenarios. This paper introduces a novel intrusion detection model using
reinforcement learning, designed to operate without frequent updates. The model
employs two strategies: long-term learning through reinforcement learning and model
updates via transfer learning and a sliding window mechanism. Experiments using a
large 8TB dataset and four years of real network traffic show that the proposed model
outperforms traditional approaches, achieving comparable accuracy without periodic
updates and significantly reducing false positives and false negatives with minimal
training data and computational resources. This research presents a promising solution
to the evolving challenges of network-based intrusion detection.

2.9 A Neural Network Architecture Combining Gated Recurrent Unit


(GRU) and Support Vector Machine (SVM) for Intrusion Detection
in Network Traffic Data

Recurrent Neural Networks (RNNs), with variations such as the Long Short-Term
Memory (LSTM) and Gated Recurrent Unit (GRU), have gained prominence in
various machine learning tasks due to their ability to capture sequential dependencies.
These models have demonstrated effectiveness in diverse applications, including natural
language processing, speech recognition, and text classification. Traditionally, the final
output layer of RNNs often employs the Softmax function for prediction and the cross-
entropy loss function for training. However, the study at hand seeks to challenge
this convention by introducing an alternative approach. The research paper proposes

8
replacing the Softmax function with a linear Support Vector Machine (SVM) for the
final output layer of a GRU model. Additionally, the commonly used cross-entropy loss
function is replaced with a margin-based function. While previous studies have explored
similar approaches, this research specifically targets binary classification in intrusion
detection, utilizing network traffic data from 2013 honeypot systems at Kyoto University.
Results from the study indicate that the GRU-SVM model outperforms the
traditional GRU-Softmax model in binary classification. The proposed model achieves
a higher training accuracy of approximately 81.54% , as compared to the conventional
model’s training accuracy of about 63.07%. The testing accuracy of the GRU-SVM
model is also notably higher, at approximately 84.15%, in contrast to the GRU-Softmax
model’s testing accuracy of about 70.75%. These findings suggest that the introduction
of SVM in the final output layer enhances the model’s predictive capabilities. The study
goes further to investigate the computational efficiency of both models. It is observed
that the GRU-SVM model’s training and testing times are slightly shorter than those
of the GRU-Softmax model, despite the substantial amount of data processed. This
observation hints at the potential for SVM to outperform Softmax in terms of prediction
time, which aligns with theoretical expectations. Overall, this research contributes to
the evolving landscape of neural network models in intrusion detection, shedding light
on the viability of alternative output layers and loss functions in GRU models. The
results highlight the promise of SVM in enhancing model accuracy and efficiency,
providing valuable insights for further exploration in the domain of network security
and machine
learning.

2.10 A Fine-grained System Driven of Attacks over Several New


Representation Techniques using Machine Learning

In today’s technology-driven world, the integration of machine learning, particularly


deep learning, is pivotal for enhancing the functionality and security of real-world
systems. However, the proliferation of advanced attacks, including stealthy drive-by
downloads and malicious email attachments, poses significant threats to hosts and
networks. Traditional intrusion detection systems (IDS) face challenges in identifying
these evolving threats, necessitating the development of novel approaches. This research
paper introduces an innovative framework that employs cutting-edge representation
methodologies and attack taxonomy. Notably, it combines Convolutional Neural
Networks (CNN) and Support Vector Machines (SVM) within a Deep Neural Network
(DNN) to detect hostile network attacks. The findings reveal substantial improvements
in detection rates compared to traditional CNN and SVM models.
Furthermore, the study provides a valuable benchmark by comparing its results with

9
prior research that utilized the same dataset. The unique aspect of this research lies
in its utilization of signature-based representation grounded in spectral graph theory,
setting it apart from existing techniques. In sum, this study underscores the pivotal
role of machine learning in countering contemporary security challenges and highlights
the potential of novel approaches to enhance detection capabilities. It paves the way for
future advancements in the field of intrusion detection and network security.

1
Chapter 3
LIMITATIONS OF EXISTING SYSTEM OR
RESEARCH GAP

These ten research papers collectively contribute to the field of intrusion detection
and anomaly detection, offering valuable insights while also highlighting some common
limitations and research gaps. Several of the papers introduce novel intrusion detection
systems and machine learning techniques, demonstrating promising results but often
failing to delve into the computational complexities and practical challenges of deploying
these systems in real-world scenarios. Moreover, many of the papers rely on older datasets
for training and testing, raising concerns about the adaptability of the proposed methods
to more recent and diverse datasets. In addition, while some papers present comparisons
of various intrusion detection systems and machine learning algorithms, they often
focus on specific choices, leaving research gaps in broader comparative analyses. The
research papers collectively emphasize the importance of addressing these limitations by
optimizing computational efficiency, developing practical solutions for countering
modern malware and evasion techniques, and exploring more recent and diverse
datasets, as well as conducting comprehensive comparative studies. Additionally,
there is a need for research that addresses multi-class scenarios, adversarial attacks,
and the broader applicability of intrusion detection models in various network
environments, all while considering scalability and adaptability. These gaps in the
literature signify opportunities for future research to further enhance the field of
intrusion detection, bridging the divide between theoretical advancements and practical
implementation, and addressing the evolving challenges posed by modern network
threats.

1
Chapter 4
PROBLEM STATEMENT AND OBJECTIVE

4.1 Problem Statement

The effectiveness of intrusion detection systems (IDS) is often hindered by their reliance
on outdated datasets for training, leaving them vulnerable to new and evolving cyber
threats. This inadequacy stems from the rapid advancement of attack techniques and
the lack of timely updates to training data. As a result, organizations face an
increased risk of undetected intrusions, potentially leading to data breaches, financial
losses, and reputational damage. To address this critical issue, this project aims to
develop an IDS that leverages machine learning algorithms and incorporates up-to-date
datasets for effective intrusion detection. The proposed system will utilize the CICIDS
2018 and UNSW 2015 datasets, encompassing a comprehensive range of modern
attack patterns, to train machine learning models such as Random Forest and Support
Vector Machines (SVM). The resulting IDS will be capable of accurately identifying
anomalies in network traffic, enabling timely intervention and mitigation of potential
intrusions. Furthermore, the project will develop a user-friendly web application using
Flask, allowing users to input new traffic data and receive real-time predictions of
potential anomalies. This accessibility will empower network administrators and
security personnel to proactively monitor network activity and safeguard their systems
against emerging threats. By addressing the limitations of outdated training data
and providing a user-friendly interface, this project will contribute to the development
of a robust and adaptable IDS, enhancing cybersecurity preparedness and reducing the
risk of undetected intrusions.

4.2 Objectives

The objectives of this project are to:

• Develop a new IDS that is trained on the latest datasets, including the CICIDS
2018 and UNSW 2015 datasets.

• Use machine learning algorithms, such as random forest and support vector

1
machines (SVM), to develop a model that can accurately detect anomalies in
network traffic.

• Develop a web application using Flask that allows users to input new traffic
data and predict whether it is anomalous.

• False positives: IDS can sometimes generate false positives, which are alarms that
are triggered by legitimate traffic. It is important to minimize the number of false
positives in order to avoid disrupting normal operations.

• Ease of use: The web application which is easy to use for users of all skill levels.

1
Chapter 5
PROPOSED SYSTEM

5.1 System Architecture 1

This intrusion detection system employs a three-tiered architecture. The frontend,


built with Flask and web technologies, collects data and interacts with users. The
backend, powered by Python, preprocesses data and trains the intrusion detection model
using SVM and Random Forest algorithms. Cloud deployment, utilizing platforms like
AWS, ensures scalability and accessibility. The system offers high accuracy, scalability,
accessibility, and customization, making it a robust intrusion detection solution.

Figure [Link]: System Architecture 1

The system architecture consists of:

1. Frontend: The frontend of the system is responsible for collecting data from
the network or system being monitored and displaying the results to users. The
frontend is implemented using Flask, a lightweight web framework for Python.
HTML, CSS, and JavaScript are used to create the user interface of the web
application.

2. Backend: The backend of the system is responsible for preprocessing the collected
data, training the intrusion detection model, deploying the model, and performing
intrusion detection. The backend is implemented using Python. Two machine
learning algorithms, SVM and Random Forest, are used to train the intrusion
detection model. The CICIDS 2018 and UNSW 2015 datasets are used to train
and evaluate the model.

1
3. Cloud Deployment: The intrusion detection model is deployed on the cloud
using AWS or other cloud platforms. This allows the model to be scaled to meet
the needs of the system and to be accessed from anywhere in the world.

4. Evaluation and Analysis: This module is responsible for evaluating the


performance of the trained intrusion detection model on the two datasets and
displaying the results to users in the frontend. The Evaluation and Analysis
Module can be implemented using a variety of tools and technologies, such as
Python libraries such as scikit-learn and pandas.

The system architecture works as follows:

1. Data collection: The frontend collects data from the network or system being
monitored. This data can be in the form of network traffic packets, system logs,
or other types of data.

2. Data preprocessing: The backend preprocesses the collected data to extract


features and prepare it for machine learning training.

3. Model training: The backend trains the intrusion detection model using the
preprocessed data. Two machine learning algorithms, SVM and Random Forest,
are used in the system.

4. Model evaluation and analysis: The backend evaluates the performance of the
trained intrusion detection model on the two datasets and displays the results to
users in the frontend.

5. Model deployment: The trained intrusion detection model is deployed on the


cloud.

6. Intrusion detection: The web application uses the deployed intrusion detection
model to detect intrusions in the network or system being monitored.

7. Alerting: The web application alerts users of detected intrusions.

The following are some of the benefits of the proposed system architecture:

1. Accuracy: The system uses two machine learning algorithms, SVM and Random
Forest, which are known for their high accuracy in intrusion detection.

2. Scalability: The system can be scaled to meet the needs of the system by
deploying the intrusion detection model on the cloud.

1
3. Accessibility: The web application can be accessed from anywhere in the world,
provided that there is an internet connection.

4. Flexibility: The system can be customized to meet the specific needs of the
organization.

5. Evaluation and Analysis: The system includes an Evaluation and Analysis


Module that evaluates the performance of the trained intrusion detection model
and displays the results to users in the frontend.

5.2 System Architecture 2

This architecture is based on the idea of using a large and diverse dataset of labeled
network traffic to train a machine learning model to detect intrusions.

Figure [Link]: System Architecture 2

1
The system architecture works as follows:
1. Data collection: The system collects network traffic data from a variety of
sources, such as production networks, honeypots, and sandboxes. Production
networks are real-world networks that are used by organizations. Honeypots
are fake networks that are designed to attract attackers. Sandboxes are isolated
environments that are used to test suspicious code or software.
The data is collected in a variety of formats, such as packet captures, flow logs,
and system logs. Packet captures contain all of the packets that are transmitted
on a network interface. Flow logs contain aggregated information about the traffic
that flows through a network. System logs contain information about events that
occur on a system, such as login attempts and process executions.

2. Data preprocessing and feature engineering: The collected data is


preprocessed and engineered to extract features that are relevant for intrusion
detection. This includes cleaning the data, removing noise, and converting the data
into a format that is compatible with the machine learning algorithms. Feature
engineering is the process of creating new features that are more informative for
intrusion detection. This can be done by combining different features, creating
new features based on existing features, or transforming features into a different
format.

3. Model training: The machine learning models are trained on the engineered
dataset. Two machine learning algorithms, SVM and Random Forest, are used
in the system. SVM is a supervised learning algorithm that can be used for
both classification and regression tasks. Random Forest is an ensemble learning
algorithm that combines the predictions of multiple decision trees to produce a
more accurate prediction.

4. Model evaluation: The trained models are evaluated on a held-out test set to
assess their performance. This helps to identify any areas where the models need
improvement. The evaluation metrics that are used depend on the specific machine
learning algorithms that are used. For example, common evaluation metrics for
classification tasks include accuracy, precision, recall, and F1 score.

5. Model deployment: The trained models are deployed to production. This can
be done by deploying them to a cloud platform or to on-premises servers. The
specific deployment method that is used depends on the specific requirements of
the organization.

1
6. Intrusion detection: The deployed models are used to detect intrusions in real
time. The models analyze network traffic and generate alerts when they detect
malicious activity. The alerts can be sent to security analysts or to automated
systems that can take action to respond to the threat.

7. Evaluation and analysis: The performance of the deployed models is


continuously evaluated and analyzed. This helps to identify any changes in the
attack landscape and to ensure that the models are still effective. The results
of the evaluation and analysis are displayed to users in the frontend of the web
application. This allows users to monitor the performance of the models over time
and to make adjustments as needed.

5.3 Model Training

The SVM and Random Forest algorithms will be implemented to train the IDS models.
SVM is a supervised learning algorithm that separates data points into different classes
using hyperplanes in high-dimensional space. Random Forest is an ensemble learning
method that constructs multiple decision trees and combines their predictions. The
training process involves splitting the datasets into training and testing sets. The models
will be trained on the training set and evaluated on the testing set to measure their
performance. Various metrics, such as accuracy, precision, recall, and F1-score, will be
used to assess the models’ effectiveness in detecting intrusions.

5.3.1 Random Forest Algorithm:

Random Forest is a popular machine learning algorithm that is used for both
classification and regression tasks. It is an ensemble learning method that combines
multiple decision trees to make predictions. The algorithm gets its name from the fact
that it creates a "forest" of decision trees, where each tree is built using a random subset
of the training data. The key components of Random Forest are:

1. Bootstrapping: Random Forest employs bootstrapping, which is a sampling


technique where multiple subsets of the original training data are created by
randomly selecting samples with replacement. This means that some samples
may appear multiple times in a subset, while others may not appear at all.
Bootstrapping helps introduce diversity in the training data for each decision tree.

2. Random Feature Selection: In addition to using random subsets of the training


data, Random Forest also uses random feature selection. At each node of a decision
tree, only a subset of features is considered for splitting. This further enhances the

1
diversity among the trees and prevents any single feature from dominating the
decision-making process.

3. Building Decision Trees: Each decision tree in the Random Forest is built using
a recursive process called recursive partitioning. The goal is to split the data at
each node based on a selected feature and its corresponding threshold value. The
splitting criterion, such as Gini impurity or information gain, is used to determine
the best feature and threshold for each split. The process continues until a stopping
criterion is met, such as reaching a maximum depth or minimum number of
samples in a leaf node.

4. Voting and Aggregation: Once all the decision trees are built, the Random
Forest algorithm combines their predictions to make the final prediction. For
classification tasks, the most common class predicted by the individual trees is
selected as the final prediction. For regression tasks, the average or median of
the predicted values is taken as the final prediction. This voting and aggregation
process helps reduce the variance and improve the overall accuracy of the model.

Algorithm:

1. Initialize the number of trees (d) and the maximum depth of the tree (d)

2. For each tree:

(a) Select a random subset of features (X) from the training data
(b) Use a random subset of the training data (X) to train each decision tree
(c) Use the remaining features (X) to train the next decision tree
(d) Repeat steps b and c until all trees are trained

3. For each tree:

(a) Compute the prediction for the test data (y) using the trained decision tree
(b) Compute the error (e) between the predicted output (y) and the actual output
(y’)

4. Combine the predictions of all trees to produce the final output (y_rand_forest)

1
Implementation in Scikit-learn:
For each decision tree, Scikit-learn calculates a nodes importance using Gini Importance,
assuming only two child nodes (binary tree):

nij = wj Cj − wleft(j)Cleft(j) − wright(j)Cright(j) (5.1)

where,

• nij = the importance of node j

• wj = weighted number of samples reaching node j

• Cj = the impurity value of node j

• leftj = child node from left split on node j

• rightj = child node from right split on node j

The importance for each feature on a decision tree is then calculated as:
Σ
fii j:no de j splits on feature i (5.2)
= Σ
nj
k ∈ all nodes nk
where,

• fii = the importance of feature i

• nij = the importance of node j

These can then be normalized to a value between 0 and 1 by dividing by the sum of all
feature importance values:
fii (5.3)
normfii = Σ
j ∈ all fij
features

2
The final feature importance, at the Random Forest level, is it’s average over all the
trees. The sum of the feature’s importance value on each trees is calculated and divided
by the total number of trees:

Σ
RFfi i = j ∈ all trees j normfiij (5.4)
T
where,

• RFfi i = the importance of feature i calculated from all trees in the Random Forest
model

• normfiij = the normalized feature importance for i in tree j

• T = total number of trees

5.3.2 Support Vector Machine Algorithm:

Support Vector Machines (SVM) is a powerful machine learning algorithm that is widely
used for classification and regression tasks. It is based on the principles of statistical
learning theory and aims to find an optimal hyperplane that separates different classes or
predicts continuous values. In this explanation, we will delve into the mathematical
terms and concepts behind SVM. Let’s start by considering a binary classification
problem where we have a set of labeled training data points. Each data point is
represented by a feature vector x and belongs to one of two classes, either positive
(+1) or negative (-1). The goal of SVM is to find a hyperplane in the feature space
that maximally separates the two classes.
Mathematically, we can represent the hyperplane as a linear equation:

w·x+b=0 (5.5)

where w is the normal vector to the hyperplane and b is the bias term. The sign of
w · x + b determines on which side of the hyperplane a data point lies. If w · x + b > 0,
then the data point belongs to the positive class, otherwise it belongs to the negative
class.

To find the optimal hyperplane, SVM aims to maximize the margin between the
hyperplane and the closest data points from each class. These data points are known as
support vectors, hence the name "Support Vector Machines". The margin is defined as
the perpendicular distance between the hyperplane and these support vectors.

2
Let’s denote the set of support vectors as S. For any given data point xi in S, we have:

w · xi + b = 1 if yi = +1 (5.6)
w · xi + b = −1 if yi = −1 (5.7)

where yi represents the class label of xi. The margin can be calculated as:
w 2
margin = · (x+ − x−) = (5.8)
∥w∥ i i ∥w∥
where x+ and
are two support vectors from the positive and negative classes,
i −
x i
respectively. The objective of SVM is to maximize this margin, which can be formulated
as an optimization problem.
To handle cases where the data is not linearly separable, SVM introduces the concept
of slack variables. These variables allow for a certain amount of mis-classification or
overlapping between the classes. Let’s denote the slack variables as ξi, where ξi ≥ 0 for
all data points. The optimization problem can then be formulated as:

Σ
1 2
minimize: ∥w∥ + C ξi
2
subject to: yi(w · xi + b) ≥ 1 − (5.9)
ξi
ξi ≥ 0
where C is a hyperparameter that controls the trade-off between maximizing the
margin and minimizing the misclassification errors. A larger value of C allows for fewer
misclassifications but may result in a smaller margin, while a smaller value of C allows
for a larger margin but may lead to more misclassifications.

To solve this optimization problem, we can use techniques from convex optimization,
such as quadratic programming or Lagrange duality. The solution will provide us with
the optimal values of w and b, which define the hyperplane that separates the classes.

2
5.4 UML Diagrams

5.4.1 Use Case Diagram

Figure [Link]: Use Case Diagram

Users can view alerts generated by the intrusion detection system, view the system
status, and download reports.
Actors: User

Use Cases:

1. Configure System

2. Train Model

3. Deploy Model

4. Monitor System

Users would be able to log into the web application to view the following:

1. A list of alerts generated by the intrusion detection system, including the time
of the alert, the type of alert, and the source IP address of the malicious
activity.
2
2. The status of the intrusion detection system, such as whether it is running,
stopped, or in maintenance mode.

3. Reports on the performance of the intrusion detection system, such as the number
of alerts generated, the types of attacks detected, and the accuracy of the
system.

4. This information would allow users to stay informed about the security of their
network and to take action to respond to threats.

5.4.2 Activity Diagram

Figure [Link]: Activity Diagram

2
This activity diagram shows the high-level steps involved in developing and deploying
an intrusion detection system using machine learning.
Actors:

• System administrator: Responsible for developing, deploying, and monitoring


the intrusion detection system. Short description for project report:

Activities:

1. Collect and prepare data: Collect network traffic data and prepare it for
training and testing the machine learning models.

2. Train machine learning models: Train two machine learning models, SVM
and random forest, on the prepared data.

3. Evaluate machine learning models: Evaluate the performance of the two


machine learning models on a held-out test set.

4. Deploy machine learning models: Deploy the two machine learning models to
a cloud platform.

5. Develop web application: Develop a web application using Flask, HTML, CSS,
and JavaScript to allow users to view the results of the intrusion detection system
and manage the system.

6. Deploy web application: Deploy the web application to AWS or another cloud
platform.

7. Monitor intrusion detection system: Monitor the intrusion detection system


for performance and security issues.

2
5.4.3 Sequence Diagram

Figure [Link]: Sequence Diagram

The sequence diagram provides a high-level overview of the steps involved in the
intrusion detection system using machine learning. It shows the interactions between the
different components of the system, including the user, the web application, and the
intrusion detection model.
The sequence diagram shows the following steps involved in the intrusion detection
system using machine learning:

1. The user sends a request to the web application.

2. The web application forwards the request to the intrusion detection model.

3. The intrusion detection model analyzes the request and generates a prediction.

4. The intrusion detection model returns the prediction to the web application.

5. The web application displays the prediction to the user.

6. The web application collects data about the performance of the intrusion detection
models.

7. The web application analyzes the data to identify areas where the models can
be improved.

2
5.4.4 Class Diagram

Figure [Link]: Class Diagram

Class Diagram for Intrusion Detection System Using Machine Learning with 3 classes:

1. Detects_Intrusion: This class represents the overall intrusion detection system.


It has the following attributes and methods:

• Attack_ID: A unique identifier for the intrusion.


• Attack_Name: The type of intrusion.
• Detecting_Intrusion( ) : This method takes network traffic as input and
returns a prediction of whether or not the traffic is malicious.

2. Random_Forest: This class represents a random forest machine learning model.


It has the following attributes and methods:

• protocol_type: Type of Protocol( TCP, UDP...)


• service: Destination Service( ftp, telnet...)
• srv_error_rate: Percentage of connections with SYN errors
and many more features.
• Generating_random_vectors( ) This method generates random vectors
from the network traffic features.

2
• Building_trees( ) This method builds a set of decision trees from the
random vectors.

3. SVM: This class represents a support vector machine (SVM) machine learning
model. It has the following attributes and methods:

• protocol_type: Type of Protocol( TCP, UDP...)


• service: Destination Service( ftp, telnet...)
• srv_error_rate: Percentage of connections with SYN errors
and many more features.
• Feature_vector_respresentation( ) This method converts the network
traffic features into a feature vector representation.
• Selecting_hyperplane( ) This method selects the hyperplane that best
separates the normal and malicious network traffic.

Relationships between the classes:

1. The Detects_Intrusion class uses the Random_Forest and SVM classes to


detect intrusions in network traffic.

2. The Random_Forest and SVM classes use the Protocol_type, Service, and
srv_error_rate attributes to build their machine learning models.

2
Chapter 6
EXPERIMENTAL SETUP

6.1 Introduction:

The purpose of this experimental setup is to develop an intrusion detection system (IDS)
using machine learning algorithms, specifically Support Vector Machines (SVM) and
Random Forest. The IDS will be trained and evaluated on two datasets: CICIDS 2018
and UNSW 2015. Additionally, a web application will be built using Flask, HTML,
CSS, and JavaScript to provide a user-friendly interface for interacting with the IDS.
The model will be deployed on a cloud platform, such as AWS, to ensure scalability and
availability.

6.2 Dataset Selection:

The first step in the experimental setup is to select appropriate datasets for training
and evaluation. The CICIDS 2018 dataset is a widely used benchmark dataset for
network intrusion detection research. It contains a large number of network traffic
records with various types of attacks and normal traffic. The UNSW 2015 dataset is
another popular dataset that includes both normal and attack traffic captured in a
controlled environment.

6.3 Data Preprocessing:

Before training the machine learning models, it is essential to preprocess the datasets to
ensure data quality and compatibility with the algorithms. This step involves removing
duplicates, handling missing values, normalizing features, and encoding categorical
variables if necessary. Additionally, feature selection techniques can be applied to reduce
dimensionality and improve model performance.

2
6.4 Model Training:

In this step, the SVM and Random Forest algorithms will be trained on the preprocessed
datasets. SVM is a supervised learning algorithm that separates data points into
different classes using hyperplanes in high-dimensional space. Random Forest is an
ensemble learning method that combines multiple decision trees to make predictions.
Both algorithms have been proven effective in intrusion detection tasks.

6.5 Model Evaluation:

To evaluate the performance of the trained models, various metrics will be used,
including accuracy, precision, recall, F1-score, and area under the receiver operating
characteristic curve (AUC-ROC). The evaluation will be conducted using cross-validation
techniques to ensure robustness and avoid overfitting. Additionally, confusion matrices
and classification reports will be generated to provide detailed insights into the model’s
performance.

6.6 Web Application Development:

To make the IDS accessible to users, a web application will be developed using Flask,
HTML, CSS, and JavaScript. Flask is a lightweight web framework that allows easy
integration with Python-based machine learning models. The frontend of the web
application will be designed using HTML, CSS, and JavaScript to provide an intuitive
and user-friendly interface for interacting with the IDS.

6.7 Model Deployment:

The trained machine learning models will be deployed on a cloud platform, such as AWS
or another suitable platform. This ensures that the IDS is accessible from anywhere
and can handle a large number of concurrent requests. The deployment process involves
containerizing the models using technologies like Docker and deploying them on cloud
instances or serverless architectures.

6.8 Integration with Frontend:

The deployed models will be integrated with the frontend of the web application. This
integration allows users to input network traffic data through the web interface and
receive real-time predictions from the IDS. The frontend will communicate with the
deployed models through APIs or other communication protocols.

3
6.9 Evaluation and Analysis Display:

The evaluation results obtained during model training and testing will be displayed to
the users in the frontend of the web application. This includes metrics such as accuracy,
precision, recall, F1-score, AUC-ROC, as well as visualizations like confusion matrices
and classification reports. The purpose of displaying these results is to provide users
with insights into the performance of the intrusion detection system and help them
make informed decisions.

6.10 Details about Inputs to the System

1. Datasets: The system will be trained on datasets of labeled network traffic. This
helps the system to learn to identify different types of intrusions. The datasets we
will be using are CICIDS 2018 and UNSW 2015.

2. Packet captures: Packet captures contain all of the packets that are transmitted
on a network interface. This information can be used to identify suspicious patterns
of traffic, such as a large number of packets from a single IP address or a large
number of packets with the same destination port.

3. Flow logs: Flow logs contain aggregated information about the traffic that flows
through a network. This information can be used to identify unusual traffic
patterns, such as a sudden increase in traffic to a particular service.

4. System logs: System logs contain information about events that occur on a
system, such as login attempts and process executions. This information can be
used to identify suspicious activity, such as a large number of failed login attempts
or a process that is using a lot of resources.

5. Configuration parameters: The system can be configured with a variety of


parameters, such as the types of attacks to detect, the alerting thresholds, and the
machine learning algorithms to use.

6.11 Evaluation Parameters

These evaluation parameters will be used to assess the performance of the intrusion
detection system using two machine learning algorithms, random forest and support
vector machine, on two datasets, CICIDS 2018 and [Link] following
evaluation parameters will be used for the intrusion detection system:

3
1. Accuracy: The accuracy of the intrusion detection system is the percentage of
correctly classified data points. It is calculated as follows:

Accuracy = (TP + TN )/(TP + TN + FP + FN )

where,

• TP (true positive) is the number of correctly classified malicious data points.


• TN (true negative) is the number of correctly classified normal data points.
• FP (false positive) is the number of normal data points that are incorrectly
classified as malicious.
• FN (false negative) is the number of malicious data points that are incorrectly
classified as normal.

2. Precision: The precision of the intrusion detection system is the percentage of


predicted positive cases that are actually positive. It is calculated as follows:

Precision = TP/(TP + FP )

3. Recall: The recall of the intrusion detection system is the percentage of actual
positive cases that are correctly predicted. It is calculated as follows:

Recall = TP/(TP + FN )

4. F1 Score: The F1 score is a harmonic mean of precision and recall. It is calculated


as follows:

F 1score = 2 ∗ (Precision ∗ Recall)/(Precision + Recall)

5. Area under the ROC curve (AUC): The AUC is a measure of the overall
performance of a classifier. It is calculated by plotting the true positive rate (TPR)
against the false positive rate (FPR) at different thresholds. The AUC ranges from
0 to 1, with a higher AUC indicating better performance.

6. ROC curve: The ROC curve is a graphical representation of the performance of a


classifier. It is plotted by plotting the TPR against the FPR at different thresholds.
The ROC curve can be used to visualize the trade-off between sensitivity and
specificity.

3
6.12 Software and Hardware Setup

6.12.1 Software and Packages Requirements:

1. Machine Learning Tools:

(a) Python: Python will be used to implement the entire system, including the
data preprocessing, feature extraction, machine learning model training and
evaluation, and web application development.
(b) Scikit-learn: Scikit-learn will be used to train and evaluate the machine
learning models for intrusion detection.
(c) Jupyter Notebook or Jupyter Lab: Useful for experimenting and
documenting your machine learning code.

2. Web Application:

(a) Flask: Flask will be used to develop a web application that can be used to
interact with the intrusion detection system.
(b) HTML, CSS and Javascript: HTML, CSS and Javascript will be used to
develop the user interface (UI) of the web application that is used to
interact with the intrusion detection system.

3. Data Analysis:

(a) Pandas: Pandas will be used to preprocess the network traffic data and
extract features that are relevant for intrusion detection.
(b) Matplotlib: Matplotlib will be used to visualize the network traffic data and
the results of the machine learning model evaluation.
(c) Numpy: Numpy will be used for numerical operations and linear algebra
computations.

4. Databases:

(a) PostgreSQL: A powerful open-source relational database management


system.
(b) MongoDB: A NoSQL database for flexible data storage and retrieval.

5. Model Deployment:

(a) Tools like Docker and Kubernetes for containerization


(b) Amazon SageMaker or other cloud-based machine learning platforms for
model deployment.

3
6.12.2 Hardware Requirements:

1. Operating System: These editions offer features and compatibility suitable for
development and deployment

2. Memory 16 GB RAM or more: Machine learning tasks, especially when


working with larger datasets, benefit from ample RAM. A minimum of 16 GB
is recommended for smooth development and training experiences.

3
Chapter 7
RESULTS AND DISCUSSION

7.1 HOME

Figure [Link]: Home

The home page serves as the central hub of our system, providing users with a
welcoming introduction to our platform’s capabilities and features

3
7.2 ABOUT US

Figure [Link]: About Us

The About Us page serves as Discover more about our organization and the dedicated
team behind our innovative solutions. Learn about our mission, vision, and commitment
to revolutionizing cybersecurity

7.3 CONTACT US

Figure [Link]: Contact Us

The CONTACT Us page serves as Connect with us to learn more about our project
and services. Reach out to our team for inquiries and support.

3
7.4 ANALYSIS

Figure [Link]: Analysis 1

Figure [Link]: Analysis 2

The ANALYSIS page serves as In-depth analysis of network traffic patterns and
intrusion detection outcomes. Gain insights into the effectiveness of our algorithms and
methodologies in identifying potential security threats and visualize with graphs.

3
7.5 FEATURE DESCRIPTION

Figure [Link]: Feature Description

The FEATURE DESCRIPTION page serves as Explore the comprehensive


features of our intrusion detection system. From real-time monitoring to predictive
analytics, discover how our platform empowers organizations to safeguard their digital
assets

7.6 RESULTS

Figure [Link]: Results

The RESULT page serves as Detailed results and findings from our intrusion
detection [Link] insights into network traffic categorization, model predictions,
and performance [Link] results with visual graphs
3
Chapter 8
IMPLEMENTATION PLAN OF NEXT SEMESTER

8.1 Implementation Plan:

Table 8.1: Project Implementation Plan

Week Tasks Description


1-2 Project kickoff, Define project scope, objectives, and
requirements gathering, deliverables. Identify specific user
and dataset preparation. requirements for the web application.
3 Research and dataset Acquire the CICIDS 2018 and UNSW 2015
preparation. datasets. Preprocess and clean the datasets,
handling missing values and outliers.
4-5 Algorithm selection and Select the machine learning algorithms (SVM
development environment and Random Forest) for intrusion detection.
setup. Set up the development environment with
Python, Jupyter Notebooks, and necessary
libraries (scikit-learn, Flask, etc.).
6-7 Model training and Divide the dataset into training and testing
evaluation. sets. Train and evaluate the SVM and
Random Forest models.
8-9 Web application Design the user interface using HTML,
development with Flask. CSS, and JavaScript. Integrate the trained
models into the web application for real-time
detection.
10-11 Cloud deployment on a Deploy the web application and machine
chosen platform. learning models on the cloud. Configure
security measures for the cloud
environment.
12 Evaluation, analysis, and Implement evaluation metrics (e.g.,
user testing. accuracy, precision, recall, F1-score) for
the models. Create visualizations and
dashboards to display results to users.
13-15 Documentation, final Document the project, including code,
adjustments, presentation, architecture, and user instructions.
and project wrap-up.

3
Chapter 9
CONCLUSION

In summary, our project aims to develop an Intrusion Detection System (IDS) using
machine learning techniques. By leveraging Support Vector Machine (SVM) and Random
Forest algorithms and incorporating two diverse and extensive datasets, CICIDS 2018
and UNSW 2015, our project aspires to enhance network security and detect intrusions
effectively. The web application built using Flask, HTML, CSS, and JavaScript will
serve as the user interface for our IDS. This application will provide a user-friendly
platform for security monitoring and intrusion alerts. The deployment of our machine
learning models to a cloud environment, potentially AWS or another suitable platform,
will ensure scalability and accessibility. This cloud-based approach is a strategic choice
to accommodate the dynamic nature of network security threats. One of the primary
objectives of this project is the implementation of thorough evaluation and analysis
processes for our machine learning models. This analysis will provide valuable insights
into the performance and effectiveness of the IDS, which will be accessible to users via
the frontend of our web application. We expect the system to achieve high accuracy,
precision, recall, and F1 score on both datasets. We also expect the random forest
algorithm to outperform the SVM algorithm on both datasets. The intrusion detection
system developed in this project will be able to protect networks from a variety of attacks,
including denial-of-service attacks, port scans, and brute-force attacks. The system can
be deployed in a variety of environments, including enterprise networks, government
networks, and cloud-based networks. While our project is still in the planning phase,
these initial steps represent the foundation for a robust and advanced Intrusion
Detection System. The successful completion of this project will contribute significantly
to the field of network security and play a crucial role in safeguarding critical systems
against cyber threats.

4
REFERENCES

[1] K. B. Dr. V. Usha Bala, N. Dimple Sai Keerthana, “Intrusion detection system
with machine learning algorithms and comparison analysis,” International Research
Journal of Engineering and Technology (IRJET), vol. 07, no. 04, pp. 405–408,
2020.

[2] P. V. A. Khraisat, I. Gondal and J. Kamruzzaman, “Survey of intrusion detection


systems: techniques, datasets and challenges,” Cybersecurity, vol. 02, no. 01, pp. 2–
20, Dec. 2019.

[3] G. A. N. V. Sharma, Kavita and S. Sharma, “Performance study of snort and


suricata for intrusion detection system,” IOP Conference Series: Materials Science
and Engineering, vol. 1099, no. 01, Mar. 2021.

[4] A. A. M. K. Usman S Musa, Megha Chhabra, “Intrusion detection system


using machine learning techniques: A review,” International Conference on Smart
Electronics & Communication (ICOSEC 2020), vol. 90, no. 01, pp. 149–155, Nov.
2020.

[5] H. J. S. F. S. e. a. S. Zehra, U. Faseeha, “Machine learning-based anomaly detection


in nfv: A comprehensive surve,” Sensors 2023, vol. 10, no. 01, pp. 1–26, Jun.
2023.

[6] P. L. R. K. B. H. S. Bhatia, A. Jain, “Mstream: Fast anomaly detection in multi-


aspect streams,” WWW ’21: Proceedings of the Web Conference 2021, vol. 4, no. 01,
pp. 1–12, Mar. 2021.

[7] B. C. Qian Ma, Cong Sun, “A novel model for anomaly detection in network
traffic based on support vector machine and clustering,” Security and
Communication Network, vol. 2021, no. 01, pp. 1–11, Oct. 2021.

[8] A. O. S. V. V. C. Poger R. dos Santos, Eduardo K. Viegas, “Reinforcement


learning for intrusion detection: More model longness and fewer updates,” IEE
Transactions on Network and Service Management, vol. 20, no. 02, pp. 2040–2055,
Jun. 2021.

4
[9] A. F. M. Agarap, “A neural network architecture combining gated recurrent
unit (gru) and support vector machine (svm) for intrusion detection in network

4
traffic data,” The International Conference on Machine Learning and Computing
(ICMLC), vol. 8, no. 01, pp. 1–5, Feb. 2018.

[10] M. A. A. Ghamdi, “A fine-grained system driven of attacks over several new


representation techniques using machine learning,” IEEE ACCSSS 2022, vol. 01,
no. 01, pp. 1–11, Feb. 2022.

You might also like