You are on page 1of 5

2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)

Amity University, Noida, India. Oct 13-14, 2022

Anomaly based Intrusion Detection Model using


2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO) | 978-1-6654-7433-7/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICRITO56286.2022.9965050

Supervised Machine Learning Techniques


Swati Goel Kalpna Guleria* Surya Narayan Panda
Chitkara University Institute of Chitkara University Institute of Chitkara University Institute of
Engineering and Technology, Engineering and Technology, Engineering and Technology,
Chitkara University, Chitkara University, Chitkara University,
Punjab, India Punjab, India Punjab, India
swati.goel@chitkara.edu.in guleria.kalpna@gmail.com snpanda@chitkara.edu.in

Abstract— An intrusion detection system (IDS) is a software and using an Intrusion Detection System we can deal with
system that keeps track of network traffic and looks for cyber-attacks. These systems continuously monitor a network
anomalies. Abnormal or unusual network changes could be to find anomalous activity or intrusions [4]. Various Machine
signs of fraud at any phase from the start of an attempt to a Learning (ML) techniques; the subset of artificial intelligence
complete intrusion. Since data sharing primarily depends on which is capable of finding hidden patterns or trends in data
the internet, it must be safe. For internet security, data and making precise predictions can be used for intrusion
encryption and authentication are insufficient and firewalls are detection [5].
unable to identify fraudulent packets that are fragmented.
Moreover, attackers frequently vary their strategy, equipment, The paper organization is as follows: Section II addresses
methods and tactics which can have disastrous results and the various types of IDS methods and attacks and the
effects such as lost productivity, financial loss, data loss etc. So, problem statement. Section III describes the material and
it become essential to put in place an effective intrusion methods used. Section IV shows the implementation and
detection system which is a very challenging task. The various result analysis of the proposed IDS and section V highlights
supervised Machine Learning (ML) algorithms are applied in the various challenges and the future scope of the intrusion
this paper, like J48, Random Forest, Random Tree, Hoeffding detection system. Section VI concludes the overall paper with
Tree and Logistic Model to predict the accuracy of an IDS a discussion on future scope.
system. The analysis was performed on the basis of three
categories of data split and the algorithm that gives the best II. ANOMALY BASED INTRUSION DETECTION
accuracy is suggested for future predictions. The various
performance measures like accuracy, execution time, precision, A. Different types of intrusion detection methods and
F-measure and ROC curve are also analyzed. Random Forest attacks:
exhibits best accuracy of 99.84% at a split ratio of 80:20 ratio The network size and associated data have significantly increased
as compared to other ML algorithms in all aspects. The as a result of the fast developments in the internet and
execution time to build and test the model is less incase of communication areas. The resulting growth of various
Random Tree. As accuracy is the prime concern for an unique threats has made it difficult for network security to properly
intrusion detection system (IDS); Random Forest is suggested identify attacks [6]. The different IDS methods for detecting attacks
to be the best solution as it provides highest accuracy. has broadly categorised as:
Keywords― Anomaly based Intrusion detection model,
Network Security, Machine Learning techniques, Network
x Signature based detection
Attack.
x Anomaly based detection
I. INTRODUCTION
x Hybrid based detection.
Data protection is of paramount importance in today’s
world. The vast amount of dataflow between corporations The specific types of attacks are categorized into four
and consumers needs to be secured considering that they are groups which are being used as a benchmark by various
entrusted with a lot of belief. The company can spend researchers to compare their intrusion detection systems
millions of dollars on the most secured servers but it takes a performance [7]. These attacks are summarized in the table
single hacker to ruin the goodwill between the organizations. (TABLE I) below:
To prevent these malicious attacks many automated security
systems have been developed which are also known as Traditional intrusion detection is losing effectiveness as
Intrusion Detection systems (IDS) [1] [2]. An IDS is a host or new threats appear and communication protocols grow.
system that is inserted into a network to record traffic and Malicious activity detection is a crucial issue that needs to be
detect malicious activity based on predetermined rules. This addressed in future IDS, particularly for undiscovered threats.
malicious conduct is then recorded and a notification of an Additionally, the presence of intruders who intend to launch a
intrusion is sent to the appropriate parties thus identifying variety of attacks within the network cannot be neglected and
attempts to compromise the integrity, confidentiality or must be dealt with immediately. In order to effectively detect
accessibility of resources and assets [3]. breaches across the network, a variety of ML algorithms are
being implemented. The following figure Fig. 1 shows the
Threats to computer networks, infrastructure and block diagram of an IDS that makes use of the ML model to
equipment have grown rapidly in the current years. Hence, predict an attack. It uses two labels normal and anomaly to
cyber security has shown to be a significant issue. Malicious classify an attack.
activity detection is a serious issue that needs to be solved

978-1-6654-7433-7/22/$31.00 ©2022 IEEE 1


Authorized licensed use limited to: Staffordshire University. Downloaded on March 28,2024 at 16:28:34 UTC from IEEE Xplore. Restrictions apply.
TABLE I. DIFFERENT ATTACK CLASSES WITH EXAMPLE features and two labels are used to classify the intrusion i.e.
Attack Denial-of- Remote to User to Root Probing normal (if no attack is there) or anomaly (if attack is there).
Class Service Local The NSL KDD is preferred over KDD99 data set as it suffers
Definitio The users unauthorize unauthorized scanning from various problems which are being resolved in the NSL-
n can no d access access to the network KDD dataset like removal of redundant records in the train
longer from a local for host
set, presence of reasonable number of records, less biased
connect to remote superuser data
the system machine (By (root) packets to results; which makes it more efficient providing higher
because of connecting privileges acquire evaluation in terms of accuracy for different ML techniques
unavailabilit to the (The legitimate [10]. The block diagram of the proposed IDS in shown in the
y of service. network cybercriminal IP Fig. 2 below:
(It begins by without first logs in as addresses
resetting the owning a a regular user
target PCs system before
forcibly) account, the upgrading to
computer a super user,
hacker sends that can result
packets to a in the
remote system's
system) multiple
vulnerabilitie
s being
attacked)
Example syn flooding Password buffer port
& Attack (worm,smurf guessing overflow scanning;
Type etc.) (spy, Phf, (rainbow surveillanc Fig. 2. Block diagram for the proposed IDS
Imap etc.) attack, e and
sql_attack probing
etc.) (Satan,
B. Data Pre-processing:
Mscan etc.) The first step in creating an ML model is data preparation.
This stage involves data cleaning, normalization, feature
reduction and standardization [11]. The data cleaning and
normalization are very important as it involves identifying
incomplete and irrelevant data and further refining the data
based on the outcome. Normalization is done for the
organizing data and to change the values of numeric columns
to use a common scale in the dataset so as their classification
becomes easy without modifying the data meaning i.e.
without losing information and distorting data differences in
the ranges of values [12]. The dataset used initially consists
of 42 attributes and after applying normalization the new
attributes are 123.
C. Data Splitting:
The data splitting process determines how to use the
available data by defining the ratio of training to testing data
[13]. Percentage split is the method utilised in the proposed
Fig. 1. An Intrusion Detection System using ML Model system to divide the data into training and testing groups as it
is a quick and easy way; and the split ratios employed include
B. Problem statement 80:20, 75:25 and 70:30. The test set is used to evaluate the
model whereas the training set's primary goal is to construct
To build an Intrusion Detection System that uses the model. The data is firstly trained according to the
supervised ML models that can identify between an attack specified ratio and then the model is tested to know the
(intrusion) and a benign (good/normal) connection. The accuracy of attack detection [14].
system will use two labels to classify an attack normal class
and anomaly class. D. Machine Learning Algorithms:
The various ML based algorithms like Decision tree, J48,
III. MATERIALS AND METHODS
Random Forest, Hoeffding Tree and Logistic etc. are applied
The proposed intrusion detection system uses various ML at three different split ratio i.e. 70:30, 75:25 and 80:20. Based
based algorithms that are being applied on the NSL-KDD on the accuracy; the best algorithm and the split ratio was
dataset to recognize an attack [8]. The IDS helps to determine identified and while classifying the attack as normal or
the security of systems and raise an alarm if any intrusion is anomaly.
detected. The working of the proposed IDS is described
below: E. Evaluation Metrics:
The accuracy was used as a major parameter as an
A. Data Set Description evaluation metrics to measure the performance of various ML
The dataset used is NSL-KDD; a new version of the KDD algorithms for an IDS [15]. Except accuracy the other metrics
'99 data set consisting of approximately 42 attributes to considered are recall, precision, execution time, F-measure,
define the properties of the network data being sent over the execution time etc. These values helps to understand and
network [9]. The NSL-KDD dataset consists of various

2
Authorized licensed use limited to: Staffordshire University. Downloaded on March 28,2024 at 16:28:34 UTC from IEEE Xplore. Restrictions apply.
classify which approach yields the best results [16]. The The following figure shows the result in the form of line
algorithm that has low execution time but high precision, graph and bar graph for the above table to provide a clear
recall and accuracy is considered as the best. visualization of the results.
The following figure (Fig. 3) provides the step by step
description in the form of flowchart for the methodology
being adopted and it also shows the type of result being
produced by the intrusion detection system. As shown in the
figure, the ML based algorithms are applied to test the
accuracy and classify whether there is an attack or not.
Moreover, it provides the best ML algorithm based on the
highest accuracy value.

Fig. 4. Line graph showing highest accuracy for Random Forest at the split
ratio of 80:20

Fig. 3. Layout of the methodology used to implement an IDS

IV. IMPLEMENTATION AND RESULT ANALYSIS


The proposed IDS is implemented using the WEKA tool Fig. 5. Bar graph showing highest accuracy for Random Forest at the split
[17] [18]. The various ML algorithms are being applied and ratio of 80:20
the test is performed at different ratios to find the best
accuracy. The proposed IDS system uses different The next parameter after accuracy that is considered is
supervised ML algorithms and output is the detection of the execution time. The following table shows the results based
attack. First of all, the best algorithm is analysed on the basis on execution time (in seconds) for different classifiers at the
of accuracy. The following table shows the results based on ratio of 80:20 for training and testing data. It is observed that
accuracy for different classifiers at different ratios of training Random Tree performs best among the 5 ML techniques that
and testing data. We conducted an examination using a basic are applied. After Random Forest the next ML algorithm that
performance analysis using five machine learning algorithms. takes less execution time is the Hoeffding Tree.
It is observed that Random Forest performs best at the ratio
of 80:20. After Random Forest the next in the queue is the TABLE III. EXECUTION TIME (IN SECONDS) FOR DIFFERENT
CLASSIFIERS AT 80:20 RATIO
J48 ML algorithm.
Time taken to build Time taken to test
TABLE II. ACCURACY METRICS FOR DIFFERENT Classifier model (in seconds) Model on test split
CLASSIFIERS AT DIFFERENT RATIO OF TRAINING & TESTING (in seconds)
DATA J48 18.39 0.07
Accuracy (in %) at different split ratio Random Forest 29.04 0.45
Classifier
70:30 75:25 80:20
J48 99.5634 99.603 99.6626 Random Tree 0.72 0.01
Random Forest 99.7486 99.7618 99.8412 Hoeffding Tree 3.45 0.08
Random Tree 99.2723 99.2696 99.1862
Hoeffding Tree 97.4729 98.0629 98.0945 Logistic 12.19 0.05
Logistic 97.764 97.6659 97.5387

3
Authorized licensed use limited to: Staffordshire University. Downloaded on March 28,2024 at 16:28:34 UTC from IEEE Xplore. Restrictions apply.
employed in real time scenario [19]. So, the biggest challenge
for the proposed IDS is to be efficient enough to verify its
effectiveness for modern networks as demonstrated in the
results. The dataset which are available are imbalanced in
nature which is also a major problem that needs to be
considered. The conventional IDS cannot process large
volumes of data and may not respond accurately to new
threats [20].
Future research should focus on developing Deep
Learning (DL) based intrusion systems that are compact,
effective and capable of quickly identifying network
intrusions. IDS can be dispersed among the sensor nodes or
deployed at the places where network traffic from the internet
enters the IoT network and for this a lightweight IDS model
can be used [21][22]. To enhance the number of minority
attack instances it is necessary to provide an updated, real-
time and balanced dataset on which effective methods and
Fig. 6 Bar graph showing time taken to build and test the model for different techniques can be applied to detect every kind of intrusion
classifiers at 80:20 ratio
thus resulting in a safe network.
The various other factors like precision, recall, F – VI. CONCLUSION
Measure and ROC area have been used to perform the
evaluation of the ML algorithm being used in IDS at a split IDS is a software application that is used to detect
ratio of 80:20 as it is analysed as the best split ratio. The network intrusion. It uses various machine learning
performance of a model mainly depends upon its accuracy algorithms to detect whether there is an attack or not. IDS
especially in case of an IDS system; however the other keeps an eye out for malicious activities on a network or
parameters are also important and can also be taken into system and guards against unwanted access from anyone
consideration as these values can provide a better insight and including insiders. The results from the experiment shows
good prediction rate of the model. The values which are close that an IDS system works more accurately at the split ratio of
to 1 show better performance in terms of precision, recall and 80:20 for most of the supervised machine learning
F-measure. As shown in the table below Random Forest algorithms. The best result is produced by the Random Forest
shows the highest precision, recall and F-measure value. in terms of the accuracy among all the five ML algorithms
that are taken into consideration. The IDS main task is to
TABLE IV. PERFORMANCE METRICS SHOWN FOR DIFFERENT build a predictive model that is capable of distinguishing
CLASSIFIERS AT 80:20 RATIO between a bad connection (i.e. attack) represented by the
ROC
label anomaly and a good connection (i.e., not an attack)
ML Algorithm Precision Recall F-Measure Area represented by the label normal. One of the biggest
challenges in developing an IDS is building a lightweight
J48 0.997 0.997 0.997 0.998 IDS model for IoT devices that can be more effective in
terms of attack detection rate.
Random Forest 0.998 0.998 0.998 1.000
REFERENCES
Random Tree 0.981 0.981 0.981 0.992
[1] Razan Abdulhammed, Miad Faezipour, Khaled M. Elleithy, "Network
Hoeffding Tree 0.975 0.975 0.975 0.995 intrusion detection using hardware techniques: A review," Systems
Applications and Technology Conference (LISAT) IEEE, Long Island,
Logistic 0.967 0.967 0.967 0.981 pp. 1-7, 2016.
[2] S. Ustebay, Z. Turgut, and M. A. Aydin, "Intrusion detection system
with recursive feature elimination by using random Forest and deep
At last, the paper performs classification using various learning classifier," International Congress on Big Data, Deep
supervised learning ML techniques like J48, Random Forest, Learning and Fighting Cyber Terrorism (IBIGDELFT), pp. 71–76,
Random Tree, Hoeffding Tree and Logistic. It has been 2018.
observed that among all classification model used to classify [3] S. Kumar, S. Gupta and S. Arora, "Research Trends in Network-Based
an attack Random Forest shows the highest accuracy of Intrusion Detection Systems: A Review," IEEE Access, vol. 9, pp.
99.8412% as well as higher precision (0.998) and F-measure 157761-157779, 2021.
value (0.998) at 80:20 split ratio. [4] Khraisat, A., Alazab, A., “A critical review of intrusion detection
systems in the internet of things: techniques, deployment strategy,
V. RESEARCH CHALLENGES AND FUTURE validation strategy, attacks, public datasets and challenges,”
Cybersecurity, vol. 4, no. 18, 2021.
SCOPE [5] Hasan, M., Islam, M. M., Zarif, M. I. I., & Hashem, M. M. A., “Attack
There are lot of research challenges that need to be and anomaly detection in IoT sensors in IoT sites using machine
addressed in this area of cyber security particularly for learning approaches,” Internet of Things, vol. 135, no.1, vol. 7, no.1,
pp.100059, 2019.
intrusion detection. Despite of the tremendous efforts by the
[6] B. Singh and S. N. Panda, “An Adaptive Approach to Mitigate Ddos
researchers an IDS still faces a number of challenges in Attacks in Cloud,” Int. J. Adv. Comput. Sci. Appl., vol. 6, no. 10, pp.
detecting new and unexpected intrusions. Since the majority 47–52, 2015.
of the suggested approaches are examined and validated in a [7] Giovanni Vigna and Richard A. Kemmerer, “NetSTAT: A network-
lab environment utilising freely accessible and publically based intrusion detection system,” Journal of Computer Security, vol.
available datasets there results are not so practical when 7, pp. 37-71, Jan. 1999.

4
Authorized licensed use limited to: Staffordshire University. Downloaded on March 28,2024 at 16:28:34 UTC from IEEE Xplore. Restrictions apply.
[8] Moustafa, N., & Slay, J., “The evaluation of Network Anomaly Hybrid Approach and Impact on Growth Trend due to COVID-19”,
Detection Systems: Statistical analysis of the UNSWNB15 data set International Journal of Networking and Virtual Organisations, vol.
and the comparison with the KDD99 data set,” Information Security 25, no. 3-4, 2021.
Journal: A Global Perspective, vol. 25 no. 1-3, pp.18-31, 2016. [16] S. Goel, K. Guleria, and S. N. Panda, “Machine Learning Techniques
[9] https://www.kaggle.com/datasets/hassan06/nslkdd , accessed online for Precision Agriculture Using Wireless Sensor Networks”, ECS
on 10 June 2022. Transactions, vol. 25, no. 3, pp. 9229-9238, 2022.
[10] Miriam Seoane Santos, Jastin Pompeu Soares, Pedro Henrigues [17] Belouch, M., Hadaj, S. E., & Idhammad, M., “Performance evaluation
Abreu, Helder Araujo, and Joao Santos, “Cross-validation for of intrusion detection based on machine learning using Apache
imbalanced datasets: avoiding overoptimistic and overfitting Spark”, Procedia Computer Science, vol. 127,pp. 1-6, 2018.
approaches,” IEEE Computational Intelligence Magazine, vol. 13, no. [18] Belavagi, M. C., &Muniyal, B., “Performance evaluation of
4, pp. 59-76, 2018. supervised machine learning algorithms for intrusion detection”,
[11] Umesh Kumar Lilhore, Sarita Simaiya, Devendra Prasad, Kalpna Procedia Computer Science, vol. 89, no. 1, pp. 117-123, 2016.
Guleria, “A Hybrid Tumour Detection and Classification Based on [19] Sridevi, S., Parthasarathy, S., & Rajaram, S. , “An Effective Prediction
Machine Learning,” Journal of Computational and Theoretical System for Time Series Data Using Pattern Matching Algorithms”,
Nanoscience, vol. 17, no. 6, pp. 2539-2544, 2020. International Journal of Industrial Engineering, vol. 25, no. 2, pp.
[12] Amandeep Sharma, Kalpna Guleria, Nitin Goyal, “Prediction of 123-136, 2018.
Diabetes Disease using Machine Learning Models”, International [20] SB Atham, K Guleria, "Smart City in Underwater Wireless Sensor
Conference on Communication, Computing and Electronics Systems, Networks", Energy-Efficient Underwater Wireless Communications
vol. 733, pp. 683-690, 2021. and Networking, pp. 287-301, 2021.
[13] Remco R. Bouckaert, Eibe Frank, Mark Hall, Richard Kirkby, Peter [21] S. Badotra, Di. Nagpal, S. N. Panda, S. Tanwar, and S. Bajaj, “IoT-
Reutemann, Alex Seewald, and David Scuse, WEKA manual version Enabled Healthcare Network with SDN,” ICRITO 2020 - IEEE 8th
3-8-1. University of Waikato, Hamilton, New Zealand, Dec. 2016. Int. Conf. Reliab. Infocom Technol. Optim. Trends Futur. Dir., pp. 38–
[14] Zanariah Zainudin, Siti Mariyam Shamsuddin, and Shafaatunnur 42, 2020.
Hasan, “Deep learning for image processing in WEKA environment”, [22] Elsaeidy, A., Munasinghe, K. S., Sharma, D., &Jamalipour, A.,
International Journal of Advances in Soft Computing and its “Intrusion detection in smart cities using Restricted Boltzmann
Applications, vol. 11, no. 1, pp. 1-21, 2019. Machines”, Journal of Network and Computer Applications, vol. 135,
[15] Pradeepta Kumar Sarangi, Kalpna Guleria, Devendra Prasad, Deepak no.1, pp. 76-83, 2019.
Kumar Verma , “ Stock Movement Prediction Using Neuro Genetic

5
Authorized licensed use limited to: Staffordshire University. Downloaded on March 28,2024 at 16:28:34 UTC from IEEE Xplore. Restrictions apply.

You might also like