You are on page 1of 5

Network Anomaly Detection System using Genetic

Algorithm, Feature Selection and Classification


Elif İpek Uysal Gülnur Demircioğlu Gülsade Kale
SAAT Laboratory, Computer SAAT Laboratory, Computer Computer Technologies Department
Engineering Department Engineering Department Kilis 7 Aralık University
Ankara University Ankara University Kilis, Turkey
Ankara, Turkey Ankara, Turkey gkale@kilis.edu.tr
euysal@havelsan.com.tr gulnurrd@gmail.com
Mehmet Serdar Güzel
Erkan Bostanci SAAT Laboratory, Computer Sarmad N. Mohammed
SAAT Laboratory, Computer Engineering Department Computer Science Department
Engineering Department Ankara University Kirkuk University
Ankara University Ankara, Turkey Kirkuk, Iraq
Ankara, Turkey mguzel@ankara.edu.tr sarmad_mohammed@uokirkuk.edu.iq
ebostanci@ankara.edu.tr

Abstract—Networks are dangerous environments with The mentioned study here aims to employ genetic
containing numerous security vulnerabilities and those algorithms in order to increase the classification accuracy of a
vulnerabilities are likely to be used while attacking systems with detection system with the least number of attributes for
the intent of stealing valuable information or stopping the deciding the behavior on anomaly testing with the stated data
services. A system should be protected from already-known compiled from KDD99 dataset.
types of attacks and also have ability to detect unknown types of
attacks to prevent abduction of the information. Unknown types Each chromosome that was occurred by making up a
of attacks may give harm to the system by stopping the services suitable chromosome structure with genetic algorithm is
that runs effective and stable. For that purpose, it has become estimated how well it expresses the solution. Since genetic
necessary to develop a flexible and adaptable system which can operators are numerous and create chromosomes with
collect instant data from the network, distinguish between different attributes in these algorithms, a large solution set is
harmless and harmful behaviors and take measures against obtained and these solutions have almost optimal values even
them. The main goal of this work is to explain a network in the worst cases.
anomaly detection system that is developed using genetic
algorithm and Weka classification features to fulfill the Weka is an open source data mining software and includes
purposes stated above. The Genetic Algorithm is used to the tools on machine learning algorithms such as data
generate various individuals with the aim of determining which preparation, classification, feature selection, regression etc.
attributes of the individual are providing a better result about Data mining is a method used to obtain information that was
learning the behavioral pattern of the network traffic. not known previously but potentially important about a
Furthermore, Weka classifiers are applied to the train and test dataset. In this project method, feature selection and
datasets to calculate the best fitness value, and to decide on classification, which belongs to Weka, were used.
individual's attributes that are more effective about finding the
anomaly occurring in a given instant. Feature selection, in other words, attribute selection,
specifies which attribute in a dataset is more effective for the
Keywords—Network Anomaly Detection System, Genetic solution and ensures the attributes to be used while machine
Algorithm, Weka, Feature Selection, Classification learning. The model created by choosing the correct attributes
has higher accuracy level and less complexity. These
I. INTRODUCTION chromosomes that were obtained with the help of this
Computer networks changed the things that people should technical genetic algorithm decides if this gen to be used while
be used to carry out specific activities in an easier and faster creating the related representing attribute's model according to
way. Therefore, computer networks are used in many gen values in the dataset.
applications in order to benefit from the advantages of the Classification is the operation of organizing similar or
Internet. While benefiting from those advantages, networks separate data of a dataset in a way that would place them in
are started to face with a growing number of various threats same and different categories. This method is to create a
which are changing and deriving day-by-day. model by using attributes chosen in each chromosome. The
Network Anomaly Detection Systems (NADS) controls created model is tested with the data in the dataset and
the existence of the patterns by handling the data acquired by specifies the accuracy value of the model's behavior on the
the network traffic and takes on the task of deciding whether network according to the correctly categorized instances. With
the input data gotten from there is an anomaly or a normal data this accuracy value and used feature number, the fitness value
instance. NADS network mentioned in this project uses of the chromosome is estimated. Thus, it is seen how
Weka’s machine learning algorithms to detect the behavior successful the chromosome to detect the behavior.
that contains an anomaly. In this approach the normal A NADS project developed by using these techniques is
behavior that is moved to the system previously and the real- presented. The aim of this project is to detect and provide a
time network behavior data are compared and if there is an notification to take certain precautions in case of an anomaly
obvious diversion between them the related notification is
created.

978-1-7281-3789-6/19/$31.00 ©2019 IEEE


that means the actions to be occurred by using the individuals b) Artificial Neural Networks (ANN): As [3] says
with the best results. “Systems motivated by the distributed, massively parallel
The rest of this paper is organized as such: Section II will computation in the brain that enables it to be so successful at
talk about the recent works on the network anomaly detection complex control and recognition/classification tasks”. The
subject, Section III will have information regarding the human brain can execute calculates faster than the fastest
methods that were used in this project, Section IV will computer. The neurons inside the brain are interacted to each
mention the project that aims the detection of anomalies in a other by the connections named synapse. ANN was inspired
network, Section V will show the obtained results and Section by this was created by using great connections to have a good
VI will talk about the conclusions obtained by basing this performance. In Neural Networks, the power of the
project. connections between neurons that are actively
communicating is provided by updating synapse weights.
II. RELATED WORKS
c) Fuzzy Sets: Used while detecting specific or general
There are two approaches used in intrusion detection. The possibilities created by network intrusion detection systems
first one of these is known as misuse detection and based on via fuzzy rules.
introducing the known attacks and considering the unknown d) Rough Set: A mathematical tool that creates
cases normal. While this approach provides protection against
explainable rules by shrinking the dataset size with feature
known attacks easily, it cannot react to this kind of attacks
since it does not have the ability to detect them. In the second extraction while being beneficial for finding hidden patterns
approach, anomaly detection, while regular usual behavior, in data and handling indefinite, insufficient data [2] [4].
introduces to the system, the behavior that has an obvious e) Ant Colony Optimization (ACO): ACO and related
diversion to those (outliers) are dealt as anomalies (Denning, algorithms are techniques trying to mimic how ants search for
1987). The biggest advantage of this approach is that it detects food between a nutrition source and their own colonies. These
the unknown attacks. techniques are used to solve the calculation problems which
can be reformulated to find optimal paths through graphs.
When the increase in the computer systems using the
Artificial immune systems (AIS), represent a computational
internet and due to that the personal data transferred to these
systems are taken to account, detecting intrusions that can method inspired by the principles of the human immune
happen on networks has become an important priority. system. The human immune system is adept at performing
Network Anomaly Detection Systems is one of the approaches anomaly detection [1].
developed and researched with this goal. Network anomaly 5) Knowledge-Based NAD: Creates pre-defined rules or
detection methods are examined in six categories [1]. In this attach models to display the unknown attacks generally.
part, those mentioned categories will be explained and shown Network or host events are monitored over these, has
in Figure 1. examples such as knowledge-based methods expert systems,
1) Statistical-based detection: In statistical-based rule-based, ontology-based, logic-based and state-transition
detection, a case that is suspected to be partially or analysis [1].
completely unrelated to a model produced by the statistically 6) Combination Learners: Systems that are created by
stochastic model is observed to be an anomaly. using multiple network anomaly detection methods together
2) Classification-based detection: This method creates a can be examined under three headings.
model for cases that are expected to happen by a classification a) Ensemble-based methods and systems: Used for
algorithm and thus, classifies traffic patterns of the network. classification of examples that is not seen previously by
In this method, while the behavioral pattern is trained, a combining multiple individual classifiers [5]. This technique
labeled dataset is needed. Here data that does not belong to weights the options obtained from individual classifiers and
the normal behavior pattern class is considered to be an combines them to reach the final decision [1].
anomaly. b) Fusion-based methods and system: “With an
3) Clustering and outlier-based detection: In this evolving need of automated decision making, it is important
method, data without label are cut into sections and the data to improve classification accuracy compared to the stand-
in the same section are to be similar to each other than the alone general decision-based techniques even though such a
data in the other sections. Outliers are spots that are not system may have several disparate data sources. So, a suitable
expected to happen in the data model created after clustering. combination of these is the focus of the fusion approach.” [1].
Thus, these outliers are considered to be an anomaly. Fusion-based techniques are merging data on data level,
feature level and decision level. “Data fusion is efficient in
4) Soft Computing: This method is used in the cases that
terms of increasing timeliness of attack identification and
do not have an obviously known algorithm to reach a result
helps reducing false alarm rates. Another advantage would be
and in that way it is suitable to use for NAD. Soft Computing
that decision level fusion has a high detection rate when given
is made of these methods: Genetic Algorithm (GA), Artificial applicable training data. The downsides of these techniques
Neural Network (ANN), Fuzzy Sets, Rough Sets and Ant are the high consumption of resources as well as the feature
Colony algorithms and Artificial Immune Systems (AIS). level fusion being a time-consuming task.” [2].
a) Genetic algorithm (GA): used to find the optimal
result for search problems. “The approach is to evolve a
population of possible solutions towards a better solution,
which will, in turn, result in better detection rates of
anomalies.” [2]. GA is further explained in Section III.
Network Anomaly Detection Methods

Classification Knowledge Clustering and Combination


Statistical Soft Computing
Based Based Outlier Based Learners

Parametric GA based Rule and Expert System Ensemble Based


Non-Parametric ANN based Ontology and Logic Based Fusion Based
Fuzzy Set Hybrid
Rough Set
Ant Colony

Fig. 1. Classification of network anomaly detection methods (GA-Genetic Algorithm, ANN-Artificial Neural Network, AIS-Artificial Immune System)
(modified from [1])

c) Hybrid methods and system: Most of the intrusion sufficient or no improvement in the fitness values of the
detection systems use one of the approaches of misuse generations is also accepted as stopping criteria.
detection or anomaly detection. In addition to this, while The process steps of the genetic algorithm are as follows:
misuse detection cannot detect the unknown intrusions,
anomaly detection has a disadvantage of often having a high • Step 1: Random individuals are created and added to
false positive rate. In order to handle the disadvantages of the the population of the first generation.
approach used alone, a hybrid method is created based on • Step 2: Fitness values of chromosomes belonging to
using multiple network anomaly detection approaches individuals are calculated.
together. “Hybridization of several methods increases
performance of IDSs.” [1] [6]. “These methods benefit using • Step 3: The condition of the stopping criteria is
features from both signature and anomaly-based network checked. If provided, the processes are completed, and
anomaly detection. This leads to being able to handle both the most fit individual is considered to be the optimal
known and unknown anomalies” [2]. solution. If not, processes will be continued from Step
4.
• Step 4: The selection process is applied considering the
The project mentioned in this article is a hybrid system fitness values of the most fit individuals to increase the
created by using GA-based soft computing and classification- likelihood of joining the crossover and thus to obtain
based methods. The aim of this system is to detect by using a fitter generations.
labeled dataset including attack types along with the data
which are expressed to be normal behavior. In addition to that, • Step 5: Crossover is applied among selected
it tries to detect a new attack type that is not considered normal individuals. In this way, the diversity of the new
and differentiates from the known attack types. While generation will be increased.
executing these operations, obtaining high accuracy values
and low false positive rates are targeted for the aimed NADS. • Step 6: With a small value of the mutation rate, a
change in the chromosome can form a different
III. METHODS chromosome which has higher fitness value.
A. Genetic Algorithm • Step 7: Return to Step 2.
It is hard to achieve an absolute solution in the problems
which have more than one local optima. In such problems, it The chromosomes of individuals are generated in the
is necessary to apply intelligent approaches such as meta- manner that expresses the solution of the problem. A fitness
heuristics [7]. Examples of problems with which these function should be determined in accordance with the problem
approaches can be applied are Generator Maintenance in order to calculate the fitness of the individuals in the
Scheduling [8], Traveling Salesman [9] and Job Scheduling population. The aim of optimization problems is to maximize
[10]. or minimize the value of this function depending on the
The genetic algorithm is a search heuristic inspired by problem.
Charles Darwin’s natural theory of evolution (1859) and its “The genetic operators that GA performs add an aspect of
technique was developed by Holland (1992). “GA abstracts randomness which helps evade local minimums and maxima
the process of a population evolving given an environment while incrementally improving the initial set of solutions.” [7].
condition to adapt and thrive in the scenario. Through the
genetic operators (Selection, Mutation, and Crossover), the In this project, MOEA Framework is used when
solutions are improved each generation until a stopping implementing GA. MOEA is a free and open source Java
condition is reached.” [7]. The stopping criteria are usually to library which supports genetic algorithms. Genetic processes
reach the end of the number of generations to be created or to were performed by using NSGA-II, PAES and eMOEA
reach the desired fitness value in the solution. The lack of algorithms presented by MOEA.
B. Feature Selection
Success in machine learning is related to the quality of the
dataset used. If the dataset is inadequate, or if it contains
unnecessary and unrelated data, the results generated by
machine learning algorithms may be less accurate. The feature
selection method is used to find unnecessary and irrelevant
attributes within a dataset and to have higher accuracy for the
results obtained by preventing them from participating in the
process during machine learning.
In this study, features are selected depend on the each
chromosome’s genes that have either 1 or 0 binary value. If a
gene has a value of 1 then the feature which is corresponding
to this gene will be removed from the dataset, and if it has a
value of 0 then this feature will be used when creating the
model.
C. Classification with Weka Fig. 2. Steps applied by the system to find best feature subset to achieve
“Classification is the process of building a model of best anomaly-detecting solution combining Genetic Algorithm and
classes from a set of records that contain class labels.” [11]. classification methods.
Weka includes Java implementations of many classification
algorithms. In the classification stage, Weka's J48, AdaBoostM1 and
Bagging classifiers were used to perform the classification
In this project, Weka's J48, AdaBoostM1 and Bagging process. The accuracy value of the generated model and the
classifiers were selected to obtain the models with highest number of attributes participating in the classification process
accuracy which is using the least feature after feature are taken into consideration and the fitness value of the
selection. chromosome is calculated.
J48 is the implementation of C4.5 algorithm written in The fitness function used in this work is:
Java by Weka and generates a decision tree. “Decision Tree
Algorithm is to find out the way the attributes-vector behaves 0.8 0.2 (1)
for a number of instances. The additional features of J48 are
accounting for missing values, decision trees pruning, where fi is fitness of the chromosome i, ai is the accuracy
continuous attribute value ranges, derivation of rules, etc.” of the model implemented by using the chromosome i, and mi
[12]. is the number of the selected, attributes.
AdaBoostM1 (Adaptive Boosting), is a machine learning When the program stops working, the most fit individuals
meta-algorithm which has an effect to boost the performance obtained from these iterations are recorded and stored for use
of any machine learning algorithm. at the anomaly detection stage. Here, the most fit individuals
Bagging, is an ensemble method which creates a new train are the optimal solution for the determination of the anomaly
dataset with randomly selected instances from the train dataset in the KDD99 dataset, which has the lowest number of
and retrains the learner using this new train dataset. features as possible and the highest accuracy value of the
classification result.
IV. NETWORK ANOMALY DETECTION SYSTEM
All solutions obtained from these 9 executions are brought
The NADS presented here follows the steps shown in together and the test process is performed with a KDD99 test
Figure 2. dataset. In the test stage, all of the models were reconstructed
The chromosomal structure expressing the solution is by using the recorded chromosomes and the most successful
determined in a way that has 41 genes representing 41 models for the detection of anomalies in the network were
attributes in KDD99 dataset used in learning process. determined by reclassification using a separate KDD-99 test
dataset. The results obtained from this work will be shown in
At the beginning of the train stage, a certain number of the Results section.
chromosomes created, and each gene belonging to a
chromosome is assigned with a random 1 or 0 binary value. V. RESULTS
These chromosomes are diversified by the use of MOEA Models with the least feature number and highest accuracy
Framework’s algorithms; NSGA-II, PAES and eMOEA. The were selected as the best results. These model’s accuracy rates
fittest chromosomes which use least feature and have highest and the number of features that have been removed from the
accuracy are printed out by this framework at the end of the models with the best results are shown in Figure 3. According
execution. to the Figure 4, NSGA-II algorithm with J48 classifier has the
In the Feature selection stage, the corresponding attribute most accurate value with %91,099 and chooses the least
with a value of 1 will not participate in the classification number of attributes as 22.
operation and will be removed. The remaining attributes are Similarly, results for the training of the best models with
selected, and the classification operation is completed with the train dataset are shown in Figure 4. According to the
these attributes. Figure 5, NSGA-II algorithm with J48 classifier has the most
accurate value with %97,032 and chooses the least number of
attributes as 22.
neighbour. International Journal of Computing and ICT Research,
2008. 2(1): p. 60-66.
[5] G. Folino and P. Sabatino, Ensemble based collaborative and
distributed intrusion detection systems: A survey. Journal of Network
and Computer Applications, 2016. 66: p. 1-16.
[6] A. M. Karim, M. S. Guzel, M. R. Tolun, H. Kaya, and F. V. Celebi, A
new framework using deep auto-encoder and energy spectral density
for medical waveform data classification and processing.
Biocybernetics and Biomedical Engineering, 2019. 39(1): p. 148-159.
[7] A. H. Hamamoto, L. F. Carvalho, L. D. S. Sampaio, T. Abrao, and M.
L. Proenca Jr, Network anomaly detection system using genetic
algorithm and fuzzy logic. Expert Systems with Applications, 2018.
92: p. 390-402.
[8] S. Baskar, P. Subbaraj, M. Rao, and S. Tamilselvi, Genetic algorithms
solution to generator maintenance scheduling with modified genetic
operators. IEE Proceedings-Generation, Transmission and
Distribution, 2003. 150(1): p. 56-60.
Fig. 3. Test stage’s graphic of the accuracy values and the removed feature [9] S. Wang and A. Zhao, An improved hybrid genetic algorithm for
numbers that are obtained from the best models. traveling salesman problem. in Computational Intelligence and
Software Engineering. 2009.
[10] Y. Xing, Z. Chen, J. Sun, and L. Hu, An improved adaptive genetic
algorithm for job-shop scheduling problem. in Third International
Conference on Natural Computation (ICNC 2007). 2007. IEEE.
[11] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining: Practical
machine learning tools and techniques. 2016: Morgan Kaufmann.
[12] G. Kaur and A. Chhabra, Improved J48 classification algorithm for the
prediction of diabetes. International Journal of Computer Applications,
2014. 98(22).

Fig. 4. Train stage’s graphic of the accuracy values and the removed feature
numbers that are obtained from the best models.

As a result, this project can discover the most effective


feature subset with the help of the genetic algorithm and it can
detect the anomalies with a high accuracy rate.
VI. CONCLUSION
In this work, a NADS is created to identify behaviors that
express an attack in network traffic. In the creation of this
system, feature selection was performed in KDD99 dataset
using genetic algorithm and an optimal feature subset was
tried to be acquired in anomaly determination by using the
selected features. The best results were obtained using NSGA-
II algorithm with J48 classifier.
According to the results obtained after the completion of
the train and test phases of the project, it was observed that the
developed NADS could find the most effective features in
detecting the attacks from the datasets with the help of
chromosomes obtained by genetic algorithm. It is shown that
the models, which are created by using the selected features,
were able to detect attacks with high accuracy and less latency.
REFERENCES
[1] M. H. Bhuyan, D.K. Bhattacharyya, and J.K. Kalita, Network anomaly
detection: methods, systems and tools. Ieee communications surveys &
tutorials, 2013. 16(1): p. 303-336.
[2] A. T. Tran, Network anomaly detection. Future Internet (FI) and
Innovative Internet Technologies and Mobile Communication (IITM)
Focal Topic: Advanced Persistent Threats, 2017. 55.
[3] M. H. Hassoun, Fundamentals of artificial neural networks.
Proceedings of the IEEE, New York, 1996.
[4] A. O. Adetunmbi, S. O. Falaki, O. S. Adewale, and B. K. Alese,
Network intrusion detection based on rough set and k-nearest

You might also like