Professional Documents
Culture Documents
Abstract—Networks are dangerous environments with The mentioned study here aims to employ genetic
containing numerous security vulnerabilities and those algorithms in order to increase the classification accuracy of a
vulnerabilities are likely to be used while attacking systems with detection system with the least number of attributes for
the intent of stealing valuable information or stopping the deciding the behavior on anomaly testing with the stated data
services. A system should be protected from already-known compiled from KDD99 dataset.
types of attacks and also have ability to detect unknown types of
attacks to prevent abduction of the information. Unknown types Each chromosome that was occurred by making up a
of attacks may give harm to the system by stopping the services suitable chromosome structure with genetic algorithm is
that runs effective and stable. For that purpose, it has become estimated how well it expresses the solution. Since genetic
necessary to develop a flexible and adaptable system which can operators are numerous and create chromosomes with
collect instant data from the network, distinguish between different attributes in these algorithms, a large solution set is
harmless and harmful behaviors and take measures against obtained and these solutions have almost optimal values even
them. The main goal of this work is to explain a network in the worst cases.
anomaly detection system that is developed using genetic
algorithm and Weka classification features to fulfill the Weka is an open source data mining software and includes
purposes stated above. The Genetic Algorithm is used to the tools on machine learning algorithms such as data
generate various individuals with the aim of determining which preparation, classification, feature selection, regression etc.
attributes of the individual are providing a better result about Data mining is a method used to obtain information that was
learning the behavioral pattern of the network traffic. not known previously but potentially important about a
Furthermore, Weka classifiers are applied to the train and test dataset. In this project method, feature selection and
datasets to calculate the best fitness value, and to decide on classification, which belongs to Weka, were used.
individual's attributes that are more effective about finding the
anomaly occurring in a given instant. Feature selection, in other words, attribute selection,
specifies which attribute in a dataset is more effective for the
Keywords—Network Anomaly Detection System, Genetic solution and ensures the attributes to be used while machine
Algorithm, Weka, Feature Selection, Classification learning. The model created by choosing the correct attributes
has higher accuracy level and less complexity. These
I. INTRODUCTION chromosomes that were obtained with the help of this
Computer networks changed the things that people should technical genetic algorithm decides if this gen to be used while
be used to carry out specific activities in an easier and faster creating the related representing attribute's model according to
way. Therefore, computer networks are used in many gen values in the dataset.
applications in order to benefit from the advantages of the Classification is the operation of organizing similar or
Internet. While benefiting from those advantages, networks separate data of a dataset in a way that would place them in
are started to face with a growing number of various threats same and different categories. This method is to create a
which are changing and deriving day-by-day. model by using attributes chosen in each chromosome. The
Network Anomaly Detection Systems (NADS) controls created model is tested with the data in the dataset and
the existence of the patterns by handling the data acquired by specifies the accuracy value of the model's behavior on the
the network traffic and takes on the task of deciding whether network according to the correctly categorized instances. With
the input data gotten from there is an anomaly or a normal data this accuracy value and used feature number, the fitness value
instance. NADS network mentioned in this project uses of the chromosome is estimated. Thus, it is seen how
Weka’s machine learning algorithms to detect the behavior successful the chromosome to detect the behavior.
that contains an anomaly. In this approach the normal A NADS project developed by using these techniques is
behavior that is moved to the system previously and the real- presented. The aim of this project is to detect and provide a
time network behavior data are compared and if there is an notification to take certain precautions in case of an anomaly
obvious diversion between them the related notification is
created.
Fig. 1. Classification of network anomaly detection methods (GA-Genetic Algorithm, ANN-Artificial Neural Network, AIS-Artificial Immune System)
(modified from [1])
c) Hybrid methods and system: Most of the intrusion sufficient or no improvement in the fitness values of the
detection systems use one of the approaches of misuse generations is also accepted as stopping criteria.
detection or anomaly detection. In addition to this, while The process steps of the genetic algorithm are as follows:
misuse detection cannot detect the unknown intrusions,
anomaly detection has a disadvantage of often having a high • Step 1: Random individuals are created and added to
false positive rate. In order to handle the disadvantages of the the population of the first generation.
approach used alone, a hybrid method is created based on • Step 2: Fitness values of chromosomes belonging to
using multiple network anomaly detection approaches individuals are calculated.
together. “Hybridization of several methods increases
performance of IDSs.” [1] [6]. “These methods benefit using • Step 3: The condition of the stopping criteria is
features from both signature and anomaly-based network checked. If provided, the processes are completed, and
anomaly detection. This leads to being able to handle both the most fit individual is considered to be the optimal
known and unknown anomalies” [2]. solution. If not, processes will be continued from Step
4.
• Step 4: The selection process is applied considering the
The project mentioned in this article is a hybrid system fitness values of the most fit individuals to increase the
created by using GA-based soft computing and classification- likelihood of joining the crossover and thus to obtain
based methods. The aim of this system is to detect by using a fitter generations.
labeled dataset including attack types along with the data
which are expressed to be normal behavior. In addition to that, • Step 5: Crossover is applied among selected
it tries to detect a new attack type that is not considered normal individuals. In this way, the diversity of the new
and differentiates from the known attack types. While generation will be increased.
executing these operations, obtaining high accuracy values
and low false positive rates are targeted for the aimed NADS. • Step 6: With a small value of the mutation rate, a
change in the chromosome can form a different
III. METHODS chromosome which has higher fitness value.
A. Genetic Algorithm • Step 7: Return to Step 2.
It is hard to achieve an absolute solution in the problems
which have more than one local optima. In such problems, it The chromosomes of individuals are generated in the
is necessary to apply intelligent approaches such as meta- manner that expresses the solution of the problem. A fitness
heuristics [7]. Examples of problems with which these function should be determined in accordance with the problem
approaches can be applied are Generator Maintenance in order to calculate the fitness of the individuals in the
Scheduling [8], Traveling Salesman [9] and Job Scheduling population. The aim of optimization problems is to maximize
[10]. or minimize the value of this function depending on the
The genetic algorithm is a search heuristic inspired by problem.
Charles Darwin’s natural theory of evolution (1859) and its “The genetic operators that GA performs add an aspect of
technique was developed by Holland (1992). “GA abstracts randomness which helps evade local minimums and maxima
the process of a population evolving given an environment while incrementally improving the initial set of solutions.” [7].
condition to adapt and thrive in the scenario. Through the
genetic operators (Selection, Mutation, and Crossover), the In this project, MOEA Framework is used when
solutions are improved each generation until a stopping implementing GA. MOEA is a free and open source Java
condition is reached.” [7]. The stopping criteria are usually to library which supports genetic algorithms. Genetic processes
reach the end of the number of generations to be created or to were performed by using NSGA-II, PAES and eMOEA
reach the desired fitness value in the solution. The lack of algorithms presented by MOEA.
B. Feature Selection
Success in machine learning is related to the quality of the
dataset used. If the dataset is inadequate, or if it contains
unnecessary and unrelated data, the results generated by
machine learning algorithms may be less accurate. The feature
selection method is used to find unnecessary and irrelevant
attributes within a dataset and to have higher accuracy for the
results obtained by preventing them from participating in the
process during machine learning.
In this study, features are selected depend on the each
chromosome’s genes that have either 1 or 0 binary value. If a
gene has a value of 1 then the feature which is corresponding
to this gene will be removed from the dataset, and if it has a
value of 0 then this feature will be used when creating the
model.
C. Classification with Weka Fig. 2. Steps applied by the system to find best feature subset to achieve
“Classification is the process of building a model of best anomaly-detecting solution combining Genetic Algorithm and
classes from a set of records that contain class labels.” [11]. classification methods.
Weka includes Java implementations of many classification
algorithms. In the classification stage, Weka's J48, AdaBoostM1 and
Bagging classifiers were used to perform the classification
In this project, Weka's J48, AdaBoostM1 and Bagging process. The accuracy value of the generated model and the
classifiers were selected to obtain the models with highest number of attributes participating in the classification process
accuracy which is using the least feature after feature are taken into consideration and the fitness value of the
selection. chromosome is calculated.
J48 is the implementation of C4.5 algorithm written in The fitness function used in this work is:
Java by Weka and generates a decision tree. “Decision Tree
Algorithm is to find out the way the attributes-vector behaves 0.8 0.2 (1)
for a number of instances. The additional features of J48 are
accounting for missing values, decision trees pruning, where fi is fitness of the chromosome i, ai is the accuracy
continuous attribute value ranges, derivation of rules, etc.” of the model implemented by using the chromosome i, and mi
[12]. is the number of the selected, attributes.
AdaBoostM1 (Adaptive Boosting), is a machine learning When the program stops working, the most fit individuals
meta-algorithm which has an effect to boost the performance obtained from these iterations are recorded and stored for use
of any machine learning algorithm. at the anomaly detection stage. Here, the most fit individuals
Bagging, is an ensemble method which creates a new train are the optimal solution for the determination of the anomaly
dataset with randomly selected instances from the train dataset in the KDD99 dataset, which has the lowest number of
and retrains the learner using this new train dataset. features as possible and the highest accuracy value of the
classification result.
IV. NETWORK ANOMALY DETECTION SYSTEM
All solutions obtained from these 9 executions are brought
The NADS presented here follows the steps shown in together and the test process is performed with a KDD99 test
Figure 2. dataset. In the test stage, all of the models were reconstructed
The chromosomal structure expressing the solution is by using the recorded chromosomes and the most successful
determined in a way that has 41 genes representing 41 models for the detection of anomalies in the network were
attributes in KDD99 dataset used in learning process. determined by reclassification using a separate KDD-99 test
dataset. The results obtained from this work will be shown in
At the beginning of the train stage, a certain number of the Results section.
chromosomes created, and each gene belonging to a
chromosome is assigned with a random 1 or 0 binary value. V. RESULTS
These chromosomes are diversified by the use of MOEA Models with the least feature number and highest accuracy
Framework’s algorithms; NSGA-II, PAES and eMOEA. The were selected as the best results. These model’s accuracy rates
fittest chromosomes which use least feature and have highest and the number of features that have been removed from the
accuracy are printed out by this framework at the end of the models with the best results are shown in Figure 3. According
execution. to the Figure 4, NSGA-II algorithm with J48 classifier has the
In the Feature selection stage, the corresponding attribute most accurate value with %91,099 and chooses the least
with a value of 1 will not participate in the classification number of attributes as 22.
operation and will be removed. The remaining attributes are Similarly, results for the training of the best models with
selected, and the classification operation is completed with the train dataset are shown in Figure 4. According to the
these attributes. Figure 5, NSGA-II algorithm with J48 classifier has the most
accurate value with %97,032 and chooses the least number of
attributes as 22.
neighbour. International Journal of Computing and ICT Research,
2008. 2(1): p. 60-66.
[5] G. Folino and P. Sabatino, Ensemble based collaborative and
distributed intrusion detection systems: A survey. Journal of Network
and Computer Applications, 2016. 66: p. 1-16.
[6] A. M. Karim, M. S. Guzel, M. R. Tolun, H. Kaya, and F. V. Celebi, A
new framework using deep auto-encoder and energy spectral density
for medical waveform data classification and processing.
Biocybernetics and Biomedical Engineering, 2019. 39(1): p. 148-159.
[7] A. H. Hamamoto, L. F. Carvalho, L. D. S. Sampaio, T. Abrao, and M.
L. Proenca Jr, Network anomaly detection system using genetic
algorithm and fuzzy logic. Expert Systems with Applications, 2018.
92: p. 390-402.
[8] S. Baskar, P. Subbaraj, M. Rao, and S. Tamilselvi, Genetic algorithms
solution to generator maintenance scheduling with modified genetic
operators. IEE Proceedings-Generation, Transmission and
Distribution, 2003. 150(1): p. 56-60.
Fig. 3. Test stage’s graphic of the accuracy values and the removed feature [9] S. Wang and A. Zhao, An improved hybrid genetic algorithm for
numbers that are obtained from the best models. traveling salesman problem. in Computational Intelligence and
Software Engineering. 2009.
[10] Y. Xing, Z. Chen, J. Sun, and L. Hu, An improved adaptive genetic
algorithm for job-shop scheduling problem. in Third International
Conference on Natural Computation (ICNC 2007). 2007. IEEE.
[11] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining: Practical
machine learning tools and techniques. 2016: Morgan Kaufmann.
[12] G. Kaur and A. Chhabra, Improved J48 classification algorithm for the
prediction of diabetes. International Journal of Computer Applications,
2014. 98(22).
Fig. 4. Train stage’s graphic of the accuracy values and the removed feature
numbers that are obtained from the best models.