Professional Documents
Culture Documents
net/publication/326609929
CITATIONS READS
0 506
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ahmed Riadh Baba-Ali on 25 July 2018.
ABSTRACT In this paper, we describe an incremental learning system based on the KNN classification algorithm (K
Nearest Neighbor). Our approach uses an incremental learning paradigm since the used knowledge is improved over time,
in contrast to the offline learning paradigm, where the learning is done once for all at the beginning. The incremental
learning ability is attractive for problems with changing environments such as network intrusion detection where new and
previously unknown attacks appear overtime. This kind of problems needs a continuous adaptation of the system’s
knowledge. Our system has been tested with live data coming from a real network under state of the art attacks. The
results, showed that our approach is faster with a substantial smaller error rate and also a smaller false negative (FN) error
rate compared to the classical KNN.
KEY WORDS: incremental learning, classification, KNN, Network intrusion detection system
1 INTRODUCTION
For the purpose of intrusion detection, network, events such as packets or flows are classified with supervised
classifiers such as neural networks, support vector machine (SVM), Decision Rules, decision trees etc (Axelsson
S., 2000). Among them, KNN classifiers are also successfully used (Deepika D. and Richhariya V. , 2012), (Liao Y.
and Vemuri V.R., 2002). With KNN, a training database is used to measure the similarity of each event to classify,
with the data contained in the training database. An important property of the KNN algorithm is that they are
non parametric. It means that KNN doesn’t rely on a model or a concept, but rather on a dataset. This property is
important to problems with fast changing environments such as NIDS where new attacks and changes appear
overtime. Consequently, it is mandatory for such systems to be able to adapt their own knowledge to possible
changes. This important property is called the concept drift adaptation (J. Gama et al , 2013).
2 RELATED WORK
The concept drift has been recognized as an important and a challenging problem. There are several approaches
described in the literature to deal with the concept drift: some are based on decision trees, Bayes models, or
SVM etc, whereas some are neighbor based approaches (J. Read et al , 2012).
To deal with concept drift, nearest neighbor (NN) algorithms are based on different approaches in
order to prioritize the most recent acquired knowledge which is considered as “more true” than the oldest one.
Among the existing approaches, one can cite the time window approach which considers only the instances
acquired within a time window. Another approach is called the instance time weighted, where each instance
has a weight which is greater when the instance acquisition time is more recent (Zhang P. et al , 2011), (Beringer J.
, 2007). To the author’s knowledge, none are based on the CNN algorithm that we used in our system.
Surprisingly, our approach obtained very promising results, as we will see in the next sections.
Compared to other well known classifiers, neighborhood techniques are attractive thanks to their
simplicity and their classification accuracy. However, one of the limitations of the KNN rule is the size of the
training set (Suguna N. & Thanushkodi K. ,2010), (Yu W. & Zheng’ou W., 2007). For KNN classification, the size of
the training database, is crucial: If the number of training instances is too small, the accuracy of the KNN
classifier is not acceptable. But if the training set is too large, the KNN classification running time can be
prohibitive. Consequently, the optimal size of the training database is very important and constitutes a difficult
problem to solve (Liu H, Motoda H , 2002). This problem has already been addressed in the literature, for many
years due to its importance. It is called the instance selection problem (IS) or the prototype selection problem.
This problem can be solved in two ways: either by reducing the dimensions of the feature space by using
smaller data sets; or by using an improved classification algorithm which can accelerate the calculation time (Shi
B., Yi W. & Zhang'ou W., 2007).
Instance selection (IS) is the process of finding representative instances from the data, which can help
in reducing the size of the data (Murty M. N., Babu T. R. , 2001). This problem is classified as an NP-hard
problem (Zukhba A. V. ,2010), which means that there is no polynomial time algorithm able to find an optimal
solution. Existing approximation algorithms can however give acceptable solutions in reasonable time. As a
consequence, instance selection algorithms are essentially based on heuristics and metaheuristics:
First: Heuristic algorithms are essentially greedy ones. Examples of such algorithms are: CNN (Hart P. ,
1968), CBP (Nikolaidis et al , 2011) etc. Second: Metaheuristic algorithms on the other hand are essentially
evolutionary ones. An example of such algorithms is the genetic algorithm which has proven its efficiency in
dealing with difficult problems (Gil-Pita R. & Yao X. ,2007), (Aci M., Inan C. & Avci M. ,2010).
3 SYSTEM DESCRIPTION
The system includes the following components:
• The KNN classification module
• The instance selection module
• The KNN report validation module
• The incremental Learning module
Training
database
Instance
selection
Reduced
Training
database
KNN Network
classification Trafic
classification
report
Report
analysis
incremental
learning
classification
errors
• KNN classification module is the heart of the whole system since it classifies the incoming events from
the network as either normal or attack, using a training database containing labeled data.
• The Instance selection module processes the initial training database, in order to reduce its size by
selecting the most meaningful data. This operation is necessary in order to reduce the classification
time of each incoming traffic event.
• Thanks to the report validation module, classification outcomes are assessed generally by the network
administrators or automatically. Basically, this module corrects the classification errors made by the
KNN classification module. It produces a file which contains all the misclassified events produced by
the KNN classifier. This file is subsequently used as input data for the next module.
• The last module is the incremental learning module. It processes the previous classification errors
in order to update the knowledge database. The goal is to avoid redoing the previously done errors.
Each instance Ii of the training set is composed of inputs and one output, where:
Ii =< a1i; a2i; …; aji; ci >.
The class of the unknown instance is found according to the classes of the K most similar instances in
the training database. The similarity function is usually a distance function such as the Euclidian distance.
To make a prediction for a test instance, the algorithm first computes its distance to every training
instance. Then, it keeps the k closest training instances, where k ≥ 1 is a fixed integer. The algorithm looks for
the most common class among these instances’ classes. This class is the predicted class for the test instance.
There are two design choices to make: the value of k, and the distance function to use. It is common to select a
small and odd an k (typically 1, 3 or 5 …). The 1-NN is the basis of the KNN algorithm. It simply classifies the
unclassified instance in the same way as its nearest neighbor.
These two parameters (λ1 and λ2) allow to find a compromise between the two objectives to be
achieved. The fitness function includes the two aspects of the problem, leading to solutions that are as
satisfactory as possible. More implementation details can be found in (Miloud-Aouidate A., et al, 2013).
4 TESTS
The presented tests have been conducted on real data, in order to compare the obtained results with or without
the incremental learning concept.
4.2 Results
The results obtained by the instance selection module permits to reduce the training database by 99%. This
remarkable result is generally obtained by almost all the instance selection approaches. This result can be
explained by the fact that network events are very similar during a short period of time. For example the packets
generated by a download, or video or audio stream share many features such as addresses, ports, protocols etc.
This important size reduction of the training database, has a proportional impact on KNN classification speed. In
fact, the running time is reduced in the same proportion.
The following tests have been conducted in order to demonstrate the efficiency of our approach. We
measure the error rate or misclassification rate of the events produced by the KNN module, first without
incremental learning and second with the incremental learning. We measure the daily mean error rate over a
period of about a month. We conducted the tests for different values of K, varying from 3 to 11. It appears that
the optimal value is K=3 (see table 1) since the error at that value, reaches its lowest value. We can see that the
error rate is roughly reduced by about a fourth.
Table 1: Comparison of the mean classification error rate
For K=3, we can see in figure 2, the evolution of the classification error rate over time with (light line) and
without incremental learning (dark line). It starts from about 10% and decrease progressively to about 2.33%.
This proves that the system is able to learn incrementally from its previous errors, since the error rate decrease
regularly over time as shown by the linear tendency line drawn in dashed.
Figure 2: Comparison of classification mean error rate with and without incremental learning vs time
The next tests are similar to the previous ones, with the difference that the FN (false negative) is rather
measured. This rate is important since it indicates the number of undetected attacks which are missed by the
NIDS. The false positive (FP) in the context of NIDS is relatively less important since it represents the rate of
detected false attacks.
K 1 3 5 7 9 11
Classic. KNN 5.85 6.70 9.63 13.62 13.65 13.43
KNN with
increment. 1.14 1.24 1.25 1.35 1.41 1,42
learning
Table 2, shows a substantial reduction of the false negative error rate. In this case also, we can notice that the
error rate decreases regularly, which proves one more time that the system is able to learn from its previous
errors.
5 CONCLUSION
Network intrusion detection is an important computer security field. The computational requirements of the
field are mainly the classification speed as well as the classification error rate. Our approach has addressed both
requirements in the following ways:
Since network speed is an important feature, processing high speed networks has become nowadays
mandatory for efficient network security. The instance reduction contributes to reduce the classification running
time substantially. KNN classification speed is proportional to the training database size. The smaller the
database size, the smaller is the running time. It is important to note that this gain is overshadowed by the
incremental learning concept, since it constantly adds new knowledge to the training database. The experiment
has shown, that when processing data with the size of the original training database, the added knowledge
represents less then 2,5 % of its size. So the advantage obtained by the instance reduction, disappears overtime
when the system processes data with more than 40 times its original size. However, even in this case it is
possible to run the instance selection module, to reduce the training database size offline.
The tests show also a substantial reduction of the error rate, which has been dropped by a factor of
about five. This result is due to the incremental learning process which avoids redoing the same errors by
enhancing the training database. Moreover, the false negative error rate has been substantially reduced ranging
from fifth to more than tenth. This fact is very important in the intrusion detection context since it represents the
proportion of undetected attacks by the system.
The author believes that the simplicity of the approach using robust algorithms such as KNN, CNN and
genetic, make it remarkably efficient in terms of speed and accuracy. Moreover it can be applied to other fields
such as robotics and control. The author believes also that this approach could provide a robust solution to
several challenges facing incremental learning such as autonomous learning, noise and drift differentiation (J.
Gama et al , 2013).
References
Aci M., Inan C. & Avci M. ,2010. A hybrid classification method of k nearest neighbor, Bayesian methods and genetic
algorithm, Expert Systems with Applications, 37, pp 5061-5067.
Axelsson S., 2000. Intrusion Detection Systems: A Survey and Taxonomy, Engineering, Vol. 52 (99-15), pp.541-554.
Beringer J. ,2007. Efficient instance-based learning on data streams, intelligent Data Analysis, vol 11, No 6, pp 627-650
Deepika D. and Richhariya V. , 2012. Intrusion detection with KNN classification and DS-theory, International Journal of
Computer Science and Information Technology and Security, Vol. 2 (2), pp. 274–281.
Garcia S., Cano J. R. and Herrera F., 2008. A memetic algorithm for evolutionary prototype selection: A scaling up
approach, Pattern Recognition, Vol. 41, pp. 2693–2709.
Gama J. et al , 2013. A Survey on Concept Drift Adaptation, ACM Computing Surveys, Vol. 1, No. 1,: January 2013.
Gil-Pita R. & Yao X. ,2007. Using a Genetic Algorithm for Editing k-Nearest Neighbor Classifiers. IDEAL 2007, LNCS,
4881, pp. 1141-1150.
Hart P. , 1968. The Condensed Nearest Neighbor Rule, IEEE Trans. on Information Theory, 14, pp. 515-516.
Liao Y. and Vemuri V.R., 2002. Use of K-Nearest Neighbor classifier for intrusion detection, Computers and Security, Vol.
21 (5), pp.439–448.
Liu H, Motoda H , 2002. On issues of instance selection, Data Mining Knowledge Discovery. Vol. 6 (2), pp.115–130.
Miloud-Aouidate A., et al, 2013. IDS False alarm reduction using an instance selection KNN memetic algorithm,
International Journal of Metaheuristic, Inderscience, Volume 2(4), pp 333-252.
Murty M. N., Babu T. R. , 2001. Comparison of genetic algorithm based instance selection schemes, Pattern Recognition,
34, pp. 523-525.
Nikolaidis et al , 2011. A class boundary preserving algorithm for data condensation, Pattern Recognition, 44, pp 704-715.
Read J. et al , 2012. Batch-Incremental versus Instance-Incremental Learning in Dynamic and Evolving Data, Advances in
Intelligent Data Analysis XI, Volume 7619 of the series Lecture Notes in Computer Science pp 313-323
Suguna N. & Thanushkodi K. ,2010. An Improved k-Nearest Neighbor Classification Using Genetic Algorithm,
International Journal of Computer Science , Issues, 7, pp. 18-21.
Shi B., Yi W. & Zhang'ou W., 2007. A fast KNN algorithm applied to web text categorization, Journal of The China Society
for Scientific and Technical In-formation, 26(1), pp. 60- 64.
Yu W. & Zheng’ou W., 2007. A fast KNN algorithm for text categorization, Proceedings of the Sixth International
Conference on Machine Learning and Cybernetics, Hong Kong, pp. 3436-3441.
Zhang P. et al , 2011. Enabling Fast Lazy Learning for Data Streams, IEEE International Conference on Data Mining,
ICDM '11, pp 932- 941
Zukhba A. V. ,2010. NP-Completeness of the Problem of instance Selection in the Nearest Neighbor Method, Pattern
Recognition and Image Analysis, 20, pp 484-494.