You are on page 1of 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/326609929

AN INCREMENTAL LEARNING SYSTEM FOR ON LINE KNN CLASSIFICATION:


Application To Network Intrusion Detection

Conference Paper · July 2018

CITATIONS READS

0 506

1 author:

Ahmed Riadh Baba-Ali


University of Science and Technology Houari Boumediene
34 PUBLICATIONS   98 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

CAD VLSI View project

Classification in imbalanced and multi-labels datasets View project

All content following this page was uploaded by Ahmed Riadh Baba-Ali on 25 July 2018.

The user has requested enhancement of the downloaded file.


AN INCREMENTAL LEARNING SYSTEM FOR ON
LINE KNN CLASSIFICATION:
Application To Network Intrusion Detection
Ahmed Riadh Baba-ali
University of Science and Technology of Algiers, Algeria (USTHB)
email: rbaba-ali@usthb.dz

ABSTRACT In this paper, we describe an incremental learning system based on the KNN classification algorithm (K
Nearest Neighbor). Our approach uses an incremental learning paradigm since the used knowledge is improved over time,
in contrast to the offline learning paradigm, where the learning is done once for all at the beginning. The incremental
learning ability is attractive for problems with changing environments such as network intrusion detection where new and
previously unknown attacks appear overtime. This kind of problems needs a continuous adaptation of the system’s
knowledge. Our system has been tested with live data coming from a real network under state of the art attacks. The
results, showed that our approach is faster with a substantial smaller error rate and also a smaller false negative (FN) error
rate compared to the classical KNN.

KEY WORDS: incremental learning, classification, KNN, Network intrusion detection system

1 INTRODUCTION
For the purpose of intrusion detection, network, events such as packets or flows are classified with supervised
classifiers such as neural networks, support vector machine (SVM), Decision Rules, decision trees etc (Axelsson
S., 2000). Among them, KNN classifiers are also successfully used (Deepika D. and Richhariya V. , 2012), (Liao Y.
and Vemuri V.R., 2002). With KNN, a training database is used to measure the similarity of each event to classify,
with the data contained in the training database. An important property of the KNN algorithm is that they are
non parametric. It means that KNN doesn’t rely on a model or a concept, but rather on a dataset. This property is
important to problems with fast changing environments such as NIDS where new attacks and changes appear
overtime. Consequently, it is mandatory for such systems to be able to adapt their own knowledge to possible
changes. This important property is called the concept drift adaptation (J. Gama et al , 2013).

2 RELATED WORK
The concept drift has been recognized as an important and a challenging problem. There are several approaches
described in the literature to deal with the concept drift: some are based on decision trees, Bayes models, or
SVM etc, whereas some are neighbor based approaches (J. Read et al , 2012).
To deal with concept drift, nearest neighbor (NN) algorithms are based on different approaches in
order to prioritize the most recent acquired knowledge which is considered as “more true” than the oldest one.
Among the existing approaches, one can cite the time window approach which considers only the instances
acquired within a time window. Another approach is called the instance time weighted, where each instance
has a weight which is greater when the instance acquisition time is more recent (Zhang P. et al , 2011), (Beringer J.
, 2007). To the author’s knowledge, none are based on the CNN algorithm that we used in our system.
Surprisingly, our approach obtained very promising results, as we will see in the next sections.
Compared to other well known classifiers, neighborhood techniques are attractive thanks to their
simplicity and their classification accuracy. However, one of the limitations of the KNN rule is the size of the
training set (Suguna N. & Thanushkodi K. ,2010), (Yu W. & Zheng’ou W., 2007). For KNN classification, the size of
the training database, is crucial: If the number of training instances is too small, the accuracy of the KNN
classifier is not acceptable. But if the training set is too large, the KNN classification running time can be
prohibitive. Consequently, the optimal size of the training database is very important and constitutes a difficult
problem to solve (Liu H, Motoda H , 2002). This problem has already been addressed in the literature, for many
years due to its importance. It is called the instance selection problem (IS) or the prototype selection problem.
This problem can be solved in two ways: either by reducing the dimensions of the feature space by using
smaller data sets; or by using an improved classification algorithm which can accelerate the calculation time (Shi
B., Yi W. & Zhang'ou W., 2007).
Instance selection (IS) is the process of finding representative instances from the data, which can help
in reducing the size of the data (Murty M. N., Babu T. R. , 2001). This problem is classified as an NP-hard
problem (Zukhba A. V. ,2010), which means that there is no polynomial time algorithm able to find an optimal
solution. Existing approximation algorithms can however give acceptable solutions in reasonable time. As a
consequence, instance selection algorithms are essentially based on heuristics and metaheuristics:
First: Heuristic algorithms are essentially greedy ones. Examples of such algorithms are: CNN (Hart P. ,
1968), CBP (Nikolaidis et al , 2011) etc. Second: Metaheuristic algorithms on the other hand are essentially
evolutionary ones. An example of such algorithms is the genetic algorithm which has proven its efficiency in
dealing with difficult problems (Gil-Pita R. & Yao X. ,2007), (Aci M., Inan C. & Avci M. ,2010).

3 SYSTEM DESCRIPTION
The system includes the following components:
• The KNN classification module
• The instance selection module
• The KNN report validation module
• The incremental Learning module

Figure 1: System components

Training
database

Instance
selection

Reduced
Training
database

KNN Network
classification Trafic

classification
report

Report
analysis

incremental
learning

classification
errors
• KNN classification module is the heart of the whole system since it classifies the incoming events from
the network as either normal or attack, using a training database containing labeled data.
• The Instance selection module processes the initial training database, in order to reduce its size by
selecting the most meaningful data. This operation is necessary in order to reduce the classification
time of each incoming traffic event.
• Thanks to the report validation module, classification outcomes are assessed generally by the network
administrators or automatically. Basically, this module corrects the classification errors made by the
KNN classification module. It produces a file which contains all the misclassified events produced by
the KNN classifier. This file is subsequently used as input data for the next module.
• The last module is the incremental learning module. It processes the previous classification errors
in order to update the knowledge database. The goal is to avoid redoing the previously done errors.

3.1 The KNN Classification module


The K-nearest neighbor classification rule (KNN) is a powerful classification method that allows the
classification of an unknown instance using a set of classified training instances. The purpose of this algorithm is
to classify a new instance according to the classes of the instances of the training set.

Each instance Ii of the training set is composed of inputs and one output, where:
Ii =< a1i; a2i; …; aji; ci >.

The inputs are the features of the instance:


(a1i; a2i; …; aji)
The output is its respective class (ci).

The class of the unknown instance is found according to the classes of the K most similar instances in
the training database. The similarity function is usually a distance function such as the Euclidian distance.
To make a prediction for a test instance, the algorithm first computes its distance to every training
instance. Then, it keeps the k closest training instances, where k ≥ 1 is a fixed integer. The algorithm looks for
the most common class among these instances’ classes. This class is the predicted class for the test instance.
There are two design choices to make: the value of k, and the distance function to use. It is common to select a
small and odd an k (typically 1, 3 or 5 …). The 1-NN is the basis of the KNN algorithm. It simply classifies the
unclassified instance in the same way as its nearest neighbor.

3.2 The instance selection module


Instance selection is the process of finding the most representative instances from a training dataset (Liu H,
Motoda H , 2002). This problem is classified as an NP-hard problem (Zukhba A. V. ,2010), since there is no
polynomial algorithm able to find an optimal solution. Existing heuristics can however, give acceptable
solutions in a reasonable time. One of the techniques that is used in the selection of instances is the genetic
algorithm which belongs to a class of algorithms called evolutionary algorithms. Evolutionary Algorithms (AE)
are a family of algorithms belonging to metaheuristic inspired by the theory of the evolution dedicated to solve
various problems. These algorithms transform over time a set of solutions for a given problem in order to find
the best possible solution. A solution is represented as a chromosome, where each gene corresponds to an
instance in the training database. Each gene can contain either a 1 or 0. In the first case, a 1 means that the
corresponding instance is selected otherwise it is deleted from the database. We used as fitness, a function
which permits to minimize the training database size without degrading the classification accuracy.

fitness = λ1 (accuracy) + λ2 (reduction)


Where: λ1 + λ2 = 1;

These two parameters (λ1 and λ2) allow to find a compromise between the two objectives to be
achieved. The fitness function includes the two aspects of the problem, leading to solutions that are as
satisfactory as possible. More implementation details can be found in (Miloud-Aouidate A., et al, 2013).

3.3. The diagnostic validation module


The outcome of the classification module could have many forms. It could be labeled events such as packets or
flows or could be sophisticated alerts issued by an automatic system. These events are labeled, which means
that each event is classified as attack or normal. In many ways, each label of the incoming events can be
verified. This diagnosis could turn to be true when an event is found as a normal event whereas it has been
labeled as normal, it is called a true positive (TP). The second case is that when an event is found as an attack
whereas it has been classified as an attack, it is called a true negative (TN). On the other hand, the diagnosis
could turn to be false when an event is identified as a normal event whereas it has been classified as an attack,
it is called a false negative (FN). The fourth case is when an event is identified as an attack whereas it has been
classified as a normal event, it is called a false positive (FP).
The validation of the diagnosis could be done automatically by software such as an alert classifier
( Miloud-Aouidate A., et al, 2013) or manually by a person who is generally the network administrator. In fact,
after analysis and verification, the network administrator can say that an alert does not correspond to an actual
attack, or that a normal event corresponds actually to an attack. In both cases, these false diagnoses are used
subsequently to improve the system’s knowledge database.

3.4. The incremental learning module


The outcome of the incremental learning is to update the training database used by the KNN classifier. This
update is made using the Hart algorithm called, the Condensed Nearest Neighbor Rule or CNN (Hart P. , 1968).
CNN was one of the first used method for the training dataset size reduction. It permits to select among the
corrected predictions the most appropriate knowledge. It eliminates equivalent training instances, especially
those that do not add additional information or are redundant.
CNN basically uses a greedy heuristic. Its main advantage is its fast speed along with its good results in
terms of reduction rate. On the other hand, one major disadvantage of CNN, is the fact that the algorithm is
sensitive to the order of events used to update the training database. However, to the author’s point of view, this
disadvantage, in the context of incremental learning, is actually an advantage, since the KNN report validation
come in a precise temporal order. This order is the time where the report errors have been discovered.
Therefore the updates of the training database must be done according to the precise time of the validation, in
order to reflect the evolution of the knowledge. This evolution of knowledge is called concept drift adaptation
(J. Gama et al , 2013). The concept drift in the network intrusion detection context, may have several meanings.
Either that a network security policy has changed, or new threats have been discovered. In this case a quick
improvement of the knowledge base is necessary, preferably in real time.
Besides, the speed performance of the Hart algorithm (Hart P. , 1968), makes it suitable for real time
learning, and it is compatible with the high throughput of actual networks. In our case the CNN algorithm
selects from the updated events, the ones which are the most pertinent updates of the training data.

4 TESTS
The presented tests have been conducted on real data, in order to compare the obtained results with or without
the incremental learning concept.

4,1 Used Data


The database contains the descriptions of TCP connections composed of 41 attributes per connection. These
connections gather training and testing data collected over about a week of network traffic containing several
attacks.

4.2 Results
The results obtained by the instance selection module permits to reduce the training database by 99%. This
remarkable result is generally obtained by almost all the instance selection approaches. This result can be
explained by the fact that network events are very similar during a short period of time. For example the packets
generated by a download, or video or audio stream share many features such as addresses, ports, protocols etc.
This important size reduction of the training database, has a proportional impact on KNN classification speed. In
fact, the running time is reduced in the same proportion.
The following tests have been conducted in order to demonstrate the efficiency of our approach. We
measure the error rate or misclassification rate of the events produced by the KNN module, first without
incremental learning and second with the incremental learning. We measure the daily mean error rate over a
period of about a month. We conducted the tests for different values of K, varying from 3 to 11. It appears that
the optimal value is K=3 (see table 1) since the error at that value, reaches its lowest value. We can see that the
error rate is roughly reduced by about a fourth.
Table 1: Comparison of the mean classification error rate

For K=3, we can see in figure 2, the evolution of the classification error rate over time with (light line) and
without incremental learning (dark line). It starts from about 10% and decrease progressively to about 2.33%.
This proves that the system is able to learn incrementally from its previous errors, since the error rate decrease
regularly over time as shown by the linear tendency line drawn in dashed.

Figure 2: Comparison of classification mean error rate with and without incremental learning vs time

The next tests are similar to the previous ones, with the difference that the FN (false negative) is rather
measured. This rate is important since it indicates the number of undetected attacks which are missed by the
NIDS. The false positive (FP) in the context of NIDS is relatively less important since it represents the rate of
detected false attacks.

Table 2: Comparison of mean classification false negative error rate

K 1 3 5 7 9 11
Classic. KNN 5.85 6.70 9.63 13.62 13.65 13.43
KNN with
increment. 1.14 1.24 1.25 1.35 1.41 1,42
learning

Table 2, shows a substantial reduction of the false negative error rate. In this case also, we can notice that the
error rate decreases regularly, which proves one more time that the system is able to learn from its previous
errors.

Figure 3: Comparison of mean classification false negative error rate VS time


The last aspect that we studied is the evolution of the training database size. Since the new knowledge is
continuously added over time to the training database, the question one can ask, is how evolutes the size of
the training database ? Figure 4 shows a constant increase of the database size, of about 2.5% of the input
test data size. We conducted the tests for different values of K, varying from 3 to 11. In all cases compared
to the initial reduced training database size, the increase is about 20%. Whereas compared to the initial
training database without reduction, the increase represents less than 1%.

Figure 4: Evolution of the size of the training data base vs time

5 CONCLUSION
Network intrusion detection is an important computer security field. The computational requirements of the
field are mainly the classification speed as well as the classification error rate. Our approach has addressed both
requirements in the following ways:
Since network speed is an important feature, processing high speed networks has become nowadays
mandatory for efficient network security. The instance reduction contributes to reduce the classification running
time substantially. KNN classification speed is proportional to the training database size. The smaller the
database size, the smaller is the running time. It is important to note that this gain is overshadowed by the
incremental learning concept, since it constantly adds new knowledge to the training database. The experiment
has shown, that when processing data with the size of the original training database, the added knowledge
represents less then 2,5 % of its size. So the advantage obtained by the instance reduction, disappears overtime
when the system processes data with more than 40 times its original size. However, even in this case it is
possible to run the instance selection module, to reduce the training database size offline.
The tests show also a substantial reduction of the error rate, which has been dropped by a factor of
about five. This result is due to the incremental learning process which avoids redoing the same errors by
enhancing the training database. Moreover, the false negative error rate has been substantially reduced ranging
from fifth to more than tenth. This fact is very important in the intrusion detection context since it represents the
proportion of undetected attacks by the system.
The author believes that the simplicity of the approach using robust algorithms such as KNN, CNN and
genetic, make it remarkably efficient in terms of speed and accuracy. Moreover it can be applied to other fields
such as robotics and control. The author believes also that this approach could provide a robust solution to
several challenges facing incremental learning such as autonomous learning, noise and drift differentiation (J.
Gama et al , 2013).
References
Aci M., Inan C. & Avci M. ,2010. A hybrid classification method of k nearest neighbor, Bayesian methods and genetic
algorithm, Expert Systems with Applications, 37, pp 5061-5067.
Axelsson S., 2000. Intrusion Detection Systems: A Survey and Taxonomy, Engineering, Vol. 52 (99-15), pp.541-554.
Beringer J. ,2007. Efficient instance-based learning on data streams, intelligent Data Analysis, vol 11, No 6, pp 627-650
Deepika D. and Richhariya V. , 2012. Intrusion detection with KNN classification and DS-theory, International Journal of
Computer Science and Information Technology and Security, Vol. 2 (2), pp. 274–281.
Garcia S., Cano J. R. and Herrera F., 2008. A memetic algorithm for evolutionary prototype selection: A scaling up
approach, Pattern Recognition, Vol. 41, pp. 2693–2709.
Gama J. et al , 2013. A Survey on Concept Drift Adaptation, ACM Computing Surveys, Vol. 1, No. 1,: January 2013.
Gil-Pita R. & Yao X. ,2007. Using a Genetic Algorithm for Editing k-Nearest Neighbor Classifiers. IDEAL 2007, LNCS,
4881, pp. 1141-1150.
Hart P. , 1968. The Condensed Nearest Neighbor Rule, IEEE Trans. on Information Theory, 14, pp. 515-516.
Liao Y. and Vemuri V.R., 2002. Use of K-Nearest Neighbor classifier for intrusion detection, Computers and Security, Vol.
21 (5), pp.439–448.
Liu H, Motoda H , 2002. On issues of instance selection, Data Mining Knowledge Discovery. Vol. 6 (2), pp.115–130.
Miloud-Aouidate A., et al, 2013. IDS False alarm reduction using an instance selection KNN memetic algorithm,
International Journal of Metaheuristic, Inderscience, Volume 2(4), pp 333-252.
Murty M. N., Babu T. R. , 2001. Comparison of genetic algorithm based instance selection schemes, Pattern Recognition,
34, pp. 523-525.
Nikolaidis et al , 2011. A class boundary preserving algorithm for data condensation, Pattern Recognition, 44, pp 704-715.
Read J. et al , 2012. Batch-Incremental versus Instance-Incremental Learning in Dynamic and Evolving Data, Advances in
Intelligent Data Analysis XI, Volume 7619 of the series Lecture Notes in Computer Science pp 313-323
Suguna N. & Thanushkodi K. ,2010. An Improved k-Nearest Neighbor Classification Using Genetic Algorithm,
International Journal of Computer Science , Issues, 7, pp. 18-21.
Shi B., Yi W. & Zhang'ou W., 2007. A fast KNN algorithm applied to web text categorization, Journal of The China Society
for Scientific and Technical In-formation, 26(1), pp. 60- 64.
Yu W. & Zheng’ou W., 2007. A fast KNN algorithm for text categorization, Proceedings of the Sixth International
Conference on Machine Learning and Cybernetics, Hong Kong, pp. 3436-3441.
Zhang P. et al , 2011. Enabling Fast Lazy Learning for Data Streams, IEEE International Conference on Data Mining,
ICDM '11, pp 932- 941
Zukhba A. V. ,2010. NP-Completeness of the Problem of instance Selection in the Nearest Neighbor Method, Pattern
Recognition and Image Analysis, 20, pp 484-494.

View publication stats

You might also like