You are on page 1of 25

A novel intrusion detection system based on hierarchical clustering and support vector machines

Prsent par: Amal Walha &Selem Trabelsi

proposed by :Mr Abbes Tarek


BIRCH hierarchical clustering alghorithm


SVM with hierarchical clustering

Experimental resultas



NIDS is a network intrusion detection system that attempts to detect malicious activities A problem of NIDS is that detects only known network attacks. Many methods proposed on the design of NIDS such as: decision tree based on C5 used by the KDD Cup 1999 Fuzzy rough C-means (FRCM) based on the fuzzy set theory. support vector machine (SVM)

This study proposes a SVM-based intrusion detection system based on a hierarchical clustering algorithm to preprocess the KDD Cup 1999 dataset before SVM training.

BIRCH hierarchical clustering alghorithm

BIRCH hierarchical clustering alghorithm

Designed for very large data sets

Time and memory are limited Incremental and dynamic clustering of incoming objects Only one scan of data is necessary Constructs a tree called a clustering feature (CF) tree Does not need the whole data set in advance Able to handle noise effectively.

Two key phases: Scans the database to build an in-memory tree Applies clustering algorithm to cluster the leaf nodes

Clustering feature (CF)

Node in the CF tree composed of clustering feature A CF is a triplet summarizes the information of a cluster.
LS P i P i

CF = (n, LS, SS)

P N i


P N i

Theorem to merge sub-clusters: CF1 + CF2= (n1+n2, LS1+LS2, SS1+SS2) Given a cluster of instances we define: Centroid:

Radius: Euclidean distance :

A CF tree is a compact representation of a dataset

Each non-leaf node contains B entries of the form (CFi, child i )

leaf node represents a cluster that absorbs many data point and must satisfy the threshold requirement T.

The insertion procedure of CF tree has three steps.

Step 1: Identify the appropriate leaf

The algorithm computes the distances between CFx and each entry

It starts from the root and traverses the CF tree recursively down to the leaf level by choosing the child node, whose centroid is closest at each level to the new entry.

Step 2: Modify the leaf

three possible options: leaf absorbs the new entry without violating the radius threshold T condition. add a new leaf entry for the new entry. If adding a new entry violates the branching factor B threshold the leaf node has to split by choosing the farthest pair of entries as seeds.

Step 3: Modify entries on the path to the leaf

oAfter inserting the new entry into a leaf node, the algorithm needs to update the CF information for each non-leaf entry along the path backwards to the root.
o If

no leaf node splitting is invoked, the algorithm simply adds the CF to reect the addition of the new entry. If leaf node splitting is invoked, the algorithm checks whether the parent node meets the branching factor constraint. oIf the parent node violates the B threshold, it is split and recursively traversed back to the root, while performing the same checks.

SVM with hierarchical clustering

Support vector machines (SVM)

A SVM is a supervised learning method. It performs classication by constructing an N -dimensional hyperplane that optimally separates the data into different categories.

SVM have shown good results in data classication , but it is unable to operate at such a large dataset due to system failures caused by insufficient memory

training complexity is very dependent on the size of the dataset


SVM with hierarchical clustering

Is a combination of support vector machine and hierarchical clustering to build an intrusion detection system. The BIRCH algorithm has to transform dataset to a smaller sized and used to produce a reduced and high quality dataset before SVM training.
The process of the proposed system is described as follows: (1) Transform and scale data (2) Construct CF trees for attacks and normal packets (3) Do feature selection for each type of attacks. (4) Train four SVM classiers by the centroid of all entries in leaf nodes of CF trees. (5) Combine the four SVMs classiers to build an intrusion detection system

Data transformation and scaling

SVM requires each data point to be represented as a vector of real numbers. every non-numerical attribute has to be transformed into numerical data rst. For example, the protocol_type attribute in KDD Cup 1999, tcp is changed with 0, udp with 1, and icmp with 2. Data scaling can avoid attributes with greater values dominating those attributes with smaller values. each attribute is called with linear scaling to the range of [0, 1] by dividing every attribute value by its own maximum value.


CF trees construction
The CF trees can be constructed with a single scan of the dataset. four kinds of attacks in KDD Cup: DoS: Denial of Service R2L: illegitimate access from a remote machine U2R: Acquire the privileges of a super user Probing: scan of port One CF tree for normal taffic

In the KDD Cup 1999 data set there are 41 features LS and SS are contain the sum and square sum value of each of the features 10

Feature selection
not all features are needed in the design of a network intrusion detection system. It is critical to identify important features of network trafc data. a feature is important is determined based on the accuracy and the number of false positives of the system, with and without the feature.


Experimental resultas

The datasets contained 24 training attack types classied into four kinds of attacks, DoS, U2R, R2L, and Probe.

The best performance of this system in terms of accuracy was 95.72% with only a 0.73% false positive rate. this system showed superior performance in DoS and Probe attacks and suffered from both U2R and R2L attacks because the numbers of instances for these two attacks were too small in the original KDD Cup 1999 dataset. 12

This table presents the original numbers of instance for each kind of attack and the numbers of instances by CF trees with different threshold T

The hierarchical clustering reduces the number of instances for datasets. For exemple with a CF tree (T=0.2) the number of instances of Dos is very small compared to original data set (only 271)


This table compares the KDD Cup 1999 winners system and other researches, this system provided the best detection rate for DoS and Probe attacks. The ESC-IDS showed the best detection rate for the R2L attack, and Multi-classifier showed the best detection rate for the U2R attack.
in terms of accuracy, this system could achieve the best performance.

As shown in Table , the IDS detection rate of this study for new attacks is only 39.04%, and the worst detection is on new R2L attacks.




Many researches concerning NIDSs applied SVMs because SVMs are well known for their generalization performances. this study proposed an SVM-based network intrusion detection system with BIRCH hierarchical clustering for data preprocessing. The BIRCH hierarchical clustering could provide highly qualified, abstracted and reduced datasets to the SVM training the resultant SVM classifiers showed better performance than the SVM classifiers using the originally redundant dataset.