You are on page 1of 10

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No.

4,
ISSN: 1837-7823

A Novel Graph Based Clustering Approach for Network Intrusion


Detection
D.P.Jeyepalan1 E.Kirubakaran2
1
Research Scholar, School of Computer Science, Engineering and Applications,
Bharathidasan University, Tiruchirappalli, Tamilnadu, India.
2

Additional General Manager, SSTP (Systems), Bharat Heavy Electricals Ltd,


Tiruchirappalli, India.
Abstract

Detecting the vulnerabilities in a network plays a vital role in the prevention of intrusions in a system. This paper
describes a cluster based mechanism for detecting vulnerabilities and in turn intrusions. The network is analyzed and
a graph is constructed representing the entire network. This graph is passed to a clustering algorithm that clusters the
nodes. This process of clustering is basically an elimination of edges, hence providing the number of clusters or the
shape of the cluster before the processing is not necessary. This process helps us in sorting out the outliers. These
outliers are the nodes that have the maximum vulnerability of being attacked. Analysis shows that our process has an
accuracy rate of 0.91375.

Keywords: Intrusion detection; clustering; graph based clustering


1. Introduction
Due to the increase in amount of network related transactions, network related crimes have also shown a rapid
increase. These crimes take the form of attacking a target system directly or stealing information during online
transactions. In either of the forms, a computer forms the base of the attack. This system is called the compromised
node. Detecting these compromised nodes is a very important issue in intrusion detection. The compromised nodes
has the ability to perform malicious activities like sniffing of packets, performing Denial of Service (DoS) attacks,
transmitting viruses/worms and much worse, converting other computers into compromised nodes. All other systems
within the network become vulnerable to attacks due to the presence of a compromised node. Hence it becomes
mandatory to black list these nodes and either remove them from the network or monitor its activities for malicious
behavior and restore the system to its initial state.
Increase in the usage of data mining techniques in the areas of intrusion detection has led to the increase in amount
of specialized algorithms for detecting intrusion. Some of these include, association rule mining algorithm,
frequency scene rule mining algorithm, classification algorithm, and clustering algorithm. The first three algorithms
belong to the supervised learning category. These algorithms require training datasets describing all behaviors. Only
after applying this training dataset, the system will be able to detect anomalies. While clustering algorithm comes
under the unsupervised learning category. These types of algorithms do not depend on training data, instead they use
similarity grouping to recognize the odd one out.
The rest of this paper is organized as follows. Section 2 describes the related works and section 3 describes the
overall system architecture and an outline of the complete functioning of the system. Section 4 describes the actual
intrusion detection mechanism in detail, section 5 shows the obtained results and their analysis and section 6
provides the conclusion.

12

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4,
ISSN: 1837-7823

2. Related Works
In general, detection of an anomaly focuses mainly on monitoring and recording the users behavior. This helps us
detect unusual behavior from the normal behavior. Any kind of behavior that deviates from the normal behavior is
labeled as an anomaly or intrusion. Typical conventional anomaly detection researches [1, 2, 3] have used statistical
approaches. The statistical methods have the strong point that the size of a profile for real-time intrusion detection
can be minimized. However, the usage of statistical operators alone cannot provide best results. Further detection of
false positives cannot be avoided. Furthermore, the statistical methods cannot handle infrequent but periodically
occurring activities.
Leonid Portnoy [4] introduced a clustering algorithm to detect both known and new intrusion types without the need
to label the training data. A simple variant of singlelinkage clustering to separate intrusion instances from the normal
instances was used. Though this algorithm overcomes the shortcoming of number of clusters dependency, it requires
a predefined parameter of clustering width W which is not always easy to find. The assumption that "the normal
instances constitute an overwhelmingly large portion (>98%)" is also too strong. In [5], Qiang Wang introduced
Fuzzy-Connectedness Clustering (FCC) for intrusion detection based on the concept of fuzzy connectedness which
was introduced by Rosenfeld in 1979. FCC can detect the known intrusion types and their variants, but it is difficult
to find a general accepted definition of fuzzy affinity which was used by FCC.

3. SYSTEM ARCHITECTURE
The process of Intrusion detection can be performed as described in the Figure 1.

Figure 1: Intrusion Detection Mechanism

The initial phase deals with creating a graph for proceeding with the processing. Every system in a network is
considered as a node and every connection between the systems is marked as an edge. A complete graph is created
along with the weight details for future analysis. The graph is analyzed using the weight values provided and all
related nodes are grouped together to form clusters [10]. After the formation of clusters, the cluster analysis [6] is
performed, in which every cluster is checked for outlying items, i.e. items that are at farthest reach from the cluster
centre. These are isolated and are considered to be the vulnerable nodes. After this process, monitoring of the nodes
is performed, and if traffic anomalies were detected, then the node is labeled as an intruder.

13

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4,
ISSN: 1837-7823

4. Clustering Based Intrusion Detection


The Clustering based Intrusion Detection can be performed in four phases, graph creation, cluster creation, cluster
analysis and monitoring.
The graph creation phase initially marks all the nodes and edges. All systems that come under the considered
network form the nodes of the graph. The connections between these nodes form the edges of the graph. Since, all
the systems have two way connections, the edges in the graph represent two way paths. The distances between these
nodes form the weights of the graph.
Let G = (V ,E) be a graph where V and E are, respectively, its set of nodes and edges. The number of nodes of G is n.
Each edge is represented by a pair (i, j ), where i and j are nodes from V . Consider A = [aij ]nn to be the adjacency
matrix of graph G. Each element of the adjacency matrix has a binary value, representing the relationship between
two nodes. Thus, aij = 1 if nodes i and j are adjacent, i.e., if there is an edge linking node i to node j , and aij = 0
otherwise. This paper deals with weighted graphs. LetW = [wij ]nn be the weight matrix for the edges of a
weighted graph G. The element wij of this matrix W is defined as the weight of the edge that links node i to node j. If
there is no edge between a pair of nodes i and j, then wij = 0. The degree of a node i, degi , from an unweighted or
weighted graph, is calculated considering the number of its adjacent objects. It is given by
n

deg i = aij
j =1

A measure that evaluates the clustering tendency in graphs is known as clustering coefficient. It is based on the
analysis of three node cycles around a node i. A formulation of this measure for unweighted graphs is given by

2 j = 1 k =
n 1

ci =
Note that


n 1

j= 1
j i

k = j +1
k i

aij a jk aik

a a jk aik

j +1 ij

deg i (deg i 1)

corresponds to the number of triangles around node i. the degree

deg i

indicates the total number of neighbors of node i. The denominator measures the maximum possible number of
edges that could exist between the vertices within the neighborhood.
This measure evaluates the tendency of the nearest neighbors of node i to be connected to each other.

Figure 2: Sub-graph Creation

14

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4,
ISSN: 1837-7823
After constructing the graph, clustering is performed. The process of clustering divides the graph into several subgraphs. Clustering [7]&[8] is performed by providing a threshold value , which is calculated using the formula

=min + (max min) C P


Where, min and max represent the minimum and maximal value of matrix A (adjacency matrix) respectively, and
CP represents the Cluster Precision. So an edge is cut down from this graph if its value of weight is greater than
threshold . This results in the formation of subgraphs.
Cluster analysis phase performs the process of detecting the probable outlier from the subgraphs. The following
aspects are considered while performing the outlier detection.
For any positive integer k, the k-distance of object p, denoted as k-distance (p), is defined as the distance d (p, o)
between p and object

o D such that:

d ( p, o ' ) d ( p, o)

For at least k objects o ' D \{ p} , it holds that

For at most k-1 objects o ' D \{ p} , it holds that d

( p, o ' ) < d ( p, o) .

Given the k-distance of p, the k-distance neighborhood of p contains every object whose distance from p is not
greater than the k-distance.
N k distance (p) = { q D{p} | d(p,q) < k-distance (p) }
These objects q are called the k-nearest neighbors of p.
Given the k-distance of p, and p is a center of circle with radius k. All objects in this circle are k-distance
neighborhood of p. p is the centre of mass of this circle. So the Local Deviation Rate is defined as:

LDRk ( p ) = dis ( p , p ') | Nk dis tan ce ( p ) |


The dis(p, p) is the distance between object p and centre of mass p.
Given the k-distance neighborhood of p and LDR, the Local Deviation Coefficient is defined as:

LDCk ( p ) =

oNk dis tan ce ( p ) LDRk ( o )


| N k dis tan ce ( p ) |

Intuitively, LDC is sum of the LDR of k distance neighborhood of p. The coefficient reflects the degree of
dispersion of an objects neighborhood. Greater value of LDC means higher probability of one object being an
outlier. On the other hand, a low LDC value indicates that the density of an objects neighborhood is high. So its
hardly to be an outlier. All probable outliers are shortlisted in this phase. After the completion of this phase, comes
the monitoring phase. All the shortlisted nodes that are considered vulnerable for attacks are monitored for attacks or
abnormal activities. The traffic flow to and from these nodes are monitored. If any abnormalities were discovered,
then cleanup is performed on the node for removing the vulnerabilities.

15

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4,
ISSN: 1837-7823

Figure 3: Intrusion Detection mechanism

5. Result Analysis
The current process is evaluated with various sets of data containing different number of data items and the obtained
values are recorded in a confusion matrix.
Table 1: Confusion Matrix

Predicted

Actual

Positive

Negative

Positive

TP

FP

Negative

TN

FN

Where,
TP - True positive, FP- False positive, TN True Negative and FN False Negative.
The two performance measures, sensitivity and specificity are used for evaluating the results.
Sensitivity is the accuracy on the positive instances (equivalent to True Positive Rate-TPR)

where TP is True Positive and FN is False Negative.


Specificity is the accuracy of the negative instances (equivalent to False Positive Rate-FPR)
16

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4,
ISSN: 1837-7823

where TN is True Negative and FP is False Positive.

Figure 4: A sample confusion matrix set with TPR and FPR


The simulation is conducted with KDD-Cup 99 dataset. The process was broken at regular intervals to find the
values of TP, FP, TN and FN. These function as the basis for calculating the TPR and FPR. These readings are
tabulated and the ROC [9] is plotted (Fig 5).

From Figure 5, we can see that during the initial stages, when the number of entries are minimal, the plots point to
0,0 and 0,1 points. As the number of entries keep increasing, we can see that the plotted points are clustered towards
the northwest corner and are above the diagonal. This proves that this process provides a high level of accuracy,
almost meeting the perfect standard of 0,1.
1
0.9
0.8
0.7
T
P
R

0.6
0.5
0.4
0.3
0.2
0.1
0
0

0.2

0.4

0.6

0.8

FPR
Figure 5: ROC Plot

17

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4,
ISSN: 1837-7823
Precision is the fraction of retrieved instances that are relevant, while recall is the fraction of relevant instances that
are retrieved. Both precision and recall are therefore based on an understanding and measure of relevance. Hence we
can use this measure to find the relevance of the readings.

P
r
e
c
i
s
i
o
n

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0

0.2

0.4

0.6

0.8

Recall
Figure 6: PR Curve
Usually, precision and recall[9] scores are not discussed in isolation. Instead, either values for one measure are
compared for a fixed level at the other measure or both are combined into a single measure, such as their harmonic
mean the F-measure, which is the weighted harmonic mean of precision and recall.

F = 2.
This is also known as the

precision recall
precision + recall

F1 measure, because recall and precision are evenly weighted.

It is a special case of the general - F measure (for non-negative real values of

F= (1 + 2 ).

Two other commonly used


the

):

precision recall
2 precision + recall

F measures are the F2 measure, which weights recall higher than precision, and

F0.5 measure, which puts more emphasis on precision than recall.

The F-measure was derived by van Rijsbergen (1979) so that F "measures the effectiveness of retrieval with
respect to a user who attaches

times as much importance to recall as precision".

It is based on van Rijsbergen's effectiveness measure.


18

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4,
ISSN: 1837-7823

E = 1

Their relationship is F = 1 + E where

1
1
+
P
R

1
1+ 2

Figure 7: Precision, Recall and F-Measure Sample values

6. Conclusion and Discussions


Discovering attacks in a network plays an important role in the management of a network. The attacks take place by
exploiting the vulnerabilities in a network node. Faster detection of these vulnerabilities helps in better network
maintenance. Analysis shows that our proposed system provides faster and more accurate detection rates when
compared to the existing methodologies [1][2][3][4][5].
N
o
o
f
n
o
d
e
s

100
90
80
70
60
50
40
30
20
10
0

No of nodes in network
No of nodes detected for
vulnerabilities

9 10

Instance number

Figure 8: Total number of nodes present Vs nodes detected for vulnerabilities

19

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4,
ISSN: 1837-7823

Figure 8 shows the detection rate of our algorithm. 15% of the total nodes show abnormalities.
30
N
u
m
b
e
r

25
n
o
d
e
s

o
f

20
15

No of nodes detected for


vulnerabilities

10

Actual no of nodes
attacked

5
0
1 2 3 4 5 6 7 8 9 10
Instance number

Figure 9: No of nodes detected for vulnerabilities Vs Actual number of nodes attacked

Figure 9 shows the actual number of nodes detected for monitoring the vulnerabilities versus actual number of nodes
attacked. We can see that our algorithm has managed to detect most of the nodes that are vulnerable. Our system
shows a detection percentage of 84.91729.
Here, the F-Measure of our values shows a rate of 0.84833 and we obtain an average accuracy rate of 0.91375.
Further, we can see that our proposed structure reduces the amount of nodes that are to be monitored, hence
reduction in the amount of processing is observed. Further, the number and shape of the clusters is not defined.
Hence any type of network can be used for the clustering process. The current process can be further fine tuned by
incorporating artificial intelligence into the system. This can help create an evolutionary system that can learn new
types of attacks and evolve in time.

7. REFERENCES
[1]
[2]
[3]
[4]
[5]

Harold S.Javitz and Alfonso Valdes, "The NIDES Statistical Component Description and Justification,"
Annual Report, SRI International, 333 Ravenwood Avenue, Menlo Park, CA 94025, March 1994.
Phillip A. Porras and Peter G. Neumann, "EMERALD: Event Monitoring Enabling Responses to
Anomalous Live Disturbances," 20th NISSC, October 1997.
H.S. Javitz, A. Valdes, "The SRI IDES Statistical Anomaly Detector," IEEE Symposium on Research in
Security and Privacy, May 1991.
Portnoy, L., Eskin, E., Stolfo, S, Intrusion Detection with Unlabeled Data Using Clustering, ACM CSS
Workshop on Data Mining Applied to Security, pp. 58. ACM Press, Philadelphia, 2001.
Qiang, W., Vasileios, M, A Clustering Algorithm for Intrusion Detection, The SPIE Conference on Data
Mining, Intrusion Detection, Information Assurance, and Data Networks Security, Florida, vol. 5812, pp.
3138, 2005.

20

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4,
ISSN: 1837-7823
[6]

Joshua Oldmeadow, Siddarth Ravinutala1, and Christopher Leckie, Adaptive Clustering for Network
Intrusion Detection PAKDD 2004, LNAI 3056, pp. 255259, Springer-Verlag, Berlin Heidelberg, 2004.
[7] XlONG Jiajun, LI Qinghua, TU Jing, A Heuristic Clustering Algorithm for Intrusion Detection Based on
Information Entropy, Wuhan University Journal Of Natural Sciences, Vol. 11 No. 2 2006 355-359, 2006.
[8] Maria C.V. Nascimento, Andre C.P.L.F. Carvalho, J, A Graph Clustering Algorithm Based On A
Clustering Coefficient For Weighted Graphs, Brazil Computer Society, 17: 1929 DOI 10.1007/s13173010-0027, 2011.
[9] Jesse Davis, Mark Goadrich, The Relationship Between Precision-Recall and ROC Curves, Proceedings
of the 23 rd International Conference on Machine Learning, Pittsburgh, PA, 2006.
[10] Sang-Hyun Oh and Won-Suk Lee, Z.-H. Zhou, H. Li, and Q. Yang (Eds.), Anomaly Intrusion Detection
Based on Dynamic Cluster Updating, PAKDD 2007, LNAI 4426, pp. 737744, Springer-Verlag, Berlin
Heidelberg, 2007.

21

You might also like