You are on page 1of 7


to day life, simultaneously the attacks on the networks is also increased rapidly. In the recent years we use IDS Intrusion Detection System which is a effective tool to reduce the attacks. Even though many different systems proposed by the researchers, still there is a deficiency in detecting new type of algorithm. So in order to improve the detection accuracy and to reduce the false alarm rate, first we find the Outliers. Recently many researchers used data mining techniques in developing IDS. In this paper, we have used SpatioTemporal Outlier Detection Algorithm as first level detection and as the next level of detection we have used C4.5 and fuzzy SVM classifiers for effective classification. Here we have used KDDcupp99 data set for our experiment. The experiment results that the proposed system increases the detection accuracy and reduces the false alarm rate greatly. Key words- Outlier Detection, KDD cup99 data set , Data mining, SpatioTemporal data, C4.5, SVM. I.INTRODUCTION In the few decades computers and its technology has been used in all the industries. Intrusion through the networks is modern WAR in the current trend. So we are from the Computer society to detect and prevent Intrusion efficiently. Intrusion Detection (ID) is commonly defined as Security Management System for Computers and Networks. In other words, Intrusion Detection is the act of detecting actions that attempt to compromise the confidentiality, integrity or availability of a resource. An Intrusion Detection System (IDS) is a software application that monitors network and system activities for malicious treats or policy violations and produces reports the concerned management station. A system which performs automated intrusion detection is known as Intrusion Detection System. An Intrusion Detection (ID) is a type of security management system for computers and networks. Outlier detection can classified as five categories be

Distribution based, Clustering based, Depth based , Distance based and Density based. Outliers can be defined as observations which appear to be inconsistent with the remainder of the data set. Outlier detection is a data mining technique like classification, clustering and association rules. A Spatial Outlier (S-outlier) is an object whose nonspatial attribute value different from the values of its spatial neighbours. Temporal Outlier Detection is considered in our detection. A Temporal Outlier (T-outlier) is an object whose non-spatial attribute value is significantly different from those of other objects in its temporal neighbourhood. Here we combine S-Outlier and T-Outlier to form Spatio Temporal Outlier detection (ST-Outlier). This type of detection will lead us to the discovery of unexpected, interesting, and implicit knowledge. II.RELATED WORKS Distribution based approaches use standard statistical based distribution. Many no of test are required to which model fits the arbitrary data set. But it produces unsatisfactory results and very costly.

Clustering based detects the outliers as by-products. Some of the clusterin algorithm are CLARANS, DBSCAN & CURE. Depth based methods are based on computational geometry and compute different layers. This method is usually applied for spatial outlier detection Distance based uses distance metric to measure the distance among the data points. If the parameters are different from the data set output may differ. Density based approach was proposed by al. This method uses Local Outlier Factor(LOF). The drawback is it does not consider temporal aspects. The concept of IDS was first suggested in a technical report by Anderson. He suggested that the statistics method should be applied to analyze user behavior and detect those masqueraders who access system resources. Su-Yun Wua et all compared the accuracy, detection rate and false alarm rate for C4.5 and SVM, and also he also proved that C4.5 higher accuracy than SVM , but in False alarm rate SVM is better. Mahbod Tavalle presented data set for intrusion detection system .By his statistical ananlyzing the entire KDD data set , it showed the 2 important issues in data set

which affects the performance of the evaluated system,and the results found to be poor evaluation of anamoly detection. However S.Peddabachigari et al . suggested that C4.5 algorithm is efficient algorithm for classifying data sets and performs well in construction decision trees.

and there are actually 47 types of network connection characteristics in each kind of network connection record. Data pattern includes nominal , binary and numeric. There are 23 types of attacks contained in training information, and 37 types of attacks more than training information.
KDDCup99 Dataset

The overall system architecture of the proposed system consists of the following modules namely Spatio-Temporal Outlier Detection Module and Classification module and detection module. The KDDcup99 dataset given as input to outlier detection for separating normal and abnormal data. The normal data is first sent to spatio temporal outlier detection module and the output is again sent to C4.5 and SVMclassification . Then the report is sent to the administrator about the attacked and non attacked data.Finally both C4.5 and SVM are compared. KDDcup99 Dataset The KDDcup99 set was used for the Third International Knowledge Discovery and Data Mining Tools Competition. There are approximately 4,940,000 kinds of data in training data se, 10% only are provided; and there are 3,110,291 of data in test dataset,








SPATO TEMPORAL OUTLIER DETECTION An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. Outlier detection for data mining is often based on distance measures, clustering and spatial methods. The Spatio Outlier Detection indicates the spacial representation of the data which specifies the memory location in a particular dataset. The Temporal Outlier Detection specifies the time.i.e regarding to date, time and year. CLASSIFICATION MODULE The KDD-Cup99 dataset is taken for analysis and out of 41 attributes, 34 continuous valued attributes are selected. 34 attributes are grouped into 5 different classes which form the fuzzy data. Definite and indefinite rules are generated using the maximum and minimum deviation value of normal and attack attributes. Then we calculate the information gain for each attribute, among which the attribute with the highest information gain is chosen as the splitting criterion for the decision tree construction. The Efficient C4.5 (EC4.5) algorithm is used for classification, which adopts the best among three strategies of computing the information gain of continuous attributes. All the

strategies adopt a binary search of the threshold in the whole training set starting from the local threshold computed at a node. The first strategy computes the local threshold using C4.5 algorithm which uses quick sort method to sort the cases. The second strategy also uses the same C4.5 algorithm but uses counting sort to sort the cases. The third strategy calculates the local threshold using main memory version of rain forest algorithm which does not need sorting. Support Vector Machines (SVM) are Supervised learning machines that plot the training vectors in high dimensional feature space, labelling each vector by its class. The data is also linearly separable. The linear SVM searches for a hyper plane with the largest margin. Computing such a hyper plane to separate the data points leads to a quadratic optimization problem. There are two main reasons why we have used SVM in this paper for intrusion detection. The first reason is its performance in terms of execution speed and the second reason is its scalability. SVM does not depend on dimensionality of the feature space.

ST - OULIER DETECTION ALGORITHM As our project concerned there are three step approach to achieve the spatio temporal outliers. Those steps are Clustering. Checking Spatial neighbours. Checking Temporal neighbours. are explained below in detail CLUSTERING A clustering algorithm should satisfy the following requirements 1. Discovery of clusters with arbitrary shape 2. Good efficiency on large data bases and 3. Some heuristics to determine the input parameters. DBSCAN algorithm satisfies all the requirements, but it doesnt support temporal aspects and it cant detect when clusters have different densities. So DBSCAN algorithm undergoes 2 different modification to over come the above mentioned problem. The first modification states that the tree is traversed to find both spatial and temporal neighbors of objects with in the given radius. The second modification is to find the outliers when clusters having different densities .

ALGORITHM PROCEDURE DBSCAN algorithm needs 2 parameters as input to define the density: EPS and MinPts. EPS is a radius value which is based on the distance metric such as Manhattan , Euclidean etc. MinPts specifies the minimum no of points that could occur within EPS. Our algorithm need 4 parameters EPS1 , EPS2 MinPts and . Eps 1 and Eps 2 are distance parameters for spatial and non spatial attributes. is used to prevent the discovering of combined clusters if there is little differences in the values of neighbor locations.

In our algorithm we are going to use the Euclidean formula 2times to calculate two different distance metrics EPS1 (for Spatial values) and EPS 2 (for Non spatial values). Dist (i,j) =

(|xi1-xj1|2+|xi2-xj2|2+ ...+|xin-xjn|2)1/2
Where i=(xi1,xi2,.xin) and J=(xj1,xj2,.xjn) are dimensional data objects. two n

CHECKING SPATIAL NEIGHBOURS In the above step , potential outliers were detected when the data was clustering. In this step founded outliers are checked so whether they are actually S-outlier or not.For this verification we need the correct information about the data .If we dont have a prior knowledge about the data, we follow NERUAL NETWORK method to obtain knowledge about it.The method used to verify S-Outlier is as follows. We have a data base of n data objects D={o1,o2,.on}. Now here assume Object O is detected as potential outlier in clustering. Then find the average value of the spatial neighbours of O with in EPS1 radius is defined as

the object O is classified as S-outlier if it is outside the interval [L,U]

Ldef= A - K0. , U=A+ K0.

and K0 > 1 is some pre selected value. CHECKING TEMPORAL NEIGHBOURS The objects are considered Temporal neighbours , if the values of the objects are observed in consecutive time units such as consecutive days in the same year or in the same day in consecutive years . If the characteristic value of a S-Outlier does not have significance differences with its temporal neighbours , this is not a ST-Outlier .Otherwise it is confirmed to be a ST-Outlier. MODULE 2 SUPPORT VECTOR MACHINE It is a set of related supervised learning methods used for classification and prediction. The main goal of SVM is to build a hyperplane which separates tuples which belong to two classes -1 & +1. The larger the distance of hyperplanes to the nearest training data points.



+Oneighbour m m where m is the number of spatial neighbors of O within EPS1 radius and the standard deviation of the object O is defined as where


(Oneigh1-A)2 + (Oneigh2-A)2++(Oneigh m-A)2


where W= {w1,w2,.,wn} weight vectors for n attributes. are

A={A1,A2,,An}; b is scalar and X={x1,x2,.,xn} are values of attributes.

o Detection rate: Detection rate refers to the percentage of detected attacks among all the attack data, and is defined as follows : Detection rate = TP/TP+TN*100 False alarm rate False alarm rate refers to the percentage of normal data which is wrongly recognized as attack , and is defined as follows: False alarm rate = FP/FP+TN*100

Figure:2 EC4.5 and SVM RESULTS AND ANANLYSIS False positive (FP): Also known as false alarm , Corresponds to the number of detected attacks but it is in fact normal. False negative (FN): Corresponds to the number of detected normal instances but it is actually attack, in other words these attacks are the target of intrusion detection systems. True positive (TP): Corresponds to the number of detected attacks and it is in face attack. True negative (TN): Corresponds to the number of detected normal instances and it is actually normal.

CONCLUSION The paper proposes a three step approach to detect Spatio Temporal outliers in large databases. These steps are Clustering, Checking Spatial neighbours and Checking Temporal neigbours to identify the Spatio Temporal Outliers. Here we introduce a new outlier detection algorithm, and according to the performance test , it has the ability of processing very large data sets. In this we used Classification trees and Support Vector Machine methods as two prominent datamining techniques to detect the intrusion in network.Our approach can give better accuracy and false alarm rate in EC4.5 and SVM.