You are on page 1of 8

Pattern Recognition 43 (2010) 222 -- 229

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r

A triangle area based nearest neighbors approach to intrusion detection
Chih-Fong Tsai ∗ , Chia-Ying Lin
Department of Information Management, National Central University, Taiwan

A R T I C L E

I N F O

A B S T R A C T

Article history: Received 3 December 2008 Received in revised form 8 March 2009 Accepted 24 May 2009 Keywords: Intrusion detection Machine learning Triangle area k-means k-nearest neighbors Support vector machines

Intrusion detection is a necessary step to identify unusual access or attacks to secure internal networks. In general, intrusion detection can be approached by machine learning techniques. In literature, advanced techniques by hybrid learning or ensemble methods have been considered, and related work has shown that they are superior to the models using single machine learning techniques. This paper proposes a hybrid learning model based on the triangle area based nearest neighbors (TANN) in order to detect attacks more effectively. In TANN, the k-means clustering is firstly used to obtain cluster centers corresponding to the attack classes, respectively. Then, the triangle area by two cluster centers with one data from the given dataset is calculated and formed a new feature signature of the data. Finally, the k-NN classifier is used to classify similar attacks based on the new feature represented by triangle areas. By using KDD-Cup '99 as the simulation dataset, the experimental results show that TANN can effectively detect intrusion attacks and provide higher accuracy and detection rates, and the lower false alarm rate than three baseline models based on support vector machines, k-NN, and the hybrid centroid-based classification model by combining k-means and k-NN. © 2009 Elsevier Ltd. All rights reserved.

1. Introduction The Internet has become a part of our daily life today. It is now an essential tool in the world and aids people in many areas such as business, biology, education, etc. Business in particular uses the Internet as an important component in their business models [1]. Not only businesses use the Internet in their daily operations and communications with their customers, but also customers use the Internet applications such as e-mail and websites to do some business activities. Therefore, information security needs to be concerned over the Internet environment. It is common to find that the Internet environment has some risks of attacking. In order to prevent the attacks, many systems are designed to thwart Internet-based attacks. Intrusion detection is one of the major information security research problems in the Internetbased attacks. The goal of an Intrusion Detection System (IDS) is to provide a layer of defense against malicious uses of computer systems by sensing a misuse or a breach of a security policy and alerting operators to an ongoing attack. An IDS is used to detect all types of malicious network traffic and computer usage that cannot be detected by a

∗ Corresponding author. Tel.: +886 3 422 7151; fax: +886 3 4254604. E-mail address: cftsai@mgt.ncu.edu.tw (C.-F. Tsai). 0031-3203/$ - see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2009.05.017

conventional firewall. Intrusion detection is based on the assumption that the behavior of the intruder differs from that of a legitimate user in ways that can be qualified [2]. In addition, an IDS is able to resist external attacks. Existing IDSs can be divided into two categories according to the detection approaches: anomaly detection and misuse detection or signature detection [3]. Anomaly detection tries to determine whether deviation from the established normal usage patterns can be flagged as intrusions. On the other hand, misuse detection uses patterns of well-known attacks or weak spots of the system to identify intrusions. These systems seem to be effective and efficient. However, there are still many drawbacks to these Internet-based IDSs. The major problem is that they fail to generalize to detect new attacks without known signatures [4]. Besides, a potential drawback of these techniques is the rate of false alarms. This can happen primarily because the previously unseen system's behaviors may also be recognized as anomalies, and hence flagged as potential intrusions [5]. In addition, Shon et al. [6] indicate that these systems become a single point of failure. If the deployed IDS system is disabled for any reason, then it often gives the attacker the time to compromise the systems and possibly gain a foothold in the network. In order to solve the problems mentioned above, numbers of anomaly detection systems are developed based on machine learning techniques. These systems use a “normal behavior” to detect those unexpected attacks. In particular, supervised and unsupervised

DT a b Dataset DARPA 1998 DARPA 1998 KDD-Cup '99 DARPA 1998 KDD-Cup '99 DARPA 1998 KDD-Cup '99 KDD-Cup '99 DARPA 1999 DARPA 1998 KDD-Cup '99 DARPA 1998 KDD-Cup '99 DARPA 1998 KDD-Cup '99 KDD-Cup '99 KDD-Cup '99 Problem domain Anomaly Anomaly Anomaly Anomaly Anomaly Anomaly Anomaly Anomaly Anomaly Anomaly Anomaly Anomaly Anomaly Anomaly Anomaly Anomaly Anomaly detection detection detection detection detection and misuse and misuse and misuse detection detection detection detection detection detection and misuse detection detection Evaluation method DR . baseline classifiers for comparisons. [16] Peddabachigari et al. FP DR. which is a combination of the data vector − → dj belonging to that class. detection rate (DR). Section 4 presents the experimental setup and results. much related work for intrusion detection uses the datasets of DARPA1998 and KDD-Cup '99 for experiments. [7] Li and Guo [14] Liu et al. Regarding Table 1. Tsai. the classification technique can be used as the first component and the clustering technique for the second one. Centroid-based classification is one specific hybrid learning approach [10]. and its output is subsequently used as the input for the cluster to improve the clustering result. there is little computation is involved. FP Accuracy c d Baseline FL ANN. the data which cannot be clustered accurately can be regarded as noisy data or outliers. In this paper. anomaly detection is the mostly considered research problem in literature. Therefore. Literature review 2.229 223 machine learning techniques have both been considered for anomaly detection (c. FA FPf . [12] Kayacik et al. In particular. h i j SVM: support vector machines. Conclusion and future work are provided in Section 5. Centroid-based models find a representative data point for a particular class. That is. That is. SOM GA SVM. ANN detection detection detection detection GA: genetic algorithm. rather than the number of the original training data. datasets used. Section 2. DT SVM ANN. Therefore. DR FP. FA. . 2. we use DCj (1 Յ j Յ m) to represent the set of data vectors belonging to class Cj .-F. Section 3 introduces the proposed TANN approach. FP DR. a triangle area by two cluster centers obtained by k-means and one data (di ) from D is calculated to form a new feature for the data (di ). [9] Peddabachigari et al. . FN. dn }. DTk . ANN SVM. LR: logistic regression. [6] Wang et al.C. [13] Khan et al. accuracy TPj . k-NN SVM. evaluation methods considered. [19] Mukkamala et al. false negative (FN). For the problem domain. Finally. FN DR. Second. On the other hand. . . Related work Table 1 compares a number of recent related work in terms of their detection techniques developed. k-NN. This paper is organized as follows. . k l DT: decision trees. [8] Liu et al.2. Given a dataset D containing m examples (D = d1 .2). classified along a set C of m classes. This is because clustering is the unsupervised learning technique and it cannot distinguish data accurately like supervised one. there are several issues which can be discussed. FP Accuracy DR. which is called the centroid. the k-nearest neighbor (k-NN) classifier is used to measure the similarity of attacks based on the feature of triangle areas. we propose a novel method based on the idea of the triangle area based nearest neighbors (TANN) by combining unsupervised and supervised learning techniques to detect attacks. dm ). 2. FL: fuzzy logic. the KDD-Cup '99 dataset is the mostly used. Hybrid machine learning In general. [20] Zhang and Shen [21] Zhang et al. . FNg FP. the first clustering technique can be used to filter out unrepresentative data as performing the data reduction (or outlier detection) task. a classifier can be trained at first. FP. Lin / Pattern Recognition 43 (2010) 222 -. centroid-based models are very efficient during the training and classification stages as they are based on the number of centroids. Work Abadeh et al.1.-Y. FP DR. false alarm (FA). [22] a b c Technique GA +FL GA+ANNe SOM SOMi +SVM TCM k-NN SOM+ANN Genetic fuzzy classifier DT+SVM GA+SVM GA+ANN/k-NN/SVM Bayesian latent class SVM. k-NN SVM SVM. FN DR. FA DR. f FP: false positive. ANN SVM. Given a set of − → − → n data vectors D = {d1 . FP. . [17] Shon and Moon [1] Shon et al. For classification. support vector Table 1 Comparisons of related work. C = {C1 . ANN. TP: true positive.f. one-class SVM C-means clustering+ANN Nearest neighbor clustering+GA SVM. evaluation methods considered. ANN Ensemble of SVM/ANN Robust SVM. and accuracy are examined mostly. . DR: detection rate. 2 layer ANN SVMh . the representative data without the noisy data are used to train the classifier in order to improve the classification result. FP DR Accuracy DR. [18] Chen et al. true positive (TP). k-NN ANN GA SVM. The centroid of a particular class Cj is − → represented by a vector Cj . [11] Chen et al. Section 2 briefly describes the concept of hybrid learning techniques and a number of related studies are compared in terms of their techniques developed. . false positive (FP). For the evaluation methods. In particular. SOM: self-organizing maps. etc. a hybrid machine learning model is based on combining the clustering and classification techniques. d FA: false alarm. DT. g FN: false negative. Cm }. SVM LRl SVM. Then. . [15] Ozyer et al. . Related work developing the hybrid model is usually based on one clustering technique used as the first component for “pre-classification” and one classification technique as the second component for the final classification task [7–9]. e ANN: artificial neural networks. datasets used. . . and baseline models compared. C.

In particular. which can be obtained by the distance between cluster centers and data points. An example of forming triangle areas. T. .g. if the chosen dataset contains n-dimensional features for each data. . . where m is the total number of the data samples). Then.229 machines and k-NN are two of the most popular baseline classifiers for comparisons. 2. in the feature space. We define the data point Xi (i = 1. b.f. which includes one type of normal access and four types of Internet attacks. 1 shows the process of this stage. XiAC. [7. The triangle area based nearest neighbors approach The proposed approach (TANN) is composed of three stages. B and Xi.2.e. and k-NN training and testing based on the new data. 5). three points are selected in order to form a triangle area. . the k-means algorithm is used as the clustering method to find out five cluster centers of each category as the representative data points over the dataset. XiAD. . respectively). an ) and B=(b1 . and A and Xi. 3. Therefore. 3 shows the steps of forming the new data.1. Fig. m.e. b3 . two cluster centers obtained by k-means and one data point from the dataset are used to form a triangle area. the distances between A and B. New data formation by triangle areas To calculate a triangle area in the feature space. can be defined as dis AB = = (a1 − b1 )2 + (a2 − b2 )2 + · · · + (an − bn )2 (ai − bi )2 (1) The perimeter of the triangle is defined as G = a+b+c. bn ) in the n-feature space (see Fig. D. five cluster centers are extracted. ten triangle areas are obtained to form a new feature vector for the data point (Xi).) 3. Heron's formula states that the area. Lin / Pattern Recognition 43 (2010) 222 -.1. three data points need to be provided. XiAB. and XiDE. Then. and c = AXi (i. of a triangle whose sides have lengths a. the triangle area by Heron's formula can be calculated (see Fig.8.1). . Xi). 1. and E) and one data point (Xi). XiBC. XiAE. XiBE. where a=AB. Tsai. and the followings describe these steps. and c is T= S(S − a)(S − b)(S − c) (2) KDD-Cup 99 dataset k-means clustering five cluster centers Fig. (See Section 3 for the detailed description of this approach. The first one is based on one of the original data points in the training data (i.1. they always focus on solving the classification/detection problem over the original feature space. That is. New training data formation. the distance between an unknown data and its nearest centroid and other distances between this unknown data and other centroids can be all considered for classification.2. XiBD. the triangle area based features are regarded as the new feature space and they are used for the final classification decision. it is common to combine two techniques. . b = BXi . XiCE. . C. Forming the triangle area—Heron's formula After obtaining the perimeter of the triangle for each data point corresponding to two cluster centers. the classifiers they train and test are based on these features. which are clustering centers extraction. The Euclidean distance between points A=(a1 . Specifically. 3. Therefore. . 4). especially for combining some clustering technique as the first component and classification for the second one (e. the Euclidean distance formula is used to measure the distance between two points. A B E C Xi D Fig. 2 shows an example of the five cluster centers (A. an unknown data with any two centroids can result in a triangle area. respectively. the value of k is set by 5. . using the triangle area as the feature space for classification is the novelty of this paper. . Perimeter of the triangle—Euclidean distance First of all. 3. a3 . The other two are from the five cluster centers generated by k-means in the first stage. That is. . In related work. Section 4.-Y. which belongs to one of the hybrid approaches described above.2. the new data formation by the triangle area. The idea behind TANN is to extend the centroid-based and nearest neighbor classification approaches. a2 .2. our proposed approach is to transform the original multidimensional feature space into triangle areas.-F. C. In this stage. Note that as the KDD-Cup '99 dataset is used in this paper (c. XiCD. . However. B. Fig. That is. Cluster centers extraction.224 C. KDD-Cup 99 dataset five cluster centers Calculating the perimeter of the triangle by Euclidean Distance Calculating the area by Heron’s Formula New data Fig.15]). Extraction of cluster centers First of all. 3. thus TANN is proposed and intended to be able to improve classification performances over the centroid-based and nearest neighbor approaches. No matter what kind of approaches proposed in related work (including hybrid approaches). Fig. b2 . Subsequently. 3. we assume that all the centroids over a given dataset have their discrimination capabilities for distinguishing both similar and dissimilar classes.

Xi C3 C5 .. each pattern represents a network connection represented by a 41-dimensional feature vector. 3. T1 –T10 for each Xi are used to form the new data. port scanning. e.3. C4 . e. an) B (b1. where S is the semiperimeter of the triangle S = (a + b + c)/ 2 (3) 3.sigkdd. It should be noted that much related work described in Section 2. 10 triangles. b2. 10-fold cross-validation is used to avoid the variability of the samples that affects the performance of model training and testing. a2.3. Lin / Pattern Recognition 43 (2010) 222 -. the new data composed of triangle area based features are divided into training and testing sets based on n-fold cross-validation (c. e.org/kddcup/index. .1. Experiments 4.f. e. Nine of the 10 subsets are used for training and the remainder is served as the testing subset.1. 4. Xi C2 C4 . bn) Training data New data E Testing data C Xi D Training k-NN Fig. 6 shows the process of training and testing k-NN.1. Training and testing k-NN. “10% of KDDCup '99”. 6.1.2. Xi C3 C4 . Xi C1 C3 . guessing password. we define the sum total as 4 3 Xi . 5.C. the original data point Xi which is represented by f1 . Section 2. in which nine features are of the intrinsic types. they do not further consider another dataset for validation. C 5 = n=1 2 Xi Cn Cn+1 + n=1 Xi Cn Cn+2 (4) + n=1 Xi Cn Cn+3 + Xi C1 C5 That is. 13 features are of the content type. k-NN classifier Fig. various “buffer overflow” attacks. 1 http://www. and the remaining 19 features are of the traffic type.php?section = 1999&method = data . . the whole dataset is divided into 10 unduplicated subsets. the new data are used to train and test the k-NN classifier. Xi C2 C5 .1). Xi with two out of the five centers (C1 . fn as the n-dimensional features and oi as the output label of Xi are then transformed to Xi C1 C2 . respectively. The Euclidean distance between A and B. there are 10 classification results by 10-fold 4.-Y. • Remote to Local (R2L): unauthorized access to local super-user (root) privileges. and C5 ) generated by k-means (k = 5) form a triangle. Then. …. the distribution of normal and attack types of the connection records for classifier training and testing is summarized in Table 2. C. C 1 . . and “Corrected (Test)” of the KDD-Cup '99 dataset are used as the training/testing and validation sets. the dataset used in this paper is based on the KDD-Cup '99 dataset1 which has been considered mostly in related work (c. . In this process. Training and testing k-NN Finally. • User to Root (U2R): surveillance and other probing. The dataset Since there is no standard dataset for intrusion detection. respectively. Xi C2 C3 .g. f2 . which are normal traffic and four different classes of attacks as follows: • Probing: an attacker scans a network of computers to find known vulnerabilities. That is. Calculation of a triangle area. Each pattern of the dataset is labeled as belonging to one out of five classes. …. . C3 . Therefore.2).g.2 only uses one specific KDD-Cup '99 dataset for classifier training and testing where the testing result is regarded as the final classification performance. Section 4. • Denial of Service (DoS): unauthorized access from a remote machine. Xi C1 C4 . During the training and testing stages. port scanning. Xi C1 C5 . For the “10% of KDD-Cup '99” dataset. and Xi C4 C5 as the 10 triangle areas to represent the new feature of Xi. Therefore. .g. Formation of new data At first.. Experimental setup c Fig.g. in which the training and testing sets are used to train and test the k-NN classifier. C 2 . Fig. A B a E T C b Xi D 4. In order to reasonably assess the proposed approach.f.-F. In these two datasets. . C2 . In 10-fold cross-validation. Tsai.229 225 A (a1. .

4. is further used in order to validate the developed classifiers.98% average accuracy.73 100 their individual feature settings. In addition. 4. SVM Similar to k-NN. In order to reduce the problem that high dimensional data may affect (or degrade) detection performances. “num_failed_logins”.028 4. Table 3 shows the dataset information.4. In short. 4.1. the number of malicious executables falsely classified as benign. There are six features extracted. “num_shells”. Dimensionality reduction In general. In particular. SVM (degree = 2) performs the best. Particularly. features). to evaluate the proposed method.69 0. this paper further uses receiver operating characteristics (ROC) curves to compare the performances of the created classifiers. Section 2. . .e. In addition. Tables 5 and 6 show the classification results for each of the five classes and the binary classes (normal and attacks). the k-NN and SVM classifiers are compared. such as [15. which have similar characteristics to TANN.593 4166 231. In addition. the number of malicious executables correctly classified as malicious. Then. As a result.3.1. Baseline classifiers Regarding Table 1 (c.229 Table 4 Confusion matrix. “urgent”.f. Table 1). Samples percentage (%) 19. The ROC curve is a way of visualizing the trade-offs detection and false positive rates. Then. two other classifiers. SVM. i. k-NN and the centroid-based classifier.e. Evaluation methods The performance measures for intrusion detection can be calculated by a confusion matrix as shown in Table 4. • True positives. for k-NN. the diagonal line y = x represents the strategy of randomly guessing a class.2. 4.2. and false alarm of k-NN (k = 21) are 93. the polynomial kernel function is used where the polynomial degree is set from 1 to 5 to obtain the SVM classifier providing the best performance for comparisons.3.029 Samples percentage (%) 19. which is less considered in related work.4 1. k-NN is also considered for comparisons since it can be conveniently used as a benchmark for all the other classifiers [24].-Y. . On the other hand.23 100 Actual Predicted Normal Normal Intrusions (attacks) TN FN Intrusions (attacks) FP TP Table 3 Sample distributions of the validation dataset. 10 studies out of 17. of samples 97. they consider different input variables (i. In other words. Lin / Pattern Recognition 43 (2010) 222 -.01 0. “is_host_login”. detection and false alarm which are also examined in related work (c.33 74.97% detection rate.2. Class Normal Probe DoS U2R R2L No.16]. Experimental results 4.e.69%. the five cluster centers are used as the training data for k-NN (k = 1) classification rather than using the whole original training dataset to construct a k-NN classifier. It provides 94.020 C. five centers in this case). the 98. Then.87%. Class Normal Probe DoS U2R R2L No. Note that it is difficult to fairly compare different approaches proposed recently as shown in Table 1. k = 1. a validation dataset. i.2).39%.25) is examined in order to find out the optimal or best k-NN over the dataset. As there is no answer to which features of the KDD-Cup '99 dataset are more representative and no further funding for the good baseline(s) for comparisons. respectively. respectively. • False positives. 93. the rates of average accuracy. are also compared. much related work uses SVM as the baseline classifier. and “num_outbound_cmds”.226 Table 2 Sample distributions of the dataset.5 to extract important features from the chosen dataset.2. C.e. respectively.458 52 1126 494.455 88 14. Any classifier that appears in the upper left triangle performs better than random guessing [25]. • False negative. detection. a number of different k values (i. In the ROC curve. for SVM.1. . “Corrected (Test)”. Tsai.727 311. of samples 60. In addition. • True negatives (TN). The k-means algorithm is used to obtain the cluster centers (i. classification accuracy can be averaged by the 10 classification results. which may result in different classification results.277 4107 391. . the centroid-based classifier is also considered which is based on combining k-means and k-NN since the idea of TANN is derived from these two classification techniques.e. which are “land”.83 79. Tables 7 and 8 show the classification results of SVM for each of the five classes and the binary classes (normal and attacks). the number of benign programs correctly classified as benign. we consider the most widely used baseline. 11 studies out of 18 use SVM as the baseline (c. i.2) can be obtained by Accuracy = TP + TN TP + TN + FP + FN TP TP + FP (5) (6) (7) cross-validation. these approaches may perform well over Detection rate = False alarm = FP FP + TN Besides the three evaluation measures. these selected features for each data point of the dataset are used to create the baseline and TANN classifiers.e. the rate of accuracy.1. Therefore. the dataset for intrusion detection contains a very large and high dimensional data. the number of benign programs falsely classified as malicious. and 28.-F.f.2. Section 2. 4. we find out the best value for k is 21. and the 4. we set the factor loadings equal to or greater than 0.02% false alarm rate.4 0. k-NN After testing a number of different k values.f. This is because although much related work uses the same dataset.5. principal component analysis (PCA) is used to extract representative features.24 0. PCA has been widely used in the domain of intrusion detection [23].

Finally. the major task of the intrusion detection system is to filter out potential attacks and allow normal connection to access.84 Normal Intrusions (attacks) Accuracy (%) Table 10 Classification results of k-means+k-NN for the normal and attack classes. Actual Predicted Normal Normal Probe DoS U2R R2L FP (%) 6936 23 216 0 0 96. C.75 Probe 366 397 4360 0 7 7.74 DoS 11 3 32.00 78. Combining k-means and k-NN Table 9 and 10 show the classification results of the centroidbased classifier for each of the five classes and the binary classes (normal and attacks).587 The accuracy rate of TANN is 99.5.4.09 97. That is. Actual Predicted Normal Normal Probe DoS U2R R2L FP (%) 9350 11 2076 0 0 81.88%).12 96.00 80.90 R2L 0 0 34 0 89 72.27%) than k-NN (93.31 93.59 82. 8 shows an ROC curve depicting the relationship between false positive and detection rates. FA. FP.97%). detection. . TN. 16 and 17.01%.538 2 2 99.01%.92 Probe 278 390 3196 0 6 10. Tables 11 and 12 show the classification results of TANN for each of the five classes and the binary classes (normal and attacks).01%). Lin / Pattern Recognition 43 (2010) 222 -.67 Probe 2378 384 1324 0 9 9. FN.475 Table 8 Classification results of SVM for the normal and attack classes. 7 further examines the average accuracy of each class including one normal class and four different types of attacks. Tsai.601 2 1 99. detection. Besides. and accuracy performances over the testing and validation datasets as well as the accuracy of detecting four different types of attacks. TANN performs the best over the three baseline classifiers.87 40. the centroid-based classifier (95. Actual Predicted Normal Normal Intrusions (attacks) 9336 2088 Intrusions (attacks) 391 37. respectively. The rates of average accuracy. and false alarm rates which are shown in Tables 15.38 DoS 393 4 37. Further comparisons This section compares these classifiers in terms of their TP. respectively.87%). Finally.2. Actual Predicted Normal Normal Probe DoS U2R R2L FP (%) 9336 12 2076 0 0 81.72 Probe 378 397 4464 0 7 7.-F. 99. Tables 13 and 14 show the performances of these classifiers over the testing and validation datasets. and 3.00 85.85 60.76 Accuracy (%) Table 12 Classification results of TANN for the normal and attack classes. a high Table 9 Classification results of k-means+k-NN for the five classes.590 Table 11 Classification results of TANN for the five classes. it provides the highest accuracy and detection rates and the lowest false alarm rate. Actual Predicted Normal Normal Intrusions (attacks) 6936 239 Intrusions (attacks) 2791 39. and the centroid-based classifier (99.07 R2L 0 0 37 0 91 71. Actual Predicted Normal 9350 2087 227 Intrusions (attacks) 377 37.12 60.229 Table 5 Classification results of k-NN (k = 21) for the five classes. TANN also performs the best (2.63 96.98 96.08 DoS 13 6 35.2. In addition.69%).3.01% which outperforms the baselines of k-NN (93. SVM (4.95 U2R 20 0 69 2 6 2.43 95.-Y.99%) over k-NN (28.26 Accuracy (%) 4.65 71.02%) and the centroid-based classifier (3. the t test is used to examine the level of significance of these classifiers in terms of the accuracy. different k values of k-NN were also examined and we found out that when k = 17.2. respectively.433 2 1 99.01%). TANN performs the best.57 DoS 13 2 32. respectively. In other words.59 83. respectively. Actual Predicted Normal Normal Probe DoS U2R R2L FP (%) 9436 15 185 0 0 97. Fig. for the false alarm rate.39%).C. The comparative results allow us to see the performance of each classifier and identify the best one for intrusion detection. Based on the testing and validation datasets. TANN also provides the higher detection rate (99. In this case. The result indicates that TANN outperforms k-NN.06 R2L 0 0 9 1 97 90. Actual Predicted Normal Normal Intrusions (attacks) 9436 200 Intrusions (attacks) 291 39.53 Accuracy (%) Table 6 Classification results of k-NN (k = 21) for the normal and attack classes.436 Table 7 Classification results of SVM for the five classes. and the centroid-based classifier. 4. It should be noted here that although TANN cannot effectively detect each of the four types of attacks compared with the three baseline classifiers. That is. the rate of detecting whether the connection is the normal access or attacks should be as high as possible at the first line of security. SVM (94.94 U2R 0 0 127 3 15 2. SVM (98. DR.01 94. 4.36 95. TANN For the proposed approach.85 R2L 0 0 29 0 90 75.94 60. TANN performs the best.98%).89 90. SVM. The result can be used to examine whether the classification performances of these classifiers are significantly different.00 78.88%. and false alarm of this classifier are 95.94 U2R 0 0 139 3 16 1.528 2 1 98. Fig.94 U2R 0 0 143 3 16 1.

785 True negatives (TN) 8143 10.94% 83.000 (t = 17.587 37.272 (t = 1. the triangle area can be calculated by two out of the cluster centers and one data point in the dataset. p < 0. SVM.475) 0. TANN k-means+k-NN TANN k-means+k-NN SVM SVM k-NN Normal 0% 20% 40% 60% 80% 100% 0.79 3.436 37.01% 96. which are anomaly detection and misuse detection.98 93. The result indicates that TANN has a significant level of difference with k-NN.929 (t = −0. the performances of these classifiers would provide very similar results to this paper.094 (t = 1. a reliable conclusion can be made that TANN significantly perform better than k-NN.00% Table 15 Result of t test (accuracy). C.779) 0.59% 96.83 Accuracy (%) 88. True positive (TP) k-NN SVM k-means+k-NN TANN 39.00% 60. SVM. Conclusion TANN k-NN SVM k-means + k-NN Detection Rate (%) level of significant difference (i.516 37. Therefore.05) over accuracy.918) Fig.00% 90.228 Table 13 Comparisons of these models by the testing dataset.000 (t = −5.-F.780 35.55 96. As a result.163) SVM 0.31% Table 16 Result of t test (detection rate).-Y. In particular.000 (t = 14. . Tsai.372) 0.000 (t = −30.366) 0.987) 0. ROC curves of these models.89% 96.769) 0. 8.094) R2L U2R 40. namely TANN.092 False positives (FP) 4995 693 488 402 False negatives (FN) 484 2775 2696 1123 Detection rate (DR) (%) 87.095 (t = 1.98% 71. the detection rate. TANN TANN k-means+k-NN SVM k-means+k-NN 0.229 True negatives (TN) 6936 9336 9350 9436 False positives (FP) 2791 391 377 291 False negatives (FN) 239 2008 2087 200 Detection rate (DR) (%) 93.43% 97.97 99.91 92.76% 85.01 99.001 (t = −4.02 3.937) 0.000 (t = 8. (p < 0.12% 95. Intrusion Detection Systems (IDSs) are an integral package in any well configured and managed computer system or network.243) k-NN 0.01 Table 14 Comparisons of these models by the validation dataset.02 6. Accuracy of each attack class.88 2.08 98.813 (t = 0. Two main approaches to intrusion detection are currently used. True positive (TP) k-NN SVM k-means+k-NN TANN 35.68 98.27 False alarm (FA) (%) 28. we propose a novel hybrid method based on a triangle area based nearest neighbors approach. Then.05) between two classifiers implies that when using other datasets.000 (t = 6.853) 0.426 36. respectively A modern computer network should acquire mechanisms to ensure the security policy of data and equipments inside the network. and the centroid-based classifier over these three evaluation measures. In literature.00% 60.e.000 (t = 10. k-means is firstly used to extract a number of cluster centers where each cluster center represents one particular type of attacks.200 (t = −1.53% 78.59% 93.000 (t = 8.87% 94.99 Accuracy (%) 93. 7.988) SVM 0.69 4.26% 78.508 9702 10. and false alarm.14 95.87 95. and the centroid-based classifier.19 4.171) DoS Probe Table 17 Result of t test (false alarm).01 99.010 (t = 3. Lin / Pattern Recognition 43 (2010) 222 -.84% 60. TANN TANN k-means+k-NN SVM k-means+k-NN 0.590 39.847) 0.491) 0.550) 0.95 False alarm (FA) (%) 38. machine learning techniques (supervised and/or unsupervised learning) have been considered for intrusion detection.092) k-NN 0. to detect attacks. 100 80 60 40 20 0 False Positive (%) Fig.85% 95.000 (t = −34.475 C.39 98.91 TANN k-means + k-NN SVM k-NN 80. In this paper.000 (t = −31. especially for anomaly detection.12% 82. 5.75 98.

Zhang. Nur. Kamel.-P. J. Z.K. A. Lee. References [1] T. Intrusion detection systems using decision trees and support vector machines. K.-M. J. TANN performs better than k-NN. Abraham. Computer Networks 51 (2007) 3448–3470. Information Processing & Management. Knowledge-Based Systems. Pattern Recognition 37 (2004) 927–942. [25] T. Yi. B. Therefore. Network and Computer Applications 28 (2005) 167–182.D. An active learning based TCM-KNN algorithm for supervised network intrusion detection. [11] M. Duin. in 2008. About the Author—CHIH-FONG TSAI obtained a Ph. Neurocomputing. His current research focuses on multimedia information retrieval and data mining applications. By using the KDD-Cup '99 dataset with 10-fold cross-validation. Abraham. [10] A. The VLDB Journal 16 (2007) 507–521. false alarm. Kumar. A. Alhajj. J. pp. J.H.-H. [5] P. Sergi. . If the problem contains a larger number of classes. for classification. Journal of Systems and Software. Stolfo. A latent class modeling approach to detect network intrusion. the dimensionality of the new features as the number of triangle areas for each data will become very large.-N. Semi-supervised single-label text categorization using centroid-based classifiers. V. X. for the 5-class classification problem in this paper. 2006.S. Computer and Security 26 (2007) 459–467. G. Fawcett. Application of online-training SVMs for real-time intrusion detection with different considerations. in: Proceedings of the ACM Symposium on Applied Computing. Kim. Engineering Applications of Artificial Intelligence 20 (8) (2007) 1058–1069. [20] S.-H. 844–851. L. Park. Ozyer. IEEE Transitions on Pattern Analysis and Machine Intelligence 22 (1) (2000) 4–37. Chan. Zhang. Zhang. Pattern Recognition Letters 27 (8) (2006) 861–874. at School of Computing and Technology from the University of Sunderland. [18] Y. some issues could be considered. [16] T. Eskin. USA. Peddabachigari.229 229 these triangle areas represent a new feature for measuring similar attacks. it would be interesting to examine the performance of TANN over the datasets which contain different numbers of classes. S. the stability and scalability of TANN for different numbers of classes for classification need to be examined in the future. in: Proceedings of NSF Workshop on Next Generation Data Mining. The t test also shows that the experimental results of these models have a high level of significant difference.P. Cardoso-Cachopo. He has published over 20 refereed journal papers including ACM Transactions on Information Systems. Oliveira. For example. Lazarevic. there are 10 triangle areas as the new features for each data. Abraham. Patcha. second ed.G. [24] A. 89–100. I. say 20.-F. Applying genetic algorithm for classifying anomalous TCP/IP packets. Seoul. Expert Systems with Applications. Yang. Stallings. Hsu. Shon. Kayacik. About the Author—Miss CHIA-YING LIN received the master degree in the Accounting and Information Technology Department of National Chung Cheng University. Cryptography and Network Security Principles and Practices. [3] S. different clustering and classification techniques can be applied during the cluster center extraction and triangle area classification stages. Ertoz. Jain. Wang. [21] Z.-Y. [13] H. Intrusion detection by integrating boosting genetic fuzzy classifier and data mining criteria for rule pre-screening. C. P. Sung. Abadeh. L. [6] T. Real time data mining-based intrusion detection. cluster and/or classifier ensemble can be used in these two stages. Hershkop. Liu. UK in 2005 for the thesis entitled “Automatically Annotating Images with Keywords”. This may cause the “curse of the dimensionality” problem. [9] Y. the k-NN classifier is used based on the feature of triangle areas to detect intrusions. Yang. Khan. Engineering Applications of Artificial Intelligence 20 (2007) 439–451. Lin / Pattern Recognition 43 (2010) 222 -. Abraham. USA. [23] A. Chen. and the centroid-based classifier in terms of average accuracy. SVM. National Central University. M. A. 2000. Expert Systems. Network Intrusion Detection: An Analyst's Handbook. Liao. [4] W. M. Tsai. Guo.J. M. J. Journal of Network and Computer Applications 30 (2007) 114–132. Jiang. Moon. Liu. Heywood. S. An introduction to ROC analysis. Intrusion detection using hierarchical neural network. International Journal of Intelligent Systems 22 (2007) 337–352. E. Dokas. Srivastava. S. [7] L. [15] G. Hybrid flexible neural-tree-based intrusion detection systems. H. March 11–15. A genetic clustering method for intrusion detection. Her current research interest focuses on data mining and intrusion detection. Thomas. A. Neurocomputing 69 (2006) 2429–2433. C. Barker. and the ROC curve. A hierarchical intrusion detection model based on the PCA neural networks. J. 2007. An overview of anomaly detection techniques: existing solution and latest technological trends. For future work. in: Proceedings of the DARPA Information Survivability Conference & Exposition II. Fan.C. [8] C. J. A new intrusion detection system using support vector machines and hierarchical clustering. M. [14] Y. That is. Computer Communications 30 (2006) 93–100. 2002. [12] Y. Thuraisingham. Z. [19] W. He is now an Assistant Professor at the Department of Information Management. Li. Modeling intrusion detection system using hybrid intelligent systems. the detection rate. Shen. Korea. then. International Journal on Artificial Intelligence Tools. Shen. W. W. Prentice-Hall. Taiwan. New Riders Publishing. S.W. Shon. S. June 12–14. A. X. A parallel genetic local search algorithm for intrusion detection in computer networks. Kovah. Chen. P. Tan. Statistical pattern recognition: a review. A hybrid machine learning approach to network anomaly detection. Chen. Application of SVM and ANN for intrusion detection. Neurocomputing 70 (2007) 1561–1568. A hierarchical SOM-based intrusion detection system. Barzegar. Thomas. Awad. Miller. K. J. Ho. International Journal of Applied Science and Computations 11 (3) (2004) 118–134. etc. Journal of Network and Computer Applications 30 (2007) 99–113. Zhang. Mao. Pattern Recognition Letters 26 (2005) 779–791. Finally.K. First. pp. Mukkamala. Computer Communications 28 (2005) 1428–1442. Anaheim. Intrusion detection using an ensemble of intelligent paradigms. Taiwan. Grosan. M. H. [2] W. J. R. B. J.-H. Z. 2001. A. Northcutt. [17] S. respectively. A. [22] S. R. Data mining for network intrusion detection. Online Information Review. Mbateng. Novak.I. Information Sciences 177 (2007) 3799–3821. Secondly. Peddabachigari.-Y. Computer and Operations Research 32 (2005) 2617–2634. Moon. J. Habibi.