You are on page 1of 3

Kdd99 The KDD Cup '99 dataset was created by processing the tcpdump portions of the 1998 DARPA Intrusion

Detection System (IDS) Evaluation dataset, created by Lincoln Lab under contract to DARPA [Lippmann et al]. Since one can not know the intention (benign or malicious) of every connection on a real world network (if we could, we would not need research in intrusion detection), the artificial data was generated using a closed network, some proprietary network traffic generators, and hand-injected attacks. It was intended to simulate the traffic seen in a medium sized US Air Force base (and was created in collaboration with the AFRL in Rome, NY, which could be characterized as a medium sized US Air Force base). Based on the published description of how the data was generated, McHugh published a fairly harsh criticism of the dataset. Among the issues raised, the most important seemed to be that no validation was ever performed to show that the DARPA dataset actually looked like real network traffic. Indeed, even a cursory examination of the data showed that the data rates were far below what will be experienced in a real medium sized network. Nevertheless, IDS researchers continued to use the dataset (and the KDD Cup dataset that was derived from it) for lack of anything better. In 2003, Mahoney and Chan built a trivial intrusion detection system and ran it against the DARPA tcpdump data. They found numerous irregularities, including that -- due to the way the data was generated -- all the malicious packets had a TTL of 126 or 253 whereas almost all the benign packets had a TTL of 127 or 254. This served to demonstrate to most people in the network security research community that the DARPA dataset (and by extension, the KDD Cup '99 dataset) was fundamentally broken, and one could not draw any conclusions from any experiments run using them. Numerous researchers indicated to us (in personal conversations) that if they were reviewing a paper based solely on the DARPA dataset, they would reject it solely on that basis. Indeed, at the time we were conducting our own assessment of the DARPA dataset, using Snort [Caswell and Roesch]. Trivial detection using the TTL aside, we found that it was still useful to evaluate the true positive performance of a network IDS; however, any false positive results were meaningless [Brugger and Chow]. Anonymous reviewers at respectable information security conferences were unimpressed; one noted, ``is there any interest to study the capacities of SNORT on such data?''. A reviewer from another conference summarized their review with ``The content of the paper is really out of date. If this paper appears five years ago, there is some value, but not much now.'' While the DARPA (and KDD Cup '99) dataset has fallen from grace in the network security community, we still see it widely used in the greater KDD community. Examples in the past couple years include [Kayacik et al.], [Sarasamma et al.], [Gao et al.], [Chan et al.], and [Zhang et al.]. While this sample doesn't necessarily represent the top-tier journals and conferences in the KDD community, they are to the best of our knowledge respectable, peer-reviewed publications. Obviously, the knowledge discovery researchers are well intentioned by wanting to show the

142 Date of Publication: 28 August 2009 Date of Current Version: 28 June 2010 Sponsored by: IEEE Computer Society Abstract . As a result. For illustrative purposes. such conclusions can not be drawn. ALAD. an architecture using data-dependent decision fusion is proposed. The method gathers an in-depth understanding about the input traffic and also the behavior of the individual intrusion detection systems by means of a neural network supervised learner unit.2009. and (3) peer reviewers for conferences and journals ding papers (or even outright reject them. as is common in the network security community) with results drawn solely from the KDD Cup '99 dataset. it has become possible to obtain a more reliable and accurate decision for a wider class of attacks. In this paper. Unfortunately. (2) The KDD Cup and UCI websites include a warning on the KDD Cup '99 dataset webpage informing researchers that there are known problems with the dataset. IEEE Transactions on Issue Date: Aug. we strongly recommend that (1) all researchers stop using the KDD Cup '99 dataset.1149 ISSN: 1045-9219 INSPEC Accession Number: 11388501 Digital Object Identifier: 10. The overall performance of the proposed sensor fusion system shows considerable improvement with respect to the performance of individual intrusion detection systems. due to the problems with the dataset. by combining the decisions of multiple intrusion detection systems. With the advances in sensor fusion.7 Location: Cologne Print ISBN: 978-3-8007-3092-6 References Cited: 30 INSPEC Accession Number: 10365903 Date of Current Version: 26 September 2008 Abstract Various intrusion detection systems reported in literature have shown distinct preferences for detecting a certain class of attacks with improved accuracy. since the fusion depends on the input feature vector. This information is used to fine-tune the fusion unit. and Snort have been considered using the DARPA 1999 dataset in order to validate the proposed architecture.1109/TPDS. This paper appears in: Parallel and Distributed Systems. This paper appears in: Information Fusion. 2010 Volume: 21 Issue:8 On page(s): 1143 .usefulness of every technique imaginable to the network intrusion detection domain. 2008 11th International Conference on Issue Date: June 30 2008-July 3 2008 On page(s): 1 . three intrusion detection systems namely PHAD. while performing moderately on the other classes.

with no considerable attention to the features associated with lower level protocol frames. we study the impact of the optimization of the feature set for wireless intrusion detection systems on the performance and learning time of different types of classifiers based on neural networks. specifically in anomaly detection models.Intrusion Detection Systems (IDSs) are a major line of defense for protecting network resources from illegal penetrations.11-specific intrusions. Selecting the best set of features is central to ensuring the performance.11-specific attacks such as deauthentication attacks or MAC layer DoS attacks. A common approach in intrusion detection models. and reliability of these detectors as well as to remove noise from the set of features used to construct the classifiers. . the features used for training and testing the intrusion detection systems consist of basic information related to the TCP/IP header. speed of learning. The resulting detectors were efficient and accurate in detecting network attacks at the network and transport layers. Experimental results with three types of neural network architectures clearly show that the optimization of a wireless feature set has a significant impact on the efficiency and accuracy of the intrusion detection system. In most current systems. In this paper. is to use classifiers as detectors. not capable of detecting 802. we propose a novel hybrid model that efficiently selects the optimal set of features in order to detect 802. Our model for feature selection uses the information gain ratio measure as a means to compute the relevance of each feature and the k-means classifier to select the optimal set of MAC layer features that can improve the accuracy of intrusion detection systems while reducing the learning time of their learning algorithm. accuracy. In the experimental section of this paper. but unfortunately.