(IJCSIS) International Journal of Computer Science and Information Security,Vol. 10, No. 8, August 2012
A.
KDD Dataset
The KDD Cup 1999 dataset was derived from the1998 DARPA Intrusion detection evaluation programprepared and managed by MIT Lincoln Laboratory. Thedataset was a collection of simulated raw TCP dump dataover a period of nine weeks. There are 4,898,430 labeledand 311,029 unlabeled connection records in the dataset [8].The labeled connection records consist of 41 attributes: 7symbolic and 34 numeric. The complete listing of the set of features in the dataset is given in Table 1.
TABLE I: List of attributes in KDD dataset
B.
Data Transformation
In data transformation, the data aretransformed or consolidated into forms appropriatefor mining. Min-max normalization performs a lineartransformation on the original data. Suppose thatminA and maxA are the minimum and maximumvalues of an attribute A. Min-max normalizationmaps a value, v, of A to v0 in the range [new_minA,new_maxA] by computingMin-max normalization preserves the relationshipsamong the original data values. It will encounter an
“out
-of-
bounds” error if a future input case for
normalization falls outside of the original data rangefor A.
C.
Data Preprocessing
Incomplete, noisy, and inconsistent data arecommonplace properties of large real world databasesand data warehouses. Incomplete data can occur for anumber of reasons. Attributes of interest may notalways be available. Other data may not be includedsimply because it was not considered important at thetime of entry. Relevant data may not be recorded dueto a misunderstanding, or because of equipmentmalfunctions. Data that were inconsistent with otherrecorded data may have been deleted. Furthermore,the recording of the history or modifications to thedata may have been overlooked. Missing data,particularly for tuples with missing values for someattributes, may need to be inferred. There are manypossible reasons for noisy data (having incorrectattribute values). Data cleaning (or data cleansing)routines attempt to fill in missing values, smooth outnoise while identifying outliers, and correctinconsistencies in the data. Handling Missing Values:The attribute mean or stddev to fill in the missingvalue.
D.
C45 ALGORITHM
Algorithm
:
Geneate_decision_treeInput
:
Data partition, D, which is a set of training tuples andtheir associated class labels. Attribute_list, the set of candidate attributes. Attribute_selection_method, a
procedure to determine the splitting criterion that “best”
partitions the data tuples into individual classes. Thiscriterion consists of a splitting_attribute and, possibly, eithera split point or splitting subset.Output
:
a decision treeMethod:(1) create a node N;(2) if tuples in D are all of the same class, C then(3) return N as a leaf nod labeled with the class C;(4) If attribute_list is empty then
28http://sites.google.com/site/ijcsis/ISSN 1947-5500