You are on page 1of 3
Proceedings of the International Conference on Electrical Engineering and Informatics Institut Teknologi Bandung, Indonesia June 17-19, 2007 B-28 Data Mining Discretization Methods and Performances Z. Marzuki, F. Ahmad Faculty of Information Technology, Universiti Utara Malaysia, 06010 UUM Sintok, Kedah, Malaysia Phone: (60)-04-9284722; Fax (60)-04-9284753 Email: {zaharin, fudz} Discretization process is known to be one of the most important data preprocessing tasks in data mining. Presently, many discretization methods are available. These include Boolean Reasoning, Equal Frequency Binning, Entropy, and others. Each method is developed for specific problems or domain area. In consequent, the usage of such methods in other areas might not be appropriate. In appropriately used of a technique will cause serious problem to happen in which important data is lost. This will cause inaccuracy of results and unreliable models be produced. This study attempts to evaluate the performances of various discretization methods on several domain areas. The experiments have been validated using 10fold cross validation method. The ranking of the performances of the methods have been discovered from the experiments. The results suggest that different discretization methods perform better in one or more domain areas. . 1. Introduction In data mining, discretization process is known to be one of the most important data preprocessing tasks. Most of the existing machine learning algorithms are capable of extracting knowledge from databases that store discrete attributes. If the attribute are continuous, the algorithms can be integrated with a discretization algorithms which transform them into discrete feature. Discretization methods are used to reduce the number of values for a given continuous attributes by dividing the range of the attribute into intervals (5)(2). Discretization makes learning more accurate and faster. The results (decision tree, induction rules) of the process are usually more compact, shorter and more accurate when discrete features are used compared to continuous features (7)(8). Furthermore, according to (3), “the most important performance criterion of the discretization method is the accuracy rate”. This paper begins with a brief description about the importance of discretization method in data mining. In Section 2., the discretization process’s steps are listed and the chosen methods are described. Subsequent section explains about the experimental evaluation conducted in this study, and the results obtained are also presented. Discussion on the findings and concluding remarks follow in Section 4. In order to carry out the process, discretization method has to be applied. The following subsections describe about different discretization methods used in this study, namely the Boolean Reasoning, Equal Frequency Binning, Entropy Minimum Description Length (Entropy/MDL), Naïve and Semi Naïve. 2.1 Boolean Reasoning Outlined by (10), a straight forward implementation of the algorithm that combines the cuts found by Boolean Reasoning procedure for discarding all but a small subset of attribute value that does not preserve the discernibility. The remaining subset is a minimal set of cuts preserves the discernibility inherent in the dataset. The algorithm operates by first creating a Boolean function f from the set of candidate cuts, and then computing a prime implicant of this function (9). 2.2 Equal Frequency Binning Equal frequency binning is a simple unsupervised and univariate discretization algorithm which discretizes a continuous valued attributes by creating a specified number of bins. The number of intervals is used to determine the number of bins. Each bin is associated with a distinct discrete value where an equal number of continuous values will be assigned into each bin (8). 2. Discretization Process A normal discretization process specifically consists of four steps (i) sort all the continuous values of the feature to be discretized (ii) choose a cut point to split the continuous values into intervals. (iii) split or merge the intervals of continuous values (iv) choose the stopping criteria of the discretization process (8). 2.3 Entropy/MDL According to (7), Entropy/MDL is an algorithm that recursively partitioned the value set of each attribute so that the local measure of entropy is optimized. In this algorithm, the minimum description length principle defines a stopping criterion for the partitioning process. sorts the condition attributes first, then considers a cut point between two values of each attribute. 2.4 Naïve Naïve algorithm takes both condition attributes and decision attributes into consideration (1). The algorithm ISBN 978-979-16338-0-2 535 Proceedings of the International Conference on Electrical Engineering and Informatics Institut Teknologi Bandung, Indonesia June 17-19, 2007 B-28 2.5 Semi Naïve Functionally similar to naïve but has more logic to handle the case where value-neighboring objects belong to different decision classes iteration. The overall results of the experiment are presented in the next section. 3.3 Experimental Results Table 3 and Table 4 show the experimental results for Boolean Reasoning, Equal Frequency Binning, Entropy, Naïve and Semi Naïve discretization methods for medical and engineering datasets respectively. The last row of each table gives the average accuracies for each discretization methods for the dataset. Table 3. Result of the classification accuracy % of medical dataset Discretization Methods Datasets Boolean Equal Entropy Naïve Reasoning Frequency wbc 81 82 85 90 hep 89 81 48 85 lung 79 73 79 75 lym 94 50 90 93 AVG 85.8 71.5 75.5 85.8 Table 4. Result of the classification accuracy % of engineering dataset Discretization Methods Datasets Boolean Equal Entropy Naïve Reasoning Frequency ion 39 93 76 93 image 20 71 68 78 cpu 0.7 1.6 0.6 1.7 auto 1.9 1.8 0.6 3.1 AVG 15.4 41.9 36.3 44.0 Figure 1 and Figure 2 represent the classification accuracy for various numbers of classes of medical and engineering datasets. 3. Experimental Evaluation 3.1 The Datasets The datasets were obtained from the UCI Repository (4). Datasets from two different areas, medical and engineering, were selected. These datasets vary in size. From each area, four different datasets are selected for the process. During each process, five discretization methods are applied on the datasets. Summary of the datasets are shown in Table 1and Table 2. Table 1. The medical datasets for experiment Dataset Class Ins Cont Attr Winconsin 2 699 9 9 breast cancer (wbc) Hepatitis(hep) 2 155 6 20 Lung Cancer 3 32 57 (lung) Lympho 4 148 19 19 (lym) Table 2. The engineering datasets for experiment Dataset Class Ins Cont Attr Ionosphere 2 351 34 34 (ion) Image 7 210 19 19 Segmentation (image) Relative CPU 8 209 8 10 Performance (cpu) Auto-Mpg many 398 5 9 (auto) 100 80 60 BoolRea EqualFreq Entropy 3.2 Experimental Design During the experimental design, the missing values in the datasets must be handled. In this study, this step is accomplished using SPSS. Next, each dataset is discretized using the following discretization methods namely Boolean Reasoning, Equal Frequency Binning, Entropy, Naïve and Semi Naïve. For training and testing purposes, the split factor 0.2 has been used to partition the datasets. Genetic Algorithm is used to generate reducts and rules for all datasets. For each dataset 10-fold cross validation is performed to train and test for Standard Voting classifier. The results of these steps are measured based on the performance of the average percentage in correct classification across all 40 20 0 2 3 4 Naïve SemiNaïve Figure 1: Classification accuracy obtained for various numbers of classes of medical dataset ISBN 978-979-16338-0-2 536 Proceedings of the International Conference on Electrical Engineering and Informatics Institut Teknologi Bandung, Indonesia June 17-19, 2007 B-28 100 80 % accuracy BooleRea EqualFreq Entropy Naïve Semi-Naïve 60 40 20 0 2 7 8 No. of class reveals that all discretization methods give higher accuracy for small class size for engineering dataset. Thus, the bigger the class size, the lesser the accuracy of classification. In general, this experiment has shown that for engineering dataset the size of class does affect the classification accuracy for all the discretization methods experimented. Further study that looks into the suitable class size for engineering dataset should be conducted. In addition, a study on the effect of different type of datasets should be carried out. References Ohrn. A.:Discernibility and Rough Sets in Medicine: Tools and Applications. PhD Thesis, Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway, NTNU report 1999:133,IDI report 1999:14, ISBN 82-7984-014-1, 239 pages. (Pub.1999) (2) Kurgan. L and Cios, K.J.: Discretization Algorithm that Uses ClassAttribute Interdependence Maximization, Proceedings of the 2001 International Conference on Artificial Intelligence (IC-AI 2001), pp.980-987, Las Vegas, Nevada. (Pub. 2001.) (3) Chmielewski. M.R and Grzymala-Busse. J.W: Global discretization of Continuous Attributes as Preprocessing for Medichine Learning, International of Approximate Reasoning, Vol. 15, pp. 319-331 (Pub. 1996.) (4) The University of California,,UCI Machine Learning Repository, (5) Ratanamahatana, C. A.: CloNI: Clustering of √N-Interval discretization, Proceedings of the 4th International Conference on Data Mining Including Building Application for CRM & Competitive Intelligence, Rio de Janeiro, Brazil, (Pub. 2003). (6) Ohrn A.: ROSETTA Technical Reference Manual. Department ofComputer and Information Science, Norwegian University of Science and Technology,Trondheim, Norway. (Pub 2001). (7) Dougherty, J.,Kohavi, R., and Sahami, M.: Supervised and unsupervised discretization of continuous features.In Proc. Twelfth International Conference on Machine Learning. Los Altos, CA: Morgan Kaufmann, pp. 194–202 (Pub 1995). (8) Liu, H. et. al:Discretization :An Enabling Technique.Data Mining and Knowledge Discovery, 6,393-423.(Pub 2002) (9) Karthigasoo. S, Cheah Y.N. and Manickam.S: Improving the Accuracy of Medical Decision Support via a Knowledge Discovery Pipeline using Ensemble Techniques. Journal of Advancing Information and Management Studies, 2(1), (Pub 2005). (10) Nguyen H. S. and Skowron A.:Quantization of real-valued attributes. In Proc. Second International Joint Conference on Information Sciences, pp. 34–37 (Pub. 1995). (1) Figure2: Classification accuracy obtained for various numbers of classes of engineering dataset 4. Discussion and Conclusion In this study, five data discretization methods known as Boolean Reasoning, Equal Frequency Binning, Entropy, Naïve and Semi Naïve have been tested on two types of datasets namely medical and engineering. Standard Voting has been chosen as the classifier. The results illustrated in Table 4, shows that datasets from engineering areas have less than 50% accuracy of classification. In contrast, the results for medical datasets as shown in Table 3, exhibit more than 70% accuracy of classification. Based on the output, Naïve and Boolean Reasoning are ranked first among the used methods as they are the most suitable discretization methods in medical area. They provided better accuracies compared to other three methods. Similarly, for engineering data with a specific class distribution, Naïve, semi-naïve and entropy give better results compared to the others. From Table 4, the average accuracy of classification for each discretization methods from cpu and auto datasets are less than 4%. This experiment reveals that datasets with more than 7 classes may give poor classification accuracy. Figure 1 and Figure 2 supports this claim as they reveal that the classification accuracy decreases as the number of class size increases. Based on Figure 1, Equal Frequency and Semi-naïve discretization methods produce less classification accuracy when bigger number of classes is used. Similarly, Figure 2 ISBN 978-979-16338-0-2 537