You are on page 1of 5

A Comparative Study of Pruned Decision Trees and Fuzzy Decision Trees

Houda Benbrahim and Amine Bensaid School of Science and Engineering, AlAkhawayn University in Ifrane (AUI) P.O.Box 1830, Ifrane 53000, Morocco. {H.Benbrahim, A.Bensaid}

Decision trees have been widely and successfully used in machine learning. However, they have suffered from overfitting in noisy domains. This problem has been remedied, in C4.5 for example, by tree pruning, resulting in improved performance. More recently, fuzzy representations have been combined with decision trees. How does the performance of fuzzy decision trees compare to that of pruned decision trees? In h s paper, we propose a comparative study of pruned decision trees and fuzzy decision trees. Further, for continuous inputs, we explore different ways (1) for selecting the granularity of the fuzzy input variables, and (2) for defining the membership functions of the fuzzy input values. We carry out an empirical study using 12 data sets. The results show that a fuzzy decision tree constructed using FID3, combined with fuzzy clustering (to build membership functions) and cluster validity (to decide on granularity), is superior to pruned decision trees.

We propose a comparative study of pruned decision trees and fuzzy decision trees. Important issues arise when evaluating the performance of a fuzzy decision tree. Performance depends on the way continuous crisp inputs are converted into fuzzy variables when no fuzzy terms are predefined. How can we decide on granularity for each input variable; i.e., whats the right number of input values (fuzzy sets) that each variable should take on? How can we build good membership functions for each input value?

This paper is organized as follows: Section 2 reviews crisp decision tree pruning. Section 3 introduces fuzzy decision trees and discusses issues related to selecting granularity for input variables and constructing the fuzzy sets for continuous-valued attributes. Section 4 presents experiments and results, comparing different techniques of pruning decision trees and fuzzy decision trees, tested on 12 different data sets.
11- Crisp Pruned Decision Trees

I- Introduction
Decision trees are one of the most popular methods for learning and reasoning from feature-based examples. They have been studied thoroughly by the machine learning community. However, they have often been criticized for their requiring near-perfect data, and for the performance degradation in the presence of imperfect data. Data imperfection might be the result of noise, imprecise measurements, subjective evaluations, inadequate descriptive language, or simply missing data. These problems have been mitigated by tree pruning techniques and more recently by allowing for fuzziness into decision tree. Pruned decision trees try to overcome the problem of overfitting. Several methods now exist that prune or simplify trees in an attempt to prevent this phenomenon and construct trees with better generalization capabilities. Pruning results in improved performance, but no one pruning method outperforms all other methods in all application domains. Fuzzy decision trees treat features as fuzzy variables and also yield simple decision trees. Moreover, the use of fuzzy sets is expected to deal with uncertainty due to noise and imprecision. Decision trees were popularized by Quinlan with the ID3 Systems based on ID3 work well in algorithm [I]. symbolic domains. A large variety of extensions to the basic ID3 algorithm have been developed by different researchers. Practical issues in learning decision trees include determining how deeply to grow the decision tree, handling continuous attributes, choosing an appropriate attribute selection measure, and handling training data with missing attribute values. But a decision tree can become cumbersomely large for several reasons. One possible cause is noise, when examples have a large amount of feature noise (i.e., erroneous feature values) or class noise (i.e., mislabeled class value), the induction algorithm may expand the tree too far based on irrelevant distinctions between examples. Noise can cause some irrelevant features to be included among the selected tests. This leads to trees that overfit the training examples and that do poorly in front of unseen samples. The goal of pruning a tree is to eliminate tests that were dictated by noisy data. Two categories of pruning methods are usually distinguished: include pre-pruning and post-pruning techniques. Pre-pruning techniques decide eliminate tree nodes and branches during the initial construction of the tree. Post-pruning is the more popular simplification method. The input to a post-pruning (or

0-7803-6274-8/00/$10.00 O 2000 IEEE


simply pruning) algorithm is an unpruned tree T, and its output is a pruned tree T', formed by removing one or more subtrees from T. Post-pruning techniques do not usually search all possible T'; instead, they rely on heuristics to guide their search. Some pruning methods use the same set for building and for pruning the tree. This is useful with small training set sues; it allows taking advantage of all available examples during tree induction. However, it has the disadvantage of pruning the tree without reference to any previously unseen data. In this paper, we consider post-pruning methods. Three of them (namely, "reduced error pruning," "critical value pruning" and "minimum error pruning") require a set for pruning the tree distinct from that used to grow it. The other two pruning methods we use are "error based pruning" and "pessimistic error pruning"; they use the same set for both growing and pruning the tree.

absence of domain knowledge, it is useful to define membership functions based on the data. Janicow [6] uses an inductive reasoning approach. We propose using fuzzy clustering.

3.1.1 Clustering with the Fuzzy c-Means Cluster analysis refers to a collection of procedures which aims at exploring and sorting objects into groups or clusters so that the association or similarity is strong between members of the same group and weak between members of different groups.
A well known method for forming clusters is the family of c-means algorithms. The hard c-means (HCM) or kmeans algorithm assumes that an object can belong to one and only one cluster. In practice, the separation between clusters is not always well defined; an object may admit characteristics of two or more clusters. The fuzzy cmeans algorithm [8] is a fuzzy extension of HCM. It minimizes the weighted-sum-of-squared-distances objective function:

111-Fuzzy Decision Trees

Traditional ID3-like decision tree algorithms perform symbolic inductive learning. Therefore, they have limited use when some or all attributes are described by continuous features. C4.5 [4] and CART [2] have proposed ways to address this problem, as well as problems related to missing and noisy attributes. However, the resulting trees suffer from reduced comprehensibility, which is not always a welcome tradeOff.

Fuzzy decision trees attempt to combine elements of symbolic and subsymbolic approaches. Fuzzy sets and fuzzy logic allow modeling language-related uncertainties, while providing a symbolic framework for knowledge comprehensibility. Fuzzy decision trees differ from the traditional crisp decision trees in three respects [6]: 0 They use splitting criteria based on fuzzy restrictions. Their inference procedures are different. The fuzzy sets representing the data have to be defined. The tree-building procedure in [6] follows that of ID3, except that information utility of individual attributes is evaluated using fuzzy sets, memberships, and reasoning methods.

where: m E [l,co) is a weighting exponent on each fuzzy membership; V=(v,, v2, ... v,,) are c prototypes in Rp; C(ik is a fuzzy membership value 0 U is a fuzzy c-partition of X. A is any positive definite (p x p) matrix.

is the distance from xk to vi . The basic idea in FCM is to minimize J,,, over the variables U and V, on the assumption matrices U that are part of optimal pairs of J, identify "good" partitions of the data.
3.2 Granularity of the fuzzy variables

3.1 Fuzzy partitioning

In fuzzy decision trees, the interface between the comprehensible symbolic level and the sub-symbolic (numeric) level is done through the use of fuzzy sets. Fuzzy sets constitute the input to the fuzzy decision tree. They must be predefined; their membership functions must be constructed. Typically, these functions have been defined by the user, based on domain knowledge. The most widely used representations for these functions are the triangular and the trapezoid-shaped functions. In the

When applying FCM to construct the membership functions for each input variable, the number of functions (or fuzzy sets) is assumed to be known. However, the "optimal" number may be different from one variable to the next. Furthermore, in the absence of relevant prior knowledge, it is often not clear how many functions to use for each variable. One way to determine this (granularity) number is to apply the clustering algorithm (e.g., FCM) several times with different numbers of clusters, then to evaluate the resulting partitions and pick the number that corresponds to the best partition.


Cluster validity measures have been used to evaluate partitions produced by clustering algorithms. Several measures have been proposed to validate both crisp and fuzzy partitions. In this work, we have used an instantiation of the crisp Generalized Dunns Index (GDI) [9] that has been reported to yield good validity results for fuzzy partitions [ 101. The partition produced by the fuzzy clustering algorithm (e.g., FCM) is first hardened, using the maximum-membership rule, for example. The resulting crisp partition is then submitted for evaluation by the crisp validity index. Dunns index favors partitions with compact and well-separated clusters. The difference between different instantiations of GDI is in the way cluster compactness (or, alternatively, cluster diameter) and cluster separation are measured. The instantiation of GDI that we use defines the diameter A(Xi) of crisp cluster Xi as the average of distances from elements of cluster Xi to Xis cluster center vi:


1 -

where vi is the centroid of points in Xi . The separation or inter-cluster distance (&(Xi , Xj)) between clusters Xi and Xj is defined as the average of distances between points in Xi and points in X,. Compared to indices based on the min or the max function, using the average function reduces the effect of few noisy points that may be present in data set X :

GDI is then defined as:

To determine the granularity of a given continuous attribute A, the values that A takes on are stored in a set X . Then FCM is applied to X several times, each time ., with a different number (c) of clusters, c=2,3, ...c Each resulting c-partition is evaluated using V(c). Then the number of clusters c=c* that minimizes V(c) is taken as the optimal value of c.

The experiments in this paper are divided into two parts. The fist part deals with pruned decision trees; we used 5 post-pruning methods as an extension to C4.5[4]. The goal of this part is to test the performance of C4.5 with different pruning methods. Several comparisons of pruning methods have already appeared in the literature of machine learning. Quinlan [3], Esposito, Malerba and Semeraro [SI.. . have addressed those comparisons. In our experiments, we have followed the procedure in [SI. In the second part, a fuzzy decision tree is used. We carry out the experiments using the FID3.0 system developed by Janicow [6]. We carry out different experiments based on the different alternatives for constructing the membership functions and choosing the granularity of the fuzzy sets: 0 For the generation of the fuzzy sets, we have adopted two different ways: an inductive reasoning based approach described in [6] and an FCM-based clustering approach. For the first approach, we used a preprocessor found in FID3.0 software[111. For the second one, we use clustering to construct the membership functions of each continuous-valued attribute (one attribute at a time) of each data set. We apply FCM with : N=1000 iterations, ~=0.00001, m=2, and the Euclidean norm for both distance comparison and error determination. 0 Concerning the granularity, we have experimented with different strategies: (1) using heuristics based on a priori knowledge about the data, (2) selecting granularity randomly, and (3) using cluster validity to determine the right number of clusters that best describes the data. We have explored different combinations of the choice of granularity and the generation of fuzzy sets to evaluate the performance of the fuzzy decision trees. We use cross validation to test the performance of the different crisp and fuzzy decision trees obtained on 12 data sets (Iris, Glass, Hypo, Hepatitis, Cleveland, Hungary, Switzerland, Long Beach, Heart, Pima, Australian and German) from the UCI Machine Learning Database Repository [12]. For each data set, and each decision tree, we perform 30 trials. In each trial, the data set is randomly split into 2 subsets: training set (70 % of the data) and a testing set (the remaining 30 % of the data). The prediction error of a given decision tree for a given data set is computed as the average error over all 30 trials along with a corresponding standard deviation. Finally, we carry out statistical significance tests for performance comparison of the different crisp and fuzzy partitions.

4.2 Experimental results

1-Pruned Decision Tree results: all the pruning methods improved the performance of the trees with respect to the unpruned trees for 7 out of 12 data sets, and did not increase the accuracy of the trees for 4 data sets. For one data sets, the Reduced Error Pruning method did worse

1V-Experiments 4.1.Descriptionof the experiments


with respect to the unpruned tree. We also note that the "Critical Value Pruning" method has not effected a change on the performance with respect to the unpruned trees for all the data sets. Finally, there is no one pruning method that did best for all the data sets, but in relative terms, the "Error Based Pruning" used in the standard C4.5 algorithm produced consistently good results. 2-Fuzzv Decision Tree results: We have carried out experiments using some heuristics to choose the granularity of the data. With this choice of granularity, we have found that the performance of the fuzzy decision tree with fuzzy membership functions built using the inductive reasoning method (IR) is slightly better than that of fuzzy trees with fuzzy membership functions built using the FCM-based method: IR did better for 4 data sets over 12, worse for 3 data sets and equally well for 5 data sets. An alternative way for automatically choosing the granularity is to use cluster validity. We compared this approach to the one based on heuristics and to selecting granularity randomly. We report that: Using the inductive reasoning method to build the fuzzy membership functions, the fuzzy decision tree along using cluster validity to decide about the granularity did better than the fuzzy decision tree with random granularity for 6 data sets, did worse for 2 data sets and did equally well for 4 data sets. We retain cluster validity as the preferred way for choosing granularity. Using cluster validity to choose granularity, the fuzzy decision tree using the FCM-based approach to build the membership functions did better than the fuzzy decision tree using the inductive reasoning method to build the membership functions for 3 data sets over 12, did worse for 2 data sets and did equally well for 7 data sets. 3-Comparison of Pruned Decision Trees and Fuzzy Decision Trees results: we directly compared the results of crisp and fuzzy decision tree. We observed the following: 0 The crisp decision tree pruned using the "Error Based Pruning'' method did better than the fuzzy decision tree with random granularity and the inductive reasoning based method to build the membership functions for 5 data sets over 12, did worse for 2 data sets, and did equally well for 5 data sets. 0 The fuzzy decision tree with cluster validity to decide the granularity and the inductive reasoning based method to build the membership functions did better than the crisp decision tree pruned using the "Error Based Pruning'' method for 5 data sets over 12, did worse for 3 data sets, and did equally well for 4 data sets. 0 The fuzzy decision tree with cluster validity to decide the granularity and the FCM-based method to build the membership functions did better than the crisp decision tree pruned using the "Error Based Pruning"

for 6 out of 12 data sets, did worse 1 6 data set, and did equally well for 5 data sets.

We have performed in this paper a comparative study of pruned decision trees and fuzzy decision trees on 12 data sets. We experimented with different ways of building fuzzy membership functions and determining granularity of fuzzy variables. We can summarize our conclusions as follows: The fuzzy decision tree, used with the FCM-based method to generate the membership functions of the continuous valued attributes and with granularity decided using cluster validity, clearly outperforms the crisp decision tree pruned using any of the most widely used pruning algorithms. 0 The use of cluster validity as an automatic way to decide about the granularity of the fuzzy input variables improves the results of the fuzzy decision trees, compared to random selection of granularity. 0 There is no one pruning method that did best for all the data sets, but, the "Error Based Pruning" method shows certain stability over the other pruning methods.

[ 11 J. R. Quinlan, "Induction of Decision Trees," Machine learning, vol 1, pp 81-106, 1986. [2] J. Brrieman, J. H. Friedman, R. A. Olsen, and C. J. Stone, "Classification and Regression Trees," Belmont, CA, Wadsworth, 1984. [3] J. R. Quinlan, "Simplifying Decision Trees," Int'l J. Man-Machine Studies, vo1.27, pp 221-234, 1987. [4] J. R. Quinlan, "C4.5: Programs for Machine Learning," San Mateo, Calif.: Morgan Kaufmann, 1993. [5] F. Esposito, D. Malerba, G. Semeraro, "A Comparative Analysis of Methods for Pruning Decision 'Trees," IEEE Transactions on Pattern Analysis and ]MachineIntelligence, vo1.19, No 5, May 1997. 1:6] C. J. Janikow, "Fuzzy Decision Trees: Issues and Methods," IEEE Transactions on Systems, Man and Cybernetics-part B:Cybemetics, vo1.28, No 1, February 1998. 1:7] C. J. Janikow, "Fuzzy Processing in Decision Trees," I?roceedings of 6th International Symposium on Artificial Intelligence, pp 360-367, 1993. 181 J. C. Bezdek, R. J. Hathaway, M. J. Sabin, and W. T. Tucker, "Convergence Theory for l~uzzyC-Means: Counterexamples and Repairs," IEEE 'Trans.Syst., Man, Cybem., Sep/Oct 1987. 191 Bezdek, J.C. and Pal, N.R. "Cluster validation with generalized Dunnk indices, " Proc. 1995 2nd NZ Int'l. two-stream conference on A W E S , ed. N.Kasabov and G . Coghill, IEEE press, Piscataway, NJ, 1995. [IO] H. Hassar, A. Bensaid, "Validation of Fuzzy and Crisp c-Partitions," Proceedings of 18' International


Conference of the North American Fuzzy Information Processing Society - NAFIPS, pp342-346, New York, 6-8 June, 1999. [ 1 11 [12] C. J. Men, P. M. Murphy. Repository of machine learning data sets. Univ. of CA, Dept. of Information and Computer Science, 1996. httD://www.ics.uci.ed~~-mleam~~~I.Re~ositorv.html