Professional Documents
Culture Documents
a Decision-Tree Hybrid
Ron Kohavi
Data Mining and Visualization
Silicon Graphics, Inc.
2011 N. Shoreline Blvd
Mountain View, CA 94043-1389
ronnyk@sgi.com
decision-trees and Naive-Bayes. The decision-tree seg- Accuracy Scale-Up: the Learning
ments the data, a task that is consider an essential part Curves
of the data mining process in large databases (Brach- A Naive-Bayes classier requires estimation of the con-
man & Anand 1996). Each segment of the data, rep- ditional probabilities for each attribute value given the
resented by a leaf, is described through a Naive-Bayes label. For discrete data, because only few parameters
classier. As will be shown later, the induction algo- need to be estimated, the estimates tend to stabilize
rithm segments the data so that the conditional in- quickly and more data does not change the underly-
dependence assumptions required for Naive-Bayes are ing model much. With continuous attributes, the dis-
likely to be true. cretization is likely to form more intervals as more data
is available, thus increasing the representation power.
The Induction Algorithms However, even with continuous data, the discretization
is global and cannot take into account attribute inter-
We brie
y review methods for induction of decision- actions.
trees and Naive-Bayes. Decision-trees are non-parametric estimators and
can approximate any \reasonable" function as the
Decision-tree (Quinlan 1993; Breiman et al. 1984) database size grows (Gordon & Olshen 1984). This
are commonly built by recursive partitioning. A uni- theoretical result, however, may not be very comfort-
variate (single attribute) split is chosen for the root ing if the database size required to reach the asymp-
of the tree using some criterion (e.g., mutual infor- totic performance is more than the number of atoms
mation, gain-ratio, gini index). The data is then di- in the universe, as is sometimes the case. In practice,
vided according to the test, and the process repeats some parametric estimators, such as Naive-Bayes, may
recursively for each child. After a full tree is built, a perform better.
pruning step is executed, which reduces the tree size. Figure 2 shows learning curves for both algorithms
In the experiments, we compared our results with the on large datasets from the UC Irvine repository1 (Mur-
C4.5 decision-tree induction algorithm (Quinlan 1993), phy & Aha 1996). The learning curves show how the
which is a state-of-the-art algorithm. accuracy changes as more instances (training data) are
Naive-Bayes (Good 1965; Langley, Iba, & Thomp- shown to the algorithm. The accuracy is computed
son 1992) uses Bayes rule to compute the probabil- based on the data not used for training, so it repre-
ity of each class given the instance, assuming the at- sents the true generalization accuracy. Each point was
tributes are conditionally independent given the la- computed as an average of 20 runs of the algorithm,
bel. The version of Naive-Bayes we use in our ex- and 20 intervals were used. The error bars show 95%
periments was implemented in MLC ++ (Kohavi et condence intervals on the accuracy, based on the left-
al. 1994). The data is pre-discretized using the out sample.
an entropy-based algorithm (Fayyad & Irani 1993; In most cases it is clear that even with much more
Dougherty, Kohavi, & Sahami 1995). The probabil- 1
The Adult dataset is from the Census bureau and the
ities are estimated directly from data based directly task is to predict whether a given adult makes more than
on counts (without any corrections, such as Laplace or $50,000 a year based attributes such as education, hours of
m-estimates). work per week, etc..
DNA waveform-40 led24
96 82 74
94 80 72
92
78 70
90
88 68
76
accuracy
accuracy
accuracy
86 66
74
84 64
82 72
62
80 70 Naive-Bayes 60 Naive-Bayes
78 Naive-Bayes C4.5 C4.5
76 C4.5 68 58
74 66 56
0 500 1000 1500 2000 2500 0 1000 2000 3000 4000 0 500 1000 1500 2000 2500
instances instances instances
accuracy
accuracy
99.6
84.5
99.5 75
84
99.4
99.3 70 83.5
99.2 Naive-Bayes Naive-Bayes 83
65 Naive-Bayes
99.1 C4.5 C4.5 82.5 C4.5
99 60 82
0 15,000 30,000 45,000 60,000 0 5,000 10,000 15,000 20,000 0 15,000 30,000 45,000
instances instances instances
accuracy
accuracy
92 97
90 82
88 96
86 80
95 Naive-Bayes Naive-Bayes
84 C4.5 C4.5
Naive-Bayes 94 78
82 C4.5
80 93 76
0 500 1000 1500 2000 2500 0 2,000 4,000 6,000 8,000 0 1000 2000 3000 4000 5000 6000
instances instances instances
Figure 2: Learning curves for Naive-Bayes and C4.5. The top three graphs show datasets where Naive-Bayes outperformed
C4.5, and the lower six graphs show datasets where C4.5 outperformed Naive-Bayes. The error bars are 95% condence
intervals on the accuracy.
data, the learning curves will not cross. While it is fold cross-validation accuracy estimate of using Naive-
well known that no algorithm can outperform all others Bayes at the node. The utility of a split is the weighted
in all cases (Wolpert 1994), our world does tend to sum of the utility of the nodes, where the weight given
have some smoothness conditions and algorithms can to a node is proportional to the number of instances
be more successful than others in practice. In the next that go down to that node.
section we show that a hybrid approach can improve Intuitively, we are attempting to approximate
both algorithms in important practical datasets. whether the generalization accuracy for a Naive-Bayes
classier at each leaf is higher than a single Naive-
NBTree: The Hybrid Algorithm Bayes classier at the current node. To avoid splits
The NBTree algorithm we propose is shown in Fig- with little value, we dene a split to be signicant if
ure 3. The algorithm is similar to the classical recur- the relative (not absolute) reduction in error is greater
sive partitioning schemes, except that the leaf nodes than 5% and there are at least 30 instances in the node.
created are Naive-Bayes categorizers instead of nodes Direct use of cross-validation to select attributes has
predicting a single class. not been commonly used because of the large overhead
A threshold for continuous attributes is chosen us- involved in using it in general. However, if the data is
ing the standard entropy minimization technique, as is discretized, Naive-Bayes can be cross-validated in time
done for decision-trees. The utility of a node is com- that is linear in the number of instances, number of
puted by discretizing the data and computing the 5- attributes, and number of label values. The reason is
Input: a set T of labelled instances. for NBTree it is 84.47%.
Output: a decision-tree with naive-bayes categorizers at Absolute dierences do not tell the whole story be-
the leaves. cause the accuracies may be close to 100% in some
cases. Increasing the accuracy of medical diagnosis
1. For each attribute Xi , evaluate the utility, u(Xi ), of from 98% to 99% may cut costs by half because the
a split on attribute Xi . For continuous attributes, a number of errors is halved. Figure 5 shows the ratio
threshold is also found at this stage.
2. Let j = arg maxi (ui ), i.e., the attribute with the highest of errors (where error is 100%-accuracy). The shuttle
utility. dataset, which is the largest dataset tested, has only
3. If uj is not signicantly better than the utility of the cur- 0.04% absolute dierence between NBTree and C4.5,
rent node, create a Naive-Bayes classier for the current but the error decreases from 0.05% to 0.01%, which is
node and return. a huge relative improvement.
4. Partition T according to the test on Xj . If Xj is con- The number of nodes induced by NBTree was in
tinuous, a threshold split is used; if Xj is discrete, a many cases signicantly smaller than that of C4.5.
multi-way split is made for all possible values. For example, for the letter dataset, C4.5 induced 2109
5. For each child, call the algorithm recursively on the por- nodes while NBTree induced only 251; in the adult
tion of T that matches the test leading to the child.
dataset, C4.5 induced 2213 nodes while NBTree in-
duced only 137; for DNA, C4.5 induced 131 nodes and
Figure 3: The NBTree algorithm. The utility u(X ) is i NBTree induced 3; for led24, C4.5 induced 49 nodes,
described in the text. while NBTree used a single node. While the complex-
ity of each leaf in NBTree is higher, ordinary trees with
thousands of nodes could be extremely hard to inter-
that we can remove the instances, update the counters, pret.
classify them, and repeat for a dierent set of instances.
See Kohavi (1995) for details. Related Work
Given m instances, n attributes, and ` label values, Many attempts have been made to extend Naive-Bayes
the complexity of the attribute selection phase for dis- or to restrict the learning of general Bayesian networks.
cretized attributes is O(m n2 `). If the number of
Approaches based on feature subset selection may help,
attributes is less than O(log m), which is usually the but they cannot increase the representation power as
case, and the number of labels is small, then the time was done here, thus we will not review them.
spent on attribute selection using cross-validation is Kononenko (1991) attempted to join pairs of at-
less than the time spent sorting the instances by each tributes (make a cross-product attribute) based on sta-
attribute. We can thus expect NBTree to scale up well tistical tests for independence. Experimentation re-
to large databases. sults were very disappointing. Pazzani (1995) searched
for attributes to join based on cross-validation esti-
Experiments mates.
To evaluate the NBTree algorithm we used a large set Recently, Friedman & Goldszmidt (1996) showed
of les from the UC Irvine repository. Table 1 de- how to learn a Tree Augmented Naive-Bayes (TAN),
scribes the characteristics of the data. Articial les which is a Bayes network restricted to a tree topology.
(e.g., monk1) were evaluated on the whole space of pos- The results are promising and running times should
sible values; les with over 3,000 instances were evalu- scale up, but the approach is still restrictive. For ex-
ated on a left out sample which is of size one third of ample, their accuracy for the Chess dataset, which con-
the data, unless a specic test set came with the data tains high-order interactions is about 93%, much lower
(e.g., shuttle, DNA, satimage); other les were evalu- then C4.5 and NBTree, which achieve accuracies above
ated using 10-fold cross-validation. C4.5 has a complex 99%.
mechanism for dealing with unknown values. To elim-
inate the eects of unknown values, we have removed Conclusions
all instances with unknown values from the datasets We have described a new algorithm, NBTree, which is
prior to the experiments. a hybrid approach suitable in learning scenarios when
Figure 4 shows the absolute dierences between the many attributes are likely to be relevant for a clas-
accuracies for C4.5, Naive-Bayes, and NBTree. Each sication task, yet the attributes are not necessarily
line represents the accuracy dierence for NBTree and conditionally independent given the label.
one of the two other methods. The average accuracy NBTree induces highly accurate classiers in prac-
for C4.5 is 81.91%, for Naive-Bayes it is 81.69%, and tice, signicantly improving upon both its constituents
Dataset No Train Test Dataset No Train Test Dataset No Train Test
attrs size size attrs size size attrs size size
adult 14 30,162 15,060 breast (L) 9 277 CV-10 breast (W) 10 683 CV-10
chess 36 2,130 1,066 cleve 13 296 CV-10 crx 15 653 CV-10
DNA 180 2,000 1,186
are 10 1,066 CV-10 german 20 1,000 CV-10
glass 9 214 CV-10 glass2 9 163 CV-10 heart 13 270 CV-10
ionosphere 34 351 CV-10 iris 4 150 CV-10 led24 24 200 3000
letter 16 15,000 5,000 monk1 6 124 432 mushroom 22 5,644 3,803
pima 8 768 CV-10 primary-tumor 17 132 CV-10 satimage 36 4,435 2,000
segment 19 2,310 CV-10 shuttle 9 43,500 14,500 soybean-large 35 562 CV-10
tic-tac-toe 9 958 CV-10 vehicle 18 846 CV-10 vote 16 435 CV-10
vote1 15 435 CV-10 waveform-40 40 300 4,700
Table 1: The datasets used, the number of attributes, and the training/test-set sizes (CV-10 denotes 10-fold
cross-validation was used).
30.00
Accuracy difference
20.00
10.00
-10.00 NBTree - NB
ionosphere
chess
vehicle
flare
led24
vote1
shuttle
iris
DNA
breast (L)
heart
waveform-40
soybean-large
monk1
segment
adult
crx
breast (W)
pima
glass
cleve
tic-tac-toe
letter
vote
satimage
mushroom
german
glass2
primary-tumor
Figure 4: The accuracy dierences. One line represents the accuracy dierence between NBTree and C4.5 and the
other between NBTree and Naive-Bayes. Points above the zero show improvements. The les are sorted by the
dierence of the two lines so that they cross once.
1.50
Error Ratio
1.00
0.50
NBTree/ C4.5
0.00 NBTree/ NB
chess
vehicle
flare
led24
vote1
shuttle
ionosphere
breast (L)
breast (w)
heart
waveform-40
iris
DNA
monk1
segment
soybean-large
adult
crx
pima
glass
cleve
tic-tac-toe
letter
vote
satimage
mushroom
german
glass2
primary-tumor
Figure 5: The error ratios of NBTree to C4.5 and Naive-Bayes. Values less than one indicate improvement.
in many cases. Although no classier can outper- Good, I. J. 1965. The Estimation of Probabilities: An
form others in all domains, NBTree seems to work Essay on Modern Bayesian Methods. M.I.T. Press.
well on real-world datasets we tested and it scales up Gordon, L., and Olshen, R. A. 1984. Almost sure con-
well in terms of its accuracy. In fact, for the three sistent nonparametric regression from recursive par-
datasets over 10,000 instances (adult, letter, shuttle), titioning schemes. Journal of Multivariate Analysis
it outperformed both C4.5 and Naive-Bayes. Running 15:147{163.
time is longer than for decision-trees and Naive-Bayes
alone, but the dependence on the number of instances Hamel, G., and Prahalad, C. K. 1994. Competing
for creating a split is the same as for decision-trees, for the Future. Harvard Business School Press and
O(m log m), indicating that the running time can scale McGraw Hill.
up well. Kohavi, R.; John, G.; Long, R.; Manley, D.; and
Interpretability is an important issue in data min- P
eger, K. 1994. MLC++: A machine learn-
ing applications. NBTree segments the data using a ing library in C++. In Tools with Articial In-
univariate decision-tree, making the segmentation easy telligence, 740{743. IEEE Computer Society Press.
to understand. Each leaf is a Naive-Bayes classiers, http://www.sgi.com/Technology/mlc.
which can also be easily understood when displayed Kohavi, R. 1995. Wrappers for Performance En-
graphically, as shown in Figure 1. The number of hancement and Oblivious Decision Graphs. Ph.D.
nodes induced by NBTree was in many cases signi- Dissertation, Stanford University, Computer Science
cantly smaller than that of C4.5. department.
Acknowledgments We thank Yeo-Girl (Yogo) Yun ftp://starry.stanford.edu/pub/ronnyk/teza.ps.
who implemented the original CatDT categorizer in Kononenko, I. 1991. Semi-naive bayesian classiers.
MLC ++. Dan Sommereld wrote the Naive-Bayes vi- In Proceedings of the sixth European Working Session
sualization routines in ++.
MLC on Learning, 206{219.
References Kononenko, I. 1993. Inductive and bayesian learning
in medical diagnosis. Applied Articial Intelligence
Brachman, R. J., and Anand, T. 1996. The process 7:317{337.
of knowledge discovery in databases. In Advances in
Knowledge Discovery and Data Mining. AAAI Press Langley, P.; Iba, W.; and Thompson, K. 1992. An
and the MIT Press. chapter 2, 37{57. analysis of bayesian classiers. In Proceedings of
the tenth national conference on articial intelligence,
Breiman, L.; Friedman, J. H.; Olshen, R. A.; and 223{228. AAAI Press and MIT Press.
Stone, C. J. 1984. Classication and Regression Trees.
Wadsworth International Group. Murphy, P. M., and Aha, D. W. 1996. UCI repository
of machine learning databases.
Dougherty, J.; Kohavi, R.; and Sahami, M. 1995. http://www.ics.uci.edu/~mlearn.
Supervised and unsupervised discretization of contin-
uous features. In Prieditis, A., and Russell, S., eds., Pazzani, M. 1995. Searching for attribute depen-
Machine Learning: Proceedings of the Twelfth Inter- dencies in bayesian classiers. In Fifth International
national Conference, 194{202. Morgan Kaufmann. Workshop on Articial Intelligence and Statistics,
Fayyad, U. M., and Irani, K. B. 1993. Multi-interval 424{429.
discretization of continuous-valued attributes for clas- Quinlan, J. R. 1993. C4.5: Programs for Machine
sication learning. In Proceedings of the 13th Inter- Learning. Los Altos, California: Morgan Kaufmann
national Joint Conference on Articial Intelligence, Publishers, Inc.
1022{1027. Morgan Kaufmann Publishers, Inc. Utgo, P. E. 1988. Perceptron trees: a case study
Fayyad, U. M.; Piatetsky-Shapiro, G.; and Smyth, in hybrid concept representation. In Proceedings of
P. 1996. From data mining to knowledge discovery: the Seventh National Conference on Articial Intelli-
An overview. In Advances in Knowledge Discovery gence, 601{606. Morgan Kaufmann.
and Data Mining. AAAI Press and the MIT Press. Wolpert, D. H. 1994. The relationship between PAC,
chapter 1, 1{34. the statistical physics framework, the Bayesian frame-
Friedman, N., and Goldszmidt, M. 1996. Building work, and the VC framework. In Wolpert, D. H., ed.,
classiers using bayesian networks. In Proceedings of The Mathemtatics of Generalization. Addison Wesley.
the Thirteenth National Conference on Articial In-
telligence. To appear.