You are on page 1of 10

Discovery of Significant Classification Rules from

Incrementally Inducted Decision Tree Ensemble for


Diagnosis of Disease

Minghao Piao, Jong Bum Lee, Khalid E.K. Saeed, and Keun Ho Ryu

Database/Bioinformatics Lab, Chungbuk National University,


361-763 Cheongju, Korea
{bluemhp,jongbumlee,abolkog,khryu}@dblab.chungbuk.ac.kr

Abstract. Previous studies show that using significant classification rules to


accomplish the classification task is suitable for bio-medical research. Discov-
ery of many significant rules could be performed by using ensemble methods in
decision tree induction. However, those traditional approaches are not useful for
incremental task. In this paper, we use an ensemble method named Cascading
and Sharing to derive many significant classification rules from incrementally
inducted decision tree and improve the classifiers accuracy.

Keywords: Classification rules, Incremental tree induction, Ensemble method,


Cascading and Sharing.

1 Introduction
Decision trees are commonly used for gaining information for the purpose of decision
making. For inductive learning, decision tree is attractive for 3 reasons: (1) Decision
tree is a good generalization for unobserved instance, only if the instances are
described in terms of features that are correlated with the target concept. (2) The
methods are efficient in computation that is proportional to the number of observed
training instances. (3) The result of decision tree provides a representation of the
concept that explainable to human.
The basis of many existing decision trees is Hunt's algorithm and developed trees
are ID3 [1], C4.5 [2], and C5.0 [3]. Tree induction of those non-incremental ap-
proaches is based on a consistent set of labeled examples and the process for inducing
a decision tree is quite inexpensive because exactly one tree is generated and maps a
single batch of examples to a particular tree. ID4 [4], ID5R [5] and ITI [6] are incre-
mental approaches and the results are as correct as non-incremental approaches do.
For improving the classifier’s accuracy, ensemble methods are used to construct a
set of base classifiers from training data set and perform the classification work by
voting on the predictions made by each classifier. The ensemble of classifiers can be
constructed in many ways [7] and most widely used is by manipulating the training
set like Bagging and boosting. Three interesting observations are described in [8]
based on the study of many ensemble methods: (1) Many ensembles constructed by

R. Huang et al. (Eds.): ADMA 2009, LNAI 5678, pp. 587–594, 2009.
© Springer-Verlag Berlin Heidelberg 2009
588 M. Piao et al.

the Boosting method were singletons. Due to this constraint, deriving classification
rules have a limitation: decision trees are not encouraged to derive many significant
rules and they are mutually exclusive and covering the entire of training samples
exactly only once. (2) Many top-ranked features possess similar discriminating merits
with little difference for classification. This indicates that it is worthwhile to employ
different top-ranked features as the root nodes for building multiple decision trees. (3)
Fragmentation problem is the another problem does those ensemble methods have: as
less and less training data are used to search for root nodes of sub-trees.
Base on those observations, we need a method that can break the singleton cover-
age constraint and solve the fragmentation problem. Also, the method should be pos-
sible to deal with incrementally collected large data set and handle the data set with
growing models and guarantee for the accuracy.

2 Related Works
In this section, we will describe some non-incremental and incremental decision tree
induction. Also, we will illustrate some widely used ensemble method and the use in
incremental induction task.

2.1 Non-incremental Decision Induction Trees

One approach to the induction task would be ID3 which is a simple decision tree
learning algorithm developed by Ross Quinlan. The construction of ID3 is based on
information theory and constructs the tree by employing a top-down, greedy search
strategy through the given sets to test each attribute at every tree node. In order to
select an attribute that is most useful for classifying an instance, it uses information
gain [1], [10]. Suppose there is a given an example data set S, the entropy of S could
be derived from the Equation 1:

Entropy ( S ) = − P( p ) log 2 P( p) − P(n) log 2 P(n) (1)

Where P(p) is the proportion of positive examples in S and P(n) is the proportion of
negative examples is S. Then the information gain is as shown in Equation 2:
V
SV
InformationGain( S , A) = Entropy ( S ) − ∑ × Entropy ( SV ) (2)
i =1 S
Where A is an attribute and the value of i from 1 to V is the domain of A, and SV is the
number of records which contains the value V.
The decision tree algorithm C4.5 [2] is developed from ID3 in the following ways:
Handling missing data, handling continuous data, and pruning, generating rules, and
splitting. For splitting purpose, C4.5 uses the Gain Ratio instead of Information Gain.
C4.5 uses the largest Gain Ratio that ensures a larger than average information gain.
Given a data set D, and it is split into s new subsets S = {D1, D2, ... , Ds}:
Gain( D, S )
GainRatio( D, S ) = (3)
SplitINFO
Discovery of Significant Classification Rules 589

s
Di D
splitINFO = −∑ log 2 i (4)
i =1 D D
C5.0 (called See 5 on Windows) is a commercial version of C4.5 now widely used in
many data mining packages such as Clementine and RuleQuest. It is targeted toward
use with large datasets. The decision tree induction is close to the C4.5, but the rule
generation is different. However, the precise algorithms used for C5.0 are not public.
One major improvement to the accuracy of C5.0 is based on boosting and it does
improve the accuracy. Results show that C5.0 improves on memory usage by about
90 percent, runs between 5.7 and 240 times faster than C4.5, the error rate has been
shown to be less than half of that found with C4.5 on some data sets, and produces
more accurate rules [3].

2.2 Incremental Decision Induction Trees

An incremental classifier can be characterized as ID3 compatible if it constructs al-


most similar decision tree produced by ID3 using all the training set. This strategy is
maintained by classifiers such as ID4, ID5, ID5R and ITI. ID4 was the first ID3
variant to construct the incremental learning.
ID4 applies the ID3 in an incremental manner to allow objects to be presented one
at a time. The heart of this modification lies in a series of tables located at each poten-
tial decision tree root. Each table consists of entries for the values of all untested at-
tributes and summarizes the number of positive and negative instances with each
value. As a new instance is add into the tree, the positive and negative count for each
attribute value is incremented and those count are used to compute the E-score for a
possible test attribute at a node. Each decision node contains an attribute that has the
lowest E-score and if the attribute does not contains the lowest E-score, then the at-
tribute is replaced by a non-test attribute with lowest E-score and sub-trees below the
decision node are discarded. ID4 builds the same tree as the basic ID3 algorithm,
when there is an attribute at each node that is the best among other attributes.
ID5 expanded this idea by selecting the most suitable attribute for a node, while a
new instance is processed, and restructuring the tree, so that this attribute is pulled-up
from the leaves towards that node. This is achieved by suitable tree manipulations that
allow the counters to be recalculated without examining the past instances.
ID5R is a successor of the ID5 algorithm. When have to change the test attribute at
a decision node, instead of discarding the sub-trees, ID5R uses a pull-up process to
restructure the tree and retains the training instances in the tree. This pull-up process
only recalculates the positive and negative counts of training instances during the
manipulation. An ID5R tree is defined as: A leaf node (answer node) contains a class
name and the set of instance descriptions at the node belonging to the class. A non-
leaf node (decision node) contains an attribute test with a branch to another decision
tree for each possible value of the attribute, and a set of non-test attribute at the node.
Each test or non-test attribute is combined with positive and negative counts for each
possible value. When classifying an instance, the tree is traversed from the root node
until a node is reached that contains the all instances from same class. At that point
the class label for the instance is assigned either the node is a leaf or non-leaf node.
590 M. Piao et al.

The basic algorithm of ITI follows the ID5R, but adds the ability to handle numeric
variable, instances with missing values, and inconsistent training instances, also han-
dle multiple classes, not just two. Updating the tree, ITI uses two steps: incorporating
an instance into the tree and restructuring the tree as necessary so that each decision
node contains the best test. When picking a best attribute it uses the Gain Ratio which
is described in C4.5. A table of frequency counts for each class and value combina-
tion is kept at a decision node and used for ensuring a best test at a decision node and
for tree revision.

2.3 Ensemble Methods

Bagging and boosting are first approach they construct multiple base trees, each time
using a bootstrapped replicate of the original training data. Bagging [11] is a method
for generating multiple decision trees and using these trees to get an aggregated predic-
tor. The multiple decision trees are formed by bootstrap aggregating which repeatedly
samples from a data set and the sampling is done with replacement. It is that some
instances may appear several times in the same training set, while others may be omit-
ted from the training set. Unlike bagging, boosting [12] assigns a weight to each train-
ing example and may adaptively change the weight at the end of each boosting round.
However, bagging and boosting are difficult to be used in incremental induction
tree process because of the expensive cost and they have to manipulate the training
data. In CS4 algorithm [8], [13], instead of manipulating the training data, it keeps the
original data unchanged and but change the tree learning phase. It forces the top-
ranked features to be the root of tree and remain nodes are constructed as C4.5. This
is called tree cascading, and for classification, the algorithm combines those tree
committees and shares the rules in the committee in a weighted manner. Together, the
cascading idea guides the construction of tree committees while the sharing idea
aggregates the discrimination powers made by each individual decision tree.

3 Incrementally Inducted Decision Tree Ensemble


In bio-medical research mining area, the useful diagnostic or prognostic knowledge
from the result is very important. Only explainable result could be analyzed and easy
to understand for the application to the bio-medical and diagnosis of a disease [14],
[15]. Among many classification approaches, using classification rules derived from
the decision tree induction may helpful to perform this work and previous studies
show that it is powerful. We define a rule as a set of conjunctive conditions with a
predictive term. The general form of rules is presented as: IF condition1 & condition2
& ... & conditionm, THEN a predictive term. The predictive term in a rule refers to a
single class label. For useful clinical diagnosis purpose, using those rules we can
address issues in understanding the mechanism of a disease and improve the discrimi-
nating power of the rules. A significant rule is one with a largest coverage which the
coverage satisfies a given threshold. For example, the given threshold is 60%, if one
rule’s coverage is larger than 60% then it is called significant rule.
However, using traditional ensemble method to build and refine the tree committee
and derive significant classification rules is still impossible. So, we introduce a new
incremental decision learning algorithm which uses the skeleton of ITI and accepts
Discovery of Significant Classification Rules 591

the cascading and sharing ensemble method of CS4 to break the constraint of single-
ton classification rules by producing many significant rules from the committees of
decision trees and combine those rules discriminating power to accomplish the predic-
tion process. We call this algorithm as ICS4 and the main steps of the process could
be 3 as shown in three functions below:

incremental_update(node, training_example)
{
add_training_example_to_tree(node, training_example)
{ Add examples to tree using tree revision; }
ensure_best_test(node)
{ Ensure each node has desired test ; }
sign_class_label_for_test_example(test_example)
{
if there are test examples
for each kth top-ranked tests
//except the first best test
force the test to installed at root node
for remaining nodes
ensure_best_test(node);
from each constructed decision tree
Derive significant classification rules;
}
}

Algorithm 1. The skeleton of ICS4

Details of add_training_example_to_tree and ensure_best_test function are shown


in [6]. The third function sign_class_label_for_test_example only works at the point
when there are test examples or unknown instances that need to be assign a class
label. For constructing tree committees, there are two options: construct at the point
when we need to perform classification or start from the beginning of tree induction
and using incremental manner to construct them. However, at the point when the tree
committees are constructed, the top-ranked features used are same for both two
strategies because the used examples are no difference. It means that using incre-
mental manner to construct the tree committees is just wasting time and storage. After
derived those rules, we use the aggregate score to perform the prediction task. The
classification score [9] for a specific class, say class C, is calculated as:
KC
ScoreC (T ) = ∑ Coverage(ruleC i ) (5)
i =1

Here, KC denotes the number of rules in the class C, ruleCi denotes ith rules in class C,
if the score for one class C is larger than other classes, then the class label for the
instance T is assigned as C.
592 M. Piao et al.

4 Experiments and Results

Breast cancer is a cancer that starts in the cells of the breast in women and men.
Worldwide, breast cancer is the second most common type of cancer after lung cancer
and the fifth most common cause of cancer death. Breast cancer is about 100 times as
frequent among women as among men, but survival rates are equal in both sexes. In
this section, we report empirical behavior of the algorithm discussed above by using
the data source named Wisconsin Breast Cancer Dataset which is taken from the Uni-
versity of California at Irvine (UCI) machine learning repository [16]. This dataset
consists of 569 instances and have 32 features. Here, we only use 20 top-ranked
features to construct tree committees.

Table 1. Attribute information

Feature Name Description


ID number Not used for training
Diagnosis M = malignant, B = benign
3-32 features are: Ten real-valued features are computed for each cell nucleus
radius mean of distances from center to points on the perimeter
texture standard deviation of gray-scale values
perimeter
area
smoothness local variation in radius lengths
compactness perimeter^2 / area - 1.0
concavity severity of concave portions of the contour
concave points number of concave portions of the contour
fractal dimension “coastline approximation” - 1
symmetry

The mean, standard error, and “worst” or largest (mean of the three largest values)
of these features were computed and resulting in 30 features. For instance, field 3 is
Mean Radius, field 13 is Radius SE, and field 23 is Worst Radius.
In the following tables, for example, that 50 vs. 50 means that the example data set
is divided into 50% of training and 50% of test.

Table 2. Confusion Matrix

50 vs. 50 Predicted 70 vs. 30 Predicted


Actual Benign Malignant Actual Benign Malignant
Benign 169 10 Benign 114 5
Malignant 6 100 Malignant 4 66
80 vs. 20 Predicted All Predicted
Actual Benign Malignant Actual Benign Malignant
Benign 88 1 Benign 207 5
Malignant 7 46 Malignant 4 353
Discovery of Significant Classification Rules 593

Table 3. Detailed accuracy by class

FP rate Precision Recall F-measure Class


50 vs. 50 0.057 0966 0944 0955 Malignant
0.056 0909 0943 0926 Benign
80 vs. 20 0132 0926 0989 0957 Malignant
0011 0979 0868 092 Benign
70 vs. 30 0057 0966 0958 0962 Malignant
0042 093 0943 0936 Benign
All 0011 0981 0976 0979 Malignant
0024 0986 0989 0987 Benign

Comparison of accuracy
102
100 ITI ICS4 C5.0 rule C5.0

98
yc
ar 96
uc
c 94
A
92
90
88
TR50 ><TE50 TR70 ><TE30 TR80 ><TE20 TR100 ><TE100
Defferent size of training and test

Fig. 1. Comparison of accuracy

At Figure 1, ITI and ICS4 are tested in incremental mode, and C5.0 is tested in
batch mode (non-incremental) for both rule based and tree based classifier. As shown
in above results, the ICS4 algorithms can achieve high performances on different
manipulation of the example data and with different types of decision tree learning
algorithms. It means that using the Cascading and Sharing method in incremental
induction tree to derive significant rules could provide competitive accuracy to incre-
mental induction algorithm and even non-incremental approaches when tested on
consistent size of example data. Consider the execution time and storage mechanism,
because it just constructs the tree committees on the point of beginning of test or there
are new unknown instances, so it can finish the work on acceptable time as ITI.

5 Conclusion
In this paper, base on well-accepted design goals we introduced an approach for dis-
covering many significant classification rules from incrementally induced decision
tree by using the Cascading and Sharing ensemble method. Tree induction offers a
highly practical method for generalizing from examples whose class label is known
and the average cost of constructing the tree committee is much lower than the cost of
594 M. Piao et al.

building a new decision tree committee. For testing the performance of ensembles
of incremental tree induction, we used the example data which is about diagnosis of
Wisconsin Breast Cancer Dataset. All 31 features are used without variable selection
and the threshold of the feature choice was given as 20 to construct the 20 number of
trees by forcing 20 top-ranked features iteratively as the root node of a new tree. The
first tree is constructed in incremental induction manner, and others are constructed
when there are instances that need to be assign a class label. The results show that this
new approach is suitable for the bio-medical research.

Acknowledgment. This work is supported by Korea Science and Engineering Foun-


dation (KOSEF) grant funded by the Korea government (MOST) (R01-2007-000-
10926-0) and supported by the Korea Research Foundation Grant funded by the
Korean Government(MOEHRD) (The Regional Research Universities Program /
Chungbuk BIT Research-Oriented University Consortium).

References
1. Quinlan, J.R.: Induction of Decision Trees. Machine Learning, 81–106 (1986)
2. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco
(1993)
3. RuleQues Research Data Mining Tools, http://www.rulequest.com/
4. Schlimmer, J.C., Fisher, D.: A case study of incremental concept induction. In: Proceed-
ings of the Fifth National Conference on Artificial Intelligence, pp. 496–501 (1986)
5. Utgoff, P.E.: Incremental Induction of decision trees. Machine Learning, 161–186 (1989)
6. Utgoff, P.E., Berkman, N.C., Clouse, J.A.: Decision Tree Induction Based on Efficient
Tree Restructuring. Machine Learning, 5–44 (1997)
7. Tan, P.N., Steinbach, M., Kumar, V.: Ensemble methods. In: Introduction to data mining,
pp. 278–280. Addison Wesley, Reading (2006)
8. Li, J.Y., Liu, H.A., Ng, S.-K., Wong, L.: See-Kiong Ng, Limsoon Wong: Discovery of
significant rules for classifying cancer diagnosis data. Bioinformatics 19, 93–102 (2003)
9. Utgoff, P.E.: Decision Tree Induction Based on Efficient Tree Restructuring. Technical re-
port, University of Massachusetts (1994)
10. Tan, P.N., Steinbach, M., Kumar, V.: Decision tree induction. In: Introduction to data min-
ing, pp. 150–172. Addison Wesley, Reading (2006)
11. Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)
12. Freund, Y., Schapire, R.E.: Experiments with a New Boosting Algorithm. In: The Thir-
teenth International Conference on Machine Learning, pp. 148–156 (1996)
13. Li, J., Liu, H.: Ensembles of cascading trees. In: Third IEEE international conference on
data mining, pp. 585–588 (2003)
14. Lee, H.G., Noh, K.Y., Ryu, K.H.: Mining Biosignal Data: Coronary Artery Disease Diag-
nosis using Linear and Nonlinear Features of HRV. In: PAKDD 2007 Workshop, BioDM
2007. LNCS. Springer, Heidelberg (2007)
15. Ryu, K.H., Kim, W.S., Lee, H.G.: A Data Mining Approach and Framework of Intelligent
Diagnosis System for Coronary Artery Disease Prediction. In: The 4th Korea-Japan Int’l
Database Workshop 2008 (2008)
16. UCI Machine Learning Repository,
http://archive.ics.uci.edu/ml/datasets.html

You might also like