You are on page 1of 8

Information Fusion 14 (2013) 423430

Contents lists available at SciVerse ScienceDirect

Information Fusion
journal homepage: www.elsevier.com/locate/inffus

Ensembles of decision trees based on imprecise probabilities


and uncertainty measures
Joaqun Abelln
Department of Computer Science and Articial Intelligence, ETS de Ingenieras Informtica y de Telecomunicacin, University of Granada, 18071 Granada, Spain

a r t i c l e i n f o a b s t r a c t

Article history: In this paper, we present an experimental comparison among different strategies for combining decision
Received 8 February 2011 trees built by means of imprecise probabilities and uncertainty measures. It has been proven that the
Received in revised form 14 March 2012 combination or fusion of the information obtained from several classiers can improve the nal process
Accepted 15 March 2012
of the classication. We use previously developed schemes, known as Bagging and Boosting, along with a
Available online 23 March 2012
new one based on the variation of the root node via the information rank of each feature of the class var-
iable. To this end, we applied two different approaches to deal with missing data and continuous vari-
Keywords:
ables. We use a set of tests on the performance of the methods analyzed here, to show that, with the
Imprecise probabilities
Credal sets
appropriate approach, the Boosting scheme constitutes an excellent way to combine this type of decision
Uncertainty measures tree. It should be noted that it provides good results, even compared with a standard Random Forest clas-
Supervised classication sier, a successful procedure very commonly used in the literature.
Decision trees  2012 Elsevier B.V. All rights reserved.
Ensemble methods

1. Introduction principally motivated by a signicant improvement in accuracy


when they are applied to articial and real-world datasets.
In the area of machine learning, supervised classication learn- Abelln and Moral [4] presented a method for building decision
ing can be considered an important tool for decision support. Clas- trees using the study on measures of information/uncertainty in
sication can be dened as a machine learning technique used to general convex sets of probability distributions (see [5]). In this
predict group membership for data instances. It can be applied to method, the split criterion used, called Imprecise Info-Gain (IIG) is
decision support in medicine, character recognition, astronomy, different to the characteristics of classical split criteria [4,6,7]. An
banking and other elds. A classier may be represented using a important characteristic of this criterion as opposed to the classic
Bayesian network, a neural network, a decision tree, etc. In the lit- ones is that it can produce negative information values for the class
erature, it can be seen that the information obtained by a single variable, i.e. the value of the information on the class variable using
classier can be improved with the combination or fusion of sev- a feature can be lower than the value obtained for the class vari-
eral classiers belonging to the same type or to different ones. able alone. We label this type of trees credal decision trees1 or to
A decision tree, also known as a classication tree, is a simple simplify, credal trees (CTs).
structure that can be used as a classier. An important aspect of In this paper, we perform a set of experiments on a wide range
decision trees is their inherent instability which means that they of datasets in order to check the performance of credal trees for dif-
are sensitive to changes in the training examples, and signicantly ferent ensemble schemes: Bagging, Boosting, and IRN schemes. The
different hypotheses are therefore generated for different training latter scheme of IRN is the method recently presented in Abelln
datasets [1]. Compared with single classiers, the strategies used and Masegosa [8], which uses the IIG criterion in two important
to combine classiers are known to increase classier accuracy, steps of the procedure: to select the set of informative features
providing they combine different models. Decision trees built from used as root nodes of the decision trees in the combination
different training sub-samples from a given problem domain will scheme;2 and as a split criterion for building each decision tree.
produce quite different models. This characteristic is essential with The experimental study by Abelln and Masegosa [8] shows that
regard to considering them as suitable classiers in an ensemble
scheme such as Bagging [2], Boosting [1] or Random Forest [3].
1
Ensembles of decision tree classiers (also called forests in the lit- The term credal used here refers to the use of convex sets of probability
erature), are well accepted in the data mining community. This is distribution, i.e. credal sets, for building precise classiers (see [6]). It differs from its
use in imprecise classication (see [9]).
2
Since the main characteristic of this method is the use of Informative Root Nodes,
E-mail address: jabellan@decsai.ugr.es we have termed this method IRN-Ensemble or IRN to simplify.

1566-2535/$ - see front matter  2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.inffus.2012.03.003
424 J. Abelln / Information Fusion 14 (2013) 423430

IRN, with a standard previous preprocessing approach, performs well extracted from the dataset for each case of the class variable using
compared to other classic ensemble methods that use classical deci- Walleys imprecise Dirichlet model (IDM) [14], which represents a
sion trees. The method does not need to set the number of trees to be specic kind of convex set of probability distributions (see [15]).
used; rather, it uses a number of decision trees which depends on The IDM depends on a given hyperparameter s which does not
the number of informative features (using IIG criterion) in relation depend on the sample space [14]. The IDM estimates that the prob-
to the class variable. These informative features are used as root abilities for each value of the class variable C are within the
nodes. interval:
We study the performance of the following methods: the origi-  
nal IRN method, Bagging CTs (noted as BAGCT), and Boosting CTs ncj ncj s
pcj 2 ; ; j 1; . . . ; k;
(noted as BOOCT). To compare the results, we also use the following Ns Ns
methods: the IRN method replacing its nal procedure with a vote
procedure (noted as IRNV), Bagging decision trees built with Quin- with ncj as the frequency of the set of values (C = cj) in the dataset;
lans classic Information Gain (Info-Gain) split criterion [10] (noted and N the sample size. The value of parameter s determines the
as BAGIG), Boosting decision trees built with the Info-Gain split cri- speed at which the upper and lower probability values converge
terion (noted as BOOIG), and the Random Forest method. when the sample size increases. Higher values of s give a more cau-
To check the accuracy of the above-mentioned methods, we tious inference. Walley [14] does not give a denitive recommenda-
used two different approaches to deal with missing data and con- tion for the value of this parameter but he suggests two candidates:
tinuous variables. We show that for two of the classication meth- s = 1 or s = 2.
ods used here, the type of approach can be strongly related to their For this type of probability intervals, the maximum entropy is
accuracy. We compare the performance of the classication meth- estimated. This is a total uncertainty measure which is well known
ods analyzed by conducting a set of tests, including a Friedman for this type of sets (see [16,17]).
rank [11,12] on the methods used in each approach, and a set of The procedure for building credal trees is very close to the one
known post hoc tests. The most important outcome of this exper- used in Quinlans [10] well-known ID3 algorithm, replacing its
imental study is that the Boosting scheme, with the appropriate Info-Gain split criterion with the Imprecise Info-Gain (IIG) split cri-
approach, is the best way to ensemble credal decision trees. It ob- terion. This criterion can be dened as follows: in a classication
tains excellent results compared with other ensemble procedures for problem, let C be the class variable, {Z1, . . . , Zk} be the set of fea-
this type of decision trees. Indeed, it should be noted that Boosting tures, and Z be a feature; then
credal decision trees obtains a better Friedman rank score than a P
IIGD C; Z S K D C  PZ zi S K D CjZ zi ;
standard Random Forest method in the comparative study conducted i
among the seven methods presented in this paper. The results
D D
obtained with the set of tests carried out reinforce this statement. where K C and K CjZ zi are the credal sets obtained using the
The rest of the paper is organized as follows: in Section 2, we IDM for the variables C and (CjZ = zi) respectively, for a partition D of
present previous knowledge on the method for building credal the dataset (see [4]); and S is the maximum entropy function
decision trees and the ensemble schemes utilized with credal deci- (see [15]).
sion trees in this paper. In Section 3, we check all the ensemble The IIG criterion is different from the classical ones. It is based
methods studied here, using two different approaches to deal with on the principle of maximum uncertainty (see [5]), broadly used
missing data and continuous variables, on a set of datasets widely in the classic information theory, in which it is known as the max-
used in classication. Section 4 is devoted to the conclusions. imum entropy principle [18]. The use of the maximum entropy
function in the procedure for building decision trees is justied
2. Background in Abelln and Moral [6]. It is important to note that for a feature
Z and a partition D; IIGD C; Z can be negative. This situation does
2.1. Method for building credal decision trees not appear with classical split criteria such as the Info-Gain crite-
rion used in ID3.3 This characteristic enables the IIG criterion to re-
A decision tree is a simple structure that can be used as a clas- veal features that worsen the information on the class variable.
sier. An important reference on its origin is the work Classication Each node No in a decision tree produces a partition of the data-
and Regression Trees by Breiman et al. [13]. Quinlans well-known set (for the root node, D is considered to be the entire dataset). Fur-
ID3 algorithm for building decision trees, which was presented la- thermore, each node No has an associated list L of feature labels
ter, should also be mentioned. (that are not in the path from the root node to No). The procedure
In a decision tree, each internal (non-leaf) node represents an for building credal trees is explained in the algorithm in Fig. 1.
attribute variable (or predictive attribute or feature) and each Considering this algorithm, when an Exit situation is attained,
branch represents one of the states of this variable. Each tree leaf i.e. when there are no more features to introduce in a node, or
species an expected value of the class variable (the variable to when the uncertainty is not reduced (steps 1 and 4 of the algo-
be predicted). Important aspects of the procedures for building rithm, respectively), a leaf node is produced. In a leaf node, the
decision trees are: the split criterion, i.e. the criterion used for most probable state or value of the class variable for the partition
branching; and the stop criterion, i.e. the criterion used to stop associated with that leaf node is inserted. To avoid obtaining
branching. unclassied instances, if we do not have one single most probable
Decision trees are built using a set of data referred to as the class value, we can select the one obtained in its parent node, and
training dataset. A different set, called the test dataset, is used to so on recursively (see [8]).
check the model. When we obtain a new sample or instance of Credal decision trees can be somewhat smaller than the ones
the test dataset, we can make a decision or prediction about the produced with similar procedures replacing the IIG criterion with
state of the class variable by following the path in the tree from the classic ones (see [7]). This normally produces a reduction of
the root node to a leaf node, using the sample values and the tree the overtting of the model (see [6]), which could advise against
structure. its use in bagging or boosting schemes (see [19]).
The split criterion employed to build credal decision trees [4] is
based on the application of uncertainty measures on convex sets of 3
The Info-Gain criterion is actually a particular case of the IIG criterion using the
probability distributions. Specically, probability intervals are parameter s = 0.
J. Abelln / Information Fusion 14 (2013) 423430 425

ers where each one depends upon its predecessors; and that
each classier considers the error of the previous classier in
order to decide what to focus on during the next iteration of
the data. Boosting assumes training dataset to be perfect, so
this procedure does not perform as well as other strategies
with noisy training datasets. A special type of Boosting is
the algorithm of Adaboost [1] which has demonstrated
excellent experimental performance on benchmark datasets
and real applications. AdaBoost is an adaptive algorithm in
the sense that classiers built subsequently are tweaked in
favor of instances misclassied by previous classiers. It is
a feature selector with a principled strategy: minimization
of upper bound on empirical error.
(iii) The IIG criterion is used in two important ways in the recent
Fig. 1. Procedure to build a credal decision tree. procedure IRN-Ensemble method of Abelln and Masegosa
[8]. As a split criterion, a single credal tree needs to be built
and used to obtain the set of features that will make up the
root node of each tree used in the ensemble scheme. As sta-
Credal trees and the IIG criterion have been successfully used in
ted in the preceding sections, this criterion can use certain
other procedures and tools in data mining as a part of a procedure
features to provide negative values of information for the
for selecting variables [20] and on datasets with classication noise
class variable. Therefore, all the features can be considered
[7]. An extended version of the IIG criterion has been used to dene
as potential root nodes or, to the contrary, none of them
a semi-nave Bayes classier [21].
can be considered as root nodes. The method is not based
upon a nal voting procedure, and can be expressed as
2.2. Ensemble methods using credal decision trees follows:
 Use IIG rank to obtain the set of features {Z1, . . . , Zm},
The general idea normally used in the procedures for combining where IIG (CjZj) > 0, "j 2 {1, . . . , m}.
decision trees is based on the generation of a set of different deci-  Build m credal trees {T1, . . . , Tm}. Each Tj is built using Zj
sion trees combined with a majority vote criterion. When a new as the root node and the rest of nodes, for this tree, are
unclassied instance arises, each single decision tree makes a pre- selected following the procedure described in the previ-
diction and the instance is assigned to the class value with the ous section, using the IIG criterion.
highest number of votes. In an ensemble scheme, if all decision  In order to classify a new observation, apply the new case
trees are quite similar, ensemble performance will not be much to each of the m credal trees and consider the frequency
better than a single decision tree. However, if the ensemble com- (number of cases) of each state of the class variable in
prises a broad set of different decision trees exhibiting good indi- each of the leaf nodes. For a tree Tj, a new observation
vidual performance, the ensemble will become more robust, with is associated with a leaf node, and this also has an asso-
a better prediction capacity. There are many different approaches ciated partition of the dataset. Then njci is the frequency
to this problem, but Bagging [2], Random Forests [3], and AdaBoost (number of cases) with which ci appears in the partition
[1] stand out as the best known and most competitive ones. generated by the leaf node of Tj. Calculate the following
The schemes used here with credal trees can be briey de- relative frequencies (probabilities):
scribed as follows:
P njci
qi P j ;
(i) The Bootstrap Aggregating ensemble method, known as j ci nci
Breimans Bagging [2] is an intuitive and simple method
for each ci.
that performs very well. Diversity in Bagging is obtained
 For a new observation, the value obtained by the
by using bootstrap replicates of the original training dataset:
IRN-Ensemble classication method is ch, where: ch =
different training datasets are randomly drawn with
arg(maxi{qi}).
replacement. Subsequently, a single decision tree is built
with each instance of the training dataset using the standard
As can be inferred from the above explanation, the IRN method
approach [13]. Thus, each tree can be dened by a different
has an important characteristic with respect to the rest: it does not
set of variables, nodes and leaves. Finally, their predictions
x the number of trees (or iterations) to be used. It depends on the
are combined by a majority vote.
number of informative features with respect to the class variable
(ii) Boosting is an ensemble method based on the work of
(see [8]). This implies that it is possible to use a number of trees
Kearns [22]. This procedure can be described in the follow-
equal to the number of features in the dataset, or even zero.4 As
ing steps: (i) applying a weak classier (such as a decision
a very low number of trees can be used, the authors of the IRN meth-
tree) to the learning data, where each observation is
od do not consider a nal voting procedure appropriate to classify a
assigned an initially equal weight; (ii) applying weights to
new observation.
the observations in the learning sample that are inversely
proportional to the accuracy of the classication of the com-
puted predicted classications; (iii) going to the step (i) M- 3. Experimentation
times (M previously xed); and (iv) combining predictions
from individual models (weighted by accuracy of the mod- Our aim is to study the performance of the methods: the origi-
els). In the Boosting procedure: for misclassied samples nal IRN method (noted as IRN); the IRN method replacing its nal
the weights are increased, while for correctly classied sam-
ples the weights are decreased. Thus, the principal idea of 4
In that situation the classication procedure always returns the case of the class
this method is that it uses a sequence of successive classi- variable with greater frequency in the dataset.
426 J. Abelln / Information Fusion 14 (2013) 423430

procedure with a voting procedure (noted as IRNV); Bagging CTs Table 1


(noted as BAGCT); Bagging decision trees built with Quinlans clas- Dataset description. Column N is the number of instances in the datasets, column
Feat is the number of features or attribute variables, column Num is the number
sic Info-Gain split criterion [10] (noted as BAGIG); Boosting CTs of numerical variables, column Nom is the number of nominal variables, column k
(noted as BOOCT); Boosting decision trees built with the Info-Gain is the number of cases or states of the class variable (always a nominal variable) and
split criterion (noted as BOOIG); and the Random Forest method column Range is the range of states of the nominal variables of each dataset.
(noted as RF). Dataset N Feat Num Nom k Range
In order to check the above procedures, we used a broad and di-
Anneal 898 38 6 32 6 210
verse set of 25 known datasets, obtained from the UCI repository of Audiology 226 69 0 69 24 26
machine learning datasets which can be downloaded directly from Autos 205 25 15 10 7 222
http://archive.ics.uci.edu/ml/. We took datasets that are different Breast-cancer 286 9 0 9 2 213
with respect to the number of cases of the variable to be classied, Colic 368 22 7 15 2 26
cmc 1473 9 2 7 2 24
dataset size, feature types (discrete or continuous) and number of
Credit-german 1000 20 7 13 2 211
cases of the features. A brief description of the datasets can be Diabetes-pima 768 8 8 0 2
found in Table 1. Glass-2 163 9 9 0 2
For our experimentation, we used Weka software [23] on Java Hepatitis 155 19 4 15 2 2
1.5, and we added the methods required to build decision trees Hypothyroid 3772 29 7 22 4 24
Ionosphere 351 35 35 0 2
using the IIG split criterion. The ensembles of credal trees used
kr-vs-kp 3196 36 0 36 2 23
here: IRN-Ensembles (IRN and IRNV), Bagging credal trees (BAGCT) Labor 57 16 8 8 2 23
and Adaboost credal trees (BOOCT) were implemented using data Lymph 146 18 3 15 4 28
structures of Weka.5 The parameter of the IDM was set to s = 1, i.e. Mushroom 8123 22 0 22 2 212
Segment 2310 19 19 0 7
the value used in the original method by Abelln and Moral [4], be-
Sick 3772 29 7 22 2 2
cause the procedure to obtain the maximum entropy value (see [15]) Solar-are1 323 12 0 12 2 26
reaches its lowest computational cost for this value. Bagging and Sonar 208 60 60 0 2
Boosting ensembles were built with 100 decision trees.6 Although Soybean 683 35 0 35 19 27
the number of trees can strongly affect ensemble performance, this Sponge 76 44 0 44 3 29
Vote 435 16 0 16 2 2
is a reasonable number of trees for the low-medium size of the data-
Vowel 990 12 10 2 11 2
sets used in this study, and moreover it was used in related research, Zoo 101 16 1 15 7 2
such as Freund and Schapire [1]. We used the Random Forest meth-
od (RF) that can be found in Weka software as a benchmark to com-
pare the results. We used it with its default conguration, except
dure 10 times.8 Alternative methods of processing missing data have
that the number of trees used for this method was 100, the same
been presented in Pelckmans et al. [25].
as for the above-mentioned ensemble methods.
Tables 2 and 3 show the accuracy of the methods with ap-
We used the following approaches to deal with missing data
proaches (A) and (B), respectively.
and continuous variables in different implementations of the
Following the recommendations of Demsar [26], we used a ser-
methods:
ies of tests to compare the methods.9 Also, we have considered it
would be equally interesting to take into account the work by Garca
(A) For each feature, missing values within datasets were
and Herrera [27], where the work of Demsar [26] is completed. We
replaced with mean (for continuous features) and mode
used the following tests:
(for discrete features) values, with Wekas own lters. In
the same way, continuous features were discretized using
To compare two classiers on multiple datasets:
the known Fayyad and Iranis discretization method [24].
Wilcoxons test (Wilcoxon [28]): a non-parametric test
The approach was applied by means of the training dataset
which ranks the differences in performance of two classiers
and it was then transferring to the test dataset, for each pair
for each dataset, ignoring the signs, and compares the ranks
of training/test datasets.
for the positive and the negative differences.
(B) Assume that missing values are randomly distributed. In
To compare multiple classiers on multiple datasets:
order to compute the scores, the instances are split into
Friedmans test (Friedman [11,12]): a non-parametric test
units. The initial weight of an instance is equal to the unit,
that ranks the algorithms separately for each dataset, the
but when it drops down a branch, it receives a weight equal
best performing algorithm being assigned the rank of 1,
to the proportion of instances belonging to this branch
the second best, rank 2, etc. The null hypothesis is that all
(weights sum to 1). For continuous variables, only binary
the algorithms are equivalent. If the null-hypothesis is
split attributes are considered and each possible split point
rejected, we can compare all the algorithms to each other
is evaluated and nally selected. The one which induced a
using Holms test (Holm [29]).
partition of the samples with the highest split score, using
We have considered the above set of tests following a trade-off
the IIG criterion, is ultimately selected.7
on the recommendations by Demsar [26] and by Garca and Herre-
ra [27]. As suggested in those works, the Friedman test may report
It should be noted that both approaches were applied by using
a signicant difference but the post hoc test fails to detect it. This is
the training dataset and then translated the result to the test data-
due to the lower power of the post hoc test conducted. Nemenyis
set. For each dataset, we repeated a 10-fold cross validation proce-
test [30] is recommended by Demsar, but in some situations can be
5
In the event of ties, the ensemble procedures implemented in Weka (such as the
8
ones used in this paper) select the rst case of the class variable they nd, considering The dataset is separated into 10 subsets, and each one is used as a test dataset.
their order in the dataset le. The set obtained by joining the other 9 subsets is used as the training dataset. Thus,
6
The IRN-Ensemble method does not have a xed number of trees to be used. It we have 10 training datasets and 10 test datasets. This procedure is repeated 10 times
uses a number of trees less or equal than the number of features of each dataset. with prior random reordering. Finally, it produces 100 training datasets and 100 test
Always it is less than 70 decision trees in this experimental study. datasets. The percentage of correct classications for each dataset, given in tables, is
7
This approach is the one presented for the C4.5 method in Weka, adapted for the average of these 100 trials.
9
credal trees using the IIG criterion. All the tests were carried out using Keel software, available at http://www.keel.es.
J. Abelln / Information Fusion 14 (2013) 423430 427

Table 2
Percentage of correct classication of the methods with the (A) approach.

Dataset IRN IRNV BAGCT BAGIG BOOCT BOOIG RF


Anneal 99.67 99.64 99.71 99.60 99.72 99.68 99.41
Audiology 82.08 80.40 83.12 80.49 81.73 75.05 80.01
Autos 82.99 81.97 82.83 81.40 83.02 80.52 84.24
Breast-cancer 72.42 72.59 71.57 67.36 67.21 67.46 69.98
cmc 48.99 48.76 49.15 46.95 47.70 46.71 47.79
Horse-colic 83.94 84.42 83.88 82.47 80.22 80.03 83.20
German-credit 71.28 71.40 71.63 70.25 70.59 70.82 73.82
Pima-diabetes 74.38 74.35 73.88 73.22 73.23 73.10 73.36
Glass2 78.36 79.46 78.30 78.30 77.46 78.24 79.15
Hepatitis 81.53 80.99 80.75 78.78 79.10 79.61 82.05
Hypothyroid 99.36 99.36 99.36 99.34 99.40 99.31 99.21
Ionosphere 92.22 91.97 92.14 91.34 91.77 91.20 92.08
kr-vs-kp 99.52 99.50 99.48 99.59 99.64 99.59 99.28
Labor 87.97 87.80 86.23 87.00 88.40 89.17 88.97
Lymphography 79.74 79.95 76.93 76.78 84.14 79.08 82.74
Mushroom 100.00 100.00 100.00 100.00 99.99 100.00 100.00
Segment 96.43 96.13 95.10 95.49 95.94 95.10 96.63
Sick 97.87 97.80 97.89 97.63 97.62 97.50 97.71
Solar-are 96.97 97.04 97.22 96.11 95.74 95.95 96.91
Sonar 75.69 75.65 76.62 76.81 76.06 76.45 76.79
Soybean 93.56 92.93 92.87 89.72 92.28 89.13 93.89
Sponge 95.00 95.00 94.75 94.25 95.00 93.80 95.00
Vote 95.67 95.79 95.90 94.90 95.35 95.28 96.36
Vowel 77.15 74.80 76.48 79.03 79.02 78.72 80.86
Zoo 95.92 95.92 96.12 95.92 95.43 95.92 96.32
Average 86.35 86.14 86.08 85.31 85.83 85.10 86.63

a less sensitive test than others, as it is described by Garca and the same results of the tests on the methods when the approach (B)
Herrera. With Nemenyis test we can encounter situations where is used.11
the differences expressed by the Friedman test were not detected.
Hence, as the latter authors recommended, we considered con- 3.1. Comments on the results
ducting a post hoc Holms test.
The study of the p-values is very interesting because, citing a From Tables 2 and 3 we can obtain the differences between the
paragraph by Garca and Herrera [27]: A p-value provides infor- results of the methods using approaches (A) and (B). Some differ-
mation about whether a statistical hypothesis test is signicant ences can be seen in favor of the results obtained from approach
or not, and it also indicates something about how signicant the re- (B) compared to the results from approach (A). This is particularly
sult is: The smaller p-value, the stronger the evidence against the evident for some small datasets such as glass2, hepatitis and sonar,
null hypothesis. Most important, it does this without committing and for medium ones such as vowel. These differences are more
to a particular level of signicance. noticeable in methods BOOCT and RF.
When a p-value comes from a multiple comparison, it does not We conducted Wilcoxons tests on each separate model with
take into account the other comparisons in the multiple study. To each approach, i.e. each method using approach (A) against the
solve this problem, we can use a study of the Adjusted P-values same method using approach (B). The results are given in Table
(APVs) [31]. Using the APVs in a statistical analysis gives us more 4. In these tests, we obtained that the IRN method with the (A) ap-
information because it is not focused on a xed level of signi- proach exhibits no signicant differences from the same method
cance. The work by Garca and Herrera [27] provides an explana- and the (B) approach. The same situation also arises for the IRNV,
tion of how to obtain the values. In some situations, such as this BAGCT, BAGIG and BOOIG methods. However, it is important to re-
one, certain methods are more similar in performance. Therefore, mark that the tests performed for the BOOCT and RF methods ex-
to make a clear distinction between them, it is very interesting press that there are signicant positive differences in accuracy
the study of the APVs of the tests conducted in the experiments. when the (B) approach is used.
Table 4 gives the results of the Wilcoxons test carried out in The Friedman test carried out for each set of methods with each
each method separately in order to compare their performance approach rejects the null hypothesis. Therefore, Holms tests were
by using approaches (A) and (B), respectively. The table shows conducted. The Friedmans ranks of the methods using approach
the approach in which the method is better using Wilcoxons (A), expressed in Table 5, show that the best methods are IRN
test.10 and RF in this order; whereas the worst methods are those in
Tables 5 and 6 show Friedmans ranks of the methods with ap- which the Info-Gain criterion was used for the decision trees:
proaches (A) and (B), respectively (the null hypothesis is rejected in BOOIG and BAGIG. The situation is different for the best methods
both cases). when approach (B) is used (Table 6); the best methods here are
Table 7 shows the p-values of the tests carried out on the meth- BOOCT and RF in this order, but the situation is similar for the worst
ods when the approach (A) is used. The column p-valueH corre- methods, again the BOOIG and BAGIG methods have the worse
sponds to the p-values of the Holms tests on the methods in results.
each row. The column APVH corresponds to the adjusted p-values Focusing on the results of the Holms tests, Tables 7 and 8 ex-
of each comparison obtained from the Holms tests. Table 8 shows press the p-values of each comparison between two methods with

11
In both tables, for a weaker level of signicance a = 0.1, Holms procedure rejects
10
It is the same table of results obtained for a weaker level of signicance of 0.1 and those hypotheses that have a p-valueH 6 0.003846. With the Nemenyis test the
for a stronger one of 0.01. results are very similar to the ones obtained with the Holms test.
428 J. Abelln / Information Fusion 14 (2013) 423430

Table 3
Percentage of correct classication of the methods with the (B) approach.

Dataset IRN IRNV BAGCT BAGIG BOOCT BOOIG RF


Anneal 99.10 99.09 98.89 98.86 99.78 99.31 99.68
Audiology 78.41 75.49 80.41 77.67 81.91 72.15 80.32
Autos 82.37 79.98 80.32 67.77 82.99 61.04 84.19
Breast-cancer 71.31 70.57 70.35 69.84 68.58 67.00 70.12
cmc 53.03 52.69 53.23 54.64 50.46 52.46 50.50
Horse-colic 85.83 85.48 84.91 84.75 81.27 81.92 85.62
German-credit 74.27 74.81 74.65 74.46 74.95 72.06 76.10
Pima-diabetes 74.45 74.48 75.80 76.05 73.97 74.43 75.97
Glass2 81.75 80.21 83.00 81.05 88.05 82.41 87.90
Hepatitis 81.66 81.76 80.99 81.43 85.20 79.97 83.58
Hypothyroid 99.56 99.52 99.59 99.55 99.66 99.49 99.52
Ionosphere 91.91 91.77 91.23 91.37 93.79 90.89 93.45
kr-vs-kp 99.42 99.39 99.40 99.06 99.62 99.31 99.28
Labor 87.40 88.23 86.53 86.33 88.73 82.60 87.27
Lymphography 77.55 77.48 76.24 77.44 83.57 77.53 83.15
Mushroom 100.00 100.00 100.00 100.00 100.00 99.98 100.00
Segment 97.56 97.55 97.45 96.83 98.67 96.52 98.16
Sick 98.98 98.98 98.97 98.76 99.00 98.69 98.42
Solar-are 97.10 97.10 97.31 97.84 95.86 97.44 96.91
Sonar 82.99 83.14 80.78 78.00 86.81 78.96 84.72
Soybean 90.91 89.82 90.47 87.60 91.86 89.12 93.37
Sponge 92.50 92.50 92.63 92.50 93.71 91.46 95.00
Vote 96.00 95.77 96.34 95.54 95.47 94.94 96.43
Vowel 91.38 91.26 91.13 88.08 95.80 81.70 96.68
Zoo 92.01 92.01 92.40 92.51 95.65 92.10 96.33
Average 87.10 86.76 86.92 85.92 88.21 84.54 88.51

Table 4
Results of the Wilcoxons test conducted with a level of signicance of 0.05,
performed in each method separately using approaches (A) and (B). The table shows Table 6
the approach in which the method is better using the results of the Wilcoxons test. Friedmans ranks of the algorithms
indicates non signicant statistical differences. with the approach (B).

Wilcoxon Test Method Rank


IRN IRNV BAGCT BABIG BOOCT BOOIG RF BOOCT 2.7
RF 2.8
(A)/(B) (B) (B)
IRN 3.52
BAGCT 3.94
IRNV 4.18
BAGIG 4.9
Table 5
BOOIG 5.96
Friedmans ranks of the algorithms
with the approach (A).

Method Rank
IRN 3.02
RF 3.04 Table 7
BAGCT 3.42 P-values table for approach (A). For a = 0.05, Holms procedure rejects those
IRNV 3.5 hypotheses that have a p-valueH 6 0.003333. The APVH column corresponds to the
BOOCT 4.42 adjusted p-values obtained from the Holms tests.
BAGIG 5.16
BOOIG 5.44 i Methods p-valueH APVH
1 IRN vs. BOOIG 0.002381 0.00157
2 BOOIG vs. RF 0.0025 0.001714
3 IRN vs. BAGIG 0.002632 0.008761
each approach. Table 7 shows that with approach (A) the best 4 BAGIG vs. RF 0.002778 0.00938
methods are IRN and the RF. They obtain very similar results, i.e. 5 BAGCT vs. BOOIG 0.002941 0.016088
similar number of wins in the tests carried out.12 The IRNV and 6 IRNV vs. BOOIG 0.003125 0.023968
7 BAGCT vs. BAGIG 0.003333 0.066046
BAGCT methods are also winner ones when they are compared with
8 IRNV vs. BAGIG 0.003571 0.092279
the worst method: BOOIG. Table 8 shows that with the approach (B) 9 IRN vs. BOOCT 0.003846 0.285308
the best methods are BOOCT and the RF methods.13 The IRN and IRNV 10 BOOCT vs. RF 0.004167 0.286933
methods are also winners when they are compared with the worst 11 BOOCT vs. BOOIG 0.004545 1.045492
method: BOOIG. 12 BAGCT vs. BOOCT 0.005 1.045492
13 IRNV vs. BOOCT 0.005556 1.18929
It is not easy to compare the best methods for each approach 14 BAGIG vs. BOOCT 0.00625 1.806828
with the results obtained in this work, but as Garca and Herrera 15 IRN vs. IRNV 0.007143 3.024777
16 IRNV vs. RF 0.008333 3.024777
17 IRN vs. BAGCT 0.01 3.024777
12
18 BAGCT vs. RF 0.0125 3.024777
IRN has one more win in the Holms test than the RF if a weaker level of 19 BAGIG vs. BOOIG 0.016667 3.024777
signicance of 0.1 is used. 20 IRNV vs. BAGCT 0.025 3.024777
13
We must remark that BOOCT has one more win in the Holms test than the RF if a 21 IRN vs. RF 0.05 3.024777
weaker level of signicance of 0.1 is used.
J. Abelln / Information Fusion 14 (2013) 423430 429

Table 8 scheme with the appropriate approach is the best way to combine
P-values table for approach (B). For a = 0.05, Holms procedure rejects those credal decision trees. Using the best approach for each method ((B)
hypotheses that have a p-valueH 6 0.003333. The APVH column corresponds to the
adjusted p-values obtained from the Holms tests.
approach), we used a set of tests carried out on the result obtained
from the seven methods analyzed here to show that Boosting cre-
i Methods p-valueH APVH dal trees and the Random Forest methods are the best ones, with
1 BOOCT vs. BOOIG 0.002381 0.000002 Boosting credal trees being better than the Random Forest method
2 BOOIG vs. RF 0.0025 0.000005 in comparison with any other method.
3 IRN vs. BOOIG 0.002632 0.001238
4 BAGIG vs. BOOCT 0.002778 0.005715
5 BAGIG vs. RF 0.002941 0.010002 Acknowledgments
6 BAGCT vs. BOOIG 0.003125 0.015142
7 IRNV vs. BOOIG 0.003333 0.05366 This work has been supported by the Spanish Consejera de
8 IRNV vs. BOOCT 0.003571 0.215965
Econom a, Innovacin y Ciencia de la Junta de Andaluca under
9 IRNV vs. RF 0.003846 0.310844
10 IRN vs. BAGIG 0.004167 0.310844 Project TIC-06016.
11 BAGCT vs. BOOCT 0.004545 0.466564 Im very grateful to the anonymous reviewers of this paper for
12 BAGCT vs. RF 0.005 0.620745 their valuable suggestions and comments to improve its quality.
13 BAGIG vs. BOOIG 0.005556 0.744935
14 BAGCT vs. BAGIG 0.00625 0.929148
15 IRN vs. BOOCT 0.007143 1.257081
References
16 IRN vs. RF 0.008333 1.431879
17 IRNV vs. BAGIG 0.01 1.431879 [1] Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in:
Thirteenth International Conference on Machine Learning, San Francisco, 1996,
18 IRN vs. IRNV 0.0125 1.431879
pp. 148156.
19 IRN vs. BAGCT 0.016667 1.431879
[2] L. Breiman, Bagging predictors, Machine Learning 24 (2) (1996) 123140.
20 IRNV vs. BAGCT 0.025 1.431879
[3] L. Breiman, Random forests, Machine Learning 45 (1) (2001) 532.
21 BOOCT vs. RF 0.05 1.431879 [4] J. Abelln, S. Moral, Building classication trees using the total uncertainty
criterion, International Journal of Intelligent Systems 18 (12) (2003) 1215
1225.
[5] G.J. Klir, Uncertainty and Information: Foundations of Generalized Information
recommend for this type of situations, we can use the study of the Theory, John Wiley, Hoboken, NJ, 2006.
APVs to analyse the differences between the best methods in this [6] J. Abelln, S. Moral, Upper entropy of credal sets, applications to credal
classication, International Journal of Approximate Reasoning 39 (2-3) (2005)
study in greater depth. Table 7 shows that for approach (A) there 235255.
exist very similar results in the strength of the rejection of the null [7] J. Abelln, A. Masegosa, An experimental study about simple decision trees for
hypothesis (values of the APVs) of the comparisons IRN vs. M and bagging ensemble on data sets with classication noise, in: C. Sossai, G.
Chemello (Eds.), ECSQARU, LNCS, vol. 5590, Springer, 2009, pp. 446456.
RF vs. M, with M being any other method. However, in Table 8, [8] J. Abelln, A. Masegosa, An ensemble method using credal decision trees,
about the tests on the methods with approach (B), we can observe European Journal of Operational Research 205 (1) (2010) 218226.
that this situation is more clearly in favor of the BOOCT method [9] M. Zaffalon, The naive credal classier, Journal of Statistical Planning and
Inference 105 (2002) 521.
when it is compared with the RF method. For example, the APV
[10] J. R Quinlan, Induction of decision trees, Machine Learning 1 (1986) 81106.
of the comparison RF vs. BOOIG is 2.5 times greater than the [11] M. Friedman, The use of rank to avoid the assumption of normality implicit in
one of the comparison BOOCT vs. BOOIG; and the APV of the com- the analysis of variance, Journal of the American Statistical Association 32
(1937) 675701.
parison RF vs. BAGIG is 1.75 times the one of the comparison
[12] M. Friedman, A comparison of alternative tests of signicance for the problem
BOOCT vs. BAGIG. of m rankings, Annals of Mathematical Statistics 11 (1940) 8692.
We can use similar analysis to check the differences between [13] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classication and Regression
the worst methods for each approach. We have the same situation Trees, Wadsworth Statistics, Probability Series, Belmont, 1984.
[14] P. Walley, Inferences from multinomial data: learning about a bag of marbles,
with both approaches: comparisons BOOIG vs. M always give an Journal of the Royal Statistical Society B 58 (1996) 357.
APV that is notably lower14 than the one for BAGIG vs. M, with [15] J. Abelln, Uncertainty measures on probability intervals from Imprecise
M being any other method. Thus, we can say that clearly the worst Dirichlet model, International Journal of General Systems 35 (5) (2006) 509
528.
method of our study with the two approaches is BOOIG. [16] J. Abelln, G.J. Klir, S. Moral, Disaggregated total uncertainty measure for credal
Using the analysis reported in this paper, we can summarize the sets, International Journal of General Systems 35 (1) (2006) 2944.
following: the methods using approach (A) are worse than or equal [17] J. Abelln, A. Masegosa, Requirements for total uncertainty measures in
DempsterShafer theory of evidence, International Journal of General Systems
to the equivalent using approach (B); the RF method obtains the 37 (6) (2008) 733747.
best average value with both approaches but it can not be consid- [18] E.T. Jaynes, On the rationale of maximum-entropy methods, Proceeding of the
ered the best one for any approach; with approach (B), the RF and IEEE 70 (9) (1982) 939952.
[19] F. Provost, P. Domingos, Tree induction for probability-based ranking, Machine
BOOCT methods can be considered the best methods, with BOOCT
Learning 52 (3) (2003) 199215.
being better than the RF in the comparisons with the other [20] J. Abelln, A. Masegosa, A lter-wrapper method to select variables for the
methods. naive bayes classier based on credal decision trees, International Journal of
Uncertainty, Fuzziness and Knowledge-Based Systems 17 (6) (2009) 833854.
[21] J. Abelln, Application of uncertainty measures on credal sets on the naive
4. Conclusions bayes classier, International Journal of General Systems 35 (6) (2006) 675
686.
[22] M. Kearns, Thoughts on hypothesis boosting, Unpublished manuscript, 1988.
In this paper, we analyzed different strategies for combining [23] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and
credal decision trees, i.e. decision trees built using imprecise prob- Techniques, second ed., Morgan Kaufmann, San Francisco, 2005.
abilities (using the IDM); and uncertainty measures (using the [24] U.M. Fayyad, K.B. Irani, Multi-valued interval discretization of continuous-
valued attributes for classication learning, in: Proceedings of the 13th
maximum entropy method). We used previously created Bagging International Joint Conference on Articial Intelligence, Morgan Kaufman, San
and Boosting schemes and a new one specically developed for Mateo, 1993, pp. 10221027.
credal decision trees. We experimentally showed that different ap- [25] K. Pelckmans, J. De Brabanter, J.A.K. Suykens, B. De Moor, Handling missing
values in support vector machine classiers, Neural Networks 18 (5-6) (2005)
proaches of dealing with missing data and continuous variables
684692.
can signicantly vary the results of some methods for creating [26] J. Demsar, Statistical comparison of classiers over multiple data sets, Journal
ensemble credal decision trees. We can conclude that the Boosting of Machine Learning Research 7 (2006) 130.
[27] S. Garca, F. Herrera, An extension on statistical comparisons of classiers over
multiple data sets for all pairwise comparisons, Journal of Machine Learning
14
In this case, lower signies stronger evidence in favor of the M method. Research 9 (2008) 26772694.
430 J. Abelln / Information Fusion 14 (2013) 423430

[28] F. Wilcoxon, Individual comparison by ranking methods, Biometrics 1 (1945) [30] P.B. Nemenyi, Distribution-free multiple comparison, PhD thesis, Princenton
8083. University, 1963.
[29] S. Holm, A simple sequentially rejective bonferroni test procedure, [31] P.H. Westfall, S.S. Young, Resampling-Based Multiple Testing: Examples and
Scandinavian Journal of Statistics 6 (1979) 6570. Methods for p-value Adjustment, John Wiley and Sons, 2004.