4 views

Uploaded by sherylav

Jurnal international mengenai probabilitas pada decision tree

- artificial intelligence_ch5_mach1
- Fact Finding
- ch 17
- Decision Trees
- Practice Exam 1 STA 2122 Hybrid
- A Survey on Significance for Changes Student's thoughts using Data Mining
- romi-dm-04-algoritma-nov2014.pptx
- Business Process Intelligence
- Comparative Study of J48, ADTree, REPTree and BFTree Data Mining Algorithms through Colon Tumour Dataset
- Dynamic Distribution Approach for Construvting Decision Trees
- The Role of Data Mining-Based Cancer Prediction system (DMBCPS) in Cancer Awareness
- DTree Slides (1)
- Determination of Factors Influencing the Achievement of the First-year University Students
- Phase 2 Content
- IRJET-ANALYZING GUIDANCE INFORMATION USING RANDOM FORESTS FOR THERAPEUTIC IMAGE SEGMENTATION PROCESS
- CIDAM
- Mining in Bank
- OCCUPATIONAL ASPIRATION OF UNDERGRADUATES IN MEGHALAYA.
- Hypothesis Testing Examples and Exercises
- What is Hypothesis Testing

You are on page 1of 8

Information Fusion

journal homepage: www.elsevier.com/locate/inffus

and uncertainty measures

Joaqun Abelln

Department of Computer Science and Articial Intelligence, ETS de Ingenieras Informtica y de Telecomunicacin, University of Granada, 18071 Granada, Spain

a r t i c l e i n f o a b s t r a c t

Article history: In this paper, we present an experimental comparison among different strategies for combining decision

Received 8 February 2011 trees built by means of imprecise probabilities and uncertainty measures. It has been proven that the

Received in revised form 14 March 2012 combination or fusion of the information obtained from several classiers can improve the nal process

Accepted 15 March 2012

of the classication. We use previously developed schemes, known as Bagging and Boosting, along with a

Available online 23 March 2012

new one based on the variation of the root node via the information rank of each feature of the class var-

iable. To this end, we applied two different approaches to deal with missing data and continuous vari-

Keywords:

ables. We use a set of tests on the performance of the methods analyzed here, to show that, with the

Imprecise probabilities

Credal sets

appropriate approach, the Boosting scheme constitutes an excellent way to combine this type of decision

Uncertainty measures tree. It should be noted that it provides good results, even compared with a standard Random Forest clas-

Supervised classication sier, a successful procedure very commonly used in the literature.

Decision trees 2012 Elsevier B.V. All rights reserved.

Ensemble methods

when they are applied to articial and real-world datasets.

In the area of machine learning, supervised classication learn- Abelln and Moral [4] presented a method for building decision

ing can be considered an important tool for decision support. Clas- trees using the study on measures of information/uncertainty in

sication can be dened as a machine learning technique used to general convex sets of probability distributions (see [5]). In this

predict group membership for data instances. It can be applied to method, the split criterion used, called Imprecise Info-Gain (IIG) is

decision support in medicine, character recognition, astronomy, different to the characteristics of classical split criteria [4,6,7]. An

banking and other elds. A classier may be represented using a important characteristic of this criterion as opposed to the classic

Bayesian network, a neural network, a decision tree, etc. In the lit- ones is that it can produce negative information values for the class

erature, it can be seen that the information obtained by a single variable, i.e. the value of the information on the class variable using

classier can be improved with the combination or fusion of sev- a feature can be lower than the value obtained for the class vari-

eral classiers belonging to the same type or to different ones. able alone. We label this type of trees credal decision trees1 or to

A decision tree, also known as a classication tree, is a simple simplify, credal trees (CTs).

structure that can be used as a classier. An important aspect of In this paper, we perform a set of experiments on a wide range

decision trees is their inherent instability which means that they of datasets in order to check the performance of credal trees for dif-

are sensitive to changes in the training examples, and signicantly ferent ensemble schemes: Bagging, Boosting, and IRN schemes. The

different hypotheses are therefore generated for different training latter scheme of IRN is the method recently presented in Abelln

datasets [1]. Compared with single classiers, the strategies used and Masegosa [8], which uses the IIG criterion in two important

to combine classiers are known to increase classier accuracy, steps of the procedure: to select the set of informative features

providing they combine different models. Decision trees built from used as root nodes of the decision trees in the combination

different training sub-samples from a given problem domain will scheme;2 and as a split criterion for building each decision tree.

produce quite different models. This characteristic is essential with The experimental study by Abelln and Masegosa [8] shows that

regard to considering them as suitable classiers in an ensemble

scheme such as Bagging [2], Boosting [1] or Random Forest [3].

1

Ensembles of decision tree classiers (also called forests in the lit- The term credal used here refers to the use of convex sets of probability

erature), are well accepted in the data mining community. This is distribution, i.e. credal sets, for building precise classiers (see [6]). It differs from its

use in imprecise classication (see [9]).

2

Since the main characteristic of this method is the use of Informative Root Nodes,

E-mail address: jabellan@decsai.ugr.es we have termed this method IRN-Ensemble or IRN to simplify.

1566-2535/$ - see front matter 2012 Elsevier B.V. All rights reserved.

http://dx.doi.org/10.1016/j.inffus.2012.03.003

424 J. Abelln / Information Fusion 14 (2013) 423430

IRN, with a standard previous preprocessing approach, performs well extracted from the dataset for each case of the class variable using

compared to other classic ensemble methods that use classical deci- Walleys imprecise Dirichlet model (IDM) [14], which represents a

sion trees. The method does not need to set the number of trees to be specic kind of convex set of probability distributions (see [15]).

used; rather, it uses a number of decision trees which depends on The IDM depends on a given hyperparameter s which does not

the number of informative features (using IIG criterion) in relation depend on the sample space [14]. The IDM estimates that the prob-

to the class variable. These informative features are used as root abilities for each value of the class variable C are within the

nodes. interval:

We study the performance of the following methods: the origi-

nal IRN method, Bagging CTs (noted as BAGCT), and Boosting CTs ncj ncj s

pcj 2 ; ; j 1; . . . ; k;

(noted as BOOCT). To compare the results, we also use the following Ns Ns

methods: the IRN method replacing its nal procedure with a vote

procedure (noted as IRNV), Bagging decision trees built with Quin- with ncj as the frequency of the set of values (C = cj) in the dataset;

lans classic Information Gain (Info-Gain) split criterion [10] (noted and N the sample size. The value of parameter s determines the

as BAGIG), Boosting decision trees built with the Info-Gain split cri- speed at which the upper and lower probability values converge

terion (noted as BOOIG), and the Random Forest method. when the sample size increases. Higher values of s give a more cau-

To check the accuracy of the above-mentioned methods, we tious inference. Walley [14] does not give a denitive recommenda-

used two different approaches to deal with missing data and con- tion for the value of this parameter but he suggests two candidates:

tinuous variables. We show that for two of the classication meth- s = 1 or s = 2.

ods used here, the type of approach can be strongly related to their For this type of probability intervals, the maximum entropy is

accuracy. We compare the performance of the classication meth- estimated. This is a total uncertainty measure which is well known

ods analyzed by conducting a set of tests, including a Friedman for this type of sets (see [16,17]).

rank [11,12] on the methods used in each approach, and a set of The procedure for building credal trees is very close to the one

known post hoc tests. The most important outcome of this exper- used in Quinlans [10] well-known ID3 algorithm, replacing its

imental study is that the Boosting scheme, with the appropriate Info-Gain split criterion with the Imprecise Info-Gain (IIG) split cri-

approach, is the best way to ensemble credal decision trees. It ob- terion. This criterion can be dened as follows: in a classication

tains excellent results compared with other ensemble procedures for problem, let C be the class variable, {Z1, . . . , Zk} be the set of fea-

this type of decision trees. Indeed, it should be noted that Boosting tures, and Z be a feature; then

credal decision trees obtains a better Friedman rank score than a P

IIGD C; Z S K D C PZ zi S K D CjZ zi ;

standard Random Forest method in the comparative study conducted i

among the seven methods presented in this paper. The results

D D

obtained with the set of tests carried out reinforce this statement. where K C and K CjZ zi are the credal sets obtained using the

The rest of the paper is organized as follows: in Section 2, we IDM for the variables C and (CjZ = zi) respectively, for a partition D of

present previous knowledge on the method for building credal the dataset (see [4]); and S is the maximum entropy function

decision trees and the ensemble schemes utilized with credal deci- (see [15]).

sion trees in this paper. In Section 3, we check all the ensemble The IIG criterion is different from the classical ones. It is based

methods studied here, using two different approaches to deal with on the principle of maximum uncertainty (see [5]), broadly used

missing data and continuous variables, on a set of datasets widely in the classic information theory, in which it is known as the max-

used in classication. Section 4 is devoted to the conclusions. imum entropy principle [18]. The use of the maximum entropy

function in the procedure for building decision trees is justied

2. Background in Abelln and Moral [6]. It is important to note that for a feature

Z and a partition D; IIGD C; Z can be negative. This situation does

2.1. Method for building credal decision trees not appear with classical split criteria such as the Info-Gain crite-

rion used in ID3.3 This characteristic enables the IIG criterion to re-

A decision tree is a simple structure that can be used as a clas- veal features that worsen the information on the class variable.

sier. An important reference on its origin is the work Classication Each node No in a decision tree produces a partition of the data-

and Regression Trees by Breiman et al. [13]. Quinlans well-known set (for the root node, D is considered to be the entire dataset). Fur-

ID3 algorithm for building decision trees, which was presented la- thermore, each node No has an associated list L of feature labels

ter, should also be mentioned. (that are not in the path from the root node to No). The procedure

In a decision tree, each internal (non-leaf) node represents an for building credal trees is explained in the algorithm in Fig. 1.

attribute variable (or predictive attribute or feature) and each Considering this algorithm, when an Exit situation is attained,

branch represents one of the states of this variable. Each tree leaf i.e. when there are no more features to introduce in a node, or

species an expected value of the class variable (the variable to when the uncertainty is not reduced (steps 1 and 4 of the algo-

be predicted). Important aspects of the procedures for building rithm, respectively), a leaf node is produced. In a leaf node, the

decision trees are: the split criterion, i.e. the criterion used for most probable state or value of the class variable for the partition

branching; and the stop criterion, i.e. the criterion used to stop associated with that leaf node is inserted. To avoid obtaining

branching. unclassied instances, if we do not have one single most probable

Decision trees are built using a set of data referred to as the class value, we can select the one obtained in its parent node, and

training dataset. A different set, called the test dataset, is used to so on recursively (see [8]).

check the model. When we obtain a new sample or instance of Credal decision trees can be somewhat smaller than the ones

the test dataset, we can make a decision or prediction about the produced with similar procedures replacing the IIG criterion with

state of the class variable by following the path in the tree from the classic ones (see [7]). This normally produces a reduction of

the root node to a leaf node, using the sample values and the tree the overtting of the model (see [6]), which could advise against

structure. its use in bagging or boosting schemes (see [19]).

The split criterion employed to build credal decision trees [4] is

based on the application of uncertainty measures on convex sets of 3

The Info-Gain criterion is actually a particular case of the IIG criterion using the

probability distributions. Specically, probability intervals are parameter s = 0.

J. Abelln / Information Fusion 14 (2013) 423430 425

ers where each one depends upon its predecessors; and that

each classier considers the error of the previous classier in

order to decide what to focus on during the next iteration of

the data. Boosting assumes training dataset to be perfect, so

this procedure does not perform as well as other strategies

with noisy training datasets. A special type of Boosting is

the algorithm of Adaboost [1] which has demonstrated

excellent experimental performance on benchmark datasets

and real applications. AdaBoost is an adaptive algorithm in

the sense that classiers built subsequently are tweaked in

favor of instances misclassied by previous classiers. It is

a feature selector with a principled strategy: minimization

of upper bound on empirical error.

(iii) The IIG criterion is used in two important ways in the recent

Fig. 1. Procedure to build a credal decision tree. procedure IRN-Ensemble method of Abelln and Masegosa

[8]. As a split criterion, a single credal tree needs to be built

and used to obtain the set of features that will make up the

root node of each tree used in the ensemble scheme. As sta-

Credal trees and the IIG criterion have been successfully used in

ted in the preceding sections, this criterion can use certain

other procedures and tools in data mining as a part of a procedure

features to provide negative values of information for the

for selecting variables [20] and on datasets with classication noise

class variable. Therefore, all the features can be considered

[7]. An extended version of the IIG criterion has been used to dene

as potential root nodes or, to the contrary, none of them

a semi-nave Bayes classier [21].

can be considered as root nodes. The method is not based

upon a nal voting procedure, and can be expressed as

2.2. Ensemble methods using credal decision trees follows:

Use IIG rank to obtain the set of features {Z1, . . . , Zm},

The general idea normally used in the procedures for combining where IIG (CjZj) > 0, "j 2 {1, . . . , m}.

decision trees is based on the generation of a set of different deci- Build m credal trees {T1, . . . , Tm}. Each Tj is built using Zj

sion trees combined with a majority vote criterion. When a new as the root node and the rest of nodes, for this tree, are

unclassied instance arises, each single decision tree makes a pre- selected following the procedure described in the previ-

diction and the instance is assigned to the class value with the ous section, using the IIG criterion.

highest number of votes. In an ensemble scheme, if all decision In order to classify a new observation, apply the new case

trees are quite similar, ensemble performance will not be much to each of the m credal trees and consider the frequency

better than a single decision tree. However, if the ensemble com- (number of cases) of each state of the class variable in

prises a broad set of different decision trees exhibiting good indi- each of the leaf nodes. For a tree Tj, a new observation

vidual performance, the ensemble will become more robust, with is associated with a leaf node, and this also has an asso-

a better prediction capacity. There are many different approaches ciated partition of the dataset. Then njci is the frequency

to this problem, but Bagging [2], Random Forests [3], and AdaBoost (number of cases) with which ci appears in the partition

[1] stand out as the best known and most competitive ones. generated by the leaf node of Tj. Calculate the following

The schemes used here with credal trees can be briey de- relative frequencies (probabilities):

scribed as follows:

P njci

qi P j ;

(i) The Bootstrap Aggregating ensemble method, known as j ci nci

Breimans Bagging [2] is an intuitive and simple method

for each ci.

that performs very well. Diversity in Bagging is obtained

For a new observation, the value obtained by the

by using bootstrap replicates of the original training dataset:

IRN-Ensemble classication method is ch, where: ch =

different training datasets are randomly drawn with

arg(maxi{qi}).

replacement. Subsequently, a single decision tree is built

with each instance of the training dataset using the standard

As can be inferred from the above explanation, the IRN method

approach [13]. Thus, each tree can be dened by a different

has an important characteristic with respect to the rest: it does not

set of variables, nodes and leaves. Finally, their predictions

x the number of trees (or iterations) to be used. It depends on the

are combined by a majority vote.

number of informative features with respect to the class variable

(ii) Boosting is an ensemble method based on the work of

(see [8]). This implies that it is possible to use a number of trees

Kearns [22]. This procedure can be described in the follow-

equal to the number of features in the dataset, or even zero.4 As

ing steps: (i) applying a weak classier (such as a decision

a very low number of trees can be used, the authors of the IRN meth-

tree) to the learning data, where each observation is

od do not consider a nal voting procedure appropriate to classify a

assigned an initially equal weight; (ii) applying weights to

new observation.

the observations in the learning sample that are inversely

proportional to the accuracy of the classication of the com-

puted predicted classications; (iii) going to the step (i) M- 3. Experimentation

times (M previously xed); and (iv) combining predictions

from individual models (weighted by accuracy of the mod- Our aim is to study the performance of the methods: the origi-

els). In the Boosting procedure: for misclassied samples nal IRN method (noted as IRN); the IRN method replacing its nal

the weights are increased, while for correctly classied sam-

ples the weights are decreased. Thus, the principal idea of 4

In that situation the classication procedure always returns the case of the class

this method is that it uses a sequence of successive classi- variable with greater frequency in the dataset.

426 J. Abelln / Information Fusion 14 (2013) 423430

(noted as BAGCT); Bagging decision trees built with Quinlans clas- Dataset description. Column N is the number of instances in the datasets, column

Feat is the number of features or attribute variables, column Num is the number

sic Info-Gain split criterion [10] (noted as BAGIG); Boosting CTs of numerical variables, column Nom is the number of nominal variables, column k

(noted as BOOCT); Boosting decision trees built with the Info-Gain is the number of cases or states of the class variable (always a nominal variable) and

split criterion (noted as BOOIG); and the Random Forest method column Range is the range of states of the nominal variables of each dataset.

(noted as RF). Dataset N Feat Num Nom k Range

In order to check the above procedures, we used a broad and di-

Anneal 898 38 6 32 6 210

verse set of 25 known datasets, obtained from the UCI repository of Audiology 226 69 0 69 24 26

machine learning datasets which can be downloaded directly from Autos 205 25 15 10 7 222

http://archive.ics.uci.edu/ml/. We took datasets that are different Breast-cancer 286 9 0 9 2 213

with respect to the number of cases of the variable to be classied, Colic 368 22 7 15 2 26

cmc 1473 9 2 7 2 24

dataset size, feature types (discrete or continuous) and number of

Credit-german 1000 20 7 13 2 211

cases of the features. A brief description of the datasets can be Diabetes-pima 768 8 8 0 2

found in Table 1. Glass-2 163 9 9 0 2

For our experimentation, we used Weka software [23] on Java Hepatitis 155 19 4 15 2 2

1.5, and we added the methods required to build decision trees Hypothyroid 3772 29 7 22 4 24

Ionosphere 351 35 35 0 2

using the IIG split criterion. The ensembles of credal trees used

kr-vs-kp 3196 36 0 36 2 23

here: IRN-Ensembles (IRN and IRNV), Bagging credal trees (BAGCT) Labor 57 16 8 8 2 23

and Adaboost credal trees (BOOCT) were implemented using data Lymph 146 18 3 15 4 28

structures of Weka.5 The parameter of the IDM was set to s = 1, i.e. Mushroom 8123 22 0 22 2 212

Segment 2310 19 19 0 7

the value used in the original method by Abelln and Moral [4], be-

Sick 3772 29 7 22 2 2

cause the procedure to obtain the maximum entropy value (see [15]) Solar-are1 323 12 0 12 2 26

reaches its lowest computational cost for this value. Bagging and Sonar 208 60 60 0 2

Boosting ensembles were built with 100 decision trees.6 Although Soybean 683 35 0 35 19 27

the number of trees can strongly affect ensemble performance, this Sponge 76 44 0 44 3 29

Vote 435 16 0 16 2 2

is a reasonable number of trees for the low-medium size of the data-

Vowel 990 12 10 2 11 2

sets used in this study, and moreover it was used in related research, Zoo 101 16 1 15 7 2

such as Freund and Schapire [1]. We used the Random Forest meth-

od (RF) that can be found in Weka software as a benchmark to com-

pare the results. We used it with its default conguration, except

dure 10 times.8 Alternative methods of processing missing data have

that the number of trees used for this method was 100, the same

been presented in Pelckmans et al. [25].

as for the above-mentioned ensemble methods.

Tables 2 and 3 show the accuracy of the methods with ap-

We used the following approaches to deal with missing data

proaches (A) and (B), respectively.

and continuous variables in different implementations of the

Following the recommendations of Demsar [26], we used a ser-

methods:

ies of tests to compare the methods.9 Also, we have considered it

would be equally interesting to take into account the work by Garca

(A) For each feature, missing values within datasets were

and Herrera [27], where the work of Demsar [26] is completed. We

replaced with mean (for continuous features) and mode

used the following tests:

(for discrete features) values, with Wekas own lters. In

the same way, continuous features were discretized using

To compare two classiers on multiple datasets:

the known Fayyad and Iranis discretization method [24].

Wilcoxons test (Wilcoxon [28]): a non-parametric test

The approach was applied by means of the training dataset

which ranks the differences in performance of two classiers

and it was then transferring to the test dataset, for each pair

for each dataset, ignoring the signs, and compares the ranks

of training/test datasets.

for the positive and the negative differences.

(B) Assume that missing values are randomly distributed. In

To compare multiple classiers on multiple datasets:

order to compute the scores, the instances are split into

Friedmans test (Friedman [11,12]): a non-parametric test

units. The initial weight of an instance is equal to the unit,

that ranks the algorithms separately for each dataset, the

but when it drops down a branch, it receives a weight equal

best performing algorithm being assigned the rank of 1,

to the proportion of instances belonging to this branch

the second best, rank 2, etc. The null hypothesis is that all

(weights sum to 1). For continuous variables, only binary

the algorithms are equivalent. If the null-hypothesis is

split attributes are considered and each possible split point

rejected, we can compare all the algorithms to each other

is evaluated and nally selected. The one which induced a

using Holms test (Holm [29]).

partition of the samples with the highest split score, using

We have considered the above set of tests following a trade-off

the IIG criterion, is ultimately selected.7

on the recommendations by Demsar [26] and by Garca and Herre-

ra [27]. As suggested in those works, the Friedman test may report

It should be noted that both approaches were applied by using

a signicant difference but the post hoc test fails to detect it. This is

the training dataset and then translated the result to the test data-

due to the lower power of the post hoc test conducted. Nemenyis

set. For each dataset, we repeated a 10-fold cross validation proce-

test [30] is recommended by Demsar, but in some situations can be

5

In the event of ties, the ensemble procedures implemented in Weka (such as the

8

ones used in this paper) select the rst case of the class variable they nd, considering The dataset is separated into 10 subsets, and each one is used as a test dataset.

their order in the dataset le. The set obtained by joining the other 9 subsets is used as the training dataset. Thus,

6

The IRN-Ensemble method does not have a xed number of trees to be used. It we have 10 training datasets and 10 test datasets. This procedure is repeated 10 times

uses a number of trees less or equal than the number of features of each dataset. with prior random reordering. Finally, it produces 100 training datasets and 100 test

Always it is less than 70 decision trees in this experimental study. datasets. The percentage of correct classications for each dataset, given in tables, is

7

This approach is the one presented for the C4.5 method in Weka, adapted for the average of these 100 trials.

9

credal trees using the IIG criterion. All the tests were carried out using Keel software, available at http://www.keel.es.

J. Abelln / Information Fusion 14 (2013) 423430 427

Table 2

Percentage of correct classication of the methods with the (A) approach.

Anneal 99.67 99.64 99.71 99.60 99.72 99.68 99.41

Audiology 82.08 80.40 83.12 80.49 81.73 75.05 80.01

Autos 82.99 81.97 82.83 81.40 83.02 80.52 84.24

Breast-cancer 72.42 72.59 71.57 67.36 67.21 67.46 69.98

cmc 48.99 48.76 49.15 46.95 47.70 46.71 47.79

Horse-colic 83.94 84.42 83.88 82.47 80.22 80.03 83.20

German-credit 71.28 71.40 71.63 70.25 70.59 70.82 73.82

Pima-diabetes 74.38 74.35 73.88 73.22 73.23 73.10 73.36

Glass2 78.36 79.46 78.30 78.30 77.46 78.24 79.15

Hepatitis 81.53 80.99 80.75 78.78 79.10 79.61 82.05

Hypothyroid 99.36 99.36 99.36 99.34 99.40 99.31 99.21

Ionosphere 92.22 91.97 92.14 91.34 91.77 91.20 92.08

kr-vs-kp 99.52 99.50 99.48 99.59 99.64 99.59 99.28

Labor 87.97 87.80 86.23 87.00 88.40 89.17 88.97

Lymphography 79.74 79.95 76.93 76.78 84.14 79.08 82.74

Mushroom 100.00 100.00 100.00 100.00 99.99 100.00 100.00

Segment 96.43 96.13 95.10 95.49 95.94 95.10 96.63

Sick 97.87 97.80 97.89 97.63 97.62 97.50 97.71

Solar-are 96.97 97.04 97.22 96.11 95.74 95.95 96.91

Sonar 75.69 75.65 76.62 76.81 76.06 76.45 76.79

Soybean 93.56 92.93 92.87 89.72 92.28 89.13 93.89

Sponge 95.00 95.00 94.75 94.25 95.00 93.80 95.00

Vote 95.67 95.79 95.90 94.90 95.35 95.28 96.36

Vowel 77.15 74.80 76.48 79.03 79.02 78.72 80.86

Zoo 95.92 95.92 96.12 95.92 95.43 95.92 96.32

Average 86.35 86.14 86.08 85.31 85.83 85.10 86.63

a less sensitive test than others, as it is described by Garca and the same results of the tests on the methods when the approach (B)

Herrera. With Nemenyis test we can encounter situations where is used.11

the differences expressed by the Friedman test were not detected.

Hence, as the latter authors recommended, we considered con- 3.1. Comments on the results

ducting a post hoc Holms test.

The study of the p-values is very interesting because, citing a From Tables 2 and 3 we can obtain the differences between the

paragraph by Garca and Herrera [27]: A p-value provides infor- results of the methods using approaches (A) and (B). Some differ-

mation about whether a statistical hypothesis test is signicant ences can be seen in favor of the results obtained from approach

or not, and it also indicates something about how signicant the re- (B) compared to the results from approach (A). This is particularly

sult is: The smaller p-value, the stronger the evidence against the evident for some small datasets such as glass2, hepatitis and sonar,

null hypothesis. Most important, it does this without committing and for medium ones such as vowel. These differences are more

to a particular level of signicance. noticeable in methods BOOCT and RF.

When a p-value comes from a multiple comparison, it does not We conducted Wilcoxons tests on each separate model with

take into account the other comparisons in the multiple study. To each approach, i.e. each method using approach (A) against the

solve this problem, we can use a study of the Adjusted P-values same method using approach (B). The results are given in Table

(APVs) [31]. Using the APVs in a statistical analysis gives us more 4. In these tests, we obtained that the IRN method with the (A) ap-

information because it is not focused on a xed level of signi- proach exhibits no signicant differences from the same method

cance. The work by Garca and Herrera [27] provides an explana- and the (B) approach. The same situation also arises for the IRNV,

tion of how to obtain the values. In some situations, such as this BAGCT, BAGIG and BOOIG methods. However, it is important to re-

one, certain methods are more similar in performance. Therefore, mark that the tests performed for the BOOCT and RF methods ex-

to make a clear distinction between them, it is very interesting press that there are signicant positive differences in accuracy

the study of the APVs of the tests conducted in the experiments. when the (B) approach is used.

Table 4 gives the results of the Wilcoxons test carried out in The Friedman test carried out for each set of methods with each

each method separately in order to compare their performance approach rejects the null hypothesis. Therefore, Holms tests were

by using approaches (A) and (B), respectively. The table shows conducted. The Friedmans ranks of the methods using approach

the approach in which the method is better using Wilcoxons (A), expressed in Table 5, show that the best methods are IRN

test.10 and RF in this order; whereas the worst methods are those in

Tables 5 and 6 show Friedmans ranks of the methods with ap- which the Info-Gain criterion was used for the decision trees:

proaches (A) and (B), respectively (the null hypothesis is rejected in BOOIG and BAGIG. The situation is different for the best methods

both cases). when approach (B) is used (Table 6); the best methods here are

Table 7 shows the p-values of the tests carried out on the meth- BOOCT and RF in this order, but the situation is similar for the worst

ods when the approach (A) is used. The column p-valueH corre- methods, again the BOOIG and BAGIG methods have the worse

sponds to the p-values of the Holms tests on the methods in results.

each row. The column APVH corresponds to the adjusted p-values Focusing on the results of the Holms tests, Tables 7 and 8 ex-

of each comparison obtained from the Holms tests. Table 8 shows press the p-values of each comparison between two methods with

11

In both tables, for a weaker level of signicance a = 0.1, Holms procedure rejects

10

It is the same table of results obtained for a weaker level of signicance of 0.1 and those hypotheses that have a p-valueH 6 0.003846. With the Nemenyis test the

for a stronger one of 0.01. results are very similar to the ones obtained with the Holms test.

428 J. Abelln / Information Fusion 14 (2013) 423430

Table 3

Percentage of correct classication of the methods with the (B) approach.

Anneal 99.10 99.09 98.89 98.86 99.78 99.31 99.68

Audiology 78.41 75.49 80.41 77.67 81.91 72.15 80.32

Autos 82.37 79.98 80.32 67.77 82.99 61.04 84.19

Breast-cancer 71.31 70.57 70.35 69.84 68.58 67.00 70.12

cmc 53.03 52.69 53.23 54.64 50.46 52.46 50.50

Horse-colic 85.83 85.48 84.91 84.75 81.27 81.92 85.62

German-credit 74.27 74.81 74.65 74.46 74.95 72.06 76.10

Pima-diabetes 74.45 74.48 75.80 76.05 73.97 74.43 75.97

Glass2 81.75 80.21 83.00 81.05 88.05 82.41 87.90

Hepatitis 81.66 81.76 80.99 81.43 85.20 79.97 83.58

Hypothyroid 99.56 99.52 99.59 99.55 99.66 99.49 99.52

Ionosphere 91.91 91.77 91.23 91.37 93.79 90.89 93.45

kr-vs-kp 99.42 99.39 99.40 99.06 99.62 99.31 99.28

Labor 87.40 88.23 86.53 86.33 88.73 82.60 87.27

Lymphography 77.55 77.48 76.24 77.44 83.57 77.53 83.15

Mushroom 100.00 100.00 100.00 100.00 100.00 99.98 100.00

Segment 97.56 97.55 97.45 96.83 98.67 96.52 98.16

Sick 98.98 98.98 98.97 98.76 99.00 98.69 98.42

Solar-are 97.10 97.10 97.31 97.84 95.86 97.44 96.91

Sonar 82.99 83.14 80.78 78.00 86.81 78.96 84.72

Soybean 90.91 89.82 90.47 87.60 91.86 89.12 93.37

Sponge 92.50 92.50 92.63 92.50 93.71 91.46 95.00

Vote 96.00 95.77 96.34 95.54 95.47 94.94 96.43

Vowel 91.38 91.26 91.13 88.08 95.80 81.70 96.68

Zoo 92.01 92.01 92.40 92.51 95.65 92.10 96.33

Average 87.10 86.76 86.92 85.92 88.21 84.54 88.51

Table 4

Results of the Wilcoxons test conducted with a level of signicance of 0.05,

performed in each method separately using approaches (A) and (B). The table shows Table 6

the approach in which the method is better using the results of the Wilcoxons test. Friedmans ranks of the algorithms

indicates non signicant statistical differences. with the approach (B).

IRN IRNV BAGCT BABIG BOOCT BOOIG RF BOOCT 2.7

RF 2.8

(A)/(B) (B) (B)

IRN 3.52

BAGCT 3.94

IRNV 4.18

BAGIG 4.9

Table 5

BOOIG 5.96

Friedmans ranks of the algorithms

with the approach (A).

Method Rank

IRN 3.02

RF 3.04 Table 7

BAGCT 3.42 P-values table for approach (A). For a = 0.05, Holms procedure rejects those

IRNV 3.5 hypotheses that have a p-valueH 6 0.003333. The APVH column corresponds to the

BOOCT 4.42 adjusted p-values obtained from the Holms tests.

BAGIG 5.16

BOOIG 5.44 i Methods p-valueH APVH

1 IRN vs. BOOIG 0.002381 0.00157

2 BOOIG vs. RF 0.0025 0.001714

3 IRN vs. BAGIG 0.002632 0.008761

each approach. Table 7 shows that with approach (A) the best 4 BAGIG vs. RF 0.002778 0.00938

methods are IRN and the RF. They obtain very similar results, i.e. 5 BAGCT vs. BOOIG 0.002941 0.016088

similar number of wins in the tests carried out.12 The IRNV and 6 IRNV vs. BOOIG 0.003125 0.023968

7 BAGCT vs. BAGIG 0.003333 0.066046

BAGCT methods are also winner ones when they are compared with

8 IRNV vs. BAGIG 0.003571 0.092279

the worst method: BOOIG. Table 8 shows that with the approach (B) 9 IRN vs. BOOCT 0.003846 0.285308

the best methods are BOOCT and the RF methods.13 The IRN and IRNV 10 BOOCT vs. RF 0.004167 0.286933

methods are also winners when they are compared with the worst 11 BOOCT vs. BOOIG 0.004545 1.045492

method: BOOIG. 12 BAGCT vs. BOOCT 0.005 1.045492

13 IRNV vs. BOOCT 0.005556 1.18929

It is not easy to compare the best methods for each approach 14 BAGIG vs. BOOCT 0.00625 1.806828

with the results obtained in this work, but as Garca and Herrera 15 IRN vs. IRNV 0.007143 3.024777

16 IRNV vs. RF 0.008333 3.024777

17 IRN vs. BAGCT 0.01 3.024777

12

18 BAGCT vs. RF 0.0125 3.024777

IRN has one more win in the Holms test than the RF if a weaker level of 19 BAGIG vs. BOOIG 0.016667 3.024777

signicance of 0.1 is used. 20 IRNV vs. BAGCT 0.025 3.024777

13

We must remark that BOOCT has one more win in the Holms test than the RF if a 21 IRN vs. RF 0.05 3.024777

weaker level of signicance of 0.1 is used.

J. Abelln / Information Fusion 14 (2013) 423430 429

Table 8 scheme with the appropriate approach is the best way to combine

P-values table for approach (B). For a = 0.05, Holms procedure rejects those credal decision trees. Using the best approach for each method ((B)

hypotheses that have a p-valueH 6 0.003333. The APVH column corresponds to the

adjusted p-values obtained from the Holms tests.

approach), we used a set of tests carried out on the result obtained

from the seven methods analyzed here to show that Boosting cre-

i Methods p-valueH APVH dal trees and the Random Forest methods are the best ones, with

1 BOOCT vs. BOOIG 0.002381 0.000002 Boosting credal trees being better than the Random Forest method

2 BOOIG vs. RF 0.0025 0.000005 in comparison with any other method.

3 IRN vs. BOOIG 0.002632 0.001238

4 BAGIG vs. BOOCT 0.002778 0.005715

5 BAGIG vs. RF 0.002941 0.010002 Acknowledgments

6 BAGCT vs. BOOIG 0.003125 0.015142

7 IRNV vs. BOOIG 0.003333 0.05366 This work has been supported by the Spanish Consejera de

8 IRNV vs. BOOCT 0.003571 0.215965

Econom a, Innovacin y Ciencia de la Junta de Andaluca under

9 IRNV vs. RF 0.003846 0.310844

10 IRN vs. BAGIG 0.004167 0.310844 Project TIC-06016.

11 BAGCT vs. BOOCT 0.004545 0.466564 Im very grateful to the anonymous reviewers of this paper for

12 BAGCT vs. RF 0.005 0.620745 their valuable suggestions and comments to improve its quality.

13 BAGIG vs. BOOIG 0.005556 0.744935

14 BAGCT vs. BAGIG 0.00625 0.929148

15 IRN vs. BOOCT 0.007143 1.257081

References

16 IRN vs. RF 0.008333 1.431879

17 IRNV vs. BAGIG 0.01 1.431879 [1] Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in:

Thirteenth International Conference on Machine Learning, San Francisco, 1996,

18 IRN vs. IRNV 0.0125 1.431879

pp. 148156.

19 IRN vs. BAGCT 0.016667 1.431879

[2] L. Breiman, Bagging predictors, Machine Learning 24 (2) (1996) 123140.

20 IRNV vs. BAGCT 0.025 1.431879

[3] L. Breiman, Random forests, Machine Learning 45 (1) (2001) 532.

21 BOOCT vs. RF 0.05 1.431879 [4] J. Abelln, S. Moral, Building classication trees using the total uncertainty

criterion, International Journal of Intelligent Systems 18 (12) (2003) 1215

1225.

[5] G.J. Klir, Uncertainty and Information: Foundations of Generalized Information

recommend for this type of situations, we can use the study of the Theory, John Wiley, Hoboken, NJ, 2006.

APVs to analyse the differences between the best methods in this [6] J. Abelln, S. Moral, Upper entropy of credal sets, applications to credal

classication, International Journal of Approximate Reasoning 39 (2-3) (2005)

study in greater depth. Table 7 shows that for approach (A) there 235255.

exist very similar results in the strength of the rejection of the null [7] J. Abelln, A. Masegosa, An experimental study about simple decision trees for

hypothesis (values of the APVs) of the comparisons IRN vs. M and bagging ensemble on data sets with classication noise, in: C. Sossai, G.

Chemello (Eds.), ECSQARU, LNCS, vol. 5590, Springer, 2009, pp. 446456.

RF vs. M, with M being any other method. However, in Table 8, [8] J. Abelln, A. Masegosa, An ensemble method using credal decision trees,

about the tests on the methods with approach (B), we can observe European Journal of Operational Research 205 (1) (2010) 218226.

that this situation is more clearly in favor of the BOOCT method [9] M. Zaffalon, The naive credal classier, Journal of Statistical Planning and

Inference 105 (2002) 521.

when it is compared with the RF method. For example, the APV

[10] J. R Quinlan, Induction of decision trees, Machine Learning 1 (1986) 81106.

of the comparison RF vs. BOOIG is 2.5 times greater than the [11] M. Friedman, The use of rank to avoid the assumption of normality implicit in

one of the comparison BOOCT vs. BOOIG; and the APV of the com- the analysis of variance, Journal of the American Statistical Association 32

(1937) 675701.

parison RF vs. BAGIG is 1.75 times the one of the comparison

[12] M. Friedman, A comparison of alternative tests of signicance for the problem

BOOCT vs. BAGIG. of m rankings, Annals of Mathematical Statistics 11 (1940) 8692.

We can use similar analysis to check the differences between [13] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classication and Regression

the worst methods for each approach. We have the same situation Trees, Wadsworth Statistics, Probability Series, Belmont, 1984.

[14] P. Walley, Inferences from multinomial data: learning about a bag of marbles,

with both approaches: comparisons BOOIG vs. M always give an Journal of the Royal Statistical Society B 58 (1996) 357.

APV that is notably lower14 than the one for BAGIG vs. M, with [15] J. Abelln, Uncertainty measures on probability intervals from Imprecise

M being any other method. Thus, we can say that clearly the worst Dirichlet model, International Journal of General Systems 35 (5) (2006) 509

528.

method of our study with the two approaches is BOOIG. [16] J. Abelln, G.J. Klir, S. Moral, Disaggregated total uncertainty measure for credal

Using the analysis reported in this paper, we can summarize the sets, International Journal of General Systems 35 (1) (2006) 2944.

following: the methods using approach (A) are worse than or equal [17] J. Abelln, A. Masegosa, Requirements for total uncertainty measures in

DempsterShafer theory of evidence, International Journal of General Systems

to the equivalent using approach (B); the RF method obtains the 37 (6) (2008) 733747.

best average value with both approaches but it can not be consid- [18] E.T. Jaynes, On the rationale of maximum-entropy methods, Proceeding of the

ered the best one for any approach; with approach (B), the RF and IEEE 70 (9) (1982) 939952.

[19] F. Provost, P. Domingos, Tree induction for probability-based ranking, Machine

BOOCT methods can be considered the best methods, with BOOCT

Learning 52 (3) (2003) 199215.

being better than the RF in the comparisons with the other [20] J. Abelln, A. Masegosa, A lter-wrapper method to select variables for the

methods. naive bayes classier based on credal decision trees, International Journal of

Uncertainty, Fuzziness and Knowledge-Based Systems 17 (6) (2009) 833854.

[21] J. Abelln, Application of uncertainty measures on credal sets on the naive

4. Conclusions bayes classier, International Journal of General Systems 35 (6) (2006) 675

686.

[22] M. Kearns, Thoughts on hypothesis boosting, Unpublished manuscript, 1988.

In this paper, we analyzed different strategies for combining [23] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and

credal decision trees, i.e. decision trees built using imprecise prob- Techniques, second ed., Morgan Kaufmann, San Francisco, 2005.

abilities (using the IDM); and uncertainty measures (using the [24] U.M. Fayyad, K.B. Irani, Multi-valued interval discretization of continuous-

valued attributes for classication learning, in: Proceedings of the 13th

maximum entropy method). We used previously created Bagging International Joint Conference on Articial Intelligence, Morgan Kaufman, San

and Boosting schemes and a new one specically developed for Mateo, 1993, pp. 10221027.

credal decision trees. We experimentally showed that different ap- [25] K. Pelckmans, J. De Brabanter, J.A.K. Suykens, B. De Moor, Handling missing

values in support vector machine classiers, Neural Networks 18 (5-6) (2005)

proaches of dealing with missing data and continuous variables

684692.

can signicantly vary the results of some methods for creating [26] J. Demsar, Statistical comparison of classiers over multiple data sets, Journal

ensemble credal decision trees. We can conclude that the Boosting of Machine Learning Research 7 (2006) 130.

[27] S. Garca, F. Herrera, An extension on statistical comparisons of classiers over

multiple data sets for all pairwise comparisons, Journal of Machine Learning

14

In this case, lower signies stronger evidence in favor of the M method. Research 9 (2008) 26772694.

430 J. Abelln / Information Fusion 14 (2013) 423430

[28] F. Wilcoxon, Individual comparison by ranking methods, Biometrics 1 (1945) [30] P.B. Nemenyi, Distribution-free multiple comparison, PhD thesis, Princenton

8083. University, 1963.

[29] S. Holm, A simple sequentially rejective bonferroni test procedure, [31] P.H. Westfall, S.S. Young, Resampling-Based Multiple Testing: Examples and

Scandinavian Journal of Statistics 6 (1979) 6570. Methods for p-value Adjustment, John Wiley and Sons, 2004.

- artificial intelligence_ch5_mach1Uploaded byupender_kalwa
- Fact FindingUploaded byGaurav Sharma
- ch 17Uploaded byKhushboo Mehta
- Decision TreesUploaded bySidharth Kayakandy
- Practice Exam 1 STA 2122 HybridUploaded bySergio
- A Survey on Significance for Changes Student's thoughts using Data MiningUploaded byEditor IJTSRD
- romi-dm-04-algoritma-nov2014.pptxUploaded byRizqy Fahmi
- Business Process IntelligenceUploaded byricardofrancisco
- Comparative Study of J48, ADTree, REPTree and BFTree Data Mining Algorithms through Colon Tumour DatasetUploaded byInternational Journal for Scientific Research and Development - IJSRD
- Dynamic Distribution Approach for Construvting Decision TreesUploaded byShanti Prasad
- The Role of Data Mining-Based Cancer Prediction system (DMBCPS) in Cancer AwarenessUploaded byInternational Journal of Computer Science and Engineering Communications
- DTree Slides (1)Uploaded byDiaa Malah
- Determination of Factors Influencing the Achievement of the First-year University StudentsUploaded byMadalina Lungu
- Phase 2 ContentUploaded bydlulza
- IRJET-ANALYZING GUIDANCE INFORMATION USING RANDOM FORESTS FOR THERAPEUTIC IMAGE SEGMENTATION PROCESSUploaded byIRJET Journal
- CIDAMUploaded byJay R Mcjunes Valiao
- Mining in BankUploaded byBannuru Rafiq
- OCCUPATIONAL ASPIRATION OF UNDERGRADUATES IN MEGHALAYA.Uploaded byIJAR Journal
- Hypothesis Testing Examples and ExercisesUploaded byWess Sklas
- What is Hypothesis TestingUploaded bycjpsyche
- Handout Hypothesis Test for a Population MeanUploaded byvignanaraj
- Hypothesis Testing IUploaded byeragorn
- ch9Uploaded byMatt Beard
- PPT Chapter 04Uploaded byEstee
- MAT 540 Statistical Concepts for ResearchUploaded bynequwan79
- Stats Test 2 ReviewUploaded bySteve Smith
- Order or Not: Does Parallelization of Model Building in hBOA Affect Its Scalability?Uploaded byMartin Pelikan
- ch09_keytosomequestionsUploaded byMhd Taufik Kurniawan
- Cherry 1998 - Statistical TestsUploaded bytarumatu
- 6. Significance tests.pptUploaded byTruong Giang Vo

- Country Selection for International Expansion Using TOPSIS - 1406553190Uploaded bysherylav
- MaxWheel Final Presentation - 2Uploaded bysherylav
- Tugas 4 - Bab 6&8 - Alyaa Putri - 1406553190Uploaded bysherylav
- review jurnal kebijakan teknologiUploaded bysherylav
- Open-Source ERP- Is It Ripe for Use in Teaching Supply Chain Management - Alyaa Putri - 1406553190Uploaded bysherylav
- Evaluating Probability Estimates From Decision TreesUploaded bysherylav

- 7471415Uploaded byabdel2121
- Islamic EconomicsUploaded byVaibhav Garg
- psd (2015-17)Uploaded byJayashree Haridoss
- Sunset and Desire by Paz LatorenaUploaded byAllyzon Acosta
- Agent Based Model Approach for Creating More ResilientUploaded bymemoegyptian
- financial mathematics eportpholioUploaded byapi-240161799
- Procedure Checklist of Blood TransfusionUploaded byJelly Rose Bajao Otayde
- birthdays-and-anniversaries.pdfUploaded byal alami azzeddine
- 1 (3).pdfUploaded byBiniam Adugna
- Can a Christian Have a DemonUploaded bylukaszkraw
- MAXIMO.pdfUploaded byManuel Zorrilla
- Acute Respiratory Distress Syndrome - ReviewUploaded byMr. L
- accumulation of heavy metals in plankton and alageUploaded byapi-248356596
- 7 Tips on Passing the BarUploaded byColeen Navarro
- Dnv Gl Rules 2015 OrsUploaded byolivo farrera jr
- Tom Rockmore - Is Marx a FichteanUploaded byCristián Ignacio
- CIE_A_AS_Factsheet.pdfUploaded byThilini Baduge
- v4n1Uploaded byJorge Rodriguez
- Errol P. Crossdale v. Michael A. Crossdale, 11th Cir. (2015)Uploaded byScribd Government Docs
- GRE填空机经1100题Uploaded by徐一锋
- Distributed Electrical Power System in CubesatUploaded byAshwin Raghavan
- Dr Rojalini SahooUploaded bySachin Sahoo
- Zokem GSMA Mobile App Scan Report - USA-UK 2011-January - Feb 8Uploaded byAlish Saydalikhodjayev
- Understanding FFT Overlap Processing Fundamentals.pdfUploaded byJose Prado
- Music of SpainUploaded byFerrari En Zo
- QCUtilsUploaded bysusanthbnair
- The choir icon as long-distance swimmerUploaded byRisha Jealyn Estareja Lungsod
- NetDoc Software 2 BrochureUploaded byCorey Rahr
- Lanzi Agreement - CopyUploaded byAnonymous JYLgJv
- ei reflection essayUploaded byapi-302056916