You are on page 1of 12

Expert Systems with Applications 41 (2014) 2514–2525

Contents lists available at ScienceDirect

Expert Systems with Applications


journal homepage: www.elsevier.com/locate/eswa

Analysis and extension of decision trees based on imprecise


probabilities: Application on noisy data
Carlos J. Mantas, Joaquín Abellán ⇑
Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain

a r t i c l e i n f o a b s t r a c t

Keywords: An analysis of a procedure to build decision trees based on imprecise probabilities and uncertainty mea-
Imprecise probabilities sures, called CDT, is presented. We compare this procedure with the classic ones based on the Shannon’s
Imprecise Dirichlet model entropy for precise probabilities. We found that the handling of the imprecision is a key part of obtaining
Uncertainty measures improvements in the method’s performance, as it has been showed for class noise problems in classifica-
Credal Decision Trees
tion. We present a new procedure for building decision trees extending the imprecision in the CDT’s pro-
Noisy data
cedure for processing all the input variables. We show, via an experimental study on data set with general
noise (noise in all the input variables), that this new procedure builds smaller trees and gives better
results than the original CDT and the classic decision trees.
Ó 2013 Elsevier Ltd. All rights reserved.

1. Introduction 1996; Wang, 2010; Weichselberger, 2000). The use of imprecise


probabilities instead of precise ones implies some advantages:
In the area of machine learning, supervised classification learn-
ing can be considered an important tool for decision support. Clas-  The manipulation of total ignorance is coherently solved.
sification can be defined as a machine learning technique used to  Indeterminacy and inconsistency are adequately represented.
predict group membership for data instances. It can be applied to
decision support in medicine, character recognition, astronomy, The different manipulation of two training sets with distinct
banking and other fields. A classifier may be represented using a sizes is possible thanks to the second advantage. The probabilities
Bayesian network, a neural network, a decision tree, etc. of the larger training set will be more reliable than those of the
A Decision Tree (DT) is a very useful tool for classification. Its smaller set.
structure is simple and easy to interpret. Moreover, the normal By using the theory of imprecise probabilities presented in Wal-
time required to build classification model based on a DT is low. ley (1996), Abellán and Moral (2003) have developed an algorithm
The ID3 algorithm (Quinlan, 1986) and its extension C4.5 (Quin- for designing decision trees. The variable selection process for this
lan, 1993) for designing decision trees are widely used. These algo- algorithm is determined from operations based on imprecise prob-
rithms use the Divide and Conquer technique. The splits are carried abilities and uncertainty measures. This method obtains good
out in terms of the input variable values. The variable selection experimental results, as shown in Abellán and Moral (2005), Abel-
process in each node is based on the probabilities calculated from lán and Masegosa (2009), Abellán and Masegosa (2009).
training examples. The probabilities are used via an uncertainty An analysis of this kind of trees, called Credal Decision Trees
measure: the Shannon’s entropy (Shannon, 1948). (CDTs), is exposed in this paper. CDTs complement the ID3 and
In this manner, the variable selection process for a node is C4.5 algorithms because they use a total uncertainty measure that
determined by the arrangement of the examples in the descendant adds good properties to Shannon entropy (H). This new measure is
nodes. However, this process does not depend on the training set the Maximum of Entropy of a Probability Convex Set (H⁄), which has
size in each node. Because overspecialization relies on the number been previously defined and analyzed (Abellán & Moral, 2003;
of examples in the training set, the trees designed by ID3 and C4.5 Abellán, Klir, & Moral, 2006; Abellán & Masegosa, 2008).
have this problem. It has been demonstrated in Abellán et al. (2006) H⁄ is a disag-
On the other hand, recently several formal theories for impre- gregated measure of information that combines two elements:
cise probabilities manipulation have been developed (Walley,
(a) A randomness measure that indicates the arrangement of
the samples of each class in the training set. This measure
⇑ Corresponding author. Tel.: +34 958 242376.
corresponds to the entropy of the probabilities in the convex
E-mail addresses: cmantas@decsai.ugr.es (C.J. Mantas), jabellan@decsai.ugr.es
(J. Abellán).
set.

0957-4174/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.eswa.2013.09.050
C.J. Mantas, J. Abellán / Expert Systems with Applications 41 (2014) 2514–2525 2515

(b) A non-specificity measure that shows the uncertainty of the CDT procedure where the imprecision for manipulating all
derived from the training set size. This measure corresponds input variables is considered. In Section 5 we describe the experi-
with the size of the convex set. mentation carried out on a wide range of data sets and comments
on the results. Here, we apply different percentages of noise on all
Two objectives are achieved by using the total uncertainty mea- the variables in the experiments. Finally, Section 6 explores the
sure for the variable selection process. It serves to: conclusions.

(1) Consider the arrangement of the samples in a node when the 2. Previous knowledge
variable is selected. This fact is similar to the process fol-
lowed by the ID3 and C4.5 algorithms. 2.1. Decision Trees
(2) Take into account the training set size of a node and the
available example numbers for each class. This property is Decision Trees (DTs), also known as Classification Trees or hier-
different to the process carried out by the ID3 and C4.5 archical classifiers, started to play an important role in machine
algorithms. learning with the publication of Quinlan’s ID3 (Iterative Dichoto-
miser 3) (Quinlan, 1986). Subsequently, Quinlan also presented
Differences between CDT methodology and the ID3 and C4.5 the C4.5 algorithm (Classifier 4.5) (Quinlan, 1993), which is an ad-
algorithms will be analyzed later on in this paper. vanced version of ID3. Since then, C4.5 has been considered a stan-
On the other hand, recent data mining literature (Khoshgoftaar dard model in supervised classification. It has also been widely
& Van Hulse, 2009; Nettleton, Orriols-Puig, & Fornells, 2010; Van applied as a data analysis tool to very different fields, such as
Hulse, Khoshgoftaar, & Huang, 2007; Zhu & Wu, 2004) is paying astronomy, biology, medicine, etc.
more attention to classification problems with attribute noise. As Decision trees are models based on a recursive partition meth-
mentioned in Zhu and Wu (2004), class information in real-world od, the aim of which is to divide the data set using a single variable
data is usually much cleaner than attributes information. A medi- at each level. This variable is selected with a given criterion. Ide-
cal data set can be considered as an example. In this case, when ally, they define a set of cases in which all the cases belong to
new data (for example, a patient’s data) are entered, medical staff the same class.
pay more attention to the diagnosis and to the accuracy of the pre- Their knowledge representation has a simple tree structure. It
diction done. can be interpreted as a compact set of rules in which each tree
However, research on handling attribute noise has not made node is labelled with an attribute variable that produces branches
much progress. Research on noise classification has focused on for each value. The leaf nodes are labelled with a class label.
class noise. This is the case for CDTs, where only the probability The process for inferring a decision tree is mainly determined
distribution for the class variable is considered imprecise. by the followings aspects:
Let us see the following example in order to show the impor-
tance of considering attribute noise when a classification model (i) The criteria used to select the attribute to insert in a node
is designed. and branching (split criteria).
(ii) The criteria to stop the tree from branching.
Example 1. Let us suppose a medical binary classification problem (iii) The method for assigning a class label or a probability distri-
from patient’s data (type_A versus type_B), and the following bution at the leaf nodes.
instance from a data set with possible incorrect values: (iv) The post-pruning process used to simplify the tree structure.

ðAge ¼ 23; Weight ¼ 100; Class ¼ type AÞ Many different approaches for inferring decision trees, which
It is possible that the error of this instance is produced by the mea- depend upon the aforementioned factors, have been published.
sure on the class variable, type_B instead of type_A (class noise). Quinlan’s ID3 (Quinlan, 1986) and C4.5 (Quinlan, 1993) stand out
However, other error source can be the measure on the input attri- among all of these.
butes. For instance, let us suppose that we know the exact classifi- Decision trees are built using a set of data referred to as the
cation rule, and it is: training data set. A different set, called the test data set, is used
to check the model. When we obtain a new sample or instance
If Age ¼ 25 and Weight ¼ 100 then Class ¼ type A of the test data set, we can make a decision or prediction on the
Otherwise Class ¼ type B state of the class variable by following the path in the tree from
the root to a leaf node, using the sample values and tree structure.
In this case, the error of the instance can be derived from an
2.1.1. Split criteria
imprecision on the input variable Age (value 23 instead of 25).
Let us suppose a classification problem. Let C be the class vari-
Hence, it can be interesting to design classification models where
able, {X1, . . . ,Xn} the set of features, and X a feature. We can find the
attribute noise is taken into account.
following split criteria to build a DT.
With this motivation, an extension of the CDT’s procedure is
presented in this paper. This new classification model considers 2.1.1.1. Info-Gain. This metric was introduced by Quinlan as the ba-
that both the probabilities for the class variable and the ones for sis for his ID3 model (Quinlan, 1986). The model has the following
the attributes (or features) are imprecise. main features: it was defined to obtain decision trees with discrete
The new model, called Complete Credal Decision Trees (CCDTs), variables, it does not work with missing values, a pruning process
is coherently defined by using the principle of maximum uncer- is not carried out and it is based on Shannon‘s entropy H (Shannon,
tainty (see Klir (2006)), analyzed and experimentally compared 1948).
with other models. The conclusions will be that CCDTs are more The split criterion of this model is Info-Gain (IG) which is
adequate for general noisy data and build smaller decision trees. defined as:
Section 2 briefly describes the necessary previous knowledge on X
decision trees and split criteria. Section 3 analyses the performance IGðC; XÞ ¼ HðCÞ  PðX ¼ xi ÞHðCjX ¼ xi Þ:
i
of the Credal Decision Tree (CDT). Section 4 presents an extension
2516 C.J. Mantas, J. Abellán / Expert Systems with Applications 41 (2014) 2514–2525

2.1.1.2. Info-Gain Ratio. In order to improve the ID3 model, Quinlan


introduces the C4.5 model (Quinlan, 1993), where the Info-Gain
split criterion (the split criterion of ID3) is replaced by an Info-Gain
Ratio criterion that penalizes variables with many states. The C4.5
model is defined to work with continuous variables and missing
data. It has a complex subsequent pruning that is introduced to im-
prove the results and obtain less complex structures.
The split criterion for this model is Info-Gain Ratio (IGR) which is
defined as
IGðC; XÞ
IGRðC; XÞ ¼ :
HðXÞ

2.2. Credal Decision Trees


Fig. 1. Procedure to build a credal decision tree.
The split criterion employed to build Credal Decision Trees
reason, we will use this value for s in the experimentation section).
(CDTs) (see Abellán & Moral (2003)) is based on imprecise proba-
Firstly, the procedure consists in determining the set
bilities and the application of uncertainty measures on convex sets
of probability distributions (credal sets). The mathematical basis of A ¼ fzj jnzj ¼ mini fnzi gg ð1Þ
this theory is described in the next section.
then the distribution with maximum entropy is
8
2.2.1. Mathematical foundation n zi
< if zi R A
 Nþs
Let there be a variable Z whose values belong to {z1, . . . ,zk}. Let p ðzi Þ ¼ ; i ¼ 1; . . . ; k;
: nzi þs=l if zi 2 A
us suppose a probability distribution p(zj),j = 1, . . . , k defined for Nþs
each value zj from a data set.
where l is the number of elements of A.
A formal theory of imprecise probability has been recently pre-
As the imprecise intervals are wider with smaller sizes, then we
sented, called Walley’s Imprecise Dirichlet Model (IDM) (Walley,
will have a tendency to obtain greater values for H⁄ with smaller
1996). Probability intervals are extracted from the data set for each va-
sample sizes. This property will be important to differentiate the
lue of the variable Z. The IDM estimates that the probabilities for each
action of CDTs as opposed to the behavior of other kinds of DTs.
value in Walley’s Imprecise Dirichlet Model zj are within the interval:
 
nzj nzj þ s 2.2.2. Method for building Credal Decision Trees
pðzj Þ 2 ; ; j ¼ 1; . . . ; k;
Nþs Nþs The procedure for building credal trees is very similar to the one
used in the well-known Quinlan‘s ID3 algorithm (Quinlan, 1986),
with nzj as the frequency of the set of values (Z = zj) in the data set, N replacing its Info-Gain split criterion with the Imprecise Info-Gain
the sample size and s a given hyperparameter that does not depend (IIG) split criterion. This criterion can be defined as follows: in a
on the sample space (Representation Invariance Principle, Walley classification problem, let C be the class variable, {X1, . . . ,Xm} the
(1996)). The value of parameter s determines the speed at which set of features, and X a feature; then
the upper and lower probability values converge when sample size X
increases. Higher values of s give a more cautious inference. Walley IIGD ðC; XÞ ¼ H ðK D ðCÞÞ  PðX ¼ xi ÞH ðK D ðCjX ¼ xi ÞÞ;
(1996) does not give a definitive recommendation for the value of i

this parameter but he suggests two candidates: s = 1 or s = 2. D D


where K ðCÞ and K ðCjX ¼ xi Þ are the credal sets obtained via the
One important thing is that intervals are wider if the sample IDM for the C and (CjX = xi) variables respectively, for a partition
size is smaller. Therefore, this method produces more precise inter- D of the data set (see Abellán & Moral (2003)).
vals as N increases. The IIG criterion is different from the classical criteria. It is based
This representation originates a specific kind of convex set of on the principle of maximum uncertainty (see Klir (2006)), widely
probability distributions on the variable Z,K(Z) (Abellán, 2006). used in classic information theory, where it is known as maximum
The set is defined as entropy principle (Jaynes (1982)). The use of the maximum entropy
    function in the decision tree building procedure is justified in Abel-
nzj nzj þ s
KðZÞ ¼ p j pðzj Þ 2 ; ; j ¼ 1; . . . ; k : lán and Moral (2005). It is important to note that for a feature X and
Nþs Nþs
a partition D; IIGD ðC; XÞ can be negative. This situation does not ap-
pear with classical split criteria, such as the Info-Gain criterion used
Using this theory, uncertainty measures on convex set of proba- in ID3.1 This characteristic enables the IIG criterion to reveal features
bility distributions can be defined. The entropy of the set will be esti- that worsen the information on the class variable.
mated as the maximum entropy of all the probability distributions Each node No in a decision tree causes a partition of the data set
that belong to this set. This function, denoted as H⁄, is defined as: (for the root node, D is considered to be the entire data set). Fur-
H ðKðZÞÞ ¼ maxfHðpÞ j p 2 KðZÞg thermore, each No node has an associated list L of feature labels
(that are not in the path from the root node to No). The procedure
where the function H is the Shannon’s entropy function (Shannon, for building credal trees is explained in the algorithm in Fig. 1.
1948). H⁄ is a total uncertainty measure which is well known for Considering this algorithm, when an Exit situation is attained,
this type of set (see Abellán & Masegosa (2008)), which coherently i.e. when there are no more features to insert in a node (L ¼ ;, step
separates conflict and non-specificity Abellán et al. (2006). 1) or when the uncertainty measure is not reduced
 
The procedure for calculating H⁄ has a low computational cost (maxX j 2L IIGD ðC; X j Þ 6 0, step 4), a leaf node is produced.
for values s 2 [1,2] (see Abellán & Moral (2006)). The specific pro-
cedure for the IDM reaches the lowest cost for s = 1 (see Abellán 1
The Info-Gain criterion is actually a specific case of the IIG criterion using the
(2006)). Moreover, to compute H⁄ is very simple for s = 1 (for this parameter s = 0.
C.J. Mantas, J. Abellán / Expert Systems with Applications 41 (2014) 2514–2525 2517

In a leaf node, the most probable state or value of the class Definition 1. The function dif ðxi ; N Þ is defined as:
variable for the partition associated with that leaf node is inserted, 8 nx þs nxi þs nx nx
that is, the class label for the leaf node No associated with the <  Nþs
i
logð Nþs Þ þ Ni logð Ni Þ xi 2 AN
X
partition D is: dif ðxi ; N Þ ¼
: nx nx n xi nx
   Nþsi logðNþsi Þ þ N
logð Ni Þ xi R AN
X
ClassðNo; DÞ ¼ maxj Ij 2 D = classðIj Þ ¼ ci ; j ¼ 1; . . . ; jDj j
ci 2C

The function dif ðxi ; N Þ measures the imprecision provided by


where class(Ij) is the class of the instance Ij 2 D and jDj is the
the value xi 2 X when H(X) is substituted by H ðK N ðXÞÞ.
number of instances of D.
According the preceding propositions, we can analyze dif ðxi ; N Þ
If we do not have one single most probable class value, we
for every case of xi 2 X:
obtain an unclassified leaf node. To avoid this situation, we can
select the class obtained in its parent node (see Abellán &
(a) If xi 2 AN
X and xi 2 [0,1/e] then dif ðxi ; N Þ > 0.
Masegosa (2010)).
(b) If xi 2 AN
X and xi 2 [1/e,0.5] then dif ðxi ; N Þ < 0.
CDTs can be somewhat smaller than the trees produced with
(c) If xi R ANX and xi 2 [0,1/e] then dif ðxi ; N Þ < 0.
similar procedures by replacing the IIG criterion with classic
(d) If xi R ANX and xi 2 [1/e,1] then dif ðxi ; N Þ > 0.
ones (see Abellán & Masegosa (2009)). This normally produces
a reduction of the overfitting of the model (see Abellán & Moral
The negative terms in cases (b) and (c) are compensated be-
(2005)).
cause H ðK N ðXÞÞ P HðXÞ. That is:
Credal trees and IIG criterion have been used successfully in
other procedures and data mining tools: as a part of a procedure
- The case (b) is compensated with case (d) because Prop. 2, 3 and
for selecting variables (see Abellán & Masegosa (2009)) and on data nx nx
sets with classification noise (see Abellán & Masegosa (2009)). An if xi 2 AN j
X then there is a value xj with 1=e < N < N that fulfils
i

extended version of the IIG criterion has been used to define a case (d).
semi-naïve Bayes classifier (see Abellán (2006)). - The case (c) is compensated with case (a) because Prop. 2, 3 and
nxj nxi
if xi R ANX then there is a value xj with N
< N
< 1=e that fulfils
3. Analysis of Credal Decision Trees case (a).

Next, some properties and examples are described in order to On the other hand, the variation of H ðK N ðXÞÞ  HðXÞ is propor-
show the differences between CDT and the standard methodolo- 1
tional to the expression NðNþsÞ according to Prop. 1. Hence, we can
gies used to build DTs (C4.5 and ID3). deduce the following proposition.
Let Xi(i = 1, . . . ,m) be a set of features, and X be a feature with
the values xi ði ¼ 1; . . . ; mÞ; N be a data set with jN j ¼ N; AN X be Proposition 4. Let N 1 and N 2 be two data sets with jN 1 j ¼
the set of values with minimum frequency for X by using N accord- N 1 ; jN 2 j ¼ N 2 and N1 < N2;C a variable with HN 1 ðCÞ ¼ HN 2 ðCÞ. Then
ing to (1). Let us suppose2 jAN X j ¼ 1. H ðK N 1 ðCÞÞ > H ðK N 2 ðCÞÞ.
First, we want to analyze the differences between standard en-
tropy H(X) and credal set maximum entropy H ðK N ðXÞÞ. Both A conclusion of Prop. 4 is that the ID3 and CDT methodologies
expressions are the result of adding components with the form have similar behaviour in the nodes where the instances number
of the partition D is large. This situation is normally found in the
f ðyÞ ¼ y logðyÞ
upper levels of the DT. On the other hand, ID3 and CDTs will be-
nxi
with y 2 [0,1]. H(X) uses values y ¼ N
and H ðK N ðXÞÞ uses values have differently when jDj is not large. This fact can be usually
nx i nxi þs
y¼ if xi R AN and y ¼ if xi 2 AN . found in the deeper levels of the tree.
Nþs X Nþs X
The following is an example to illustrate this property.
We need the following trivial results and definition in order to
analyze the differences between each component of H(X) and
Example 2. Let us suppose jCj = 2 and two nodes with the data sets
H ðK N ðXÞÞ.
N 1 and N 2 , respectively. The size of the data sets is jN 1 j ¼ 10 and
jN 2 j ¼ 1000. The arrangement of the instances in these nodes is:
Proposition 1.
nxi þ s nxi ðN  nxi Þs jC ¼ C 1 jN 1 ¼ 7
if xi 2 AN
X then  ¼ >0
Nþs N NðN þ sÞ jC ¼ C 2 jN 1 ¼ 3
nxi nxi snxi
if xi R AN
X then  ¼ <0 jC ¼ C 1 jN 2 ¼ 700
Nþs N NðN þ sÞ
jC ¼ C 2 jN 2 ¼ 300
Proposition 2. The function f(y) =  ylog(y) is increasing for y 2 [0,1/
In this case, HN 1 ðCÞ ¼ HN 2 ðCÞ and H ðK N 1 ðCÞÞ > H ðK N 2 ðCÞÞ. The
e] and decreasing for y 2 [1/e,1] where e is the Euler’s number.
imprecision for node1 is greater than the one for node2. This fact is
reasonable because the sample size of node1 is the smallest. This
Proposition 3. let slope(yi) the slope of the function f(y) =  ylog(y) characteristic is taken into consideration in the CDT methodology
at y = yi and j:j the absolute value function. and not in ID3.
Next, an analysis finds that the CDT methodology has an inter-
if y1 < y2 < 1=e then jslopeðy1 Þj > jslopeðy2 Þj
mediate action between ID3 and C4.5 when feature variables with
if 1=e < y1 < y2 then jslopeðy2 Þj > jslopeðy1 Þj: a large number of values are processed.
It is known that the IG used by ID3 is biased in favor of feature
variables with a large number of values. The IGR measure used by
C4.5 was created to compensate for this bias. IGR penalize the
2
It helps us to simplify the expressions and the results are not different to the ones selection of feature variables with a large number of values. How-
obtained with other value. ever, the IGR might not be always defined (the denominator may
2518 C.J. Mantas, J. Abellán / Expert Systems with Applications 41 (2014) 2514–2525

be zero) and might choose feature variables with very low H(X) X2
rather than those with a high gain. The IIG measure does not have
these drawbacks. CDTs will have an intermediate behaviour
between ID3 and C4.5 when variables with many values are 20
B B B
processed. B A
B B B B A A
Let X0 be a variable constructed from X by splitting one of its values B BB B AA A A A
15 B
into two, so X0 will have a larger number of values than X. According to B B B AA A A A
Quinlan (1986), it could be anticipated that excessive fineness would BB BB A A A A
B B B B B A A
tend to obscure structure in the training set, so that X0 should be in fact B B B AA A A A
10 A
less useful than X. However, it can be proved that IG(C,X0 ) > IG(C,X), B B B B A A A
that is, ID3 algorithm chooses X0 for splitting a node instead of X. On B B B B AA A A A
B B B A A
A AA
the other hand, if we use credal sets, it is expected that B
B B B B A A A
5 B BB BB B B BA B B B
IIGN ðC; X 0 Þ  IIGN ðC; XÞ < IGðC; X 0 Þ  IGðC; XÞ B B B B B
B B B
B B B B B
because the partitions generated by X0 are finer than the ones gen-
erated by X (see Prop. 4). The variable X0 has an imprecision incre-
X1
5 10 15 20
ment greater than X when imprecise probabilities and credal sets
are used. Thus, variables with a large number of values are penal- Fig. 2. Example of binary classification problem with noise point.
ized by CDT, but they are less punished than when C4.5 is used. Let’s
see an example.

classification problem. In this case, we can suppose that the point


Example 3. Let a data set with jN j ¼ 16 instances, a binary class
(X1 = 15,X2 = 4) is classified as Class_A instead of the correct Class_B
variable with jC = c1j = 6 and jC = c2j = 10; two feature variables
(see Fig. 2), that is, it is a noise point in this problem. This situation
X1 2 {0,1} and X2 2 {0,1,2,3}. If we have the following arrangement
can come from two sources: (i) the class variable has been
of the instances:
incorrectly obtained (Class_A instead of Class_B, class noise); or (2)
for X 1 ¼ 0; jC ¼ c1 j ¼ 0 and jC ¼ c2 j ¼ 5 the feature variable X2 has not been well measured (value 4 instead
for X 1 ¼ 1; jC ¼ c1 j ¼ 6 and jC ¼ c2 j ¼ 5 of value 6, attribute noise).
for X 2 ¼ 0; jC ¼ c1 j ¼ 0 and jC ¼ c2 j ¼ 5 The Example 4 says us that can be interesting to define a meth-
for X 2 ¼ 1; jC ¼ c1 j ¼ 2 and jC ¼ c2 j ¼ 2 od to build decision trees that takes into account the different
for X 2 ¼ 2; jC ¼ c1 j ¼ 2 and jC ¼ c2 j ¼ 2 sources of noise.
for X 2 ¼ 3; jC ¼ c1 j ¼ 2 and jC ¼ c2 j ¼ 1 The design of a new method for building Credal Decision Trees
is presented in this section. This new method considers that the
then probability distributions of the class variable C and feature vari-
IGðC; X 1 Þ < IGðC; X 2 Þ ables {X1, . . . ,Xn} are imprecise.
These new CDTs are experimentally compared with the stan-
IIGN ðC; X 1 Þ ¼ IIGN ðC; X 2 Þ dard CDT and better results are obtained.
IGRðC; X 1 Þ > IGRðC; X 2 Þ
Hence, we can see that the IIG criterion has an intermediate ac- 4.1. Complete Credal Decision Trees
tion between IG and IGR when feature variables with large number
of values are manipulated. Complete Credal Decision Trees (CCDTs) are CDT where both
the class variable and feature variables Xi(i = 1, . . . ,n) are modeled
by using IDM. In this manner, the value of each variable is esti-
4. Extension of Credal Decision Trees mated by a convex set of probability distributions.
The procedure for building a CCDT is similar to the one used by
The standard method for building the CDTs presented in Sec- CDTs (see Section 2.2.2), by replacing the IIG split criterion with
tion 2.2.2 is derived from considering the probability distribution the Complete Imprecise Info-Gain (CIIG) criterion, that can be de-
about the class variable C as imprecise. This fact is coherent when fined as follows:
the information on the class of each instance from the data set is X D
not very reliable. CIIGD ðC; XÞ ¼ H ðK D ðCÞÞ  P ðX ¼ xi ÞH ðK D ðCjX ¼ xi ÞÞ;
i
However, there are other variables apart from the class variable
D
that are used when a decision tree is built. These are the feature where P ðX ¼ xi Þði ¼ 1; . . . ; nÞ is a probability distribution that be-
variables {X1, . . . ,Xn}. If the information about the feature variables longs to the convex set K D ðXÞ.
are considered unreliable, then it could be interesting to manage Again, the principle of maximum uncertainty (Klir, 2006) is the
their probability distributions with imprecision. Let us see the fol- basis for choosing PD from K D ðXÞ. This principle indicates that the
lowing example. probability distribution with the maximum entropy, compatible
with available restrictions, must be chosen. Hence, we choose the
Example 4. Let us suppose the classification problem defined by probability distribution PD from K D ðXÞ, which maximizes the fol-
the following rule: lowing expression:
X
If X 1 > 10 and X 2 > 5 then Class A PðX ¼ xi ÞHðCjX ¼ xi Þ:
Otherwise Class B: i

It is simple to calculate this probability distribution. Let xj0 be a va-


The problem space is illustrated in Fig. 2. It is usual that noisy lue for X such that H(CjX = xi) is the maximum. Then the probability
points appear when a training data set is extracted from a distribution P D will be
C.J. Mantas, J. Abellán / Expert Systems with Applications 41 (2014) 2514–2525 2519

( n xi
Nþs
if i – j0 the same conditions as in the preceding example. If we have the fol-
PD ðxi Þ ¼ nxi þs : lowing arrangement of instances:
Nþs
if i ¼ j0
for X ¼ 0; jC ¼ c1 j ¼ 2 and jC ¼ c2 j ¼ 10
for X ¼ 1; jC ¼ c1 j ¼ 6 and jC ¼ c2 j ¼ 2
The procedure for building complete credal trees is explained
by the algorithm in Fig. 3. Note that the differences with the meth- then CIIG ðC; XÞ < IIGD ðC; XÞ. In this case, the data sets size are not
D

od for designing credal trees (Fig. 1) are only the step 3 for calcu- very different for each value of X. Hence, the values for H⁄ are af-
lating P D ðX ¼ xi Þði ¼ 1; . . . ; nÞ on the convex set K D ðXÞ and the use fected by non-specificity in a similar proportion. The order relation
of the CIIG criterion instead of IIG.
HðC=X ¼ 0Þ ¼ 0:65 < HðC=X ¼ 1Þ ¼ 0:81
The process to label the class variable of each leaf node is sim-
ilar to the one used by CDT that is explained in Section 2.2.2. It is is not altered by H⁄
expected that CCDTs obtain better results than CDTs when data
sets with general (attribute and class) noise are processed. This will H ðK N 0 ðC=X ¼ 0ÞÞ ¼ 0:78 < H ðK N 0 ðC=X ¼ 1ÞÞ ¼ 0:92:
be shown in experimental section. As consequence it is expected that CCDTs are smaller than CDTs.
This is experimentally shown in the next section.
4.2. Analysis of Complete Credal Decision Trees
5. Experimental Analysis
An advantage of CDTs is that they build trees smaller than the
ones provided by the classic models (Abellán & Masegosa, 2009). Our aim is to study the performance of several decision trees
The reason for this property is that the IIG criterion can return neg- built with different split criteria. These criteria are IG, IGR, IIG
ative values for some nodes. Hence, the tree building process is and CIIG. The trees built with those criteria will be called IGT, IGRT,
stopped in these nodes and the trees are small. CDT and CCDT in this section, respectively.
As the probability distribution P D for CIIG is selected by using In order to check the above procedures, we used a broad and di-
the principle of maximum uncertainty, the following condition is verse set of 60 known data sets, obtained from the UCI repository of ma-
usually fulfilled: chine learning data sets which can be directly downloaded from http://
CIIGD ðC; XÞ 6 IIGD ðC; XÞ archive.ics.uci.edu/ml. We took data sets that are different with re-
spect to the number of cases of the variable to be classified, data set
size, feature types (discrete or continuous) and number of cases of
Hence, the good property of CDTs with regards to tree size is the features. A brief description of these can be found in Table 1.
reinforced for CCDTs. We used Weka software (Witten & Frank, 2005) on Java 1.5 for
The property CIIGD ðC; XÞ 6 IIGD ðC; XÞ can be not fulfilled in some our experimentation. We added the necessary methods to build
cases. For example, let us suppose a data set with N = 20 instances, decision trees using the split criteria explained in previous sections.
a binary class variable with jC = c1j = 8 and jC = c2j = 12; a binary We applied the following preprocessing procedure: data sets
feature variable X 2 {0,1}. If we have the following arrangement with missing values were replaced with mean values (for continu-
of instances: ous variables) and mode (for discrete variables) using Weka’s own
for X ¼ 0; jC ¼ c1 j ¼ 4 and jC ¼ c2 j ¼ 1 filters. In the same way, continuous variables were discretized
using Fayyad and Irani’s known discretization method (Fayyad &
for X ¼ 1; jC ¼ c1 j ¼ 4 and jC ¼ c2 j ¼ 11
Irani, 1993). We note that the preprocessing was applied using
then CIIG ðC; XÞ > IIGD ðC; XÞ. The reason is that the data set for X = 0
D the training set and then translated to the test set.
(N 0 ) is small. Hence, the value of H ðK N 0 ðC=X ¼ 0Þ is greatly in- Using Weka’s filters, we added the following percentages of ran-
creased by non-specificity. In this manner, the order relation dom noise to the features and to the class variable: 0%,5%,10%,20%
and 30%, only in the training data set. The procedure to introduce noise
HðC=X ¼ 0Þ ¼ 0:72 < HðC=X ¼ 1Þ ¼ 0:83 was the following for a variable: a given percentage of instances of the
is inverted by the function H⁄ training data set was randomly selected, and then their current vari-
able values were randomly changed to other possible values. The in-
H ðK N 0 ðC=X ¼ 0ÞÞ ¼ 0:91 > H ðK N 0 ðC=X ¼ 1ÞÞ ¼ 0:89: stances belonging to the test data set were left unmodified.
We repeated 10 times a 10-fold cross validation procedure for
The previous situation is not usual. Normally, the condition
each data set3.
CIIGD ðC; XÞ 6 IIGD ðC; XÞ is fulfilled. As an example, let us suppose
Following the recommendation of Demsar (Demsar, 2006), we
used a series of tests to compare the methods4. We used the follow-
ing tests to compare multiple classifiers on multiple data sets, with a
level of significance of a = 0.05:

Friedman test (Friedman (1937), Friedman (1940)): a non-


parametric test that ranks the algorithms separately for each
data set, the best performing algorithm being assigned the rank
of 1, the second best, rank 2, etc. The null hypothesis is that all

3
The data set is separated in 10 subsets. Each one is used as a test set and the set
obtained by joining the other 9 subsets is used as training set. So, we have 10 training
sets and 10 test sets. This procedure is repeated 10 times with a previous random
reordering. Finally, it produces 100 training sets and 100 test sets. The percentage of
correct classifications for each data set, presented in tables, is the average of these 100
trials.
4
All the tests were carried out using Keel software Alcalá-Fdez et al., 2009,
Fig. 3. Procedure to build a complete credal decision tree. available at www.keel.es
2520 C.J. Mantas, J. Abellán / Expert Systems with Applications 41 (2014) 2514–2525

Table 1
data set description. Column ‘‘N’’ is the number of instances in the data sets, column ‘‘Feat’’ is the number of features or attribute variables, column ‘‘Num’’ is
the number of numerical variables, column ‘‘Nom’’ is the number of nominal variables, column ‘‘k’’ is the number of cases or states of the class variable (always
a nominal variable) and column ‘‘Range’’ is the range of states of the nominal variables of each data set.

Data set N Feat Num Nom k Range


anneal 898 38 6 32 6 2–10
arrhythmia 452 279 206 73 16 2
audiology 226 69 0 69 24 2–6
autos 205 25 15 10 7 2–22
balance-scale 625 4 4 0 3 -
breast-cancer 286 9 0 9 2 2–13
wisconsin-breast-cancer 699 9 9 0 2 –
car 1728 6 0 6 4 3–4
cmc 1473 9 2 7 3 2–4
horse-colic 368 22 7 15 2 2–6
credit-rating 690 15 6 9 2 2–14
german-credit 1000 20 7 13 2 2–11
cylinder-bands 540 39 18 21 2 2–429
dermatology 366 34 1 33 6 2–4
pima-diabetes 768 8 8 0 2 –
ecoli 366 7 7 0 7 –
Glass 214 9 9 0 7 –
haberman 306 3 2 1 2 12
cleveland-14-heart-disease 303 13 6 7 5 2–14
hungarian-14-heart-disease 294 13 6 7 5 2–14
heart-statlog 270 13 13 0 2 –
hepatitis 155 19 4 15 2 2
hypothyroid 3772 30 7 23 4 2–4
ionosphere 351 35 35 0 2 –
iris 150 4 4 0 3 -
kr-vs-kp 3196 36 0 36 2 2–3
labor 57 16 8 8 2 2–3
letter 20000 16 16 0 26 –
liver-disorders 345 6 6 0 2 –
lymphography 146 18 3 15 4 2–8
mfeat-factors 2000 216 216 0 10 –
mfeat-fourier 2000 76 76 0 10 –
mfeat-karhunen 2000 64 64 0 10 –
mfeat-morphological 2000 6 6 0 10 –
mfeat-pixel 2000 240 0 240 10 4–6
mfeat-zernike 2000 47 47 0 10 –
mushroom 8123 22 0 22 2 2–12
nursery 12960 8 0 8 4 2–4
optdigits 5620 64 64 0 10 –
page-blocks 5473 10 10 0 5 –
pendigits 10992 16 16 0 10 –
postoperative-patient-data 90 8 0 8 3 3–4
primary-tumor 339 17 0 17 21 2–3
segment 2310 19 16 0 7 –
shuttle-landing-control 15 6 0 5 2 2–4
sick 3772 29 7 22 2 2
solar-flare2 1066 12 0 6 3 2–8
sonar 208 60 60 0 2 -
soybean 683 35 0 35 19 2–7
spambase 4601 57 57 0 2 –
spectrometer 531 101 100 1 48 4
splice 3190 60 0 60 3 4–6
Sponge 76 44 0 44 3 2–9
tae 151 5 3 2 3 2
vehicle 946 18 18 0 4 –
vote 435 16 0 16 2 2
vowel 990 11 10 1 11 2
waveform 5000 40 40 0 3 –
wine 178 13 13 0 3 –
zoo 101 16 1 16 7 2

the algorithms are equivalent. If the null-hypothesis is rejected, Tables 2–6 present the accuracy results of each method when
we can compare all the algorithms to each other using the data sets with a percentage of random noise are applied to the fea-
Holm’s test (Holm (1979)). tures and the class variable (0%, 5%, 10%, 20% and 30%),
respectively.
5.1. Results Tables 7 and 8 present the average results for the accuracy
and tree size (number of nodes) of each method when applied
Next, It is shown the results obtained by decision trees built to data sets with percentages of random noise equal to 0%, 5%,
with the split criteria IG, IGR, IIG and CIIG. These trees are called 10%, 20% and 30%. The results on tree size are also illustrated in
IGT, IGRT, CDT and CCDT, respectively. Fig. 4.
C.J. Mantas, J. Abellán / Expert Systems with Applications 41 (2014) 2514–2525 2521

Table 2 Table 3
Accuracy results of IGT, IGRT, CDT and CCDT when are applied on data sets with Accuracy results of IGT, IGRT, CDT and CCDT when are applied on data sets with
percentage of random noise equal to 0%. percentage of random noise equal to 5%.

data set IGT IGRT CDT CCDT data set IGT IGRT CDT CCDT
anneal 99.70 99.64 99.66 99.34 anneal 98.91 99.39 99.21 99.01
arrhythmia 62.93 63.66 66.58 67.08 arrhythmia 62.59 62.56 66.71 66.86
audiology 79.69 83.59 80.44 80.94 audiology 73.78 80.66 76.62 77.69
autos 81.59 79.21 79.14 78.27 autos 80.14 77.14 76.89 75.97
balance-scale 69.59 69.59 69.59 69.59 balance-scale 71.02 71.02 71.02 71.02
breast-cancer 66.93 65.01 71.58 72.03 breast-cancer 64.27 66.09 70.13 70.17
wisconsin-breast-cancer 93.61 93.81 94.72 94.74 wisconsin-breast-cancer 93.69 93.66 94.64 94.64
car 92.88 92.93 91.64 90.28 car 88.17 88.37 91.10 90.27
cmc 46.37 46.33 48.63 48.70 cmc 45.42 46.17 48.16 48.19
horse-colic 78.15 78.15 82.25 83.17 horse-colic 77.83 77.73 81.55 82.04
credit-rating 79.70 81.01 83.75 84.06 credit-rating 79.81 81.01 83.67 84.09
german-credit 67.01 67.17 68.99 69.53 german-credit 66.79 67.21 69.12 69.17
cylinder-bands 60.24 66.96 70.57 71.28 cylinder-bands 59.76 66.76 71.28 71.39
dermatology 92.12 92.59 93.95 94.58 dermatology 90.38 92.24 93.11 93.76
pima-diabetes 73.53 73.44 74.20 74.22 pima-diabetes 73.14 73.24 73.91 73.89
ecoli 79.80 79.94 80.24 80.03 ecoli 79.56 79.76 79.61 79.67
Glass 70.69 72.97 69.10 68.83 Glass 71.19 72.75 67.76 67.24
haberman 73.59 73.59 73.59 73.59 haberman 72.17 72.17 72.49 72.49
cleveland-14-heart-disease 76.43 75.78 75.74 76.23 cleveland-14-heart-disease 74.78 74.07 74.64 74.87
hungarian-14-heart-disease 79.35 79.83 78.41 78.62 hungarian-14-heart-disease 79.16 79.98 77.84 78.35
heart-statlog 81.19 81.48 82.30 82.11 heart-statlog 79.89 80.22 81.63 81.85
hepatitis 77.59 75.60 79.60 80.32 hepatitis 75.48 74.80 78.29 79.13
hypothyroid 99.26 99.24 99.36 99.37 hypothyroid 98.99 99.12 99.20 99.19
ionosphere 89.00 89.46 89.72 89.75 ionosphere 88.58 88.29 89.09 89.55
iris 93.80 93.80 93.53 93.73 iris 93.60 93.47 93.40 93.47
kr-vs-kp 99.61 99.63 99.50 99.49 kr-vs-kp 99.28 99.36 99.35 99.38
labor 86.03 85.83 84.40 83.27 labor 82.87 83.87 83.03 84.37
letter 80.86 79.76 78.15 77.55 letter 80.01 79.62 77.29 76.72
liver-disorders 56.85 56.85 56.85 56.85 liver-disorders 56.62 56.62 56.62 56.62
lymphography 73.39 75.64 73.12 74.50 lymphography 74.08 74.00 74.79 75.51
mfeat-factors 82.60 80.08 81.53 81.89 mfeat-factors 81.27 79.11 80.26 80.61
mfeat-fourier 70.71 66.39 68.59 68.17 mfeat-fourier 68.88 64.54 66.76 66.62
mfeat-karhunen 76.55 70.88 72.63 72.81 mfeat-karhunen 74.27 68.55 70.55 70.28
mfeat-morphological 69.89 69.91 70.76 70.74 mfeat-morphological 69.53 69.52 70.35 70.30
mfeat-pixel 75.78 78.19 79.89 80.31 mfeat-pixel 75.16 77.55 78.65 78.94
mfeat-zernike 61.47 61.03 63.58 63.71 mfeat-zernike 60.24 60.15 62.14 62.72
mushroom 100.00 100.00 100.00 100.00 mushroom 100.00 100.00 100.00 100.00
nursery 98.87 98.84 96.28 96.31 nursery 92.75 93.06 96.41 96.38
optdigits 77.54 79.03 78.73 79.33 optdigits 76.31 78.05 77.84 78.23
page-blocks 96.32 96.42 96.27 96.26 page-blocks 96.20 96.09 96.16 96.15
pendigits 90.32 89.84 89.06 88.87 pendigits 89.29 89.01 88.48 88.27
postoperative-patient-data 54.67 54.44 71.00 71.00 postoperative-patient-data 55.22 54.89 69.78 69.11
primary-tumor 35.11 36.52 38.99 38.73 primary-tumor 35.72 34.49 37.61 37.88
segment 95.27 94.90 94.47 94.18 segment 94.52 94.28 93.96 93.81
shuttle-landing-control 56.50 56.00 45.00 45.00 shuttle-landing-control 57.50 53.00 38.50 42.00
sick 97.54 97.55 97.79 97.80 sick 97.53 97.50 97.84 97.87
solar-flare2 99.20 99.06 99.35 99.46 solar-flare2 98.74 99.05 98.96 99.02
sonar 73.90 73.71 73.82 73.92 sonar 73.89 72.54 73.65 73.80
soybean 89.22 92.72 91.82 92.21 soybean 87.42 91.52 91.13 91.58
spambase 90.59 91.48 91.65 91.85 spambase 90.35 91.54 91.73 91.81
spectrometer 45.39 43.54 44.77 45.22 spectrometer 44.39 42.03 43.79 44.58
splice 90.96 91.68 92.96 93.17 splice 90.42 90.90 92.62 92.87
sponge 93.07 93.34 94.11 94.63 sponge 91.54 93.63 95.00 95.00
tae 46.78 46.78 46.78 46.78 tae 45.85 45.85 45.91 45.91
vehicle 69.22 69.11 69.21 69.53 vehicle 67.83 68.36 68.98 69.28
vote 93.38 93.15 95.74 96.18 vote 93.42 93.40 95.63 95.72
vowel 83.76 82.03 77.32 75.60 vowel 82.54 80.90 75.90 74.33
waveform 71.60 70.85 74.22 74.44 waveform 70.14 69.45 73.00 73.31
wine 93.77 91.29 92.25 92.08 wine 92.59 90.52 92.58 92.59
zoo 96.31 97.04 95.92 95.83 zoo 94.35 94.70 94.43 93.84
Average 78.96 78.97 79.56 79.63 Average 78.00 78.09 78.84 78.99

 p-value 6 0.01 (for noise 0%).


Table 9 show Friedman’s ranks of IGT, IGRT, CDT and CCDT  p-value 6 0.025 (for noise 5% and 10%).
when applied on data sets with percentages of random noise equal  p-value 6 0.016667 (for noise 20% and 30%).
to 0%, 5%, 10%, 20% and 30% (the null hypothesis is rejected in all
cases).
Tables 10–12 shows the p-values of the Holm’s test on the IGT, 5.2. Comments about the results
IGRT, CDT and CCDT methods when they are applied on data sets
with a percentage of random noise. Holm’s procedure rejects The results shown in the previous section are analyzed accord-
hypotheses that have: ing the following aspects: Average accuracy, Tree size, Friedman’s
ranking and Holm’s test.
2522 C.J. Mantas, J. Abellán / Expert Systems with Applications 41 (2014) 2514–2525

Table 4 Table 5
Accuracy results of IGT, IGRT, CDT and CCDT when are applied on data sets with Accuracy results of IGT, IGRT, CDT and CCDT when are applied on data sets with
percentage of random noise equal to 10%. percentage of random noise equal to 20%.

data set IGT IGRT CDT CCDT data set IGT IGRT CDT CCDT
anneal 98.70 99.37 99.00 98.80 anneal 98.53 99.19 98.73 98.43
arrhythmia 63.54 62.26 67.04 67.17 arrhythmia 62.73 60.65 65.40 65.87
audiology 71.25 79.08 73.99 75.13 audiology 67.98 76.91 70.71 70.97
autos 78.80 76.87 76.11 75.17 autos 76.47 76.24 73.97 73.87
balance-scale 71.88 71.88 71.88 71.88 balance-scale 72.88 72.88 72.93 72.93
breast-cancer 64.70 64.99 69.78 70.16 breast-cancer 63.86 66.10 69.06 69.03
wisconsin-breast-cancer 93.89 93.73 94.35 94.49 wisconsin-breast-cancer 93.59 93.69 94.21 94.26
car 83.75 84.32 90.89 90.19 car 75.19 76.16 89.03 88.88
cmc 45.08 45.91 47.85 48.00 cmc 44.46 45.27 47.30 47.36
horse-colic 78.62 77.55 81.47 81.90 horse-colic 77.18 76.78 81.65 82.22
credit-rating 80.04 81.51 84.65 84.81 credit-rating 80.26 81.10 84.45 84.61
german-credit 66.81 67.18 69.49 69.49 german-credit 66.53 67.23 69.34 69.62
cylinder-bands 59.63 66.59 71.50 71.98 cylinder-bands 59.00 59.28 71.48 71.44
dermatology 89.94 91.93 93.24 93.66 dermatology 89.07 91.53 91.27 91.79
pima-diabetes 72.92 72.98 73.82 73.80 pima-diabetes 72.96 73.05 73.69 73.75
ecoli 79.74 79.82 79.58 79.28 ecoli 78.97 79.10 78.99 79.05
Glass 70.68 71.22 67.61 67.15 Glass 69.41 69.53 66.46 66.04
haberman 71.08 71.08 71.11 71.11 haberman 70.43 70.43 70.43 70.43
cleveland-14-heart-disease 72.77 72.82 74.08 74.25 cleveland-14-heart-disease 71.21 71.07 72.72 72.40
hungarian-14-heart-disease 78.93 80.17 78.69 78.76 hungarian-14-heart-disease 78.86 78.51 79.37 79.30
heart-statlog 80.37 80.30 82.63 82.78 heart-statlog 79.70 79.74 81.11 81.63
hepatitis 76.35 76.05 78.86 78.87 hepatitis 76.58 76.49 78.80 78.85
hypothyroid 98.91 99.08 99.16 99.14 hypothyroid 98.96 99.09 99.05 99.06
ionosphere 88.14 87.84 88.89 88.86 ionosphere 86.95 86.81 87.27 87.27
iris 93.53 93.53 93.93 93.87 iris 93.73 93.73 93.67 93.87
kr-vs-kp 99.19 99.28 99.26 99.29 kr-vs-kp 98.87 99.01 99.14 99.16
labor 86.00 85.23 86.03 85.50 labor 83.97 83.73 83.07 83.10
letter 79.31 79.23 76.50 75.93 letter 77.99 77.91 75.27 74.82
liver-disorders 56.85 56.85 56.85 56.85 liver-disorders 56.85 56.85 56.85 56.85
lymphography 73.23 75.06 74.00 74.94 lymphography 69.73 73.88 74.26 74.25
mfeat-factors 80.47 78.28 79.56 80.11 mfeat-factors 77.65 76.30 76.98 77.49
mfeat-fourier 68.11 64.04 65.92 65.58 mfeat-fourier 66.53 62.64 63.66 63.81
mfeat-karhunen 73.06 66.72 68.77 68.86 mfeat-karhunen 70.69 64.10 66.13 65.97
mfeat-morphological 69.34 69.41 70.20 70.13 mfeat-morphological 68.80 69.14 69.49 69.51
mfeat-pixel 74.58 76.77 78.56 78.69 mfeat-pixel 73.93 75.54 77.37 77.55
mfeat-zernike 59.06 59.24 61.21 61.38 mfeat-zernike 57.53 57.74 59.56 60.05
mushroom 100.00 100.00 100.00 100.00 mushroom 100.00 100.00 100.00 100.00
nursery 86.64 87.22 96.39 96.35 nursery 74.93 75.95 95.58 95.62
optdigits 75.12 77.64 76.66 76.97 optdigits 72.69 76.92 74.40 74.69
page-blocks 96.03 95.85 95.97 95.96 page-blocks 95.87 95.60 95.78 95.73
pendigits 88.56 88.17 87.84 87.72 pendigits 87.21 86.90 86.69 86.55
postoperative-patient-data 55.22 53.00 67.67 67.89 postoperative-patient-data 54.00 53.78 66.11 66.78
primary-tumor 33.68 33.54 38.21 38.00 primary-tumor 32.52 32.85 36.02 36.29
segment 93.77 93.79 93.39 93.32 segment 92.48 92.78 92.67 92.46
shuttle-landing-control 58.00 55.50 46.00 46.50 shuttle-landing-control 52.50 52.50 39.00 39.50
sick 97.51 97.43 97.78 97.81 sick 97.51 97.47 97.78 97.79
solar-flare2 98.65 98.98 99.01 99.06 solar-flare2 98.59 98.84 98.89 98.93
sonar 73.20 73.38 72.66 72.57 sonar 72.21 72.25 72.23 72.51
soybean 86.81 91.30 90.18 90.42 soybean 84.58 90.41 89.69 89.90
spambase 89.93 91.27 91.57 91.67 spambase 89.72 91.17 91.18 91.27
spectrometer 43.58 42.52 43.50 43.48 spectrometer 43.11 40.23 41.68 42.00
splice 89.53 89.87 92.29 92.51 splice 87.69 87.91 90.33 90.84
sponge 91.41 93.48 93.96 93.95 sponge 90.13 92.57 92.63 93.04
tae 44.52 44.52 44.52 44.52 tae 42.00 42.00 42.06 42.06
vehicle 68.39 68.23 68.51 68.78 vehicle 67.60 67.88 67.91 67.98
vote 93.86 93.40 95.58 95.74 vote 93.65 93.45 95.14 95.03
vowel 81.59 79.75 74.54 72.75 vowel 78.31 76.72 72.07 70.47
waveform 69.37 68.72 72.30 72.54 waveform 67.01 66.81 70.23 70.66
wine 91.30 89.77 91.19 91.41 wine 88.36 88.67 88.91 89.42
zoo 93.35 94.35 92.96 91.58 zoo 90.60 91.91 91.02 89.34
Average 77.49 77.66 78.65 78.66 Average 76.02 76.38 77.51 77.57

 Average accuracy: According to this factor, we can observe that  Holm’s test: According to this test, we can observe that CCDT is
CDT and CCDT obtain better average accuracy results than the better than IGT and IGRT with a significant statistical difference
trees built with the classic split criteria IGT and IGRT. when they are compared with data sets with or without noise.
 Tree size: CDT and CCDT build classification trees smaller than CDT is also better than IGT and IGRT when are applied on data
the ones obtained by IGT and IGRT. In particular, CCDT is the sets with noise. CDT and CCDT have not a significant statistical
method that provides the smallest trees. difference when they are applied on data sets without noise or
 Friedman’s ranking: According this ranking, we can say that with high noise (20% and 30%) but in these cases, it must be
CCDT is the best model for classifying data sets with or without remarked that CCDT has smaller p-values than the ones pro-
noise (lower rank value in each comparison). vided by CDT when they are compared with IGT and IGRT
C.J. Mantas, J. Abellán / Expert Systems with Applications 41 (2014) 2514–2525 2523

Table 6 Table 8
Accuracy results of IGT, IGRT, CDT and CCDT when are applied on data sets with Average tree size results of IGT, IGRT, CDT and CCDT when are applied on data sets
percentage of random noise equal to 30%. with percentage of random noise equal to 0%, 5%, 10%, 20% and 30%.

data set IGT IGRT CDT CCDT Tree noise 0% noise 5% noise 10% noise 20% noise 30%
anneal 98.21 98.74 98.46 98.25 IGT 962.77 1160.79 1318.50 1596.72 1828.06
arrhythmia 61.53 60.98 65.25 65.52 IGRT 1010.63 1211.42 1368.65 1704.06 1926.71
audiology 65.24 73.45 68.68 69.09 CDT 454.68 533.94 597.27 708.78 820.37
autos 71.48 71.90 70.23 70.34 CCDT 392.81 474.83 538.42 650.13 759.87
balance-scale 73.23 73.23 73.18 73.18
breast-cancer 63.80 64.66 69.89 70.21
wisconsin-breast-cancer 93.86 93.61 94.59 94.65
car 66.54 67.31 86.02 86.25 2500
IGT
cmc 44.16 45.01 46.77 46.81 IGRT
CDT
horse-colic 76.63 76.55 80.37 80.65 CCDT
credit-rating 80.45 80.70 83.77 84.04 2000

Tree size (nodes number)


german-credit 65.95 67.57 69.33 69.24
cylinder-bands 57.20 57.52 70.69 70.72
dermatology 85.81 88.54 89.29 89.59 1500
pima-diabetes 72.40 72.38 72.88 72.97
ecoli 78.68 78.69 78.58 78.43
Glass 68.12 68.68 64.39 64.10
1000
haberman 69.90 69.90 70.16 70.26
cleveland-14-heart-disease 69.63 70.29 71.84 72.17
500
hungarian-14-heart-disease 79.77 79.32 78.93 79.27
heart-statlog 79.44 79.52 81.19 81.89
hepatitis 75.46 75.20 78.27 78.39 0
hypothyroid 98.87 98.97 98.97 98.97 0% 5% 10% 20% 30%
ionosphere 87.01 86.69 87.49 87.38 Noise
iris 93.47 93.47 93.87 93.93
kr-vs-kp 98.74 98.87 99.02 99.06 Fig. 4. Average tree size results of each method when is applied on data sets with
labor 83.47 82.63 82.20 81.30 each percentage of random noise.
letter 76.37 76.50 73.89 73.45
liver-disorders 56.85 56.85 56.85 56.85
lymphography 71.01 72.31 75.20 74.29
mfeat-factors 75.50 74.40 75.21 75.19
(see Tables 10–12). On the other hand, CCDT is better than CDT
mfeat-fourier 65.33 61.09 62.06 62.21
mfeat-karhunen 67.70 61.71 63.62 63.44
with a significant statistical difference when it is applied on
mfeat-morphological 68.18 68.54 68.94 68.88 data sets with moderate level of noise (5% and 10%).
mfeat-pixel 72.75 74.69 75.78 76.10
mfeat-zernike 55.37 56.00 57.32 57.48 We can express the following comments from the results:
mushroom 99.99 99.99 100.00 100.00
nursery 64.62 65.69 93.51 93.90
optdigits 70.32 76.23 72.52 72.70  Trees built with the IIG and CIIG split criteria (CDT and CCDT)
page-blocks 95.47 95.22 95.45 95.43 obtain better classification results than the trees designed with
pendigits 85.88 85.60 85.67 85.61 the classic IG and IGR split criteria (IGT and IGRT). Moreover,
postoperative-patient-data 53.22 53.89 66.22 66.33
CDTs and CCDTs are smaller trees than IGTs and IGRTs.
primary-tumor 31.12 31.35 34.78 34.87
segment 91.44 91.94 91.09 91.05
 CCDT obtains better accuracy results and smaller trees than
shuttle-landing-control 53.50 54.00 49.00 47.50 CDT. For data sets without noise, CCDT statistically improves
sick 97.42 97.37 97.71 97.74 the classic methods (IGT and IGRT), but CDT is not able to
solar-flare2 98.59 98.77 98.90 98.96 improve these methods with a significant statistical difference
sonar 71.93 71.08 71.54 71.79
on the large set of data sets used. These facts advise the use
soybean 82.24 89.78 88.75 89.22
spambase 89.52 90.82 90.89 90.99 of CCDT instead of CDT.
spectrometer 42.79 39.06 41.24 41.47  For data with noise in all variables, CCDT and CDT manage to
splice 85.54 84.85 87.49 88.13 improve the classic methods IGT and IGRT with significant sta-
sponge 88.32 91.52 90.96 91.13
tistical differences. This conclusion is in accordance with the
tae 39.09 39.09 39.02 39.02
vehicle 66.11 66.35 66.86 66.85
design philosophy for CCDT and CDT, which are optimized for
vote 93.08 93.19 94.60 94.69 imprecise variables manipulation.
vowel 74.94 74.11 68.15 66.79  An important advantage of CDT and CCDT is that they build
waveform 65.48 65.19 68.46 68.84 small trees by using only one step. That is, they do not need a
wine 87.89 87.24 87.79 87.78
postprocessing phase for pruning a high tree. In this manner,
zoo 89.09 89.88 89.50 89.10
Average 74.76 75.14 76.72 76.74 the construction time of a tree is reduced. This is a beneficial
property when very large data sets are processed.

Finally, we remark the principal points of this experimental


comparative study as follows:
Table 7
Average accuracy results of IGT, IGRT, CDT and CCDT when are applied on data sets
 CDT and CCDT vs. IGT and IGRT: CDT and CCDT obtain always a
with percentage of random noise equal to 0%, 5%, 10%, 20% and 30%.
smaller Friedman’s rank than IGT and IGRT in each experimen-
Tree noise 0% noise 5% noise 10% noise 20% noise 30% tal study for data sets with or without noise. CDT and CCDT
IGT 78.96 78.00 77.49 76.02 74.76 have better classification results than IGT and IGRT and the tree
IGRT 78.97 78.09 77.66 76.38 75.14 size is smaller. CDT is better than IGT and IGRT for data sets
CDT 79.56 78.84 78.65 77.51 76.72
with noise, and CCDT is better than IGT and IGRT for data sets
CCDT 79.63 78.99 78.66 77.57 76.74
with or without noise according with Holm’s test.
2524 C.J. Mantas, J. Abellán / Expert Systems with Applications 41 (2014) 2514–2525

Table 9
Friedman’s ranks for a = 0.05 of IGT, IGRT, CDT and CCDT when are applied on data sets with percentage of random noise equal to 0%, 5%, 10%, 20% and 30%.

Tree Rank (noise 0%) Rank (noise 5%) Rank (noise 10%) Rank (noise 20%) Rank (noise 30%)
IGT 2.625 2.9 2.9167 3.0917 2.9667
IGRT 2.775 2.9333 2.8333 2.7917 2.7833
CDT 2.475 2.2917 2.2 2.2333 2.2667
CCDT 2.125 1.875 2.05 1.8833 1.9833

Table 10 methods using imprecision based on imprecise probabilities from


p-values of the Holm’s test for a = 0.05 on the methods IGT, IGRT, CDT and the ones based on classic probabilities; (ii) CCDT improves to
CCDT when they are applied on data sets with percentage of random noise
equal to 0%. Holm’s procedure rejects those hypotheses that have a p-
CDT in each experimental study for each level of noise, CCDT is sig-
value 6 0.01. nificatively better that CDT on data sets with moderate noise and it
is always better than CDT when both are compared with the classic
i Methods p-values
methods in each experimental study; (iii) CCDT built the smaller
6 IGRT vs. CCDT 0.008333 trees compared with the rest of methods.
5 IGT vs. CCDT 0.01
4 CDT vs. CCDT 0.0125
3 IGRT vs. CDT 0.016667
6. Conclusions
2 IGT vs. IGRT 0.025
1 IGT vs. CDT 0.05
In this paper, we have presented an analysis of a procedure to
build decision trees based on imprecise probabilities and uncer-
tainty measures, called CDT. We have compared this procedure
Table 11
with the classic ones based on the Shannon’s entropy for precise
p-values of the Holm’s test for a = 0.05 on the methods IGT, IGRT, CDT and
CCDT when they are applied on data sets with percentage of random noise
probabilities. We show that handling imprecision is a key part of
equal to 5%. Holm’s procedure rejects those hypotheses that have a p- obtaining improvements in the method’s performance. This analy-
value 6 0.025. sis allow us to extend the CDT’s procedure to present a new meth-
i Methods p-values
od with higher level of imprecision. In this extension, both the class
variable and the input features are manipulated with imprecise
6 IGRT vs. CCDT 0.008333
5 IGT vs. CCDT 0.01
probabilities.
4 IGRT vs. CDT 0.0125 Finally, we carried out an experimental study on data sets with
3 IGT vs. CDT 0.016667 different levels of general noise, i.e. data sets with noise in all the
2 CDT vs. CCDT 0.025 variables, not only in the class variable where it has been showed
1 IGT vs. IGRT 0.05
that the CDT’s procedure has an excellent performance (Abellán &
Masegosa, 2012). In our experimental study we use CCDT, CDT and
classic procedures based on precise probabilities. We show that
Table 12
CCDT builds smaller trees and obtains better results than the rest
p-values of the Holm’s test for a = 0.05 on the methods IGT, IGRT, CDT and of methods for each level of general noise, and it is even statisti-
CCDT when they are applied on data sets with percentage of random noise cally better than CDT when both are applied on data sets with
equal to 10%,20% and 30%. Holm’s procedure rejects those hypotheses that moderate level of general noise.
have p-value 6 0.025 for noise 10%; p-value 6 0.016667 for noise 20% and
30%.
Acknowledgments
i Methods p-values
6 IGRT vs. CCDT 0.008333
This work has been supported by the Spanish ‘‘Consejería de
5 IGT vs. CCDT 0.01
4 CDT vs. CCDT 0.0125
Economía, Innovación y Ciencia de la Junta de Andalucía’’ under
3 IGRT vs. CDT 0.016667 Project TIC-6016.
2 IGT vs. IGRT 0.025
1 IGT vs. CDT 0.05
References

Abellán, J. (2006). Uncertainty measures on probability intervals from Imprecise


Dirichlet model. International Journal of General Systems, 35(5), 509–528.
Abellán, J. (2006). Application of uncertainty measures on credal sets on the naive
 CCDT vs. CDT: CCDT has a smaller Friedman’s rank than CDT in
bayes classifier. International Journal of General System, 35, 675–686.
each experimental study for data sets with or without noise. Abellán, J., Klir, G. J., & Moral, S. (2006). Disaggregated total uncertainty measure for
CCDT has better classification results than CDT, and the trees credal sets. International Journal of General Systems, 35(1), 29–44.
from CCDTs are smaller than CDTs. CCDT obtains smaller p-val- Abellán, J., & Masegosa, A. (2008). Requirements for total uncertainty measures in
Dempster-Shafer theory of evidence. International Journal of General Systems,
ues than the ones of CDT in each experimental study for data 37(6), 733–747.
sets with or without noise. According Holm’s test, CCDT is bet- Abellán, J., & Masegosa, A. (2009). A filter-wrapper method to select variables for
ter than CDT with a significant statistical difference for data sets the Naive Bayes classifier based on credal decision trees. Int. J. of Uncertainty,
Fuzziness and Knowledge-Based Systems, 17(6), 833–854.
with moderate noise (5% and 10%). Besides, CCDT is better than Abellán, J., & Masegosa, A. R. (2009). An experimental study about simple decision
IGRT and IGT with a significant statistical difference for data trees for Bagging ensemble on data sets with classification noise. In C. Sossai &
sets without noise but CDT is not better than IGRT and IGT in G. Chemello (Eds.), ECSQARU. LNCS (vol. 5590, pp. 446–456). Springer.
Abellán, J., & Masegosa, A. R. (2010). An ensemble method using credal decision
that case. trees. European Journal of Operational Research, 205(1), 218–226.
Abellán, J., & Masegosa, A. (2012). Bagging schemes on the presence of noise in
The above remarked points allow us to express the following fi- classification. Expert Systems with Applications, 39(8), 6827–6837.
Abellán, J., & Moral, S. (2003). Maximum entropy for credal sets. International
nal conclusions about the experimental study on data sets with Journal of Uncertainty Fuzziness and Knowledge-Based Systems, 11(5),
general noise carried out: (i) there exists a clear advantage of the 587–597.
C.J. Mantas, J. Abellán / Expert Systems with Applications 41 (2014) 2514–2525 2525

Abellán, J., & Moral, S. (2003). Building classification trees using the total Nettleton, D. F., Orriols-Puig, A., & Fornells, A. (2010). A study of the effect of
uncertainty criterion. International Journal of Intelligent Systems, 18(12), different types of noise on the precision of supervised learning techniques.
1215–1225. Artificial Intelligence Review, 33(4), 275–306.
Abellán, J., & Moral, S. (2005). Upper entropy of credal sets. Applications to credal Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106.
classification, International Journal of Approximate Reasoning, 39(2-3), 235–255. Quinlan, J. R. (1993). Programs for machine learning. Morgan Kaufmann series in
Abellán, J., & Moral, S. (2006). An algorithm that computes the upper entropy for Machine Learning.
order-2 capacities. International Journal of Uncertainty, Fuzziness and Knowledge- Shannon, C. E. (1948). A mathematical theory of communication. The Bell System
Based Systems, 14(2), 141–154. Technical Journal, 27, 379–423 (623–656).
Alcalá-Fdez, J., Sánchez, L., García, S., Del Jesus, M. J., Ventura, S., Garrell, J. M., et al. Van Hulse, J. D., Khoshgoftaar, T. M., & Huang, H. (2007). The pairwise attribute
(2009). KEEL: A software tool to assess evolutionary algorithms to data mining noise detection algorithm. Knowledge and Information Systems, 11(2), 171–190.
problems. Soft Computing, 13(3), 307–318. Walley, P. (1996). Inferences from multinomial data, learning about a bag of
Demsar, J. (2006). Statistical comparison of classifiers over multiple data sets. marbles. Journal of the Royal Statistical Society, Series B, 58, 3–57.
Journal of Machine Learning Research, 7, 1–30. Wang, Y. (2010). Imprecise probabilities based on generalised intervals for system
Fayyad, U. M., & Irani, K. B. (1993). Multi-valued interval discretization of reliability assessment. International Journal of Reliability and Safety, 4(30),
continuous-valued attributes for classification learning. In Proceedings of the 319–342.
13th International Joint Conference on Artificial Intelligence (pp. 1022–1027). San Weichselberger, K. (2000). The theory of interval-probability as a unifying concept for
Mateo: Morgan Kaufmann. uncertainty. International Journal of Approximate Reasoning, 24(2–3), 149–170.
Friedman, M. (1937). The use of rank to avoid the assumption of normality implicit in Witten, I. H., & Frank, E. (2005). Data Mining Practical machine learning tools and
the analysis of variance. Journal of the American Statistical Association, 32, 675–701. techniques (second ed.). San Francisco: Morgan Kaufmann.
Friedman, M. (1940). A comparison of alternative tests of significance for the Zhu, X., & Wu, X. (2004). Class Noise vs. attribute noise: A quantitative study of their
problem of m rankings. Annals of Mathematical Statistics, 11, 86–92. impacts. Artificial Intelligence Review, 22(3), 177–210.
Holm, S. (1979). A simple sequentially rejective Bonferroni test procedure.
Scandinavian Journal of Statistics, 6, 65–70.
Jaynes, E. T. (1982). On the rationale of maximum-entropy methods. Proceeding of
Further Reading
the IEEE, 70(9), 939–952.
Khoshgoftaar, T. M., & Van Hulse, J. (2009). Empirical case studies in attribute noise Abellán, J. (2011). Combining nonspecificity measures in Dempster-Shafer theory of
detection. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications evidence. International Journal of General Systems, 40(6), 611–622.
and Reviews, 39(4), 379–388. Yager, R. (1992). On the specificity of a possibility distribution. Fuzzy Sets and
Klir, G. J. (2006). Uncertainty and information. Foundations of generalized information Systems (3), 279–292.
theory. Hoboken, NJ: John Wiley.

You might also like