Professional Documents
Culture Documents
net/publication/220516121
CITATIONS READS
10 291
3 authors:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Classification under Unbalanced Data. A New Geometric Oversampling Approach. View project
All content following this page was uploaded by Ronaldo Cristiano Prati on 22 February 2015.
Abstract
process. However, DM deals with a large amount of data and most le-
used to ease this problem is related to data sampling and the construction
their decisions to the user. These methods were implemented in the ELE
Data Mining (DM) aims to extract knowledge from large datasets [3]. To ac-
complish this task, supervised learning algorithms can be used. However, most
learning algorithms are not prepared to deal with large datasets. Several tech-
niques have been proposed to overcome this problem, among them techniques of
data sampling [4, 5, 9, 16]. After sampling, ensembles can be constructed in or-
der to combine the induced classifiers in each sample. To classify a new instance,
more, ensembles could be more accurate than the base-classifiers they are made
of [8].
methods often lack this property. To fill this gap, this work explores several
in such a way that it is possible to explain the ensemble’s decisions to the user.
The rest of this work is organized as follows: Section 2 introduces the notation
and definitions used; Section 3 describes related work; Section 4 explains our
2
2 Definitions and Notation
for some unknown function y = f (x). The xi values are typically vectors of
the form (xi1 , xi2 , ..., xim ) whose components are discrete or real values, called
features or attributes. Thus, xij denotes the value of the j-th feature Xj of xi .
In what follows, the i subscript will be left out when implied by the context. For
classification purposes, the y values are drawn from a discrete set of NCl classes,
rules, i.e. h = {R1 , R2 ..., RNR }. Recall that most classification rule lear-
the best rule, and removes the covered examples. This process is repeated in
the remaining examples until all examples have been covered or some stopping
criterion is met. Finally, a classifier is built gathering the rules to form an or-
dered rule list (or decision list) in case all covered examples were removed in
each iteration, or to form an unordered rule set in case only examples correctly
covered were removed in each iteration. Algorithms from the second family —
Value, where Xi is a feature name, op is an operator in the set {=, 6=, <, ≤, >, ≥}
3
and Value is a valid Xi feature value. A rule R assumes the form if B then H
for the body, or rule condition. H and B are both complexes with no features
instances that satisfy the B part compose the covered set of R, called B set
in this work; in other words, these instances are covered by R. Instances that
satisfy both B and H are correctly covered by R, and these instances belong
rule, and belong to set B ∩ H. On the other hand, instances that do not satisfy
the B part are not covered by the rule, and belong to set B. Given a rule and
a = |A|, then b and h in Table 1 denote the number of instances in sets B and H
rule R enables the calculation of several rule quality measures, such as negative
ons are combined in some way, in order to predict the label of new instances.
bias and variance, ensembles can be very large [8]. Another problem is related
4
Table 1: Contingency matrix for B → H
B B
H bh bh h
H bh bh h
b b N
5
3 Related Work
for on-line classification of large databases that do not fit into the memory [5, 4].
There are different ways in which ensembles can be generated and the resulting
output combined to classify new instances (see [12] for a review of methods and
into two sub-tasks [8]. The first one consists of generating a set of base-classifiers.
The second one consists of deciding how to combine the classifications of the
used for training in order to construct the base-classifiers, using techniques such
as bagging [2], boosting [10], wagging [1], and others [20]. To reach improve-
the other hand, recent papers focus on the combination of classifiers (second
dently using different learning algorithms [15], such as stacking, which construct
clude the use of different aggregation operators [7], which are used to fuse the
predictions of the classifiers, since multiple classifier fusion may generate more
they are not able to explain their classification decisions in new examples, like
classifiers and three voting mechanisms that enable the ensemble to explain its
6
using a small number of base-classifiers.
training dataset S. First of all, L samples S1 , ..., SL , which can be drawn with
or without restitution, are extracted from S and each sample S1 , ..., SL is used
used in each of the L samples to induce the L base-classifier. In fact, the only
the method where Combine(h1 (x), ..., hL (x)) constitutes the symbolic ensemble
h∗ (x).
classify x using the L base-classifiers, i.e, two ways of finding h1 (x), ..., hL (x):
2. Exploring the rules inside the classifier, where the best classifier’s rule that
rule sets, the rules that cover a new example can be considered as a sort
7
Figure 1: A method for constructing ensembles of classifiers
8
in Section 2, as quality measures.
Into the Combine(h1 (x), ..., hL (x)) function, the following three methods
1. Unweighted Voting – UV: the class label of x is the one that receives
classifier is weighted using the classifier’s mean error rate m err(hi ), and
the class label of x is the one having maximum total weight from the L
classifiers:
L
X
W M V (x, Cv ) = max g(hl (x), Ci )
Ci ∈{C1 ,...,CNCl }
l=1
where
lg((1 − m err(hl ))/m err(hl ))
g(hl (x), Ci ) = if hl (x) = Ci ,
0 otherwise.
to the previous one but also considering the standard error se err(hi )) of
L
X
W M SV (x, Cv ) = max g(hl (x), Ci )
Ci ∈{C1 ,...,CNCl }
l=1
9
where
lg((1 − m err(hl ))/m err(hl ))
+ lg((1 − se err(hl ))/se err(hl ))
g(hl (x), Ci ) =
if hl (x) = Ci ,
0 otherwise.
(WMV) favours hypotheses having lower mean error rates. In similar fashion,
method 3 (WMSV) favours hypotheses having lower mean error rates as well
as lower standard error rates. In other words, the aim of function g(hl (x), Ci ),
whenever the error rate and/or the standard error are less than 50%, and to
decrement it otherwise. In case the error rate and/or the standard error are
plementation of the three methods for combining classifiers — UV, WMV and
WMSV — described in this work, as well as the methods for classifying exam-
10
the object orientation paradigm, and can be easily extended to support other
quality measures.
This integration enable us to easily incorporate into the ELE system several
ensembles.
In order to illustrate the ELE system, several experiments were conducted using
Chess for short — and Splice. These datasets are often used in the ensemble
datasets used in this study, showing, for each dataset: number of instances (#
and discrete features; class distribution (Class %); majority error rate; presence
and C4.5 [18]. The experiments were conducted using a combination of the two
11
Table 2: Datasets characteristics summary
Dataset Nursery Chess Splice
# Inst. 12958 3196 3190
# Features 8 36 60
(cont.,disc.) (0,8) (0,36) (0,60)
not recom nowin EI
(33.34%) (47.78%) (24.01%)
very recom won IE
Class (2.53%) (52.22%) (24.08%)
(Class %) priority N
(32.92%) (51.88%)
spec prior
(31.21%)
Majority 66.66% 47.78% 48.12%
Error in not recom in won in N
Unknown
N N N
Values
Dup./Conf. 0 0 184
Examples (0.00%) (0.00%) (5.77%)
12
referred by “Class”, and the individual rules’ classification, in which we used
accuracy, Laplace accuracy and negative confidence rule quality measures, defi-
ned in Section 2, referred respectively by “Acc”, “Lap” and “NegRel”), and the
ted by mean and weighted by mean and standard error voting methods, referred
respectively by UV, WMV and WMSV). For example: UV-Class indicates that
each classifier was used as a “black-box” to classify a new example, and their
that the rule with the best Laplace accuracy measure value of each classifier
was used to classify a new example and the final decision was obtained using
Weighted by Mean Voting (WMV). The experiments were carried out using five
scenarios Scn 1 and Scn 3 while C4.5 was used in scenarios Scn 2 and Scn 5.
In scenario Scn 4, both algorithms CN 2 (in three samples) and C4.5 (in two
samples) were used. As in these experiments sampling was carried out without
restitution, the size of the datasets does not allow a greater number of partitions.
for short) in the following way. First of all the dataset was partitioned into 10
independent folds. In each iteration j, one fold was used as test set tej and the
nine remaining folds as training set trj . The training set trj was further divided
base-classifiers. Only for the purpose of estimating the error rate m err(hi (x))
and the standard error se err(hi (x)) used as weights in the proposed ensembles’
voting mechanism, 10-FSCV was again applied to each Si sample. In each j-th
iteration, the ensemble was constructed using the base-classifiers induced using
all examples in each Si sample and the j-th ensemble error rate was estimated in
the remaining tej test set. The same test set tej was used to estimate the base-
13
Table 3: Experiment description
Experiment # of Partitions learning algorithms
Scn 1 3 CN 2- CN 2- CN 2
Scn 2 3 C4.5- C4.5- C4.5
Scn 3 5 CN 2- CN 2- CN 2- CN 2- CN 2
Scn 4 5 CN 2- CN 2- CN 2- C4.5- C4.5
Scn 5 5 C4.5- C4.5- C4.5- C4.5- C4.5
14
classifiers error rate for each sample Si , i = 1, ..., L. This process was iterated
from j = 1 to 10.
Tables 4, 5 and 6 summarize, respectively, the error rate and the standard
error obtained with datasets Nursery, Chess and Splice. The first column identi-
fies the experiment and the next five columns show the results obtained in each
scenario. The first five rows, labeled as S1 , S2 , S3 , S4 and S5 , present the results
related to each base-classifier, while the other rows present the results related
to the ensembles’ construction methods. Results in bold indicate that the en-
semble is better than each one of the base-classifiers with a 95% confidence level
according to a paired t-test. Results from Tables 4 to 6 are also depict in graphs,
error rate and the Y -axis refers to the minimum error over all ensemble’s base-
“Class - Scn 2”, and so on, are related to results from Scn 1, Scn 2, and so on,
Scn 1”, “Rule - Scn 2”, and so on, are related to results using only the best rule
above the main diagonal line indicate that the ensemble presents better results
perimental results. They show that the constructed ensembles can improve the
results obtained with their best base-classifiers in most cases. Besides, results
in Tables 4 to 6 show that in two (nursery and splice) out of the three datasets,
the ensembles’ error rate is smaller than their base-classifiers’ error rate. More-
over, according to t-test, these results (ensemble versus base-classifiers) are all
significant with a 95% confidence level. For the chess dataset, the three combi-
15
nation methods (UV, WMV and WMSV) present identical results. However, it
should be observed that the proposed ensemble approach did not degrade the
performance of base-classifiers, even when the ensembles’ error rate are not all
significantly smaller than their base-classifiers’ error rate (chess dataset). Ta-
king into account the limited number of base-classifiers used in the experiments
(minimum of 3 and maximum of 5), the results can be considered very good, as
Another positive point is that the results obtained using the best rule clas-
sification criterion are comparable to the ones obtained using the classifiers as
“black-boxes”. This is an interesting result since at most one rule from each
Considering the experiments using C4.5 (Scn 2 and Scn 5 ), it can be obser-
ved that they present the same error rate independently of the classification cri-
terion used, i.e. all UV methods (UV-Class, UV-Acc, UV-Lap and UV-NegRel)
have equal results, as well as all WMV and WMSV methods. The reason is that
C4.5 induces disjoint rules (decision trees) which means there is only one rule
and limited size datasets, the mean error rate of the base-classifiers tends to
increase for a higher number of classifiers. This is related to the fact that the
used datasets are not so large, and using exclusive partitions as samples implies
16
Table 4: Results using nursery dataset.
Sample Scn 1 Scn 2 Scn 3 Scn 4 Scn 5
S1 5.60 6.16 7.64 7.75 7.92
(0.13) (0.23) (0.24) (0.19) (0.18)
S2 5.66 6.47 7.24 7.86 7.77
(0.25) (0.21) (0.21) (0.10) (0.24)
S3 5.43 5.97 7.93 7.47 7.80
(0.18) (0.09) (0.26) (0.27) (0.17)
S4 - - 7.83 7.50 7.73
- - (0.22) (0.17) (0.24)
S5 - - 7.82 7.94 7.46
- - (0.16) (0.29) (0.24)
UV-Class 3.86 4.51 4.42
(0.13) (0.16) (0.15)
WMV-Class 4.13 4.81 5.06 4.81 6.41
(0.13) (0.18) (0.18) (0.11) (0.14)
WMSV-Class 4.18 4.95 4.88
(0.14) (0.19) (0.13)
UV-Acc 3.30 3.86 4.41
(0.18) (0.14) (0.11)
WMV-Acc 3.53 4.81 4.47 4.75 6.41
(0.19) (0.18) (0.16) (0.08) (0.14)
WMSV-Acc 3.57 4.38 4.81
(0.21) (0.14) (0.12)
UV-Lap 3.85 4.47 4.51
(0.14) (0.15) (0.14)
WMV-Lap 4.11 4.81 5.02 4.85 6.41
(0.14) (0.18) (0.16) (0.12) (0.14)
WMSV-Lap 4.17 4.91 4.93
(0.15) (0.15) (0.13)
UV-NegRel 3.80 4.30 4.24
(0.14) (0.17) (0.16)
WMV-NegRel 4.24 4.81 5.03 4.65 6.41
(0.14) (0.18) (0.21) (0.14) (0.14)
WMSV-NegRel 4.24 4.94 4.70
(0.16) (0.21) (0.14)
17
Figure 2: Results using nursery dataset.
18
Table 5: Results using chess dataset.
Sample Scn 1 Scn 2 Scn 3 Scn 4 Scn 5
S1 2.72 1.69 3.82 3.63 3.04
(0.35) (0.29) (0.73) (0.72) (0.35)
S2 2.47 1.25 3.57 2.75 3.00
(0.32) (0.19) (0.44) (0.36) (0.46)
S3 2.72 1.75 2.88 3.00 2.53
(0.49) (0.28) (0.27) (0.35) (0.35)
S4 - - 2.94 2.22 3.13
- - (0.47) (0.28) (0.51)
S5 - - 3.04 2.50 3.04
- - (0.44) (0.27) (0.48)
UV-Class
2.50 0.91 2.25 1.60 2.32
WMV-Class
(0.33) (0.16) (0.33) (0.25) (0.35)
WMSV-Class
UV-Acc
2.72 0.91 2.32 1.63 2.32
WMV-Acc
(0.35) (0.16) (0.32) (0.27) (0.35)
WMSV-Acc
UV-Lap
2.50 0.91 2.25 1.60 2.32
WMV-Lap
(0.33) (0.16) (0.33) (0.25) (0.35)
WMSV-Lap
UV-NegRel
2.53 0.91 2.25 1.56 2.32
WMV-NegRel
(0.33) (0.16) (0.33) (0.26) (0.35)
WMSV-NegRel
19
Figure 3: Results using chess dataset.
20
Table 6: Results using splice dataset.
Sample Scn 1 Scn 2 Scn 3 Scn 4 Scn 5
S1 18.50 9.03 15.33 14.61 11.25
(1.85) (0.38) (0.97) (0.73) (0.70)
S2 15.52 9.37 14.80 16.02 13.01
(1.04) (0.47) (0.92) (1.20) (0.67)
S3 15.92 9.00 15.30 19.75 11.41
(1.28) (0.48) (0.64) (1.73) (0.40)
S4 - - 15.45 11.72 12.13
- - (1.45) (0.74) (0.55)
S5 - - 16.77 11.32 13.26
- - (0.93) (0.52) (0.71)
UV-Class 11.54 7.55 9.72 7.30 8.68
(0.49) (0.39) (0.45) (0.70) (0.42)
WMV-Class 11.13 7.34 9.84 7.08 8.53
(0.39) (0.27) (0.41) (0.66) (0.53)
WMSV-Class 11.19 7.59 9.59 7.15 8.53
(0.37) (0.38) (0.37) (0.62) (0.52)
UV-Acc 11.69 7.55 10.50 7.74 8.68
(0.46) (0.39) (0.58) (0.64) (0.42)
WMV-Acc 11.69 7.34 10.50 7.74 8.53
(0.46) (0.27) (0.58) (0.64) (0.53)
WMSV-Acc 11.69 7.59 10.38 7.68 8.53
(0.46) (0.38) (0.55) (0.58) (0.52)
UV-Lap 10.75 7.55 9.66 7.27 8.68
(0.44) (0.39) (0.42) (0.78) (0.42)
WMV-Lap 10.82 7.34 9.81 7.05 8.53
(0.52) (0.27) (0.46) (0.74) (0.53)
WMSV-Lap 10.88 7.59 9.56 7.05 8.53
(0.49) (0.38) (0.40) (0.66) (0.52)
UV-NegRel 12.85 7.55 9.94 7.52 8.68
(0.61) (0.39) (0.43) (0.65) (0.42)
WMV-NegRel 12.13 7.34 1(0.16) 7.34 8.53
(0.28) (0.27) (0.45) (0.64) (0.53)
WMSV-NegRel 12.26 7.59 9.87 7.40 8.53
(0.26) (0.38) (0.38) (0.60) (0.52)
21
Figure 4: Results using splice dataset.
22
In order to illustrate the explanation facility of the symbolic ensembles cons-
tructed by our proposed method, we consider the ensemble constructed using the
Nursery dataset and the UV-Class combination method in Scn 4. The estima-
ted error rate of this ensemble, which consists of five base-classifiers h1 , ..., h5 , is
about 62% lower than its best base-classifier — Table 4. Consider that example
mended), having true class label priority, is given to this ensemble. Example x
Moreover, observe that some rules are specialization of other rules. For
using three limited size datasets, few base-classifiers and five different scenarios.
For two datasets, the constructed ensembles using the proposed methods always
showed an error rate which was smaller than their base-classifiers’ error rates,
with a 95% of confidence level according to the paired t-test. For the other
dataset, although such good results at a 95% confidence level were obtained in
only one scenario, the ensembles’ error rate in the remaining scenarios were not
23
Example x: (usual,less proper,incomplete,3,convenient,convenient,problematic,recommended)
- Classe: priority
Combination Method: UV-Class - Scenario: Scn 4 - Ensemble classification:
priority
Body (conditions) of rules that explain classification given by ensemble to exam-
ple x:
1 – parents = usual AND has nurs = less proper AND health = recommended
2 – parents = usual AND has nurs = less proper AND form = incomplete AND
health = recommended
3 – parents = usual AND has nurs = less proper AND housing = convenient AND
health = recommended
4 – parents = usual AND has nurs = less proper AND housing = convenient AND
finance = convenient AND health = recommended
5 – parents = usual AND has nurs = less proper AND social = problematic AND
health = recommended
6 – health = recommended AND has nurs = less proper
7 – health = recommended AND has nurs = less proper AND parents = usual
AND housing = convenient AND finance = convenient
24
higher than their base-classifiers’ error rate. These results are encouraging since
The actual ELE explanation mechanism shows all different individual classi-
fiers’ rules that correctly cover the example to the user, i.e. fired rules from clas-
the set of explanatory rules showed to the user. Ongoing work also includes
Acknowledgments
This research was supported by the Brazilian research councils CAPES and
paper.
References
36(1/2):105–139, 1999.
25
[5] N. V. Chawla, T. E. Moore, L. O. Hall, K. W. Bowyer, W. P. Kegelmeyer,
[6] P. Clark and R. Boswell. Rule induction with CN 2: Some recent improve-
6(2):131–152, 2002.
[11] S. Hettich, C.L. Blake, and C.J. Merz. UCI repository of machine learning
databases, 1998.
Wiley, 2003.
26
[14] N. Lavrac, P. Flach, and B. Zupan. Rule evaluation measures: a unifying
methods in bagging. In KDD ’03: Proc. 9th ACM SIGKDD Inter. Conf. on
Knowledge Discovery and Data Mining, pages 595–600. ACM Press, 2003.
[16] Huan Liu and Hiroshi Motoda. Instance Selection and Construction for
Publishers, 1988.
27