Constructing Ensembles of Symbolic Classifiers

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/220516121
Constructing ensembles of symbolic classiﬁers
Article in International Journal of Hybrid Intelligent Systems · October 2006

DOI: 10.3233/HIS-2006-3304 · Source: DBLP
CITATIONS READS
10 291
3 authors:
Flavia Bernardini Maria-Carolina Monard

Universidade Federal Fluminense University of São Paulo
92 PUBLICATIONS 281 CITATIONS 167 PUBLICATIONS 5,887 CITATIONS
SEE PROFILE SEE PROFILE
Ronaldo Cristiano Prati

Universidade Federal do ABC (UFABC)
110 PUBLICATIONS 4,379 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Classification under Unbalanced Data. A New Geometric Oversampling Approach. View project
ADDRPD View project
All content following this page was uploaded by Ronaldo Cristiano Prati on 22 February 2015.
The user has requested enhancement of the downloaded file.

Constructing Ensembles of Symbolic Classifiers
Flávia Cristina Bernardini, Maria Carolina Monard and

Ronaldo C. Prati
Laboratory of Computational Intelligence — LABIC
Institute of Mathematics and Computer Science — ICMC
University of São Paulo — USP
P. O. Box 668, 13560-970, São Carlos, SP, Brazil
{fbernard,mcmonard,prati}@icmc.usp.br
28th July 2006
Abstract
Learning algorithms are an integral part of the Data Mining (DM)
process. However, DM deals with a large amount of data and most le-
arning algorithms do not operate in massive datasets. A technique often
used to ease this problem is related to data sampling and the construction
of ensembles of classifiers. Several methods to construct such ensembles
have been proposed. However, these methods often lack an explanation
facility. This work proposes methods to construct ensembles of symbolic
classifiers. These ensembles can be further explored in order to explain
their decisions to the user. These methods were implemented in the ELE
system, also described in this work. Experimental results in two out of
three datasets show improvement over all base-classifiers. Moreover, ac-
cording to the obtained results, methods based on single rule classification
might be used to improve the explanation facility of ensembles.

1 Introduction
Data Mining (DM) aims to extract knowledge from large datasets [3]. To ac-
complish this task, supervised learning algorithms can be used. However, most
learning algorithms are not prepared to deal with large datasets. Several tech-
niques have been proposed to overcome this problem, among them techniques of
data sampling [4, 5, 9, 16]. After sampling, ensembles can be constructed in or-
der to combine the induced classifiers in each sample. To classify a new instance,
an ensemble usually combines the set of base-classifiers using a voting mecha-
nism. Thus, an ensemble can be considered by nature a hybrid system, since
it is a collection of models whose predictions are somehow combined. Further-
more, ensembles could be more accurate than the base-classifiers they are made
of [8].
Apart from accuracy in classifying new instances, an important issue in
predictive DM is related to explanations, i.e. what knowledge has been used
to classify a new example. However, most well known ensemble construction
methods often lack this property. To fill this gap, this work explores several
voting methods to construct ensembles using sets of symbolic base-classifiers,
in such a way that it is possible to explain the ensemble’s decisions to the user.
The rest of this work is organized as follows: Section 2 introduces the notation
and definitions used; Section 3 describes related work; Section 4 explains our
proposal to construct what we call symbolic ensembles; Section 5 outlines the
system we have implemented to validate our proposal; Section 6 shows some
experimental results and Section 7 concludes this work.
2
2 Definitions and Notation
A training dataset T is a set of N classified instances {(x1 , y1 ), ..., (xN , yN )}
for some unknown function y = f (x). The xi values are typically vectors of
the form (xi1 , xi2 , ..., xim ) whose components are discrete or real values, called
features or attributes. Thus, xij denotes the value of the j-th feature Xj of xi .
In what follows, the i subscript will be left out when implied by the context. For
classification purposes, the y values are drawn from a discrete set of NCl classes,
i.e. y ∈ {C1 , C2 , ..., CNCl }. Given a set S ⊆ T of training examples, a learning
algorithm induces a classifier h, which is a hypothesis about the true unknown
function f . Given new x values, h predicts the corresponding y values.
In this work we consider that a symbolic classifier is a classifier whose des-
cription language can be transformed into a set of NR unordered or disjoint
rules, i.e. h = {R1 , R2 ..., RNR }. Recall that most classification rule lear-
ning algorithms belong to one of two families, namely separate-and-conquer
and divide-and-conquer algorithms. Algorithms from the first family generally
use an iterative greedy set-covering algorithm to search in each iteration for
the best rule, and removes the covered examples. This process is repeated in
the remaining examples until all examples have been covered or some stopping
criterion is met. Finally, a classifier is built gathering the rules to form an or-
dered rule list (or decision list) in case all covered examples were removed in
each iteration, or to form an unordered rule set in case only examples correctly
covered were removed in each iteration. Algorithms from the second family —
divide-and-conquer — construct a global classifier using a top-down strategy to
consecutively refine a partial theory. Generally, the classifier is expressed as a
decision tree which can be written as a set of disjoint unordered rules.
A complex is a disjunction of conjunctions of feature tests in the form of Xi op
Value, where Xi is a feature name, op is an operator in the set {=, 6=, <, ≤, >, ≥}
3
and Value is a valid Xi feature value. A rule R assumes the form if B then H
or symbolically B → H, where H stands for the head, or rule conclusion, and B
for the body, or rule condition. H and B are both complexes with no features
in common. In a classification rule, head H assumes the form class = Ci , where
Ci ∈ {C1 , ..., CNCl }.
The coverage of a rule is defined as follows: considering a rule R = B → H,
instances that satisfy the B part compose the covered set of R, called B set
in this work; in other words, these instances are covered by R. Instances that
satisfy both B and H are correctly covered by R, and these instances belong
to set B ∩ H. Instances satisfying B but not H are incorrectly covered by the
rule, and belong to set B ∩ H. On the other hand, instances that do not satisfy
the B part are not covered by the rule, and belong to set B. Given a rule and
a dataset, one way to assess its performance is by computing its contingency
matrix [14], as shown in Table 1. Denoting the cardinality of a set A as a, i.e.
a = |A|, then b and h in Table 1 denote the number of instances in sets B and H
respectively, i.e. b = |B| and h = |H|. Similarly, b = |B|; h = |H|; bh = |B ∩H|;
bh = |B ∩ H|; bh = |B ∩ H|; and bh = |B ∩ H|. The contingency matrix of a
rule R enables the calculation of several rule quality measures, such as negative
confidence (N egRel(R) = hb/b), accuracy (Acc(R) = hb/b), Laplace accuracy
(Lap(R) = (bh + 1)/(bh + bh + NCl )), and others [6, 14].
An ensemble consists of a set of individual base-classifiers, whose predicti-
ons are combined in some way, in order to predict the label of new instances.
Majority or other voting mechanism is one combination approach. Although
under certain conditions ensembles can reduce classification errors by reducing
bias and variance, ensembles can be very large [8]. Another problem is related
to ensembles’ interpretability by humans, since even an ensemble of symbolic
classifiers is not necessarily symbolic.
4
Table 1: Contingency matrix for B → H
B B
H bh bh h
H bh bh h
b b N
5
3 Related Work
Ensembles are increasingly gaining acceptance in the data mining community.
Apart from a significant improvement in accuracy, this is due to their potential
for on-line classification of large databases that do not fit into the memory [5, 4].
There are different ways in which ensembles can be generated and the resulting
output combined to classify new instances (see [12] for a review of methods and
algorithms). In general, methods to construct ensembles can be broken down
into two sub-tasks [8]. The first one consists of generating a set of base-classifiers.
The second one consists of deciding how to combine the classifications of the
base-classifiers to classify new instances.
Popular approaches to generate ensembles include changing the instances
used for training in order to construct the base-classifiers, using techniques such
as bagging [2], boosting [10], wagging [1], and others [20]. To reach improve-
ments in accuracy, these methods rely on a large number of base-classifiers. On
the other hand, recent papers focus on the combination of classifiers (second
sub-task). Some proposals are related to inducing the base-classifiers indepen-
dently using different learning algorithms [15], such as stacking, which construct
a classifier to combine the base-classifier decisions [19]. Other approaches in-
clude the use of different aggregation operators [7], which are used to fuse the
predictions of the classifiers, since multiple classifier fusion may generate more
accurate classification than each of the constituent classifiers [13]. However,
unlike symbolic classifiers, ensembles are sort of “black-box” classifiers, since
they are not able to explain their classification decisions in new examples, like
symbolic classifiers do.
In what follows, we describe a method to construct ensembles using symbolic
classifiers and three voting mechanisms that enable the ensemble to explain its
decisions to the user. Furthermore, we are able to achieve meaningful results
6
using a small number of base-classifiers.
4 Combining Multiple Classifiers
In this work, the set of base-classifiers (first sub-task) is constructed in the
following way: let L be the number of base-classifiers to be induced given a
training dataset S. First of all, L samples S1 , ..., SL , which can be drawn with
or without restitution, are extracted from S and each sample S1 , ..., SL is used
as an input to a symbolic learning algorithm. In other words, sample S1 is
used by Alg1 to induce h1 , sample S2 is used by Alg2 to induce h2 and so
forth. Furthermore, the same or different symbolic learning algorithm can be
used in each of the L samples to induce the L base-classifier. In fact, the only
restriction is that learning algorithm should induce unordered rules. Afterwards,
given a new instance (example) x to be classified, the individual decisions of the
set of L classifiers should be combined to output its label. Figure 1 illustrates
the method where Combine(h1 (x), ..., hL (x)) constitutes the symbolic ensemble
h∗ (x).
The use of symbolic classifiers enables us to explore two different ways to
classify x using the L base-classifiers, i.e, two ways of finding h1 (x), ..., hL (x):
1. Using the classifier as a “black-box”, where each induced classifier is res-
ponsible for classifying x;
2. Exploring the rules inside the classifier, where the best classifier’s rule that
covers example x, according to a given rule quality measure, is responsible
for classifying x. As the ensembles’ base-classifiers consist of unordered
rule sets, the rules that cover a new example can be considered as a sort
of specialized classifier with respect to the given rule measure. In this
work, we use accuracy, Laplace accuracy and negative confidence, defined
7
Figure 1: A method for constructing ensembles of classifiers
8
in Section 2, as quality measures.
Into the Combine(h1 (x), ..., hL (x)) function, the following three methods
have been implemented in order to construct the final ensemble h∗ .
1. Unweighted Voting – UV: the class label of x is the one that receives
more votes from the L classifiers;
2. Weighted by Mean Voting – WMV: the x class label given by each
classifier is weighted using the classifier’s mean error rate m err(hi ), and
the class label of x is the one having maximum total weight from the L
classifiers:
L
X
W M V (x, Cv ) = max g(hl (x), Ci )
Ci ∈{C1 ,...,CNCl }
l=1
where 



 lg((1 − m err(hl ))/m err(hl ))


g(hl (x), Ci ) = if hl (x) = Ci ,




0 otherwise.

3. Weighted by Mean and Standard Error Voting – WMSV: similar
to the previous one but also considering the standard error se err(hi )) of
the classifier’s mean error rate to estimate the corresponding weight:
L
X
W M SV (x, Cv ) = max g(hl (x), Ci )
Ci ∈{C1 ,...,CNCl }
l=1
9
where





lg((1 − m err(hl ))/m err(hl ))



 + lg((1 − se err(hl ))/se err(hl ))

g(hl (x), Ci ) =
if hl (x) = Ci ,








0 otherwise.

Voting method 1 (UV) is a straightforward voting mechanism. Methods 2
(WMV) and 3 (WMSV) aim to improve method 1. To this end, method 2
(WMV) favours hypotheses having lower mean error rates. In similar fashion,
method 3 (WMSV) favours hypotheses having lower mean error rates as well
as lower standard error rates. In other words, the aim of function g(hl (x), Ci ),
where lg is the logarithmic function, is to increment accordingly the class weight
whenever the error rate and/or the standard error are less than 50%, and to
decrement it otherwise. In case the error rate and/or the standard error are
zero, a maximum system defined weight is considered. In order to validate
our proposal, we have implemented a computational system called Ensemble
Learning Environment (ELE), described next.
5 The Ensemble Learning Environment
In order to validate our proposed approach for constructing symbolic ensem-
bles, we have implemented a computational environment that we call Ensemble
Learning Environment — ELE. The current version of ELE includes an im-
plementation of the three methods for combining classifiers — UV, WMV and
WMSV — described in this work, as well as the methods for classifying exam-
ples. An interesting property of ELE is that it has been implemented using
10
the object orientation paradigm, and can be easily extended to support other
combination methods as well as other classification methods using other rule
quality measures.
Furthermore, ELE is integrated into a major computational environment
called Discover [17], which is under development in our research laboratory.
This integration enable us to easily incorporate into the ELE system several
symbolic machine learning algorithms, such as C4.5 and CN 2, as well as using
the experimental workbench available in Discover. In other words, ELE is
a very flexible system, which enable us to explore several symbolic learning
algorithms which can be combined in a variety of ways to construct symbolic
ensembles.
6 Experiments and Results
In order to illustrate the ELE system, several experiments were conducted using
three datasets from the UCI repository [11]: Nursery1 , Chess-Kr-Vs-Kp —
Chess for short — and Splice. These datasets are often used in the ensemble
literature for empirical evaluation. Table 2 describes the characteristics of the
datasets used in this study, showing, for each dataset: number of instances (#
Inst.); number of features (# Features), as well as the number of continuous
and discrete features; class distribution (Class %); majority error rate; presence
or absence of unknown values (Unknown Values) and number (and percentage)
of duplicate or conflicting instances (Dup./Conf. Examples).
Base-classifiers were induced using the symbolic learning algorithms CN 2 [6]
and C4.5 [18]. The experiments were conducted using a combination of the two
different ways for classifying an instance (base-classifiers’ classification method,

1 The original Nursery dataset was modified to remove one of the classes that has only 2
instances. In this work, all references to this dataset refer to this modified version.
11
Table 2: Datasets characteristics summary
Dataset Nursery Chess Splice
# Inst. 12958 3196 3190
# Features 8 36 60
(cont.,disc.) (0,8) (0,36) (0,60)
not recom nowin EI
(33.34%) (47.78%) (24.01%)
very recom won IE
Class (2.53%) (52.22%) (24.08%)
(Class %) priority N
(32.92%) (51.88%)
spec prior
(31.21%)
Majority 66.66% 47.78% 48.12%
Error in not recom in won in N
Unknown
N N N
Values
Dup./Conf. 0 0 184
Examples (0.00%) (0.00%) (5.77%)
12
referred by “Class”, and the individual rules’ classification, in which we used
accuracy, Laplace accuracy and negative confidence rule quality measures, defi-
ned in Section 2, referred respectively by “Acc”, “Lap” and “NegRel”), and the
three methods for combining the individual classification (unweighted, weigh-
ted by mean and weighted by mean and standard error voting methods, referred
respectively by UV, WMV and WMSV). For example: UV-Class indicates that
each classifier was used as a “black-box” to classify a new example, and their
decisions were combined using Unweighted Voting (UV); WMV-Lap indicates
that the rule with the best Laplace accuracy measure value of each classifier
was used to classify a new example and the final decision was obtained using
Weighted by Mean Voting (WMV). The experiments were carried out using five
different set-up scenarios — Table 3. The learning algorithm CN 2 was used in
scenarios Scn 1 and Scn 3 while C4.5 was used in scenarios Scn 2 and Scn 5.
In scenario Scn 4, both algorithms CN 2 (in three samples) and C4.5 (in two
samples) were used. As in these experiments sampling was carried out without
restitution, the size of the datasets does not allow a greater number of partitions.
All results were assessed using 10-fold stratified cross-validation (10-FSCV
for short) in the following way. First of all the dataset was partitioned into 10
independent folds. In each iteration j, one fold was used as test set tej and the
nine remaining folds as training set trj . The training set trj was further divided
into Si , i = 1, ..., L independent (without restitution) samples to induce the
base-classifiers. Only for the purpose of estimating the error rate m err(hi (x))
and the standard error se err(hi (x)) used as weights in the proposed ensembles’
voting mechanism, 10-FSCV was again applied to each Si sample. In each j-th
iteration, the ensemble was constructed using the base-classifiers induced using
all examples in each Si sample and the j-th ensemble error rate was estimated in
the remaining tej test set. The same test set tej was used to estimate the base-
13
Table 3: Experiment description
Experiment # of Partitions learning algorithms
Scn 1 3 CN 2- CN 2- CN 2
Scn 2 3 C4.5- C4.5- C4.5
Scn 3 5 CN 2- CN 2- CN 2- CN 2- CN 2
Scn 4 5 CN 2- CN 2- CN 2- C4.5- C4.5
Scn 5 5 C4.5- C4.5- C4.5- C4.5- C4.5
14
classifiers error rate for each sample Si , i = 1, ..., L. This process was iterated
from j = 1 to 10.
Tables 4, 5 and 6 summarize, respectively, the error rate and the standard
error obtained with datasets Nursery, Chess and Splice. The first column identi-
fies the experiment and the next five columns show the results obtained in each
scenario. The first five rows, labeled as S1 , S2 , S3 , S4 and S5 , present the results
related to each base-classifier, while the other rows present the results related
to the ensembles’ construction methods. Results in bold indicate that the en-
semble is better than each one of the base-classifiers with a 95% confidence level
according to a paired t-test. Results from Tables 4 to 6 are also depict in graphs,
as show in Figures 2 to 4. In these figures, the X-axis refers to the ensemble’s
error rate and the Y -axis refers to the minimum error over all ensemble’s base-
classifiers. Furthermore, in these figures, legends labelled as “Class - Scn 1”,
“Class - Scn 2”, and so on, are related to results from Scn 1, Scn 2, and so on,
using all base-classifiers as a “black-box” to classify an example, while “Rule -
Scn 1”, “Rule - Scn 2”, and so on, are related to results using only the best rule
of each base classifier, according to a rule quality criterion — Acc(R), Lap(R)
or N egRel(R) —, instead of the whole classifier to classify an exemple. Points
above the main diagonal line indicate that the ensemble presents better results
than its best base-classifier.
Figures 2 to 4 are provided in order to facilitate the visualization of the ex-
perimental results. They show that the constructed ensembles can improve the
results obtained with their best base-classifiers in most cases. Besides, results
in Tables 4 to 6 show that in two (nursery and splice) out of the three datasets,
the ensembles’ error rate is smaller than their base-classifiers’ error rate. More-
over, according to t-test, these results (ensemble versus base-classifiers) are all
significant with a 95% confidence level. For the chess dataset, the three combi-
15
nation methods (UV, WMV and WMSV) present identical results. However, it
should be observed that the proposed ensemble approach did not degrade the
performance of base-classifiers, even when the ensembles’ error rate are not all
significantly smaller than their base-classifiers’ error rate (chess dataset). Ta-
king into account the limited number of base-classifiers used in the experiments
(minimum of 3 and maximum of 5), the results can be considered very good, as
we can often significantly improve upon the base-classifiers.
Another positive point is that the results obtained using the best rule clas-
sification criterion are comparable to the ones obtained using the classifiers as
“black-boxes”. This is an interesting result since at most one rule from each
base-classifier is used to classify a new example. Therefore, this can be further
explored to enhance the ensemble explanation capability.
Considering the experiments using C4.5 (Scn 2 and Scn 5 ), it can be obser-
ved that they present the same error rate independently of the classification cri-
terion used, i.e. all UV methods (UV-Class, UV-Acc, UV-Lap and UV-NegRel)
have equal results, as well as all WMV and WMSV methods. The reason is that
C4.5 induces disjoint rules (decision trees) which means there is only one rule
in each classifier that covers a new example.
As expected, considering the sampling method used (without restitution)
and limited size datasets, the mean error rate of the base-classifiers tends to
increase for a higher number of classifiers. This is related to the fact that the
used datasets are not so large, and using exclusive partitions as samples implies
in less number of examples in each sample.
16
Table 4: Results using nursery dataset.
Sample Scn 1 Scn 2 Scn 3 Scn 4 Scn 5
S1 5.60 6.16 7.64 7.75 7.92
(0.13) (0.23) (0.24) (0.19) (0.18)
S2 5.66 6.47 7.24 7.86 7.77
(0.25) (0.21) (0.21) (0.10) (0.24)
S3 5.43 5.97 7.93 7.47 7.80
(0.18) (0.09) (0.26) (0.27) (0.17)
S4 - - 7.83 7.50 7.73
- - (0.22) (0.17) (0.24)
S5 - - 7.82 7.94 7.46
- - (0.16) (0.29) (0.24)
UV-Class 3.86 4.51 4.42
(0.13) (0.16) (0.15)
WMV-Class 4.13 4.81 5.06 4.81 6.41
(0.13) (0.18) (0.18) (0.11) (0.14)
WMSV-Class 4.18 4.95 4.88
(0.14) (0.19) (0.13)
UV-Acc 3.30 3.86 4.41
(0.18) (0.14) (0.11)
WMV-Acc 3.53 4.81 4.47 4.75 6.41
(0.19) (0.18) (0.16) (0.08) (0.14)
WMSV-Acc 3.57 4.38 4.81
(0.21) (0.14) (0.12)
UV-Lap 3.85 4.47 4.51
(0.14) (0.15) (0.14)
WMV-Lap 4.11 4.81 5.02 4.85 6.41
(0.14) (0.18) (0.16) (0.12) (0.14)
WMSV-Lap 4.17 4.91 4.93
(0.15) (0.15) (0.13)
UV-NegRel 3.80 4.30 4.24
(0.14) (0.17) (0.16)
WMV-NegRel 4.24 4.81 5.03 4.65 6.41
(0.14) (0.18) (0.21) (0.14) (0.14)
WMSV-NegRel 4.24 4.94 4.70
(0.16) (0.21) (0.14)
17
Figure 2: Results using nursery dataset.
18
Table 5: Results using chess dataset.
S1 2.72 1.69 3.82 3.63 3.04
(0.35) (0.29) (0.73) (0.72) (0.35)
S2 2.47 1.25 3.57 2.75 3.00
(0.32) (0.19) (0.44) (0.36) (0.46)
S3 2.72 1.75 2.88 3.00 2.53
(0.49) (0.28) (0.27) (0.35) (0.35)
S4 - - 2.94 2.22 3.13
- - (0.47) (0.28) (0.51)
S5 - - 3.04 2.50 3.04
- - (0.44) (0.27) (0.48)
UV-Class
2.50 0.91 2.25 1.60 2.32
WMV-Class
(0.33) (0.16) (0.33) (0.25) (0.35)
WMSV-Class
UV-Acc
2.72 0.91 2.32 1.63 2.32
WMV-Acc
(0.35) (0.16) (0.32) (0.27) (0.35)
WMSV-Acc
UV-Lap
2.50 0.91 2.25 1.60 2.32
WMV-Lap
(0.33) (0.16) (0.33) (0.25) (0.35)
WMSV-Lap
UV-NegRel
2.53 0.91 2.25 1.56 2.32
WMV-NegRel
(0.33) (0.16) (0.33) (0.26) (0.35)
WMSV-NegRel
19
Figure 3: Results using chess dataset.
20
Table 6: Results using splice dataset.
S1 18.50 9.03 15.33 14.61 11.25
(1.85) (0.38) (0.97) (0.73) (0.70)
S2 15.52 9.37 14.80 16.02 13.01
(1.04) (0.47) (0.92) (1.20) (0.67)
S3 15.92 9.00 15.30 19.75 11.41
(1.28) (0.48) (0.64) (1.73) (0.40)
S4 - - 15.45 11.72 12.13
- - (1.45) (0.74) (0.55)
S5 - - 16.77 11.32 13.26
- - (0.93) (0.52) (0.71)
UV-Class 11.54 7.55 9.72 7.30 8.68
(0.49) (0.39) (0.45) (0.70) (0.42)
WMV-Class 11.13 7.34 9.84 7.08 8.53
(0.39) (0.27) (0.41) (0.66) (0.53)
WMSV-Class 11.19 7.59 9.59 7.15 8.53
(0.37) (0.38) (0.37) (0.62) (0.52)
UV-Acc 11.69 7.55 10.50 7.74 8.68
(0.46) (0.39) (0.58) (0.64) (0.42)
WMV-Acc 11.69 7.34 10.50 7.74 8.53
(0.46) (0.27) (0.58) (0.64) (0.53)
WMSV-Acc 11.69 7.59 10.38 7.68 8.53
(0.46) (0.38) (0.55) (0.58) (0.52)
UV-Lap 10.75 7.55 9.66 7.27 8.68
(0.44) (0.39) (0.42) (0.78) (0.42)
WMV-Lap 10.82 7.34 9.81 7.05 8.53
(0.52) (0.27) (0.46) (0.74) (0.53)
WMSV-Lap 10.88 7.59 9.56 7.05 8.53
(0.49) (0.38) (0.40) (0.66) (0.52)
UV-NegRel 12.85 7.55 9.94 7.52 8.68
(0.61) (0.39) (0.43) (0.65) (0.42)
WMV-NegRel 12.13 7.34 1(0.16) 7.34 8.53
(0.28) (0.27) (0.45) (0.64) (0.53)
WMSV-NegRel 12.26 7.59 9.87 7.40 8.53
(0.26) (0.38) (0.38) (0.60) (0.52)
21
Figure 4: Results using splice dataset.
22
In order to illustrate the explanation facility of the symbolic ensembles cons-
tructed by our proposed method, we consider the ensemble constructed using the
Nursery dataset and the UV-Class combination method in Scn 4. The estima-
ted error rate of this ensemble, which consists of five base-classifiers h1 , ..., h5 , is
about 62% lower than its best base-classifier — Table 4. Consider that example
x = (usual, less proper, incomplete, 3, convenient, convenient, problematic, recom-
mended), having true class label priority, is given to this ensemble. Example x
is correctly classified by the ensemble and its classification can be explained by
the body of the fired rules listed in Table 7.
Moreover, observe that some rules are specialization of other rules. For
instance, rule 2, 3, 4, 5 and 7 are an specialization of rule 1. Furthermore, rule
6 is a generalization of rule 1. In this case, we can synthesize the explanation
to a single rule, rule 6. We are currently implementing such a synthetization
facility in our system.
7 Conclusions and Future Work
Ensembles can be constructed using several methods to combine the decisions of
individual base-classifiers. In this work we propose several methods to construct
ensembles of symbolic base-classifiers such that they can be further explored in
order to explain their classification decisions. We conducted several experiments
using three limited size datasets, few base-classifiers and five different scenarios.
For two datasets, the constructed ensembles using the proposed methods always
showed an error rate which was smaller than their base-classifiers’ error rates,
with a 95% of confidence level according to the paired t-test. For the other
dataset, although such good results at a 95% confidence level were obtained in
only one scenario, the ensembles’ error rate in the remaining scenarios were not
23
Example x: (usual,less proper,incomplete,3,convenient,convenient,problematic,recommended)
- Classe: priority
Combination Method: UV-Class - Scenario: Scn 4 - Ensemble classification:
priority
Body (conditions) of rules that explain classification given by ensemble to exam-
ple x:
1 – parents = usual AND has nurs = less proper AND health = recommended
2 – parents = usual AND has nurs = less proper AND form = incomplete AND
health = recommended
3 – parents = usual AND has nurs = less proper AND housing = convenient AND
4 – parents = usual AND has nurs = less proper AND housing = convenient AND
finance = convenient AND health = recommended
5 – parents = usual AND has nurs = less proper AND social = problematic AND
6 – health = recommended AND has nurs = less proper
7 – health = recommended AND has nurs = less proper AND parents = usual
AND housing = convenient AND finance = convenient
Table 7: Body of ensemble fired rules for example x.
24
higher than their base-classifiers’ error rate. These results are encouraging since
they were obtained using small size datasets.
The actual ELE explanation mechanism shows all different individual classi-
fiers’ rules that correctly cover the example to the user, i.e. fired rules from clas-
sifiers that participate in the final (combined) ensemble classification. We are
currently improving ELE’s explanation facility, aiming to reduce and improve
the set of explanatory rules showed to the user. Ongoing work also includes
further experiments using massive datasets.
Acknowledgments
This research was supported by the Brazilian research councils CAPES and
FAPESP (Process no 02/06914-0). We would also like to thank the anonymous
referee, who provided many insightful comments on an earlier version of this
paper.
References
[1] E. Bauer and R. Kohavi. An empirical comparison of voting classifi-
cation algorithms: Bagging, boosting and variants. Machine Learning,
36(1/2):105–139, 1999.
[2] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
[3] P. Cabena, P. Hadjinian, R. Stadler, J. Verhees, and A. Zanasi. Discovering
Data Mining: from Concept to Implementation. Prentice Hall, 1998.
[4] N. V. Chawla, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer. Lear-
ning ensembles from bites: A scalable and accurate approach. Journal of
Machine Learning Research, 5:421–451, 2004.
25
[5] N. V. Chawla, T. E. Moore, L. O. Hall, K. W. Bowyer, W. P. Kegelmeyer,
and Clayton Springer. Distributed learning with bagging-like performance.
Pattern Recognition Letters, 24(1-3):455–471, 2003.
[6] P. Clark and R. Boswell. Rule induction with CN 2: Some recent improve-
ments. In Y. Kodratoff, editor, Proc. of the 5th European Working Session
on Learning (EWSL 91), pages 151–163, 1991.
[7] M. Detyniecki. Fundamentals on aggregation operators — AGOP, 2001.
Manuscript — Berkeley initiative in Soft Computing. University of Califor-
nia, Berkeley. http://www.cs.berkeley.edu/~marcin/agop.pdf.
[8] T. G. Dietterich. Ensemble methods in machine learning. In First Interna-
tional Workshop on Multiple Classifier Systems. LNCS, volume 1857, pages
1–15, New York, 2000.
[9] C. Domingo, R. Gavaldà, and O. Watanabe. Adaptive sampling methods
for scaling up knowledge discovery algorithms. Data Min. Knowl. Discov.,
6(2):131–152, 2002.
[10] Y. Freund and R.E.. Schapire. A decision-theoretic generalization of on-
line learninng and an application to boosting. J. of Computer and System
Sciences, 55(1):119–139, 1997.
[11] S. Hettich, C.L. Blake, and C.J. Merz. UCI repository of machine learning
databases, 1998.
[12] L. I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms.
Wiley, 2003.
[13] L. I. Kuncheva, J. C. Bezdek, and R. P. W. Duin. Decision templates for
multiple classifier fusion: An experimental comparison. Pattern Recogni-
tion, 34(2):299–314, 2001.
26
[14] N. Lavrac, P. Flach, and B. Zupan. Rule evaluation measures: a unifying
view. In Proc. 9th Inter. W. on Inductive Logic Programming. LNAI, vo-
lume 1634, pages 74–185, 1999.
[15] K. T. Leung and D. S. Parker. Empirical comparisons of various voting
methods in bagging. In KDD ’03: Proc. 9th ACM SIGKDD Inter. Conf. on
Knowledge Discovery and Data Mining, pages 595–600. ACM Press, 2003.
[16] Huan Liu and Hiroshi Motoda. Instance Selection and Construction for
Data Mining. Kluwer Academic Publishers, Norwell, MA, USA, 2001.
[17] R. C. Prati, M. R. Geromini, and M. C. Monard. An integrated environm-
net for data mining. In IV Congress of Logic Applied to Technology —
LAPTEC 2003, volume 2, pages 55–62, Brazil, 2003.
[18] J. R. Quinlan. C4.5 Programs for Machine Learning. Morgan Kaufmann
Publishers, 1988.
[19] L. Todorovski and S. Dzeroski. Combining classifiers with meta decision
trees. Machine Learning, 50(3):223–249, 2003.
[20] G. Valentini and F. Masulli. Ensembles of learning machines. Neural Nets
WIRN Vietri-02. Lecture Notes in Computer Science, 2486:3–19, 2002.
27
View publication stats

Constructing Ensembles of Symbolic Classifiers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Constructing Ensembles of Symbolic Classifiers

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Constructing ensembles of symbolic classiﬁers

Article in International Journal of Hybrid Intelligent Systems · October 2006

Flavia Bernardini Maria-Carolina Monard

SEE PROFILE SEE PROFILE

Ronaldo Cristiano Prati

ADDRPD View project

The user has requested enhancement of the downloaded file.

Flávia Cristina Bernardini, Maria Carolina Monard and

28th July 2006

Learning algorithms are an integral part of the Data Mining (DM)

arning algorithms do not operate in massive datasets. A technique often

of ensembles of classifiers. Several methods to construct such ensembles

have been proposed. However, these methods often lack an explanation

facility. This work proposes methods to construct ensembles of symbolic

classifiers. These ensembles can be further explored in order to explain

system, also described in this work. Experimental results in two out of

three datasets show improvement over all base-classifiers. Moreover, ac-

cording to the obtained results, methods based on single rule classification

might be used to improve the explanation facility of ensembles.

an ensemble usually combines the set of base-classifiers using a voting mecha-

nism. Thus, an ensemble can be considered by nature a hybrid system, since

it is a collection of models whose predictions are somehow combined. Further-

Apart from accuracy in classifying new instances, an important issue in

predictive DM is related to explanations, i.e. what knowledge has been used

to classify a new example. However, most well known ensemble construction

voting methods to construct ensembles using sets of symbolic base-classifiers,

proposal to construct what we call symbolic ensembles; Section 5 outlines the

system we have implemented to validate our proposal; Section 6 shows some

experimental results and Section 7 concludes this work.

A training dataset T is a set of N classified instances {(x1 , y1 ), ..., (xN , yN )}

i.e. y ∈ {C1 , C2 , ..., CNCl }. Given a set S ⊆ T of training examples, a learning

algorithm induces a classifier h, which is a hypothesis about the true unknown

function f . Given new x values, h predicts the corresponding y values.

In this work we consider that a symbolic classifier is a classifier whose des-

cription language can be transformed into a set of NR unordered or disjoint

ning algorithms belong to one of two families, namely separate-and-conquer

and divide-and-conquer algorithms. Algorithms from the first family generally

use an iterative greedy set-covering algorithm to search in each iteration for

divide-and-conquer — construct a global classifier using a top-down strategy to

consecutively refine a partial theory. Generally, the classifier is expressed as a

decision tree which can be written as a set of disjoint unordered rules.

A complex is a disjunction of conjunctions of feature tests in the form of Xi op

or symbolically B → H, where H stands for the head, or rule conclusion, and B

in common. In a classification rule, head H assumes the form class = Ci , where

Ci ∈ {C1 , ..., CNCl }.

The coverage of a rule is defined as follows: considering a rule R = B → H,

to set B ∩ H. Instances satisfying B but not H are incorrectly covered by the

a dataset, one way to assess its performance is by computing its contingency

matrix [14], as shown in Table 1. Denoting the cardinality of a set A as a, i.e.

respectively, i.e. b = |B| and h = |H|. Similarly, b = |B|; h = |H|; bh = |B ∩H|;

bh = |B ∩ H|; bh = |B ∩ H|; and bh = |B ∩ H|. The contingency matrix of a

confidence (N egRel(R) = hb/b), accuracy (Acc(R) = hb/b), Laplace accuracy

(Lap(R) = (bh + 1)/(bh + bh + NCl )), and others [6, 14].

An ensemble consists of a set of individual base-classifiers, whose predicti-

Majority or other voting mechanism is one combination approach. Although

under certain conditions ensembles can reduce classification errors by reducing

to ensembles’ interpretability by humans, since even an ensemble of symbolic

classifiers is not necessarily symbolic.

Ensembles are increasingly gaining acceptance in the data mining community.

Apart from a significant improvement in accuracy, this is due to their potential

algorithms). In general, methods to construct ensembles can be broken down

base-classifiers to classify new instances.