You are on page 1of 28

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220516121

Constructing ensembles of symbolic classifiers

Article  in  International Journal of Hybrid Intelligent Systems · October 2006


DOI: 10.3233/HIS-2006-3304 · Source: DBLP

CITATIONS READS
10 291

3 authors:

Flavia Bernardini Maria-Carolina Monard


Universidade Federal Fluminense University of São Paulo
92 PUBLICATIONS   281 CITATIONS    167 PUBLICATIONS   5,887 CITATIONS   

SEE PROFILE SEE PROFILE

Ronaldo Cristiano Prati


Universidade Federal do ABC (UFABC)
110 PUBLICATIONS   4,379 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Classification under Unbalanced Data. A New Geometric Oversampling Approach. View project

ADDRPD View project

All content following this page was uploaded by Ronaldo Cristiano Prati on 22 February 2015.

The user has requested enhancement of the downloaded file.


Constructing Ensembles of Symbolic Classifiers

Flávia Cristina Bernardini, Maria Carolina Monard and


Ronaldo C. Prati
Laboratory of Computational Intelligence — LABIC
Institute of Mathematics and Computer Science — ICMC
University of São Paulo — USP
P. O. Box 668, 13560-970, São Carlos, SP, Brazil
{fbernard,mcmonard,prati}@icmc.usp.br

28th July 2006

Abstract

Learning algorithms are an integral part of the Data Mining (DM)

process. However, DM deals with a large amount of data and most le-

arning algorithms do not operate in massive datasets. A technique often

used to ease this problem is related to data sampling and the construction

of ensembles of classifiers. Several methods to construct such ensembles

have been proposed. However, these methods often lack an explanation

facility. This work proposes methods to construct ensembles of symbolic

classifiers. These ensembles can be further explored in order to explain

their decisions to the user. These methods were implemented in the ELE

system, also described in this work. Experimental results in two out of

three datasets show improvement over all base-classifiers. Moreover, ac-

cording to the obtained results, methods based on single rule classification

might be used to improve the explanation facility of ensembles.


1 Introduction

Data Mining (DM) aims to extract knowledge from large datasets [3]. To ac-

complish this task, supervised learning algorithms can be used. However, most

learning algorithms are not prepared to deal with large datasets. Several tech-

niques have been proposed to overcome this problem, among them techniques of

data sampling [4, 5, 9, 16]. After sampling, ensembles can be constructed in or-

der to combine the induced classifiers in each sample. To classify a new instance,

an ensemble usually combines the set of base-classifiers using a voting mecha-

nism. Thus, an ensemble can be considered by nature a hybrid system, since

it is a collection of models whose predictions are somehow combined. Further-

more, ensembles could be more accurate than the base-classifiers they are made

of [8].

Apart from accuracy in classifying new instances, an important issue in

predictive DM is related to explanations, i.e. what knowledge has been used

to classify a new example. However, most well known ensemble construction

methods often lack this property. To fill this gap, this work explores several

voting methods to construct ensembles using sets of symbolic base-classifiers,

in such a way that it is possible to explain the ensemble’s decisions to the user.

The rest of this work is organized as follows: Section 2 introduces the notation

and definitions used; Section 3 describes related work; Section 4 explains our

proposal to construct what we call symbolic ensembles; Section 5 outlines the

system we have implemented to validate our proposal; Section 6 shows some

experimental results and Section 7 concludes this work.

2
2 Definitions and Notation

A training dataset T is a set of N classified instances {(x1 , y1 ), ..., (xN , yN )}

for some unknown function y = f (x). The xi values are typically vectors of

the form (xi1 , xi2 , ..., xim ) whose components are discrete or real values, called

features or attributes. Thus, xij denotes the value of the j-th feature Xj of xi .

In what follows, the i subscript will be left out when implied by the context. For

classification purposes, the y values are drawn from a discrete set of NCl classes,

i.e. y ∈ {C1 , C2 , ..., CNCl }. Given a set S ⊆ T of training examples, a learning

algorithm induces a classifier h, which is a hypothesis about the true unknown

function f . Given new x values, h predicts the corresponding y values.

In this work we consider that a symbolic classifier is a classifier whose des-

cription language can be transformed into a set of NR unordered or disjoint

rules, i.e. h = {R1 , R2 ..., RNR }. Recall that most classification rule lear-

ning algorithms belong to one of two families, namely separate-and-conquer

and divide-and-conquer algorithms. Algorithms from the first family generally

use an iterative greedy set-covering algorithm to search in each iteration for

the best rule, and removes the covered examples. This process is repeated in

the remaining examples until all examples have been covered or some stopping

criterion is met. Finally, a classifier is built gathering the rules to form an or-

dered rule list (or decision list) in case all covered examples were removed in

each iteration, or to form an unordered rule set in case only examples correctly

covered were removed in each iteration. Algorithms from the second family —

divide-and-conquer — construct a global classifier using a top-down strategy to

consecutively refine a partial theory. Generally, the classifier is expressed as a

decision tree which can be written as a set of disjoint unordered rules.

A complex is a disjunction of conjunctions of feature tests in the form of Xi op

Value, where Xi is a feature name, op is an operator in the set {=, 6=, <, ≤, >, ≥}

3
and Value is a valid Xi feature value. A rule R assumes the form if B then H

or symbolically B → H, where H stands for the head, or rule conclusion, and B

for the body, or rule condition. H and B are both complexes with no features

in common. In a classification rule, head H assumes the form class = Ci , where

Ci ∈ {C1 , ..., CNCl }.

The coverage of a rule is defined as follows: considering a rule R = B → H,

instances that satisfy the B part compose the covered set of R, called B set

in this work; in other words, these instances are covered by R. Instances that

satisfy both B and H are correctly covered by R, and these instances belong

to set B ∩ H. Instances satisfying B but not H are incorrectly covered by the

rule, and belong to set B ∩ H. On the other hand, instances that do not satisfy

the B part are not covered by the rule, and belong to set B. Given a rule and

a dataset, one way to assess its performance is by computing its contingency

matrix [14], as shown in Table 1. Denoting the cardinality of a set A as a, i.e.

a = |A|, then b and h in Table 1 denote the number of instances in sets B and H

respectively, i.e. b = |B| and h = |H|. Similarly, b = |B|; h = |H|; bh = |B ∩H|;

bh = |B ∩ H|; bh = |B ∩ H|; and bh = |B ∩ H|. The contingency matrix of a

rule R enables the calculation of several rule quality measures, such as negative

confidence (N egRel(R) = hb/b), accuracy (Acc(R) = hb/b), Laplace accuracy

(Lap(R) = (bh + 1)/(bh + bh + NCl )), and others [6, 14].

An ensemble consists of a set of individual base-classifiers, whose predicti-

ons are combined in some way, in order to predict the label of new instances.

Majority or other voting mechanism is one combination approach. Although

under certain conditions ensembles can reduce classification errors by reducing

bias and variance, ensembles can be very large [8]. Another problem is related

to ensembles’ interpretability by humans, since even an ensemble of symbolic

classifiers is not necessarily symbolic.

4
Table 1: Contingency matrix for B → H
B B
H bh bh h
H bh bh h
b b N

5
3 Related Work

Ensembles are increasingly gaining acceptance in the data mining community.

Apart from a significant improvement in accuracy, this is due to their potential

for on-line classification of large databases that do not fit into the memory [5, 4].

There are different ways in which ensembles can be generated and the resulting

output combined to classify new instances (see [12] for a review of methods and

algorithms). In general, methods to construct ensembles can be broken down

into two sub-tasks [8]. The first one consists of generating a set of base-classifiers.

The second one consists of deciding how to combine the classifications of the

base-classifiers to classify new instances.

Popular approaches to generate ensembles include changing the instances

used for training in order to construct the base-classifiers, using techniques such

as bagging [2], boosting [10], wagging [1], and others [20]. To reach improve-

ments in accuracy, these methods rely on a large number of base-classifiers. On

the other hand, recent papers focus on the combination of classifiers (second

sub-task). Some proposals are related to inducing the base-classifiers indepen-

dently using different learning algorithms [15], such as stacking, which construct

a classifier to combine the base-classifier decisions [19]. Other approaches in-

clude the use of different aggregation operators [7], which are used to fuse the

predictions of the classifiers, since multiple classifier fusion may generate more

accurate classification than each of the constituent classifiers [13]. However,

unlike symbolic classifiers, ensembles are sort of “black-box” classifiers, since

they are not able to explain their classification decisions in new examples, like

symbolic classifiers do.

In what follows, we describe a method to construct ensembles using symbolic

classifiers and three voting mechanisms that enable the ensemble to explain its

decisions to the user. Furthermore, we are able to achieve meaningful results

6
using a small number of base-classifiers.

4 Combining Multiple Classifiers

In this work, the set of base-classifiers (first sub-task) is constructed in the

following way: let L be the number of base-classifiers to be induced given a

training dataset S. First of all, L samples S1 , ..., SL , which can be drawn with

or without restitution, are extracted from S and each sample S1 , ..., SL is used

as an input to a symbolic learning algorithm. In other words, sample S1 is

used by Alg1 to induce h1 , sample S2 is used by Alg2 to induce h2 and so

forth. Furthermore, the same or different symbolic learning algorithm can be

used in each of the L samples to induce the L base-classifier. In fact, the only

restriction is that learning algorithm should induce unordered rules. Afterwards,

given a new instance (example) x to be classified, the individual decisions of the

set of L classifiers should be combined to output its label. Figure 1 illustrates

the method where Combine(h1 (x), ..., hL (x)) constitutes the symbolic ensemble

h∗ (x).

The use of symbolic classifiers enables us to explore two different ways to

classify x using the L base-classifiers, i.e, two ways of finding h1 (x), ..., hL (x):

1. Using the classifier as a “black-box”, where each induced classifier is res-

ponsible for classifying x;

2. Exploring the rules inside the classifier, where the best classifier’s rule that

covers example x, according to a given rule quality measure, is responsible

for classifying x. As the ensembles’ base-classifiers consist of unordered

rule sets, the rules that cover a new example can be considered as a sort

of specialized classifier with respect to the given rule measure. In this

work, we use accuracy, Laplace accuracy and negative confidence, defined

7
Figure 1: A method for constructing ensembles of classifiers

8
in Section 2, as quality measures.

Into the Combine(h1 (x), ..., hL (x)) function, the following three methods

have been implemented in order to construct the final ensemble h∗ .

1. Unweighted Voting – UV: the class label of x is the one that receives

more votes from the L classifiers;

2. Weighted by Mean Voting – WMV: the x class label given by each

classifier is weighted using the classifier’s mean error rate m err(hi ), and

the class label of x is the one having maximum total weight from the L

classifiers:

L
X
W M V (x, Cv ) = max g(hl (x), Ci )
Ci ∈{C1 ,...,CNCl }
l=1

where 



 lg((1 − m err(hl ))/m err(hl ))


g(hl (x), Ci ) = if hl (x) = Ci ,




0 otherwise.

3. Weighted by Mean and Standard Error Voting – WMSV: similar

to the previous one but also considering the standard error se err(hi )) of

the classifier’s mean error rate to estimate the corresponding weight:

L
X
W M SV (x, Cv ) = max g(hl (x), Ci )
Ci ∈{C1 ,...,CNCl }
l=1

9
where






lg((1 − m err(hl ))/m err(hl ))



 + lg((1 − se err(hl ))/se err(hl ))

g(hl (x), Ci ) =
if hl (x) = Ci ,








0 otherwise.

Voting method 1 (UV) is a straightforward voting mechanism. Methods 2

(WMV) and 3 (WMSV) aim to improve method 1. To this end, method 2

(WMV) favours hypotheses having lower mean error rates. In similar fashion,

method 3 (WMSV) favours hypotheses having lower mean error rates as well

as lower standard error rates. In other words, the aim of function g(hl (x), Ci ),

where lg is the logarithmic function, is to increment accordingly the class weight

whenever the error rate and/or the standard error are less than 50%, and to

decrement it otherwise. In case the error rate and/or the standard error are

zero, a maximum system defined weight is considered. In order to validate

our proposal, we have implemented a computational system called Ensemble

Learning Environment (ELE), described next.

5 The Ensemble Learning Environment

In order to validate our proposed approach for constructing symbolic ensem-

bles, we have implemented a computational environment that we call Ensemble

Learning Environment — ELE. The current version of ELE includes an im-

plementation of the three methods for combining classifiers — UV, WMV and

WMSV — described in this work, as well as the methods for classifying exam-

ples. An interesting property of ELE is that it has been implemented using

10
the object orientation paradigm, and can be easily extended to support other

combination methods as well as other classification methods using other rule

quality measures.

Furthermore, ELE is integrated into a major computational environment

called Discover [17], which is under development in our research laboratory.

This integration enable us to easily incorporate into the ELE system several

symbolic machine learning algorithms, such as C4.5 and CN 2, as well as using

the experimental workbench available in Discover. In other words, ELE is

a very flexible system, which enable us to explore several symbolic learning

algorithms which can be combined in a variety of ways to construct symbolic

ensembles.

6 Experiments and Results

In order to illustrate the ELE system, several experiments were conducted using

three datasets from the UCI repository [11]: Nursery1 , Chess-Kr-Vs-Kp —

Chess for short — and Splice. These datasets are often used in the ensemble

literature for empirical evaluation. Table 2 describes the characteristics of the

datasets used in this study, showing, for each dataset: number of instances (#

Inst.); number of features (# Features), as well as the number of continuous

and discrete features; class distribution (Class %); majority error rate; presence

or absence of unknown values (Unknown Values) and number (and percentage)

of duplicate or conflicting instances (Dup./Conf. Examples).

Base-classifiers were induced using the symbolic learning algorithms CN 2 [6]

and C4.5 [18]. The experiments were conducted using a combination of the two

different ways for classifying an instance (base-classifiers’ classification method,


1 The original Nursery dataset was modified to remove one of the classes that has only 2
instances. In this work, all references to this dataset refer to this modified version.

11
Table 2: Datasets characteristics summary
Dataset Nursery Chess Splice
# Inst. 12958 3196 3190
# Features 8 36 60
(cont.,disc.) (0,8) (0,36) (0,60)
not recom nowin EI
(33.34%) (47.78%) (24.01%)
very recom won IE
Class (2.53%) (52.22%) (24.08%)
(Class %) priority N
(32.92%) (51.88%)
spec prior
(31.21%)
Majority 66.66% 47.78% 48.12%
Error in not recom in won in N
Unknown
N N N
Values
Dup./Conf. 0 0 184
Examples (0.00%) (0.00%) (5.77%)

12
referred by “Class”, and the individual rules’ classification, in which we used

accuracy, Laplace accuracy and negative confidence rule quality measures, defi-

ned in Section 2, referred respectively by “Acc”, “Lap” and “NegRel”), and the

three methods for combining the individual classification (unweighted, weigh-

ted by mean and weighted by mean and standard error voting methods, referred

respectively by UV, WMV and WMSV). For example: UV-Class indicates that

each classifier was used as a “black-box” to classify a new example, and their

decisions were combined using Unweighted Voting (UV); WMV-Lap indicates

that the rule with the best Laplace accuracy measure value of each classifier

was used to classify a new example and the final decision was obtained using

Weighted by Mean Voting (WMV). The experiments were carried out using five

different set-up scenarios — Table 3. The learning algorithm CN 2 was used in

scenarios Scn 1 and Scn 3 while C4.5 was used in scenarios Scn 2 and Scn 5.

In scenario Scn 4, both algorithms CN 2 (in three samples) and C4.5 (in two

samples) were used. As in these experiments sampling was carried out without

restitution, the size of the datasets does not allow a greater number of partitions.

All results were assessed using 10-fold stratified cross-validation (10-FSCV

for short) in the following way. First of all the dataset was partitioned into 10

independent folds. In each iteration j, one fold was used as test set tej and the

nine remaining folds as training set trj . The training set trj was further divided

into Si , i = 1, ..., L independent (without restitution) samples to induce the

base-classifiers. Only for the purpose of estimating the error rate m err(hi (x))

and the standard error se err(hi (x)) used as weights in the proposed ensembles’

voting mechanism, 10-FSCV was again applied to each Si sample. In each j-th

iteration, the ensemble was constructed using the base-classifiers induced using

all examples in each Si sample and the j-th ensemble error rate was estimated in

the remaining tej test set. The same test set tej was used to estimate the base-

13
Table 3: Experiment description
Experiment # of Partitions learning algorithms
Scn 1 3 CN 2- CN 2- CN 2
Scn 2 3 C4.5- C4.5- C4.5
Scn 3 5 CN 2- CN 2- CN 2- CN 2- CN 2
Scn 4 5 CN 2- CN 2- CN 2- C4.5- C4.5
Scn 5 5 C4.5- C4.5- C4.5- C4.5- C4.5

14
classifiers error rate for each sample Si , i = 1, ..., L. This process was iterated

from j = 1 to 10.

Tables 4, 5 and 6 summarize, respectively, the error rate and the standard

error obtained with datasets Nursery, Chess and Splice. The first column identi-

fies the experiment and the next five columns show the results obtained in each

scenario. The first five rows, labeled as S1 , S2 , S3 , S4 and S5 , present the results

related to each base-classifier, while the other rows present the results related

to the ensembles’ construction methods. Results in bold indicate that the en-

semble is better than each one of the base-classifiers with a 95% confidence level

according to a paired t-test. Results from Tables 4 to 6 are also depict in graphs,

as show in Figures 2 to 4. In these figures, the X-axis refers to the ensemble’s

error rate and the Y -axis refers to the minimum error over all ensemble’s base-

classifiers. Furthermore, in these figures, legends labelled as “Class - Scn 1”,

“Class - Scn 2”, and so on, are related to results from Scn 1, Scn 2, and so on,

using all base-classifiers as a “black-box” to classify an example, while “Rule -

Scn 1”, “Rule - Scn 2”, and so on, are related to results using only the best rule

of each base classifier, according to a rule quality criterion — Acc(R), Lap(R)

or N egRel(R) —, instead of the whole classifier to classify an exemple. Points

above the main diagonal line indicate that the ensemble presents better results

than its best base-classifier.

Figures 2 to 4 are provided in order to facilitate the visualization of the ex-

perimental results. They show that the constructed ensembles can improve the

results obtained with their best base-classifiers in most cases. Besides, results

in Tables 4 to 6 show that in two (nursery and splice) out of the three datasets,

the ensembles’ error rate is smaller than their base-classifiers’ error rate. More-

over, according to t-test, these results (ensemble versus base-classifiers) are all

significant with a 95% confidence level. For the chess dataset, the three combi-

15
nation methods (UV, WMV and WMSV) present identical results. However, it

should be observed that the proposed ensemble approach did not degrade the

performance of base-classifiers, even when the ensembles’ error rate are not all

significantly smaller than their base-classifiers’ error rate (chess dataset). Ta-

king into account the limited number of base-classifiers used in the experiments

(minimum of 3 and maximum of 5), the results can be considered very good, as

we can often significantly improve upon the base-classifiers.

Another positive point is that the results obtained using the best rule clas-

sification criterion are comparable to the ones obtained using the classifiers as

“black-boxes”. This is an interesting result since at most one rule from each

base-classifier is used to classify a new example. Therefore, this can be further

explored to enhance the ensemble explanation capability.

Considering the experiments using C4.5 (Scn 2 and Scn 5 ), it can be obser-

ved that they present the same error rate independently of the classification cri-

terion used, i.e. all UV methods (UV-Class, UV-Acc, UV-Lap and UV-NegRel)

have equal results, as well as all WMV and WMSV methods. The reason is that

C4.5 induces disjoint rules (decision trees) which means there is only one rule

in each classifier that covers a new example.

As expected, considering the sampling method used (without restitution)

and limited size datasets, the mean error rate of the base-classifiers tends to

increase for a higher number of classifiers. This is related to the fact that the

used datasets are not so large, and using exclusive partitions as samples implies

in less number of examples in each sample.

16
Table 4: Results using nursery dataset.
Sample Scn 1 Scn 2 Scn 3 Scn 4 Scn 5
S1 5.60 6.16 7.64 7.75 7.92
(0.13) (0.23) (0.24) (0.19) (0.18)
S2 5.66 6.47 7.24 7.86 7.77
(0.25) (0.21) (0.21) (0.10) (0.24)
S3 5.43 5.97 7.93 7.47 7.80
(0.18) (0.09) (0.26) (0.27) (0.17)
S4 - - 7.83 7.50 7.73
- - (0.22) (0.17) (0.24)
S5 - - 7.82 7.94 7.46
- - (0.16) (0.29) (0.24)
UV-Class 3.86 4.51 4.42
(0.13) (0.16) (0.15)
WMV-Class 4.13 4.81 5.06 4.81 6.41
(0.13) (0.18) (0.18) (0.11) (0.14)
WMSV-Class 4.18 4.95 4.88
(0.14) (0.19) (0.13)
UV-Acc 3.30 3.86 4.41
(0.18) (0.14) (0.11)
WMV-Acc 3.53 4.81 4.47 4.75 6.41
(0.19) (0.18) (0.16) (0.08) (0.14)
WMSV-Acc 3.57 4.38 4.81
(0.21) (0.14) (0.12)
UV-Lap 3.85 4.47 4.51
(0.14) (0.15) (0.14)
WMV-Lap 4.11 4.81 5.02 4.85 6.41
(0.14) (0.18) (0.16) (0.12) (0.14)
WMSV-Lap 4.17 4.91 4.93
(0.15) (0.15) (0.13)
UV-NegRel 3.80 4.30 4.24
(0.14) (0.17) (0.16)
WMV-NegRel 4.24 4.81 5.03 4.65 6.41
(0.14) (0.18) (0.21) (0.14) (0.14)
WMSV-NegRel 4.24 4.94 4.70
(0.16) (0.21) (0.14)

17
Figure 2: Results using nursery dataset.

18
Table 5: Results using chess dataset.
Sample Scn 1 Scn 2 Scn 3 Scn 4 Scn 5
S1 2.72 1.69 3.82 3.63 3.04
(0.35) (0.29) (0.73) (0.72) (0.35)
S2 2.47 1.25 3.57 2.75 3.00
(0.32) (0.19) (0.44) (0.36) (0.46)
S3 2.72 1.75 2.88 3.00 2.53
(0.49) (0.28) (0.27) (0.35) (0.35)
S4 - - 2.94 2.22 3.13
- - (0.47) (0.28) (0.51)
S5 - - 3.04 2.50 3.04
- - (0.44) (0.27) (0.48)
UV-Class
2.50 0.91 2.25 1.60 2.32
WMV-Class
(0.33) (0.16) (0.33) (0.25) (0.35)
WMSV-Class
UV-Acc
2.72 0.91 2.32 1.63 2.32
WMV-Acc
(0.35) (0.16) (0.32) (0.27) (0.35)
WMSV-Acc
UV-Lap
2.50 0.91 2.25 1.60 2.32
WMV-Lap
(0.33) (0.16) (0.33) (0.25) (0.35)
WMSV-Lap
UV-NegRel
2.53 0.91 2.25 1.56 2.32
WMV-NegRel
(0.33) (0.16) (0.33) (0.26) (0.35)
WMSV-NegRel

19
Figure 3: Results using chess dataset.

20
Table 6: Results using splice dataset.
Sample Scn 1 Scn 2 Scn 3 Scn 4 Scn 5
S1 18.50 9.03 15.33 14.61 11.25
(1.85) (0.38) (0.97) (0.73) (0.70)
S2 15.52 9.37 14.80 16.02 13.01
(1.04) (0.47) (0.92) (1.20) (0.67)
S3 15.92 9.00 15.30 19.75 11.41
(1.28) (0.48) (0.64) (1.73) (0.40)
S4 - - 15.45 11.72 12.13
- - (1.45) (0.74) (0.55)
S5 - - 16.77 11.32 13.26
- - (0.93) (0.52) (0.71)
UV-Class 11.54 7.55 9.72 7.30 8.68
(0.49) (0.39) (0.45) (0.70) (0.42)
WMV-Class 11.13 7.34 9.84 7.08 8.53
(0.39) (0.27) (0.41) (0.66) (0.53)
WMSV-Class 11.19 7.59 9.59 7.15 8.53
(0.37) (0.38) (0.37) (0.62) (0.52)
UV-Acc 11.69 7.55 10.50 7.74 8.68
(0.46) (0.39) (0.58) (0.64) (0.42)
WMV-Acc 11.69 7.34 10.50 7.74 8.53
(0.46) (0.27) (0.58) (0.64) (0.53)
WMSV-Acc 11.69 7.59 10.38 7.68 8.53
(0.46) (0.38) (0.55) (0.58) (0.52)
UV-Lap 10.75 7.55 9.66 7.27 8.68
(0.44) (0.39) (0.42) (0.78) (0.42)
WMV-Lap 10.82 7.34 9.81 7.05 8.53
(0.52) (0.27) (0.46) (0.74) (0.53)
WMSV-Lap 10.88 7.59 9.56 7.05 8.53
(0.49) (0.38) (0.40) (0.66) (0.52)
UV-NegRel 12.85 7.55 9.94 7.52 8.68
(0.61) (0.39) (0.43) (0.65) (0.42)
WMV-NegRel 12.13 7.34 1(0.16) 7.34 8.53
(0.28) (0.27) (0.45) (0.64) (0.53)
WMSV-NegRel 12.26 7.59 9.87 7.40 8.53
(0.26) (0.38) (0.38) (0.60) (0.52)

21
Figure 4: Results using splice dataset.

22
In order to illustrate the explanation facility of the symbolic ensembles cons-

tructed by our proposed method, we consider the ensemble constructed using the

Nursery dataset and the UV-Class combination method in Scn 4. The estima-

ted error rate of this ensemble, which consists of five base-classifiers h1 , ..., h5 , is

about 62% lower than its best base-classifier — Table 4. Consider that example

x = (usual, less proper, incomplete, 3, convenient, convenient, problematic, recom-

mended), having true class label priority, is given to this ensemble. Example x

is correctly classified by the ensemble and its classification can be explained by

the body of the fired rules listed in Table 7.

Moreover, observe that some rules are specialization of other rules. For

instance, rule 2, 3, 4, 5 and 7 are an specialization of rule 1. Furthermore, rule

6 is a generalization of rule 1. In this case, we can synthesize the explanation

to a single rule, rule 6. We are currently implementing such a synthetization

facility in our system.

7 Conclusions and Future Work

Ensembles can be constructed using several methods to combine the decisions of

individual base-classifiers. In this work we propose several methods to construct

ensembles of symbolic base-classifiers such that they can be further explored in

order to explain their classification decisions. We conducted several experiments

using three limited size datasets, few base-classifiers and five different scenarios.

For two datasets, the constructed ensembles using the proposed methods always

showed an error rate which was smaller than their base-classifiers’ error rates,

with a 95% of confidence level according to the paired t-test. For the other

dataset, although such good results at a 95% confidence level were obtained in

only one scenario, the ensembles’ error rate in the remaining scenarios were not

23
Example x: (usual,less proper,incomplete,3,convenient,convenient,problematic,recommended)
- Classe: priority
Combination Method: UV-Class - Scenario: Scn 4 - Ensemble classification:
priority
Body (conditions) of rules that explain classification given by ensemble to exam-
ple x:

1 – parents = usual AND has nurs = less proper AND health = recommended
2 – parents = usual AND has nurs = less proper AND form = incomplete AND
health = recommended
3 – parents = usual AND has nurs = less proper AND housing = convenient AND
health = recommended
4 – parents = usual AND has nurs = less proper AND housing = convenient AND
finance = convenient AND health = recommended
5 – parents = usual AND has nurs = less proper AND social = problematic AND
health = recommended
6 – health = recommended AND has nurs = less proper
7 – health = recommended AND has nurs = less proper AND parents = usual
AND housing = convenient AND finance = convenient

Table 7: Body of ensemble fired rules for example x.

24
higher than their base-classifiers’ error rate. These results are encouraging since

they were obtained using small size datasets.

The actual ELE explanation mechanism shows all different individual classi-

fiers’ rules that correctly cover the example to the user, i.e. fired rules from clas-

sifiers that participate in the final (combined) ensemble classification. We are

currently improving ELE’s explanation facility, aiming to reduce and improve

the set of explanatory rules showed to the user. Ongoing work also includes

further experiments using massive datasets.

Acknowledgments

This research was supported by the Brazilian research councils CAPES and

FAPESP (Process no 02/06914-0). We would also like to thank the anonymous

referee, who provided many insightful comments on an earlier version of this

paper.

References

[1] E. Bauer and R. Kohavi. An empirical comparison of voting classifi-

cation algorithms: Bagging, boosting and variants. Machine Learning,

36(1/2):105–139, 1999.

[2] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.

[3] P. Cabena, P. Hadjinian, R. Stadler, J. Verhees, and A. Zanasi. Discovering

Data Mining: from Concept to Implementation. Prentice Hall, 1998.

[4] N. V. Chawla, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer. Lear-

ning ensembles from bites: A scalable and accurate approach. Journal of

Machine Learning Research, 5:421–451, 2004.

25
[5] N. V. Chawla, T. E. Moore, L. O. Hall, K. W. Bowyer, W. P. Kegelmeyer,

and Clayton Springer. Distributed learning with bagging-like performance.

Pattern Recognition Letters, 24(1-3):455–471, 2003.

[6] P. Clark and R. Boswell. Rule induction with CN 2: Some recent improve-

ments. In Y. Kodratoff, editor, Proc. of the 5th European Working Session

on Learning (EWSL 91), pages 151–163, 1991.

[7] M. Detyniecki. Fundamentals on aggregation operators — AGOP, 2001.

Manuscript — Berkeley initiative in Soft Computing. University of Califor-

nia, Berkeley. http://www.cs.berkeley.edu/~marcin/agop.pdf.

[8] T. G. Dietterich. Ensemble methods in machine learning. In First Interna-

tional Workshop on Multiple Classifier Systems. LNCS, volume 1857, pages

1–15, New York, 2000.

[9] C. Domingo, R. Gavaldà, and O. Watanabe. Adaptive sampling methods

for scaling up knowledge discovery algorithms. Data Min. Knowl. Discov.,

6(2):131–152, 2002.

[10] Y. Freund and R.E.. Schapire. A decision-theoretic generalization of on-

line learninng and an application to boosting. J. of Computer and System

Sciences, 55(1):119–139, 1997.

[11] S. Hettich, C.L. Blake, and C.J. Merz. UCI repository of machine learning

databases, 1998.

[12] L. I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms.

Wiley, 2003.

[13] L. I. Kuncheva, J. C. Bezdek, and R. P. W. Duin. Decision templates for

multiple classifier fusion: An experimental comparison. Pattern Recogni-

tion, 34(2):299–314, 2001.

26
[14] N. Lavrac, P. Flach, and B. Zupan. Rule evaluation measures: a unifying

view. In Proc. 9th Inter. W. on Inductive Logic Programming. LNAI, vo-

lume 1634, pages 74–185, 1999.

[15] K. T. Leung and D. S. Parker. Empirical comparisons of various voting

methods in bagging. In KDD ’03: Proc. 9th ACM SIGKDD Inter. Conf. on

Knowledge Discovery and Data Mining, pages 595–600. ACM Press, 2003.

[16] Huan Liu and Hiroshi Motoda. Instance Selection and Construction for

Data Mining. Kluwer Academic Publishers, Norwell, MA, USA, 2001.

[17] R. C. Prati, M. R. Geromini, and M. C. Monard. An integrated environm-

net for data mining. In IV Congress of Logic Applied to Technology —

LAPTEC 2003, volume 2, pages 55–62, Brazil, 2003.

[18] J. R. Quinlan. C4.5 Programs for Machine Learning. Morgan Kaufmann

Publishers, 1988.

[19] L. Todorovski and S. Dzeroski. Combining classifiers with meta decision

trees. Machine Learning, 50(3):223–249, 2003.

[20] G. Valentini and F. Masulli. Ensembles of learning machines. Neural Nets

WIRN Vietri-02. Lecture Notes in Computer Science, 2486:3–19, 2002.

27

View publication stats

You might also like