You are on page 1of 7

A Bottom-Up Oblique Decision Tree Induction Algorithm

Rodrigo C. Barros, Ricardo Cerri, Pablo A. Jaskowiak and Andre C. P. L. F. de Carvalho


Department of Computer Science, ICMC
University of Sao Paulo (USP)
Sao Carlos - SP, Brazil
{rcbarros,cerri,pablo,andre}@icmc.usp.br

AbstractDecision tree induction algorithms are widely


used in knowledge discovery and data mining, specially
in scenarios where model comprehensibility is desired.
A variation of the traditional univariate approach is the
so-called oblique decision tree, which allows multivariate
tests in its non-terminal nodes. Oblique decision trees
can model decision boundaries that are oblique to the
attribute axes, whereas univariate trees can only perform
axis-parallel splits. The majority of the oblique and
univariate decision tree induction algorithms perform a
top-down strategy for growing the tree, relying on an
impurity-based measure for splitting nodes. In this paper,
we propose a novel bottom-up algorithm for inducing
oblique trees named BUTIA. It does not require an
impurity-measure for dividing nodes, since we know a
priori the data resulting from each split. For generating the
splitting hyperplanes, our algorithm implements a support
vector machine solution, and a clustering algorithm is used
for generating the initial leaves. We compare BUTIA to
traditional univariate and oblique decision tree algorithms,
C4.5, CART, OC1 and FT, as well as to a standard SVM
implementation, using real gene expression benchmark
data. Experimental results show the effectiveness of the
proposed approach in several cases.
Keywords-oblique decision trees; bottom-up induction;
clustering; SVM; hybrid intelligent systems

I. I NTRODUCTION
Decision tree induction algorithms are highly used
in a variety of domains for knowledge discovery
and pattern recognition. The induced knowledge in
the form of hierarchical trees can be regarded as
a disjunction of conjunctions of constraints on the
attribute values [1]. Each path from the root to a leaf
is actually a conjunction of attribute tests, and the
tree itself allows the choice of different paths, i.e., a
disjunction of these conjunctions. Such a representation
is intuitive and easy to assimilate by humans, which
partially explains the large number of studies that
make use of these techniques. Another reason for their
popularity is their good predictive accuracy in several
application domains, such as medical diagnosis and
credit risk assessment [2].
A major issue in decision tree induction is which
attribute(s) to choose for splitting an internal node.

For the case of axis-parallel decision trees (also known


as univariate), the problem is to choose the attribute
that better discriminates the input data. A decision
rule based on such an attribute is thus generated, and
the input data is filtered according to the consequents
of this rule. For oblique decision trees (also known
as multivariate), the goal is to find a combination of
attributes with good discriminatory power.
Oblique decision trees are not as popular as the
univariate ones, mainly because they are harder
to interpret. Nevertheless, researchers argue that
multivariate splits can improve the performance of
the tree in several datasets, while generating smaller
trees [3][5]. Clearly, there is a tradeoff to consider in
allowing multivariate tests: simple tests may result in
large trees that are hard to understand, yet multivariate
tests may result in small trees with tests that are hard
to understand [6].
One of the advantages of oblique decision trees is
that they are able to produce polygonal (polyhedral)
partitions of the attribute space, i.e., hyperplanes
at an oblique orientation to the attribute axes.
Univariate trees, on the other hand, can only produce
hyper-rectangles parallel to the attribute axes. The tests
at
Pmeach node of an oblique tree have the form w0 +
i=1 wi xji 0, where wi is a real-valued coefficient
associated to the ith attribute of a given instance xj ,
and w0 is the disturbance coefficient (bias) of the test.
For either the growth of oblique or axis-parallel
decision trees, there is a clear preference in the
literature for algorithms that rely on a greedy,
top-down, recursive partitioning strategy, i.e.,
top-down induction. The most well-known algorithms
for decision tree induction indeed implement this
strategy, e.g., CART [7], C4.5 [8] and OC1 [9]. These
algorithms make use of impurity-based criteria to
decide which attribute(s) will split the data in purer
subsets (a pure subset is one whose instances belong to
the same class). Since these algorithms are top-down,
it is not possible to know a priori which instances will
result in each subset of a partition. Thus, in top-down
induction, trees are usually grown until every leaf

node is pure, and a pruning method is employed to


avoid data overfitting.
Works that implement a bottom-up strategy are quite
rare in the literature. The key ideas behind bottom-up
induction were first presented by Landeweerd et
al. [10]. The authors propose growing a decision tree
from the leaves to the root, assuming that each class
is represented by a leaf node, and that the closest
nodes (according to the Mahalanobis distance) are
recursively merged into a parent node. Albeit simple,
their approach presents several deficiencies, e.g., it
allows only a single leaf per class, which means that
binary-class problems will always be modeled by a
3-node tree. This is quite problematic since there are
complex binary-class problems in which a 3-node
decision tree cannot model accurately the attribute
space. We believe this deficiency is one of the reasons
for demotivating researchers to further investigate the
bottom-up induction of decision trees. Another reason
may be the extra computational effort required to
compute the costly Mahalanobis distance.
In this paper, we propose alternatives to solve the
deficiencies of the typical bottom-up approach. For
instance, we propose the application of a well-known
clustering algorithm to allow each class to be
represented by more than one leaf node. In addition,
we incorporate in our algorithm a support vector
machine (SVM) [11] solution to build the hyperplane
that divide the data within each non-terminal node
of the oblique decision tree. We call our approach
BUTIA (Bottom-Up oblique Tree Induction Algorithm),
and we evaluate its performance in gene expression
benchmark datasets.
This paper is organized as follows. In Section II
we detail the proposed algorithm, which combines
clustering and SVM for generating oblique decision
trees. In Section III we conduct a comparison
among BUTIA and traditional top-down decision tree
induction algorithms C4.5 [8], CART [7] and OC1 [9].
Additionally, we compare BUTIA to Sequential
Minimal Optimization (SMO) [12] and Functional
Trees [13]. Section IV presents studies related to our
approach, whereas in Section V we discuss the main
conclusions of this work.
II. BUTIA
We propose a new bottom-up oblique decision
tree induction algorithm, named BUTIA (Bottom-Up
oblique Tree Induction Algorithm). It employs two
machine learning algorithms in different steps of tree
growth, namely Expectation-Maximization (EM) [14]
and Sequential Minimal Optimization (SMO) [12]. Our
motivation for building bottom-up trees is twofold:

In a bottom-up approach we have a priori


information on which group of instances belongs to a
given node of the tree. It means we know the result of
each node split before even generating the separating
hyperplane. In fact, our algorithm uses these a priori
information for generating hyperplanes that maximize
the separation margin between instances of two
nodes. Hence, there is no need of relying on an
impurity-measure to evaluate the goodness of a split;
The top-down induction strategy usually
overgrows the decision tree until every leaf node
is pure, and then a pruning procedure is responsible
for simplifying the tree in order to avoid data
overfitting. In bottom-up induction, a pruning step
is not necessary because we are not overgrowing the
tree. Since we start growing the tree from the leaves
to the root, our approach reduces significantly the
chances of overfitting by clustering the data instances.
Given a space of instances X = {x1 . . . xn },
xi <m , and a set of classes C = {c1 . . . cl }, the
BUTIA algorithm (Algorithm 1) works according to the
following steps:
1) Divide the training data in pure subsets, i.e.,
instances belonging to the same subset have the same
class label. Formally, for a l-class problem, generate l
subsets Si = {x | C(x) = ci }, where C(xj ) returns the
class label of instance xj (procedure butia, line 4).
2) Apply EM over each subset Si , 1 i l.
For defining the number of clusters for each subset,
we perform a 10-fold cross-validation procedure,
evaluating the partitions stability according to the
loglikelihood measure (procedure butia, line 5).
3) Compute the centroid (i.e., the cluster mean
vector) of each cluster generated in step 2. Each of
the previously generated clusters becomes a leaf node
(procedure butia, lines 6-11).
4) Generate a new internal node by merging the
two most similar nodes, i.e., those whose centroids are
closest according to the Euclidean distance (procedure
merging, lines 9-19). Once a node is created, its
instances (that came up from its children) have their
class label replaced by a new unique meta-class
(procedure merging, line 20). During the merging
procedure two constraints are imposed: (i) only nodes
comprised of instances from different classes are
allowed to be merged; and (ii) nodes can be merged
only once.
5) Compute the new nodes centroid and generate
the separating hyperplane by training a linear-kernel
SVM over the nodes instances (procedure merging,
lines 21-22). The resulting hyperplane guarantees the
separability of the training instances that arrive at this
node. Recall that the nodes children are from two
different classes (for leaf nodes) or meta-classes (for

x2
1
1
0
1
0
1

x3
1
1
1
0
1
0

0
1

1
1

0 0 1
0 1 1

2
3
4
5
6
7
8
9
10
11
12
13

1
2
3
5
6
7
8
9
10
11
12
13
14
15

16
17
18
19
20
21
22
23
24
25
26

T ree merging(Leaves)
return T ree
procedure merging(L)
input : A set L of nodes
output: The oblique decision tree root node
begin
if |L| = 1 then
return unique node in L
else
s1 1
s2 1
smallestDistance
distance 0
foreach i, j L with i 6= j do
if i.class 6= j.class then
distance distance between i and j
if distance < smallestDistance then
smallestDistance distance
s1 i
s2 j
create new internal node t
t.lef tChild s1
t.rightChild s2
t.instances s1.instances s2.instances
t.class new meta-class
t.centroid mean vector of t.instances
t.hyperplane SVM hyperplane for t.instances
L L {t}
L L \ {s1}
L L \ {s2}
return merging(L)

Figure 1 depicts the execution steps of BUTIA.


We believe our approach presents the following
advantages over the top-down methods:
It can handle imbalanced-classes regardless of

x6
1
0
1
0
1
0

x7
0
1
0
1
0
1

xn
1
1
0
0
1
1

...

A
A1 ... AkA

1 1
0 1

...

x5
1
1
1
0
0
0

Class

A
B
A
C
B
C

...

Algorithm 1: BUTIAs procedures.


procedure butia(Dataset X, Classes C)
input : A dataset X and a set of classes C
output: The oblique decision tree
begin
Leaves ;
foreach class ci C do
Si {x | C(x) = ci }
P artition result of EM over Si
foreach cluster P artition do
create new leaf node n
n.instances instances from cluster
n.centroid mean vector of n.instances
n.class ci
Leaves Leaves {n}

x4
0
0
0
1
1
1

Training data

x1
0
1
1
0
0
0

...

internal nodes). Thence, SVM needs only to deal with


binary problems, regardless of the number of classes
in the original dataset, which means its application
is straightforward and does not rely on a majority
voting scheme between pairwise class subsets. This
is particularly interesting in order to preserve the
interpretability of the generated hyperplane as a rule.
Steps 4 and 5 are repeated recursively until a root node
is reached.

B
B1 ... BkB

1 A
0 C

2
C1 ... Ck C

4
N1
Figure 1.

5
Diagram of BUTIAs execution steps.

any explicitly provided cost matrix. Unlike top-down


approaches, BUTIA always generates at least one leaf
node per class, which guarantees that rare classes are
represented in the predictive model.
It can model problems in which one class is
described by more than one probability distribution.
By clustering each pure subset, we can identify
possibly distinct probability distributions, and thus
generate separating hyperplanes for the closest
inter-class boundaries. Note that this is not trivially
achieved by other methods, specially those whose
design is based on impurity measures.
In Figure 2, we detail how BUTIA would perform
in a synthetic binary-class dataset with m = 2. The
original data is presented in (a), and the clustering step
is performed in (b) within each class. The first class (x)
was clustered in two groups, as it was the second (o),
resulting in 4 leaf nodes. Next, the centroids of each
node are calculated, and the closest nodes according
to the Euclidean distance are merged (remember that
nodes can only be merged when their (meta) classes
are distinct) generating a new node in the tree (c). Still
referring to (c), the new merged node has its centroid
computed, a meta-class is assigned to its instances,
and finally the corresponding hyperplane is generated.
Following, the algorithm merges the next two closest
centroids, repeating the same procedure of creating a
new node, assigning a new meta-class and generating
a new hyperplane (d and e). The algorithm terminates
when all nodes (but one, the root node) are merged,
resulting in the oblique partitions presented in (f).

(a)

Table I
S UMMARY OF THE GENE EXPRESSION DATASETS .

(b)

(c)

(d)

(e)

(f)

Figure 2. BUTIAs execution in a synthetic 2-d binary-class dataset.

III. E XPERIMENTAL A NALYSIS


In order to evaluate our novel algorithm, we
consider a set of 35 publicly available datasets from
microarray gene expression data [15]. Microarray
technology enables expression level measurement for
thousands of genes in a parallel fashion, given a
biological tissue. Once combined, a fixed number
of microarray experiments are comprised in a gene
expression dataset.
The considered datasets are related to different types
or subtypes of cancer (e.g., prostate, lung and skin) and
comprehend the two flavors in which the technology
is generally available: single channel (21 datasets) and
double channel (14 datasets) [16]. Hereafter we refer
to single channel microarrays as Affymetrix (Affy)
and double channel microarrays as cDNA, since the
data were collected using either of these technologies
[15]. Our final task consists in classifying different
examples (instances) according to their gene (attribute)
expression levels. The main characteristics of the
datasets are summarized in Table I.
We compared BUTIA to five different classifiers.
Four of them are decision tree induction algorithms:
Oblique Classifier 1 (OC1) [9] (oblique trees),
Functional Trees (FT) [13] (logistic regression trees),
Classification and Regression Tress (CART) [7]
(univariate trees) and C4.5 [8] (univariate trees). The
fifth classifier, SMO [12], is an implementation of the
SVM method, which is the state of the art algorithm

Id

Dataset

Type

# Instances

# Classes

# Genes

1
2
3
4
5
6
7
8
9
10
11
12
13
14

alizadeh-v1
alizadeh-v2
alizadeh-v3
bittner
bredel
chen
garber
khan
lapointe-v1
lapointe-v2
liang
risinger
tomlins-v1
tomlins-v2

cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA
cDNA

42
62
62
38
50
180
66
83
69
110
37
42
104
92

2
3
4
2
3
2
4
4
3
4
3
4
5
4

1095
2093
2093
2201
1739
85
4553
1069
1625
2496
1411
1771
2315
1288

15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

armstrong-v1
armstrong-v2
bhattacharjee
chowdary
dyrskjot
golub-v1
golub-v2
gordon
laiho
nutt-v1
nutt-v2
nutt-v3
pomeroy-v1
pomeroy-v2
ramaswamy
shipp
singh
su
west
yeoh-v1
yeoh-v2

Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy

72
72
203
104
40
72
72
181
37
50
28
22
34
42
190
77
102
174
49
248
248

2
3
5
2
3
2
3
2
2
4
2
2
2
5
14
2
2
10
2
2
6

1081
2194
1543
182
1203
1877
1877
1626
2202
1377
1070
1152
857
1379
1363
798
339
1571
1198
2526
2526

for cancer classification in gene expression data [17].


We make use of the following implementations: (1)
OC1 - publicly available code at the authors website1 ,
and (ii) FT, CART, C4.5 and SMO - publicly available
codes implemented in java and found within the Weka
toolkit [18]. The parameters used were the default
ones. Each of the six classifiers was evaluated by its
generalization capability (accuracy rates), which was
estimated using 10-fold cross-validation.
In order to provide some reassurance about the
validity and non-randomness of the obtained results,
we present the results of statistical tests by following
the approach proposed by Demsar [19]. In brief, this
approach seeks to compare multiple algorithms on
multiple datasets, and it is based on the use of the
Friedman test with a corresponding post-hoc test. The
Friedman test is a non-parametric counterpart of the
well-known ANOVA. If the null hypothesis, which
states that the classifiers under study present similar
performances, is rejected, then we proceed with the
Nemenyi post-hoc test for pairwise comparisons.
The experimental results are summarized in Table II.
The average accuracies obtained in the 10-fold
1 http://www.cbcb.umd.edu/salzberg/announce-oc1.html

cross-validation procedure are presented, as well as the


corresponding standard deviations. Additionally, the
number of times each method was within the top-three
best accuracies is shown in the bottom part of the table.
Table II
A CCURACY ANALYSIS OF THE 35 DATASETS . ( AVERAGE S . D ).
Id

BUTIA

OC1

FT

CART

J48

SMO

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

0.940.14
1.000.00
0.940.10
0.920.14
0.860.13
0.840.08
0.850.12
0.980.05
0.880.12
0.860.09
1.000.00
0.740.15
0.900.11
0.900.08
0.990.05
0.970.06
0.970.03
0.920.08
0.850.24
0.930.07
0.940.07
0.990.02
0.880.16
0.560.18
0.570.21
0.730.34
0.890.14
0.770.15
0.620.08
0.890.10
0.770.07
0.930.05
0.840.18
0.980.04
0.870.04

0.720.19
0.870.12
0.660.20
0.560.23
0.760.08
0.820.07
0.740.13
0.790.14
0.800.07
0.660.19
0.680.28
0.570.18
0.610.11
0.530.22
0.710.38
0.740.11
0.920.06
0.390.51
0.680.17
0.830.13
0.810.09
0.960.02
0.830.22
0.240.31
0.800.22
0.730.24
0.570.50
0.630.18
0.360.27
0.710.10
0.760.11
0.650.15
0.680.36
0.990.03
0.750.10

0.890.15
0.990.05
0.820.14
0.810.18
0.780.11
0.940.04
0.820.13
1.000.00
0.760.12
0.840.09
0.930.12
0.780.18
0.870.12
0.820.14
0.960.07
0.900.10
0.970.03
0.950.05
0.650.24
0.890.11
0.940.07
0.980.03
0.800.21
0.660.21
0.370.07
0.680.23
0.790.15
0.790.18
0.650.11
0.800.14
0.890.10
0.890.09
0.900.14
0.990.03
0.850.06

0.690.26
0.900.13
0.710.10
0.580.14
0.760.13
0.850.06
0.760.12
0.830.11
0.770.07
0.650.08
0.760.25
0.530.25
0.540.15
0.590.13
0.910.09
0.750.15
0.890.09
0.970.05
0.750.17
0.850.14
0.920.07
0.950.02
0.760.18
0.580.18
0.850.20
0.630.20
0.730.19
0.530.17
0.570.08
0.780.11
0.740.10
0.760.12
0.780.22
0.990.03
0.760.10

0.690.20
0.890.14
0.700.16
0.560.23
0.840.13
0.840.06
0.800.10
0.870.09
0.720.16
0.630.18
0.790.20
0.450.24
0.550.10
0.560.17
0.900.07
0.760.09
0.910.08
0.930.07
0.730.25
0.860.13
0.960.07
0.950.04
0.890.14
0.560.21
0.820.20
0.600.33
0.730.29
0.590.17
0.620.12
0.810.13
0.820.10
0.810.10
0.860.10
0.990.02
0.700.10

0.940.14
1.000.00
0.940.08
0.880.13
0.860.10
0.940.07
0.800.14
0.990.04
0.850.15
0.850.08
0.980.08
0.810.16
0.930.11
0.910.07
0.990.05
0.960.07
0.960.05
0.960.05
0.900.17
0.970.06
0.940.07
0.990.02
0.970.11
0.720.25
0.930.14
1.000.00
0.920.14
0.790.21
0.720.08
0.930.09
0.920.06
0.900.05
0.860.16
0.980.03
0.840.08

#1
#2
#3

13
9
6

0
2
4

5
12
11

1
2
6

2
2
5

19
12
3

Results suggest that BUTIA and SMO achieved the


best overall performances. The ranking provided by
the Friedman test supports this assumption, showing
SMO as the best-ranked method, followed by BUTIA,
FT, C4.5, CART and OC1. The Friedman test also
indicates the rejection of the null-hypothesis, i.e.,
there is a statistically significant difference among
the algorithms (p-value = 3.49 1021 ). Hence, we
have executed the Nemenyi post-hoc test for pairwise
comparison, as depicted in Table III. Notice that
SMO outperforms all algorithms but BUTIA with
statistical significance. BUTIA and FT, on the other
hand, outperform C4.5, CART and OC1 with statistical
significance, but one does not outperform the other.
Roughly speaking, BUTIA is a better option than FT
because it is not outperformed by SMO significantly.
Moreover, the trees generated by FT (whose nodes,
both internal and leaves, hold logistic regression

models) are harder to interpret than the oblique


trees provided by BUTIA, which is a clear advantage
of our method. BUTIA seems to be an interesting
alternative to SMO since it is more comprehensible
than standard SVM. Recall that, in an oblique tree,
one can follow the linear models through the internal
nodes until reaching a class-label in a leaf node. This
path from the root to the leaves can be seamlessly
transformed into a set of interpretable rules. In FT, both
internal and leaf nodes can hold logistic regression
models, and leaves can hold more than one model
for characterizing distinct classes. This collection of
models can considerably harm interpretability. SVM,
in turn, when trained with a linear kernel, is only
interpretable in binary-class problems (it becomes
equivalent to a regular linear model). But in problems
with more than two classes, SVM has to test the
results in pairwise class combinations, solving the final
classification after a majority voting system. This kind
of system also harms model comprehensibility, and it
is one of the reasons why SVM is considered by many
as a black-box technique [20].
In summon, BUTIA was shown to be competitive
with the state-of-art technique for gene expression
analysis, SVM, with the further advantage of being a
comprehensible model. The advantages of presenting
a comprehensible model to the domain specialist (in
this case, the biologist) are well-described in [21], and
they are easily generalizable for gene expression data.
Table III
N EMENYI PAIRWISE COMPARISON RESULTS .

BUTIA
C4.5
OC1
SMO
FT
CART

BUTIA

C4.5

N
N

OC1

SMO

FT

N
N

N
N

N
N

CART

N - The algorithm in the column outperforms the


one in the row with statistical significance at a
95% confidence level.

IV. R ELATED W ORK


There are many works that propose top-down
induction of oblique decision trees in the literature,
and we briefly review some of them as follows.
Classification and Regression Trees (CART) [7] is
one of the first systems that allowed multivariate
splits. It employs a hill-climbing strategy with
a backward attribute elimination for finding
good (albeit suboptimal) linear combinations
of attributes in non-terminal nodes. It is a
fully-deterministic algorithm with no built-in
mechanisms to escape local-optima.

Simulated Annealing of Decision Trees (SADT) [3]


is a system that employs simulated annealing (SA)
for finding good coefficient values for attributes in
non-terminal nodes of decision trees. First, it places a
hyperplane in a canonical location, and then iteratively
perturbs the coefficients in small random amounts
guided by the SA algorithm. Although SADT can
eventually escape from local-optima, its efficiency is
compromised since it may consider tens of thousands
of hyperplanes in a single node during annealing.
Oblique Classifier 1 (OC1) [9] is yet another
top-down oblique decision tree system. It is a thorough
extension of CARTs oblique decision tree strategy.
OC1 presents the advantage of being more efficient
than the previously described systems. It searches for
the best univariate split as well as the best oblique
split, and it only employs the oblique split when
it improves over the univariate split. It uses both a
deterministic heuristic search (as employed in CART)
for finding local-optima and a non-deterministic search
(as employed in SADT - though not SA) for escaping
local-optima. Ittner [22] proposes using OC1 over
an augmented attribute space, generating non-linear
decision trees. The key idea involved is to build new
attributes by considering all possible pairwise products
and squares of the original set of n attributes.
A more recent approach for top-down oblique trees
that also employs SVM for generating hyperplanes is
SVM-ODT [23]. It is an extension to the original C4.5
algorithm, allowing the use of either an univariate split
or a SVM-generated oblique split, according to the
impurity-measure values. Several methods are applied
to avoid model overfitting, such as reduced-error
pruning and MDL computation, resulting in an
overly-complex algorithm. Model overfitting is directly
related to the own nature of the top-down strategy.
Another recently explored strategy for inducing
oblique decision trees is through evolutionary
algorithms (EAs). The interested reader can refer
to [24] for works that employ EAs for generating the
hyperplanes in top-down oblique tree inducers.
Bottom-up induction of decision trees is practically
not explored in the research community. The first
work to present the concepts of bottom-up induction
is Landeweerd et al. [10], where the authors propose
growing a decision tree from the leaves to the root,
assuming that each class is represented by a leaf
node, and that the most similar nodes are recursively
united into a parent node. This approach is too
simplistic, because it allows only a single leaf per class,
which means that binary-class problems will always be
modeled by a 3-node tree. We believe this deficiency is
the main reason for demotivating researchers to further
investigate the bottom-up induction of decision trees.

The bottom-up strategy has only been employed


for pruning decision trees [25], or for evaluating
Alternating Decision Trees (ADTree) [26], topics that
are not investigated in this paper. Our approach,
presented in Section II, intends to expand the work
of Landeweerd et al. [10] in such a way that it can
be effectively applied to distinct problem domains.
Particularly, we solve the one-leaf-per-class problem by
performing clustering in each class of the problem, and
we solve the hyperplane generation task by employing
SVMs in each internal node.
V. C ONCLUSIONS AND FUTURE WORK
In this work, we have presented a novel bottom-up
oblique decision tree induction algorithm, named
BUTIA, which makes use of well-known machine
learning algorithms, namely EM [14] and SMO [12].
Due to its bottom-up strategy for building the
tree, BUTIA presents some interesting advantages
over other top-down algorithms, such as robustness
to imbalanced data and to data overfitting. Indeed,
BUTIA does not require the further execution of a
pruning procedure, which is the case of virtually
every top-down decision tree algorithm. The use
of SVM-generated hyperplanes within internal nodes
guarantees that each hyperplane maximizes the
boundary margin between instances from different
classes, i.e., it guarantees convergence to the global
optimum solution per node division. This is not
true for other optimization techniques employed in
algorithms such as OC1 [9], CART [7] and SADT [3],
which do not guarantee convergence to global optima.
We have tested BUTIA in 35 gene expression
benchmarking datasets [15]. Experimental results
indicated that BUTIA outperformed traditional
algorithms, such as C4.5 [8], CART [7] and OC1 [9],
with statistic significance regarding accuracy,
according to the Friedman and Nemenyi tests, as
recommended in [19]. BUTIA is competitive with SMO
[12], since there was no significant difference between
both methods regarding accuracy. Since BUTIA is just
as interpretable as any oblique decision tree, it can be
seen as an interpretable alternative to standard SVM
(a technique known to be a black-box approach [20]),
which is currently considered to be the state-of-art
algorithm for classifying gene expression data [17].
This work has opened several venues for future
research, as follows. We plan to investigate the
impact of using different clustering algorithms for
the generation of the leaves, as well as different
strategies for automatically defining the number of
clusters (leaves) for each class. In addition, we intend
to test different methods for generating the separating
hyperplanes, such as Fishers linear discriminant

analysis (FLDA). The reason we did not execute


FLDA in the first place was its restriction of
dealing exclusively with linearly-separable problems.
Nevertheless, we believe that in certain applications,
in which a low computational cost is required,
FLDA may be a useful replacement of SVM. Finally,
it is interesting to investigate the performance
of BUTIA in more complex problems, such as
hierarchical multilabel classification [27].
A CKNOWLEDGEMENT
Our thanks to Brazilian research agencies CAPES,
CNPq and FAPESP for supporting this research.
R EFERENCES
[1] T. M. Mitchell, Machine Learning.

McGraw-Hill, 1997.

[2] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to


Data Mining. Addison-Wesley, 2005.
[3] D. Heath, S. Kasif, and S. Salzberg, Induction of
oblique decision trees, J ARTIF INTELL RES, vol. 2,
pp. 132, 1993.
[4] S. K. Murthy, S. Kasif, and S. S. Salzberg, A System for
Induction of Oblique Decision Trees, J ARTIF INTELL
RES, vol. 2, pp. 132, 1994.
[5] L. Rokach and O. Maimon, Top-down induction of
decision trees classifiers - a survey, IEEE T SYST MAN
CY C, vol. 35, no. 4, pp. 476 487, 2005.
[6] P. Utgoff and C. Brodley, An incremental method for
finding multivariate splits for decision trees, in 7th Int.
Conf. on Machine Learning, 1990, pp. 5865.
[7] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone,
Classification and Regression Trees. Wadsworth, 1984.
[8] J. R. Quinlan, C4.5: programs for machine learning.
Francisco, CA, USA: Morgan Kaufmann, 1993.

San

[9] S. K. Murthy, S. Kasif, S. Salzberg, and R. Beigel, OC1:


A Randomized Induction of Oblique Decision Trees, in
AAAI, 1993, pp. 322327.
[10] G. Landeweerd, T. Timmers, E. Gelsema, M. Bins,
and M. Halie, Binary tree versus single level tree
classification of white blood cells, PATTERN RECOGN,
vol. 16, no. 6, pp. 571577, 1983.
[11] C. Cortes and V. Vapnik, Support-vector networks,
MACH LEARN, vol. 20, pp. 273297, 1995.
[12] J. C. Platt, Fast training of support vector machines using
sequential minimal optimization. Cambridge, MA, USA:
MIT Press, 1999, pp. 185208.
[13] J. Gama, Functional trees, MACH LEARN, vol. 55, pp.
219250, 2004.

[14] A. P. Dempster, N. M. Laird, and D. B. Rubin,


Maximum Likelihood from Incomplete Data via the
EM Algorithm, J R STAT SOC, vol. 39, pp. 138, 1977.
[15] M. de Souto, I. Costa, D. de Araujo, T. Ludermir, and
A. Schliep, Clustering cancer gene expression data: a
comparative study, BMC BIOINF, vol. 9, p. 497, 2008.
[16] A. L. Tarca, R. Romero, and S. Draghici, Analysis of
microarray experiments of gene expression profiling,
AM J OBSTET GYNECOL, vol. 195, pp. 373 388, 2006.
[17] A. Statnikov, L. Wang, and C. Aliferis, A
comprehensive comparison of random forests and
support vector machines for microarray-based cancer
classification, BMC BIOINF, vol. 9, no. 1, p. 319, 2008.
[18] I. H. Witten and E. Frank, Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations.
Morgan Kaufmann, October 1999.
[19] J. Demsar, Statistical comparisons of classifiers over
multiple data sets, J MACH LEARN RES, vol. 7, pp.
130, 2006.
[20] A. Navia-Vazquez and E. Parrado-Hernandez, Support
vector machine interpretation, NEUROCOMP, vol. 69,
no. 13-15, pp. 1754 1759, 2006.
[21] A. A. Freitas, D. C. Wieser, and R. Apweiler, On the
importance of comprehensible classification models for
protein function prediction, IEEE ACM T COMPUT BI,
vol. 7, pp. 172182, January 2010.
[22] A. Ittner, Non-linear decision trees-NDT, in 13th Int.
Conf. on Machine Learning, 1996, pp. 16.
[23] V. Menkovski, I. Christou, and S. Efremidis, Oblique
decision trees using embedded support vector machines
in classifier ensembles, in 7th IEEE Int. Conf. on
Cybernetic Intelligent Systems, sept. 2008, pp. 1 6.
[24] R. C. Barros, M. P. Basgalupp, A. C. P. L. F. Carvalho,
and A. A. Freitas, A survey of evolutionary algorithms
for decision tree induction, To appear in IEEE T SYST
MAN CY C, pp. 121, 2011.
[25] F. Esposito, D. Malerba, and G. Semeraro, A
Comparative Analysis of Methods for Pruning Decision
Trees, IEEE T PATTERN ANAL, vol. 19, no. 5, pp.
476491, 1997.
[26] B. Yang, T. Wang, D. Yang, and L. Chang, BOAI:
Fast Alternating Decision Tree Induction Based on
Bottom-Up Evaluation, in LNCS. Springer, 2008, pp.
405416.
[27] R. Cerri and A. C. P. L. F. Carvalho, Hierarchical
multilabel classification using top-down label
combination and artificial neural networks, in XI
SBRN, 2010, pp. 253258.