4 views

Uploaded by kalanath

Experimental results on the major Benchmarking Functions for Genetic Algorithms.

- TreePlan 201 Guide
- Sentiment Analysis Using Hybrid Approach: A Survey
- a
- Data Mining 101
- Face Recognition Using Multi-Support Vector Machines
- finalacl2002
- Grading of Cashew Nuts on the Bases of Texture, Color and Size
- IJERA Article Guideline
- How Well Did You Locate Me
- OJS_file
- Nooni Et Al. 2014
- 31 1521621968_21-03-2018.pdf
- An Encapsulated Approach for Microarray Sample Classification using Supervised Attribute clustering and Fuzzy Classification Algorithm
- Biological Data Mining
- Chapter 18
- 10.1.1.51.2193.pdf
- Recognition of Hand Gesture using PCA, Image Hu’s Moment and SVM
- Lecture Notes - Tree Models
- sensors-14-19806
- 0133128903

You are on page 1of 7

Department of Computer Science, ICMC

University of Sao Paulo (USP)

Sao Carlos - SP, Brazil

{rcbarros,cerri,pablo,andre}@icmc.usp.br

used in knowledge discovery and data mining, specially

in scenarios where model comprehensibility is desired.

A variation of the traditional univariate approach is the

so-called oblique decision tree, which allows multivariate

tests in its non-terminal nodes. Oblique decision trees

can model decision boundaries that are oblique to the

attribute axes, whereas univariate trees can only perform

axis-parallel splits. The majority of the oblique and

univariate decision tree induction algorithms perform a

top-down strategy for growing the tree, relying on an

impurity-based measure for splitting nodes. In this paper,

we propose a novel bottom-up algorithm for inducing

oblique trees named BUTIA. It does not require an

impurity-measure for dividing nodes, since we know a

priori the data resulting from each split. For generating the

splitting hyperplanes, our algorithm implements a support

vector machine solution, and a clustering algorithm is used

for generating the initial leaves. We compare BUTIA to

traditional univariate and oblique decision tree algorithms,

C4.5, CART, OC1 and FT, as well as to a standard SVM

implementation, using real gene expression benchmark

data. Experimental results show the effectiveness of the

proposed approach in several cases.

Keywords-oblique decision trees; bottom-up induction;

clustering; SVM; hybrid intelligent systems

I. I NTRODUCTION

Decision tree induction algorithms are highly used

in a variety of domains for knowledge discovery

and pattern recognition. The induced knowledge in

the form of hierarchical trees can be regarded as

a disjunction of conjunctions of constraints on the

attribute values [1]. Each path from the root to a leaf

is actually a conjunction of attribute tests, and the

tree itself allows the choice of different paths, i.e., a

disjunction of these conjunctions. Such a representation

is intuitive and easy to assimilate by humans, which

partially explains the large number of studies that

make use of these techniques. Another reason for their

popularity is their good predictive accuracy in several

application domains, such as medical diagnosis and

credit risk assessment [2].

A major issue in decision tree induction is which

attribute(s) to choose for splitting an internal node.

as univariate), the problem is to choose the attribute

that better discriminates the input data. A decision

rule based on such an attribute is thus generated, and

the input data is filtered according to the consequents

of this rule. For oblique decision trees (also known

as multivariate), the goal is to find a combination of

attributes with good discriminatory power.

Oblique decision trees are not as popular as the

univariate ones, mainly because they are harder

to interpret. Nevertheless, researchers argue that

multivariate splits can improve the performance of

the tree in several datasets, while generating smaller

trees [3][5]. Clearly, there is a tradeoff to consider in

allowing multivariate tests: simple tests may result in

large trees that are hard to understand, yet multivariate

tests may result in small trees with tests that are hard

to understand [6].

One of the advantages of oblique decision trees is

that they are able to produce polygonal (polyhedral)

partitions of the attribute space, i.e., hyperplanes

at an oblique orientation to the attribute axes.

Univariate trees, on the other hand, can only produce

hyper-rectangles parallel to the attribute axes. The tests

at

Pmeach node of an oblique tree have the form w0 +

i=1 wi xji 0, where wi is a real-valued coefficient

associated to the ith attribute of a given instance xj ,

and w0 is the disturbance coefficient (bias) of the test.

For either the growth of oblique or axis-parallel

decision trees, there is a clear preference in the

literature for algorithms that rely on a greedy,

top-down, recursive partitioning strategy, i.e.,

top-down induction. The most well-known algorithms

for decision tree induction indeed implement this

strategy, e.g., CART [7], C4.5 [8] and OC1 [9]. These

algorithms make use of impurity-based criteria to

decide which attribute(s) will split the data in purer

subsets (a pure subset is one whose instances belong to

the same class). Since these algorithms are top-down,

it is not possible to know a priori which instances will

result in each subset of a partition. Thus, in top-down

induction, trees are usually grown until every leaf

avoid data overfitting.

Works that implement a bottom-up strategy are quite

rare in the literature. The key ideas behind bottom-up

induction were first presented by Landeweerd et

al. [10]. The authors propose growing a decision tree

from the leaves to the root, assuming that each class

is represented by a leaf node, and that the closest

nodes (according to the Mahalanobis distance) are

recursively merged into a parent node. Albeit simple,

their approach presents several deficiencies, e.g., it

allows only a single leaf per class, which means that

binary-class problems will always be modeled by a

3-node tree. This is quite problematic since there are

complex binary-class problems in which a 3-node

decision tree cannot model accurately the attribute

space. We believe this deficiency is one of the reasons

for demotivating researchers to further investigate the

bottom-up induction of decision trees. Another reason

may be the extra computational effort required to

compute the costly Mahalanobis distance.

In this paper, we propose alternatives to solve the

deficiencies of the typical bottom-up approach. For

instance, we propose the application of a well-known

clustering algorithm to allow each class to be

represented by more than one leaf node. In addition,

we incorporate in our algorithm a support vector

machine (SVM) [11] solution to build the hyperplane

that divide the data within each non-terminal node

of the oblique decision tree. We call our approach

BUTIA (Bottom-Up oblique Tree Induction Algorithm),

and we evaluate its performance in gene expression

benchmark datasets.

This paper is organized as follows. In Section II

we detail the proposed algorithm, which combines

clustering and SVM for generating oblique decision

trees. In Section III we conduct a comparison

among BUTIA and traditional top-down decision tree

induction algorithms C4.5 [8], CART [7] and OC1 [9].

Additionally, we compare BUTIA to Sequential

Minimal Optimization (SMO) [12] and Functional

Trees [13]. Section IV presents studies related to our

approach, whereas in Section V we discuss the main

conclusions of this work.

II. BUTIA

We propose a new bottom-up oblique decision

tree induction algorithm, named BUTIA (Bottom-Up

oblique Tree Induction Algorithm). It employs two

machine learning algorithms in different steps of tree

growth, namely Expectation-Maximization (EM) [14]

and Sequential Minimal Optimization (SMO) [12]. Our

motivation for building bottom-up trees is twofold:

information on which group of instances belongs to a

given node of the tree. It means we know the result of

each node split before even generating the separating

hyperplane. In fact, our algorithm uses these a priori

information for generating hyperplanes that maximize

the separation margin between instances of two

nodes. Hence, there is no need of relying on an

impurity-measure to evaluate the goodness of a split;

The top-down induction strategy usually

overgrows the decision tree until every leaf node

is pure, and then a pruning procedure is responsible

for simplifying the tree in order to avoid data

overfitting. In bottom-up induction, a pruning step

is not necessary because we are not overgrowing the

tree. Since we start growing the tree from the leaves

to the root, our approach reduces significantly the

chances of overfitting by clustering the data instances.

Given a space of instances X = {x1 . . . xn },

xi <m , and a set of classes C = {c1 . . . cl }, the

BUTIA algorithm (Algorithm 1) works according to the

following steps:

1) Divide the training data in pure subsets, i.e.,

instances belonging to the same subset have the same

class label. Formally, for a l-class problem, generate l

subsets Si = {x | C(x) = ci }, where C(xj ) returns the

class label of instance xj (procedure butia, line 4).

2) Apply EM over each subset Si , 1 i l.

For defining the number of clusters for each subset,

we perform a 10-fold cross-validation procedure,

evaluating the partitions stability according to the

loglikelihood measure (procedure butia, line 5).

3) Compute the centroid (i.e., the cluster mean

vector) of each cluster generated in step 2. Each of

the previously generated clusters becomes a leaf node

(procedure butia, lines 6-11).

4) Generate a new internal node by merging the

two most similar nodes, i.e., those whose centroids are

closest according to the Euclidean distance (procedure

merging, lines 9-19). Once a node is created, its

instances (that came up from its children) have their

class label replaced by a new unique meta-class

(procedure merging, line 20). During the merging

procedure two constraints are imposed: (i) only nodes

comprised of instances from different classes are

allowed to be merged; and (ii) nodes can be merged

only once.

5) Compute the new nodes centroid and generate

the separating hyperplane by training a linear-kernel

SVM over the nodes instances (procedure merging,

lines 21-22). The resulting hyperplane guarantees the

separability of the training instances that arrive at this

node. Recall that the nodes children are from two

different classes (for leaf nodes) or meta-classes (for

x2

1

1

0

1

0

1

x3

1

1

1

0

1

0

0

1

1

1

0 0 1

0 1 1

2

3

4

5

6

7

8

9

10

11

12

13

1

2

3

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

T ree merging(Leaves)

return T ree

procedure merging(L)

input : A set L of nodes

output: The oblique decision tree root node

begin

if |L| = 1 then

return unique node in L

else

s1 1

s2 1

smallestDistance

distance 0

foreach i, j L with i 6= j do

if i.class 6= j.class then

distance distance between i and j

if distance < smallestDistance then

smallestDistance distance

s1 i

s2 j

create new internal node t

t.lef tChild s1

t.rightChild s2

t.instances s1.instances s2.instances

t.class new meta-class

t.centroid mean vector of t.instances

t.hyperplane SVM hyperplane for t.instances

L L {t}

L L \ {s1}

L L \ {s2}

return merging(L)

We believe our approach presents the following

advantages over the top-down methods:

It can handle imbalanced-classes regardless of

x6

1

0

1

0

1

0

x7

0

1

0

1

0

1

xn

1

1

0

0

1

1

...

A

A1 ... AkA

1 1

0 1

...

x5

1

1

1

0

0

0

Class

A

B

A

C

B

C

...

procedure butia(Dataset X, Classes C)

input : A dataset X and a set of classes C

output: The oblique decision tree

begin

Leaves ;

foreach class ci C do

Si {x | C(x) = ci }

P artition result of EM over Si

foreach cluster P artition do

create new leaf node n

n.instances instances from cluster

n.centroid mean vector of n.instances

n.class ci

Leaves Leaves {n}

x4

0

0

0

1

1

1

Training data

x1

0

1

1

0

0

0

...

binary problems, regardless of the number of classes

in the original dataset, which means its application

is straightforward and does not rely on a majority

voting scheme between pairwise class subsets. This

is particularly interesting in order to preserve the

interpretability of the generated hyperplane as a rule.

Steps 4 and 5 are repeated recursively until a root node

is reached.

B

B1 ... BkB

1 A

0 C

2

C1 ... Ck C

4

N1

Figure 1.

5

Diagram of BUTIAs execution steps.

approaches, BUTIA always generates at least one leaf

node per class, which guarantees that rare classes are

represented in the predictive model.

It can model problems in which one class is

described by more than one probability distribution.

By clustering each pure subset, we can identify

possibly distinct probability distributions, and thus

generate separating hyperplanes for the closest

inter-class boundaries. Note that this is not trivially

achieved by other methods, specially those whose

design is based on impurity measures.

In Figure 2, we detail how BUTIA would perform

in a synthetic binary-class dataset with m = 2. The

original data is presented in (a), and the clustering step

is performed in (b) within each class. The first class (x)

was clustered in two groups, as it was the second (o),

resulting in 4 leaf nodes. Next, the centroids of each

node are calculated, and the closest nodes according

to the Euclidean distance are merged (remember that

nodes can only be merged when their (meta) classes

are distinct) generating a new node in the tree (c). Still

referring to (c), the new merged node has its centroid

computed, a meta-class is assigned to its instances,

and finally the corresponding hyperplane is generated.

Following, the algorithm merges the next two closest

centroids, repeating the same procedure of creating a

new node, assigning a new meta-class and generating

a new hyperplane (d and e). The algorithm terminates

when all nodes (but one, the root node) are merged,

resulting in the oblique partitions presented in (f).

(a)

Table I

S UMMARY OF THE GENE EXPRESSION DATASETS .

(b)

(c)

(d)

(e)

(f)

In order to evaluate our novel algorithm, we

consider a set of 35 publicly available datasets from

microarray gene expression data [15]. Microarray

technology enables expression level measurement for

thousands of genes in a parallel fashion, given a

biological tissue. Once combined, a fixed number

of microarray experiments are comprised in a gene

expression dataset.

The considered datasets are related to different types

or subtypes of cancer (e.g., prostate, lung and skin) and

comprehend the two flavors in which the technology

is generally available: single channel (21 datasets) and

double channel (14 datasets) [16]. Hereafter we refer

to single channel microarrays as Affymetrix (Affy)

and double channel microarrays as cDNA, since the

data were collected using either of these technologies

[15]. Our final task consists in classifying different

examples (instances) according to their gene (attribute)

expression levels. The main characteristics of the

datasets are summarized in Table I.

We compared BUTIA to five different classifiers.

Four of them are decision tree induction algorithms:

Oblique Classifier 1 (OC1) [9] (oblique trees),

Functional Trees (FT) [13] (logistic regression trees),

Classification and Regression Tress (CART) [7]

(univariate trees) and C4.5 [8] (univariate trees). The

fifth classifier, SMO [12], is an implementation of the

SVM method, which is the state of the art algorithm

Id

Dataset

Type

# Instances

# Classes

# Genes

1

2

3

4

5

6

7

8

9

10

11

12

13

14

alizadeh-v1

alizadeh-v2

alizadeh-v3

bittner

bredel

chen

garber

khan

lapointe-v1

lapointe-v2

liang

risinger

tomlins-v1

tomlins-v2

cDNA

cDNA

cDNA

cDNA

cDNA

cDNA

cDNA

cDNA

cDNA

cDNA

cDNA

cDNA

cDNA

cDNA

42

62

62

38

50

180

66

83

69

110

37

42

104

92

2

3

4

2

3

2

4

4

3

4

3

4

5

4

1095

2093

2093

2201

1739

85

4553

1069

1625

2496

1411

1771

2315

1288

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

armstrong-v1

armstrong-v2

bhattacharjee

chowdary

dyrskjot

golub-v1

golub-v2

gordon

laiho

nutt-v1

nutt-v2

nutt-v3

pomeroy-v1

pomeroy-v2

ramaswamy

shipp

singh

su

west

yeoh-v1

yeoh-v2

Affy

Affy

Affy

Affy

Affy

Affy

Affy

Affy

Affy

Affy

Affy

Affy

Affy

Affy

Affy

Affy

Affy

Affy

Affy

Affy

Affy

72

72

203

104

40

72

72

181

37

50

28

22

34

42

190

77

102

174

49

248

248

2

3

5

2

3

2

3

2

2

4

2

2

2

5

14

2

2

10

2

2

6

1081

2194

1543

182

1203

1877

1877

1626

2202

1377

1070

1152

857

1379

1363

798

339

1571

1198

2526

2526

We make use of the following implementations: (1)

OC1 - publicly available code at the authors website1 ,

and (ii) FT, CART, C4.5 and SMO - publicly available

codes implemented in java and found within the Weka

toolkit [18]. The parameters used were the default

ones. Each of the six classifiers was evaluated by its

generalization capability (accuracy rates), which was

estimated using 10-fold cross-validation.

In order to provide some reassurance about the

validity and non-randomness of the obtained results,

we present the results of statistical tests by following

the approach proposed by Demsar [19]. In brief, this

approach seeks to compare multiple algorithms on

multiple datasets, and it is based on the use of the

Friedman test with a corresponding post-hoc test. The

Friedman test is a non-parametric counterpart of the

well-known ANOVA. If the null hypothesis, which

states that the classifiers under study present similar

performances, is rejected, then we proceed with the

Nemenyi post-hoc test for pairwise comparisons.

The experimental results are summarized in Table II.

The average accuracies obtained in the 10-fold

1 http://www.cbcb.umd.edu/salzberg/announce-oc1.html

corresponding standard deviations. Additionally, the

number of times each method was within the top-three

best accuracies is shown in the bottom part of the table.

Table II

A CCURACY ANALYSIS OF THE 35 DATASETS . ( AVERAGE S . D ).

Id

BUTIA

OC1

FT

CART

J48

SMO

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

0.940.14

1.000.00

0.940.10

0.920.14

0.860.13

0.840.08

0.850.12

0.980.05

0.880.12

0.860.09

1.000.00

0.740.15

0.900.11

0.900.08

0.990.05

0.970.06

0.970.03

0.920.08

0.850.24

0.930.07

0.940.07

0.990.02

0.880.16

0.560.18

0.570.21

0.730.34

0.890.14

0.770.15

0.620.08

0.890.10

0.770.07

0.930.05

0.840.18

0.980.04

0.870.04

0.720.19

0.870.12

0.660.20

0.560.23

0.760.08

0.820.07

0.740.13

0.790.14

0.800.07

0.660.19

0.680.28

0.570.18

0.610.11

0.530.22

0.710.38

0.740.11

0.920.06

0.390.51

0.680.17

0.830.13

0.810.09

0.960.02

0.830.22

0.240.31

0.800.22

0.730.24

0.570.50

0.630.18

0.360.27

0.710.10

0.760.11

0.650.15

0.680.36

0.990.03

0.750.10

0.890.15

0.990.05

0.820.14

0.810.18

0.780.11

0.940.04

0.820.13

1.000.00

0.760.12

0.840.09

0.930.12

0.780.18

0.870.12

0.820.14

0.960.07

0.900.10

0.970.03

0.950.05

0.650.24

0.890.11

0.940.07

0.980.03

0.800.21

0.660.21

0.370.07

0.680.23

0.790.15

0.790.18

0.650.11

0.800.14

0.890.10

0.890.09

0.900.14

0.990.03

0.850.06

0.690.26

0.900.13

0.710.10

0.580.14

0.760.13

0.850.06

0.760.12

0.830.11

0.770.07

0.650.08

0.760.25

0.530.25

0.540.15

0.590.13

0.910.09

0.750.15

0.890.09

0.970.05

0.750.17

0.850.14

0.920.07

0.950.02

0.760.18

0.580.18

0.850.20

0.630.20

0.730.19

0.530.17

0.570.08

0.780.11

0.740.10

0.760.12

0.780.22

0.990.03

0.760.10

0.690.20

0.890.14

0.700.16

0.560.23

0.840.13

0.840.06

0.800.10

0.870.09

0.720.16

0.630.18

0.790.20

0.450.24

0.550.10

0.560.17

0.900.07

0.760.09

0.910.08

0.930.07

0.730.25

0.860.13

0.960.07

0.950.04

0.890.14

0.560.21

0.820.20

0.600.33

0.730.29

0.590.17

0.620.12

0.810.13

0.820.10

0.810.10

0.860.10

0.990.02

0.700.10

0.940.14

1.000.00

0.940.08

0.880.13

0.860.10

0.940.07

0.800.14

0.990.04

0.850.15

0.850.08

0.980.08

0.810.16

0.930.11

0.910.07

0.990.05

0.960.07

0.960.05

0.960.05

0.900.17

0.970.06

0.940.07

0.990.02

0.970.11

0.720.25

0.930.14

1.000.00

0.920.14

0.790.21

0.720.08

0.930.09

0.920.06

0.900.05

0.860.16

0.980.03

0.840.08

#1

#2

#3

13

9

6

0

2

4

5

12

11

1

2

6

2

2

5

19

12

3

best overall performances. The ranking provided by

the Friedman test supports this assumption, showing

SMO as the best-ranked method, followed by BUTIA,

FT, C4.5, CART and OC1. The Friedman test also

indicates the rejection of the null-hypothesis, i.e.,

there is a statistically significant difference among

the algorithms (p-value = 3.49 1021 ). Hence, we

have executed the Nemenyi post-hoc test for pairwise

comparison, as depicted in Table III. Notice that

SMO outperforms all algorithms but BUTIA with

statistical significance. BUTIA and FT, on the other

hand, outperform C4.5, CART and OC1 with statistical

significance, but one does not outperform the other.

Roughly speaking, BUTIA is a better option than FT

because it is not outperformed by SMO significantly.

Moreover, the trees generated by FT (whose nodes,

both internal and leaves, hold logistic regression

trees provided by BUTIA, which is a clear advantage

of our method. BUTIA seems to be an interesting

alternative to SMO since it is more comprehensible

than standard SVM. Recall that, in an oblique tree,

one can follow the linear models through the internal

nodes until reaching a class-label in a leaf node. This

path from the root to the leaves can be seamlessly

transformed into a set of interpretable rules. In FT, both

internal and leaf nodes can hold logistic regression

models, and leaves can hold more than one model

for characterizing distinct classes. This collection of

models can considerably harm interpretability. SVM,

in turn, when trained with a linear kernel, is only

interpretable in binary-class problems (it becomes

equivalent to a regular linear model). But in problems

with more than two classes, SVM has to test the

results in pairwise class combinations, solving the final

classification after a majority voting system. This kind

of system also harms model comprehensibility, and it

is one of the reasons why SVM is considered by many

as a black-box technique [20].

In summon, BUTIA was shown to be competitive

with the state-of-art technique for gene expression

analysis, SVM, with the further advantage of being a

comprehensible model. The advantages of presenting

a comprehensible model to the domain specialist (in

this case, the biologist) are well-described in [21], and

they are easily generalizable for gene expression data.

Table III

N EMENYI PAIRWISE COMPARISON RESULTS .

BUTIA

C4.5

OC1

SMO

FT

CART

BUTIA

C4.5

N

N

OC1

SMO

FT

N

N

N

N

N

N

CART

one in the row with statistical significance at a

95% confidence level.

There are many works that propose top-down

induction of oblique decision trees in the literature,

and we briefly review some of them as follows.

Classification and Regression Trees (CART) [7] is

one of the first systems that allowed multivariate

splits. It employs a hill-climbing strategy with

a backward attribute elimination for finding

good (albeit suboptimal) linear combinations

of attributes in non-terminal nodes. It is a

fully-deterministic algorithm with no built-in

mechanisms to escape local-optima.

is a system that employs simulated annealing (SA)

for finding good coefficient values for attributes in

non-terminal nodes of decision trees. First, it places a

hyperplane in a canonical location, and then iteratively

perturbs the coefficients in small random amounts

guided by the SA algorithm. Although SADT can

eventually escape from local-optima, its efficiency is

compromised since it may consider tens of thousands

of hyperplanes in a single node during annealing.

Oblique Classifier 1 (OC1) [9] is yet another

top-down oblique decision tree system. It is a thorough

extension of CARTs oblique decision tree strategy.

OC1 presents the advantage of being more efficient

than the previously described systems. It searches for

the best univariate split as well as the best oblique

split, and it only employs the oblique split when

it improves over the univariate split. It uses both a

deterministic heuristic search (as employed in CART)

for finding local-optima and a non-deterministic search

(as employed in SADT - though not SA) for escaping

local-optima. Ittner [22] proposes using OC1 over

an augmented attribute space, generating non-linear

decision trees. The key idea involved is to build new

attributes by considering all possible pairwise products

and squares of the original set of n attributes.

A more recent approach for top-down oblique trees

that also employs SVM for generating hyperplanes is

SVM-ODT [23]. It is an extension to the original C4.5

algorithm, allowing the use of either an univariate split

or a SVM-generated oblique split, according to the

impurity-measure values. Several methods are applied

to avoid model overfitting, such as reduced-error

pruning and MDL computation, resulting in an

overly-complex algorithm. Model overfitting is directly

related to the own nature of the top-down strategy.

Another recently explored strategy for inducing

oblique decision trees is through evolutionary

algorithms (EAs). The interested reader can refer

to [24] for works that employ EAs for generating the

hyperplanes in top-down oblique tree inducers.

Bottom-up induction of decision trees is practically

not explored in the research community. The first

work to present the concepts of bottom-up induction

is Landeweerd et al. [10], where the authors propose

growing a decision tree from the leaves to the root,

assuming that each class is represented by a leaf

node, and that the most similar nodes are recursively

united into a parent node. This approach is too

simplistic, because it allows only a single leaf per class,

which means that binary-class problems will always be

modeled by a 3-node tree. We believe this deficiency is

the main reason for demotivating researchers to further

investigate the bottom-up induction of decision trees.

for pruning decision trees [25], or for evaluating

Alternating Decision Trees (ADTree) [26], topics that

are not investigated in this paper. Our approach,

presented in Section II, intends to expand the work

of Landeweerd et al. [10] in such a way that it can

be effectively applied to distinct problem domains.

Particularly, we solve the one-leaf-per-class problem by

performing clustering in each class of the problem, and

we solve the hyperplane generation task by employing

SVMs in each internal node.

V. C ONCLUSIONS AND FUTURE WORK

In this work, we have presented a novel bottom-up

oblique decision tree induction algorithm, named

BUTIA, which makes use of well-known machine

learning algorithms, namely EM [14] and SMO [12].

Due to its bottom-up strategy for building the

tree, BUTIA presents some interesting advantages

over other top-down algorithms, such as robustness

to imbalanced data and to data overfitting. Indeed,

BUTIA does not require the further execution of a

pruning procedure, which is the case of virtually

every top-down decision tree algorithm. The use

of SVM-generated hyperplanes within internal nodes

guarantees that each hyperplane maximizes the

boundary margin between instances from different

classes, i.e., it guarantees convergence to the global

optimum solution per node division. This is not

true for other optimization techniques employed in

algorithms such as OC1 [9], CART [7] and SADT [3],

which do not guarantee convergence to global optima.

We have tested BUTIA in 35 gene expression

benchmarking datasets [15]. Experimental results

indicated that BUTIA outperformed traditional

algorithms, such as C4.5 [8], CART [7] and OC1 [9],

with statistic significance regarding accuracy,

according to the Friedman and Nemenyi tests, as

recommended in [19]. BUTIA is competitive with SMO

[12], since there was no significant difference between

both methods regarding accuracy. Since BUTIA is just

as interpretable as any oblique decision tree, it can be

seen as an interpretable alternative to standard SVM

(a technique known to be a black-box approach [20]),

which is currently considered to be the state-of-art

algorithm for classifying gene expression data [17].

This work has opened several venues for future

research, as follows. We plan to investigate the

impact of using different clustering algorithms for

the generation of the leaves, as well as different

strategies for automatically defining the number of

clusters (leaves) for each class. In addition, we intend

to test different methods for generating the separating

hyperplanes, such as Fishers linear discriminant

FLDA in the first place was its restriction of

dealing exclusively with linearly-separable problems.

Nevertheless, we believe that in certain applications,

in which a low computational cost is required,

FLDA may be a useful replacement of SVM. Finally,

it is interesting to investigate the performance

of BUTIA in more complex problems, such as

hierarchical multilabel classification [27].

A CKNOWLEDGEMENT

Our thanks to Brazilian research agencies CAPES,

CNPq and FAPESP for supporting this research.

R EFERENCES

[1] T. M. Mitchell, Machine Learning.

McGraw-Hill, 1997.

Data Mining. Addison-Wesley, 2005.

[3] D. Heath, S. Kasif, and S. Salzberg, Induction of

oblique decision trees, J ARTIF INTELL RES, vol. 2,

pp. 132, 1993.

[4] S. K. Murthy, S. Kasif, and S. S. Salzberg, A System for

Induction of Oblique Decision Trees, J ARTIF INTELL

RES, vol. 2, pp. 132, 1994.

[5] L. Rokach and O. Maimon, Top-down induction of

decision trees classifiers - a survey, IEEE T SYST MAN

CY C, vol. 35, no. 4, pp. 476 487, 2005.

[6] P. Utgoff and C. Brodley, An incremental method for

finding multivariate splits for decision trees, in 7th Int.

Conf. on Machine Learning, 1990, pp. 5865.

[7] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone,

Classification and Regression Trees. Wadsworth, 1984.

[8] J. R. Quinlan, C4.5: programs for machine learning.

Francisco, CA, USA: Morgan Kaufmann, 1993.

San

A Randomized Induction of Oblique Decision Trees, in

AAAI, 1993, pp. 322327.

[10] G. Landeweerd, T. Timmers, E. Gelsema, M. Bins,

and M. Halie, Binary tree versus single level tree

classification of white blood cells, PATTERN RECOGN,

vol. 16, no. 6, pp. 571577, 1983.

[11] C. Cortes and V. Vapnik, Support-vector networks,

MACH LEARN, vol. 20, pp. 273297, 1995.

[12] J. C. Platt, Fast training of support vector machines using

sequential minimal optimization. Cambridge, MA, USA:

MIT Press, 1999, pp. 185208.

[13] J. Gama, Functional trees, MACH LEARN, vol. 55, pp.

219250, 2004.

Maximum Likelihood from Incomplete Data via the

EM Algorithm, J R STAT SOC, vol. 39, pp. 138, 1977.

[15] M. de Souto, I. Costa, D. de Araujo, T. Ludermir, and

A. Schliep, Clustering cancer gene expression data: a

comparative study, BMC BIOINF, vol. 9, p. 497, 2008.

[16] A. L. Tarca, R. Romero, and S. Draghici, Analysis of

microarray experiments of gene expression profiling,

AM J OBSTET GYNECOL, vol. 195, pp. 373 388, 2006.

[17] A. Statnikov, L. Wang, and C. Aliferis, A

comprehensive comparison of random forests and

support vector machines for microarray-based cancer

classification, BMC BIOINF, vol. 9, no. 1, p. 319, 2008.

[18] I. H. Witten and E. Frank, Data Mining: Practical Machine

Learning Tools and Techniques with Java Implementations.

Morgan Kaufmann, October 1999.

[19] J. Demsar, Statistical comparisons of classifiers over

multiple data sets, J MACH LEARN RES, vol. 7, pp.

130, 2006.

[20] A. Navia-Vazquez and E. Parrado-Hernandez, Support

vector machine interpretation, NEUROCOMP, vol. 69,

no. 13-15, pp. 1754 1759, 2006.

[21] A. A. Freitas, D. C. Wieser, and R. Apweiler, On the

importance of comprehensible classification models for

protein function prediction, IEEE ACM T COMPUT BI,

vol. 7, pp. 172182, January 2010.

[22] A. Ittner, Non-linear decision trees-NDT, in 13th Int.

Conf. on Machine Learning, 1996, pp. 16.

[23] V. Menkovski, I. Christou, and S. Efremidis, Oblique

decision trees using embedded support vector machines

in classifier ensembles, in 7th IEEE Int. Conf. on

Cybernetic Intelligent Systems, sept. 2008, pp. 1 6.

[24] R. C. Barros, M. P. Basgalupp, A. C. P. L. F. Carvalho,

and A. A. Freitas, A survey of evolutionary algorithms

for decision tree induction, To appear in IEEE T SYST

MAN CY C, pp. 121, 2011.

[25] F. Esposito, D. Malerba, and G. Semeraro, A

Comparative Analysis of Methods for Pruning Decision

Trees, IEEE T PATTERN ANAL, vol. 19, no. 5, pp.

476491, 1997.

[26] B. Yang, T. Wang, D. Yang, and L. Chang, BOAI:

Fast Alternating Decision Tree Induction Based on

Bottom-Up Evaluation, in LNCS. Springer, 2008, pp.

405416.

[27] R. Cerri and A. C. P. L. F. Carvalho, Hierarchical

multilabel classification using top-down label

combination and artificial neural networks, in XI

SBRN, 2010, pp. 253258.

- TreePlan 201 GuideUploaded bygaziahmad
- Sentiment Analysis Using Hybrid Approach: A SurveyUploaded byAnonymous 7VPPkWS8O
- aUploaded byJack Dunn
- Data Mining 101Uploaded bykarinoy
- Face Recognition Using Multi-Support Vector MachinesUploaded byseventhsensegroup
- finalacl2002Uploaded byMeher Vijay
- Grading of Cashew Nuts on the Bases of Texture, Color and SizeUploaded byEditor IJRITCC
- IJERA Article GuidelineUploaded byKamalAhmad
- How Well Did You Locate MeUploaded byPramana Yoga Saputra
- OJS_fileUploaded byPraneeth Kusuma
- Nooni Et Al. 2014Uploaded byrendi saputra
- 31 1521621968_21-03-2018.pdfUploaded byRahul Sharma
- An Encapsulated Approach for Microarray Sample Classification using Supervised Attribute clustering and Fuzzy Classification AlgorithmUploaded byAnonymous vQrJlEN
- Biological Data MiningUploaded byMuhammad Ahmad Malik
- Chapter 18Uploaded byKenDaniswara
- 10.1.1.51.2193.pdfUploaded byshivani
- Recognition of Hand Gesture using PCA, Image Hu’s Moment and SVMUploaded byAnonymous vQrJlEN
- Lecture Notes - Tree ModelsUploaded bySankar Susarla
- sensors-14-19806Uploaded byThomas Mercer
- 0133128903Uploaded byNn
- LI.2012_Automatic Woven Fabric Classification Based on SVMUploaded byAnonymous PsEz5kGVae
- imp random forest 2.pdfUploaded byAllam Jayaprakash
- Which Side are You on - Identifying Perspectives at the Document and Sentence Levels (2006).pdfUploaded byjehosha
- Face Recognition Based on Curvelet TransformUploaded byImam Fathur
- Information Theory for Gabor Feature SelectionUploaded byMehmet Ada
- Morphological Hierarchical Image Decomposition Based on Laplacian 0-Crossings slideUploaded byHuỳnh Lê Duy
- art-3A10.1023-2FA-3A1018628609742Uploaded bylucifer1711
- 10.1.1.110.7399Uploaded bydharma_panga8217
- 3rd Sem Syllabus (1)Uploaded byiShamirali
- ijcsit.pdfUploaded bysaurabh

- AnDevGuide-MachineLearningUploaded bykalanath
- Go in TibetUploaded bykalanath
- Origins of GoUploaded bykalanath
- River Mountain Go - Oliver Richman - Volume 1 - 30 Kyu to 20 KyuUploaded byapi-3737188
- AgaUploaded byrameshaarya99
- AI.pdfUploaded bysudhialamanda
- The Way to Go - By Karl BakerUploaded byapi-3737188
- Life&DeathOnTheGoBoardUploaded bykalanath
- eStimAndRegeneration.pdfUploaded bykalanath
- Running FormUploaded bykalanath
- Parallettes Sample ProgramUploaded bykalanath
- Unbreakable GMUploaded bycrashnburn4u
- 3 Minute Flat Belly Flow OptimizedUploaded bykalanath
- Using XML and DBsUploaded bykalanath
- Nityananda Meditation TechniqueUploaded bykalanath
- DharmaTalks -Tiny Book Of Stopping And SeeingUploaded bykalanath
- Egypt to MesoamericaUploaded bykalanath
- Africa to Aotearoa- the longest migration.pdfUploaded bykalanath
- Data Mining And KDDUploaded bychiropriyac
- ON BENCHMARKING FUNCTIONS FOR GENETIC ALGORITHMSUploaded bykalanath
- A Proposed Genetic Algorithm Selection MethodUploaded bykalanath
- A New Strategy for Gene Expression Programming and Its Applications in Function MiningUploaded bykalanath
- Evolutionary self-adaptation: a survey of operators and strategy parametersUploaded bykalanath
- Boosting Genetic Algorithms with Self-Adaptive SelectionUploaded bykalanath
- Self Adaptation in Evolutionary AlgorithmsUploaded bykalanath
- Efficient Global Optimization for Expensive FunctionsUploaded bykalanath
- Comparison of 8 Operators for TspUploaded bykalanath
- Electric Cloud - Agile Software DevelopmentUploaded bykalanath

- Bejai Mathew - Master Thesis (Internship)Uploaded byBejai Alexander Mathew
- Introduction to Teletraffic TheoryUploaded byseungseok
- Centrum KlimaUploaded bycentrumklima
- fly by wire and fly by opticsUploaded byvenkatsahul
- Copper Cable TheftUploaded byMohammad Ali
- Network Programming Lab ManualUploaded byshdjh df
- Mobil SHC™ 600 SeriesUploaded byteamegi
- 3D Online Multimedia & Games_ Processing, Transmission and Visualization.pdfUploaded byiijnnnllkkjj
- EXPERIMENT 2-Purification and Melting Point Determination (1)Uploaded byJohn Salvador Ricacho
- BXT BAS 920300Uploaded bysandeep
- MA4001_Sem1-15-16 - v1Uploaded byPearlyn Tiko Teo
- rr320204-instrumentationUploaded bySRINIVASA RAO GANTA
- UG OrdinanceUploaded byMadhur Pahuja
- Blind People RoadcrossUploaded byKumarecit
- Motorola MC1458 datasheetUploaded byGangadai Persaud
- Handan Xinxing Heavy Machinery CoUploaded bysantoshkumarmishra2009
- List and Table Sort, Sorting Algorithm AnalysisUploaded byhardworking_student
- Tornado BSP Manual1Uploaded bychachi
- Configuration - Spark 2.3.2 DocumentationUploaded byeek.the.kat
- Three Point Bending With HyperMesh RD 3595Uploaded byKiran Dama
- Danfoss VLT 8000 AquaUploaded byjotadesco
- Plus Two Cs Em Volume_I&II Important_onemark_CSUploaded byebinezar.y
- Building IrregularitiesUploaded bykalpanaadhi
- REDHAT 18 Samba Nis SshUploaded bySouvik Ganguly
- 4400_4pgUploaded bywalk002
- Persistent Forecasting of Disruptive TechnologiesUploaded byJeff Begley
- Diode Matrix ROM1Uploaded byDeepak Kumbhar
- Art11 ELSEVIER Sizing Optimization of Grid Independent Hybrid Photovoltaic Wind Power Generation System Technq IterativeUploaded byBenmeddour Mos
- An308 Current Sensor Jan09Uploaded byLuis Vela
- Lexical Analysis Slides.pdfUploaded byMallikarjun Rao