You are on page 1of 8

Pattern Recognition Letters 60–61 (2015) 57–64

Contents lists available at ScienceDirect

Pattern Recognition Letters


journal homepage: www.elsevier.com/locate/patrec

Sparse alternating decision tree ✩


Hong Kuan Sok∗, Melanie Po-Leen Ooi, Ye Chow Kuang
Monash University Malaysia, Jalan Lagoon Selatan, Bandar Sunway 46150, Selangor D.E., Malaysia

a r t i c l e i n f o a b s t r a c t

Article history: Alternating decision tree (ADTree) is a special decision tree representation that brings interpretability to
Received 20 May 2014 boosting, a well-established ensemble algorithm. This has found success in wide applications. However, ex-
Available online 17 March 2015
isting variants of ADTree are implementing univariate decision nodes where potential interactions between
Keywords:
features are ignored. To date, there has been no multivariate ADTree. We propose a sparse version of mul-
Alternating decision tree tivariate ADTree such that it remains comprehensible. The proposed sparse ADTree is empirically tested on
Decision tree UCI datasets as well as spectral datasets from the University of Eastern Finland (UEF). We show that sparse
Boosting ADTree is competitive against both univariate decision trees (original ADTree, C4.5, and CART) and multi-
Sparse discriminant analysis variate decision trees (Fisher’s decision tree and a single multivariate decision tree from oblique Random
Feature selection Forest). It achieves the best average rank in terms of prediction accuracy, second in terms of decision tree
size and faster induction time than existing ADTree. In addition, it performs especially well on datasets with
correlated features such as UEF spectral datasets. Thus, the proposed sparse ADTree extends the applicability
of ADTree to a wider variety of applications.
© 2015 Elsevier B.V. All rights reserved.

1. Introduction to classical decision trees such as C4.5 [30] and CART [5]. The ADTree
has been successfully implemented in various applications such as
Boosting is one of the most significant advances in machine learn- genetic disorders [19], corporate performance prediction [8], man-
ing research. Its underlying concept is to boost many “weak” base agement system [9], DNA microarray [32], automated trading [10],
classifiers into an arbitrarily accurate classification model. Boosting and modeling of disease trait information [25].
initially weights every training sample equally. A weak learner is fitted Existing variants of ADTree implement univariate decision stumps
to the training dataset based on this weight distribution. After each such that the decision node is split according to the value of a single
boosting cycle, the weights are updated such that correctly classified feature [11,16,23,24]. This is effectively an axis-parallel partitioning
samples are weighted lower while incorrectly classified samples are method that divides a selected input feature into two disjoint ranges
weighted higher. The next boosting cycle uses the new weight dis- for discriminating purposes. In this way, the feature selection is per-
tribution to train the weak classifier. A linear combination of these formed implicitly as the “best” feature (axial direction) is selected.
weak classifiers forms the output of the classifier. The use of univariate decision nodes simplifies its interpretation since
Decision trees have been a popular choice of weak learner for important features are encoded within the alternating decision tree
boosting and have been shown to achieve good classification per- hierarchy.
formance. Unfortunately, they are often large, complex and hard to Clearly, the choice of univariate decision nodes can be limiting in
interpret [16]. This also applies to bagged decision trees such as Ran- some cases, and may result in a large decision tree that complicates
dom Forest [4]. The issue with decision tree ensemble has led to the its comprehensibility [6]. This is because univariate decision nodes
invention of the alternating decision tree (ADTree) which brings in- ignore potentially important interaction between features. By replac-
terpretability to the boosting paradigm [16]. Rather than building a ing the decision node to a multivariate variant, better accuracy and
decision tree at each boosting cycle, a simple decision stump is used smaller tree size can be achieved [3]. However there is no reported
instead. This decision stump consists of a decision node and two pre- literature on a multivariate variant of ADTree.
diction nodes. These one-level decision trees are then arranged in a For multivariate decision nodes, the feature selection has to be
special decision tree representation with subtle differences compared incorporated explicitly. The number of features in a decision node is
often a measure of decision node complexity and hence multivariate
decision trees are often criticized for the loss of interpretability. This is

This paper has been recommended for acceptance by F. Tortorella. a particularly challenging problem for the ADTree, which is made up

Corresponding author. Tel.: +6 035 514 6238; fax: +6 035 514 6207. of several weak classifiers. In order to induce a multivariate variant of
E-mail address: sok.hong.kuan@monash.edu, sok1002@hotmail.com (H.K. Sok). ADTree that remains comprehensible, we propose to find a sparse set

http://dx.doi.org/10.1016/j.patrec.2015.03.002
0167-8655/© 2015 Elsevier B.V. All rights reserved.
58 H.K. Sok et al. / Pattern Recognition Letters 60–61 (2015) 57–64

(a) (b) (c) (d) (e)


Fig. 1. (a) General decision tree (e.g. C4.5 and CART) (b) univariate ADTree, all dotted arrows under a prediction node are evaluated and (c) sparse ADTree with multivariate decision
node. (d) Axis-parallel input space partitioning based on univariate ADTree in (b). (e) Input space partitioning based on sparse ADTree in (c). The values inside the partitions for
both (d) and (e) indicate the summed prediction values.

of features in each decision node to discriminate between different trees split on individual feature for each decision node while multi-
classes. We achieve this by applying sparse linear discriminant analysis variate decision trees split on a feature vector (linear combination is
(SLDA) [7]. However, the sparse discriminant analysis cannot be ap- a common approach). Classical univariate decision trees are ID3, C4.5
plied directly under boosting paradigm because it is unable to adapt [30] and CART [5].
to the reweighting of the training dataset. Therefore, we propose a CART-LC (multivariate version of CART) was proposed as well in
simple modification on sparse discriminant analysis to allow for this. their pioneering work in 1984 to induce oblique decision tree where
The proposed sparse ADTree algorithm is empirically tested on deterministic hill-climbing heuristic with backward feature elimina-
public datasets from the University of California, Irvine (UCI) Ma- tion is implemented to learn the multivariate decision node of the
chine Learning Repository [15] as well as spectral datasets from the form (1) parameterized by β,
University of Eastern Finland (UEF) [34]. We benchmarked our algo-
rithm against both univariate (existing ADTree, C4.5 and CART) and f (x) = xT β > 0. (1)
multivariate (Fisher’s decision tree and a single multivariate oblique
Random Forest) trees and SLDA. We show that the sparse ADTree OC1 [29] was an improvement of CART-LC with two forms of ran-
achieves the best average rank in terms of prediction accuracy, sec- domization to avoid local optimality of hill-climbing and standard
ond in terms of decision tree size and faster induction time than feature selection methods can be applied in an ad-hoc approach. In
existing ADTree. We also show that it performs especially well on the [13], LMDT [6] was proposed to estimate the parameters in an iter-
dataset with highly correlated features such as spectral datasets from ative approach to minimize the misclassification cost. Discriminant
UEF. analysis is a popular analytic approach in recent years to optimize the
The remainder of this paper is organized as follows. Section 2 pro- parameters. These works include LDT [36] and Fisher’s decision tree
vides a detailed background on general decision trees and ADTree. [26]. Our work follows this discriminant approach.
Section 3 describes the proposed sparse ADTree in full details in-
cluding modifications to fit into boosting framework. Experimental
results are presented with further analysis in Section 4 and this paper 2.2. Alternating decision tree
concludes in Section 5.
ADTree is a generalization of classical decision trees. It consists
2. Background of alternating layers of decision nodes and prediction nodes (see
Fig. 1(b)). Prediction node contains real-valued prediction value. For
In a typical supervised learning framework, training dataset D instance, the decision tree in Fig. 1(a) can be represented as ADTree
which consists of N labeled samples is used for learning purpose. in Fig. 1(b) excluding the decision stump highlighted in circle. The
The goal is to learn a classification model F (x) that generalizes to new input space partitioning due to Fig. 1(b) is shown in Fig. 1(d). If a
unseen samples. Each sample x is a p dimensional vector with label y given test sample (x1 , x2 ) is (0.4, 0), decision tree in Fig. 1(a) will
of either –1 or +1 for binary class problem. We address only binary classify it as class 2 following the right-most path and its coun-
class problems to achieve clarity in presentation. Multiclass problems terpart ADTree (without the highlighted decision stump) returns
with K classes can be handled by the mapping introduced by Allwein sign(+0.2–0.3–0.4 = –0.5) = –1 or (class 2) by summing all the tra-
et al. [1] and Dietterich [13]. versed prediction values. The magnitude reflects the confidence mar-
gin on the prediction made and the sign makes the class prediction.
2.1. Decision trees The evaluation simply becomes sign (+0.2–0.3–0.4–0.4 = –0.9) =
–1 if the highlighted decision stump is taken into account as well. A
Decision tree is a classifier with directed acyclic graph with nodes higher confidence is achieved in this case (a magnitude of 0.9 instead
and edges. It starts with a root node at the top of the model (see of 0.5).
Fig. 1(a)). Internal nodes are decision nodes and terminal nodes are Unique representation of ADTree allows multiple decision stumps
leaf nodes. Each decision node implements a decision function f (x) under the same prediction node as illustrated in Fig. 1(b) where addi-
and every edge branching out from it corresponds to a decision out- tional decision stump highlighted in circle can be added. This is how
come. Each leaf node stores a class label. Top-down recursive parti- boosting is supported within specifically designed ADTree represen-
tioning is a common approach to split the input space populated by tation to improve prediction accuracy.
the training dataset which results in a decision tree model. ADTree model can be described mathematically as a sum of de-
Depending on decision function, decision trees are generally di- cision rules as shown in (2). Each decision rule r(x) consists of a
vided into univariate and multivariate variants. Univariate decision precondition, condition and real-valued prediction values: α + , α − and
H.K. Sok et al. / Pattern Recognition Letters 60–61 (2015) 57–64 59

0 as shown in (3), training data


T root prediction node
F (x) = rt (x). (2)
1. initialization

t=0

base condition set C 2.1 split evaluation


if (precondition) then [if (condition) then α + else α − ] else 0. (3)
f (x) = xT > 0 decision stump

2.2 node partition c 2* ∈ C


Algorithm 1 shows AdaBoost, one of the earliest adaptive boosting SLDA framework
implementations [17] to induce ADTree model from training dataset. • Modified LARSEN true false
• GCV model selection
Update
weighted training data precondition (2.3)
and weight (2.4)
Algorithm 1: ADTree learning with AdaBoost
Input: D and T 2. Repeat T times
1. Initialization
1.1. Set wi ,t 0 1 for i 1,...,N and Pt 1 true (a) (b) (c)
1.2. The first decision rule r0 x :
Fig. 2. Flow diagram of sparse ADTree learning framework. (a) Our SLDA framework
1 W true to learn sparse multivariate decision node. (b) Standard ADTree induction based on
if true then if true then 0 ln else 0 else 0
2 W true AdaBoost (Algorithm 1). (c) ADTree growing by adding a new decision stump to one of
the available prediction nodes in each boosting cycle.
1.3. Update wi ,t 1 wi ,t 0 exp r0 x yi
2. Repeat for boosting cycle t 1 : T
2.1. For each precondition c1 Pt and each condition 3. Sparse alternating decision tree
c2 C ,evaluate Z c1 , c2
We propose a sparse variant of multivariate ADTree in this paper.
2 W c1 c2 W c1 c2 W c1 c2 W c1 c2 W c1 Each multivariate decision node takes the form of f (x) = xT β > 0 pa-
2.2. Calculate t and t for the selected c1* and c2* that rameterized by discriminative vector β. This vector is sparse if many
minimizes Z with 1 of its elements are exactly zero. Features with zero coefficients are
1 W c1* c2* 1 W c1* c2* essentially “pruned” out from the decision node and play no role in
t ln , t ln the classification. For illustration purpose, Fig. 1 (c) shows an example
2 W c1* c2* 2 W c1* c2*
of sparse ADTree with a single decision stump and the corresponding
2.3. Update Pt 1 : Pt c1* c2* , c1* c2* input space partitioning is shown in Fig. 1 (e).
2.4. Update wi ,t 1 wi ,t exp rt xi yi A flow diagram of sparse ADTree learning framework is presented
T
Output: F x rt x in Fig. 2. Given the weighted training dataset at the beginning of
t 0
every boosting cycle, the proposed sparse linear discriminant analysis
or SLDA framework (Fig. 2 (a)) is implemented to learn the optimal
β which results in a single multivariate base condition for set C. This
In step 1.1, uniform initial weight is assigned to all samples. base condition set C is then passed to the standard ADTree induction
Step 1.2 learns the root decision rule which describes the root pre- (Algorithm 1) as illustrated in Fig. 2 (b) for ADTree growing (Fig. 2 (c)).
diction node α0 given the equally weighted training dataset. Total For the rest of this section, we describe SLDA in Section 3.1 and our
weights of positive or negative samples that satisfy condition c are SLDA framework in Section 3.2 to achieve sparse ADTree growing.
denoted as W+ (c) or W− (c) respectively. After evaluating this root
decision rule, the weight distribution is updated in step 1.3. The rest 3.1. SLDA
of the ADTree is grown in step 2 where T decision stumps are added
sequentially, one per boosting cycle. The decision rule explicitly de- Discriminant analysis is a well-developed theory and remains an
scribes where the decision stump is added to the ADTree and the active research area due to its theoretical as well as practical impor-
decision stump itself. tance [27]. There are three popular formulations of discriminant anal-
Precondition set P is maintained throughout the induction to keep ysis, namely: Bayesian formulation, Fisher’s discriminant and Optimal
track of the partitioned input space and this set expands to include Scoring. Optimal Scoring turns the classification setting of Fisher’s
two new partitions at the end of every boosting cycle. Each precon- discriminant to regression setting and the β of optimal scoring is
dition corresponds to a prediction node. A base condition set C is equivalent to β of Fisher’s discriminant up to a factor [21].
required at the beginning of every boosting cycle where one of them In classical Fisher’s discriminant analysis, discrimination by pro-
will be selected to split one of the P members into two new partitions. jection is performed to achieve both classification and data visual-
This implies that it is possible to have more than one decision stumps ization in lower dimension. There are K − 1 discriminative directions
under the same prediction node. This is a subtle difference compared upon which projected samples have maximal between-class covari-
to classical decision trees. ance with respect to within-class covariance. For binary class prob-
Splitting criterion, Z (step 2.1) plays the role of finding the best lem, it results in a single discriminative vector β. However, Fisher’s
combination of precondition and base condition to expand ADTree. discriminant is susceptible to singularity problem if p  N and this
It measures the weighted error of the base condition c2 under pre- issue often occurs as tree depth increases. To avoid this problem, we
condition c1 (first term) and total weights of training samples that use a regularized Optimal Scoring formulation which accommodates
do not belong to c1 (second term). The combination with lowest er- regularization term seamlessly.
ror will be selected and the corresponding prediction values will be We implement SLDA by Clemmensen et al. [7] where Optimal
calculated (step 2.2). This is how a decision rule is obtained. Pre- Scoring is regularized with Elastic Net penalty. Regularization plays
condition set is updated to include two new partitions (step 2.3) an important role in learning algorithms to balance between the clas-
and weight distribution is also updated prior to next boosting cycle sification accuracy and model complexity. Elastic Net is a convex com-
(step 2.4). bination of two classical penalties known as Ridge and Lasso. Ridge
60 H.K. Sok et al. / Pattern Recognition Letters 60–61 (2015) 57–64

penalty penalizes l2 norm of β to perform continuous shrinkage on on λ1 . Different values of λ1 will result in a different β solution. It
β to prevent overfitting and it does grouping on correlated features was discovered that β is a linear piecewise function of λ1 . The entire
[22]. Lasso or l1 norm penalty on β does both continuous shrinkage family of β solutions starting from null solution (λ1 = ∞) to OLS
and feature selection by forcing some elements of β to be exactly solution (λ1 = 0) is known as regularization path. LARS is capable of
zero [33]. Elastic Net enjoys the advantage of both worlds. It does solving the entire Lasso regularization path efficiently by finding the
continuous shrinkage, feature grouping, feature selection and per- piecewise linear breakpoints of β(λ1 ) analytically.
forms well in p  N cases. Algorithm 3 describes the LARSEN algorithm. Let the vector of
(k)
Sparse discriminant analysis by Clemmensen et al. [7] presents current estimate of Y∗ θ as μ(k) = X∗ β , the vector of current residual
two sparse variants: linear variant (SLDA) and mixture variant as ε = Y θ − μ and the vector of current correlation as c(μ(k)) =
(k ) ∗ ( k)
(SMDA). This paper is based on SLDA variant where optimal scoring X∗T ε(k). Let active set A be the set of indices correspond to “active”
formulation (4) is regularized using Elastic Net penalty to incorporate features in the estimate and let sj = sign(cj ) for j ∈ A.
sparseness criterion into the problem formulation to achieve feature Regularization path computation starts with initialization in step
(0)
selection. 1. Set index k = 0, solution β = 0, null active set A =  and inactive
1  set A = {1, ..., p}. Step 2 computes the rest of the regularization path
c
min Yθ − Xβ2 subject to θ T Dπ θ = 1 where Dπ = YT Y/N (4) (0)
θ ,β N 2 starting with null vector β initialized in step 1. Step 2.1 updates the
residual vector. Step 2.2 finds the feature with maximal correlation to
X is the N × p design matrix where each row represents one training
enter the estimate. Step 2.3 computes the equiangular vector uA such
sample. The N × 1 categorical class label y is represented as N × K
that the current estimate will move along direction of uA for some
class indicator matrix Y and K × 1 score vector θ is used to re-express
specific length γ or μ(k+1) = μ(k) + γ uA . 1A in step 2.3 is a vector
the categorical vector y as real-valued output Yθ . By applying Elastic
of 1’s of length |A|. The length γ is analytically solved in step 2.4
Net penalty to (4), the optimization problem becomes (5),
such that the estimate moves along uA until there is one inactive
1     
Yθ − Xβ2 + λ2 β2 + λ1 β subject to θ T Dπ θ = 1 (5) feature that is equally important in the sense of correlation with the
min 2 2 1
θ ,β N residual. However, LARS solution is only identical to Lasso solution
Clemmensen et al. [7] use simple iterative algorithm (Algorithm 2) provided if the sign of β matches the sign of the correlation vector of
to solve (5). In step 1, the trivial solution vector θ 0 of all 1’s is formed those active features. Step 2.5 detects if the sign restriction is violated.
and initial guess of θ is initialized. Step 2 iterates between solving If it does, regularization path will be stopped earlier at length γ ∗
β and θ until convergence or reaching maximum iterations. Step 2.1 and the corresponding feature will be dropped from A. Otherwise, it
solves β using LARSEN [37] algorithm by holding θ fixed and Step 2.2 proceeds as normal. LARSEN solves the inversion of Gram matrix GA
calculates the ordinary least squares or OLS solution of θ for fixed β. with Cholesky factorization for fast computation.

Algorithm 2: SLDA for binary class problem Algorithm 3: LARSEN algorithm


T
1. Set trivial solution 0 1 and initialize I 0 0 D * Input: Y* ,X*
where * is a random vector and is normalized such that 1. Set k 0 , k
0 , active set A and inactive set
T
D 1 Ac 1,...,p .
2. Iterate until convergence or reach maximum iterations: 2. Repeat until inactive set Ac is null:
2.1. Solve the elastic net problem (5) for fixed 2.1. Update the residual k Y* k
.
1 T
2.2. Obtain the OLS solution D Y X for fixed 2.2. Find maximal correlation r max c k and move
and ortho-normalize to make it orthogonal to 0 feature correspond to r from Ac to A
2.3. Calculate the followings to compute equiangular vector
3.2. Sparse ADTree learning u A X*A A :
X*A ...s j X*j ... j A , G A X*AT X*A ,
The ability to adjust to the errors made by previously boosted
1
base classifiers is the core principle of boosting. SLDA is not originally a 1TA G A11A , A aG A11A
designed to adapt to the weighted training dataset. We propose a 2.4. Compute analytical step size in the direction of u A
simple modification to LARSEN in finding discriminative direction
r rj r rj
β that best discriminate those hard-to-classify samples with higher min j Ac , where r , a, rj and a j are
weight values as boosting proceeds. a aj a aj
LARSEN is a direct extension of least angle regression or LARS [14] to
correlation between X*A and Y* , X*A and u A , X *j and
solve Elastic Net problem (5). LARS adopts an ingenious geometrical
view of solving least squares problem where feature is selected one at Y* , and X *j and u A
a time to enter the estimated model instead of using the standard OLS
2.5. let d be the p -vector equaling s j for j A and 0
solution. It is less greedy and with a simple modification, it can be used A
* k
to solve Lasso problem faster than competing methods. Elastic Net elsewhere, compute min j j /dj
j 0
problem is reformulated as Lasso problem (6) using the augmented * k 1 k *
training dataset (X∗ , Y∗ ) as described in (7) in order to take advantage 2.6. If , then update u A and remove
of LARS computation j * from A . Otherwise, update k 1 k
uA
k
Output: A series of coefficients (regularization path)
1   
Y∗ θ − X∗ β2 + λ1 β ,
min 2 1
(6)
β N
    The key computation step in LARSEN algorithm is c(μ(k)). Feature
 − 1 X Y
X∗(N+p)×p = 1 + λ2 2  ,Y∗(N+p) = . (7) with the highest correlation to the residual will be selected to enter
λ2 I 0
the model. In order to allow weighted training dataset, weight vector
Lasso penalizes the sum of the absolute β coefficients or β1 wt of length N at tth boosting cycle is incorporated into c(μ(k)) com-
in solving the least square problem. The amount of penalty depends putation (8) where weighted residual wt .ε(k) (dot product between
H.K. Sok et al. / Pattern Recognition Letters 60–61 (2015) 57–64 61

Table 1
Summary of UCI datasets and UEF datasets.

ID Dataset name No. of samples No. of features

1 Breast cancer 569 30 f1 x


2 Blood transfusion 748 4
3 Liver disorder 345 6 true false
4 Vertebral 310 6
5 Pimaindian 768 8
6 Heart 267 44
7 MAGIC gamma 19,020 10
8 Parkinson 195 22
9 Haberman 306 3 f2 x f3 x
10 ILPD 579 10
11 Woodchip (UEF) 10,000 26 true false true false
12 Forest (UEF) 707 93
13 Paper (UEF) 180 31

Fig. 3. sADT-gcv model for MAGIC gamma. The components of discriminative vector β
weight vector and residual) is computed,
for MAGIC Gamma dataset are mostly comparable in magnitude, confirming that the
c(μ(k)) = X∗T (wt .ε(k)). (8) optimal decision boundary is not axis parallel. Having decision boundary that is not
axis-parallel is equivalent to having probabilistically correlated features.
Given the regularization path (output of modified LARSEN), we
propose to select the optimal β solution based on Cross Validation
Generalization (GCV) model selection (9), A standard 10 × 10 fold stratified cross-validation was performed
 ∗  on each dataset per classification model. For both univariate and
Y θ − X∗ β2 sparse ADTree, an additional 10 fold stratified cross validation is
2
 2 . (9)
needed for each cross validation fold to find the optimal number
N − df
of boosting cycles. Average cross-validation prediction accuracy, de-
GCV returns a measure for each model considering how well the cision tree size (total number of nodes) and induction time were used
model fit the training dataset, the number of training samples and the to judge the performance of each classification model. The classifi-
complexity of the model in terms of degrees of freedom, df . The df of cation models were ranked according to their average ranks across
Elastic Net solution β is measured as the trace of XA (XTA XA + λ2 I)−1 XTA . all datasets. Nonparametric Friedman test at 1% and 5% significance
Model with the lowest GCV measure is selected to form the multi- were used to assess the statistical significance of the rank difference
variate decision node. This base condition is presented to Algorithm 1 [12]. If statistical significant difference is detected by Friedman test,
to induce sparse ADTree. a post-hoc Nemenyi test will be conducted to identify the pair(s) of
In short, Sparse ADTree can be regarded as a hybrid between classification models with statistical significant difference.
ADTree and SLDA which inherits merits from both algorithms. SLDA is
boosted into a set of decision rules where “local” linear discriminative 4.2. Results and discussions
projections are discovered for the partitioned input space.
Prediction accuracy, decision tree size and induction time (aver-
4. Experiments age ± standard deviation) of each classification model for all datasets
are tabulated in Tables 2–4 respectively. The best value is shown in
4.1. Experimental setup bold for each dataset. The last row of each table shows the average
rank of each classification model; the lower the better the rank.
The proposed sparse ADTree was compared against univariate In terms of prediction accuracy, sADT-gcv achieves the best av-
decision trees (ADTree, C4.5 and CART), multivariate decision trees erage rank in comparison to univariate and multivariate decision
(Fisher’s decision tree and Oblique Random Forest) and SLDA. The trees. ADTree, C4.5 and CART (univariate decision trees) are limited by
following implementations are publicly available and used in this their axis-parallel partitioning approach to approximate the decision
experiment: boundary which explains why their decision trees size are gener-
ally larger than the sADT-gcv. This phenomenon is best illustrated
• C4.5 and CART: WEKA platform in Java [20]
by MAGIC Gamma dataset where C4.5 (726 nodes) and CART (209
• Fisher’s decision tree (FDT): WEKA executable jar file [26]
nodes), reusing the same feature many times, generate extremely
• Oblique Random Forest (oRF): R package [28]
complicated and incomprehensible trees. sADT-gcv generates 3 de-
• SLDA: SpaSM Matlab toolbox [31]
cision stumps for MAGIC Gamma dataset as shown in Fig. 3. In com-
Two variants of sparse ADTree were evaluated in the experiment. parison to original ADTree, sADT-gcv brings the performance level
The first variant is univariate sparse ADTree (sADT-uni). Only one of ADTree to that of C4.5 and CART. Sparse ADTree is much more
nonzero feature coefficient is allowed per decision node in this case. flexible because it can accommodate various model selection criteria
The second variant (sADT-gcv) uses GCV to select one of the solutions to perform decision node complexity tuning. GCV has been selected
from full regularization path. This illustrates the strength of the pro- to match the complexity of the decision node to the complexity of
posed algorithm to fine tune the complexity of the decision node. For the training dataset. More aggressive feature pruning model can be
ADTree, simple thresholding method by Kuang and Ooi [24] is used used if the dataset is known to contain only a few strongly predictive
to generate the base classifiers. For fair comparison, we implemented features; or sparsity of the resulted model has very high priority.
oRF with a single multivariate decision tree (oRF1). In comparison to multivariate decision trees, sADT-gcv is ca-
The experiment was conducted using a set of real world datasets pable of achieving better prediction accuracy due to its structural
from UCI machine learning repository and spectral datasets from UEF flexibility of allowing multiple multivariate decision nodes under the
spectral color research group. The details of these datasets are shown same prediction node. FDT can be viewed as multivariate extension
in Table 1. All datasets were preprocessed to center each feature to of C4.5 where Fisher’s discriminant analysis is used to project the
mean zero and standard deviation one. training samples to a subspace spanned by β and information gain
62 H.K. Sok et al. / Pattern Recognition Letters 60–61 (2015) 57–64

Table 2
Prediction accuracy (average ± standard deviation) in terms of percentage.

ID C4.5 CART ADTree sADT-uni sADT-gcv SLDA FDT oRF1

1 93.52 ± 0.78 93.04 ± 0.73 94.68 ± 0.80 93.68 ± 0.67 96.89 ± 0.37 96.36 ± 0.25 94.97 ± 0.46 94.04 ± 0.81
2 77.92 ± 0.62 78.28 ± 0.36 77.38 ± 0.41 76.09 ± 0.38 76.21 ± 0.00 66.00 ± 0.42 78.52 ± 0.57 70.75 ± 1.09
3 65.76 ± 2.14 66.38 ± 2.45 62.34 ± 1.37 62.35 ± 1.39 62.81 ± 0.81 62.55 ± 0.38 50.67 ± 4.61 61.17 ± 2.53
4 81.23 ± 1.00 80.81 ± 1.25 82.71 ± 1.24 79.03 ± 0.59 83.35 ± 0.57 79.65 ± 0.87 83.06 ± 1.24 78.55 ± 2.56
5 74.58 ± 0.90 74.15 ± 0.37 72.54 ± 0.91 73.32 ± 0.67 76.01 ± 0.41 76.14 ± 0.15 64.96 ± 4.86 68.21 ± 1.61
6 75.50 ± 1.88 78.31 ± 1.12 78.57 ± 1.66 78.87 ± 1.05 76.36 ± 1.77 68.04 ± 1.30 73.62 ± 1.95 73.69 ± 1.85
7 85.13 ± 0.15 85.37 ± 0.12 78.59 ± 0.10 75.12 ± 0.01 78.90 ± 0.03 79.45 ± 0.02 74.69 ± 4.12 79.75 ± 0.16
8 83.83 ± 2.01 86.88 ± 1.63 88.86 ± 1.57 80.09 ± 0.84 81.67 ± 1.01 82.02 ± 1.34 84.63 ± 1.57 83.06 ± 3.16
9 70.54 ± 1.07 72.21 ± 1.23 71.91 ± 1.59 72.24 ± 1.00 71.28 ± 0.81 74.04 ± 0.63 71.60 ± 1.00 63.15 ± 1.91
10 68.35 ± 1.75 71.11 ± 0.65 71.23 ± 0.52 71.51 ± 0.00 71.39 ± 0.26 63.40 ± 0.60 59.03 ± 3.52 64.92 ± 2.03
11 91.94 ± 0.18 91.64 ± 0.26 67.82 ± 0.16 63.81 ± 0.02 99.58 ± 0.01 99.61 ± 0.01 99.47 ± 0.05 95.92 ± 0.16
12 88.61 ± 0.70 88.30 ± 0.52 83.71 ± 0.38 77.47 ± 0.20 96.42 ± 0.36 96.29 ± 0.26 95.25 ± 0.46 91.74 ± 0.76
13 95.32 ± 1.39 96.99 ± 1.33 96.51 ± 1.67 80.08 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 92.51 ± 2.16
Rank 4.46 3.77 4.38 5.62 3.23 4.00 4.69 5.85

Table 3
Decision tree size (average ± standard deviation) in terms of total number of nodes.

ID C4.5 CART ADTree sADT-uni sADT-gcv FDT oRF1

1 21.44 ± 00.94 13.08 ± 01.62 71.65 ± 12.25 43.75 ± 09.41 11.59 ± 04.84 11.00 ± 0.78 25.38 ± 0.96
2 10.82 ± 01.06 13.62 ± 02.26 31.93 ± 07.48 5.92 ± 03.21 4.00 ± 00.00 6.12 ± 0.66 120.40 ± 4.53
3 49.18 ± 02.52 24.80 ± 04.97 33.28 ± 06.64 35.53 ± 06.30 5.62 ± 02.64 7.24 ± 1.32 91.78 ± 3.92
4 19.92 ± 00.89 13.96 ± 02.56 51.91 ± 14.21 7.66 ± 02.63 7.63 ± 02.33 14.50 ± 1.55 52.60 ± 3.50
5 39.02 ± 04.72 18.60 ± 06.76 25.75 ± 07.54 11.92 ± 02.67 13.24 ± 04.32 9.34 ± 2.64 152.24 ± 5.00
6 36.90 ± 00.84 3.88 ± 01.71 23.08 ± 05.33 20.29 ± 04.43 50.56 ± 12.02 11.58 ± 0.60 34.04 ± 1.48
7 726.48 ± 20.51 209.62 ± 18.26 137.60 ± 03.80 13.00 ± 00.00 15.94 ± 08.25 266.46 ± 53.76 2633.60 ± 14.66
8 19.04 ± 01.11 11.40 ± 01.96 90.25 ± 08.48 9.79 ± 04.03 16.30 ± 07.90 9.46 ± 0.60 23.88 ± 0.41
9 4.88 ± 01.04 4.42 ± 02.26 29.98 ± 10.19 16.84 ± 09.33 18.22 ± 05.43 1.90 ± 0.51 69.38 ± 2.10
10 55.64 ± 08.40 2.00 ± 01.61 8.05 ± 04.32 4.12 ± 00.38 15.88 ± 08.10 10.84 ± 2.83 131.30 ± 3.87
11 516.54 ± 03.97 313.08 ± 25.40 30.55 ± 00.32 4.00 ± 00.00 4.00 ± 00.00 18.22 ± 0.36 270.38 ± 11.03
12 44.64 ± 01.09 27.52 ± 03.96 86.62 ± 08.22 6.70 ± 04.48 10.30 ± 04.03 8.62 ± 0.55 35.10 ± 1.49
13 10.48 ± 00.30 10.02 ± 00.75 76.12 ± 07.46 4.00 ± 00.00 4.00 ± 00.00 3.00 ± 0.00 15.06 ± 0.67
Rank 5.31 3.31 5.31 2.62 2.92 2.31 6.23

Table 4
Induction time (average ± standard deviation) in terms of milliseconds. Normalization is performed to allow fairer comparison between different platforms.
Algorithms implemented in Java, Matlab and R are divided by a factor of 2.4, 10, and 480 respectively.

ID C4.5 CART ADTree sADT-uni sADT-gcv SLDA FDT oRF1

1 9.60 ± 0.24 68.01 ± 3.06 142.59 ± 39.40 12.17 ± 5.04 44.11 ± 22.33 9.58 ± 3.19 6.62 ± 0.49 0.22 ± 0.03
2 1.18 ± 0.04 18.81 ± 0.60 7.08 ± 3.50 1.19 ± 0.93 0.93 ± 0.15 0.60 ± 0.23 1.87 ± 0.28 0.73 ± 0.06
3 1.81 ± 0.06 13.14 ± 0.30 10.92 ± 4.60 4.84 ± 1.47 1.49 ± 0.55 0.77 ± 0.37 1.61 ± 0.25 0.34 ± 0.04
4 1.22 ± 0.04 11.20 ± 0.34 18.83 ± 9.11 1.21 ± 0.52 1.82 ± 0.51 0.83 ± 0.25 1.29 ± 0.17 0.18 ± 0.01
5 4.44 ± 0.11 37.05 ± 0.98 12.21 ± 7.15 1.79 ± 0.94 4.20 ± 1.53 0.99 ± 0.26 3.47 ± 0.72 0.67 ± 0.06
6 6.52 ± 0.26 30.77 ± 0.96 32.00 ± 13.95 5.54 ± 2.00 287.88 ± 67.76 13.73 ± 1.28 14.84 ± 0.76 0.15 ± 0.01
7 562.24 ± 15.37 3574.14 ± 55.50 1624.64 ± 78.57 21.01 ± 2.54 125.84 ± 76.99 18.81 ± 2.11 457.77 ± 12.51 15.75 ± 0.46
8 2.60 ± 0.27 16.21 ± 0.58 124.99 ± 16.49 1.63 ± 0.83 25.19 ± 14.59 3.67 ± 0.51 2.89 ± 0.38 0.10 ± 0.01
9 0.73 ± 0.83 9.00 ± 3.49 5.48 ± 2.55 2.41 ± 1.75 2.72 ± 1.11 0.48 ± 0.06 0.64 ± 0.10 0.34 ± 0.07
10 5.24 ± 0.55 33.49 ± 1.85 2.57 ± 3.28 0.85 ± 0.30 8.00 ± 4.89 1.23 ± 0.23 0.01 ± 0.00 0.43 ± 0.03
11 445.03 ± 51.87 3040.43 ± 218.49 125.46 ± 2.34 4.68 ± 0.04 39.31 ± 4.84 57.76 ± 2.65 124.38 ± 6.58 5.01 ± 0.24
12 47.14 ± 1.64 378.07 ± 19.26 623.06 ± 110.28 3.92 ± 3.10 1117.34 ± 534.64 96.91 ± 4.14 74.09 ± 3.54 0.35 ± 0.02
13 2.05 ± 0.22 15.73 ± 0.65 133.70 ± 26.42 0.77 ± 0.08 11.03 ± 2.07 7.70 ± 0.39 1.75 ± 0.37 0.08 ± 0.01
Rank 4.54 7.31 7.08 3.23 5.54 3.15 3.92 1.23

is used to find the best split point on the artificial feature (linear sADT-gcv achieves better average rank than SLDA in terms of predic-
combinations of features). oRF applies Ridge penalty to search be- tion accuracy. This improvement is due to the nonparametric ADTree
tween Fisher’s discriminant (no penalty effect) to Principal Compo- hierarchy of discriminant analysis to better match the complexity of
nent Analysis. For each bagged tree, training samples are resampled the training dataset. The proposed sparse ADTree model subsumes
with replacement for induction purpose and mtry features are re- SLDA as a potentially suitable model. Users do not need to worry
sampled without replacement for each decision node. The suggested whether the linear discriminant model is sufficient. A linearly sep-

mtry value is approximately square root of feature dimension or p. arable dataset would induce a SLDA model and stop boosting after
sADT-gcv does simultaneous feature selection which is absent in FDT one cycle. There is no effort on the part of the user to choose the
and Ridge solution in oRF. For oRF, it depends on random subspace most suitable model. Take Woodchip dataset for example, it is suf-
spanned by the sampled mtry features for feature dimension reduction fice to partition it with a single linear decision boundary as shown
purposes. in sADT-gcv which excludes the need to try both SLDA and tree-
SLDA relies on a single discriminative direction to perform classifi- based classifiers. On the other hand, there are examples like Heart
cation. When it is boosted under ADTree representation, the resulted and ILPD where linear discriminant model is insufficient, boosted
H.K. Sok et al. / Pattern Recognition Letters 60–61 (2015) 57–64 63

120 80 120
(a) (b) (c)
70
100 100
60
80 80
50
C4.5
60 40 60
CART
30 sparse ADTree
40 40
20
20 20
10

0 0 0
91 92 93 94 95 96 97 98 99 100 50 55 60 65 70 75 80 85 90 95 100 50 55 60 65 70 75 80 85 90 95 100

Fig. 4. Histogram of 10 × 10 fold cross validation performance on prediction accuracy of C4.5 and CART against sADT-gcv for (a) Woodchip (UEF), (b) Forest (UEF) and (c) Paper
(UEF) datasets. The p-values of the pairwise t-test between C4.5 and sADT-gcv are 4.98 × 10−97 for Woodchip (UEF), 9.20 × 10−34 for Forest (UEF) and 2.70 × 10−13 for Paper (UEF).
The p-values of the pairwise t-test between CART and sADT-gcv are 2.80 × 10−97 for Woodchip (UEF), 4.87 × 10−37 for Forest (UEF) and 5.56 × 10−09 for Paper (UEF).

classifier in the form of sparse ADTree ensures that the model is suffi- base conditions based on individual features which can be computa-
ciently descriptive to achieve the performance comparable to C4.5 and tional costly if the features are of high-dimensional. For our proposal,
CART. regularization path can be terminated early to save computation as
In terms of complexity of multivariate splits, Table 5 records the shown in our experiment. A key property of decision tree which is
total number of nonzero β coefficients (sparsity level) and average fast training is not lost with our sparse ADTree.
ranks are shown in the last row of the table. SLDA achieves the lowest According to the non-parametric Friedman’s test with post-hoc
average rank as it uses only a single decision boundary whereas deci- Nemenyi test, no statistical significant difference was detected for
sion trees are nonparametric approach to induce a suitable number of prediction accuracy. For decision tree size, FDT is statistically smaller
decision boundaries. Statistically, SLDA is better than FDT and oRF1. than C4.5, ADTree and oRF1. Both variants of sparse ADTree are also
All multivariate functions considered in this paper involves only lin- smaller than oRF1. For induction time, oRF1, SLDA and sADT-uni are
ear function. Therefore, the complexity of the split can be used as an all statistically significantly faster than both ADTree and CART. In
approximate indicator of inference time. Despite trailing FDT in terms addition, oRF1 is also statistically faster than sADT-gcv and FDT is
of decision tree size, the overall sparsity of sADT-gcv is lower which found to be statistically faster than CART.
demonstrate the ability of our algorithm to incorporate sparseness Comparison over multiple datasets [12] assumes that “indepen-
into multivariate decision tree seamlessly. dent” datasets are used to compare the scores for multiple classifiers.
In terms of decision tree size, it is well-known that there is a Garcıa and Herrera [18] argued that many datasets from all possible
trade-off between tree size and the complexity of decision node. It domain of applications have to be employed to demonstrate statis-
is validated in our experiment that multivariate decision trees often tical significant difference due to the low power of Nemenyi’s test.
induce smaller decision trees [3]. FDT has the best average rank, fol- Therefore the significance test offers very little insight into the prob-
lowed by both variants of sparse ADTree. On the other hand, oRF1 has lem at hand. According to the “no free lunch theorem” [35], there is
the worst average rank due to its setting of growing a full multivariate no single classifier that can consistently outperform other classifiers
decision tree. for all possible domains of applications. It is therefore more sensible
In terms of induction time, normalization is performed to allow to determine the best domain of application for each classifier rather
fairer comparison as Java is faster than Matlab and both of them are than comparing the average performance across disparately differ-
faster than R [2]. However, it is safe to make the following conclusion ent datasets. The following analysis will analyze the performance of
where SLDA, ADTree and sparse ADTree are implemented in the same classification trees and try to relate these performances to the char-
platform in Matlab. Boosting SLDA into ADTree requires a longer in- acteristics of datasets.
duction time than a single SLDA, which is expected. sADT-uni grows For datasets with high correlation among the features, especially
faster than sADT-gcv due to the early stopping in regularization path those in UEF spectral datasets, multivariate decision trees (SLDA, FDT,
computation. Both variants of sparse ADTree are faster than ADTree. oRF1 and sADT-gcv) outperform univariate trees (C4.5 and CART).
Quality of ADTree very much depends on its base conditions set. So Fig. 4, a visualization of pairwise t-test, shows histograms of pre-
far in the literature, ADTree implements exhaustive approach to form diction accuracies distribution of all 10 × 10 folds cross-validation

Table 5
Complexity of multivariate split (average ± standard deviation) in terms of the total number
of nonzero coefficients.

ID SLDA sADT-gcv FDT oRF1

1 22.19 ± 1.87 86.58 ± 160.70 150.00 ± 023.74 58.45 ± 10.86


2 4.00 ± 0.00 4.00 ± 000.00 10.24 ± 005.52 118.40 ± 11.71
3 5.97 ± 0.17 9.20 ± 013.04 18.72 ± 019.00 89.78 ± 08.72
4 4.92 ± 0.31 13.25 ± 012.49 40.50 ± 017.40 50.60 ± 08.46
5 6.88 ± 0.38 26.09 ± 027.29 33.36 ± 031.31 150.24 ± 14.98
6 8.20 ± 2.13 270.66 ± 268.40 232.76 ± 056.39 96.12 ± 13.70
7 9.86 ± 0.40 40.35 ± 066.59 1327.30 ± 967.66 3974.40 ± 89.53
8 17.20 ± 2.19 99.13 ± 130.80 93.06 ± 018.19 43.76 ± 08.52
9 1.95 ± 0.33 10.70 ± 014.16 1.35 ± 002.43 67.38 ± 07.50
10 7.55 ± 0.72 43.63 ± 078.18 49.20 ± 049.05 193.95 ± 14.80
11 26.00 ± 0.00 26.00 ± 000.00 223.86 ± 039.27 670.95 ± 89.47
12 67.11 ± 6.18 276.10 ± 518.75 354.33 ± 088.37 148.95 ± 23.70
13 30.01 ± 1.31 28.83 ± 001.58 31.00 ± 000.00 32.65 ± 08.57
Rank 1.23 2.38 3 3.38
64 H.K. Sok et al. / Pattern Recognition Letters 60–61 (2015) 57–64

results. sADT-gcv obviously outperforms C4.5 and CART in the spec- [4] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32.
tral datasets. [5] L. Breiman, J.H. Friedman, R.A. Olshen, Classification and Regression Trees,
Wadsworth International Group, Belmont Canada, 1984.
The success of sADT-gcv over univariate type decision trees on [6] C.E. Brodley, P.E. Utgoff, Multivariate decision trees, Mach. Learn. 19 (1995) 45–77.
spectral datasets can be attributed to the linear transformation of orig- [7] L. Clemmensen, T. Hastie, D. Witten, B. Ersbøll, Sparse discriminant analysis, Tech-
inal features into a new feature with a higher discriminative power nometrics 53 (2011) 406–413.
[8] G. Creamer, Y. Freund, Using boosting for financial analysis and performance
than each individual features through the sparse discriminative vec- prediction: application to S&P 500 companies, Latin American ADRs and banks,
tor β which automates feature selection mechanism as well as to Comput. Econ. 36 (2010) 133–151.
achieve better classification performance. In addition, due to the con- [9] G. Creamer, Y. Freund, Learning a board balanced scorecard to improve corporate
performance, Decis. Support Syst. 49 (2010) 365–385.
vex combination nature of both Ridge and Lasso in forming the Elastic
[10] G. Creamer, Y. Freund, Automated trading with boosting and expert weighting,
Net penalty, penalty norm due to Ridge encourages grouping of cor- Quant. Finance 10 (2010) 401–420.
related features in forming the discriminative vector [37]. Elastic Net [11] F. De Comité, R. Gilleron, M. Tommasi, Learning multi-label alternating decision
trees from texts and data, in: Proceedings of the 3rd International Conference on
penalty which results in feature selection and grouping effect of cor-
Machine Learning and Data Mining in Pattern Recognition, 2003, pp. 35–49.
related features aids in interpretability of the decision nodes. [12] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach.
Learn. Res. 7 (2006) 1–30.
[13] T. Dietterich, Solving multiclass learning problems via error-correcting output
5. Conclusion codes, J. Artif. Intell. Res. 2 (1995) 263–286.
[14] B. Efron, T. Hastie, Least angle regression, Ann. Statist. 32 (2004) 407–499.
This paper presents a novel sparse ADTree learning algorithm. [15] A. Frank, A. Asuncion, n.d. UCI Machine Learning Repository [WWW Document].
URL http://archive.ics.uci.edu/ml.
The ADTree has a unique tree representation that supports boosting. [16] Y. Freund, L. Mason, The alternating decision tree learning algorithm, in: Proceed-
However, it is currently a univariate type which ignores potential in- ings of the 16th International Conference on Machine Learning, 1999, pp. 124–133.
teractions between features. We propose to boost the SLDA into the [17] Y. Freund, R. Schapire, A decision-theoretic generalization of on-line learning and
an application to boosting, J. Comput. Syst. Sci. 55 (1997) 119–139.
ADTree model. This is a novel approach which induces a multivari- [18] S. Garcıa, F. Herrera, An extension on statistical comparisons of classifiers over
ate variant of the ADTree. We allow the complexity of the decision multiple data sets for all pairwise comparisons, J. Mach. Learn. Res. 9 (2008)
node to be fine-tuned easily such that the proposed sparse ADTree 2677–2694.
[19] R. Guy, P. Santago, C. Langefeld, Bootstrap aggregating of alternating decision trees
can generate either univariate or multivariate ADTree. The embed-
to detect sets of SNPs that associate with disease, Genet. Epidemiol. 106 (2012)
ded Elastic Net criterion not only performs feature selection, but it 99–106.
also encourages the grouping of correlated features at each decision [20] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The WEKA
node. The sparse ADTree was tested on real world datasets from UCI Data Mining Software: An Update. SIGKDD Exploarations (2009) 11.
[21] T. Hastie, A. Buja, R. Tibshirani, Penalized discriminant analysis, Ann. Statist. 23
machine learning repository and UEF spectral repository. It shows an (1995) 73–102.
improvement over SLDA and is competitive against univariate deci- [22] A. Hoerl, R. Kennard, Ridge regression: biased estimation for nonorthogonal prob-
sion trees (ADTree, C4.5 and CART) and multivariate decision trees lems, Technometrics 12 (1970) 55–67.
[23] G. Holmes, B. Pfahringer, R. Kirkby, E. Frank, M. Hall, Multiclass alternating deci-
(Fisher’s decision tree and a single oblique Random Forest). It per- sion trees, Science, Springer, 2002, pp. 161–172.
forms especially well on spectral datasets whereby the features are [24] Y.C. Kuang, M.P.L. Ooi, Complex feature alternating decision tree, Intl. J. Intell.
highly correlated. In short, sparse ADTree is a generalization to exist- Syst. Technol. Appl. 9 (2010) 335–353.
[25] K.-Y.K. Liu, J. Lin, X. Zhou, S.T.C.S. Wong, Boosting alternating decision trees mod-
ing univariate ADTree. It is a promising model for wide applications eling of disease trait information, BMC Genet. 6 (2005) 1–6.
such as those already implementing the ADTree and potentially new [26] A. López-Chau, J. Cervantes, L. López-García, F.G. Lamont, Fisher’s decision tree,
applications of multivariate variety. Expert Syst. Appl. 40 (2013) 6283–6291.
[27] Q. Mai, A review of discriminant analysis in high dimensions, WIREs Comput. Stat.
5 (2013) 190–197.
Acknowledgments [28] B. Menze, B. Kelm, D. Splitthoff, On oblique random forests, in: European Confer-
ence on Machine Learning (ECML/PKDD), 2011, pp. 453–469.
[29] S.K. Murthy, S. Kasif, A system for induction of oblique decision trees, J. Artif. Intell.
The authors thank associate editor and the anonymous reviewers Res. 2 (1994) 1–32.
for constructive comments and suggestions. This work was supported [30] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgran Kaufmann Publisher,
by the Monash University Malaysia through a Higher Degree Research San Francisco, 1993.
[31] K. Sjöstrand, L. Clemmensen, Spasm: a matlab toolbox for sparse statistical mod-
scholarship and Malaysia Ministry of Higher Education Fundamental eling, [online] Available at http://www.imm.dtu.dk/projects/spasm/
Research Grant Scheme FRGS/1/2013/TK02/MUSM/02/1. [32] G. Stiglic, M. Bajgot, P. Kokol, Gene set enrichment meta-learning analysis: next-
generation sequencing versus microarrays, BMC Bioinform. 11 (2010) article 176.
[33] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Statist. Soc. B
References Methodol. 58 (1996) 267–288.
[34] University of Eastern Finland, Spectral Color Research Group [WWW Document],
[1] E.L. Allwein, R.E. Schapire, Y. Singer, Reducing multiclass to binary: a unifying n.d. URL https://www.uef.fi/spectral/spectral-database.
approach for margin classifiers, J. Mach. Learn. Res. 1 (2001) 113–141. [35] D.H. Wolpert, W.G. Macready, No free lunch theorems for optimization, IEEE
[2] S. Aruoba, J. Fernández-Villaverde, A comparison of programming languages in Trans. Evol. Comput. 1 (1997) 67–82.
economics, NBER (2014) Working Paper No. 20263. [36] O.T. Yildiz, E. Alpaydin, Linear discriminant trees, Intl. J. Pattern Recognit. Artif.
[3] R.C. Barros, M.P. Basgalupp, A. de Carvalho, A.A. Freitas, A survey of evolutionary Intell. 19 (2005) 323–353.
algorithms for decision-tree induction, IEEE Trans. Syst. Man Cybern. C Appl. Rev. [37] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R.
42 (2011) 291–312. Statist. Soc. B Statist. Methodol. 67 (2005) 301–320.

You might also like