Classifier Recommendation Using Data Complexity Measures PDF

2018 24th International Conference on Pattern Recognition (ICPR)
Beijing, China, August 20-24, 2018
Classifier Recommendation Using

Data Complexity Measures
Luı́s P. F. Garcia Ana C. Lorena Marcilio C. P. de Souto Tin Kam Ho
InfAI/Leipzig University Universidade Federal de São Paulo LIFO/University of Orleans IBM Watson
Leipzig, Germany and Instituto Tecnológico de Aeronáutica Orleans, France Yorktown Heights, NY, USA
Email: garcia@informatik São José dos Campos, SP, Brazil Email: marcilio.desouto@ Email: tho@us.ibm.com
.uni-leipzig.de Email: aclorena@unifesp.br univ-orleans.fr
Abstract—Application of machine learning to new and unfamil- problems from the OpenML repository [2], several standard
iar domains calls for increasing automation in choosing a learning base classifiers and meta-predictors, and an extended set of
algorithm suitable for the data arising from each domain. Meta- data complexity measures implemented in publicly released
learning could address this need since it has been largely used
in the last years to support the recommendation of the most software [3].
suitable algorithms for a new dataset. The use of complexity This paper is organized as follows: Section II presents some
measures could increase the systematic comprehension over the related studies on the use of complexity measures in MtL.
meta-models and also allow to differentiate the performance of a Section III describes the complexity measures employed in
set of techniques taking into account the overlap between classes this work. Section IV presents the MtL methodology followed
imposed by feature values, the separability and distribution of the
data points. In this paper we compare the effectiveness of several in this work. Section V presents the experiments performed
standard regression models in predicting the accuracies of clas- along with their results. Section VI concludes the paper.
sifiers for classification problems from the OpenML repository.
We show that the models can predict the classifiers’ accuracies II. R ELATED S TUDIES
with low mean-squared-error and identify the best classifier for Complexity measures of classification problems allows one
a problem that results in statistically significant improvements
over a randomly chosen classifier or a fixed classifier believed to to estimate the expected difficulty of a classification problem
be good on average. by extracting descriptions of the overlap between classes
imposed by feature values, the separability and distribution
I. I NTRODUCTION of the data points, and certain structural characteristics of the
The recent surge of interests in applying Machine Learning problem based on the representation of the problem by the
(ML) methods to many new and diverse domains has high- training datasets available for learning [4]. Most of the mea-
lighted a need for further automation of the analysis process. sure values are highly correlated with predictive performance
In particular, new application domains open up the possibility of classification models, as demonstrated in previous studies
that the data available for learning are of a different nature [5], [6]. Therefore, it is expected that they can play an impor-
from those familiar to previous users of ML, like text or tant role in improving the recommendation for classification
image and speech signals. With unfamiliar data and class algorithms [7].
distributions in a new context, a practitioner has little guidance MtL can be used in ML algorithm recommendation by,
in choosing a suitable learning method, as it is well known that for instance, estimating the expected predictive accuracy of
different classifiers appeal to data with different geometries classification techniques [7]. To do this, firstly, a meta-dataset
and distribution characteristics [1]. must be constructed where each meta-example is associated
A decision task like identifying a good classifier for a with a dataset. Then, meta-features describing characteristics
dataset can be formulated as an ML problem if certain numeri- of each dataset are extracted. In this paper, the complexity
cal or categorical features are available to describe the dataset, measures of classification problems are used as meta-features.
and some previous examples are available for training a pre- The target value for the prediction is the accuracy of a
dictor. Meta-Learning (MtL) with data complexity measures, classification technique for the dataset associated with the
where the learning task is to predict a classifier’s performance meta-example.
for a dataset, and the features to use are some statistical or Given a meta-dataset, the next step is the induction of a
geometrical characteristics of that dataset, carries a promise to meta-model. The meta-model can be induced by ML tech-
serve this purpose. The practical question, however, as in all niques and can be used in a recommendation system to select
learning tasks, is to what extent such prediction is successful, the most suitable algorithm(s) for a new dataset. In this
and whether the known measures of data complexity and paper we want to predict the accuracy of the classification
the known examples of classifier behavior are sufficient for techniques, so we used a set of regression techniques and
training such a predictor. In this paper, we demonstrate a also some standard baselines to evaluate and validate the
success in this MtL task using a large collection of ML achieved performance [8]. The expectation is that with highly
978-1-5386-3788-3/18/$31.00 ©2018 IEEE 874

accurate models, the MtL system can be used to select the most 2) linearity measures: try to quantify if it is possible to
suitable algorithm(s) for a new dataset based on the complexity separate the classes by a hyperplane.
measures. 3) neighborhood measures: analyze the neighborhoods of
There are some previous work in the literature using data individual examples and try to capture class overlap and
complexity measures to recommend classification algorithms. the shape of the decision boundary.
In [9] nine complexity measures from [4] are used as meta- 4) network measures: represent the dataset as a graph and
features in an MtL setup designed to predict the best among extract structural information from it.
five classification techniques for binary classification prob- 5) dimensionality measures: give an indication of data
lems. In [6] the complexity measures are used to analyze the sparsity.
expected behavior of the Nearest Neighbor (NN) classifier. The 6) class balance measures: capture the differences in the
work [10] uses the complexity measures as meta-features for number of examples per class in a dataset.
inducing multi-label recommendation algorithms for datasets All measures are estimated from a learning dataset T
from a specific (educational) domain. containing n pairs of examples (xi , yi ) where i ranges from 1
The MtL setup in the current study differs from the pre- to n. Each training example xi is described by m input features
vious attempts in the following ways. Firstly, a larger set and has a label yi out of nc classes. Some of the measures
of classification complexity measures is considered. This set are defined for features with numerical values only; in this
contains all the measures from the Data Complexity pack- case the categorical features are first coded into numerical
age (DCoL) [11] plus other measures from recent literature values accordingly. Multiclass classification problems have to
that may complement them and were implemented in the be decomposed first into binary subproblems for some of the
Extended Complexity Library package (ECoL) [3]. A larger measures. We adopt the one-versus-one (OVO) decomposition
set of classification datasets is also considered, totalizing 141 for multiclass problems [17]. In this case, a measure for
datasets from different domains. We chose four classification the multiclass problem is computed as the average of the
techniques with different biases to be recommended: Artifi- values obtained for the pairwise sub-problems. The complexity
cial Neural Networks (ANN) [12], Support Vector Machines measures computed in this paper are presented in Table I
(SVM) [13], k-Nearest Neighbor (kNN) Classifiers [14] and (category, name, acronym and original reference) and are
Decision Trees (DT) [15]. These techniques are often used briefly described next.
for various reasons that include simplicity, wide availability, F1 takes the largest Fisher’s discriminant ratio among those
convenience, ease of set-up, popularity, or in some cases, computed from all the features, so that larger F1 values are
interpretability. The MtL problem is formulated as a prediction expected for simpler classification problems. F1v complements
of the expected accuracy of the four classification techniques. F1 by searching for a vector able to separate two classes
To derive the meta-models, three regression techniques are after the training examples have been projected onto it. F2
used; in addition, two baselines described in the literature are computes the overlap of the support of the distributions of
used for evaluation [8]. We also analyzed the most important the features values within the classes. The higher the F2
meta-features for one of the recommender models induced. value, the greater the amount of overlap between the problem
As can be seen in our results, all meta-models yielded high classes and the problem’s complexity. F3 returns the maximum
predictive accuracies and the use of the extended set of ratio between the number of examples that are not in the
complexity measures resulted in effective meta-models for ML overlapping region of two classes and the total number of
algorithm recommendation. examples, among all input features. Higher F3 values are
obtained for problems in which there are few examples in an
III. C OMPLEXITY M EASURES overlapping region for at least one of the input dimensions. F4
applies F3 successively until all examples are discriminated or
Various issues may cause the difficulty of a classification no input feature remains. Larger F4 values indicate that it is
problem. Ho et al. [16] point to three main factors: (i) the possible to discriminate more examples using a combination
intrinsic ambiguity of the classes, or overlapping of class of the input features and, therefore, that the problem is simpler.
distributions; (ii) the sparsity and dimensionality of the data; To obtain the linearity measures, a linear Support Vector
and (iii) the complexity of the classification boundary. Often Machine (SVM) [13] is used. L1 computes the sum of the
there is a combination of these three factors in effect. The distances of incorrectly classified examples to a linear bound-
measures proposed in Ho & Basu [4] focus on characterizing ary used in their classification. Low values for L1 indicate that
the complexity of the classification boundary in different ways. the problem is simpler. L2 computes the error rate of the linear
In this paper we consider an extended set of measures that SVM classifier induced from T . Higher L2 values denote more
includes other measures proposed in recent related literature. errors and therefore a greater complexity regarding the aspect
The entire set of employed measures can be divided into the of data linearity. L3 first creates a new dataset by randomly
following groups: interpolating pairs of examples of the same class, as in [22].
1) feature overlapping measures: evaluate the Then a linear SVM is trained on the original data and has
discriminative power of the input features. its error rate measured with the new data points. Higher L3
values indicate higher complexity.
875
TABLE I
C OMPLEXITY MEASURES EMPLOYED .
Category Name Acronym Ref.

Maximum Fisher’s Discriminant Ratio F1 [4], [5]
Directional-vector maximum Fisher’s discriminant ratio F1v [11]
Feature overlapping Volume of the overlapping region F2 [4], [18]
Maximum individual feature efficiency F3 [4]
Collective feature efficiency F4 [11]
Sum of the error distances by linear programming L1 [4]
Linearity measures Error rate of a linear classifier L2 [4]
Non-linearity of a linear classifier L3 [4]
Fraction of borderline points N1 [4]
Ratio of intra/inter class nearest neighbor distance N2 [4]
Error rate of the nearest neighbor classifier N3 [4]
Neighborhood measures
Non-linearity of the nearest neighbor classifier N4 [4]
Fraction of hyperspheres covering data T1 [4]
Local set average cardinality LSCAvg [19]
Average density of the network Density [20]
Network measures Clustering coefficient ClsCoef [20]
Hub score Hubs [20]
Average number of points per dimension T2 [4]
Dimensionality measures Average number of points per PCA dimension T3 [18]
Ratio of the PCA dimension to the raw dimension T4 [18]
Entropy of class proportions C1 [18]
Class balance measures
Imbalance ratio C2 [21]
The neighborhood measures work over a matrix that stores whilst weighted edges connect pairs of examples. The -NN
the distances between all pairs of points in T . The Gower method is used to build the graph, in which pairs of nodes
distance [23] is employed, which is hybrid and supports both i and j are connected only if dist(i, j) < . As in [24]
numerical and categorical features. N1 first builds a Minimum and [20], we adopted = 0.15d, where d is the smallest
Spanning Tree (MST) from data and computes the percentage distance between all pairs of examples. Next, as in [20] a post-
of vertices incident to edges connecting examples of opposite processing step is applied to the graph to prune edges between
classes in the MST. Higher N1 values indicate the need for examples of opposite classes. Density is given by the number
more complex boundaries to separate the classes. N2 computes of edges in the graph, divided by the maximum number of
the ratio of two sums: intra-class and inter-class distances. edges between the number of pairs of points. A low number of
The former corresponds to the sum of the distances between edges will be observed for datasets of low density or for which
each example and its closest neighbor from the same class. examples of opposite classes are near each other. Both cases
The later is the sum of the distances between each example indicate a higher classification complexity. ClsCoef averages
and its closest neighbor from another class (nearest enemy). the clustering tendency of the vertices, which, for each vertex
High N2 values are indicative of more complex problems. vi , corresponds to the ratio of the number of existent edges
N3 corresponds to the error rate of a 1NN (one Nearest between its neighbors to the total number of edges that could
Neighbor) classifier, estimated using a leave-one-out procedure possibly exist between them. ClsCoef will be larger for simpler
in T . High N3 values indicate a more complex problem. N4 datasets, with dense connections among examples from the
is similar to L3, but uses the 1NN classifier instead of the same class. Hubs takes the average hub score of the graph’s
linear SVM predictor. T1 builds hyperspheres centered at each vertices. The hub score of a node is given by the number of
example, and grows their radii until the hypersphere reaches an connections it has to other nodes, weighted by the number
example of another class. Smaller hyperspheres contained in of connections these neighbors have. Simpler datasets show
larger hyperspheres are eliminated. T1 is defined as the ratio dense regions within the classes and higher hub scores.
between the number of the remaining hyperspheres and the
total number of examples in the dataset, so that lower values T2 is given by the ratio between the number of examples n
are for simpler datasets. In this paper the hyperspheres are and dimensionality m. Low T2 values indicate more sparsity
grown until they reach another hypersphere from an opposite and therefore a higher expected complexity. T3 is similar to
class. The Local-Set (LS) of an example xi ∈ T the set of T2, but uses the number of principal components needed to
points from T whose distance to xi is smaller than the distance represent 95% of data variability (m0 ), which can be regarded
from xi to xi ’s nearest enemy [19]. LSCAvg is the average as an estimate of the intrinsic dataset dimensionality, as the
cardinality of the LS of all examples. Lower LSCAvg values base of data sparsity assessment. T4 divides the number of
are expected for more complex problems. PCA dimensions as defined for T3 (m0 ) by the original number
of dimensions (m). The larger the T4 value, the more of
In the network measures, the dataset T is first modeled the original features are needed to describe data variability,
as a graph, in which each example corresponds to a vertex, indicating a more complex relationship of the input variables.
876
C1 computes an entropy estimate based on the proportions can be found at a GitHub site1 . The parameters used for the
of examples per class in the dataset. It achieves maximum classifiers are: ANN with learning rate of 0.3, momentum of
value for balanced problems, which can be considered simpler 0.5 and one hidden layer; C4.5 with pruning; kNN with k=3;
according to the class balance aspect. C2 is a multiclass and SVM with radial basis kernel. The parameters used in
version of the imbalance ratio, in which larger values are the regression techniques are: DWNN with Gaussian kernel,
obtained for imbalanced problems. SVR with radial basis kernels and RF with 500 DTs. All
other parameter values of the classification and regression
IV. M ETHODOLOGY techniques used were kept as default of the R implementation
used in the experiments.
The data complexity measures are used in an MtL setup
designed to predict the accuracy of some popular classification V. M ETA - LEARNING R ESULTS
techniques for a given dataset. The objective is to determine First we present an overview on the classifiers’ performance
whether the measures allow to build accurate recommendation in all the 141 datasets used in this study. Figure 2(a) shows
systems for a diverse set of classification techniques and boxplots of the accuracies achieved by the ANN, C4.5, kNN
classification problems. and SVM classifiers. We notice that all techniques achieved
To train and evaluate a meta-learner, we build meta-datasets a median accuracy above 0.75. Whilst the SVM classifier
using a collection of classification problems for each of which achieved the highest median accuracy, the kNN classifier
the performance of one or more classification techniques is showed the lowest. Figure 2(b) shows the number of times
known [7]. We use 141 datasets from the OpenML repository each classifier achieved the best accuracy in these datasets.
[2] in a process as shown in Figure 1. For each dataset we SVM also achieved the best overall results, winning for about
compute the complexity measures values plus the average 43% of the datasets, whilst the kNN classifier was better only
cross-validated predictive accuracy achieved by four ML tech- in 13% of the datasets. Because of these results, the SVM
niques: ANN based on backpropagation (also called Multilayer classifier is set as the default recommended technique. The
Perceptron) [12]; SVM [13]; DT induced by the C4.5 learning dominance of one base-classifier suggests that if we set up
algorithm [15]; and kNN, a lazy learning technique [14]. As a the recommendation problem as a multi-class classification
result, we have four meta-datasets. All of them have 22 input problem, the meta-dataset is class-imbalanced. This motivated
features, namely the complexity measures described in Section our use of regression on the accuracies instead for the meta-
III, 141 examples (the datasets) and each example is labeled learning step.
with the accuracy of one of the four classifiers.
1.00 60
0.75
Complexy Measures
Regression techniques Number of wins 40
Accuracy
Meta-datasets (DWNN, RF, SVR)

Datasets 0.50
Classification
performance ●
●
●
●
● 20
(ANN, DT, SVM, kNN) ●
0.25 ●
●
●
● ●
● ●
Classification performance
● ●
● ●
●
estimation ●
●
0
0.00
ANN C4.5 kNN SVM ANN C4.5 kNN SVM
Fig. 1. MtL experimental setup. (a) Distribution of accuracies. (b) Winning classifiers.
Each meta-dataset is submitted to a regression technique, Fig. 2. Performance of the base-classifiers.

which estimates the expected accuracy by the classifier as a
Figure 3 shows the average mean-squared-error (MSE) of
function of the 22 meta-features. The classifier with the highest
the meta-regressors for predicting the expected accuracies of
estimated accuracy for a dataset is to be recommended for the
the base-classifiers, estimated by a leave-one-out process. Each
dataset. Three regression techniques known to have different
plot represents one base-classifier and the boxplots summarize
biases were employed to make the recommendations: Random
the MSE values of each meta-regressor. We show the baselines
Forests (RF) [25] [26], Support Vector Regressors (SVR) [13]
in gray color and the 3 meta-regressors techniques (DWNN,
and Distance Weighted k-Nearest Neighbor (DWNN) [14]. To
RF and SVR) in white color. In these plots, the default baseline
train and evaluate these regressors, leave-one-out rounds are
corresponds to predicting the average accuracy in the meta-
performed.
dataset, whilst the random baseline predicts a random accuracy
All of the datasets chosen for that experiment have no among those registered in the training meta-dataset. Therefore,
missing values. They represent diverse application contexts they vary for each base-classifier, based on their corresponding
and domains, and are selected with limits of up to a maximum accuracy.
number of 10,000 examples, 500 features and 10 classes. The
code, datasets and a summary of their main characteristics 1 https://github.com/lpfgarcia/complex
877
ANN C4.5 accuracy decrease. The accuracy increase was higher when
0.100
●
●
●
compared to the random recommendation, which is not sur-
● ●
●
●
●
●
●
●
● ●
● prising. The accuracy increase when compared to the default
●
0.075 recommendation suggests that, when choosing a classifier, if

●
● ●
●
one takes into account the data characteristics and chooses the
● ●
●
● classifier accordingly (such as using our meta-model), better
0.050 ●
●
●
●
●
●
●
●
●
●
●
●
● ●
accuracy can be achieved than if one just uses a fixed, default
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
choice, even if that default choice is the best on average. Again
● ●
0.025 ●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
the best performing meta-regression technique was RF.
● ● ● ●
●
●
● ● ●
● ●
● ●
● ●
●
0.000
Random Default
MSE
kNN SVM
Percentage increase of accuracy

0.100 ●
● 150
● ●
●
● ●
●
●
0.075
●
● ●
100
●
● ● ●
0.050 ● ●
●
●
● ●
● ●
●
●
●
● ●
●
●
●
●
50
● ● ●
0.025 ●
●
●
●
●
●
●
●
●
●
● ●
● ● ● ●
●
●
●
● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
●
●
● ●
●
● ● ●
● ●
●
● ● ●
●
● ●
●
●
●
●
0.000 0
DWNN RF SVR RandomDefault DWNN RF SVR RandomDefault
DWNN RF SVR DWNN RF SVR
Fig. 3. MSE of each meta-regressor.

Fig. 4. Improvement of base-classifier accuracies over baselines when the
complexity-driven recommendations are used.
We notice that the DWNN, RF and SVR meta-regressors To learn more about the complexity measures regarding
outperformed the baselines and are more accurate in all cases. their importance in predicting the best-to-use base-classifiers,
In general, the meta-regressors also showed a more stable we show in Figure 5 the 5 top-ranked complexity measures
behavior when compared to the baselines, which have elon- selected by the RF in all runs. The x-axis represents the
gated boxplots and larger standard deviation values. Among measures and the y-axis shows the average of their Gini index
the meta-regressors, in most of the cases RF performed better, in the RF models. The complexity measures regarded as the
followed by SVR and DWNN. Using the Friedman statistical most important are those based on neighborhood information
test with the Nemenyi post-test at 95% confidence level (N3, N1, N2 and T1). There is also a representative from
[27], the DWNN, RF and SVR meta-regressors presented the structural (network) group (Density). But, since the struc-
better predictive performance than the baselines, but performed tural measure is calculated from a proximity graph of the
similarly to each other. examples, it can also be regarded as employing neighborhood
A dataset-dependent recommender of classifiers is useful if information. It should be noticed that these measures are very
the classifier it suggests for a dataset performs better than a correlated to each other and that a more sophisticated feature
random or a fixed choice. Figure 4 presents the increase of selection procedure could reveal other useful combinations of
accuracy obtained by the classifiers in the datasets when they complexity measures.
are the ones recommended by the meta-regressors instead of It is also worth noting that some complexity measures have a
the Random and Default recommendations. The x-axis shows strong bias towards the kNN classifier or the DWNN regressor
the meta-regressors and the y-axis represents the increase of (e.g., N3, N4). But this did not translate into advantages for
accuracy (summed over all datasets) when compared to one these techniques, which showed worst results compared to the
corresponding baseline. The random baseline here assigns other classification and regression techniques employed.
a randomly chosen classifier among the four. The default One additional analysis is the trade-off between the runtime
baseline is SVM, which obtained more wins in the meta- of the complexity measures to that of evaluating all classifiers
dataset. Since we have four meta-regressors per regression in a cross-validation setup. Using a sequential execution for
technique (one per base-classifier), the recommendation is all the datasets in a cluster node with two Intel Xeon E5-
obtained as follows: the accuracies predicted by each of the 2680v2 processors and 128 GB, the runtime of the complexity
four meta-models are used to rank the base-classifiers; next, measures was lower than that of the classifier evaluations
the technique for which the expected accuracy is highest is for 71% of the cases. For datasets with a high number of
recommended. Indeed, the meta-regressors were always able input features, there was an overhead in the complexity mea-
to improve the accuracy results and we did not notice any sures computation. One must also notice that the complexity
878
[2] J. N. Van Rijn, B. Bischl, L. Torgo, B. Gao, V. Umaashankar, S. Fischer,
P. Winter, B. Wiswedel, M. R. Berthold, and J. Vanschoren, “OpenML:
A collaborative science platform,” in ECML/PKDD, 2013, pp. 645–649.
1.5 [3] L. P. F. Garcia, A. C. Lorena, and J. Lehmann, “ECoL: Com-
plexity measures for classification problems,” 2018, https://CRAN.R-
Measure Importance
project.org/package=ECoL.
[4] T. K. Ho and M. Basu, “Complexity measures of supervised classifi-
1.0 cation problems,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 24, no. 3, pp. 289–300, 2002.
[5] R. A. Mollineda, J. S. Sánchez, and J. M. Sotoca, “A meta-learning
framework for pattern classification by means of data complexity mea-
sures,” Inteligencia artificial, vol. 10, no. 29, pp. 31–38, 2006.
0.5
[6] G. D. C. Cavalcanti, T. I. Ren, and B. A. Vale, “Data complexity
measures and nearest neighbor classifiers: a practical analysis for meta-
learning,” in 24th International Conference on Tools with Artificial
Intelligence (ICTAI), vol. 1, 2012, pp. 1065–1069.
0.0 [7] P. Brazdil, C. Giraud-Carrier, C. Soares, and R. Vilalta, Metalearning:
Applications to Data Mining. Springer, 2009.
N3
N1
N2
Density
T1
[8] C. Soares, P. Brazdil, and P. Kuba, “A meta-learning method to select the
kernel width in support vector regression,” Machine Learning, vol. 54,
no. 3, pp. 195–209, 2004.
[9] E. Hernández-Reyes, J. A. Carrasco-Ochoa, and J. F. Martı́nez-Trinidad,
“Classifier selection based on data complexity measures,” Lecture notes
Fig. 5. Top-ranked complexity measures selected by the RF meta-regressors. in computer science, vol. 3773, p. 586, 2005.
[10] J. L. Olmo, C. Romero, E. Gibaja, and S. Ventura, “Improving meta-
learning for algorithm selection by using multi-label classification: A
measures are coded in R and are not optimized, whilst the case of study with educational data sets,” International Journal of
Computational Intelligence Systems, vol. 8, no. 6, pp. 1144–1164, 2015.
classification techniques are written in compiled languages. [11] A. Orriols-Puig, N. Macià, and T. K. Ho, “Documentation for the data
complexity library in c++,” La Salle - Universitat Ramon Llull, Tech.
VI. C ONCLUSION Rep., 2010.
[12] S. Haykin, Neural Networks – A Compreensive Foundation. Prentice-
This paper presented a study on the use of several measures Hall, 1999.
of classification complexity in the recommendation of classi- [13] N. Cristianini and J. Shawe-Taylor, An introduction to support vector
fication algorithms. A large and updated set of complexity machines and other kernel-based learning methods. Cambridge uni-
versity press, 2000.
measures were used to characterize a diverse set of classifica- [14] T. M. Mitchell, Machine Learning, ser. McGraw Hill series in computer
tion problems from different domains. We were able to build science. McGraw Hill, 1997.
accurate meta-models to predict the expected accuracies and [15] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1,
no. 1, pp. 81–106, 1986.
choose the best classifier among four popular classification [16] T. K. Ho, M. Basu, and M. H. C. Law, “Measures of geometrical
techniques: ANN, SVM, C4.5 and kNN. complexity in classification problems,” in Data Complexity in Pattern
The meta-regressors DWNN, RF and SVR used to pre- Recognition, 2006, pp. 1–23.
[17] A. C. Lorena, A. C. P. L. F. de Carvalho, and J. M. P. Gama, “A review on
dict the classifier with best expected performance for new the combination of binary classifiers in multiclass problems,” Artificial
datasets were compared to two baseline recommenders. In Intelligence Review, vol. 30, no. 1, pp. 19–37, 2008.
the experiments, the random recommendation of one of the [18] A. C. Lorena, I. G. Costa, N. Spolaôr, and M. C. P. Souto, “Analysis of
complexity indices for classification problems: Cancer gene expression
four classification techniques, a very naı̈ve baseline, achieved data,” Neurocomputing, vol. 75, no. 1, pp. 33–42, 2012.
worst results. The best meta-model was the RF and the most [19] E. Leyva, A. González, and R. Pérez, “A set of complexity measures
important complexity measures used by this model were based designed for applying meta-learning to instance selection,” IEEE Trans-
actions on Knowledge and Data Engineering, vol. 27, no. 2, pp. 354–
on neighborhood information and structure. 367, 2014.
Future work shall look for the core complexity measures [20] L. P. F. Garcia, A. C. P. L. F. de Carvalho, and A. C. Lorena,
best for distinguish between the performances of various “Effect of label noise in the complexity of classification problems,”
Neurocomputing, vol. 160, pp. 108–119, 2015.
classifiers. We would also like to: (i) optimize the imple- [21] A. K. Tanwani and M. Farooq, “Classification potential vs. classifica-
mentation of the complexity measures; (ii) evaluate other tion accuracy: a comprehensive study of evolutionary algorithms with
MtL approaches like ranking the classifiers; (iii) investigate biomedical datasets,” Learning Classifier Systems, vol. 6471, pp. 127–
144, 2010.
parameter tuning for the classification techniques; and (iv) [22] A. Hoekstra and R. P. W. Duin, “On the nonlinearity of pattern
study further in which cases the classifier recommendation classifiers,” in 13th ICPR, vol. 4, 1996, pp. 271–275.
outperforms the default classification technique. [23] J. Gower, “A general coefficient of similarity and some of its properties,”
Biometrics, vol. 27, no. 4, pp. 857–871, 1971.
ACKNOWLEDGEMENTS [24] G. Morais and R. C. Prati, “Complex network measures for data set
characterization,” in 2nd BRACIS, 2013, pp. 12–18.
The second author would like to thank the financial support [25] T. K. Ho, “The random subspace method for constructing decision
of the foundations FAPESP (grant 2012/22608-8) and CNPq forests,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 20, no. 8, p. 832844, 1998.
(grant 308858/2014-0). [26] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp.
5–32, 2001.
R EFERENCES [27] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,”
[1] D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, no. 2, Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.
pp. 241–259, 1992.
879

Classifier Recommendation Using Data Complexity Measures PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Classifier Recommendation Using Data Complexity Measures PDF

Uploaded by

Copyright:

Available Formats

2018 24th International Conference on Pattern Recognition (ICPR)

Beijing, China, August 20-24, 2018

Classifier Recommendation Using

978-1-5386-3788-3/18/$31.00 ©2018 IEEE 874

Category Name Acronym Ref.

Meta-datasets (DWNN, RF, SVR)

Each meta-dataset is submitted to a regression technique, Fig. 2. Performance of the base-classifiers.

0.075 recommendation suggests that, when choosing a classifier, if

Percentage increase of accuracy

Fig. 3. MSE of each meta-regressor.

You might also like