Professional Documents
Culture Documents
Abstract—Application of machine learning to new and unfamil- problems from the OpenML repository [2], several standard
iar domains calls for increasing automation in choosing a learning base classifiers and meta-predictors, and an extended set of
algorithm suitable for the data arising from each domain. Meta- data complexity measures implemented in publicly released
learning could address this need since it has been largely used
in the last years to support the recommendation of the most software [3].
suitable algorithms for a new dataset. The use of complexity This paper is organized as follows: Section II presents some
measures could increase the systematic comprehension over the related studies on the use of complexity measures in MtL.
meta-models and also allow to differentiate the performance of a Section III describes the complexity measures employed in
set of techniques taking into account the overlap between classes this work. Section IV presents the MtL methodology followed
imposed by feature values, the separability and distribution of the
data points. In this paper we compare the effectiveness of several in this work. Section V presents the experiments performed
standard regression models in predicting the accuracies of clas- along with their results. Section VI concludes the paper.
sifiers for classification problems from the OpenML repository.
We show that the models can predict the classifiers’ accuracies II. R ELATED S TUDIES
with low mean-squared-error and identify the best classifier for Complexity measures of classification problems allows one
a problem that results in statistically significant improvements
over a randomly chosen classifier or a fixed classifier believed to to estimate the expected difficulty of a classification problem
be good on average. by extracting descriptions of the overlap between classes
imposed by feature values, the separability and distribution
I. I NTRODUCTION of the data points, and certain structural characteristics of the
The recent surge of interests in applying Machine Learning problem based on the representation of the problem by the
(ML) methods to many new and diverse domains has high- training datasets available for learning [4]. Most of the mea-
lighted a need for further automation of the analysis process. sure values are highly correlated with predictive performance
In particular, new application domains open up the possibility of classification models, as demonstrated in previous studies
that the data available for learning are of a different nature [5], [6]. Therefore, it is expected that they can play an impor-
from those familiar to previous users of ML, like text or tant role in improving the recommendation for classification
image and speech signals. With unfamiliar data and class algorithms [7].
distributions in a new context, a practitioner has little guidance MtL can be used in ML algorithm recommendation by,
in choosing a suitable learning method, as it is well known that for instance, estimating the expected predictive accuracy of
different classifiers appeal to data with different geometries classification techniques [7]. To do this, firstly, a meta-dataset
and distribution characteristics [1]. must be constructed where each meta-example is associated
A decision task like identifying a good classifier for a with a dataset. Then, meta-features describing characteristics
dataset can be formulated as an ML problem if certain numeri- of each dataset are extracted. In this paper, the complexity
cal or categorical features are available to describe the dataset, measures of classification problems are used as meta-features.
and some previous examples are available for training a pre- The target value for the prediction is the accuracy of a
dictor. Meta-Learning (MtL) with data complexity measures, classification technique for the dataset associated with the
where the learning task is to predict a classifier’s performance meta-example.
for a dataset, and the features to use are some statistical or Given a meta-dataset, the next step is the induction of a
geometrical characteristics of that dataset, carries a promise to meta-model. The meta-model can be induced by ML tech-
serve this purpose. The practical question, however, as in all niques and can be used in a recommendation system to select
learning tasks, is to what extent such prediction is successful, the most suitable algorithm(s) for a new dataset. In this
and whether the known measures of data complexity and paper we want to predict the accuracy of the classification
the known examples of classifier behavior are sufficient for techniques, so we used a set of regression techniques and
training such a predictor. In this paper, we demonstrate a also some standard baselines to evaluate and validate the
success in this MtL task using a large collection of ML achieved performance [8]. The expectation is that with highly
875
TABLE I
C OMPLEXITY MEASURES EMPLOYED .
The neighborhood measures work over a matrix that stores whilst weighted edges connect pairs of examples. The -NN
the distances between all pairs of points in T . The Gower method is used to build the graph, in which pairs of nodes
distance [23] is employed, which is hybrid and supports both i and j are connected only if dist(i, j) < . As in [24]
numerical and categorical features. N1 first builds a Minimum and [20], we adopted = 0.15d, where d is the smallest
Spanning Tree (MST) from data and computes the percentage distance between all pairs of examples. Next, as in [20] a post-
of vertices incident to edges connecting examples of opposite processing step is applied to the graph to prune edges between
classes in the MST. Higher N1 values indicate the need for examples of opposite classes. Density is given by the number
more complex boundaries to separate the classes. N2 computes of edges in the graph, divided by the maximum number of
the ratio of two sums: intra-class and inter-class distances. edges between the number of pairs of points. A low number of
The former corresponds to the sum of the distances between edges will be observed for datasets of low density or for which
each example and its closest neighbor from the same class. examples of opposite classes are near each other. Both cases
The later is the sum of the distances between each example indicate a higher classification complexity. ClsCoef averages
and its closest neighbor from another class (nearest enemy). the clustering tendency of the vertices, which, for each vertex
High N2 values are indicative of more complex problems. vi , corresponds to the ratio of the number of existent edges
N3 corresponds to the error rate of a 1NN (one Nearest between its neighbors to the total number of edges that could
Neighbor) classifier, estimated using a leave-one-out procedure possibly exist between them. ClsCoef will be larger for simpler
in T . High N3 values indicate a more complex problem. N4 datasets, with dense connections among examples from the
is similar to L3, but uses the 1NN classifier instead of the same class. Hubs takes the average hub score of the graph’s
linear SVM predictor. T1 builds hyperspheres centered at each vertices. The hub score of a node is given by the number of
example, and grows their radii until the hypersphere reaches an connections it has to other nodes, weighted by the number
example of another class. Smaller hyperspheres contained in of connections these neighbors have. Simpler datasets show
larger hyperspheres are eliminated. T1 is defined as the ratio dense regions within the classes and higher hub scores.
between the number of the remaining hyperspheres and the
total number of examples in the dataset, so that lower values T2 is given by the ratio between the number of examples n
are for simpler datasets. In this paper the hyperspheres are and dimensionality m. Low T2 values indicate more sparsity
grown until they reach another hypersphere from an opposite and therefore a higher expected complexity. T3 is similar to
class. The Local-Set (LS) of an example xi ∈ T the set of T2, but uses the number of principal components needed to
points from T whose distance to xi is smaller than the distance represent 95% of data variability (m0 ), which can be regarded
from xi to xi ’s nearest enemy [19]. LSCAvg is the average as an estimate of the intrinsic dataset dimensionality, as the
cardinality of the LS of all examples. Lower LSCAvg values base of data sparsity assessment. T4 divides the number of
are expected for more complex problems. PCA dimensions as defined for T3 (m0 ) by the original number
of dimensions (m). The larger the T4 value, the more of
In the network measures, the dataset T is first modeled the original features are needed to describe data variability,
as a graph, in which each example corresponds to a vertex, indicating a more complex relationship of the input variables.
876
C1 computes an entropy estimate based on the proportions can be found at a GitHub site1 . The parameters used for the
of examples per class in the dataset. It achieves maximum classifiers are: ANN with learning rate of 0.3, momentum of
value for balanced problems, which can be considered simpler 0.5 and one hidden layer; C4.5 with pruning; kNN with k=3;
according to the class balance aspect. C2 is a multiclass and SVM with radial basis kernel. The parameters used in
version of the imbalance ratio, in which larger values are the regression techniques are: DWNN with Gaussian kernel,
obtained for imbalanced problems. SVR with radial basis kernels and RF with 500 DTs. All
other parameter values of the classification and regression
IV. M ETHODOLOGY techniques used were kept as default of the R implementation
used in the experiments.
The data complexity measures are used in an MtL setup
designed to predict the accuracy of some popular classification V. M ETA - LEARNING R ESULTS
techniques for a given dataset. The objective is to determine First we present an overview on the classifiers’ performance
whether the measures allow to build accurate recommendation in all the 141 datasets used in this study. Figure 2(a) shows
systems for a diverse set of classification techniques and boxplots of the accuracies achieved by the ANN, C4.5, kNN
classification problems. and SVM classifiers. We notice that all techniques achieved
To train and evaluate a meta-learner, we build meta-datasets a median accuracy above 0.75. Whilst the SVM classifier
using a collection of classification problems for each of which achieved the highest median accuracy, the kNN classifier
the performance of one or more classification techniques is showed the lowest. Figure 2(b) shows the number of times
known [7]. We use 141 datasets from the OpenML repository each classifier achieved the best accuracy in these datasets.
[2] in a process as shown in Figure 1. For each dataset we SVM also achieved the best overall results, winning for about
compute the complexity measures values plus the average 43% of the datasets, whilst the kNN classifier was better only
cross-validated predictive accuracy achieved by four ML tech- in 13% of the datasets. Because of these results, the SVM
niques: ANN based on backpropagation (also called Multilayer classifier is set as the default recommended technique. The
Perceptron) [12]; SVM [13]; DT induced by the C4.5 learning dominance of one base-classifier suggests that if we set up
algorithm [15]; and kNN, a lazy learning technique [14]. As a the recommendation problem as a multi-class classification
result, we have four meta-datasets. All of them have 22 input problem, the meta-dataset is class-imbalanced. This motivated
features, namely the complexity measures described in Section our use of regression on the accuracies instead for the meta-
III, 141 examples (the datasets) and each example is labeled learning step.
with the accuracy of one of the four classifiers.
1.00 60
0.75
Complexy Measures
Regression techniques Number of wins 40
Accuracy
0.25 ●
●
●
● ●
● ●
Classification performance
● ●
● ●
●
estimation ●
●
0
0.00
ANN C4.5 kNN SVM ANN C4.5 kNN SVM
Fig. 1. MtL experimental setup. (a) Distribution of accuracies. (b) Winning classifiers.
877
ANN C4.5 accuracy decrease. The accuracy increase was higher when
0.100
●
●
●
compared to the random recommendation, which is not sur-
● ●
●
●
●
●
●
●
● ●
● prising. The accuracy increase when compared to the default
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
choice, even if that default choice is the best on average. Again
● ●
0.025 ●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
the best performing meta-regression technique was RF.
● ● ● ●
●
●
● ● ●
● ●
● ●
● ●
●
0.000
Random Default
MSE
kNN SVM
●
● ●
●
●
0.075
●
● ●
100
●
● ● ●
0.050 ● ●
●
●
● ●
● ●
●
●
●
● ●
●
●
●
●
50
● ● ●
0.025 ●
●
●
●
●
●
●
●
●
●
● ●
● ● ● ●
●
●
●
● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
●
●
● ●
●
● ● ●
● ●
●
● ● ●
●
● ●
●
●
●
●
0.000 0
DWNN RF SVR RandomDefault DWNN RF SVR RandomDefault
DWNN RF SVR DWNN RF SVR
We notice that the DWNN, RF and SVR meta-regressors To learn more about the complexity measures regarding
outperformed the baselines and are more accurate in all cases. their importance in predicting the best-to-use base-classifiers,
In general, the meta-regressors also showed a more stable we show in Figure 5 the 5 top-ranked complexity measures
behavior when compared to the baselines, which have elon- selected by the RF in all runs. The x-axis represents the
gated boxplots and larger standard deviation values. Among measures and the y-axis shows the average of their Gini index
the meta-regressors, in most of the cases RF performed better, in the RF models. The complexity measures regarded as the
followed by SVR and DWNN. Using the Friedman statistical most important are those based on neighborhood information
test with the Nemenyi post-test at 95% confidence level (N3, N1, N2 and T1). There is also a representative from
[27], the DWNN, RF and SVR meta-regressors presented the structural (network) group (Density). But, since the struc-
better predictive performance than the baselines, but performed tural measure is calculated from a proximity graph of the
similarly to each other. examples, it can also be regarded as employing neighborhood
A dataset-dependent recommender of classifiers is useful if information. It should be noticed that these measures are very
the classifier it suggests for a dataset performs better than a correlated to each other and that a more sophisticated feature
random or a fixed choice. Figure 4 presents the increase of selection procedure could reveal other useful combinations of
accuracy obtained by the classifiers in the datasets when they complexity measures.
are the ones recommended by the meta-regressors instead of It is also worth noting that some complexity measures have a
the Random and Default recommendations. The x-axis shows strong bias towards the kNN classifier or the DWNN regressor
the meta-regressors and the y-axis represents the increase of (e.g., N3, N4). But this did not translate into advantages for
accuracy (summed over all datasets) when compared to one these techniques, which showed worst results compared to the
corresponding baseline. The random baseline here assigns other classification and regression techniques employed.
a randomly chosen classifier among the four. The default One additional analysis is the trade-off between the runtime
baseline is SVM, which obtained more wins in the meta- of the complexity measures to that of evaluating all classifiers
dataset. Since we have four meta-regressors per regression in a cross-validation setup. Using a sequential execution for
technique (one per base-classifier), the recommendation is all the datasets in a cluster node with two Intel Xeon E5-
obtained as follows: the accuracies predicted by each of the 2680v2 processors and 128 GB, the runtime of the complexity
four meta-models are used to rank the base-classifiers; next, measures was lower than that of the classifier evaluations
the technique for which the expected accuracy is highest is for 71% of the cases. For datasets with a high number of
recommended. Indeed, the meta-regressors were always able input features, there was an overhead in the complexity mea-
to improve the accuracy results and we did not notice any sures computation. One must also notice that the complexity
878
[2] J. N. Van Rijn, B. Bischl, L. Torgo, B. Gao, V. Umaashankar, S. Fischer,
P. Winter, B. Wiswedel, M. R. Berthold, and J. Vanschoren, “OpenML:
A collaborative science platform,” in ECML/PKDD, 2013, pp. 645–649.
1.5 [3] L. P. F. Garcia, A. C. Lorena, and J. Lehmann, “ECoL: Com-
plexity measures for classification problems,” 2018, https://CRAN.R-
Measure Importance
project.org/package=ECoL.
[4] T. K. Ho and M. Basu, “Complexity measures of supervised classifi-
1.0 cation problems,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 24, no. 3, pp. 289–300, 2002.
[5] R. A. Mollineda, J. S. Sánchez, and J. M. Sotoca, “A meta-learning
framework for pattern classification by means of data complexity mea-
sures,” Inteligencia artificial, vol. 10, no. 29, pp. 31–38, 2006.
0.5
[6] G. D. C. Cavalcanti, T. I. Ren, and B. A. Vale, “Data complexity
measures and nearest neighbor classifiers: a practical analysis for meta-
learning,” in 24th International Conference on Tools with Artificial
Intelligence (ICTAI), vol. 1, 2012, pp. 1065–1069.
0.0 [7] P. Brazdil, C. Giraud-Carrier, C. Soares, and R. Vilalta, Metalearning:
Applications to Data Mining. Springer, 2009.
N3
N1
N2
Density
T1
[8] C. Soares, P. Brazdil, and P. Kuba, “A meta-learning method to select the
kernel width in support vector regression,” Machine Learning, vol. 54,
no. 3, pp. 195–209, 2004.
[9] E. Hernández-Reyes, J. A. Carrasco-Ochoa, and J. F. Martı́nez-Trinidad,
“Classifier selection based on data complexity measures,” Lecture notes
Fig. 5. Top-ranked complexity measures selected by the RF meta-regressors. in computer science, vol. 3773, p. 586, 2005.
[10] J. L. Olmo, C. Romero, E. Gibaja, and S. Ventura, “Improving meta-
learning for algorithm selection by using multi-label classification: A
measures are coded in R and are not optimized, whilst the case of study with educational data sets,” International Journal of
Computational Intelligence Systems, vol. 8, no. 6, pp. 1144–1164, 2015.
classification techniques are written in compiled languages. [11] A. Orriols-Puig, N. Macià, and T. K. Ho, “Documentation for the data
complexity library in c++,” La Salle - Universitat Ramon Llull, Tech.
VI. C ONCLUSION Rep., 2010.
[12] S. Haykin, Neural Networks – A Compreensive Foundation. Prentice-
This paper presented a study on the use of several measures Hall, 1999.
of classification complexity in the recommendation of classi- [13] N. Cristianini and J. Shawe-Taylor, An introduction to support vector
fication algorithms. A large and updated set of complexity machines and other kernel-based learning methods. Cambridge uni-
versity press, 2000.
measures were used to characterize a diverse set of classifica- [14] T. M. Mitchell, Machine Learning, ser. McGraw Hill series in computer
tion problems from different domains. We were able to build science. McGraw Hill, 1997.
accurate meta-models to predict the expected accuracies and [15] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1,
no. 1, pp. 81–106, 1986.
choose the best classifier among four popular classification [16] T. K. Ho, M. Basu, and M. H. C. Law, “Measures of geometrical
techniques: ANN, SVM, C4.5 and kNN. complexity in classification problems,” in Data Complexity in Pattern
The meta-regressors DWNN, RF and SVR used to pre- Recognition, 2006, pp. 1–23.
[17] A. C. Lorena, A. C. P. L. F. de Carvalho, and J. M. P. Gama, “A review on
dict the classifier with best expected performance for new the combination of binary classifiers in multiclass problems,” Artificial
datasets were compared to two baseline recommenders. In Intelligence Review, vol. 30, no. 1, pp. 19–37, 2008.
the experiments, the random recommendation of one of the [18] A. C. Lorena, I. G. Costa, N. Spolaôr, and M. C. P. Souto, “Analysis of
complexity indices for classification problems: Cancer gene expression
four classification techniques, a very naı̈ve baseline, achieved data,” Neurocomputing, vol. 75, no. 1, pp. 33–42, 2012.
worst results. The best meta-model was the RF and the most [19] E. Leyva, A. González, and R. Pérez, “A set of complexity measures
important complexity measures used by this model were based designed for applying meta-learning to instance selection,” IEEE Trans-
actions on Knowledge and Data Engineering, vol. 27, no. 2, pp. 354–
on neighborhood information and structure. 367, 2014.
Future work shall look for the core complexity measures [20] L. P. F. Garcia, A. C. P. L. F. de Carvalho, and A. C. Lorena,
best for distinguish between the performances of various “Effect of label noise in the complexity of classification problems,”
Neurocomputing, vol. 160, pp. 108–119, 2015.
classifiers. We would also like to: (i) optimize the imple- [21] A. K. Tanwani and M. Farooq, “Classification potential vs. classifica-
mentation of the complexity measures; (ii) evaluate other tion accuracy: a comprehensive study of evolutionary algorithms with
MtL approaches like ranking the classifiers; (iii) investigate biomedical datasets,” Learning Classifier Systems, vol. 6471, pp. 127–
144, 2010.
parameter tuning for the classification techniques; and (iv) [22] A. Hoekstra and R. P. W. Duin, “On the nonlinearity of pattern
study further in which cases the classifier recommendation classifiers,” in 13th ICPR, vol. 4, 1996, pp. 271–275.
outperforms the default classification technique. [23] J. Gower, “A general coefficient of similarity and some of its properties,”
Biometrics, vol. 27, no. 4, pp. 857–871, 1971.
ACKNOWLEDGEMENTS [24] G. Morais and R. C. Prati, “Complex network measures for data set
characterization,” in 2nd BRACIS, 2013, pp. 12–18.
The second author would like to thank the financial support [25] T. K. Ho, “The random subspace method for constructing decision
of the foundations FAPESP (grant 2012/22608-8) and CNPq forests,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 20, no. 8, p. 832844, 1998.
(grant 308858/2014-0). [26] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp.
5–32, 2001.
R EFERENCES [27] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,”
[1] D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, no. 2, Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.
pp. 241–259, 1992.
879