Professional Documents
Culture Documents
Classification Datasets
Vinı́cius V. de Melo Ana C. Lorena
UNIFESP-ICT UNIFESP-ICT and ITA
São José dos Campos, Brazil São José dos Campos, Brazil
Email: vinicius.melo@unifesp.br Email: aclorena@gmail.br
Abstract—Machine Learning studies usually involve a large difficulty can be produced by optimizing such measures. Some
volume of experimental work. For instance, any new technique or previous works have employed Genetic Algorithms (single and
solution to a classification problem has to be evaluated concerning multi-objective) to produce datasets with different complexity
the predictive performance achieved in many datasets. In order
to evaluate the robustness of the algorithm face to different class levels [7], [6], [8]. The idea is to produce a new dataset [7],
distributions, it would be interesting to choose a set of datasets [6] or to sample an existing dataset [8] to reach a given
that spans different levels of classification difficulty. In this paper, complexity measure value (or for optimizing the values of
we present a method to generate synthetic classification datasets multiple complexity measures).
with varying complexity levels. The idea is to greedly exchange This paper employs a similar approach as in [6] and gen-
the labeling of a set of synthetically generated points in order to
reach a given level of classification complexity, which is assessed erates new synthetic datasets with different target complexity
by measures that estimate the difficulty of a classification problem values. Nonetheless, a low-cost hill-climbing greedy search
based on the geometrical distribution of the data. strategy is employed instead of the GA. Given an initial
dataset, the labels of pairs of examples are iteratively swapped.
I. I NTRODUCTION By doing so, one can change the dataset structure, while
preserving some of its initial characteristics, such as the
Most Machine Learning (ML) studies include an experi- number of examples, of input features, of classes, and the
mental evaluation section, in which one or more designed data distribution within the classes. Despite the simplicity
techniques have their performance evaluated on some datasets. of the algorithm, our experiments demonstrate that it can
Although there are a few popular benchmark repositories, generate datasets of different levels of target difficulty at a
such as the UCI and OpenML repositories [1], [2], often low computational cost.
the available datasets have quite simple structures or are This paper is structured as follows: Section II presents the
already preprocessed, and may not present a real challenge complexity measures used in this work. Section III reviews
to data analysis [3]. In other cases, one may be interested in related work on dataset generation based on the data com-
investigating the effectiveness of a designed technique on a set plexity descriptors. Section IV describes the dataset generator
of datasets with known distribution. These aspects motivate the proposed. Section V presents an experimental evaluation of
design of synthetic datasets [4]. the algorithm, whilst Section VI concludes this work.
Various strategies can be employed in the generation of
synthetic datasets for classification problems. A common II. C OMPLEXITY M EASURES
approach is to sample the data items according to specific Ho and Basu [9] introduced complexity measures for es-
distributions [5]. For instance, one may assume the examples timating the difficulty of a classification problem. Such de-
are sampled from a normal distribution, with distinct means scriptors are extracted from the datasets available for learning,
and variances for the different classes. By approximating giving an indication of the size and shape of the boundary
the means of the classes or by increasing the values of the required to separate the classes.
variances, datasets in which the classes overlap more can be These measures have been employed in various types of
produced. Another interesting approach is to generate synthetic analysis in recent work, among them: (i) to characterize the
datasets by changing the geometrical structure of the data [6], domain of competence of different ML algorithms [10]; (ii)
which can be accomplished by the use of data complexity to develop new data-driven techniques [11]; (iii) to describe
measures to generate synthetic datasets [7], [6], [8]. classification problems in meta-learning studies [12]; and (iv)
The complexity measures have been introduced in the early to generate new classification datasets [3]. This paper uses
2000s [9] and have been used in numerous types of analysis a subset of complexity measures that vary the overlapping
in the recent years [10], [11], [12], [3]. They allow estimating of the classes in the generation of synthetic datasets with
the complexity of a given classification problem by extracting different levels of complexity. Herewith, if the distributions
some simple geometrical and statistical descriptors from its of the examples from different classes present a high mixture
learning dataset. Synthetic datasets spanning different levels of or overlapping, the classification problem can be regarded as
F1
0.0
N1
0.0
0.0
and N3, in the largest dataset (1000 × 1000), it took only 7.25 is not considered in the former analysis, but they are shown
seconds to run the maximum number of 100,000 iterations, in Table III for reference. For F1, there is also an impact
while the fastest run took approximately one millisecond. On regarding nf , since its computation is dependent on the
the other hand, F1 is much more costly because it must do number of features of the dataset. Moreover, the related work
calculations on the dataset features at each iteration of the does not report any processing times to be compared to those
algorithm. Our implementation can still be optimized since reported here.
we did not adopt the incremental computation of the mean
values in F1. Nonetheless, the largest time to reach the target Table III: Data structures creation time (averages and standard-
in F1 was still low (25.6 seconds). deviations, in parenthesis).
n nf Measure Time (s)
One may also observe that the optimization processing
times increase with n because the search space is given by 100 2 N1 0.035 (0.099)
100 2 N3 0.001 (0.000)
the number of examples to be labeled; for N1 and N3, nf 1000 2 N1 0.605 (0.146)
impacts only the first iteration of the algorithm, in which 1000 2 N3 0.046 (0.018)
the data structures required are built to be used in the next 1000 1000 N1 3.257 (0.280)
1000 1000 N3 0.806 (0.119)
iterations. For this reason, the data structures creation time
0.0
on one of the features. Thus, kNN is supposed to perform
-0.5
well but other methods should outperform it. Such hypothesis
-1.0
1.0
was validated in the experiment, where NB was the best
algorithm for most runs. These results are strong indicatives to
0.5
support algorithm selection based on the evaluated complexity
C2
N1
0.0
measures, although this is not our objective in this work.
-0.5
We chose not to run hypothesis tests to compare the
-1.0
1.0
accuracies because our intention is not to show which classifier
is the best for each scenario, but to show that our generator
0.5
can synthesize datasets of different complexity and difficulty
N3
0.0
levels, for either balanced or unbalanced cases.
-0.5
-1.0
VI. C ONCLUSIONS AND FUTURE WORK
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
C1
In this paper, we proposed a hill-climbing local-search
algorithm with multiple trials to optimize the labels assignment
Figure 3: Example of optimized solutions for five classes with
3 We employed the R caret package to do the grid-search and test all the
n = 1000 and nf = 2.
parameters available. The other parameters have default values and cannot be
modified.
Table V: Parameters of the classifiers for the grid search. two-class balanced datasets and five-class unbalanced datasets.
Values were empirically chosen. Some classifiers were tested to validate the difficulty levels
Classifier Parameters achieved, showing indeed that the higher the complexity as
kNN k={1, 3, 5, 7} measured by the different measures, the harder the classifica-
SVM-RBF sigma={10−4 , 10−2 , 10}, C={10−1 , 100 , 101 , 102 } tion problem to be solved. Also, different classifiers performed
Naı̈ve Bayes fL={0, 0.5, 1}, usekernel={FALSE, TRUE}, adjust={0.01, 0.5, 1.0}
Random Forest mtry={3, 4, 5} better depending on the complexity measure, meaning that one
could use the measure to select the most appropriate classifier
in each case or to generate datasets to challenge classifiers.
of synthetic datasets to achieve specific complexity measures Although being a local-search algorithm developed in an
target values. Several dataset configurations and target values interpreted language, our approach is efficient and runs in
were tested to assess the efficiency of the proposed algorithm. an acceptable time. The optimization processing time is not
Experimental results show that the proposed algorithm affected by the number of classes and, depending on the
works as expected, successfully generating datasets with var- complexity measure adopted, the number of features do not
ious levels of complexity in tests considering two scenarios: influence on the processing time either; there is, however, a
starting up time for creating some initial data structures. [6] N. Macià, A. Orriols-Puig, and E. Bernadó-Mansilla, “Beyond home-
As future work, we intend to investigate heuristic selection made artificial data sets,” in International Conference on Hybrid Artifi-
cial Intelligence Systems. Springer, 2009, pp. 605–612.
mechanisms to increase the chances of generating better [7] ——, “Genetic-based synthetic data sets for the analysis of classifiers
neighbors to reduce the necessary number of iterations and add behavior,” in Hybrid Intelligent Systems, 2008. HIS’08. Eighth Interna-
other popular complexity measures. Related work investigated tional Conference on. IEEE, 2008, pp. 507–512.
[8] ——, “In search of targeted-complexity problems,” in Proceedings of
the multi-objective optimization of complexity measures; this the 12th annual conference on Genetic and evolutionary computation.
could be another interesting investigation topic for our ap- ACM, 2010, pp. 1055–1062.
proach. [9] T. K. Ho and M. Basu, “Complexity measures of supervised classifi-
cation problems,” IEEE Transactions on Pattern Analysis and Machine
VII. ACKNOWLEDGMENTS Intelligence, vol. 24, no. 3, pp. 289–300, 2002.
[10] J. Luengo and F. Herrera, “An automatic extraction method of the
The first author would like to thank FAPESP (grant domains of competence for learning classifiers using data complexity
2017/20844-0) for the financial support. The second author measures,” Knowledge and Information Systems, vol. 42, no. 1, pp. 147–
180, 2015.
would like to thank FAPESP (grant 2012/22608-8) and CNPq [11] L. P. Garcia, A. C. de Carvalho, and A. C. Lorena, “Effect of label noise
for the financial support. in the complexity of classification problems,” Neurocomputing, vol. 160,
pp. 108–119, 2015.
R EFERENCES [12] ——, “Noise detection in the meta-learning level,” Neurocomputing, vol.
176, pp. 14–25, 2016.
[1] M. Lichman, “UCI machine learning repository,” 2013. [Online]. [13] R. A. Mollineda, J. S. Sánchez, and J. M. Sotoca, “Data characterization
Available: http://archive.ics.uci.edu/ml for effective prototype selection,” in 2nd Iberian Conference on Pattern
[2] J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo, “Openml: Recognition and Image Analysis (IbPRIA), 2005, pp. 27–34.
networked science in machine learning,” ACM SIGKDD Explorations [14] N. Macià, E. Bernadó-Mansilla, and A. Orriols-Puig, “On the dimen-
Newsletter, vol. 15, no. 2, pp. 49–60, 2014. sions of data complexity through synthetic data sets.” in CCIA, 2008,
[3] N. Macia and E. Bernadó-Mansilla, “Towards uci+: A mindful repository pp. 244–252.
design,” Information Sciences, vol. 261, pp. 237–262, 2014. [15] N. Macià, T. Ho, A. Orriols-Puig, and E. Bernadó-Mansilla, “The
[4] N. Macia, E. Bernadó-Mansilla, and A. Orriols-Puig, “Preliminary landscape contest at icpr 2010,” Recognizing Patterns in Signals, Speech,
approach on synthetic data sets generation based on class separability Images and Videos, pp. 29–45, 2010.
measure,” in Pattern Recognition, 2008. ICPR 2008. 19th International [16] I. Guyon, K. Bennett, G. Cawley, H. J. Escalante, S. Escalera, T. K.
Conference on. IEEE, 2008, pp. 1–4. Ho, N. Macia, B. Ray, M. Saeed, A. Statnikov et al., “Design of the
[5] D. R. Amancio, C. H. Comin, D. Casanova, G. Travieso, O. M. Bruno, 2015 chalearn automl challenge,” in Neural Networks (IJCNN), 2015
F. A. Rodrigues, and L. da Fontoura Costa, “A systematic comparison International Joint Conference on. IEEE, 2015, pp. 1–8.
of supervised classifiers,” PloS one, vol. 9, no. 4, p. e94137, 2014.