Using Complexity Measures To Evolve Synthetic Classification Datasets

Using Complexity Measures to Evolve Synthetic
Classification Datasets
Vinı́cius V. de Melo Ana C. Lorena
UNIFESP-ICT UNIFESP-ICT and ITA
São José dos Campos, Brazil São José dos Campos, Brazil
Email: vinicius.melo@unifesp.br Email: aclorena@gmail.br
Abstract—Machine Learning studies usually involve a large difficulty can be produced by optimizing such measures. Some
volume of experimental work. For instance, any new technique or previous works have employed Genetic Algorithms (single and
solution to a classification problem has to be evaluated concerning multi-objective) to produce datasets with different complexity
the predictive performance achieved in many datasets. In order
to evaluate the robustness of the algorithm face to different class levels [7], [6], [8]. The idea is to produce a new dataset [7],
distributions, it would be interesting to choose a set of datasets [6] or to sample an existing dataset [8] to reach a given
that spans different levels of classification difficulty. In this paper, complexity measure value (or for optimizing the values of
we present a method to generate synthetic classification datasets multiple complexity measures).
with varying complexity levels. The idea is to greedly exchange This paper employs a similar approach as in [6] and gen-
the labeling of a set of synthetically generated points in order to
reach a given level of classification complexity, which is assessed erates new synthetic datasets with different target complexity
by measures that estimate the difficulty of a classification problem values. Nonetheless, a low-cost hill-climbing greedy search
based on the geometrical distribution of the data. strategy is employed instead of the GA. Given an initial
dataset, the labels of pairs of examples are iteratively swapped.
I. I NTRODUCTION By doing so, one can change the dataset structure, while
preserving some of its initial characteristics, such as the
Most Machine Learning (ML) studies include an experi- number of examples, of input features, of classes, and the
mental evaluation section, in which one or more designed data distribution within the classes. Despite the simplicity
techniques have their performance evaluated on some datasets. of the algorithm, our experiments demonstrate that it can
Although there are a few popular benchmark repositories, generate datasets of different levels of target difficulty at a
such as the UCI and OpenML repositories [1], [2], often low computational cost.
the available datasets have quite simple structures or are This paper is structured as follows: Section II presents the
already preprocessed, and may not present a real challenge complexity measures used in this work. Section III reviews
to data analysis [3]. In other cases, one may be interested in related work on dataset generation based on the data com-
investigating the effectiveness of a designed technique on a set plexity descriptors. Section IV describes the dataset generator
of datasets with known distribution. These aspects motivate the proposed. Section V presents an experimental evaluation of
design of synthetic datasets [4]. the algorithm, whilst Section VI concludes this work.
Various strategies can be employed in the generation of
synthetic datasets for classification problems. A common II. C OMPLEXITY M EASURES
approach is to sample the data items according to specific Ho and Basu [9] introduced complexity measures for es-
distributions [5]. For instance, one may assume the examples timating the difficulty of a classification problem. Such de-
are sampled from a normal distribution, with distinct means scriptors are extracted from the datasets available for learning,
and variances for the different classes. By approximating giving an indication of the size and shape of the boundary
the means of the classes or by increasing the values of the required to separate the classes.
variances, datasets in which the classes overlap more can be These measures have been employed in various types of
produced. Another interesting approach is to generate synthetic analysis in recent work, among them: (i) to characterize the
datasets by changing the geometrical structure of the data [6], domain of competence of different ML algorithms [10]; (ii)
which can be accomplished by the use of data complexity to develop new data-driven techniques [11]; (iii) to describe
measures to generate synthetic datasets [7], [6], [8]. classification problems in meta-learning studies [12]; and (iv)
The complexity measures have been introduced in the early to generate new classification datasets [3]. This paper uses
2000s [9] and have been used in numerous types of analysis a subset of complexity measures that vary the overlapping
in the recent years [10], [11], [12], [3]. They allow estimating of the classes in the generation of synthetic datasets with
the complexity of a given classification problem by extracting different levels of complexity. Herewith, if the distributions
some simple geometrical and statistical descriptors from its of the examples from different classes present a high mixture
learning dataset. Synthetic datasets spanning different levels of or overlapping, the classification problem can be regarded as
978-1-5090-6014-6/18/$31.00 ©2018 IEEE

more complex than if the classes are well-separated. Therefore, operators may disrupt the information contained in the parent
the produced datasets are focused mainly in modifying the individuals, changing the classes distribution. Also, the dataset
overlapping of the classes. can become unbalanced (in extreme cases one or more classes
The maximum Fisher’s discriminant ratio, denoted as F1, may be lost).
regards on the overlapping of the features values within the Extending the previous work, in [6] a Multi-objective
classes. For each input feature fi , a discriminant ratio rfi is GA (MOGA) is employed to label the examples in order
calculated according to Equation 1 [13]: to optimize multiple complexity measures, using the same
Pnc fi fi 2 individuals representation and genetic operators as in [7].
j=1 ncj (µc − µ ) In [8], the authors argue that changing the labels of the
rfi = Pnc Pncj jj fi 2
, (1)
j=1 l=1 (xli − µcj ) examples in a dataset can produce classification problems with
unrealistic data distributions. They then propose to perform in-
where ncj is the number of examples in class cj , µfi is the stance selection from known base classification problems. The
mean of fi , µfcji is the mean of feature fi for the examples idea is to select the subset of examples that optimize a com-
of class cj , and xjli denotes the value of the feature fi for an bination of the complexity measures values. They introduced
example from class cj . F1 outputs the maximum discriminant a MOGA with some restrictions for controlling the number of
value among all input features. The larger the F1 value, the selected examples, class proportions, and data duplicity. Each
simpler the classification problem is, indicating that there is individual encodes the indexes of the examples selected from
at least one feature which discriminates the classes well. a base dataset. Two-point crossover and a mutation operator
The fraction of points in the class boundary measure consisting in the addition/removal of examples are used, along
(denoted as N1) first builds a Minimum Spanning Tree (MST) with some correction or penalization strategies to guarantee
based on the data items contained in the learning dataset. In the feasibility of the solutions. Although the produced datasets
this tree, each node corresponds to an example. The MST spanned the complexity space more realistically, the algorithm
will connect examples which are nearest in the feature space, can be considered costly and has many parameters to be
despite their labels. The label information is used in the final tuned. Some of the produced datasets have also been used
computation of the measure, which outputs the percentage of in the design of ML challenges [15], [16]. In [3], this MOGA
connected examples in the tree which have different labels. algorithm is used to enrich the UCI repository with more
These examples are in boundary or overlapping regions of the challenging datasets.
classes and can be regarded as critical for data classification.
N1 values range between 0 and 1, in which larger values IV. C OMPLEXITY-BASED DATASET G ENERATOR
indicate more complex problems. As already discussed, instead of evolving datasets to opti-
The third measure employed in this paper is the error mize the desired complexity measure, our approach searches
rate of the 1-nearest neighbor (1NN) classifier (denoted as for that single-objective solution by modifying the label as-
N3). The 1NN classifier is one of the simplest classification signment (see Figure 1), similarly to [14], [7]. Thus, this is a
techniques, which classifies an example according to the label traditional combinatorial optimization problem.
of its nearest neighbor in the training dataset. The N3 measure
outputs the error rate of a 1NN classifier trained according to
the leave-one-out cross-validation (LOOCV) strategy. It also Idx2 Swap the Idx2
labels
varies in the [0, 1] range, in which larger values represent more
complex problems.
Idx1 Idx1
III. R ELATED WORKS
This section reviews related works which also employ
complexity measures in the design of classification datasets. Figure 1: Label assignment optimization.
In [14], [4], a set of synthetic datasets is built for reaching
a specified boundary length (N1 value). An MST is built Starting from a particular dataset (real or synthetic) with
from randomly generated data, and the labels of the connected pre-assigned labels, these original labels are iteratively mod-
vertices are assigned such as to meet a user-specified boundary ified to approximate a target complexity value. There is no
length. Although the authors mention the use of a heuristic need to modify the actual instances. Thus, our approach can
search for the labeling process in [14], such algorithm is not deal with any dataset characteristics such as missing values
described in the paper. and different feature types. The proposed approach, which is
In [7], each individual of a Genetic Algorithm (GA) codes a simple yet efficient Hill-climbing algorithm with multiple
the classes attributed to the examples. The GA developed in trials, is presented in Algorithm 1.
that paper searches for the combination of labels that yields Among other parameters, the optimization function
a given N1 value. Standard two-point cross-over and bit-flip receives the labels, which is an array containing
mutation operators are used in the evolutionary process. It is the original label assignments of the examples. For
important to notice that, for this specific problem, the chosen instance, in Figure 1, this array could be classes =
2018 International Joint Conference on Neural Networks (IJCNN)

Algorithm 1 A high-level version of the label assignment to increase the search efficiency.
optimization algorithm. It is clear that our approach is a local-search method that
1 optLabels = function ( training , labels , target ,
could be easily trapped in local optima, while related genetic
complexityMeasure , maxIter , maxTrials ){ algorithms perform a global search and tend to be more robust.
2
3 b e s t O v e r a l l S o l u t i o n = NULL
Therefore, in order to verify its efficiency, we performed the
4 bestOverallValue = Inf experimental analysis reported in the next section.
5
6 for ( t r i a l in 1: maxTrials ){
7 bestSolution = labels
V. E XPERIMENTS
8 b e s t V a l u e = abs ( t a r g e t −c o m p l e x i t y M e a s u r e ( t r a i n i n g , In this section, we evaluate the proposed approach to
bestSolution ) )
9 generate datasets of several F1, N1, and N3 target values
10 for ( i t e r in 1: maxIter ){ and a few different dimensions. The objective is to investigate
11 neighbor = generateNeighbor ( bestSolution )
12 n e i g h b o r V a l u e = abs ( t a r g e t − c o m p l e x i t y M e a s u r e ( the effectiveness in achieving or, at least, approximating the
training , neighbor ) ) target values of the complexity measures. Also, we evaluate
13
14 i f ( n e i g h b o r V a l u e <= b e s t V a l u e ) { the computational cost in terms of processing time. Finally,
15 bestValue = neighborValue we induce some classifiers on the optimized datasets in order
16 bestSolution = neighbor
17 i f ( b e s t V a l u e == 0 ) break to assess if datasets of different difficulty levels were obtained,
18 } as expected.
19 }
20
21 i f ( bestValue < bestOverallValue ){ A. Implementation and configuration of the generator
22 bestOverallValue = bestValue
23 bestOverallSolution = bestSolution The experiments were run on a system environment with an
24 i f ( b e s t O v e r a l l V a l u e == 0 ) break Intel(R) Core(TM) i7-6700K CPU @ 4.6GHz, 32Gb RAM,
25 }
26 } Arch Linux 4.13.3-1-ARCH #1 SMP PREEMPT x86 64, gcc
27 7.2.1, R 3.4.3. Our approach was implemented in R1 . Also,
28 return b e s t O v e r a l l S o l u t i o n
29 } for the grid-search and classifiers, we used packages caret
6.0.77, rJava 0.9.9, and RWeka 0.4.66, with OpenJDK Java
1.8.0 144-b01.
As we are using synthetic datasets, we split the generator
[circle, triangle, circle, circle, triangle, triangle, circle]. into two parts. The first one creates a random dataset with n
The generateNeighbor function in line 11 randomly selects examples and nf features sampled from a uniform distribution
two indexes of the bestSolution (containing the labels being between -1 and 1. Here, we tested two types of problems:
optimized) until the labels are different; then, it swaps them. (i) with two equally distributed classes; and (ii) five unbal-
It performs a single swapping to generate a new neighbor. anced classes with a distribution given by probabilites =
The main loop evaluates a single solution per iteration. Thus, [0.07, 0.13, 0.20, 0.27, 0.33]2 . The initial labels of the exam-
the final number of executed iterations is the total number of ples are randomly assigned according to the desired class
solutions evaluated. distributions.
Any complexity measure can be used to evaluate the solu- The second part of the generator is the actual optimizer,
tions in the optimization procedure. In this paper, we opted which optimizes the labels of the original random dataset
to monitor the mixture and overlapping of the classes, as to the desired targets (within a pre-specified tolerance) for
described in Section II, which are highly influenced by the each complexity measure (always starting with the original
label swapping procedure. Furthermore, as the original data dataset). It is important to note that it performs nr independent
features are kept unchanged, the necessary data structures for runs but, for each run, a single original dataset is created.
using both N1 and N3 can be computed only once and reused Thus, all measures are optimized from the same initial dataset.
multiple times. They are the MST for N1 and the nearest The parameters used by the generator and the optimizer are
neighbors of all examples for N3. For F1, once the labels presented in Table I. We tested two different sample sizes,
are swapped, there is a need to update the mean values in numbers of features, and classes, considering four complexity
Equation 1, which can be done incrementally, since only one measures target values. The hill-climbing algorithm is run
example per class is altered. for at most three trials, each one performing a maximum
An important characteristic of this implementation is that of 100,000 iterations, stopping when the target is within the
the swapping procedure maintains the original labels distribu- configured tolerance. Therefore, we are generating a maximum
tion. Therefore, our approach works with any class distribution of 1200 synthetic datasets in our experiments.
as well as any number of classes, while in the approach
B. The optimized datasets
reported in [7] various datasets become imbalanced. A possible
drawback of our approach is that the naı̈ve selection method In this section, we discuss on the generated datasets.
only guarantees it is swapping different labels; it does not 1 We used the minimum spanning tree from the igraph 1.1.1 package to
consider any other information to choose the indexes. Thus, the implement the N1 complexity measure.
selection method is a strong candidate for future improvements 2 This corresponds to a sequence from 1 to 5 divided by its sum.

Table I: Parameters used to generate and optimize the datasets. than the GA in [7].
Parameter Value
Regarding the differences among F1, N1, and N3, we have:
N3 is approximately four times faster than N1 (see the times
sample size (n) 100; 1000
number of features (nf) 2; 1000 for 100,000 iterations); also, N3 requires fewer iterations than
number of classes (nc) 2; 5 N1 for targets 0 to 0.5, while N1 is faster for the other
number of runs (nr) 10 target values; F1, on the other hand, showed intermediate
maximum iterations 100,000
maximum trials 3 behavior concerning execution time. It, in general, required
target 0; 0.25; 0.5; 0.75; 1 fewer iterations for the target value of 0, as expected from the
error tolerance 0.01 initial random datasets configuration.
complexity measure F1; N1; N3
Examples of optimized datasets for different target values
are shown in Figure 2. The F1 measure tries to split the data
into regions according to the means of one of the features.
1) Balanced datasets: Table II has the results of the ex- Thus, the final solution looks like a clustering result as the
periment for the two-class balanced datasets. It presents the target value grows and the overlapping is minimized in the
results for each combination of n, nf and complexity measure direction of one of the input features (in the datasets shown,
target value. The following average and standard-deviation (in the C1 axis). The N1 measure is better than N3 at grouping
parenthesis) values are shown: Value reached, which averages examples with the same labels because it emphasizes border-
the final complexity measure values of the obtained solutions; line examples. Thus, optimizing N1 adjusts the size of the
Trials, the average number of trials required to reach the borders between the classes by placing examples of the same
desired complexity measures values in the experiments; Iter- class together. On the other hand, N3 is the LOOCV error of a
ations, the average number of iterations required to reach the 1NN classifier; thus, if just two examples of the same class are
target values; and Time, the average time, in seconds, taken to close to each other, both of them are still classified correctly,
execute the hill-climbing algorithm. When all trials failed, the and the classes can be spread all over the response surface.
average number of iterations was 100,000 (maximum limit). One may also notice that the datasets get easier for larger
One may notice that, for most of the runs, the intermediate F1 values and more complex for larger N1 and N3 values.
targets can be easily achieved by the proposed algorithm, This is expected regarding the relationship of the increase
reflecting in the low number of iterations required to reach of these measures values to the expected complexity of the
them. In fact, the random initial solutions will probably show 1
classification problem. One could optimize F 1+1 (the sum
a distribution close to N1=0.75 and N3=0.5 (see Table II), as prevents divisions by 0) too and obtain a similar relationship
the features are sampled from a uniform distribution and both to the complexity of the problem as that of N1 and N3.
classes have the same probability of being sampled.
On the other hand, achieving the extreme values gets much 0 0.5 1
1.0
harder as n increases. In our case, this could be explained
because a single random swap is performed per iteration. 0.5
However, this characteristic was also verified in [7], where
F1
0.0
the optimization algorithm is a GA. -0.5
F1 analysis is not precise since it is dependent on the -1.0

1.0
input features values. However, one may expect that the initial
datasets show a high feature overlapping in the initial random 0.5
assignments so that F1 values closer to 0 are expected initially.

C2
N1
0.0
Nevertheless, in all cases, the generated datasets are close -0.5
enough to the target complexities and can be considered as -1.0
good quality solutions. Better solutions could be achieved by 1.0
improving the selection mechanism in the generateNeighbor 0.5
function, as discussed in the previous section. However, some

N3
0.0
targets may be impossible to reach without manipulating the -0.5
features of the examples in the dataset. In general, the difficulty -1.0

-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
(in ascending order) in achieving the target value was: F1, N3, C1
and N1.
A direct comparison with related work is not accurate; for Figure 2: Example of optimized solutions for two classes with
this reason, we are not showing their results on the table. n = 1000 and nf = 2. C1 and C2 are the columns (features).
However, the authors report in [7] that their GA can find The target values are at the top.
solutions for N1 with the following targets (evaluations): 0.2
(11,410), 0.4 (500), 0.6 (500), and 0.8 (11,120), for a dataset An important aspect of our method is its computational
with 101 examples. One may check our results in Table II for performance. Despite R being an interpreted programming
similar target values and observe that it is many times faster language, the processing times are acceptable. Considering N1

Table II: Characteristics of the datasets (averages and standard-deviations, in parenthesis) from the experiments with two classes.
n nf Measure Target Value reached Trials Iterations Time (s)
F1 0 0.005 (0.003) 1.000 (0.000) 5.300 (5.208) 0.001 (0.000)
F1 0.25 0.249 (0.004) 1.000 (0.000) 26.600 (11.345) 0.002 (0.001)
F1 0.5 0.498 (0.006) 1.000 (0.000) 56.300 (14.900) 0.011 (0.022)
F1 0.75 0.750 (0.005) 1.000 (0.000) 127.800 (73.536) 0.009 (0.004)
F1 1 1.001 (0.007) 1.000 (0.000) 101.700 (26.192) 0.007 (0.003)
N1 0 0.061 (0.015) 3.000 (0.000) 100,000.000 (0.000) 3.120 (0.460)
100 2 N1 0.25 0.250 (0.000) 1.000 (0.000) 346.300 (101.379) 0.014 (0.007)
N1 0.5 0.500 (0.000) 1.000 (0.000) 44.600 (14.370) 0.001 (0.001)
N1 0.75 0.750 (0.000) 1.000 (0.000) 5.700 (8.994) 0.001 (0.001)
N1 1 1.000 (0.000) 1.000 (0.000) 246.200 (70.784) 0.021 (0.025)
N3 0 0.005 (0.005) 2.400 (0.843) 52149.000 (50463.834) 0.920 (0.915)
N3 0.25 0.250 (0.000) 1.000 (0.000) 63.000 (19.465) 0.001 (0.001)
N3 0.5 0.500 (0.000) 1.000 (0.000) 8.700 (9.878) 0.000 (0.000)
N3 0.75 0.750 (0.000) 1.000 (0.000) 55.700 (22.735) 0.001 (0.001)
N3 1 1.000 (0.000) 1.000 (0.000) 8548.400 (4031.742) 0.169 (0.087)
F1 0 0.002 (0.002) 1.000 (0.000) 1.000 (0.000) 0.000 (0.000)
F1 0.25 0.243 (0.001) 1.000 (0.000) 366.800 (42.978) 0.050 (0.043)
F1 0.5 0.494 (0.004) 1.000 (0.000) 621.700 (52.390) 0.083 (0.038)
F1 0.75 0.746 (0.005) 1.000 (0.000) 814.200 (81.355) 0.083 (0.038)
F1 1 0.995 (0.004) 1.000 (0.000) 990.900 (48.611) 0.119 (0.040)
N1 0 0.065 (0.008) 3.000 (0.000) 100,000.000 (0.000) 8.070 (0.187)
N1 0.25 0.259 (0.001) 1.000 (0.000) 2444.600 (230.482) 0.264 (0.056)
1000 2 N1 0.5 0.508 (0.001) 1.000 (0.000) 399.000 (55.863) 0.051 (0.028)
N1 0.75 0.743 (0.003) 1.000 (0.000) 21.100 (17.156) 0.004 (0.005)
N1 1 0.991 (0.000) 1.000 (0.000) 2645.900 (280.824) 0.348 (0.098)
N3 0 0.009 (0.000) 1.000 (0.000) 46753.600 (12040.008) 1.226 (0.371)
N3 0.25 0.259 (0.001) 1.000 (0.000) 443.600 (49.180) 0.009 (0.002)
N3 0.5 0.502 (0.007) 1.000 (0.000) 10.900 (11.484) 0.001 (0.002)
N3 0.75 0.742 (0.001) 1.000 (0.000) 437.600 (71.488) 0.011 (0.008)
N3 1 0.991 (0.001) 1.100 (0.316) 46728.000 (11709.526) 1.190 (0.407)
F1 0 0.010 (0.000) 1.000 (0.000) 1703.800 (789.963) 25.624 (10.798)
F1 0.25 0.242 (0.001) 1.000 (0.000) 298.000 (26.158) 4.299 (0.645)
F1 0.5 0.494 (0.003) 1.000 (0.000) 531.800 (27.716) 7.529 (0.893)
F1 0.75 0.744 (0.002) 1.000 (0.000) 791.200 (48.821) 11.359 (1.358)
F1 1 0.995 (0.003) 1.000 (0.000) 946.800 (73.072) 13.574 (2.241)
N1 0 0.066 (0.007) 3.000 (0.000) 100,000.000 (0.000) 7.250 (0.780)
N1 0.25 0.250 (0.000) 1.000 (0.000) 2456.600 (248.005) 0.249 (0.059)
1000 1000 N1 0.5 0.500 (0.000) 1.000 (0.000) 264.700 (52.281) 0.041 (0.029)
N1 0.75 0.750 (0.000) 1.000 (0.000) 172.500 (27.573) 0.038 (0.037)
N1 1 1.000 (0.000) 1.000 (0.000) 34334.700 (11229.149) 3.486 (1.071)
N3 0 0.014 (0.004) 3.000 (0.000) 100,000.000 (0.000) 1.776 (0.214)
N3 0.25 0.250 (0.000) 1.000 (0.000) 711.000 (48.972) 0.049 (0.051)
N3 0.5 0.500 (0.000) 1.000 (0.000) 29.300 (15.550) 0.001 (0.000)
N3 0.75 0.750 (0.000) 1.000 (0.000) 661.500 (47.414) 0.028 (0.034)
N3 1 0.986 (0.006) 3.000 (0.000) 100,000.000 (0.000) 1.837 (0.216)
and N3, in the largest dataset (1000 × 1000), it took only 7.25 is not considered in the former analysis, but they are shown
seconds to run the maximum number of 100,000 iterations, in Table III for reference. For F1, there is also an impact
while the fastest run took approximately one millisecond. On regarding nf , since its computation is dependent on the
the other hand, F1 is much more costly because it must do number of features of the dataset. Moreover, the related work
calculations on the dataset features at each iteration of the does not report any processing times to be compared to those
algorithm. Our implementation can still be optimized since reported here.
we did not adopt the incremental computation of the mean
values in F1. Nonetheless, the largest time to reach the target Table III: Data structures creation time (averages and standard-
in F1 was still low (25.6 seconds). deviations, in parenthesis).
n nf Measure Time (s)
One may also observe that the optimization processing
times increase with n because the search space is given by 100 2 N1 0.035 (0.099)
100 2 N3 0.001 (0.000)
the number of examples to be labeled; for N1 and N3, nf 1000 2 N1 0.605 (0.146)
impacts only the first iteration of the algorithm, in which 1000 2 N3 0.046 (0.018)
the data structures required are built to be used in the next 1000 1000 N1 3.257 (0.280)
1000 1000 N3 0.806 (0.119)
iterations. For this reason, the data structures creation time

2) Unbalanced datasets: Table IV has the experimental C. Classifiers Evaluation
results for the five-class unbalanced datasets. This is a much We evaluated the performance of some popular classifiers
harder problem than the previous one, justifying the higher on the optimized datasets. The classifiers used were: k-nearest
number of failures. Only F1 reached target 0 and worked for neighbor (kNN), Support Vector Machine with and Radial
most runs, which is expected given that the initial random Basis Function Kernel (SVM-RBF), Random Forest (RF) and
datasets may already be close to that target value. The only Naı̈ve Bayes (NB). As all previous experiments were mostly
targets easily achieved with success by all measures were 0.5 successful, we selected the intermediate dimension datasets
(the most balanced) and 1 (the hardest classification problem (n = 1000, nf = 2) for the analysis. We performed a grid-
for N1 and N3 where the closest neighbors are all different search3 with the classifiers shown in Table V for each dataset
from each other). Our method could not get very close to the (including those from different runs). All datasets are first
target value zero for N1 and N3, but targets 0.25 and 0.75 centered and scaled. The results in Table VI are averages of
were well-approximated. Again, a better instance selection ten-fold cv accuracies over ten independent runs. The best
mechanism, instead of a random one, could be the key for classifiers’ configurations are independent for each dataset and
improving the solution quality. run.
As explained before, for N1 and N3, the processing time is For all classifiers, the accuracy results in the two-class
not strongly affected by either the number of classes or the balanced datasets was higher than that achieved for the five-
number of features; this can be observed comparing the times class imbalanced dataset. This is expected, since the later
for 100,000 iterations. However, F1 took a much longer time to case represents a more challenging classification problem
run because of the extra calculations. Obviously, the harder the than the former, despite of the target complexity measures
problem, the higher the number of iterations required to solve values considered. We may also notice that the accuracy
it. The data structures creation time (not shown) is similar to values correlate to the expected complexities of the produced
those in Table III. datasets. For instance, the larger the F1 value, the simpler is
Figure 3 shows examples of optimized datasets in this the underlying classification problem. The accuracies reported
multiclass imbalanced scenario for different target values of for larger F1 values in Table VI also gets higher than those
the complexity measures. A similar grouping behavior is seen reported for lower F1 values, for all classification techniques,
here when compared to Figure 2: N1 presents a more compact revealing a lower complexity of the former problems. The
distribution of the classes, while N3 is spread. Measure F1, on same happens for N1 and N3, but with an inverse behavior,
the other hand, did not show the expected clustering behavior since larger values of these measures are obtained for more
and all datasets look similar to each other. This can be related complex classification problems.
to the fact hat the F1 maximum values are not limited by 1, so As complexity measures N1 and N3 consider neighborhood
larger target values could lead to different results. But, since information, kNN is supposed to be the best classifier for
F1 works on the feature values, one may also expect more datasets produced by low target values of these measures. This
alterations if the input features were also altered by the search hypothesis was validated in this experiment as can be seen in
algorithm. Table VI. The expected overall behavior is that as the target
value increases, the problem becomes harder. Such hypothesis
0 0.5 1
1.0
was also validated and, in this case, SVM was the best method
0.5
for the hardest scenarios.
On the other hand, F1 tries to find a linear separation based
F1
0.0
on one of the features. Thus, kNN is supposed to perform
-0.5
well but other methods should outperform it. Such hypothesis
-1.0
1.0
was validated in the experiment, where NB was the best
algorithm for most runs. These results are strong indicatives to
0.5
support algorithm selection based on the evaluated complexity
C2
N1
0.0
measures, although this is not our objective in this work.
-0.5
We chose not to run hypothesis tests to compare the
-1.0
1.0
accuracies because our intention is not to show which classifier
is the best for each scenario, but to show that our generator
0.5
can synthesize datasets of different complexity and difficulty
N3
0.0
levels, for either balanced or unbalanced cases.
-0.5
-1.0
VI. C ONCLUSIONS AND FUTURE WORK
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
C1
In this paper, we proposed a hill-climbing local-search
algorithm with multiple trials to optimize the labels assignment
Figure 3: Example of optimized solutions for five classes with
3 We employed the R caret package to do the grid-search and test all the
n = 1000 and nf = 2.
parameters available. The other parameters have default values and cannot be
modified.

Table IV: Characteristics of the datasets (averages and standard-deviations, in parenthesis) obtained in the experiment with five
unbalanced classes.
n nf Measure Target Value reached Trials Iterations Time (s)
F1 0 0.007 (0.002) 1.000 (0.000) 172.600 (92.865) 0.023 (0.012)
F1 0.25 0.247 (0.006) 1.000 (0.000) 27.600 (19.990) 0.004 (0.002)
F1 0.5 0.502 (0.005) 1.000 (0.000) 44.200 (55.232) 0.005 (0.006)
F1 0.75 0.750 (0.007) 1.000 (0.000) 26.500 (14.516) 0.003 (0.001)
F1 1 1.000 (0.006) 1.000 (0.000) 66.200 (73.946) 0.008 (0.008)
N1 0 0.140 (0.020) 3.000 (0.000) 100,000.000 (0.000) 2.845 (0.295)
N1 0.25 0.253 (0.004) 1.000 (0.000) 3637.800 (881.969) 0.135 (0.054)
100 2 N1 0.5 0.500 (0.000) 1.000 (0.000) 400.100 (75.733) 0.020 (0.025)
N1 0.75 0.752 (0.005) 1.000 (0.000) 68.900 (16.756) 0.002 (0.001)
N1 1 1.000 (0.000) 1.000 (0.000) 98.400 (41.945) 0.004 (0.002)
N3 0 0.021 (0.008) 3.000 (0.000) 100,000.000 (0.000) 1.391 (0.264)
N3 0.25 0.253 (0.004) 1.000 (0.000) 286.000 (86.704) 0.004 (0.002)
N3 0.5 0.500 (0.000) 1.000 (0.000) 69.900 (17.188) 0.001 (0.000)
N3 0.75 0.751 (0.005) 1.000 (0.000) 6.200 (3.425) 0.000 (0.000)
N3 1 1.000 (0.000) 1.000 (0.000) 149.100 (53.236) 0.002 (0.001)
F1 0 0.009 (0.001) 1.000 (0.000) 89.800 (89.657) 0.013 (0.013)
F1 0.25 0.244 (0.005) 1.000 (0.000) 144.800 (61.858) 0.024 (0.017)
F1 0.5 0.497 (0.005) 1.000 (0.000) 270.400 (67.051) 0.045 (0.019)
F1 0.75 0.745 (0.006) 1.000 (0.000) 357.700 (86.027) 0.050 (0.015)
F1 1 0.999 (0.004) 1.000 (0.000) 467.600 (58.176) 0.085 (0.037)
N1 0 0.164 (0.007) 3.000 (0.000) 100,000.000 (0.000) 8.132 (0.266)
N1 0.25 0.259 (0.001) 1.000 (0.000) 26,227.500 (2205.334) 2.486 (0.466)
1000 2 N1 0.5 0.509 (0.000) 1.000 (0.000) 3599.800 (383.609) 0.383 (0.083)
N1 0.75 0.759 (0.001) 1.000 (0.000) 554.100 (60.248) 0.069 (0.039)
N1 1 0.991 (0.000) 1.000 (0.000) 777.700 (150.311) 0.084 (0.031)
N3 0 0.027 (0.003) 3.000 (0.000) 100,000.000 (0.000) 1.939 (0.145)
N3 0.25 0.260 (0.000) 1.000 (0.000) 2731.900 (243.083) 0.083 (0.037)
N3 0.5 0.508 (0.001) 1.000 (0.000) 616.400 (35.787) 0.024 (0.040)
N3 0.75 0.750 (0.009) 1.000 (0.000) 10.700 (10.089) 0.000 (0.000)
N3 1 0.991 (0.000) 1.000 (0.000) 1763.400 (254.396) 0.046 (0.031)
F1 0 0.089 (0.006) 3.000 (0.000) 100,000.000 (0.000) 32.835 (3.606)
F1 0.25 0.245 (0.005) 1.000 (0.000) 21.800 (8.967) 0.527 (0.293)
F1 0.5 0.495 (0.005) 1.000 (0.000) 160.800 (27.466) 3.163 (0.679)
F1 0.75 0.748 (0.006) 1.000 (0.000) 242.600 (44.873) 4.803 (1.216)
F1 1 0.998 (0.007) 1.000 (0.000) 330.900 (50.192) 6.363 (1.442)
N1 0 0.173 (0.004) 3.000 (0.000) 100,000.000 (0.000) 7.445 (0.539)
N1 0.25 0.251 (0.000) 3.000 (0.000) 100,000.000 (0.000) 7.903 (0.453)
1000 1000 N1 0.5 0.500 (0.000) 1.000 (0.000) 2434.600 (216.865) 0.269 (0.040)
N1 0.75 0.750 (0.000) 3.000 (0.000) 100,000.000 (0.000) 8.969 (0.593)
N1 1 1.000 (0.000) 1.000 (0.000) 2583.700 (434.526) 0.293 (0.086)
N3 0 0.060 (0.004) 3.000 (0.000) 100,000.000 (0.000) 1.500 (0.157)
N3 0.25 0.250 (0.000) 3.000 (0.000) 100,000.000 (0.000) 1.583 (0.153)
N3 0.5 0.500 (0.000) 1.000 (0.000) 823.200 (54.912) 0.014 (0.007)
N3 0.75 0.749 (0.000) 3.000 (0.000) 100,000.000 (0.000) 1.578 (0.120)
N3 1 1.000 (0.000) 1.000 (0.000) 5629.500 (2316.621) 0.106 (0.049)
Table V: Parameters of the classifiers for the grid search. two-class balanced datasets and five-class unbalanced datasets.
Values were empirically chosen. Some classifiers were tested to validate the difficulty levels
Classifier Parameters achieved, showing indeed that the higher the complexity as
kNN k={1, 3, 5, 7} measured by the different measures, the harder the classifica-
SVM-RBF sigma={10−4 , 10−2 , 10}, C={10−1 , 100 , 101 , 102 } tion problem to be solved. Also, different classifiers performed
Naı̈ve Bayes fL={0, 0.5, 1}, usekernel={FALSE, TRUE}, adjust={0.01, 0.5, 1.0}
Random Forest mtry={3, 4, 5} better depending on the complexity measure, meaning that one
could use the measure to select the most appropriate classifier
in each case or to generate datasets to challenge classifiers.
of synthetic datasets to achieve specific complexity measures Although being a local-search algorithm developed in an
target values. Several dataset configurations and target values interpreted language, our approach is efficient and runs in
were tested to assess the efficiency of the proposed algorithm. an acceptable time. The optimization processing time is not
Experimental results show that the proposed algorithm affected by the number of classes and, depending on the
works as expected, successfully generating datasets with var- complexity measure adopted, the number of features do not
ious levels of complexity in tests considering two scenarios: influence on the processing time either; there is, however, a

Table VI: Average accuracies. Highest values are in bold face. Standard deviation values are in parenthesis.
Measure Target Classes kNN SVM RF NB
F1 0 2 0.514 (0.025) 0.516 (0.018) 0.498 (0.020) 0.521 (0.019)
F1 0.25 2 0.604 (0.018) 0.646 (0.007) 0.579 (0.020) 0.647 (0.010)
F1 0.5 2 0.655 (0.019) 0.694 (0.006) 0.631 (0.020) 0.698 (0.006)
F1 0.75 2 0.706 (0.015) 0.732 (0.010) 0.687 (0.019) 0.736 (0.013)
F1 1 2 0.734 (0.015) 0.761 (0.007) 0.717 (0.011) 0.762 (0.007)
N1 0 2 0.967 (0.005) 0.857 (0.029) 0.931 (0.012) 0.705 (0.042)
N1 0.25 2 0.843 (0.007) 0.692 (0.029) 0.808 (0.008) 0.620 (0.012)
N1 0.5 2 0.674 (0.013) 0.578 (0.018) 0.653 (0.011) 0.555 (0.021)
N1 0.75 2 0.502 (0.013) 0.504 (0.012) 0.489 (0.012) 0.515 (0.017)
N1 1 2 0.345 (0.016) 0.444 (0.016) 0.321 (0.018) 0.478 (0.017)
N3 0 2 0.969 (0.005) 0.631 (0.019) 0.815 (0.020) 0.584 (0.021)
N3 0.25 2 0.723 (0.009) 0.540 (0.017) 0.639 (0.015) 0.541 (0.019)
N3 0.5 2 0.509 (0.015) 0.512 (0.016) 0.496 (0.007) 0.524 (0.013)
N3 0.75 2 0.412 (0.012) 0.470 (0.018) 0.373 (0.010) 0.496 (0.017)
N3 1 2 0.337 (0.018) 0.452 (0.019) 0.265 (0.023) 0.474 (0.012)
F1 0 5 0.270 (0.015) 0.336 (0.004) 0.267 (0.019) 0.333 (0.006)

F1 0.25 5 0.280 (0.013) 0.341 (0.012) 0.272 (0.018) 0.342 (0.018)
F1 0.5 5 0.286 (0.019) 0.341 (0.017) 0.279 (0.015) 0.346 (0.015)
F1 0.75 5 0.299 (0.018) 0.359 (0.030) 0.290 (0.024) 0.358 (0.027)
F1 1 5 0.305 (0.022) 0.365 (0.024) 0.297 (0.021) 0.369 (0.024)
N1 0 5 0.900 (0.008) 0.731 (0.032) 0.853 (0.012) 0.492 (0.035)
N1 0.25 5 0.834 (0.012) 0.626 (0.021) 0.779 (0.017) 0.431 (0.021)
N1 0.5 5 0.629 (0.012) 0.449 (0.026) 0.594 (0.014) 0.381 (0.016)
N1 0.75 5 0.395 (0.015) 0.347 (0.015) 0.372 (0.021) 0.340 (0.009)
N1 1 5 0.219 (0.012) 0.334 (0.001) 0.188 (0.014) 0.330 (0.005)
N3 0 5 0.933 (0.008) 0.453 (0.024) 0.736 (0.008) 0.371 (0.009)
N3 0.25 5 0.699 (0.006) 0.371 (0.020) 0.546 (0.019) 0.351 (0.011)
N3 0.5 5 0.471 (0.007) 0.348 (0.006) 0.404 (0.018) 0.341 (0.009)
N3 0.75 5 0.272 (0.011) 0.337 (0.009) 0.270 (0.016) 0.333 (0.012)
N3 1 5 0.195 (0.014) 0.334 (0.000) 0.135 (0.012) 0.332 (0.002)
starting up time for creating some initial data structures. [6] N. Macià, A. Orriols-Puig, and E. Bernadó-Mansilla, “Beyond home-
As future work, we intend to investigate heuristic selection made artificial data sets,” in International Conference on Hybrid Artifi-
cial Intelligence Systems. Springer, 2009, pp. 605–612.
mechanisms to increase the chances of generating better [7] ——, “Genetic-based synthetic data sets for the analysis of classifiers
neighbors to reduce the necessary number of iterations and add behavior,” in Hybrid Intelligent Systems, 2008. HIS’08. Eighth Interna-
other popular complexity measures. Related work investigated tional Conference on. IEEE, 2008, pp. 507–512.
[8] ——, “In search of targeted-complexity problems,” in Proceedings of
the multi-objective optimization of complexity measures; this the 12th annual conference on Genetic and evolutionary computation.
could be another interesting investigation topic for our ap- ACM, 2010, pp. 1055–1062.
proach. [9] T. K. Ho and M. Basu, “Complexity measures of supervised classifi-
cation problems,” IEEE Transactions on Pattern Analysis and Machine
VII. ACKNOWLEDGMENTS Intelligence, vol. 24, no. 3, pp. 289–300, 2002.
[10] J. Luengo and F. Herrera, “An automatic extraction method of the
The first author would like to thank FAPESP (grant domains of competence for learning classifiers using data complexity
2017/20844-0) for the financial support. The second author measures,” Knowledge and Information Systems, vol. 42, no. 1, pp. 147–
180, 2015.
would like to thank FAPESP (grant 2012/22608-8) and CNPq [11] L. P. Garcia, A. C. de Carvalho, and A. C. Lorena, “Effect of label noise
for the financial support. in the complexity of classification problems,” Neurocomputing, vol. 160,
pp. 108–119, 2015.
R EFERENCES [12] ——, “Noise detection in the meta-learning level,” Neurocomputing, vol.
176, pp. 14–25, 2016.
[1] M. Lichman, “UCI machine learning repository,” 2013. [Online]. [13] R. A. Mollineda, J. S. Sánchez, and J. M. Sotoca, “Data characterization
Available: http://archive.ics.uci.edu/ml for effective prototype selection,” in 2nd Iberian Conference on Pattern
[2] J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo, “Openml: Recognition and Image Analysis (IbPRIA), 2005, pp. 27–34.
networked science in machine learning,” ACM SIGKDD Explorations [14] N. Macià, E. Bernadó-Mansilla, and A. Orriols-Puig, “On the dimen-
Newsletter, vol. 15, no. 2, pp. 49–60, 2014. sions of data complexity through synthetic data sets.” in CCIA, 2008,
[3] N. Macia and E. Bernadó-Mansilla, “Towards uci+: A mindful repository pp. 244–252.
design,” Information Sciences, vol. 261, pp. 237–262, 2014. [15] N. Macià, T. Ho, A. Orriols-Puig, and E. Bernadó-Mansilla, “The
[4] N. Macia, E. Bernadó-Mansilla, and A. Orriols-Puig, “Preliminary landscape contest at icpr 2010,” Recognizing Patterns in Signals, Speech,
approach on synthetic data sets generation based on class separability Images and Videos, pp. 29–45, 2010.
measure,” in Pattern Recognition, 2008. ICPR 2008. 19th International [16] I. Guyon, K. Bennett, G. Cawley, H. J. Escalante, S. Escalera, T. K.
Conference on. IEEE, 2008, pp. 1–4. Ho, N. Macia, B. Ray, M. Saeed, A. Statnikov et al., “Design of the
[5] D. R. Amancio, C. H. Comin, D. Casanova, G. Travieso, O. M. Bruno, 2015 chalearn automl challenge,” in Neural Networks (IJCNN), 2015
F. A. Rodrigues, and L. da Fontoura Costa, “A systematic comparison International Joint Conference on. IEEE, 2015, pp. 1–8.
of supervised classifiers,” PloS one, vol. 9, no. 4, p. e94137, 2014.

Using Complexity Measures To Evolve Synthetic Classification Datasets

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Using Complexity Measures To Evolve Synthetic Classification Datasets

Uploaded by

Copyright:

Available Formats

Using Complexity Measures to Evolve Synthetic

978-1-5090-6014-6/18/$31.00 ©2018 IEEE

2018 International Joint Conference on Neural Networks (IJCNN)

2018 International Joint Conference on Neural Networks (IJCNN)

However, this characteristic was also verified in [7], where

the optimization algorithm is a GA. -0.5

F1 analysis is not precise since it is dependent on the -1.0

assignments so that F1 values closer to 0 are expected initially.

Nevertheless, in all cases, the generated datasets are close -0.5

enough to the target complexities and can be considered as -1.0

good quality solutions. Better solutions could be achieved by 1.0

improving the selection mechanism in the generateNeighbor 0.5

function, as discussed in the previous section. However, some

targets may be impossible to reach without manipulating the -0.5

features of the examples in the dataset. In general, the difficulty -1.0

2018 International Joint Conference on Neural Networks (IJCNN)

2018 International Joint Conference on Neural Networks (IJCNN)

2018 International Joint Conference on Neural Networks (IJCNN)

2018 International Joint Conference on Neural Networks (IJCNN)

F1 0 5 0.270 (0.015) 0.336 (0.004) 0.267 (0.019) 0.333 (0.006)

2018 International Joint Conference on Neural Networks (IJCNN)

You might also like