Professional Documents
Culture Documents
Feature 4election For $lassification 6sing 1article 4warm 0ptimization
Feature 4election For $lassification 6sing 1article 4warm 0ptimization
MACEDONIA
Abstract—This paper proposes a method for the problem of decreased dimensionality of the data and, therefore, reduced
processing high-dimensional data. When one has thousands of computational time and the elimination of unnecessary fea-
features (attributes) in a dataset, it is hard to achieve an efficient tures.
feature selection. To cope with this problem, we propose the
use of a binary particle swarm optimization algorithm combined This paper is divided into eight sections. The next section
with the C4.5 as a classifier in the fitness function for the presents the curse of dimensionality, Section III introduces
selection of informative attributes. The results obtained on 11 related work. The proposed method is described in detail in
datasets were analyzed statistically and reveal that the proposed section IV. Section V comprises the experiment planning,
method, called BPSO+C4.5, outperforms known classifiers, i.e., execution and presentation of the obtained results. The sta-
C4.5, Naive Bayes, and SVM.
tistical analysis in section VI is intended for the statistical
Keywords—swarm intelligence, particle swarm optimization, processing of the results, in which we compare our method
feature selection, classification with established classification methods. In section VII the
results of the proposed method are compared with the results
I. I NTRODUCTION of other researchers that address the same problem. Finally,
section VIII summarizes this paper and provides plans for
We live in an era of big data, where data explosion has
further work.
become ubiquitous. The reason for this is due primarily
to increasingly easier data acquisition with automated data II. C URSE OF DIMENSIONALITY
collection tools and databases/repositories with practically no High-dimensional data is a serious problem for existing data
space limits. Because of that, researchers deal with data which mining and machine learning algorithms because of the so-
is rich, but information poor. Such data consists typically called curse of dimensionality [1]. This refers to the known
of a vast number of features (attributes). One might expect phenomenon that data in high-dimensional space become
that the analysis of such high-dimensional data constitutes an sparse. The presence of a vast number of features also affects
advantage but, in practice, it turns out that this is far from learning models, which tend to overfit and, therefore, can cause
the case. The manual identification and exhaustive search for performance degradation on unseen data. For example, in a
an optimal subset of informative features in a given dataset is large search space, there is a total of 2n (n represents the
nearly impossible. number of features) feasible solutions (feature subsets) [2].
Due to the continuous growth of databases and capacity It means that a dataset with 10,000 features per instance has
of storage devices, data processing is struggling to keep 210000 solutions. With each additional feature, the complexity
up with data acquisition. Supervised Machine Learning is of a problem doubles. To solve the issue of the curse of
required in most classification problems from the real world. dimensionality, authors have begun studying techniques for
The association between instances and classes is well known, dimensionality reduction.
but we do not have information about which attributes of Dimensionality reduction comprises two different tech-
instances are necessary and which are not. Thus, for a better niques, i.e. feature selection and feature extraction. The result
presentation of a given domain more attributes are used, which of the latter is a new feature space with low dimensionality,
results in the presence of irrelevant and redundant attributes unlike the subset of relevant features in the feature selection.
of the target concept. Therefore, their elimination is essential. The proposed method uses feature selection, since a further
The irrelevant attribute has no direct connection with the analysis in feature extraction is very problematic, as it builds
target concept but has an impact on the learning process of a set of new features and loses the physical meanings of the
a classifier, and the redundant attribute does not add any original features.
additional value to the target concept. Because of the high Feature selection from a set of features X selects subset
dimensionality of data, the construction of a suitable classifier A ⊆ X via specific optimization techniques. It maintains
based on a whole set of features is almost impossible. the physical meanings of features, which usually leads to
The solution for the high-dimensional data problem is better classification accuracy, lower complexity and improved
currently tackled with artificial intelligence methods. The comprehensibility for resulting classifier models.
proposed method BPSO+C4.5 falls within the scope of swarm The authors in [4], based on their review of feature relevance
intelligence and extends it with feature selection. The use in related articles, classify features into three disjointed cat-
of the latter has many advantages, primarily improved clas- egories, i.e. strongly relevant, weakly relevant, and irrelevant
sification accuracy within high-dimensional data problems, features (Fig. 1).
966
978-1-5090-3843-5/17/$31.00 2017
c European Union
IEEE EUROCON 2017, 6-8 JULY 2017, OHRID, R. MACEDONIA
967
IEEE EUROCON 2017, 6-8 JULY 2017, OHRID, R. MACEDONIA
1 initialization of population();
sigmoid(v id new ) = new , (4) 1:
1 + e−vid 2: while maximum iterations do
where 3: for i = 1 to number of particles do
e is the base of the natural logarithm. 4: f = fitness function of a particle;
5: fs = features selected in a particle;
The value obtained is then compared with uniformly dis-
6: if f (Xi ) > f (pBesti ) then
tributed randomly generated value between 0 and 1 (function
7: pBesti = Xi ; update personal best
U (0, 1)). A decision on the new position of the particle xid is
8: end if
now probabilistic. That means that the bigger the probability
new 9: if f (Xi ) > f (gBest) then
vid , the bigger the value of the Sigmoid function, therefore,
10: gBest = Xi ; update global best
the probability of the value 1 assigned to the xnew
id increases.
11: end if
new
If vid new
increases, the function sigmoid(vid ) will be limited
new 12: if f (Xi ) = f (gBest) and fs(Xi ) < fs(gBest) then
to 1. For example, if vid > 6, the probability of xnew is
new
id 13: gBest = Xi ; update global best
almost 1, but not exactly 1. Thus, for vid = 6 the probability
14: end if
xnew
id = 1 is 0.998, and the probability of xnew
id = 0 is 0.002.
15: end for
16: for i = 1 to number of particles do
1.0 17: for d = 1 to number of features do
new
sigmoid(vid )
18: vid new = ω× vid old +
0.8 19: c1 r1 (pBestid old − xid old ) +
20: c2 r2 (gBestd old − xid old )
new
0.6 21: if vid > vmax then
22: vid new = vmax
0.4 23: end if
24: if vid new < vmin then
25: vid new = vmin
0.2 26: end if
new
vid 27: if sigmoid(v id new ) > U (0, 1) then
28: xid new = 1
−6.0 −4.0 −2.0 2.0 4.0 6.0 29: else
Fig. 3: Sigmoid function 30: xid new = 0
31: end if
32: end for
B. Fitness function 33: end for
The goal of the fitness function is the assessment of each 34: end while
particle’s quality. Fig. 2 presents a simplified diagram of
the proposed method with the exposed fitness function. The
input to the fitness function represents a particle with selected method was selected because it is a “white box” method and,
features (features marked with 1). The classifier is then built therefore, easy to interpret. The output of the C4.5 classifica-
based on the selected features. The output of the fitness tion method is a decision tree that is easily undestandable by
function is classification accuracy (5) of a classifier, built with the expert who validates the obtained results.
the C4.5 algorithm using a subset of features, defined by a In the case where the fitness function value of the i-th
specific particle: particle equals the fitness function value of gBest, a second
number of correctly classified instances evaluation takes place. In this evaluation, we compare the
accuracy = . number of selected features in the i-th particle with the number
number of instances
(5) of selected features in gBest. If the i-th particle has fewer
Classification accuracy is one of the most often used metrics selected features than gBest, then gBest is replaced with the
for classification method assessment. The C4.5 classification i-th particle.
968
IEEE EUROCON 2017, 6-8 JULY 2017, OHRID, R. MACEDONIA
V. E XPERIMENT 2. The lower (vmin ) and the upper (vmax ) bounds of velocity
The experimental approach was used in order to evaluate were set to −4 and 4, respectively. The value 0.8 was assigned
the performance of the proposed method. We did not use any to the inertia weight (ω). The number of iterations and the
preprocessing on selected datasets in order to demonstrate the number of particles in the swarm was set at 100 and 200,
power of the proposed method. respectively.
In the research model, five latent variables were defined, TABLE II: Used datasets
of which three are exogenous (independent), namely: Classifi-
cation techniques, Evolutionary techniques, and Evolutionary Number of
Datasets Domain
parameters, and two endogenous (dependent), namely, Perfor- Features Classes Samples
mance of classification and Complexity of the model. Indicators 1 Primary Tumor Medicine 18 22 339
for each one of them were defined since the latent variable is 2 Ionosphere Physics 35 2 351
an abstract idea, which cannot be measured directly. 3 Soybean Biology 36 19 683
4 Movement-libras Medicine 91 15 360
The first latent exogenous variable, called Classification 5 SRBCT Medicine 2309 4 83
techniques, covers classification methods C4.5, Naive Bayes, 6 Leukemia1 Medicine 5328 3 72
SVM, and the proposed method BPSO+C4.5. Evolutionary 7 DLBCL Medicine 5470 2 77
8 CNS Medicine 7130 2 60
techniques are the second latent exogenous variable of which 9 Brain Tumor2 Medicine 10368 4 50
only the BPSO is used. The last exogenous variable is Evo- 10 Prostate Tumor Medicine 10510 2 102
lutionary parameters. In the fitness function, we used the 11 Leukemia2 Medicine 11226 3 72
classification method C4.5.
The endogenous variable Performance of classification has B. Results
indicators AUC, F-measure, and accuracy. All three indicators From Table I it is evident that the proposed method
take up values in the interval [0, 1], but the accuracy is usually BPSO+C4.5 is superior to all the compared classification
expressed as a percentage. The last latent endogenous variable methods. The most significant classification accuracy improve-
is Complexity of the model, which measures the number of ment regarding the C4.5 classification method was achieved
features. on the CNS dataset, i.e. 51.00%. On two datasets (SRBCT
A. Used datasets and proposed method settings and Leukemia1) the proposed method managed to improve
classification accuracy to 100%.
We tested our method by using 11 different datasets, as In all cases, on average, less than half of the features
shown in Table II. The meaning of the columns in Table II is were selected from the original datasets. The most significant
as follows: Sequence number, name, and domain of the dataset, feature elimination was noticeable in the Ionosphere dataset,
the number of features, classes, and samples. All datasets are i.e. 70.71%.
freely available on the Internet. We downloaded them from the The proposed method was also compared to all classification
UCI Machine Repository [22] and the web page of Plymouth methods using F-measure and AUC (Table III). The results of
University [23]. the F-measure of our proposed method are better in all 11
Before running the proposed method, a cross-validation datasetes. The results of AUC of our proposed method are
process was carried out on the dataset. The separation of the better in eight datasets, in the remaining three AUC results, the
proposed method and cross-validation process was used to NB classification method is slightly better (avg. 0.029). The
obtain as much autonomy of the proposed method as possible. best possible result of both metrics, i.e. 1, was obtained with
Firstly, a 5-fold cross-validation was used to divide initial data the proposed method on three datasets SRBCT, Leukemia1,
into 5 equal stratified parts. The classification model was then and DLBCL.
built on training sets (4 parts), in which only selected features
are used, and then tested on the testing set (1 part). VI. S TATISTICAL ANALYSIS
In the experiment, the following settings were used. The The purpose of the statistical analysis was to prove if the
social and cognitive coefficients c1 and c2 were both set to proposed method is statistically significantly better than other
969
IEEE EUROCON 2017, 6-8 JULY 2017, OHRID, R. MACEDONIA
TABLE III: Experiment results – F-measure and AUC of demonstrates the results of a Friedman’s Two-way Analysis
BPSO+C4.5 of Variance by Ranks. The proposed method BPSO+C4.5
Datasets F-measure AUC (mean rank = 3.86) attains the highest rank, followed by
1 Primary Tumor 0.454 0.788 classification methods Naive Bayes (mean rank = 2.45), C4.5
2 Ionosphere 0.980 0.982 (mean rank = 2.15), and SVM (mean rank = 1.55).
3 Soybean 0.969 0.991
4 Movement-libras 0.823 0.917
5 SRBCT 1 1
6 Leukemia1 1 1
7 DLBCL 1 1
8 CNS 0.949 0.938
9 Brain Tumor2 0.938 0.969
10 Prostate Tumor 0.990 0.990
11 Leukemia2 0.973 0.981
970
IEEE EUROCON 2017, 6-8 JULY 2017, OHRID, R. MACEDONIA
971