Professional Documents
Culture Documents
A Framework For Cost-Based Feature Selection
A Framework For Cost-Based Feature Selection
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
art ic l e i nf o
a b s t r a c t
Article history:
Received 19 July 2012
Received in revised form
15 November 2013
Accepted 21 January 2014
Available online 28 January 2014
Over the last few years, the dimensionality of datasets involved in data mining applications has increased
dramatically. In this situation, feature selection becomes indispensable as it allows for dimensionality
reduction and relevance detection. The research proposed in this paper broadens the scope of feature
selection by taking into consideration not only the relevance of the features but also their associated
costs. A new general framework is proposed, which consists of adding a new term to the evaluation
function of a lter feature selection method so that the cost is taken into account. Although the proposed
methodology could be applied to any feature selection lter, in this paper the approach is applied to two
representative lter methods: Correlation-based Feature Selection (CFS) and Minimal-RedundancyMaximal-Relevance (mRMR), as an example of use. The behavior of the proposed framework is tested on
17 heterogeneous classication datasets, employing a Support Vector Machine (SVM) as a classier. The
results of the experimental study show that the approach is sound and that it allows the user to reduce
the cost without compromising the classication error.
& 2014 Elsevier Ltd. All rights reserved.
Keywords:
Cost-based feature selection
Machine learning
Filter methods
1. Introduction
The proliferation of high-dimensional data has become a trend
in the last few years. Datasets with a dimensionality over the tens
of thousands are constantly appearing in applications such as
medical image and text retrieval or genetic data. In fact, analyzing
the dimensionality of the datasets posted in the UCI Machine
Learning Repository [1] in the last decades, one can observe that in
the 1980s, the maximum dimensionality of the data is about 100,
increasing to more than 1500 in the 1990s; and nally in the
2000s, it further increases to about 3 million [2].
The high-dimensionality of data has an important impact in
learning algorithms, since they degrade their performance when a
number of irrelevant and redundant features are present. In fact,
this phenomenon is known as the curse of dimensionality [3],
because unnecessary features increase the size of the search space
and make generalization more difcult. For overcoming this major
obstacle in machine learning, researchers usually employ dimensionality reduction techniques. In this manner, the set of features
required for describing the problem is reduced, most of the times
along with an improvement in the performance of the models.
Feature selection is arguably the most famous dimensionality
n
Corresponding author at: Department of Computer Science, Facultade de
Informtica, Campus de Elvia s/n, University of A Corua, 15071 A Corua, Spain.
Tel. +34 981 167 000x1305; fax: +34 981 167 160.
E-mail addresses: vbolon@udc.es (V. Boln-Canedo),
iporto@udc.es (I. Porto-Daz), nsanchez@udc.es (N. Snchez-Maroo),
ciamparo@udc.es (A. Alonso-Betanzos).
0031-3203/$ - see front matter & 2014 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.patcog.2014.01.008
2482
most popular lter metrics for classication problems are correlation and mutual information, although other common lter
metrics include error probability, probabilistic distance, entropy
or consistency [5].
There are some situations where a user is not only interested in
maximizing the merit of a subset of features, but also in reducing
costs that may be associated to features. For example, for medical
diagnosis, symptoms observed with the naked eye are costless, but
each diagnostic value extracted by a clinical test is associated with
its own cost and risk. In other elds, such as image analysis, the
computational expense of features refers to the time and space
complexities of the feature acquisition process [6]. This is a critical
issue, specically in real-time applications, where the computational time required to deal with one or another feature is crucial,
and also in the medical domain, where it is important to save
economic costs and to also improve the comfort of a patient by
preventing risky or unpleasant clinical tests (variables that can be
also treated as costs).
The goal of this research is to obtain a trade-off between a lter
metric and the cost associated to the selected features, in order to
select relevant features with a low associated cost. A general framework to be applied together with the lter approach is introduced.
In this manner, any lter metric can be modied to have into account
the cost associated to the input features. In this paper, and for the sake
of brevity, two implementations of this framework will be presented
as an example of use, choosing two representative and widely
used lters: Correlation-based Feature Selection (CFS) and MinimalRedundancy-Maximal-Relevance (mRMR). The results obtained
with these two lters are promising, showing that the approach
is sound.
The rest of the paper is organized as follows: Section 2
summarizes previous research on the subject; Section 3 describes
the proposed method in detail; Sections 4 and 5 describe the
experimental study performed and the results obtained, respectively; and nally, Section 6 presents the conclusions and the
future work.
2. Background
Feature selection has been an active and effective tool in
numerous elds such as DNA microarray analysis [7,8], intrusion
detection [9,10], medical diagnosis [11] or text categorization [12].
New feature selection methods are constantly appearing, however,
the great majority of them only focus on removing irrelevant and
redundant features but not on the costs for obtaining the input
features.
The cost associated to a feature can be related to different
concepts. For example, in medical diagnosis, a pattern consists of
observable symptoms (such as age and sex) along with the results
of some diagnostic tests. Contrary to observable symptoms, which
have no cost, diagnostic tests have associated costs and risks. For
example, an invasive exploratory surgery is much more expensive
and risky than a blood test [13]. Another example of the risk of
extracting a feature can be found in [14], where for evaluating the
merits of beef cattle as meat producers is necessary to carry out
zoometry on living animals.
On the other hand, the cost can also be related to computational
issues. In the medical imaging eld, extracting a feature from a
medical image can have a high computational cost. For example, in
the texture analysis technique known as co-occurrence features
[15], the computational cost for extracting each feature is not the
same, which implies different computational times. In other cases,
such as real-time applications, the space complexity is negligible,
but the time complexity is very important [6].
value which satises a given condition and another one just selects
the k attributes with the lower cost. Therefore, the general framework for cost-based feature selection proposed in this paper intends
to cover this necessity.
M S p
k kk 1r ii
where MCS is the merit of the subset S affected by the cost of the
features, Ci is the cost of the feature i, and is a parameter
introduced to weight the inuence of the cost in the evaluation
function.
The parameter is a positive real number. If is 0, the cost is
ignored and the method works as the regular CFS. If is between
0 and 1, the inuence of the cost is smaller than the other term. If
1 both terms have the same inuence and if 4 1, the inuence
of the cost is greater than the inuence of the other term.
2483
1
Ix ; c
jSjxi A S i
1
Ixi ; xj
jSj2 xi ;xj A S
The evaluation function to be maximized combines the two constraints (3) and (5). It is called Minimal-Redundancy-Maximal-Relevance
(mRMR) and has the expression shown in the following equation:
D; R
1
1
Ix ; c 2 Ixi ; xj DS; c RS
jSjxi A S i
jSj xi ;xj A S
ki 1 C i
k
10
11
4. Experimental study
The experiment is performed over three blocks of datasets
(Table 2). The datasets in the rst and second blocks are available
at the UCI Machine Learning Repository [1]. The datasets in
the third block are DNA microarray datasets and are available
on http://datam.i2r.a-star.edu/.sg/datasets/.krbd and http://www.
broadinstitute.org/cgi-bin/cancer/datasets.cgi. The main feature of
the rst block of datasets is that they have intrinsic cost associated
to the input features. For the second and third blocks, as these
datasets do not have intrinsic cost associated, random cost for
their input features has been generated. This decision has been
taken because no datasets with cost, other than the four ones of
the rst block, exist publicly available, to the best knowledge of
the authors. For each feature, the cost was generated as a random
number between 0 and 1. For instance, in Table 1 the costs for each
feature of Yeast dataset are displayed.
Overall, the chosen classication datasets are very heterogeneous. They present a variable number of classes, ranging from
two to twenty six. The number of samples and features range from
single digits to the tens of thousands. Notice that datasets in
the rst and second blocks have a larger number of samples than
features, whilst datasets in the third block have a much larger
5. Experimental results
Figs. 1, 3 and 6 show the average cost and error for several
values of . The solid line with x represents the error (referenced
on the left Y axis) and the dashed line with o represents the cost
(referenced on the right Y axis). Notice than when 0 the cost
has no inuence on the behavior of the method and it behaves as if
it was the non-cost version.
Fig. 1 plots the error/cost of the four datasets with cost
associated found at the UCI repository (see Table 2). The behavior
expected when applying cost feature selection is that the higher
the , the lower the cost and the higher the error. The results
obtained for the rst block of datasets, in fact, show that cost value
behaves as expected (although the magnitude of the cost does not
change too much because these datasets have few features and
the set of selected ones is often very similar). The error, however,
remains constant in most of the cases. This may happen because
these datasets are quite simple and the same set of features is
often chosen. The KruskalWallis statistical test run on the results
displayed that the errors are not signicantly different, except for
Pima dataset. This fact can be caused because this dataset has very
few expensive features (which are often associated with a higher
predictive power), as can be seen on Table 3. Therefore, removing
them has a greater effect on the classication accuracy.
30
0.6
0.4
20
0.2
10
0
0.5 0.75
0
10
20
10
0.5 0.75
40
0.8
0.6
30
0.6
30
0.4
20
0.4
20
0.2
10
0.2
10
Error
Cost
0.5 0.75
1
0.8
0.6
30
0.4
20
10
0.2
10
30
0.4
20
0.2
0
10
Cost
0.6
Error
0.8
Error
Cost
0.5 0.75
0
10
Error
Cost
0.5 0.75
40
40
0.8
Error
0
10
50
50
30
0.2
50
Error
Cost
0.8
0.5 0.75
40
0.4
50
0
10
Cost
Error
Cost
50
40
0
10
50
40
0.8
0.6
30
0.6
30
0.4
20
0.4
20
0.2
10
0.2
10
Error
Cost
0.5 0.75
Cost
0.8
0
10
Error
Cost
0.5 0.75
50
40
Cost
40
Error
Error
0.6
50
Cost
Error
Cost
Error
1
0.8
Cost
0.5093
0.1090
0.5890
0.2183
0.8112
0.6391
0.2741
0.1762
Cost
1
2
3
4
5
6
7
8
Cost
Cost
Error
Feature
Error
Table 1
Random costs of the features of Yeast dataset.
Error
2484
0
10
Fig. 1. Error/cost plots of the rst block of datasets for cost feature selection with CFS and mRMR. (a) Hepatitis CFS, (b) liver CFS, (c) Pima CFS, (d) Thyroid CFS, (e) hepatitis
mRMR (f) Liver mRMR, (g) Pima mRMR and (h) Thyroid mRMR.
Table 2
Description of the datasets.
Dataset
No. features
No. samples
No. classes
Hepatitis
Liver
Pima
Thyroid
19
6
8
20
155
345
768
3772
2
2
2
3
Letter
Magic04
Optdigits
Pendigits
Sat
16
10
64
16
36
20,000
19,020
5620
7494
4435
26
2
10
10
6
Segmentation
Waveform
Yeast
19
21
8
2310
5000
1033
7
3
10
12,625
7129
2000
4026
7129
21
60
62
47
72
2
2
2
2
2
Brain
CNS
Colon
DLBCL
Leukemia
Table 3
Costs of the features of Pima dataset (normalized to 1).
Feature
Cost
1
2
3
4
5
6
7
8
0.0100
0.7574
0.0100
0.0100
0.9900
0.0100
0.0100
0.0100
2485
0
0.5
0.75
1
2
5
10
10
10
20
30
40
50
60
40
50
60
Mean Ranks
0
0.5
0.75
1
2
5
10
10
10
20
30
Mean Ranks
Fig. 2. KruskalWallis statistical test results of Pima dataset. (a) ANOVA table (Cost CFS), (b) graph of multiple comparison (Cost CFS), (c) ANOVA table (Cost mRMR) and
(d) graph of multiple comparison (Cost mRMR).
Error
Cost
50
Error
Cost
50
40
30
0.4
20
0.4
20
0.4
20
0.4
20
0.2
10
0.2
10
0.2
10
0.2
10
30
0.4
20
0.2
10
0
10
0.5 0.75
0.8
0.6
30
0.4
20
0.2
10
0.5 0.75
0
10
0.4
20
10
0.2
10
0.5 0.75
0.2
10
2
0
10
Cost
20
Error
30
0.4
30
0.4
20
0.4
20
0.2
10
0.2
10
0.6
30
0.4
20
0.2
10
2
0
10
0.6
0
10
30
0.8
0.6
40
0.5 0.75
0.5 0.75
Error
Cost
0.5 0.75
0
10
Error
Cost
0.5 0.75
Error
Cost
0.8
50
40
10
0.6
20
0
10
40
50
0.2
0.8
20
0.2
0.4
40
Error
Cost
0.4
30
50
0.5 0.75
30
0.6
50
0.6
Error
Cost
0
10
30
0
10
0.6
0.8
40
0.5 0.75
0.5 0.75
0.8
0.8
Error
Cost
40
50
50
0
10
Cost
Cost
40
Error
Cost
0.8
Error
Error
Cost
50
Error
10
0.5 0.75
Cost
20
0.2
Error
0.4
Cost
30
Error
0.6
0.6
Error
Cost
Cost
1
0.8
0.8
40
40
0.5 0.75
50
50
Error
Cost
Error
0.5 0.75
Cost
50
40
0
10
50
50
40
0.8
0.6
30
0.6
30
0.4
20
0.4
20
0.2
10
0.2
10
Error
Cost
0.5 0.75
Error
0
10
Cost
Error
0
10
Cost
0.5 0.75
Error
0
10
Cost
0.6
Error
0.8
30
Error
40
0.6
Cost
0.8
30
Error
40
0.6
Cost
0.8
0.8
Error
Error
Cost
30
Error
40
Error
50
Cost
0.6
0.8
Error
50
Cost
Error
Cost
Cost
0
10
Error
Cost
0.5 0.75
40
Cost
2486
0
10
Fig. 3. Error/cost plots of second block of datasets for cost feature selection with CFS and mRMR. (a) Letter CFS, (b) Magic04 CFS, (c) Optdigits CFS, (d) Pendigits CFS, (e) Letter
mRMR, (f) Magic04 mRMR, (g) Optdigits mRMR, (h) Pendigits mRMR, (i) Sat CFS, (j) Segment CFS, (k) Waveform CFS, (l) Yeast CFS, (m) Sat mRMR, (n) Segment mRMR,
(o) Waveform mRMR and (p) Yeast mRMR.
0
0.5
0.75
1
2
5
10
0
10
20
30
40
50
60
70
80
Mean Ranks
Fig. 4. KruskalWallis error statistical test of Sat dataset with Cost CFS. (a) ANOVA table and (b) graph of multiple comparison.
Fig. 6(c) or (f)). The reason why the error is not raising can be
two-fold:
2487
0
0.5
0.75
1
2
5
10
10
10
20
30
40
50
60
70
80
Mean Ranks
Fig. 5. KruskalWallis cost statistical test results of Sat dataset with Cost CFS. (a) ANOVA table and (b) graph of multiple comparison.
0.5 0.75
Error
50
40
0.8
0.6
30
0.4
20
0.2
10
0.5 0.75
Cost
Error
Cost
Error
20
0.4
20
0.2
10
0.2
10
0
10
0.5 0.75
0.8
0.4
0
10
Error
Cost
50
40
0.8
0.6
0.4
20
0.2
10
0
0.5 0.75
0
10
30
0.4
20
0.4
20
0.2
10
0.2
10
0.5 0.75
0
10
0.6
30
20
0.4
20
0.2
10
0.2
10
Error
Cost
0.5 0.75
Error
Cost
0.5 0.75
Cost
40
0
10
0.4
50
0.6
0.6
0.6
30
0.5 0.75
10
30
0.8
0
10
40
Error
Cost
0.8
0.8
40
0.5 0.75
50
50
40
Error
Cost
Error
0
10
30
Error
10
0
30
Cost
20
0.2
0
30
Error
Cost
Cost
0.4
Cost
30
Error
0.5 0.75
0.6
Cost
0
10
0.6
0.6
Error
Cost
Error
10
0.8
0.8
Cost
20
0.2
40
40
50
40
Cost
0.4
0.8
50
50
50
Cost
30
Error
Cost
Error
0.6
40
Error
Error
50
Cost
Error
Cost
Error
1
0.8
0
10
Fig. 6. Error/cost plots on the third block of datasets for cost feature selection with CFS and mRMR. (a) Brain CFS, (b) CNS CFS, (c) colon CFS, (d) DLBCL CFS, (e) brain mRMR,
(f) CNS mRMR, (g) colon mRMR, (h) DLBCL mRMR, (i) Leukemia CFS and (j) Leukemia mRMR.
2488
0
0.5
0.75
1
2
5
10
20
25
30
35
40
45
50
55
60
Mean Ranks
Fig. 7. KruskalWallis error statistical test of DLBCL dataset with Cost mRMR. (a) ANOVA table and (b) graph of multiple comparison.
0
0.5
0.75
1
2
5
10
10
10
20
30
40
50
60
70
80
Mean Ranks
Fig. 8. KruskalWallis cost statistical test of DLBCL dataset with Cost mRMR. (a) ANOVA table and (b) graph of multiple comparison.
Conict of interest
None declared.
Acknowledgments
This work was supported by Secretara de Estado de Investigacin of the Spanish Government under project TIN 2009-02402,
and by the Consellera de Industria of the Xunta de Galicia through
the research project CN2011/007, both of them partially supported
by the European Union ERDF. V. Boln-Canedo and I. Porto-Daz
acknowledge the support of Xunta de Galicia and Universidade da
Corua under their grant programs.
References
[1] A. Asuncion, D.J. Newman, UCI Machine Learning Repository, University of
California, Irvine, School of Information and Computer Sciences, http://mlearn.
ics.uci.edu/MLRepository.html, Last access: April 2012.
[2] Z.A. Zhao, H. Liu, Spectral Feature Selection for Data Mining, Chapman & Hall/
CRC, London, UK, 2012.
[3] A. Jain, D. Zongker, Feature selection: evaluation, application, and small
sample performance, IEEE Trans. Pattern Anal. Machine Intell. 19 (2) (1997)
153158.
[4] I. Guyon, S. Gunn, M. Nikravesh, L. Zadeh, Feature Extraction. Foundations and
Applications, Springer, New York, USA, 2006.
[5] V. Boln-Canedo, N. Snchez-Marono, A. Alonso-Betanzos, A review of feature
selection methods on synthetic data, Knowl. Inf. Syst. 34 (3) (2013) 483519.
[6] J.T. Feddema, C.S.G. Lee, O.R. Mitchell, Weighted selection of image features for
resolved rate visual feedback control, IEEE Trans. Robot. Autom. 7 (1) (1991)
3147.
[7] C. Ding, H. Peng, Minimum redundancy feature selection from microarray
gene expression data, in: Proceedings of the 2003 IEEE, Bioinformatics
Conference, 2003, CSB 2003, IEEE, 2003, pp. 523528.
[8] V. Boln-Canedo, N. Snchez-Maroo, A. Alonso-Betanzos, An ensemble of
lters and classiers for microarray data classication, Pattern Recognit. 45 (1)
(2012) 531539.
[9] S. Mukkamala, A.H. Sung, Feature selection for intrusion detection with neural
networks and support vector machines, Transp. Res. Rec. J. Transp. Res. Board
1822 (1) (2003) 3339.
[10] V. Boln-Canedo, N. Snchez-Maroo, A. Alonso-Betanzos, Feature selection
and classication in multiple class datasets: an application to KDD cup 99
dataset, Exp. Syst. Appl. 38 (5) (2011) 59475957.
[11] M.F. Akay, Support vector machines combined with feature selection for breast
cancer diagnosis, Exp. Syst. Appl. 36 (2) (2009) 32403247.
[12] G. Forman, An extensive empirical study of feature selection metrics for text
classication, J. Mach. Learn. Res. 3 (2003) 12891305.
[13] J. Yang, V. Honavar, Feature subset selection using a genetic algorithm, IEEE
Intell. Syst. Appl. 13 (2) (1998) 4449.
[14] A. Bahamonde, G.F. Bayn, J. Dez, J.R. Quevedo, O. Luaces, J.J. Del Coz, J. Alonso,
F. Goyache, Feature subset selection for learning preferences: a case study, in:
Proceedings of the Twenty-First International Conference on Machine Learning, ACM, 2004, pp. 4956.
[15] Robert M. Haralick, K. Shanmugam, Its'Hak Dinstein, Texture features for
image classication, IEEE Trans. Syst. Man Cybern. 3 (1973) 610621.
[16] Jerome H. Friedman, Regularized discriminant analysis, J. Am. Stat. Assoc.
84 (405) (1989) 165175.
2489
[22] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The WEKA
data mining software: an update, ACM SIGKDD Explor. Newsl. 11 (1) (2009) 1018.
[23] Ingo Mierswa, Michael Wurst, Ralf Klinkenberg, Martin Scholz, Timm Euler.
Yale: rapid prototyping for complex data mining tasks, in: Lyle Ungar, Mark
Craven, Dimitrios Gunopulos, and Tina Eliassi-Rad (Eds.), KDD '06: Proceedings
of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, New York, NY, USA, August 2006, ACM, pp. 935940.
[24] E. Rich, K. Knight, Articial Intelligence, McGraw-Hill, New York, 1991.
[25] Y. Hochberg, A.C. Tamhane, Multiple Comparison Procedures, John Wiley &
Sons, New Jersey, USA, 1987.
[26] R. Kohavi, G.H. John, Wrappers for feature subset selection, Artif. Intell. 97 (12)
(1997) 273324.
Vernica Boln-Canedo received her B.S. degree in Computer Science from University of A Corua, Spain, in 2008. She received her M.S. degree in 2010 and is currently
a Ph.D. student in the Department of Computer Science at the same university. Her research interests include machine learning and feature selection.
Iago Porto-Daz received his B.S. degree in Computer Science from University of A Corua, Spain, in 2008. He received his M.S. degree in 2010 and is currently a Ph.D. student
in the Department of Computer Science at the same university. His research interests include machine learning and feature selection.
Noelia Snchez-Maroo received the Ph.D. degree for her work in the area of functional and neuronal networks in 2005 at the University of A Corua. She is currently
teaching at the Department of Computer Science in the same university. Her current research areas include agent-based modeling, machine learning and feature selection.
Amparo Alonso-Betanzos received the Ph.D. degree for her work in the area of medical expert systems in 1988 at the University of Santiago de Compostela. Later, she was a
postdoctoral fellow in the Medical College of Georgia, Augusta. She is currently a Full Professor in the Department of Computer Science, University of A Corua. Her main
current areas are intelligent systems, machine learning and feature selection.