You are on page 1of 4

International Journal of Advanced Computer Science, Vol. 3, No. 6, Pp. 318-321, Jun., 2013.

1

Manuscript
Received:
21,Aug., 2012
Revised:
27,Sep. 2012
Accepted:
1,May,2013
Published:
15,May,2013

Keywords
Feature
selection;
parameter
selection;
Gaussian
radial basis
function;
cosine
similarity





Abstract Recently Li et al. proposed a
parameter selection method for Gaussian radial
basis function (GRBF) in support vector
machine (SVM). In his paper cosine similarity
was calculated between two vectors based on the
properties of GRBF kernel function. Lis
method can determine an optimal sigma in
SVM and thus efficiently improve its
performance, yet it is limited by only focusing
on a fixed original feature space and may suffer
if the space contains some irrelevant and
redundant features, especially in a
high-dimensional feature space. In this paper,
Lis method is extended to a flexible feature
space so that feature selection and parameter
selection are conducted at the same time. A
feature subset and sigma are determined by
minimizing the objective function that considers
both within-class and between-class cosine
similarities. Our experimental results
demonstrate that the proposed method has a
better performance than Lis method and
traditional SVM in terms of classification
accuracy.

1. Introduction
In the machine learning area, data preprocessing is
important for data mining algorithms. Dimension reduction
is one of the usual objectives of preprocessing, and it aims
to alleviate the curse of dimensionality and Hughes
phenomenon [1], focus on relevant features only, and
improve the performance of data mining algorithms, e.g. in
terms of classification accuracy and computational time.
Theoretical analyses and experimental studies both confirm
that many algorithms scale poorly when there are a large
number of irrelevant and redundant features [2]. Therefore,
it is necessary and significant to do dimension reduction in
the preprocessing step. There are two representative
categories of dimension reduction: feature extraction and
feature selection. Feature extraction is to generate a set of
new features from the original feature space through some
functional projection [3]. Principal component analysis
(PCA), independent component analysis (ICA), and linear
discriminant analysis (LDA) are three representative
statistical methods in this category. Several new methods
such as the nonparametric weighted feature extraction
(NWFE) [4] and the LDA-based clustering feature
extraction [5] have been proposed recently.


Zhiliang Liu and Hongbing Xu are from School of Automation
Engineering, University of Electronic Science and Technology of China.
{zhilianng_liu, hbbxxu}@ueestc.edu.cn
On the other hand, feature selection attempts to find a
subset with M features from the original feature space of N
features (M N) so that the space is optimally reduced
according to a certain criterion [6], such as classification
accuracy, mutual information [7], Pearson correlation
coefficient, 2-statistic, t-statistic, and reliefF [8]. Qu et al.
used the norm of the weight vector (||||) of SVM as the
performance measure and applied their feature selection
method to damage degree classification of planet gears [9].
Compared to feature extraction, feature selection has the
advantage of interpreting selected features intuitively
because it does not change the original features.
Nowadays, researchers pay much attention to feature
selection and apply it to various areas including text
processing of internet documents, gene expression array
analysis [8], combinatorial chemistry, and fault diagnosis
[9, 10]. Feature selection is normally grouped into filters,
wrappers, and embedded methods. Filters do feature
selection before classification and are thus independent of
the chosen classifier. Wrappers utilize an interested
classifier to evaluate feature subsets according to their
predictive power, such as classification accuracy.
Embedded methods perform feature selection in the process
of training and are usually specific to a given classifier [10].
This paper focuses on filters with the classifier independent
and time-saving properties. Sequential search and
exhaustive search are two searching methods for filters. The
former usually evaluates features one by one according to a
certain criterion mentioned above. The sequential strategy
has a drawback as Guyon et al. reported that a completely
useless feature can greatly improve the performance when
being used together with others [10]. Exhaustive search can
avoid such a problem because it looks at features as groups
rather than individuals. This, however, may be time
consuming and cause over-fitting.
In 2010 a GRBF parameter sigma selection method is
proposed by Li [11]. It employs cos (cosine similarity) as a
similarity measure of samples, where is the angle of each
pair vectors in the GRBF kernel space. Lis method uses the
gradient descent method, a one-dimensional optimization
algorithm, to find the optimal sigma by minimizing his
objective function. The method is efficient to improve the
performance of GRBF-based classifiers, such as SVM. Yet
Lis method deals with the complete original feature space
and may suffer if it contains some irrelevant and redundant
features, especially in a high-dimensional feature space.
That is, the optimal sigma may be trapped in a local
minimizer. Considering this issue, Lis method is extended
to a flexible feature space so that a feature subset and sigma
Hybrid Feature Selection and Parameter
Selection for Support Vector Machine Classification
Zhiliang Liu & Hongbing Xu
Liu et al.: Hybrid Feature Selection and Parameter Selection for Support Vector Machine Classification.
International Journal Publishers Group (IJPG)


319
can be optimized at the same time. The proposed method
takes cos as a similarity measure of samples as well and
introduces a variable related to feature subsets into the
objective function that considers both within-class and
between-class similarities. Eventually, the problem becomes
an optimization problem of two variables. An optimization
algorithms, e.g. genetic algorithm (GA), particle swarm
optimization (PSO), Nelder-Mead simplex method and
simulated annealing (SA) [12], can be used to solve the
optimization problem.
The remaining paper is organized as follows. Section II
provides a brief introduction of the GRBF kernel function
and its properties. Our method of feature selection is then
presented. Section III conducts experiments and result
analysis based on three benchmark datasets. The details of
how to implement experiments and experimental results are
provided in this section. Section IV includes the summary
and conclusion of this paper.
2. The Proposed Method
A. Gaussian Radial Basis Function
Gaussian radial basis function (GRBF), also called
Gaussian kernel function, is one of the most commonly used
kernel functions, which can be expressed as
2
2
( , , ) exp( )
2
k o
o

=
x z
x z (Equ. 1)
where ,
n
e x z are two samples that are n-dimensional
column vectors in the original feature space, || || denotes the
Euclidean distance between two samples, ( ) 0, o e is
called the width of features. The GRBF kernel function has
the following properties 0 0.
a. The norm of each sample in the kernel space is one:
( , , ) 1 k o = x x . That is to say all samples stand on a
circle if the kernel space is two-dimensional and on a
spherical surface if the kernel space is
three-dimensional.
b. The cosine value of the angle between two samples in
the kernel space is equal to their kernel function value:
cos ( , , ) u k o = x z , where is the angle between two
samples and has a range of 0 u t s s . Because the
kernel function values of GRBF has a range of
0 ( , , ) 1 k o < s x z , It is obvious that 0 cos 1 u < s , and
then 0 2 u t s < .
The property of 0 2 u t s < always holds for any two
samples in the kernel space, and it implies that the angle
between any two samples is in a range of 2 t rad. For
example, all samples in the kernel space should be in a
quarter of a circle if the kernel space is two-dimensional as
shown in Fig. 1 (i) and an eighth of a spherical surface if the
kernel space is three-dimensional as shown in Fig. 1 (ii).
Based on the property (a), it is reasonable to measure the
similarity of two samples by their angles. If the angle is
smaller, these two samples are more alike, e.g. A1 and A2
in Fig. 1 (i); otherwise, they are not as similar to each other,
e.g. A1 and C1 in Fig. 1 (i). Because cosu is
monotonically decreasing with respect to when
0 2 u t s < , cosu can be taken to measure the similarity
of samples, which is equal to ( , , ) k o x z . That is to say,
( , , ) k o x z becomes large when two samples are close
and small when they are far away.

Fig. 1 Schematic diagram of samples in the kernel space: (i),
two-dimensional GRBF kernel space; (ii), three-dimensional GRBF kernel
space
B. GRBF-based Feature Selection Algorithm
When selecting a proper feature subset, two principles
have to be considered for the proposed GRBF-based feature
selection algorithm: (i) Samples from the same class have
large GRBF values; (ii) Samples from different classes have
small GRBF values. We firstly define within-class cosine
C
w
and between-class cosine C
b
as
w
2 1
1
1
C ( , ) ( , , )
L
L
i i i
i
i
l
o k o
= e e
=
=

x z
s x s z s (Equ. 2)
b
1 1
1 1,
1
C ( , ) ( , , )
L L
L L
i j i j
j i i j
i j j i
l l
o k o
= = e e
=
= = =
=


x z
s x s z s (Equ. 3)
where L is the number of classes; l
i
the number of samples
in class i; i and j are indices of classes; { } 0,1
n
e s is an
n-dimensional column vector, and x s is the dot product
between x and s; by the dot product, the binary vector s can
enable or disable a feature by setting the corresponding
value to one or zero, respectively.
C
w
,
w
0 C 1 < s , is the average cosine value of samples
in the same class, and C
b
,
b
0 C 1 < s , is the average cosine
value of samples in different classes. Both C
w
and C
b
are
scalar. Ideally, C
w
=1 because all samples in the same class
are totally overlapped, and C
b
0 because all samples from
different classes tend to be orthogonal with each other.
According to the above two principles, the objective
function is defined as
| |
w T
b
1-C
( , )
C
w b
J o e e
(
= =
(

s C (Equ. 4)



(i) (ii)
International Journal of Advanced Computer Science, Vol. 3, No. 6, Pp. 318-321, Jun., 2013.
International Journal Publishers Group (IJPG)


320
where ( )
T
is the transpose of a vector or a matrix, is a
two-dimensional weighting vector for C
w
and C
b
, and
w b
1 e e + = . The feature selection problem becomes a
constrained optimization problem as
{ }
* *
0,1 , (0, )
[ , ] argmin ( , )
n
J
o
o o
e e
=
s
s s (Equ. 5)
We use genetic algorithm to solve the problem, and then
the optimal s
*
and o
*
can be determined. For example,
suppose s
*
= [1, 0, 1, 0, 0]
T
and o
*
= 1, it means Feature 1
(F1) and Feature 3 (F3) are selected to form the new feature
subset. The subset of {F1, F3} and o
*
are then used for
GRBF-based classifiers as shown in Fig. 2.

Fig. 2 An example of how to use the proposed method for classification
3. Experiments and Results
When doing experiments, we choose SVM as a classifier
to evaluate the performance of the proposed method. SVM
is implemented by svmtrain and svmclassify from the
Bioinformatics Toolbox, and GA is implemented by ga
from the Global Optimization Toolbox of MATLAB. In the
following, the proposed method (method 1) is compared
with two other relevant methods: SVM with sigma selection
(method 2) and the traditional SVM (method 3).
Classification accuracy is used as the performance measure
of these methods. It is defined as Nc/(Nc+Nf)100%, where
Nc is the number of samples that are correctly classified,
and Nf is the number of those falsely classified. The GRBF
is used as the kernel function in SVM in these three
methods, and default values for other parameters are used.
There is no feature selection in method 2 or method 3, so
they use all features in the original feature space. We take
three benchmark datasets, the Parkinson, Ionosphere and
Sonar datasets from the University of California Irvine (UCI)
repository [1] to test the three methods. Their profiles are
summarized in TABLE .
TABLE 1
SUMMARY OF BENCHMARK DATASETS
Dataset Classes Instances Features
Parkinson 2 195 22
Ionosphere 2 351 34
Sonar 2 208 60

Normalization is firstly conducted to prevent features in
greater numeric ranges dominating those in smaller numeric
ranges. K-fold cross-validation is then employed for data
partition (K=3 in the experiment). In each run out of five,
our programs are executed K times, and the results are
averaged over these K runs. The same procedure is repeated
with all the three datasets. The obtained feature subsets and
the sigma values with these three methods are summarized
in TABLE II, and classification accuracy is provided in Fig.
3-5.
1 2 3 4 5
82
84
86
88
90
92
run (#)
c
l
a
s
s
i
f
i
c
a
t
i
o
n

a
c
c
u
r
a
c
y
(
%
)


method 1
method 2
method 3

Fig. 3 Classification accuracy of the Parkinson dataset
1 2 3 4 5
78
80
82
84
86
88
90
92
94
96
run (#)
c
l
a
s
s
i
f
i
c
a
t
i
o
n

a
c
c
u
r
a
c
y
(
%
)


method 1
method 2
method 3

Fig. 4 Classification accuracy of the Ionosphere dataset
1 2 3 4 5
50
55
60
65
70
75
80
85
90
run (#)
c
l
a
s
s
i
f
i
c
a
t
i
o
n

a
c
c
u
r
a
c
y
(
%
)


method 1
method 2
method 3

Fig. 5 Classification accuracy of the Sonar dataset
From the results presented, the proposed method uses
fewer features and has higher classification accuracy than
the other two methods. In the Parkinson dataset, our method
selects five of 22 features. It is more accurate to use only
23% features of the original space. Dimensions are reduced
to 50% and 45% in the Ionosphere dataset and the Sonar
dataset, respectively. Computational time of test is thus
saved with a reduced feature space, and only important
features are retained in the optimal feature subset. Sigma
selection of GRBF greatly affects the performance of SVM
in the Sonar dataset since the traditional SVM has quite low
Liu et al.: Hybrid Feature Selection and Parameter Selection for Support Vector Machine Classification.
International Journal Publishers Group (IJPG)


321
classification accuracy in Fig. 5. It indicates that a proper
sigma is critical to the performance of SVM. Method 2
shows its effectiveness in Fig. 3-5 in comparison with
Method 3, as expected. Performance is further improved by
using the proposed method over that of method 2.
4. Conclusions
In this paper we proposed a GRBF-based feature
selection algorithm that can tackle feature selection and
parameter selection of the GRBF kernel function
simultaneously. By utilizing the properties of GRBF, the
proposed method employs cos as a measure of similarity
of classes, where is the angle of each pair vectors in the
GRBF kernel space. The objective function considering
both within-class and between-class is minimized to obtain
an optimal combination of a feature subset and sigma. Three
benchmark datasets from the UCI repository are used to
evaluate the proposed method (method 1), Lis method
(method 2), and the traditional SVM (method 3). Results
demonstrate that the proposed method possesses higher
classification accuracy with a reduced feature subset. The
proposed method could be applied to measurement systems
to determine most significant variables for condition
monitoring. Cost of measurement can be reduced with a
reduced feature subset. In the future, we will validate
generalization of the proposed method with other
GRBF-based classifiers.
References
[1] P.-H. Hsu, "Feature extraction of hyperspectral images using
wavelet and matching pursuit," (2007) ISPRS Journal of
Photogrammetry and Remote Sensing, vol. 62, no. 2, pp.
78-92.
[2] P. Langley, Elements of Machine Learining. San Francisco:
Morgan Kaufmann Inc., 1996.
[3] N. Wyse, R. Dubes, & A.K. Jain, "A critical evaluation of
intrinsic dimensionality algorithms," (1980) Pattern
Recognition in Practice, pp. 514-524.
[4] B.-C. Kuo, & D. A. Landgrebe., "Nonparametric weighted
feature extraction for classification," (2004) IEEE
Transactions on Geoscience and Remote Sensing, vol. 42, no.
5, pp. 1096- 1105.
[5] C.-H. Li, B.-C. Kuo, & C.-T. Lin, "LDA-based clustering
algorithm and its application to an unsupervised feature
extraction," (2011) IEEE Transactions on Fuzzy Systems, vol.
19, no. 1, pp.152-163.
[6] M. Dash, & H. Liu, "Feature selection methods for
classification," (1997) Intelligent Data Analysis: An
International Journal, vol. 1, no. 3, pp. 131-156.
[7] X. Zhao, M.J. Zuo, & T. Patel, "EMD, ranking mutual
information and PCA based condition monitoring," (2010)
ASME 2010 International Design Engineering Technical
Conferences, Montreal, Canada.
[8] P. Yang, B. Zhou, Z. Zhang, & A.Y Zomaya, "A multi-filter
enhanced genetic ensemble system for gene selection and
sample classification of microarray data," (2010) BMC
Bioinformatics, 11 (suppl. 1): S5.
[9] J. Qu, Z. Liu, M.J. Zuo, & H.-Z. Huang, "Feature selection
for damage degree classification of planetary gearboxes
using support vector machine," (2011) Proceedings of the
Institution of Mechanical Engineers, Part C: Journal of
Mechanical Engineering Science, vol. 255, no. 9, pp.
2250-2264.
[10] I. Guyon & A. Elisseeff, "An introduction to variable and
feature selection," (2003) Journal of Machine Learning
Research, vol. 3, no. , pp. 1157-1182.
[11] C.-H. Li; C.-T. Lin; B.-C. Kuo; & H.-S. Chu, "An automatic
method for selecting the parameter of the RBF kernel
function to support vector machines," (2010) Proceedings of
IEEE International Conference on Geoscience and Remote
Sensing Symposium, pp.836-839.
[12] E.K.P. Chong, & S.H. ak, An introduction to optimization,
3rd Edition. Hoboken: John Wiley & Sons Inc., 2008.
[13] S.T. John & C. Nello, Kernel Methods for Pattern Analysis.
Cambridge: Cambridge Univ. Press, 2004.
[13] A. Frank, & A. Asuncion. UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of
California, School of Information and Computer Science,
2010.

Zhiliang Liu received the B.S. degree in
school of electrical and information
engineering, Southwest University for
Nationalities, Chengdu, in 2006. He visited
University of Alberta as a visiting scholar
from 2009 to 2011. He is currently a Ph.D
candidate in the school of automation
enigeering, University of Electronic
Science and Technology of China,
Chengdu. His research interests mainly include pattern
recognition, data mining, fault diagnosis and prognosis.



TABLE 2
SUMMARY OF EXPERIMENTAL RESULTS
Dataset
Method 1 Method 2 Method 3

*
s
*

*
s s

Parkinson 0.6545 {F1,F3,F12,F18,F19} [0.4 0.6]
T
3.6125 {F1-F22} 1 {F1-F22}
Ionosphere 2.1635
{F1,F3,F5-F9,F11,F14,F15,
F20-F25,F28,F31}
[0.4 0.6]
T
3.9915 {F1-F34} 1 {F1-F34}
Sonar 1.8212
{F8,F10-F16,F19,F22,F23, F26,F27,
F32,F34-F37,F46-F51,F55,F59,F60}
[0.4 0.6]
T
5.8804 {F1-F60} 1 {F1-F60}
Note: F# is the feature # in the corresponding dataset; star (*) denotes the optimal value.