You are on page 1of 4

A NOVEL HYBRID FILTER FEATURE SELECTION METHOD FOR

DATA MINING

B.M.Vidyavathi,
Department of Computer science
Bellary Engineering College
Bellary, Karnataka State, India
Vidyabm1@yahoo.co.in

Dr.C.N.Ravikumar
Department of Computer science
Sri Jayachamarajendra College of Engineering
Mysore, Karnataka State, India
kumarcnr@yahoo.com

ABSTRACT
In recent years many applications of data mining deal with a high-dimensional data
(very large number of features) impose a high computational cost as well as the
risk of “over fitting”. In these cases, it is common practice to adopt feature
selection method to improve the generalization accuracy. Feature selection method
has become the focus of research in the area of data mining where there exists a
high-dimensional data. We propose in this paper a novel feature selection method
based on two stage analysis of Fisher Ratio and Mutual Information. The two-
stage analysis of Fisher Ratio and Mutual Information is carried out in the feature
domain to reject the noisy feature indexes and select the most informative
combination from the remaining. In the approach, we develop two practical
solutions, avoiding the difficulties of using high dimensional Mutual Information
in the application, that are the feature indexes clustering using cross Mutual
Information and the latter estimation based on conditional empirical PDF. The
effectiveness of the proposed method is evaluated by the SVM classifier using
datasets from the UCI Machine Learning Repository. Experimental results show
that the proposed method is superior to some other classical feature selection
methods and can get higher prediction accuracy with small number of features.
The results are highly promising.

Keywords: Pattern recognition, feature selection, data mining, fisher ratio, mutual
information.

1 INTRODUCTION [4, 5]. The filter model separates feature selection


from classifier learning and relies on general
Feature selection, a process of choosing a subset characteristics of the training data to select feature
of features from the original ones, is frequently used subsets that are independent of any mining
as a preprocessing technique in data mining [6,7]. It algorithms. By reducing the number of features, one
has proven effective in reducing dimensionality, can both reduce over fitting of learning methods, and
improving mining efficiency, increasing mining increase the computation speed of prediction (Guyon
accuracy, and enhancing result comprehensibility [4, and Elisseeff, 2003)[1]. We focus in this paper on the
5]. Feature selection methods can broadly fall into selection of a few features among several in a
the wrapper model and the filter model [5]. The context of classification. Our main interest in this
wrapper model uses the predictive accuracy of a paper is to design an efficient filter, both from a
predetermined mining algorithm to determine the statistical and from a computational point of view.
goodness of a selected subset. It is computationally
expensive for data with a large number of features

Ubiquitous Computing and Communication Journal 1


The most standard filters rank features according 2 FEATURE SELECTION BASED ON
to their individual predictive power, which can be FISHER RATIO AND MUTUAL
estimated by various means such as Fisher score INFORMATION
(Furey et al., 2000), Kolmogorov-Smirnov test,
Pearson correlation (Miyahara and Pazzani, 2000) or In this section we discuss following issues of the
mutual information (Battiti, 1994; Bonnlander and proposed method:1)the pre-selection of feature
Weigend, 1996; Torkkola, 2003)[3]. Selection based components based on their Fisher Ratio (FR) scores
on such a ranking does not ensure weak dependency 2)the final selection by Mutual Information and 3)the
among features, and can lead to redundant and thus classifier SVM
less informative selected families. To catch
dependencies between features, a criterion based on
decision trees has been proposed recently 2.1 Fisher Ratio Analysis
(Ratanamahatana and Gunopulos, 2003)[2]. Features Suppose that two investigating classes on a
which appear in binary trees build with the standard feature component domain have mean vectors µ1, µ2
C4.5 algorithm are likely to be either individually and covariances Σ1, Σ2 respectively. The Fisher Ratio
informative (those at the top) or conditionally is defined as the ratio of the variance of the between
informative (deeper in the trees). The drawbacks of classes to the variance of within classes noted by
such a method are its computational cost and
sensitivity to overfitting.
The combination of Fisher Ratio and Mutual
Information is the principal merit of the proposed
method. From a theoretical point of view, the Fisher
Ratio analysis should be able to select the most The maximum of class separation (discriminative
discriminate (i.e. less noisy) feature components but level) is obtained when
can not to provide the "best" combination between
selected components because there might be cross
correlations. On another hand, the Mutual
Information without Fisher Ratio analysis would be 2.2 Feature Selection By Mutual Information
able to provide the maximum independence between Maximization
feature components but this might mistakenly select The Mutual Information maximization is a
the noisy components which even degrade the natural idea for subset selection since this can
system. In this work we show that the combination provide a maximum information combination of pre-
of Fisher Ratio and Mutual Information greatly selected components. However, a big problem is that
improve the performance of classifier. the estimation of high-dimensional mutual
In the approach, we developed two flexible and information requires very large number of
practical solutions for the application of Mutual observations to be accurate but this is often not
Information. First, to avoid the typical difficulty in provided.
estimating high dimensional mutual information, we To solve this problem we develop a flexible
use two-dimensional Mutual Information as a solution using only two dimensional cross Mutual
distance measurement to cluster the pre-selected Information. This measurement is used as a distance
components into groups. Their best subset is then to cluster the feature components into groups and
chosen picking up the component with the best then select from each group the best component by
Fisher Ratio score in each group. mean of the Fisher Ratio.
In this approach, we develop two practical 2.1.1 Cluster Pre-Selected Components For The
solutions which avoid the difficulties in the Selection
application of high dimensional Mutual Information: The Mutual Information based clustering is an
the feature indexes clustering using cross Mutual iterative procedure like vector quantization and
Information and the former estimation based on therefore might be sensitive to the initial setting.
conditional empirical PDF. However, taking into account the fact that the largest
The remainder of this paper is organized as contribution is expected to come from the best pre-
follows. In Section 2 we briefly describe the theory selected component, the initialization is set as
of fisher ratio analysis, mutual information follows.
maximization, the proposed feature selection method 1. Fixed the feature component with the best Fisher
and Support Vector Machine (SVM) classifier. Ratios; calculate the cross Mutual Informations from
Section 3 contains experimental results. Section 4 each component to this; sort the estimated sequence;
concludes this work. 2. Set the initial cluster centers uniformly from the
sorted sequence;

Ubiquitous Computing and Communication Journal 2


3. Classify the components according to the lowest in each group. For example, the "piecewise" pdf of
cross mutual information to the centers. sequence Y with the order statistics in (5) is noted by
4. Recompute the center of each group;
5. Repeat (3) and (4) until the centers do not change;
6. Select in each group the component with highest
FR score.

2.1.2 Two-Dimension Mutual Information The selection of parameter M is determined by a bias


Estimation variance tradeoff and typically, M= √N.
Now we pay our attention to the estimation of
cross Mutual Information. Conventional method 2.3 Support Vector Machine (SVM) Classifier
estimates the two-dimensional Mutual Information Support Vector Machine (SVM) is based on
through the marginal and joint histograms statistical learning theory developed by Vapnik [9,
10]. It has been used extensively for classification of
data. SVM was originally designed for binary
classification. How to effectively extend it for multi-
class classification is still an ongoing research issue
where x, y denote the observation of random [11]. The most common way to build a k-class SVM
variables X and Y. In our case, X and Y are a pair of is by constructing and combining several binary
feature components. The typical problems of the classifiers [12]. The representative ensemble
estimation (3) are the complexity and possible schemes are One-Against-All and One-Versus-One.
presence of null bins in the conventional joint In One-Against-All k binary classifiers are trained,
histogram. In this work, we develop an estimation each of which separates one class from other k-1
method by using the empirical conditional classes. Given a test sample X to classifier, the binary
distributions. We first rewrite (3) as classifier with the largest output determines the class
label of X. One-Versus-One constructs k*(k-1)/2
binary classifiers. The outputs of the classifier are
aggregated to make a final decision. Decision tree
formulation is a variant of One-Against-All
Eq. (4) can be simplified by clustering the y into
formulation based on decision tree. Error correcting
clusters and estimate the conditional densities in each
output code is a general representation of One-
cluster
Against-All or One-Versus-One formulation, which
uses error-correcting codes for encoding outputs [8].
The One-Against-All approach, in combination with
SVM, provides better classification accuracy in
comparison to others [11]. Consequently we applied
where i is the cluster index and ki is the number of One- Against-All approach in our experiments.
observations in each cluster. For clustering y we
apply a fast method based on order statistics. We 3 EXPERIMENTAL RESULTS
briefly describe this idea as follows. Given an
observed sequence Y {y1, y2,….., yn} and a number M, Experiments are carried out on well-known data
a set of M order statistics are defined as sets from UCI Machine Learning Repository. In the
experiments the original partition of the datasets into
training and test sets is used whenever information
about the data split is available. In the absence of
separate test set, 10 fold cross validation is used for
calculating the classification accuracy.. Multi-class
SVM classifier (one against all) is implemented
using MATLAB. The kernel chosen is RBF kernel
the multi-class SVM
is the sorted sequence of Y, i= 1, 2, ..., M. The classifier. The control parameter is taken as 0.01
clustering of sequence Y into group number i is and the regularization parameter C is fixed as 100.
given by matching the sequence to an inequality as In this section we address the performance of our
proposed approach in terms of classification
accuracy. We compare two feature selection methods
(proposed, MIM). Features selected by Mutual
The conditional density for each group p (y|i) is Information Maximization (MIM) does not ensure
estimated by a "piecewise" density function using the weak dependency among features, and can lead to
order statistics, calculated from the clustered samples redundant and poorly informative families of

Ubiquitous Computing and Communication Journal 3


features. These feature selection methods are [2] C. A. Ratanamahatana and D. Gunopulos.
evaluated according to their classification accuracy Feature selection for the naive bayesian
achieved by the SVM classifier. classifier using decision trees. Applied Artificial
A comparison of classification accuracy Intelligence, Vol. 17(5-6), pp. 475–487, 2003.
achieved by the feature selection methods is made in [3] K. Torkkola. Feature extraction by non-
Table 1 for each data collection. Real life public parametric mutual information maximization.
domain data sets are used from the UCI Machine Journal of Machine Learning Research, Vol. 3,
learning repository for our experiments. Our pp. 1415–1438, 2003.
proposed approach achieves better classification
accuracy compared to other method MIM. [4] A. Blum and P. Langley. Selection of relevant
features and examples in machine learning.
Artificial Intelligence, pp. 245-271, 1997.
Table 1 Comparison of Feature selection Methods [5] R. Kohavi and G. John. Wrappers for feature
subset selection. Artificial Intelligence, Vol. (1-
Data set Feature Classification 2), pp. 273-324, 1997.
selection Accuracy (%) [6] M. Dash and H. Liu. Feature selection for
method classification. Intelligent Data Analysis: An
Iris MIM 95.12 International Journal, Vol. 1(3), pp. 131-156,
D=4 Proposed 98.16 1997.
Ionosphere MIM 92.0
[7] H. Liu and H. Motoda. Feature Selection for
D=32 Proposed 94.18
Knowledge Discovery and Data Mining. Boston:
Isolet MIM 85.63
Kluwer Academic Publishers, 1998.
D=617 Proposed 87.24
Multiple MIM 82.46 [8] Dietterich TG., Bakiri G., Solving multi-class
Features learning via errorcorrecting output codes, General of
D=649 Proposed 84.67 Artificial Intelligence Research,Vol. 2, pp. 263-
86, 1995.
Arrhythmia MIM 90.23
D=195 Proposed 93.46 [9] Corts C., Vapnik VN., Support Vector
Networks, Machine Learning, Vol. 2, pp. 273-
297, 1995.
4 CONCLUSION [10] Vapnik VN., The Nature of Statistical Learning
Theory. Springer, Berlin Heidelberg New York
We have presented a simple and very efficient 1995.
scheme for feature selection in a context of
[11] Rifkin R., Klautau A., In Defence of One-Vs.-
classification. The method proposed selects a small
All Classification, Journal of Machine Learning,
subset of features that carries as much information as
Vol. 5, pp. 101-141, 2004.
possible. The hybrid feature selection method
described in this paper has demonstrated that the [12] Hsu CW., Lin CJ., A comparison of methods for
approach is able to reduce the number of features Multi-class Support vector machine, IEEE
selected as well to increase the classification rate. Transactions on Neural Networks, Vol. 13(2),
This method does not select a feature similar to pp. 415-425, 2002.
already picked ones, even if it is individually
powerful, as it does not carry additional information
about the class to predict. Thus, this criterion ensures
a good tradeoff between independence and
discrimination. The experiments we have conducted
show that proposed method is the best feature
selection method. Combined with SVM Classifier
the scores we obtained are comparable or better than
those of state-of-the-art techniques.

5 REFERENCES

[1] I. Guyon and A. Elisseeff. An introduction to


variable and feature selection. Journal of
Machine Learning Research, Vol. 3, pp. 1157–
1182, 2003.

Ubiquitous Computing and Communication Journal 4

You might also like