You are on page 1of 13

A Comparative Study on Feature Selection Techniques

Ms.Shailaja V. Pede Lecturer, Pimpri Chinchwad college of engineering, Nigdi, pune Contact no. +919422428065 Email :

Prof.K.Rajeswari (Asst.Prof & Head of Comp Engg. Dept), Pimpri Chinchwad college of engineering, Nigdi, Pune Contact No. : +919766909526 Email :

Data mining techniques become complex and expensive when more numbers of features are used. Selecting different features from a given dataset is considered to be one of the major components of data mining techniques.. Feature selection algorithms are being used to reduce the dimensions of a given dataset without degrading the performance and information content of the domain. Supervised feature selection is used as a pre-processing method for other machine learning techniques such as clustering, classification or association rule mining to reduce the dimensionality of the domain space without much loss of information content. In this paper we discuss some existing algorithms of feature selection such as LVF(Las Vegas for Filter feature selection), Relief, DTM( Decision Tree Method) Branch & Bound, MDLM (Minimum Length Description Method) and compare their performance with some new techniques of feature selection theoretically such as feature selection using frequency count, feature reduction using association rule mining and multi objective feature reduction technique. This paper is a comparative study of different techniques to reduce the number of features in a dataset. It is found that recent techniques are more useful to reduce the number of features thereby improving the performance in real life datasets. Our next paper will propose a new algorithm based on these to improve the performance of data classification effectively. Keywords- Data Mining, Feature Selection, Feature Reduction, Preprocessing.

Feature selection has been one of the wide researches in pattern recognition, machine learning, statistics and data mining communities. The main ides of feature selection is to choose a subset of original values by eliminating redundant features and those with little and no predictive information. If we extract as much information as possible from a given data set while using the smallest number of features, we can save large amount of computation time as well often build a model that generalizes better to unseen points. By identifying the insignificant information feature selection also saves cost occurred during data collection, communication, and maintenance. It also improves the compressibility of the resulting classifier models and

clarifies the relationship among features. Another case is to find the correct subset of predictive features. For example a physician can make a decision based on the selected features whether a potentially dangerous surgery is necessary for the treatment or not. Feature reduction attempts to remove irrelevant features according to two basic criteria: (i) the accuracy does not significantly decrease and (ii) the resulting concept, given only the values for the selected features, is as close as possible to the original concept, given all the features. The Feature reduction methods find the best feature subset in terms of some evaluating function among the possible subsets.

Statement of the Problem

To compare some existing techniques of feature selection with new techniques of feature selection such as feature selection using frequency count, feature selection using association rule mining and multiobjective feature selection and arrive at a new solution.

Existing Techniques
Feature selection problems can be broadly classified into three distinct categories Hard problems, where data is defined in a space consisting of hundreds, thousands of coordinates and drastic dimensionality reduction is necessary, Soft problems, where the requirement for reduction is milder and Visualization problems, where data of high dimensionality is mapped to few dimensions, such that its structure becomes perceivable by humans. In this section, some of the popular feature selection algorithms are reproduced. LVF: LVF is a Las Vegas algorithm for Filter feature selection, uses consistency measure to evaluate the candidate subsets generated at random. It initiates random search over the subset space and computes the inconsistency measure for each subset. Then it checks the inconsistency value against a threshold, and if found greater than the threshold that subset is rejected. LVF is consistency driven and handles noisy domains. LVF reduces the number of features quickly during the initial stages, after that LVF still searches blindly i.e. in the same way while the computing resource is spent on generating many subsets that are not good. The algorithm is reproduced in figure 1. LVF (D, S, MaxTries, ) 1. T = S 2. For i=1 to MaxTries

3. Randomly choose a subset of features, Sj 4. if card(Sj ) card(T) 5. if inConCal(Sj , D) then T = Sj and Output Sj 6. else append Sj to T 7. output Sj as another solution 8. endfor 9. return T
Figure 1 : LVF

Branch and Bound:

B&B was proposed in. It is an exponential search algorithm. This

algorithm uses some monotonic evaluation function. The algorithm needs input of required number of features (M) and it attempts to find out the best subset. The algorithm is given in figure 2.2. Branch and Bound feature selection starts with a full set of features and removes one feature at a time so it is considered as non exhaustive and accuracy is the other major disadvantage of this algorithm. The algorithm is given in figure 2
B&B (D, S, M) 1. if card(S) M then /*subset generation*/ 2. j=0 3. for all features f in S begin 4. Sj = S - f /*remove one feature at a time */ 5. if (Sj is legitimate) and if isbetter(Sj , T) then T = Sj 6. B&B(Sj ,M) /*recursion*/ 7. j + + 8. return T
Figure 2: Branch & Bound

Relief: Relief selects the relevant features by using statistical method. It is basically a feature weight based algorithm designed based on instance based learning algorithm . Choosing a sample of instances at random from the set of training instances and for each instance in it, finds the NearHit and NearMiss instances based on Euclidian distance measure. NearHit of an instance is defined as the instance having minimum Euclidean distance among all instances of the same

class as that of the instance. NearMiss of an instance is defined as the instance having minimum Euclidean distance among all instances of different class. The algorithm finds the weights of the features from a sample of instances and chooses the features with weight greater than a threshold. Relief family of algorithms can be used to select a feature which is most predictive of an outcome. Relief has the ability to efficiently detect higher-order interaction effects which may go undetected by other methods. Relief works well for noisy and correlated features. However, it cannot work with redundant features and hence generates non-optimal features if the dataset contains redundant features. It works only with binary classes. Another problem is to choose the proper value of NoSample. This algorithm is efficient as only the subset having the number of features smaller than that of the current best subsets are checked for inconsistency. It finds the optimal subsets for most of the datasets. Relief is feature selection based on attribute estimation. Relief assigns a grade of relevance to each feature and those features valued with given threshold value by user here the main problem is that original Relief handles only Boolean concept problem. The steps for Relief algorithm is described in figure 3. Relief (D, S, NoSample, Threshold) 1. T = 2. Initialize all weights, Wj to zero 3. For i = 1 to NoSample 4. Randomly choose an instance x in D 5. Find its nearHit and nearMiss 6. For j = 1 to N /* N is the number of features*/ 7. Wj = Wj - diff (xj, nearHitj) 2 + diff (xj, nearMissj) 2 8. For j =1 to N 9. If Wj = Threshold, append feature fj to T 10. Return T
Figure 3: Relief

DTM: Decision Tree Method uses feature selection in an application on Natural Language Processing. To select the features, it runs over a training set and all those features that appear in the pruned decision tree are selected. In other words, the union of the subsets of the features,

appearing in the path to any leaf node in the pruned tree, is the selected subset. The idea of decision tree algorithms is to analyze a set of different samples each assigned a class label. The decision tree system splits these different samples into subsets so that the data in each of the descendant subsets is purer than the data in the parent super set. As a classifier, the decision tree can then be used to classify an unknown sample, i.e. a sample with no class information associated according to the previously analyzed training set. From more formal point of view, a decision tree method is a supervised learning technique that builds a decision tree from a set of training samples during the machine learning process. DTM is useful to improve case based learning of feature selection; if the given dataset is noisy the DTM cannot work successfully. MDLM If a feature subset X can be expressed as a fixed non-class-dependent function F of the features in another subset Y, then once the values in the features in the subset X are known, the features in the subset Y becomes useless. Based on this concept Minimum Description Length Method [1], eliminates all irrelevant and redundant features. Minimum Description Length Criterion (MDLC) is used for this purpose. The algorithm exhaustively searches for all the possible subsets and returns the subset satisfying MDLC. This method can find all the useful features for Gaussian cases. MDLM is a complete generation process of feature selection. This method can find all the useful features and it works well for those all the observations are Gaussian and for non-Gaussian it may not able to find out the useful subset of features.
Symbol D S M

Description The Dataset Original set of Features Number of features to be selected Function to find the cardinality of the set X A function to check if the set X is better than the set Y the sample size lower limit of a features weight to become relevant weight of jth feature number of iterations function to calculate inconsistency upper level of inconsistency Minimum support 6

is better(X; Y ) NoSample
Threshold Wj



Table 1 : Symbols used in above algorithms

Recent Techniques
In this section, two feature selection & extraction techniques to support the classical single objective association rule mining is studied. Classical association rule mining works on a market basket dataset and tries to optimize one objective, i.e. confidence. Feature Selection using frequency count This algorithm is not only useful for reducing the features of market basket dataset based on Frequency count, also reduce the dimensionality of the dataset by selecting only the relevant attributes from the original continuous valued dataset. When the dataset is converted to market basket, all the sub-ranges of the selected attributes have to be considered. This algorithm is having the capacity to find out the relevant sub-ranges of the attributes for given dataset, and produces the market basket dataset with a selected attributes in it i.e. less number of attributes. The required input for this algorithm is the dataset and maximum number of needed sub-ranges, i.e. Maxatt. Table 2 describes the symbols used in algorithm. It reads and scans the dataset only once and then it finds the frequency count of every sub-range of all attributes. Using these frequency counts it removes the number of irrelevant sub-ranges, till desired number of attributes remains not-eliminated. Finally it produces the selected features or attributes i.e. some subranges as output. Number of such sub-ranges will be equal or less than MaxAtt. In this way the algorithm is useful to reduce the features of given dataset. FSUFC (D, MaxAtt) 1. S = 2. for all attributes Ai A 3. for all subranges Pi, j Ai 4. S = S U Pi, j 5. for all s S 6. find the frequency count, SUPs 7. minfreq = 1 8. S1 = 9. for all s S

10. if (s Ai) and ((SUPs * |Ai |) (minfreq * max(|A|)) 11. S1 = S1 U s 12. if |S1| MaxAtt go to step 16 13. minfreq = minfreq + 1 14. S=S1 15. goto step 8 16. return S1
Figure 4: FSUFC

SYMBOL A |Ai | Pi ,j max(|A|) S S1 SUPs Minfreq MaxAtt

DESCRIPTION Attributes of original dataset Number or sub-ranges of ith Attribute jth sub-range of ith attribute Maximum of |Ai | Set of sub-ranges of A Set of selected sub-ranges Frequency of sub range s Current value of support count to declare as Frequent Maximum no of sub-ranges to be selected

Table 2: Symbols Used in FRUFC

This algorithm works on the original continuous valued dataset where the number of attributes is very small hence requiring less amount of memory for its execution. For every attribute some sub-ranges are considered. These sub-ranges will become attributes in the market basket dataset. But this method will restrict some of these ranges from becoming an attribute of the market basket dataset. For every sub-range of all the attributes, the frequency of them within the dataset is calculated by reading the dataset once. Afterwards, only those frequency counts will be used to reduce the dimensionality of the market basket dataset to the user desired level. The user has to provide his desired number of attributes as input to the algorithm. After calculating the frequency count, those sub-ranges are eliminated; whose frequency count is less than a factor of minimum frequency, minfreq. This factor is different for the sub-ranges of different attribute. If the dataset has been reduced to the desired level, it will produce the subranges that are found out to be relevant. Otherwise it will eliminate some more sub-ranges by incrementing the minimum frequency count, minfreq, This process will continue till the number of relevant sub-ranges do not become less than or equal to the user desired number of attributes. Feature Reduction using association Rule Mining This algorithm, DRARM, uses one user parameter minimum support to eliminate the attributes that are irrelevant for the association rule mining problem. The algorithm produces the various sub-ranges of the attributes that have the support count more or equal to equivalent support count of minimum support as output. FRARM (D, Minsup) 1. S= 2. for all attributes Ai A 3. for all subranges Pi , j Ai 4. S=S U Pi ,j 5. for all s S 6. find the frequency count, SUPs

7. T= | D | /* total number of records in D */ 8. S1= 9. for all s S 10. if (SUPs T * Minsup) 11. S1=S1 U s 12. return S1
Figure 5 : FRARM

If only these sub-ranges are used to form the market basket dataset that will be used by the rule generation algorithms, then the algorithm will work faster. Feature reduction will not lead to less number of passes, but it will save the time as the datasets size is smaller one. The steps for this new methodology are given in Figure 5.

Experimental Results and Comparative Study

The algorithm was tested over several synthetic datasets as well as some benchmark datasets downloaded from UCI machine learning data repository Monks-1 and Monks-3 are two such datasets used in the experiments . There are 124 and 122 instances in Monks-1 and Monks3 respectively. Both of them have 8 attributes; first one is the class number and the last one is the sample number. Remaining six attributes are numeric values spanning over different ranges. The minimum and maximum values of these attributes are A1(1,3), A2(1,3), A3(1,2), A4(1,3),A5(1,4) and A6(1,2). If these datasets are converted to market basket with unique values, then there will be total 17 attributes. From the following results it can be observed that it selects the sub-ranges of those attributes, declared as relevant by the existing algorithms. Table 3 presents the comparison of some of the above mentioned algorithms with the proposed method. Method Selected Relief attributes A2,A5 always & one or both of B&B A3,A4 A1,A3,A4 Monks3 MBs Dimension 9 or 10 or 12 8 NA 10

Monks1 Selected attributes A1,A2,A5 MBs dimension 10

DTM LVF MDLM FSUFT Reduced to 4 FSUFT Reduced to 6 FSUFT Reduced to 8 FRARM with 25% support

A2,A5 A2,A4,A5 A2,A3,A5 A1-1,A3-1,A43,A5-1 -

7 10 9 4 -

NA NA NA A1-1,A2-3,A5-4,A6-2

A1-1, A2-3, A3-1, A4- 6 3, A5-4, A6-2 -

A1-1, A2-2, A3-1, 8 A4-3, A5-1, A5-2, A5-4, A6-2 A1-1, A2-2, A3-1, 8 A4-3, A5-1, A5-2, A5-4, A6-2

A1-1, A1-2, A2-2, A2- 10 3, A3-1, A4-1, A4-3, A52, A5-4, A6-2 A1-1, A2-3, A3-1, A4- 6 3, A5-4, A6-2

DRARM with 26% support

A1-1, A3-1, A4-3, 4 A5-1

Table 4: Comparative Results

Using the frequency count of the attributes, this algorithm reduces the features of a dataset based on a user given minimum support. Features of the target market basket dataset can be reduced to a desired level by providing a suitable value of minimum support. Since the same parameter is used in the rule mining process also, there will be less difficulty for the user to provide the appropriate value for that parameter. The algorithms presented in the previous subsections can help the classical single objective association rule mining problem.

Summary of Findings and Conclusion

With same support, same frequent item-sets were generated from the original as well as reduced dataset, provided reduction was done with same or smaller support. Reduced dataset saves significant amount of time during frequent item-set generation while compared with the

original dataset. Reduction does not cause any information loss during frequent item-set generation with a higher support. Selecting support more than minimum support to be used in frequent item-set generation may lead to loss of information. Based on experimental result, it has been found that the given algorithm FSUFC is good enough for reducing the dimension of market basket dataset The execution time of the algorithm is controlled only by the desired number of attributes, as it is the only user parameter. The main limitation of this technique is that the user must have a clear knowledge about the dataset before reducing the dataset. Using FRARM features of the target market basket dataset can be reduced to a desired level by providing a suitable value of minimum support. Since the same parameter is used in the rule mining process also, there will be less difficulty for the user to provide the appropriate value for that parameter.

Scope for further research

Above studied techniques support the classical single objective association rule mining problem. Using these techniques a new technique can be proposed which support multiobjective association rule mining. Further same algorithms can be reproduced to improve the performance of real life databases such as medical dataset or agricultural dataset so that less number of features are required for making the decision for improvement.

B Nath, D K Bhattacharyya, A Ghosh, Dimensionality Reduction Using association Rule Mining, International Journal of Intelligent Information Processing, Volume 2, Number 1, March 2011 M. Carreira. A review of dimension reduction techniques. Technical report, University of Stanford, 1997
K.M. Faraoun, A. Rabhi, Data dimensionality reduction based on genetic selection of

feature subsets, EEDIS UDL Unixversity- SBA, (2002).

Cheng-Hong Yang, Chung-Jui Tu, Jun-Yang Chang Hsiou-Hsiang Liu Po-Chang Ko, Dimensionality Reduction using GA-PSO(2001).

P_adraig, Dimension Reduction, Cunningham University College Dublin Technical Report UCDCSI-2007-7 August 8th, 2007.

H. Liu and R. Setiono. A probabilistic approach to feature selection a filter solution. In Proceedings of International Conference on Machine Learning, pages 319327, 1996. P. M. Narendra and K. Fukunaga. A branch and bound algorithm for feature selection. IEEE Transaction on Computers, C-26(9):917922, September 1977. K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proc. of 10th National Conference on AI, pages 129134. MIT Press,, 1992 C. Cardie. Using decision trees to improve case based learning. In Proc. of Tenth International Conference on Machine Learning, pages 2532, Amherst, MA, 1993. Morgan Kaufmann.