P. 1
Hybrid Sequential Feature Selection -II

|Views: 9|Likes:

See more
See less

09/07/2009

pdf

text

original

# HYBRID SEQUENTIAL FEATURE SELECTION

Assignment No. 3

Submitted By Name Registration No. Riaz Ahmad L1R08MSCS0006

University of Central Punjab Department of Information & Technology Lahore.

HYBRID SEQUENTIAL FEATURE SELECTION
Faculty of Information and Technology University of Central Punjab, Lahore, Pakistan. {riaz.ahmad@ucp.edu.pk}

• Regression Lets us discuss about these types of Data Mining Algorithms 1.4.1 Association Rule Mining This technique is used to find the interesting relationships among the data items. A most famous application of association rule mining is market basket analysis. In marketbasket analysis, we try to observe the different buying trends of customers. It helps the stakeholder to offer promotions as well the placement of items. For example items egg, butter and butter purchased together can be kept together. Association rule mining is unsupervised. 1.4.2 Classification Classification is used to divide the data items on the basis of class attributes. Most famous example of classification is to divide the given sale data into the class of data consisting of those who buy and who don’t buy. Classification is supervised as the class attribute is already known on the basis of which data is divided into different classes. The best example of classification is decision trees. 1.4.3 Clustering Clustering is used to group the items having same characteristics. Example of clustering can be thought as grouping the students of a university into different groups considering the similarities found in them. As usually we divide the students into groups as intelligent students and dull minded students. Clustering is also unsupervised unlike the classification in which class attribute is given. 1.4.4 Prediction Prediction means telling about future. Mostly it is used in predicting the product sales. This technique observes the previous history of the data items. For example one might want to predict will it rain in the next week. Then climate conditions in the previous days, last year during these days will be observed to predict. Similarly one might want to observe the passing percentage of the students next year. 1.5 Applications of Data Mining The applications of data mining are almost in every field of life. Main areas of application are • Finance (Loan Application Processing, Credit Card Analysis) • Insurance (Claims, Fraud Analysis) • Telecommunications (Call History Analysis, Fraud detections, Promotions) • Transport • Marketing & Sales • Electricity Supply Forecasting • Medical Let us discuss the examples of each area of application. In finance/Banking data mining application is used in loan application processing. While processing loan application, data mining helps to make decisions either to accept or reject the loan application by analyzing different attributes of the applicant. Algorithm can use the attribute of region, race, salary, years of service with the current employer, years of living at current place, family background as well previous loan history. It helps you to make decision accurately. Further elaborating might be the people of that regions are mostly defaulters similarly might be the black people are most vulnerable to defaulter. Same is the case with credit card fraud detection. History of customer is analyzed to detect the fraud. For example credit card holder history tells

that he uses card mostly in the very start of month and never make a transaction of a huge amount. If someone else tries to use his/her card illegally to make unusual transaction it can be detected easily. Insurance Companies use data mining for claims and fraud analysis. For example it might help insurance company to make decision either to give insurance to the particular person or not similarly by analyzing different attributes from the profile of the customer. Data about customer from external sources mean to say from different companies can also be used besides the data provided by the customer. In telecommunication, company might want to offer different packages; data mining helps them to make decisions to offer promotions for what age of people during which hours for different regions by analyzing the traffic. The major use of data mining in telecommunication is also the fraud detection by observing the behavior of their customers (voucher recharging, call durations etc) In Transport, bus or airline Company can take help by applying data mining to make a decision about new rout. By analyzing the traveling trend of passengers, theirs likes or dislikes during journey. Either they must increase or decrease the no. of busses/airplane on a specific rout. This all is only possible by the application of data mining. Marketing & Sales point of view as I have discussed earlier Market-Basket analysis. A shopkeeper might offer promotions on different items based on analyzing the buying trend of consumers at different places. Similarly the placement of items is made with the help of data mining results. Power Supply Forecasting is also a main area of application of data mining. Power supply companies use the data mining for power usage analysis of domestic and commercial areas during different hours, months to forecast the power requirements for the next month, next quarter or next year. As in Europe electricity is charged differently in different hours of the day. Such types of decisions are only possible with the usage of data mining. There are also many other areas of application of data mining, in simple data mining has become the need of industries. In Europe data mining is being used since years, in Pakistan industries have just started the data mining after knowing its importance. Section 2 discusses the FSS and BSS feature selection methods from the material studied for supervised learning. Section 3 introduces and explains criterion addition to BSS and BSS while selecting feature among candidate features. Section 4 discusses the newly hybrid nature proposed technique. Section 5 concludes this work with key ﬁndings. 2. MATERIAL AND METHOD Data for mining is collected from the different sources. Cleansing and pre-processing is performed on the data. During the pre-processing of data some features/attributes are added and removed. Before adding an attribute it is made sure that is this attribute have worth or not. Similarly while removing an attribute or feature the accuracy rate of classifier must not be compromised. The need to remove an attribute is to simplify the data as much as possible so that the efficiency of the algorithm could be increased and the model could be simpler [5]. Attributes can be divided into two types, relevant and irrelevant [6].Feature selection has

been an active field of research and development for decades in statistical pattern recognition [10], machine learning [11,12], data mining [13,14] and statistics [15]. It has proven in both theory and practice effective in enhancing learning efficiency, increasing accuracy, and simplicity of learned results. 2.1 Feature Selection ‘Let G be some subset of F and fG be the value vector of G. In general, the goal of feature selection can be formalized as selecting a minimum subset G such that P(C | G = fG) is equal or as close as possible to P(C | F = f ), where P(C | G = fG) is the probability distribution of different classes given the feature values in G and P(C | F = f ) is the original distribution given the feature values in F’ [5]. The method of selecting only the relevant attributes is called Feature Selection. Feature selection attempts to select the smallest sized subset of features keeping in view the following criteria. • The classiﬁcation accuracy does not signiﬁcantly decrease; • and results must be as close as possible to the original class distribution, given all features.[16]. The following figure shows the general Feature Selection Process[5].

Subset Generation

Subset Evaluation

Fig.1. Framework of feature selection. The diagram in Figure 1 exhibits a traditional feature Original Candidate selection framework. Subset generation produces the Set Subset possible subsets, and then each candidate subset is evaluated and compared with the previous one subset at the second Current the stage with respect to some measuring criteria. If Best newly evaluated subset is better than the previousSubset it will one, replace the last one. This process is repeated until the given Stop condition is reached. In this traditional approach on the Conditi No relevancy of the attributes is considered. on [9] proposed a new framework of feature selection, a claim that only the relevance among the attributes is not enough Yes Selected Subset but as well feature redundancy is also another metric.

Fig. 2. New Framework of feature selection. The diagram in Figure 2 is an enhancement in traditional Relevant Selected Original feature selection framework composed of two steps. First Relevance Redundancy
Set
Analysis

step is to remove the irrelevant features and in second step redundant features are removed. The advantage of over traditional framework is that it gives the optimal subset solution as compared to traditional framework. Feature selection algorithms are typically composed of the following three components [17]: • Search algorithm, searches the space of feature subsets, which has size 2d, where d is the number of features. • Evaluation function inputs a feature subset and outputs a numeric evaluation. The search algorithm's goal is to maximize this function. • Performance i.e. classification. The most common sequential search algorithms for feature selection are forward sequential selection (FSS) and backward sequential selection (BSS). Let us discuss these techniques one by one in depth. 2.2 Forward Sequential Selection: Forward Sequential Selection shortly named as FSS evaluates all feature subsets with exactly one feature starting with an empty set, and selects the one with the best performance. It then adds to this subset the feature yielding the best performance. This process repeats until there is no improvement by adding additional attributes. It means Forward Sequential Selection selects best feature locally optimal from the features. This constitutes on of its drawback. The algorithm can not correct the previously added feature [16,17]. 2.3 Backward Sequential Selection Backward Sequential Selection shortly called BSS is reciprocal of FSS. It begins with all features and removes a feature if removal increases the performance of classifier. This cycle continues until the performance of the classifier remains the same or begins to decrease. BSS frequently outperformed FSS, perhaps because BSS evaluates the contribution of a given feature in the context of all other features. In contrast, FSS can evaluate the utility of a single feature only in the limited context of previously selected features. [16,17] note this problem with FSS, but their results favor neither algorithm. Based on these observations, it is not clear whether FSS will outperform BSS on a given dataset with an unknown no. of feature interactions. 2.4 Plus l take away r Method The plus l take away r method shortly named as (1,r) makes use of both Forward Sequential Selection as well Back Sequential Selection. For l iterations of Forward Sequential Search r iterations of Backward Sequential Search are performed. This cycle is repeated until we reached the required no. of features [10]. This method overcomes the problem stated above in FSS and BSS but in partial fashion. Here the more problem arises that are what will be the optimal values of l and r for moderate computational cost. 2.5 Sequential Floating Forward Selection Sequential Floating Forward Selection shortly called SFFS also makes use of both Forward and Backward Sequential Selection techniques. Each forward Sequential Selection step is followed by the number of Backward Sequential Selection steps until the optimal subset of features is selected. Thus backtracking becomes possible in this algorithm and it

Subset

Analysis

Subset

requires no parameter is required as in Plus l take away r Method [16]. 3. PROPOSED CRITERION In both FSS and BSS, one drawback that is common in both Algorithms is that no one handles if a tie occurs between the candidate features, either for addition in case of FSS or deletion in case of BSS. Let me discuss this in detail to elaborate the problem. Suppose in FSS, we start with empty set S={}. Then we check with other features one by one either and calculates the accuracy rates of adding each feature. Suppose feature F1 results the highest accuracy. We add this feature to set S, thus S becomes S= {F1}. Next we repeat the process for the remaining features, and there occur that features F3 & F4 cause the same accuracy rates. Which of them should be added? Here I propose that a history should be maintained about each feature during each cycle. When such tie occurs the decision must be taken with the help of the history maintained. Mean to say the attribute having the greater accumulated average amount the candidate attributes can be taken. Same criterion can be added to BSS. 3.1 Illustration Let us take an example to elaborate proposed selection criterion for selecting the best feature among the candidate features. Suppose we have F set of features and we want to get the optimal subset of features among these by using Forward Sequential Selection (FSS). Iteration 1. S = {} Table 1 Feature Accuracy Rate Acc. Avg. Rate F1 65 F2 70 F3 87 F4 85 F5 80 F6 75 As feature F3 in Table 1 has the highest Accuracy rate it should be selected. So S becomes S={F3}. Iteration 2. S = {F3} Table 2 Feature Accuracy Rate Acc. Avg. Rate F1 88 77 F2 88 79 F4 85 F5 80 F6 75 As feature F1 and F2 in Table 2 have a tie, so we compute their accumulated averages and selected the feature having largest value. So our set becomes S = {F3, F2} Iteration 3. S = {F3, F2} Table 3 Feature Accuracy Rate Acc. Avg. Rate F1 88 F4 86 F5 89

F6 78 As feature F5 in Table 3 causes raise in accuracy rate, so we add F5 and set becomes S = {F3, F2, F5} Iteration 4. S = {F3, F2, F5} Table 4 Feature Accuracy Rate Acc. Avg. Rate F1 89.5 F4 85 F6 75 As feature F1 in Table 4 causes raise in accuracy rate, so we add F1 and set becomes S = {F3, F2, F5, F1}. Iteration 5 causes no increase in accuracy rate, so we stop this process. The final optimal subset is S = {F3, F2, F5, F1}. Note that we have calculated the accumulated averages of only those attributes causing a tie thus saving our computational cost. 4. HYBRID NATURE Here we are going to propose another technique that makes use of most common and popular sequential feature selection algorithms (FSS, BSS). As it is not clear whether FSS will outperform BSS on a given dataset with an unknown amount of feature interactions. On this experimented statement we propose that if we take the union of resulted features of both algorithms, we will not have any doubt about the nature of data. What kind of data sets are, our selection of features will be better than performing only one of these algorithms. Suppose we have the data set S having ten features S = {F1,F2,F3,……….F10}. when we perform both algorithms on this data set the resulted subsets of both are S1={F2,F4,F5,F7} and S2={F2,F3,F4,F5,F7}. If we take the union of both resulted subsets S1 and S2, then the final result is S3 = {S1 U S2}, S1= {F2,F3,F4,F5,F7}. In such a way we have all relevant features required by reducing the chances of error as well as mentioned above that the results as well are dependant on the algorithm. By taking union of resultant subsets of features, we come with the all features best for classifier. 5. CONCLUSIONS In this paper, it was identified that there is a need of adding a selection criterion for candidate attributes selections in a case if there exists during Sequential Selection Algorithms, which will surely select a feature having greater accumulated average thus making sure that this attribute is best among the candidate attributes. Similarly we proposed another way of selection of optimal subset of features by taking union of subsets of both the Forward Sequential Selection and Backward Sequential Selection , which ensures that whatever the nature of dataset, the resulted subset will be the optimal subset without missing any feature that can play role during the construction of the final classifier. REFERENCES [1]. Daniel T. Larose, Discovering Knowledge in Data: An Introduction to Data Mining, 2005. [2]. P. Tan, M. Steinbach & V. Kumar, Introduction to Data Mining, 2nd edition, 2006.

[3]. J. Han and M. Kamber , Data Mining: Concepts

and Techniques, 2006. [4]. M. Hegland, Data Mining – Challenges, Models, Methods and Algorithms, 2003 [5]. M. Dash , H. Liu, Feature Selection for Classiﬁcation, Intelligent Data Analysis 1 (1997) 131–156 [6]. Almuallim, H., and Dietterich, T.G., Learning with many irrelevant features. In: Proceedings of Ninth National Conference on Artiﬁcal Intelligence, MIT Press, Cambridge, Massachusetts, 547–552, 1992. [7]. Almuallim, H., and Dietterich, T.G., Learning Boolean Concepts in the Presence of Many Irrelevant Features. Artiﬁcial Intelligence, 69(1– 2):279–305, November, 1994. [8]. Siedlecki, W. and Sklansky, J., On automatic feature selection. International Journal of Pattern Recognition and Artiﬁcial Intelligence, 2:197–220, 1988. [9]. Lei Yu, Huan Liu11, Efﬁcient Feature Selection via Analysis of Relevance and Redundancy, Journal of Machine Learning Research 5 (2004) 1205–1224 [10].P. Mitra, C. A. Murthy, and S. K. Pal. Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):301–312, 2002. [11].H. Liu, H. Motoda, and L. Yu. Feature selection with selective sampling. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 395–402, 2002b. [12].M. Robnik-Sikonja and I. Kononenko. Theoretical and empirical analysis of Relief and ReliefF. Machine Learning, 53:23–69, 2003. [13].Y. Kim, W. Street, and F. Menczer. Feature selection for unsupervised learning via evolutionary search. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 365–369, 2000. [14].M. Dash, K. Choi, P. Scheuermann, and H. Liu. Feature selection for clustering – a ﬁlter solution. In Proceedings of the Second International Conference on Data Mining, pages 115–122, 2002. [15].

[16].A. Miller. Subset Selection in Regression.
Chapman & Hall/CRC, 2 edition, 2002. Toward optimal feature selection.In Proceedings of the Thirteenth International Conference on Machine Learning, pages 284–292, 1996. [18]. David.W. Aha, Richard L. Banker, A Comparative Evalution of Sequential Selection Algorithms, Springer-Verlag 1996 [19].G. John, R. Khavi, and K. Pﬂeger, “Irrelevant Features and the Subset Selection Problem,” to appear in The Proceedings of the Eleventh International Conference on Machine Learning,, Rutgers, NJ., July 1994. [20].S. Salzberg, “Improving Classiﬁcation Methods via Feature Selection,” John Hopkins TR, 1992. [21].Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V: Feature selection for SVMs. In Advances in Neural Information Processing Systems 13. 11th edition. Edited by Solla SA, Leen TK, Muller K-R. Cambridge, MA: MIT Press, 2001.

[17].D. Koller and M. Sahami.

scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->