P. 1
Hybrid Sequential Feature Selection -II

Hybrid Sequential Feature Selection -II

|Views: 9|Likes:
Published by riazahmad82

More info:

Published by: riazahmad82 on Jul 24, 2009
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as DOC, PDF, TXT or read online from Scribd
See more
See less






Assignment No. 3

Submitted By Name Registration No. Riaz Ahmad L1R08MSCS0006

University of Central Punjab Department of Information & Technology Lahore.

Riaz Ahmad
Faculty of Information and Technology University of Central Punjab, Lahore, Pakistan. {riaz.ahmad@ucp.edu.pk}

ABSTRACT: The data collected for data mining might have many irrelevant as well as redundant features. There is a need to remove these irrelevant and redundant features, because these features does not add or affect the target concept [1]. But this data must be removed in many applications, so that learning can work. Removing these features improve the efficiency of the learning algorithm as well make the model simpler and general. Many algorithms have been introduced for feature selection .e.g. [5,8,9,16,17,19,20]. The focus of the paper is the Forward Sequential Selection (FSS) and Backward Sequential Selection (BSS) features selection techniques. In these techniques a feature is removed or added by comparing the accuracy rates. But there is still lack of selection criteria if two features having same accuracy rates. This work adds a selection criteria if there is a tie between the two features candidates for selection. As well propose a hybrid technique BF Sequential Feature Selection incorporating this selection criterion in both techniques, showing better results comparing to both techniques. Let F be the set of features, S1 = {F1, F3, F4} result after applying FSS and S2= {F1, F3, F4, F5} after applying BSS algorithm with incorporated selection criterion. Then the final result of the proposed hybrid technique is the union of both resultant features S = {F1, F2, F3, F4, F5}. 1. INTRODUCTION Data Mining is new emerging technology these days used to find hidden, interesting and previously unknown patterns in data [1]. The data to be mined can be in any form e.g. Relations, text, images etc. Question arises here that why we need to mine the data and use different algorithms or tools or in simple words can we use the powerful features provided by SQL to achieve this goal? The answer is very simple ‘NO’, because SQL queries are simply used to present the data in different forms either detailed form or summarized form, mostly provide the patterns that we already know. Data Mining algorithms give us the patterns that we already don’t know (unusual but of our interest). If we further elaborate Data Mining, we can say as Coal Miners dig the soil and come up with coal or different mines that are their main objective. In data mining our objective is to dig into the data and come up with interesting results to our stakeholders that were previously unknown. If we come with the result that all females got pregnant then it’s not an unknown result. No need to present such results because these results are known or of no interest to our stakeholder. Data mining is also known as Knowledge Discovery. Now a days where we are with full of new technology developments, companies have massive data in the form of databases, data marts and data warehouses to run their day to day business tasks as well for decision makings. Data mining as discussed in the start helps to uncover hidden patterns lie under the data. In this era data mining is possible as the computing power is affordable as well as data mining tools are easily available in the market [2]. 1.1 Prerequisites for Data Mining Data Mining comes from the different fields of science including machine learning, neural networks, statistics, database systems and data warehousing. To learn about data mining we must have knowledge about these areas up to some extent [1]. 1.2 Importance/Need of Data Mining Organizations are storing their day to day data since years; they have terabyte of data. Organizations are bearing cost of keeping this data for years. They were forced to think what to do with this data. Then the concept of warehousing came for efficient reporting but still organizations could not use their data in an efficient way. Data warehousing helps them to view data in an efficient way as well up to some extent for decision making. Then the concept of data mining came into existence. Data Mining helps to make decisions in a validated way. With the help of data mining we can find the relationships or associations between two products. For example if want to know that when a customer buy an egg what other product he will buy as well and vice versa [2]. In this way the stakeholder can keep the most commonly bought products together or can offer the promotions for other products. In the same way data mining can help stakeholder for forecasting mean to say prediction. Data mining is used both to reduce cost as well to increase revenue. 1.3 Data Mining Process Model Following steps are involved in data mining [5] • Data cleaning, (removes or transforms noise and inconsistent data) • Data integration, (combine data from multiple data sources) • Data selection, (data relevant to the analysis task are retrieved from the database) • Data transformation, (data are transformed or consolidated into forms appropriate for mining) • Data mining, (model construction, Algorithms are applied to uncover patterns) • Patterns evaluation, (checking results) • Knowledge presentation, (using visualization and knowledge representation techniques) 1.4 Types of Data Mining We can divide the types of data mining into two categories, first category concerns with the type of data to be mined; I mean to say Text Mining, Web Mining, and Graph Mining etc. Second category comprises of different ways of mining the data [3,4]. These consist of the following types. • Association Rule Mining • Classification • Clustering • Prediction

• Regression Lets us discuss about these types of Data Mining Algorithms 1.4.1 Association Rule Mining This technique is used to find the interesting relationships among the data items. A most famous application of association rule mining is market basket analysis. In marketbasket analysis, we try to observe the different buying trends of customers. It helps the stakeholder to offer promotions as well the placement of items. For example items egg, butter and butter purchased together can be kept together. Association rule mining is unsupervised. 1.4.2 Classification Classification is used to divide the data items on the basis of class attributes. Most famous example of classification is to divide the given sale data into the class of data consisting of those who buy and who don’t buy. Classification is supervised as the class attribute is already known on the basis of which data is divided into different classes. The best example of classification is decision trees. 1.4.3 Clustering Clustering is used to group the items having same characteristics. Example of clustering can be thought as grouping the students of a university into different groups considering the similarities found in them. As usually we divide the students into groups as intelligent students and dull minded students. Clustering is also unsupervised unlike the classification in which class attribute is given. 1.4.4 Prediction Prediction means telling about future. Mostly it is used in predicting the product sales. This technique observes the previous history of the data items. For example one might want to predict will it rain in the next week. Then climate conditions in the previous days, last year during these days will be observed to predict. Similarly one might want to observe the passing percentage of the students next year. 1.5 Applications of Data Mining The applications of data mining are almost in every field of life. Main areas of application are • Finance (Loan Application Processing, Credit Card Analysis) • Insurance (Claims, Fraud Analysis) • Telecommunications (Call History Analysis, Fraud detections, Promotions) • Transport • Marketing & Sales • Electricity Supply Forecasting • Medical Let us discuss the examples of each area of application. In finance/Banking data mining application is used in loan application processing. While processing loan application, data mining helps to make decisions either to accept or reject the loan application by analyzing different attributes of the applicant. Algorithm can use the attribute of region, race, salary, years of service with the current employer, years of living at current place, family background as well previous loan history. It helps you to make decision accurately. Further elaborating might be the people of that regions are mostly defaulters similarly might be the black people are most vulnerable to defaulter. Same is the case with credit card fraud detection. History of customer is analyzed to detect the fraud. For example credit card holder history tells

that he uses card mostly in the very start of month and never make a transaction of a huge amount. If someone else tries to use his/her card illegally to make unusual transaction it can be detected easily. Insurance Companies use data mining for claims and fraud analysis. For example it might help insurance company to make decision either to give insurance to the particular person or not similarly by analyzing different attributes from the profile of the customer. Data about customer from external sources mean to say from different companies can also be used besides the data provided by the customer. In telecommunication, company might want to offer different packages; data mining helps them to make decisions to offer promotions for what age of people during which hours for different regions by analyzing the traffic. The major use of data mining in telecommunication is also the fraud detection by observing the behavior of their customers (voucher recharging, call durations etc) In Transport, bus or airline Company can take help by applying data mining to make a decision about new rout. By analyzing the traveling trend of passengers, theirs likes or dislikes during journey. Either they must increase or decrease the no. of busses/airplane on a specific rout. This all is only possible by the application of data mining. Marketing & Sales point of view as I have discussed earlier Market-Basket analysis. A shopkeeper might offer promotions on different items based on analyzing the buying trend of consumers at different places. Similarly the placement of items is made with the help of data mining results. Power Supply Forecasting is also a main area of application of data mining. Power supply companies use the data mining for power usage analysis of domestic and commercial areas during different hours, months to forecast the power requirements for the next month, next quarter or next year. As in Europe electricity is charged differently in different hours of the day. Such types of decisions are only possible with the usage of data mining. There are also many other areas of application of data mining, in simple data mining has become the need of industries. In Europe data mining is being used since years, in Pakistan industries have just started the data mining after knowing its importance. Section 2 discusses the FSS and BSS feature selection methods from the material studied for supervised learning. Section 3 introduces and explains criterion addition to BSS and BSS while selecting feature among candidate features. Section 4 discusses the newly hybrid nature proposed technique. Section 5 concludes this work with key findings. 2. MATERIAL AND METHOD Data for mining is collected from the different sources. Cleansing and pre-processing is performed on the data. During the pre-processing of data some features/attributes are added and removed. Before adding an attribute it is made sure that is this attribute have worth or not. Similarly while removing an attribute or feature the accuracy rate of classifier must not be compromised. The need to remove an attribute is to simplify the data as much as possible so that the efficiency of the algorithm could be increased and the model could be simpler [5]. Attributes can be divided into two types, relevant and irrelevant [6].Feature selection has

been an active field of research and development for decades in statistical pattern recognition [10], machine learning [11,12], data mining [13,14] and statistics [15]. It has proven in both theory and practice effective in enhancing learning efficiency, increasing accuracy, and simplicity of learned results. 2.1 Feature Selection ‘Let G be some subset of F and fG be the value vector of G. In general, the goal of feature selection can be formalized as selecting a minimum subset G such that P(C | G = fG) is equal or as close as possible to P(C | F = f ), where P(C | G = fG) is the probability distribution of different classes given the feature values in G and P(C | F = f ) is the original distribution given the feature values in F’ [5]. The method of selecting only the relevant attributes is called Feature Selection. Feature selection attempts to select the smallest sized subset of features keeping in view the following criteria. • The classification accuracy does not significantly decrease; • and results must be as close as possible to the original class distribution, given all features.[16]. The following figure shows the general Feature Selection Process[5].

Subset Generation

Subset Evaluation

Fig.1. Framework of feature selection. The diagram in Figure 1 exhibits a traditional feature Original Candidate selection framework. Subset generation produces the Set Subset possible subsets, and then each candidate subset is evaluated and compared with the previous one subset at the second Current the stage with respect to some measuring criteria. If Best newly evaluated subset is better than the previousSubset it will one, replace the last one. This process is repeated until the given Stop condition is reached. In this traditional approach on the Conditi No relevancy of the attributes is considered. on [9] proposed a new framework of feature selection, a claim that only the relevance among the attributes is not enough Yes Selected Subset but as well feature redundancy is also another metric.

Fig. 2. New Framework of feature selection. The diagram in Figure 2 is an enhancement in traditional Relevant Selected Original feature selection framework composed of two steps. First Relevance Redundancy

step is to remove the irrelevant features and in second step redundant features are removed. The advantage of over traditional framework is that it gives the optimal subset solution as compared to traditional framework. Feature selection algorithms are typically composed of the following three components [17]: • Search algorithm, searches the space of feature subsets, which has size 2d, where d is the number of features. • Evaluation function inputs a feature subset and outputs a numeric evaluation. The search algorithm's goal is to maximize this function. • Performance i.e. classification. The most common sequential search algorithms for feature selection are forward sequential selection (FSS) and backward sequential selection (BSS). Let us discuss these techniques one by one in depth. 2.2 Forward Sequential Selection: Forward Sequential Selection shortly named as FSS evaluates all feature subsets with exactly one feature starting with an empty set, and selects the one with the best performance. It then adds to this subset the feature yielding the best performance. This process repeats until there is no improvement by adding additional attributes. It means Forward Sequential Selection selects best feature locally optimal from the features. This constitutes on of its drawback. The algorithm can not correct the previously added feature [16,17]. 2.3 Backward Sequential Selection Backward Sequential Selection shortly called BSS is reciprocal of FSS. It begins with all features and removes a feature if removal increases the performance of classifier. This cycle continues until the performance of the classifier remains the same or begins to decrease. BSS frequently outperformed FSS, perhaps because BSS evaluates the contribution of a given feature in the context of all other features. In contrast, FSS can evaluate the utility of a single feature only in the limited context of previously selected features. [16,17] note this problem with FSS, but their results favor neither algorithm. Based on these observations, it is not clear whether FSS will outperform BSS on a given dataset with an unknown no. of feature interactions. 2.4 Plus l take away r Method The plus l take away r method shortly named as (1,r) makes use of both Forward Sequential Selection as well Back Sequential Selection. For l iterations of Forward Sequential Search r iterations of Backward Sequential Search are performed. This cycle is repeated until we reached the required no. of features [10]. This method overcomes the problem stated above in FSS and BSS but in partial fashion. Here the more problem arises that are what will be the optimal values of l and r for moderate computational cost. 2.5 Sequential Floating Forward Selection Sequential Floating Forward Selection shortly called SFFS also makes use of both Forward and Backward Sequential Selection techniques. Each forward Sequential Selection step is followed by the number of Backward Sequential Selection steps until the optimal subset of features is selected. Thus backtracking becomes possible in this algorithm and it




requires no parameter is required as in Plus l take away r Method [16]. 3. PROPOSED CRITERION In both FSS and BSS, one drawback that is common in both Algorithms is that no one handles if a tie occurs between the candidate features, either for addition in case of FSS or deletion in case of BSS. Let me discuss this in detail to elaborate the problem. Suppose in FSS, we start with empty set S={}. Then we check with other features one by one either and calculates the accuracy rates of adding each feature. Suppose feature F1 results the highest accuracy. We add this feature to set S, thus S becomes S= {F1}. Next we repeat the process for the remaining features, and there occur that features F3 & F4 cause the same accuracy rates. Which of them should be added? Here I propose that a history should be maintained about each feature during each cycle. When such tie occurs the decision must be taken with the help of the history maintained. Mean to say the attribute having the greater accumulated average amount the candidate attributes can be taken. Same criterion can be added to BSS. 3.1 Illustration Let us take an example to elaborate proposed selection criterion for selecting the best feature among the candidate features. Suppose we have F set of features and we want to get the optimal subset of features among these by using Forward Sequential Selection (FSS). Iteration 1. S = {} Table 1 Feature Accuracy Rate Acc. Avg. Rate F1 65 F2 70 F3 87 F4 85 F5 80 F6 75 As feature F3 in Table 1 has the highest Accuracy rate it should be selected. So S becomes S={F3}. Iteration 2. S = {F3} Table 2 Feature Accuracy Rate Acc. Avg. Rate F1 88 77 F2 88 79 F4 85 F5 80 F6 75 As feature F1 and F2 in Table 2 have a tie, so we compute their accumulated averages and selected the feature having largest value. So our set becomes S = {F3, F2} Iteration 3. S = {F3, F2} Table 3 Feature Accuracy Rate Acc. Avg. Rate F1 88 F4 86 F5 89

F6 78 As feature F5 in Table 3 causes raise in accuracy rate, so we add F5 and set becomes S = {F3, F2, F5} Iteration 4. S = {F3, F2, F5} Table 4 Feature Accuracy Rate Acc. Avg. Rate F1 89.5 F4 85 F6 75 As feature F1 in Table 4 causes raise in accuracy rate, so we add F1 and set becomes S = {F3, F2, F5, F1}. Iteration 5 causes no increase in accuracy rate, so we stop this process. The final optimal subset is S = {F3, F2, F5, F1}. Note that we have calculated the accumulated averages of only those attributes causing a tie thus saving our computational cost. 4. HYBRID NATURE Here we are going to propose another technique that makes use of most common and popular sequential feature selection algorithms (FSS, BSS). As it is not clear whether FSS will outperform BSS on a given dataset with an unknown amount of feature interactions. On this experimented statement we propose that if we take the union of resulted features of both algorithms, we will not have any doubt about the nature of data. What kind of data sets are, our selection of features will be better than performing only one of these algorithms. Suppose we have the data set S having ten features S = {F1,F2,F3,……….F10}. when we perform both algorithms on this data set the resulted subsets of both are S1={F2,F4,F5,F7} and S2={F2,F3,F4,F5,F7}. If we take the union of both resulted subsets S1 and S2, then the final result is S3 = {S1 U S2}, S1= {F2,F3,F4,F5,F7}. In such a way we have all relevant features required by reducing the chances of error as well as mentioned above that the results as well are dependant on the algorithm. By taking union of resultant subsets of features, we come with the all features best for classifier. 5. CONCLUSIONS In this paper, it was identified that there is a need of adding a selection criterion for candidate attributes selections in a case if there exists during Sequential Selection Algorithms, which will surely select a feature having greater accumulated average thus making sure that this attribute is best among the candidate attributes. Similarly we proposed another way of selection of optimal subset of features by taking union of subsets of both the Forward Sequential Selection and Backward Sequential Selection , which ensures that whatever the nature of dataset, the resulted subset will be the optimal subset without missing any feature that can play role during the construction of the final classifier. REFERENCES [1]. Daniel T. Larose, Discovering Knowledge in Data: An Introduction to Data Mining, 2005. [2]. P. Tan, M. Steinbach & V. Kumar, Introduction to Data Mining, 2nd edition, 2006.

[3]. J. Han and M. Kamber , Data Mining: Concepts

and Techniques, 2006. [4]. M. Hegland, Data Mining – Challenges, Models, Methods and Algorithms, 2003 [5]. M. Dash , H. Liu, Feature Selection for Classification, Intelligent Data Analysis 1 (1997) 131–156 [6]. Almuallim, H., and Dietterich, T.G., Learning with many irrelevant features. In: Proceedings of Ninth National Conference on Artifical Intelligence, MIT Press, Cambridge, Massachusetts, 547–552, 1992. [7]. Almuallim, H., and Dietterich, T.G., Learning Boolean Concepts in the Presence of Many Irrelevant Features. Artificial Intelligence, 69(1– 2):279–305, November, 1994. [8]. Siedlecki, W. and Sklansky, J., On automatic feature selection. International Journal of Pattern Recognition and Artificial Intelligence, 2:197–220, 1988. [9]. Lei Yu, Huan Liu11, Efficient Feature Selection via Analysis of Relevance and Redundancy, Journal of Machine Learning Research 5 (2004) 1205–1224 [10].P. Mitra, C. A. Murthy, and S. K. Pal. Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):301–312, 2002. [11].H. Liu, H. Motoda, and L. Yu. Feature selection with selective sampling. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 395–402, 2002b. [12].M. Robnik-Sikonja and I. Kononenko. Theoretical and empirical analysis of Relief and ReliefF. Machine Learning, 53:23–69, 2003. [13].Y. Kim, W. Street, and F. Menczer. Feature selection for unsupervised learning via evolutionary search. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 365–369, 2000. [14].M. Dash, K. Choi, P. Scheuermann, and H. Liu. Feature selection for clustering – a filter solution. In Proceedings of the Second International Conference on Data Mining, pages 115–122, 2002. [15].

[16].A. Miller. Subset Selection in Regression.
Chapman & Hall/CRC, 2 edition, 2002. Toward optimal feature selection.In Proceedings of the Thirteenth International Conference on Machine Learning, pages 284–292, 1996. [18]. David.W. Aha, Richard L. Banker, A Comparative Evalution of Sequential Selection Algorithms, Springer-Verlag 1996 [19].G. John, R. Khavi, and K. Pfleger, “Irrelevant Features and the Subset Selection Problem,” to appear in The Proceedings of the Eleventh International Conference on Machine Learning,, Rutgers, NJ., July 1994. [20].S. Salzberg, “Improving Classification Methods via Feature Selection,” John Hopkins TR, 1992. [21].Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V: Feature selection for SVMs. In Advances in Neural Information Processing Systems 13. 11th edition. Edited by Solla SA, Leen TK, Muller K-R. Cambridge, MA: MIT Press, 2001.

[17].D. Koller and M. Sahami.

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->