(IJCSIS) International Journal of Computer Science and Information Security,Vol.
An Unsupervised Feature Selection Method Based OnGenetic Algorithm
Nasrin Sheikhi, Amirmasoud Rahmani Reza VeisisheikhrobatDepartment of computer engineering.Islamic azad university of iran research and science branchNational Iranian South Oil Company(NISOC)Ahvaz, IranAhvaz, Iran
In this paper we describe a new unsupervised featureselection method for text clustering. In this method we introducea new kind of features that we called multi term features. Multiterm feature is the combination of terms with different length. Sowe design a genetic algorithm to find the multi term features thathave maximum discriminating power.
Keywords-multi term feature; discriminating power; genetic algorithm; fitness function
Reducing dimensionality of a problem, in many real worldproblems, is an essential step before any analysis of the data.The general criterion for reducing the dimensionality is thedesire to preserve most of the relevant information of theoriginal data according to some optimality criteria.Dimensionality reduction or feature selection has been anactive research area in pattern recognition, statistics and datamining communities. The main idea of feature selection is tochoose a subset of input features by eliminating features withlittle or no predictive information. In particular, featureselection removes irrelevant features, increases efficiency of learning tasks, improves learning performance and enhancescomprehensibility of learned resultsDepending on if the class label information is required,feature selection can be either unsupervised or supervised.Feature selection has been well studied in supervisedclassification . However, it is a quite recent research topicand also a challenging problem for clustering analysis for tworeasons: first, it is not an easy task to define a good criterionfor evaluating the quality of a candidate feature subset due tothe absence of accurate labels of items. Second, it requires anexponentially increasing number of feature subset evaluationsto optimize the defined criterion, that is in fact impractical if the data set has a large number of features.Some methods for unsupervised feature selection have beenproposed in the literature, such as document frequency(DF),term contribution(TC), Term Variance Quality(TVQ), TermVariance(TV) et al. In most of these methods a criterion isdefined for evaluate the relevance of one term of documentsfor clustering, and depend on how much dimensionalityreduction required, the number of most relevant features willbe selected.In this paper we proposed a novel feature selection methodthat evaluate the discriminating power of set of terms insteadraw terms as features.The main idea of this method is that a feature that isirrelevant by itself may become relevant when used with otherfeatures. So we describe new kind of feature named MultiTerm Feature(MTF), that is the feature that made fromcombination of terms.We use genetic algorithm for search the large space of different multi term features to find most relevant of them. Toachieve this goal we designed the fitness function to estimatethe discriminating power of MTFs.The rest of this paper organized as follows: the next sectiondescribes two methods for evaluate relevance of MTFs.Section III explains using the genetic algorithm to find bestMTFs. Experimental results are presented in section IV, and aconclusion is given in section V.II.
Because in many cases one term can not determine thesubject of document very well, we use MTF to find the bestterms that can determine the clusters of documents. So wemust define criterions for evaluate relevance of MTFs. At firstwe must determine when a MTF appear in a document.We defined appearance threshold for determine thepresence of MTF in a document, that is the minimum numberof terms of MTF that if appear in a document that’s MTFappear in the document too.Two criterions that we defined for evaluatingdiscriminating power of MTFs are as follows:
Modified Term Variance
Term variance is one of the methods that use for evaluatethe quality of term in dataset for clustering the documents. Theequation of this method is as follows:
f f t v
(1)In this method the terms that have high frequency but havenot uniform distribution over document will have high TVvalue. We modified TV method to use with MTFs :