Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more ➡
Download
Standard view
Full view
of .
Add note
Save to My Library
Sync to mobile
Look up keyword
Like this
1Activity
×
0 of .
Results for:
No results containing your search query
P. 1
An Unsupervised Feature Selection Method Based On Genetic Algorithm

An Unsupervised Feature Selection Method Based On Genetic Algorithm

Ratings: (0)|Views: 309|Likes:
Published by ijcsis
In this paper we describe a new unsupervised feature selection method for text clustering. In this method we introduce a new kind of features that we called multi term features. Multi-term feature is the combination of terms with different length. So we design a genetic algorithm to find the multi term features that have maximum discriminating power.
In this paper we describe a new unsupervised feature selection method for text clustering. In this method we introduce a new kind of features that we called multi term features. Multi-term feature is the combination of terms with different length. So we design a genetic algorithm to find the multi term features that have maximum discriminating power.

More info:

Published by: ijcsis on Feb 15, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See More
See less

02/02/2012

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and Information Security,Vol.
9
 , No.
1
 , 201
1
An Unsupervised Feature Selection Method Based OnGenetic Algorithm
Nasrin Sheikhi, Amirmasoud Rahmani Reza VeisisheikhrobatDepartment of computer engineering.Islamic azad university of iran research and science branchNational Iranian South Oil Company(NISOC)Ahvaz, IranAhvaz, Iran
 Abstract—
In this paper we describe a new unsupervised featureselection method for text clustering. In this method we introducea new kind of features that we called multi term features. Multiterm feature is the combination of terms with different length. Sowe design a genetic algorithm to find the multi term features thathave maximum discriminating power.
 Keywords-multi term feature; discriminating power; genetic algorithm; fitness function
I.
 
I
NTRODUCTION
Reducing dimensionality of a problem, in many real worldproblems, is an essential step before any analysis of the data.The general criterion for reducing the dimensionality is thedesire to preserve most of the relevant information of theoriginal data according to some optimality criteria.Dimensionality reduction or feature selection has been anactive research area in pattern recognition, statistics and datamining communities. The main idea of feature selection is tochoose a subset of input features by eliminating features withlittle or no predictive information. In particular, featureselection removes irrelevant features, increases efficiency of learning tasks, improves learning performance and enhancescomprehensibility of learned results[2]Depending on if the class label information is required,feature selection can be either unsupervised or supervised.Feature selection has been well studied in supervisedclassification [3]. However, it is a quite recent research topicand also a challenging problem for clustering analysis for tworeasons: first, it is not an easy task to define a good criterionfor evaluating the quality of a candidate feature subset due tothe absence of accurate labels of items. Second, it requires anexponentially increasing number of feature subset evaluationsto optimize the defined criterion, that is in fact impractical if the data set has a large number of features.Some methods for unsupervised feature selection have beenproposed in the literature, such as document frequency(DF),term contribution(TC), Term Variance Quality(TVQ), TermVariance(TV) et al. In most of these methods a criterion isdefined for evaluate the relevance of one term of documentsfor clustering, and depend on how much dimensionalityreduction required, the number of most relevant features willbe selected.In this paper we proposed a novel feature selection methodthat evaluate the discriminating power of set of terms insteadraw terms as features.The main idea of this method is that a feature that isirrelevant by itself may become relevant when used with otherfeatures. So we describe new kind of feature named MultiTerm Feature(MTF), that is the feature that made fromcombination of terms.We use genetic algorithm for search the large space of different multi term features to find most relevant of them. Toachieve this goal we designed the fitness function to estimatethe discriminating power of MTFs.The rest of this paper organized as follows: the next sectiondescribes two methods for evaluate relevance of MTFs.Section III explains using the genetic algorithm to find bestMTFs. Experimental results are presented in section IV, and aconclusion is given in section V.II.
 
E
VALUATE
R
ELEVANCE
O
F
M
ULTI
T
ERM
F
EATURES
 Because in many cases one term can not determine thesubject of document very well, we use MTF to find the bestterms that can determine the clusters of documents. So wemust define criterions for evaluate relevance of MTFs. At firstwe must determine when a MTF appear in a document.We defined appearance threshold for determine thepresence of MTF in a document, that is the minimum numberof terms of MTF that if appear in a document that’s MTFappear in the document too.Two criterions that we defined for evaluatingdiscriminating power of MTFs are as follows:
 A.
 
 Modified Term Variance
Term variance is one of the methods that use for evaluatethe quality of term in dataset for clustering the documents. Theequation of this method is as follows:
=
=
 N  jiiji
 f  f v
12
][)(
(1)In this method the terms that have high frequency but havenot uniform distribution over document will have high TVvalue. We modified TV method to use with MTFs :
116http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol.
9
 , No.
1
 , 201
1
 
=
=
 N  jthithiji
vf vf th MTF v
12,,
][),(
(2)In this relation
thij
vf 
,
is the frequency of ikh MTF indocument j with appearance threshold th, and
thi
vf 
,
is theaverage of ith MTF frequency in all documents. The frequencyof MTF is measured by equation as follows:(3)In this relation is the kth term of MTF, m is the numberof term of MTF and is the jth document in dataset,
 
isthe number of different MTF 's term that appear in andlength is the length of MTF.is the logical function that determineif contains the MTF return TRUE and else return FALSE.
thi
vf 
,
is measured as follows:(4)
 B.
 
 Dependency Between Terms
Another criterion that we define to evaluate the relevanceof MTFs is dependency between terms of MTF that measureby this equation:(5)Our goal is to find the MTFs that have high discriminatingpower, so we look for find the MTFs that terms of them isbelong to same subject and most of the time appear in thedocuments of that subject.Dependency between the terms of MTF is the ratio of sumof MTF 's frequency in all documents to sum of the MTF 'sterms frequency. This value show that most of the time theterms of MTF appear together in documents or separately.III.
 
U
SING
G
ENETIC
A
LGORITHM
 As we already mentioned our goal is to find best MTFsthat can determine the clusters of documents. Because of thelarge number of MTFs that can extract from documents, usinga search algorithm that can search a huge amount of data isnecessary.Genetic algorithm is one of the best algorithms that canfind best solutions for a problem between large number of solutions that is in the search space of problem. So we usegenetic algorithm as our search strategy to find the mostdiscriminating MTFs that exist in the search space.
 A.
 
Chromosomes
Each chromosome in this method is a MTF that can havedifferent length. Each gene of chromosome is a term of theMTF. So the chromosome is shown as the set of terms and nota binary code.
 B.
 
 Initial population
Initial population is the set of specific number of chromosomes.
C.
 
Crossover and Mutation
The genetic algorithm generates new solutions byrecombining the genes of the current best solutions. This isaccomplished through the crossover and the mutationoperators. On a one-point crossover, the crossing point isselected at random and genes from one side of thechromosomes are exchanged. In our model because of thedifferent length of chromosomes crossover method isdifferent too.In this method the crossing point is selected at random onboth of parent chromosomes. Then one side of chromosomesare exchanged, so two chromosomes of results of this kind of crossover have not equal length.The mutation operator selected one position of gene inchromosome at random, and then exchange it with the termthat is selected from documents randomly.
 D.
 
Fitness function
The objective function is the cornerstone of the geneticprocess. We designed the following fitness function to explorethe space of solutions:(6)In this function:Is the ith chromosomeIs the fitness value of Is the modified term variance value of withappearance thresholdIs the value of dependency between terms of Is the length of We described the modified term variance and dependencybetween terms of MTF in sectionAnother part of our fitness function is
117http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol.
9
 , No.
1
 , 201
1
During the genetic process the chromosomes' length willincreased because of operating crossover on the population.According to this length increasing the probability of presenceof chromosome in the document and as its result the value of modified term variance and dependency criterions for thischromosome will decreased. So the fitness value of thechromosomes with more length will decreased, and then thealgorithm go to select the smaller chromosome and so go tousual methods.By adding we give more chance to the biggerchromosomes to be selected as relevance chromosome.
 E.
 
The proposed algorithm
The designed algorithm is as follows:1. An initial set of solutions is established at random.This population contains chromosomes that are MTFs thatmade of terms that selected randomly from documents.2. The fitness value of each chromosome is measured byfitness function, the stopping criteria are tested. As a generalcriterion the genetic process is stopped when the maximumfitness does not increase over a few iterations.3. Selection, mutation and crossover operate onpopulation.4. The new population is generated and the iterativeprocess buckles up from step 2.IV.
 
E
XPERIMENTAL
R
ESULTS
 The following experiments we conducted are to comparethe proposed genetic model and term variance method.We choose 3 datasets from Reuters-21587 that each onehave 1000 documents.We choose K-means to be the clustering algorithm .sinceK-means clustering algorithm is easily influenced by selectionof initial centroids, we random produced 10 sets of initialcentroids for each dataset and averaged 10 times performanceas the final clustering performance.We use Average Accuracy(AA) and F1-Measure(F1) thatdefined in [3], as clustering validity criterions for evaluate theaccuracy of clustering results. This results on reut2-001, reut2-002 and reut2-003 datasets are shown in Fig. 1 to Fig. 6.From these figures, we can see that proposed algorithm canimprove the clustering accuracy in most of experimentsresults.
Figure 1. precision comparison on reut2-001(AA)Figure 2. precision comparison on reut2-001(F1)Figure 3. precision comparison on reut2-002(AA)
   A  c  c  u  r  a  c  y
 
0.440.490.540.591% 5% 10%20%25%30%GATV
 
Number of selected termsNumber of selected terms
   A  c  c  u  r  a  c  y   A  c  c  u  r  a  c  y
Number of selected terms
00.050.10.150.20.251% 5% 10% 20% 25% 30%GA TV
00.050.10.150.21% 5% 10%20%25%30%GA TV
118http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->