This action might not be possible to undo. Are you sure you want to continue?

1, 2011

**An Unsupervised Feature Selection Method Based On Genetic Algorithm
**

Nasrin Sheikhi, Amirmasoud Rahmani Department of computer engineering. Islamic azad university of iran research and science branch Ahvaz, Iran Reza Veisisheikhrobat National Iranian South Oil Company(NISOC) Ahvaz, Iran

Abstract— In this paper we describe a new unsupervised feature selection method for text clustering. In this method we introduce a new kind of features that we called multi term features. Multi term feature is the combination of terms with different length. So we design a genetic algorithm to find the multi term features that have maximum discriminating power. Keywords-multi term feature; discriminating power; genetic algorithm; fitness function

I.

INTRODUCTION

Reducing dimensionality of a problem, in many real world problems, is an essential step before any analysis of the data. The general criterion for reducing the dimensionality is the desire to preserve most of the relevant information of the original data according to some optimality criteria. Dimensionality reduction or feature selection has been an active research area in pattern recognition, statistics and data mining communities. The main idea of feature selection is to choose a subset of input features by eliminating features with little or no predictive information. In particular, feature selection removes irrelevant features, increases efficiency of learning tasks, improves learning performance and enhances comprehensibility of learned results[2] Depending on if the class label information is required, feature selection can be either unsupervised or supervised. Feature selection has been well studied in supervised classification [3]. However, it is a quite recent research topic and also a challenging problem for clustering analysis for two reasons: first, it is not an easy task to define a good criterion for evaluating the quality of a candidate feature subset due to the absence of accurate labels of items. Second, it requires an exponentially increasing number of feature subset evaluations to optimize the defined criterion, that is in fact impractical if the data set has a large number of features. Some methods for unsupervised feature selection have been proposed in the literature, such as document frequency(DF), term contribution(TC), Term Variance Quality(TVQ), Term Variance(TV) et al. In most of these methods a criterion is defined for evaluate the relevance of one term of documents for clustering, and depend on how much dimensionality reduction required, the number of most relevant features will be selected.

In this paper we proposed a novel feature selection method that evaluate the discriminating power of set of terms instead raw terms as features. The main idea of this method is that a feature that is irrelevant by itself may become relevant when used with other features. So we describe new kind of feature named Multi Term Feature(MTF), that is the feature that made from combination of terms. We use genetic algorithm for search the large space of different multi term features to find most relevant of them. To achieve this goal we designed the fitness function to estimate the discriminating power of MTFs. The rest of this paper organized as follows: the next section describes two methods for evaluate relevance of MTFs. Section III explains using the genetic algorithm to find best MTFs. Experimental results are presented in section IV, and a conclusion is given in section V. II. EVALUATE RELEVANCE OF MULTI TERM FEATURES

Because in many cases one term can not determine the subject of document very well, we use MTF to find the best terms that can determine the clusters of documents. So we must define criterions for evaluate relevance of MTFs. At first we must determine when a MTF appear in a document. We defined appearance threshold for determine the presence of MTF in a document, that is the minimum number of terms of MTF that if appear in a document that’s MTF appear in the document too. Two criterions that we defined for evaluating discriminating power of MTFs are as follows: A. Modified Term Variance Term variance is one of the methods that use for evaluate the quality of term in dataset for clustering the documents. The equation of this method is as follows:

v(t i ) = ∑ [ f ij − f i ] 2

j =1

N

(1)

In this method the terms that have high frequency but have not uniform distribution over document will have high TV value. We modified TV method to use with MTFs :

116

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, 2011

v ( MTF i , th ) =

∑ [ vf

j =1

N

ij , th

− vf i ,th ] 2

(2)

Genetic algorithm is one of the best algorithms that can find best solutions for a problem between large number of solutions that is in the search space of problem. So we use genetic algorithm as our search strategy to find the most discriminating MTFs that exist in the search space. A. Chromosomes Each chromosome in this method is a MTF that can have different length. Each gene of chromosome is a term of the MTF. So the chromosome is shown as the set of terms and not a binary code. B. Initial population Initial population is the set of specific number of chromosomes. C. Crossover and Mutation The genetic algorithm generates new solutions by recombining the genes of the current best solutions. This is accomplished through the crossover and the mutation operators. On a one-point crossover, the crossing point is selected at random and genes from one side of the chromosomes are exchanged. In our model because of the different length of chromosomes crossover method is different too. In this method the crossing point is selected at random on both of parent chromosomes. Then one side of chromosomes are exchanged, so two chromosomes of results of this kind of crossover have not equal length. The mutation operator selected one position of gene in chromosome at random, and then exchange it with the term that is selected from documents randomly.

In this relation vf ij ,th is the frequency of ikh MTF in document j with appearance threshold th, and vf i ,th is the average of ith MTF frequency in all documents. The frequency of MTF is measured by equation as follows:

(3)

In this relation of term of MTF and is the kth term of MTF, m is the number is the jth document in dataset, is

the number of different MTF 's term that appear in and length is the length of MTF. is the logical function that determine contains the MTF return TRUE and else return FALSE. if

vf i ,th is measured as follows:

(4)

B. Dependency Between Terms Another criterion that we define to evaluate the relevance of MTFs is dependency between terms of MTF that measure by this equation: (5)

Our goal is to find the MTFs that have high discriminating power, so we look for find the MTFs that terms of them is belong to same subject and most of the time appear in the documents of that subject. Dependency between the terms of MTF is the ratio of sum of MTF 's frequency in all documents to sum of the MTF 's terms frequency. This value show that most of the time the terms of MTF appear together in documents or separately. III. USING GENETIC ALGORITHM

D. Fitness function The objective function is the cornerstone of the genetic process. We designed the following fitness function to explore the space of solutions: (6) In this function: Is the ith chromosome Is the fitness value of Is the modified term variance value of with appearance threshold Is the value of dependency between terms of Is the length of We described the modified term variance and dependency between terms of MTF in section Another part of our fitness function is

As we already mentioned our goal is to find best MTFs that can determine the clusters of documents. Because of the large number of MTFs that can extract from documents, using a search algorithm that can search a huge amount of data is necessary.

117

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, 2011

During the genetic process the chromosomes' length will increased because of operating crossover on the population. According to this length increasing the probability of presence of chromosome in the document and as its result the value of modified term variance and dependency criterions for this chromosome will decreased. So the fitness value of the chromosomes with more length will decreased, and then the algorithm go to select the smaller chromosome and so go to usual methods. By adding we give more chance to the bigger chromosomes to be selected as relevance chromosome. E. The proposed algorithm The designed algorithm is as follows: 1. An initial set of solutions is established at random. This population contains chromosomes that are MTFs that made of terms that selected randomly from documents. 2. The fitness value of each chromosome is measured by fitness function, the stopping criteria are tested. As a general criterion the genetic process is stopped when the maximum fitness does not increase over a few iterations. 3. Selection, mutation and crossover operate on population. 4. The new population is generated and the iterative process buckles up from step 2. IV. EXPERIMENTAL RESULTS

0.2 0.15

Accuracy

0.1 0.05 0 GA TV

1% 5% 10%20%25%30%

Number of selected terms Figure 1. precision comparison on reut2-001(AA)

0.59

Accuracy

**0.54 0.49 0.44 GA TV 1% 5% 10% 20% 25% 30%
**

Number of selected terms Figure 2. precision comparison on reut2-001(F1)

The following experiments we conducted are to compare the proposed genetic model and term variance method. We choose 3 datasets from Reuters-21587 that each one have 1000 documents. We choose K-means to be the clustering algorithm .since K-means clustering algorithm is easily influenced by selection of initial centroids, we random produced 10 sets of initial centroids for each dataset and averaged 10 times performance as the final clustering performance. We use Average Accuracy(AA) and F1-Measure(F1) that defined in [3], as clustering validity criterions for evaluate the accuracy of clustering results. This results on reut2-001, reut2002 and reut2-003 datasets are shown in Fig. 1 to Fig. 6. From these figures, we can see that proposed algorithm can improve the clustering accuracy in most of experiments results.

0.25 0.2

Accuracy

0.15 0.1 0.05 0 1% 5% 10% 20% 25% 30% GA TV

Number of selected terms

Figure 3. precision comparison on reut2-002(AA)

118

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, 2011

V.

0.8 0.6

Accuracy

CONCLUSION

0.4 GA 0.2 0 1% 5% 10%20%25%30% TV

In this paper we described a new feature selection method based on genetic algorithm. We use the new kind of feature that we called MTF that is the set of terms and then define the criterions for evaluate the relevance of these features. The experimental results shown that the proposed method can improve the accuracy of clustering. ACKNOWLEDGMENT Athors thank national iranian oil company (nioc) and national iranian south oil company (nisoc) for their help and financial support. REFERENCES

[1] [2] Sullivan, D., Document warehousing and text Mining, John Wiley, New York, 2001.J. Miller, T., Data and text mining a business applications approach, Prentice Hall, New York, 2005. Liu, L. and Kang, J. and YU, J. and Wang, Z., “A comparative study on unsupervised feature selection methods for text Clustering”, Proceeding of NLP-KE, Vol. 9, pp. 597-601, 2005. Aliane, H., “An ontology based approach to multilingual information retrieval”, IEEE Information and Communication Technologies, Vol. 1, pp. 1732-1737, 2006. Yu Lee, L. and Soo, v., “Ontology-based information retrieval and extraction”, International Conference on Information Technology: Research and Education, Vol. 8, pp. 265-269, 2005. Nayyeri, A. and Oroumchian, F., “FuFaIR: a fuzzy farsi information retrieval system”, IEEE International Conference on Computer Systems and Applications, Vol. 3, pp. 1126-1130, 2006. Desjardins, G. and Proulx, R. and Godin, R., “An auto-associative neural network for information retrieval”, IEEE International Joint Conference on Neural Networks, Vol. 9, pp. 3492-3498, 2006. Tian, Q., “A foundational perspective for visual information retrieval”, Multimedia IEEE, Vol. 13, pp. 90-92, 2006. Brunner, J. and Naudet, Y. and Latour, T., “Information retrieval in multimedia: exploiting MPEG-7 metadata by the use of ontologies and fuzzy thematic spaces”, Proceedings of the Sixth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA’05), Vol. 7, pp. 1-6. 2005. Dong, A. and Li, H., “Multi-ontology based multimedia annotation for domain-specific information retrieval”, Proceedings of the IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing (SUTC’06), Vol. 9, pp. 1-8, 2006. Heo, S. and Motoyuki, S. and Ito, A. and Makino, S., “An effective music information retrieval method using three-dimensional continuous DP”, IEEE Transactions on Multimedia, Vol. 8, NO. 3, pp. 633-639, 2006. Chang, C. and Kayed, M., “A survey of web information extraction systems”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, NO. 10, pp. 1411-1428, 2006. Kim, J. and Moldovan, D., “Acquisition of linguistic patterns for knowledge-based information extraction”, IEEE Transactions on Knowledge and Data Engineering, Vol. 7, NO. 5, pp. 713-724, 1995. Ramshaw, A. and Weischeldel, M., “Information extraction”, IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 7, pp. 969-972. 2005. Lam, M. and Gong, Z., “Web information extraction”, Proceedings of the 2005 IEEE International Conference on Information Acquisition, Vol. 1, pp. 569-601, 2005. Yang, S. and WU, X. and Deng, Z. and Zhang M. and Yang, D., “Relative term-frequency based feature selection for text

Number of selected terms Figure 4. precision comparison on reut2-002(F1)

0.4 0.3

Accuracy

[3]

[4]

0.2 0.1 0 GA TV

[6] [5]

1% 5% 10%20%25%30%

[7] Number of selected terms Figure 5. precision comparison on reut2-003(AA) [8] [9]

0.8 0.6

Accuracy

0.4 0.2 0 GA TV

[10]

[11]

1% 5% 10%20%25%30%

[12] Number of selected terms [13] Figure 6. precision comparison on reut2-003(F1) [14]

[15]

[16]

119

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

**(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, 2011
**

categorization”, Proceeding of the IEEE first international Conference on Machine Learning and Cybernetics, Vol. 4, pp. 1432-1436, 2002. Yiming, Y. and Pedersen, J., “A comparative study on feature selection in text categorization”, Proceedings of ICML-97, 14th International Conference on Machine Learning, pp. 412-420, 1997. Prabowo, R. and Thelwall, M., “A comparison of feature selection methods for an evolving RSS feed corpus”, Information Processing and Management, Vol. 42, pp. 1491-1512, 2006. How, B. and Narayanan, K., “An empirical study of feature selection for text categorization based on Term weightage”, Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI’04), Vol. 2, pp. 1-4, 2004. Li, S. and Zong, C, “A new approach to feature selection for text categorization”, IEEE Proceeding of NLP-KE'O5, Vol. 9, pp. 626-630, 2005. Mitchell, T., Machine Learning, McGraw-Hill, Washington, 1997. [22] Dong, Y. and Han, K., “A comparison of several ensemble methods for text categorization”, Proceedings of the 2004 IEEE International Conference on Services Computing (SCC’04), Vol. 4, pp. 1-4, 2004. [23] Soucy, P. and Mineau, G., ”A simple KNN algorithm for text categorization”, Vol. 8, pp. 647-648, 2001. [24] Namburu, S. and Tu, H. and Luo, J. and Pattipati, R., “Experiments on supervised learning algorithms for text categorization”, IEEE Aerospace Conference, pp. 1-8, 2005. [25] Goldberg, J.L., “CDM: An approach to learning in text categorization”, Proceedings of Seventh International Conference on Tools with Artificial Intelligence, Vol. 9, pp. 258-265, 1995.

[17]

[18]

[19]

[20]

[21]

120

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

- Journal of Computer Science IJCSIS March 2016 Part II
- Journal of Computer Science IJCSIS March 2016 Part I
- Journal of Computer Science IJCSIS April 2016 Part II
- Journal of Computer Science IJCSIS April 2016 Part I
- Journal of Computer Science IJCSIS February 2016
- Journal of Computer Science IJCSIS Special Issue February 2016
- Journal of Computer Science IJCSIS January 2016
- Journal of Computer Science IJCSIS December 2015
- Journal of Computer Science IJCSIS November 2015
- Journal of Computer Science IJCSIS October 2015
- Journal of Computer Science IJCSIS June 2015
- Journal of Computer Science IJCSIS July 2015
- International Journal of Computer Science IJCSIS September 2015
- Journal of Computer Science IJCSIS August 2015
- Journal of Computer Science IJCSIS April 2015
- Journal of Computer Science IJCSIS March 2015
- Fraudulent Electronic Transaction Detection Using Dynamic KDA Model
- Embedded Mobile Agent (EMA) for Distributed Information Retrieval
- A Survey
- Security Architecture with NAC using Crescent University as Case study
- An Analysis of Various Algorithms For Text Spam Classification and Clustering Using RapidMiner and Weka
- Unweighted Class Specific Soft Voting based ensemble of Extreme Learning Machine and its variant
- An Efficient Model to Automatically Find Index in Databases
- Base Station Radiation’s Optimization using Two Phase Shifting Dipoles
- Low Footprint Hybrid Finite Field Multiplier for Embedded Cryptography

In this paper we describe a new unsupervised feature selection method for text clustering. In this method we introduce a new kind of features that we called multi term features. Multi-term feature ...

In this paper we describe a new unsupervised feature selection method for text clustering. In this method we introduce a new kind of features that we called multi term features. Multi-term feature is the combination of terms with different length. So we design a genetic algorithm to find the multi term features that have maximum discriminating power.

- Nanda and Panda 2013 - A Survey on Nature Inspired Metaheuristic Algorithms for Partitional Clusteringby fckw-1
- Canopies: An Efficient Distance based Clustering approach for High Dimensional Data Setsby International Journal of Research in Computer Science and Electronics Technology
- A Global K-modes Algorithm for Clustering Categorical Databy Rubinder Singh
- IJAIEM-2014-06-13-31by editorijettcs1

- A Multi -Perspective Evaluation of MA and GA for Collaborative Filtering Recommender System
- 06142293
- Paper-3_A Survey Performance Improving of K-Mean by Genetic Algorithm
- File 5
- base-jits
- A Survey of K Means and Ga Km the Hybrid Clustering Algorithm
- Dwm Notes
- GA Clustering
- Genetic Algorithm Based Clustering and Its New Mutation Operator
- synonyms
- Multilevel Techniques for the Clustering Problem
- Genetic Algorithm-based Clustering Technique
- Nanda and Panda 2013 - A Survey on Nature Inspired Metaheuristic Algorithms for Partitional Clustering
- Canopies
- A Global K-modes Algorithm for Clustering Categorical Data
- IJAIEM-2014-06-13-31
- f 810475192631211
- rfid-main
- N6Jan2011
- Farthest Neighbor Approach for Finding Initial Centroids in K- Means
- Multimodal Optimization Using Self-Adaptive Real Coded Genetic Algorithm With K-Means & Fuzzy C-Means Clustering
- Research paper
- An Effective Evolutionary Clustering Algorithm
- db ga
- IJCTT-V3I3P112
- Incremental Data Clustering using a Genetic Algorithmic Approach
- 10.1.1.140.4763
- Comparison Between Clustering Algorithms for Microarray Data Analysis
- An Improvement in K-mean Clustering Algorithm Using Better Time and Accuracy
- Bello Orgaz Gema
- An Unsupervised Feature Selection Method Based On Genetic Algorithm

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd