Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
6Activity
0 of .
Results for:
No results containing your search query
P. 1
Survey on Text Document Clustering

Survey on Text Document Clustering

Ratings: (0)|Views: 737 |Likes:
Published by ijcsis
Document clustering is also referred as text clustering, and its concept is merely equal to data clustering. It is hardly difficult to find the selective information from an ‘N’number of series information, so that document clustering came into picture. Basically cluster means a group of similar data, document clustering means segregating the data into different groups of similar data. Clustering can be of mathematical, statistical or numerical domain. Clustering is a fundamental data analysis technique used for various applications such as biology, psychology, control and signal processing, information theory and mining technologies. For theoretical or machine learning perspective the cluster represent hidden pattern means search can be done by unsupervised learning, called data concept. For practical perspective clustering plays vital role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web Analysis, CRM, marketing, medical diagnostics, computational, biology, cybernetics, genetics, marketing etc., in this survey we mainly concentrate on text mining and data mining. The process of extracting interesting information and knowledge from unstructured text is referred as text mining. Data mining is sorting through data to identify patterns and plot out the relationship. There are lot of algorithm based on text and data mining.
Document clustering is also referred as text clustering, and its concept is merely equal to data clustering. It is hardly difficult to find the selective information from an ‘N’number of series information, so that document clustering came into picture. Basically cluster means a group of similar data, document clustering means segregating the data into different groups of similar data. Clustering can be of mathematical, statistical or numerical domain. Clustering is a fundamental data analysis technique used for various applications such as biology, psychology, control and signal processing, information theory and mining technologies. For theoretical or machine learning perspective the cluster represent hidden pattern means search can be done by unsupervised learning, called data concept. For practical perspective clustering plays vital role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web Analysis, CRM, marketing, medical diagnostics, computational, biology, cybernetics, genetics, marketing etc., in this survey we mainly concentrate on text mining and data mining. The process of extracting interesting information and knowledge from unstructured text is referred as text mining. Data mining is sorting through data to identify patterns and plot out the relationship. There are lot of algorithm based on text and data mining.

More info:

Published by: ijcsis on Aug 13, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

10/25/2012

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 4, July 2010
Survey on Text Document Clustering
M.Thangamani
Computer TechnologyKongu Engineering CollegePerundurai, Tamilnadu, IndiaVetha_narayana@yahoo.co.in
Dr.P.Thangaraj
Dean, School of Computer Technology and ApplicationsKongu Engineering CollegePerundurai, Tamilnadu
,
Indiactptr@yahoo.co.in 
 Abstract
Document clustering is also referred as textclustering, and its concept is merely equal to data clustering. Itis hardly difficult to find the selective information from an‘N’number of series information, so that document clusteringcame into picture. Basically cluster means a group of similardata, document clustering means segregating the data intodifferent groups of similar data. Clustering can be of mathematical, statistical or numerical domain. Clustering is afundamental data analysis technique used for variousapplications such as biology, psychology, control and signalprocessing, information theory and mining technologies. Fortheoretical or machine learning perspective the clusterrepresent hidden pattern means search can be done byunsupervised learning, called data concept. For practicalperspective clustering plays vital role in data miningapplications such as scientific data exploration, informationretrieval and text mining, spatial database applications, WebAnalysis, CRM, marketing, medical diagnostics,computational, biology, cybernetics, genetics, marketing etc., inthis survey we mainly concentrate on text mining and datamining. The process of extracting interesting information andknowledge from unstructured text is referred as text mining.Data mining is sorting through data to identify patterns andplot out the relationship. There are lot of algorithm based ontext and data mining
.
 Keywords-Text Mining, Information Retrieval and Text Mining, Spatial Database Applications, Web Analysis.
I.
 
I
NTRODUCTION
 
Document clustering is the task of automaticallyorganizing text document into meaning full cluster or group,such that the document in the same cluster are similar, andare dissimilar from the one in other clusters. It is one of themost important tasks in text mining. There are severalnumber of technique launched for clustering documentssince there is rapid growth in the field of internet andcomputational technologies, the field of text mining have aabrupt growth, so that simple document clustering to moredemanding task such as production of granular taxonomies,sentiment analysis, and document summarization for thescope of devolving higher quality information from text.They involve in multiple interrelated types of objects. Co-cluster means document similarity and word similarity aredefined in a reinforcing manner.Computer network is the backbone of science andtechnology, merely 85% of business information is in theform of text, so logic-based programming is difficult tocapture in fuzzy. In order to solve this problem we cope upwith large number of words in one hand able to structures innatural language and on the other hand allow handlingvagueness, uncertainty and fuzziness. Text mining is aknowledge intensive process where the user interacts with adocument collection by using analysis tools. This isanalogous to data mining. It extracts the useful informationfrom data source from unstructured text. Text documentused to identify simplified subset of document features thatcan be used to represent the particular document as thewhole. This feature is said to be a representational model.Each document in a collection is made up of large numberof features, so that it affects the system approach,performance and design.The text mining system supports the presentation layerfor browsing that is both dynamic and content basedbrowsing. Text mining always use visualization tool tonavigate and explore the concept pattern, this is used forexpressing the complex data relationship. Text mining act as(GUI), graphical user interface, that is friendlier for interactwith the graphical representation of concept pattern. Thepresentation layer of text mining system severs as a frontend for execution of the system core discovery algorithm.This is user friendly and powerful algorithm, but verycomplex.The dynamic partitioning of texts ranks top on the prioritylist for all business intelligence systems. However, firstcurrent text clustering approaches still suffer from majorproblems that greatly limit their practical applicability. Textclustering usually noticed as objective method, since itdelivers one clearly defined results, which should beoptimal. This is contrary, because different people needdifferent idea in clustering the text, as the same text isviewed in terms of business as well as technical aspects.Second text clustering working in [13][12] high dimensionalspace, where each word is seen as a potential attribute. Butclustering in high dimensional space is very difficult, sinceeach data tends to have the same distance from all other
174http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 4, July 2010
points. To overcome this problem we are using severaltechniques.[14] Third text clustering is actually useless, butif we combine with explanation of particular text werecategorized into particular cluster, but this suffers from highnumber of feature chosen for computing cluster. Thoughthere is lot of difficulties in high dimensional clustering, thiscan be eradicated by means of several algorithms ourultimate aim is to derive clustering result eventually in allspace.The remainder of this paper is organized as follows.Section II discusses some of the earlier proposed researchwork on text document clustering. Section III provides afundamental idea on which the future research work focuseson. Section IV concludes the paper with fewer discussions.
II.
 
R
ELATED WORK
 
Barry de Ville et al., [1] proposed the data-miningclassification and predictive modeling algorithms that arebased on bootstrapping techniques. This explains re-use of source data repeatedly that can render a holographic view of the modeled data. This holographic application is mainlyused in industrial area that involves text mining warrantyclaims at a major international car, truck, and heavyequipment manufacturer. This paper shows, how they work,how they perform in text mining area as supplied towarranty claims. The main goal is to obtain theperformances better -than -human.Mine.T et al., [2] put forward that text mining systemobtain the relationship between the topics of internationalconference. This paper not only says about the relationbetween topic and conference, but also says the relationshipbetween information entities that users are interested.[3]arbitrary relations between concepts of a molecular biologyontology for the purpose of supporting text mining andmanual ontology building. [4], [5] have given insights onwork done on the WWW corpus for text mining based onontological systems. Basically ontology is defined asspecification of a conceptualization and this also refers tothe subject of existence. JAVA based ontology andknowledge based framework provides a plug-and-playenvironment that makes it a flexible for rapid prototyping.Qiaozhu Mei et al., says new general probabilistic modelfor contextual text mining that can cover several existingmodels as special cases. The extension of the ProbabilisticLatent Semantic Analysis (PLSA) model the contextvariables models the context of a document. The proposedmixture model, called contextual probabilistic latentsemantic analysis (CPLSA) model, can be applied to manyinteresting mining tasks, such as temporal text mining.PLSA [7] document act as a mixture of aspects, where eachaspect is represented by a multinomial distribution. To avoidover fitting in PLSA, Blei and co-authors proposed agenerative aspect model called Latent Dirichlet Allocation(LDA), which could group up the themes from document.Miha Grcar1 et al., put fourth an approach regarding lack of software mining techniques, which means process of extracting knowledge out of source code. [8]In this paper weapproach the software mining task with a combination of text mining and link analysis technique. This mainly dealswith interlinks between one instance to another instance.There are mainly two approaches to build tool for softwarecomponent, retrival and knowledge based approaches. Firstapproach natural language documentation of the softwarecomponents. With this approach no interpretation of thedocumentation is made but information is extracted viastatistical analyses of the words distribution. On the otherhand, the knowledge-based approach relies on pre-encoded,manually provided information Knowledge-based systemscan be “smarter” than IR systems but they suffer from thescalability issue. We recently started developing anontology-learning framework named LATINO which standsfor Link-analysis and text-mining toolbox [8]. LATINO willbe an open source general purpose data mining platformproviding text mining, link analysis, machine learning, anddata visualization capabilities.Ingo Feinerer et al.,[9] gives a survey on text miningfacilities in R and explain how typical application tasks canbe carried out using our framework. We present techniquesfor count-based analysis methods, text clustering, textclassification and string kernels [10]. Here the authorintroduced a new framework for text mining applications inR via the tm package. It offers functionality for managingtext documents, abstracts the process of document handlingand the usage of heterogeneous text formats in R. Thepackage has integrated database backend support tominimize memory demands. An advanced metadatamanagement is also implemented for collections of textdocuments to lighten the usage of large and with metadataenriched document sets, tm provides easy access topreprocessing and manipulation mechanisms such aswhitespace removal, stemming, or conversion between fileformats.Alan Marwick introduced (UIMA). [10]There is lot of opportunity if we focus on using information technology toget more value from unstructured information withinorganizations. The new Unstructured InformationManagement Architecture (UIMA) framework that wasrecently introduced by IBM, which makes easier to developand deploy systems that analyze unstructured media objects,like documents, in order to provide functions such assemantic search and text mining. Text mining is data miningapplied to information extracted from text. How it can becombined with structured databases and data mining. Thisarticle is mainly for people who are interested in learninghow the words of unstructured and structured informationcan be brought together.
175http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 4, July 2010
A.Hotho et al., [11] suggest that text clustering basicallyinvolves in high dimensional space, but it appears difficultfor all types of setting. This is one of the new approachesfor applying background knowledge during preprocessing inorder to improve clustering results and allow for selectionbetween results. In order to over come the difficulty, wecompute multiple clustering results using k-Means. Theresults may be distinguished and explained by thecorresponding selection of concepts in the ontology. Theproblem of clustering high-dimensional data sets has beenresearched by Agrawal et al. [12] They present a clusteringalgorithm called CLIQUE that identifies dense clusters insubspaces of maximum dimensionality. Hinneburg & Keim[13] show how projections improve the effectiveness andefficiency of the clustering process. Their work shows thatprojections are important for improving the performance of clustering algorithms.Hossein M. Shirazi et al., projected, that in current trend,extracting information from the World Wide Web have beenmuch familiar among all. Information extraction systemdefined as a system that “automatically identifies predefinedset of related items" [14] , Since a lot of Web data are foundin HTML pages. Since we use HTML, the extractionprocess requires fetching a Web document, cleaning it upusing a syntactic normalization algorithm, and then, locating“objects of interest” in this Web page. This is done by firstlocating the minimal object-rich sub tree. Finally, the set of objects is refined to eliminate irrelevant objects. Crescenziand colleagues [15] present a system that automaticallyextract data from large data-intensive Web sites their “datagrabber” explores a large Web site and infers a model for it,describing it as a directed graph with nodes describingclasses of structurally similar pages and arcs representinglinks between these pages. After pinpointing classes of interest, a library of wrappers can be generated, one perclass with the help of an external wrapper generator andappropriate data can be extracted.Embley and others [16] gave an idea to extractinformation automatically from HTML tables. Theinformation is extracted in the from stepwise manner. As thefirst step, extract on ontology is formulated. Extractionontology is a “conceptual model instance” that serves as awrapper for a narrow domain of interest [16]. Second stepexpected attribute names and values from the ontology, thirdstep attribute-value pairs are formed and adjusted so thatthey are more meaningful. In the fourth step, the extractionpatterns are analyzed to refine the extracted informationfurther. Then finally, given the input from the earlier foursteps, a mapping can be inferred from the source to thetarget. There are several other works such as ‘road runner’,’hidden morkov model’,’ cluster’ to study about HTMLdocuments clustering.Pallav Roxy, and Durga Toshniwal, implemented severalapproaches and that can be classified into two majorcategories, similarity-based approach and model-basedapproach. Similarity-based approach is a pair wise similaritydocument clustering, aiming to maximize the averagesimilarities within clusters and minimize the averagesimilarities between the clusters. Model-based approaches,on the other hand, attempt to learn generative models fromthe documents, with each model representing one particulardocument group. Several approaches have been so farproposed for document clustering from mid nineties. Newtechnique such as self-organizing map [18], mixture of Gaussians [19], spherical k-mean[20], bi-secting k-means[21], mixture of multinomial [22, 23]. K-means is one of thesimplest unsupervised learning algorithms that solve thewell known clustering problem. The main idea is to define k centroids, for each cluster. These centroids should be placedfar away from each other. The next step is to take each pointbelonging to a given data set and associate it to the nearestcentroid, then need to re-calculate k new centroids, After wehave these k new centroids, a new binding has to be donebetween the same data set points and the nearest newcentroid. A loop has been generated. As a result of this loopit is seen that the k centroids change their location step bystep until no more changes are done, finally this algorithmaims at minimizing an objective function.Shady Shehata put forward new view called Vector SpaceModel (VSM) .Vector Space Model (VSM) [24] which is awidely used for representing data for text classification andclustering. It says each document as a feature vector of thedocument. Each feature vector contains term-weight.Selecting and weighting these features accurately affect theresult of the clustering algorithm substantially [25] [26].Incorporating semantic features from the WorldNet [27]lexical database is one of best approaches that have beentried to improve the accuracy of text clustering techniques.In this paper he also introduced new semantic-based model.[28]The proposed model captures the semantic structure of each term within a sentence rather than the frequency. Eachsentence in a document is labeled by a semantic role labeler,it can be either a word or phrase dependent on the semanticstructure of the sentence.[29] Based on this analysis, eachterm assign some weight. The terms that have maximumweights are extracted as top terms. Synonyms of each wordare added to the term vector. These concepts are analyzedon the sentence. Top terms and used in text documentclustering. When a new document is introduced to thesystem, the proposed model can detect a concept matchfrom this document to all the previously processeddocuments in the data set by scanning the new documentand extracting the matching concepts.There are several methods for exploiting correlationsbetween terms in document clustering. The methodcalculates similarity between documents based on thestatistical correlations between their terms and then uses
176http://sites.google.com/site/ijcsis/ISSN 1947-5500

Activity (6)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads
Alireza Ebrahimi liked this
Chunhua Su liked this
shakilmujeeb liked this

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->