(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 4, July 2010
Survey on Text Document Clustering
Computer TechnologyKongu Engineering CollegePerundurai, Tamilnadu, IndiaVetha_narayana@yahoo.co.in
Dean, School of Computer Technology and ApplicationsKongu Engineering CollegePerundurai, Tamilnadu
Document clustering is also referred as textclustering, and its concept is merely equal to data clustering. Itis hardly difficult to find the selective information from an‘N’number of series information, so that document clusteringcame into picture. Basically cluster means a group of similardata, document clustering means segregating the data intodifferent groups of similar data. Clustering can be of mathematical, statistical or numerical domain. Clustering is afundamental data analysis technique used for variousapplications such as biology, psychology, control and signalprocessing, information theory and mining technologies. Fortheoretical or machine learning perspective the clusterrepresent hidden pattern means search can be done byunsupervised learning, called data concept. For practicalperspective clustering plays vital role in data miningapplications such as scientific data exploration, informationretrieval and text mining, spatial database applications, WebAnalysis, CRM, marketing, medical diagnostics,computational, biology, cybernetics, genetics, marketing etc., inthis survey we mainly concentrate on text mining and datamining. The process of extracting interesting information andknowledge from unstructured text is referred as text mining.Data mining is sorting through data to identify patterns andplot out the relationship. There are lot of algorithm based ontext and data mining
Keywords-Text Mining, Information Retrieval and Text Mining, Spatial Database Applications, Web Analysis.
Document clustering is the task of automaticallyorganizing text document into meaning full cluster or group,such that the document in the same cluster are similar, andare dissimilar from the one in other clusters. It is one of themost important tasks in text mining. There are severalnumber of technique launched for clustering documentssince there is rapid growth in the field of internet andcomputational technologies, the field of text mining have aabrupt growth, so that simple document clustering to moredemanding task such as production of granular taxonomies,sentiment analysis, and document summarization for thescope of devolving higher quality information from text.They involve in multiple interrelated types of objects. Co-cluster means document similarity and word similarity aredefined in a reinforcing manner.Computer network is the backbone of science andtechnology, merely 85% of business information is in theform of text, so logic-based programming is difficult tocapture in fuzzy. In order to solve this problem we cope upwith large number of words in one hand able to structures innatural language and on the other hand allow handlingvagueness, uncertainty and fuzziness. Text mining is aknowledge intensive process where the user interacts with adocument collection by using analysis tools. This isanalogous to data mining. It extracts the useful informationfrom data source from unstructured text. Text documentused to identify simplified subset of document features thatcan be used to represent the particular document as thewhole. This feature is said to be a representational model.Each document in a collection is made up of large numberof features, so that it affects the system approach,performance and design.The text mining system supports the presentation layerfor browsing that is both dynamic and content basedbrowsing. Text mining always use visualization tool tonavigate and explore the concept pattern, this is used forexpressing the complex data relationship. Text mining act as(GUI), graphical user interface, that is friendlier for interactwith the graphical representation of concept pattern. Thepresentation layer of text mining system severs as a frontend for execution of the system core discovery algorithm.This is user friendly and powerful algorithm, but verycomplex.The dynamic partitioning of texts ranks top on the prioritylist for all business intelligence systems. However, firstcurrent text clustering approaches still suffer from majorproblems that greatly limit their practical applicability. Textclustering usually noticed as objective method, since itdelivers one clearly defined results, which should beoptimal. This is contrary, because different people needdifferent idea in clustering the text, as the same text isviewed in terms of business as well as technical aspects.Second text clustering working in  high dimensionalspace, where each word is seen as a potential attribute. Butclustering in high dimensional space is very difficult, sinceeach data tends to have the same distance from all other