A MINI PROJECT ON CLUSTERING OF WEB DOCUMENTS USING SUFFIX TREE ALGORITHM

BY, V.NagaSivaChaitanya [l09it140] P.GovindaRao [y08it100] R. Ashok [l09it138]

ABSTRACT:
With the increase in information on the World Wide Web it has become difficult to find the desired information on search engines. Clustering Techniques are now being used to give a meaningful search result on web. This paper gives an idea about Web Page document clustering, based on Suffix tree algorithm. Clustering, a mostly used technique in data mining identifies a group of related records that can be used as a starting point for exploring further relationship among the data sets. The majority of search engines give a long list of ranked documents; most of them are irrelevant. The low precision of the web search engines coupled with the long ranked list presentation make it hard for users to find the information they are looking for. Typical queries retrieve hundreds of documents, most of which have no relation with what the user was looking for. Most of the documents clustering techniques rely on single term analysis of the document data set. More informative features including phrases and their weights are particularly important to achieve more accurate document clustering. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results. Clustering algorithms attempt to group documents together based on their similarities; thus documents relating to a certain topic will hopefully be placed in a single cluster. This can help users both in locating interesting documents more easily and in getting an overview of the retrieved document set.

Document clustering
Document clustering is the automatic organization of documents into groups or clusters. "Document clustering" differs from other techniques (classification, taxonomy building, tagging, etc.) in that it is fully automated: further human intervention is not needed,

.n] is indicated by the deepest fork node in the suffix tree. and other application areas...g. computational biology.n] and build a suffix tree.although in many applications the Clustering Engine installation can benefit from specific domain expertise. Some examples are given below. to txt[1.. Software Requirements: Java and Oracle 10g Hardware: RAM : 512MB HardDisk : 60 gb Platform: NetBeans Suffix Tree Applications: Suffix Trees can be used to solve a large number of string problems that occur in textediting. the longest repeated substring oftxt[1.. i. e. can be solved in O(m) time (after the suffix tree for txt has been built in O(n) time). where depth is measured by the number of characterstraversed from the root. Longest Common Substring . when it is available.n]. pat[1. The biggest challenge for document clustering has been to quickly find meaningful groups that are concisely described. The longest repeated substring can be found in O(n) time using a suffix tree.m]. Longest Repeated Substring Add a special ``end of string'' character. `issi' in the case of `mississippi'. `$'. in txt[1. String Search Searching for a substring. free-text search.e.

. P. can be found by building a generalized suffix tree for txt1 andtxt2: Each node is marked to indicate if it represents a suffix of txt1 or txt2 or both.. by building the suffix tree for txt$reverse(txt)# or by building the generalized suffix tree for txt and reverse(txt).. The longest palindrome of txt[1. txt1 and txt2. one can build a (basic) suffix tree for the string txt1$txt2#. Palindromes A palindrome is a string.) . such that P=reverse(P).g. (Try it. e. e. Equivalently. where `$' is a special terminator for txt1 and `#' is a special terminator for txt2. `ississi' is the longest palindrome in `mississippi'.g.$. The longest common substring is indicated by the deepest fork node that has both `.' (no $) beneath it. The deepest node marked for both txt1and txt2 represents the longest common substring. but an instance of a substring cannot have gaps.. (Try it using the HTML FORM above. `abba'=reverse(`abba')..n] can be found in O(n) time.) Note that the `longest common substring problem' is different to the `longest common subsequence problem' which is closely related to the `edit-distance problem': An instance of a subsequence can have gaps where it appears in txt1 and in txt2. e...' and `.#..g.The longest common substring of two strings..

Advantages: It is the Fastest and efficient string searching algorithm when compared to googles PageRank Algorithm. This technique has been used in the statistics for last five decades. The IR community has explored document clustering as an alternative method of organizing retrieval results. Industry analysts predict that Google and other major search engines will need to make use of clustering technology to stay competitive. Conclusion Clustering is not a brand-new technique. . but clustering has yet to be deployed on most major search engines.

Sign up to vote on this title
UsefulNot useful