You are on page 1of 7

International journal of Computing

Journal homepage: www.ifrsa.org

Comprehensive Document Clustering for Information retrieval on web


1
Dr. M. Hanumanthappa, 2B R Prakash, 3Manish Kumar
1
Reader, 2,3Research Scholar
Department of Computer Science & Applications, Bangalore University, Bangalore
ABSTRACT
Document clustering is useful in many information retrieval tasks: document browsing,
organization and viewing of retrieval results, etc., and hence web search results clustering is an
increasingly popular technique for providing useful grouping of search results, or snippets, into
clusters. The Lingo algorithm, uses frequent phrases to identify candidate cluster labels, and
then assigns snippets to these labels. This paper exposes to an extended Lingo algorithm which
includes the semantic recognition to the frequent phrase extraction phase. This is achieved by
finding the synonyms of frequent words in the WordNet database, and adding the synonyms to
the pool of frequent terms that comprise the cluster label candidates. This paper also presents a
brief introduction to the CBC algorithm which could be further deployed and thus helps in the
tight clustering (committee) of the Semanticized clusters.
Key Words: Document clustering, Lingo/Modified Algorithm, CBC, Query Categories.
I. Introduction
Search for information on digital repositories pose multiple challenges for information seekers: the retrieved
sets of documents rarely match the information needs of people. Information retrieval is the research area
devoted in searching for information within the documents, searching for text documents themselves, or
searching for metadata. It includes search in hyper-textually-networked information resources, such as
WWW. If an IR system could discover the meanings of the terms in the user defined query, it would be able
to retrieve more relevant documents, otherwise the precision is low. This is why providing knowledge on as
many as possible various senses to the system is of a very high importance.
Document clustering was initially proposed for improving the precision and recall of information
retrieval systems. Because clustering is often too slow for large corpora and has indifferent
performance. Document clustering has been used more recently in document browsing . While at the
retrieval process users would need to go through a long list of irrelevant documents to find what they were
looking for. The ranking mechanism employed by search engines might also place a useful result low in the
ranked list of document snippets, in which case the user would not notice this result. Search results clustering
attempts to solve this problem by identifying and labeling groups of similar search results, and presenting this
grouped output to the user as clusters. Search results’ clustering is becoming increasingly popular; examples
include commercial systems such as Vivisimo (http://www.vivisimo.com) and IBoogie (http://iboogie.tv/),
and research frameworks such as Carrot2.
Common characteristics of document clustering include:
• There are a large number of documents to be clustered;
• The number of output clusters may be large;
• Each document has a large number of features; e.g., the features may include all the terms in the document;
and the feature space, the union of the features of all documents, is even larger.
1.1. Categorization of Queries:

IFRSA’s International Journal Of Computing|Vol1|issue 3|July 2011 442


M. Hanumanthappa,B R Prakash,M Kumar| Comprehensive Document Clustering for Information
retrieval on web

It can viewed that searching as beginning from a user- supplied query. It seems best not to take too unified a
view of the notion of a query; there is more than one type of query, and the handling of each may require
different techniques.
Consider the following types of queries:
 Specific queries.
 Broad-topic queries.
 Similar-page queries.
Concentrating on just the first two types of queries for now, it is observed that they present very different
sorts of obstacles. The difficulty in handling specific queries is centered, roughly, around what could be called
the Scarcity Problem: there are very few pages that contain the required information, and it is often difficult to
determine the identity of these pages.
For broad-topic queries, on the other hand, one expects to find many thousand relevant pages on the www;
such a set of pages might be generated by variants of term-matching or by more sophisticated means. Thus,
there is not an issue of scarcity here. Instead, the fundamental difficulty lies in what could be called the
Abundance Problem: The number of pages that could reasonably be returned as relevant is far too large for a
human user to digest. To provide effective search methods under these conditions, one needs a way to filter,
from among a huge collection of relevant pages, a small set of the most \authoritative" or \definitive" ones.
This paper empowers on the Lingo algorithm uses the approach of identifying frequent phrases as candidate
cluster labels, then assigning snippets to these labels. This paper’s contribution is adding semantic recognition
to enable the recognition of synonyms in snippets, thus improving the quality of the clusters generated. The
semantic recognition is achieved using the WordNet database, which is a lexical database for the English
language, in which “Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms
(synsets), each expressing a distinct concept”. In WordNet, each word can be a part of various synsets, each
synset relating the word to other words with the same conceptual meaning. For example, the word “doctor” is
part of a synset that relates it to the word “physician”, for the meaning of “a licensed medical practitioner”;
for this meaning, the words “doctor” and “physician” are synonymous.
This paper also inventively introduces clustering algorithm CBC (Clustering By Committee), which
produces higher quality clusters in document clustering tasks as compared to several well known
clustering algorithms.
II. Emphasized CBC
CBC can handle a large number of documents, a large number of output clusters, and a large sparse feature
space. It discovers clusters using well-scattered tight clusters called committees. In some experiments on
document clustering, it is showed that CBC out-performs several well-known hierarchical, partitional, and
hybrid clustering algorithms in cluster quality. CBC may be applied to other clustering tasks such as word
clustering as many words have multiple senses.
2.1. Algorithm
CBC algorithm consists of three phases: In Phase I, compute each element’s top-k similar elements. In
Phase II, construct a collection of tight clusters, where the elements of each cluster form a committee.
The algorithm tries to form as many committees as possible on the condition that each newly formed
committee is not very similar to any existing committee. If the condition is violated, the committee is
simply discarded. In the final phase of the algorithm, each element is assigned to its most similar
cluster.
For an element e, it is possible to find its most similar cluster and assign e to it. Latterly remove those features
from e that are shared by the centroid of the cluster. Then, recursively find e’s next most similar
cluster and repeat the feature removal. This process continues until e’s similarity to its most similar cluster
is below a threshold or when the total mutual information of all the residue features of e is below a fraction of
the total mutual information of its original features. For a polysymous word, CBC can then potentially
discover clusters that correspond to its senses. Preliminary experiments on clustering words using the
TREC collection (3GB) and a proprietary collection (2GB) of grade school readings from Educational
Testing Service gave the following automatically discovered word senses for the word bass:

IFRSA’s International Journal Of Computing|Vol1|issue 3|July 2011 443


M. Hanumanthappa,B R Prakash,M Kumar| Comprehensive Document Clustering for Information
retrieval on web

(clarinet, saxophone, cello, trombone)


(allied-Lyons, crand Metropolitan,
United Biscuits, Cadbury Schweppes)
(contralto, Baritone, Mezzo, soprano)
(Steinbach, callego,Felder, uribe)
(halibut,mackerel, sea bass, whitefish)
(kohlberg Kravis, Kohlberg, Bass Croup,
American Home)
And for word China
(Russia, chain, soviet Union, Japan)
(Earchenware, pewter, terra cotta, porcelain)

The word senses are represented by four committee members of the cluster.
Evaluating cluster quality has always been a difficult task. A new evaluation methodology that is based
on the editing distance between output clusters and manually constructed classes (the answer key) is used
in CBC. This evaluation measure is more intuitive and easier to interpret than previous evaluation measures.
2.2. Justifying CBC
Using a single representative from a cluster may be problematic too because each individual element
has its own idiosyncrasies that may not be shared by other members of the cluster.
CBC constructs the centroid of a cluster by averaging the feature vectors of a subset of the cluster members.
The subset is viewed as a committee that determines which other elements belong to the cluster. By
carefully choosing committee members, the features of the centroid tend to be the more typical features of the
target class
III. Lingo Algorithm and Modifications
3.1. Lingo Algorithm Phases
The following is a brief overview of the phases of the Lingo algorithm.
 Preprocessing: This phase includes common operations such as stemming and stop word removal that
improve the quality of the input snippets.
 Frequent Phrase Extraction: The modification carried out for this paper adds an extra step that involves
finding the synonyms of the frequent terms and phases.
 Cluster Label Induction: A term-document matrix of the frequent terms is constructed, Singular Value
Decomposition method is applied to the matrix to help in identifying abstract semantic concepts that link
the documents together.
 Cluster Content Discovery: In this phase, the input snippets are assigned to the clusters with the labels
selected in the previous phase.
 Final Cluster Formation: Clusters are sorted according to a scoring function, and the top scoring
clusters are displayed.
3.2. Semantic Clustering using WordNet
The modification to the Lingo algorithm was carried out in the frequent phrase extraction phase described
above. Frequent phrase extraction is carried out in two phases: Extract Single Terms, which extracts frequent
single terms, and Extract Phrase Terms, which extracts frequent phrase terms (frequent terms including more
than one word). Both of these phases make use of two important data structures that are used to organize data
to keep track of the frequency of the terms in each document:

Term-Document Array: this array is populated for each term, and includes for each document index the
number of occurrences of the term in the document.
Feature: a feature is an encapsulation of a frequent term that includes the term text (actual term), its total
frequency in all documents, the document indices of the documents in which the term was found, a reference
to the term-document array for the current term, as well as other useful information.

IFRSA’s International Journal Of Computing|Vol1|issue 3|July 2011 444


M. Hanumanthappa,B R Prakash,M Kumar| Comprehensive Document Clustering for Information
retrieval on web

Algorithm shows the modified Extract Single Terms algorithm including the changes carried out for adding
the synonyms using WordNet. Special attention was given to the following potential scenarios that may
happen when finding synonyms; the handling of these scenarios can also be seen in Algorithm 1:
1 The synonym found is the same as the original
term. For example, when finding the synonyms for the term “doctor”, one of the synonyms would be the term
“doctor” itself and skipping rest.
2 The synonym found is already one of the frequent single terms.
This situation arises, for example, when adding the synonym “physician” for the original term “doctor”, and
the term “physician” was included in another document. In this case, the term “physician” should not be
added to the frequent terms list, as it is already included; Accordingly, the Term-Document Array for the term
“physician” is found, and the term frequency for the current document is incremented. Also, the total term
frequency for “physician” is incremented. Using this approach we are intuitively linking the current
document, which contains the word “doctor”, to the term “physician”.
By doing so, we are ensuring that the document will be included in any cluster whose label contains the word
“physician”, which satisfies our purpose of matching documents to clusters using the semantic meaning of
words, not only the specific words themselves.

1: D input documents (or snippets)


2: T single terms
3: F list of Features, empty
4: TD list of Term-Document Arrays, empty
5: for each document d in D
6: for each frequent term t in T
7: if t does not exist in F then
8: td new Term-Document Array
9: increase term frequency in td for document d by 1
10: add td to TD
11: fnew Feature
12: increase total term frequency for term t by 1
13: add t, td to f
14: add f to F
15: else // t exists in F
16: td Term-Document Array for term t
17: increase term frequency in td for document d by 1
18: f Feature that contains term t
19: increase total term frequency for term t by 1
20: end if
21: end for
22: end for
23: // end of original Extract Single Terms
24: // start adding synonyms
25: SF list of synonym Features, empty
26: STD list of synonym Term-Document Arrays,empty
27: for each document d in D
28: for each term t in F
29: find word for term in WordNet
30: if word is found
31: for each synset syn for word
32: for each synonym in syn
33: if synonym is the same as t
34: continue to next synonym
35: end if
36: if synonym is found in F
37: td Term-Document Array for synonym

IFRSA’s International Journal Of Computing|Vol1|issue 3|July 2011 445


M. Hanumanthappa,B R Prakash,M Kumar| Comprehensive Document Clustering for Information
retrieval on web
38: increase term frequency in td for document d by 1
39: f Feature that contains term synonym
40: increase total term frequency for term synonym by 1
41: else // synonym not found in F
42: if synonym is found in SF
43: std Term-Document Array for synonym
44: increase term frequency in std for document d by 1
45: sf Feature that contains term synonym
46: increase total term frequency for term synonym by 1
47: else // synonym not found in F or SF
48: td new Term-Document Array
49: increase term frequency in td for document d by 1
50: add td to STD
51: fnew Feature
52: increase total term frequency for term synonym by 1
53: add synonym, td to f
54: add f to SF
55: end if
56: end if
57: end for
58: end for
59: end if
60: end for
61: end for
62: append SF to F
63: append STD to TD

Algorithm 1. Pseudo-code of modified Extract Single Terms

3 The synonym found is the same as a synonym that was added previously.
This is the same as scenario 2, but compares the synonym being added, for example “physician”, to a
synonym that was added previously, rather than one of the original frequent terms.
The modification to the Extract Phrase Terms algorithm follows the same logic as the Extract Single
Terms algorithm above. The main difference between the two algorithms comes from the fact that phrase
terms contain more than one word, which would mean that synonyms would need to be found for each word
in the phrase. The approach used for this was to find the synonyms for each word in the phrase, and replace
the word in-place inside the phrase with its synonym to generate a new “phrase term”. Obviously, this does
not consider all possible permutations of the original phrase; however, as the final assignment of documents
to clusters matches the document to a cluster label based on the matching of any word in the label, not
necessarily the complete label, this approach satisfies the requirement to include synonyms for phrase terms.
After the frequent phrase extraction phase is complete, the combined frequent phrases from Extract Single
Terms and Extract Phrase Terms are used to build the term-document matrix, which is then used in the cluster
label induction phase. Now the candidate cluster labels include both frequent terms from the original
documents, as well as frequent terms that contain synonyms generated by WordNet. The cluster label
induction phase then carries out the pruning necessary to arrive at the best cluster labels from both the original
terms and synonym terms.
IV. Experimental Evaluation
The comparison shows a number of differences in the clusters generated by semantic Lingo over the original
Lingo algorithm. The differences can be realized if it is noted that the following groups of words are
synonyms in WordNet:

1. “Recovery” and “Retrieval”


2. “Data” and “Information”

IFRSA’s International Journal Of Computing|Vol1|issue 3|July 2011 446


M. Hanumanthappa,B R Prakash,M Kumar| Comprehensive Document Clustering for Information
retrieval on web

3. “Dr.”, “Doctor”, and “Physician”.


Original Lingo Algorithm Semantic Lingo Algorithm

Cluster 1: Information Retrieval (4 documents) Cluster 1: Information Recovery (4 documents)


1: Introduction to Modern Information Retrieval 1: Linear Algebra for Intelligent Data Retrieval
2: Information Retrieval and Data Mining 2: Data Recovery Techniques
3: Linear Algebra for Intelligent Data Retrieval 3: Information Retrieval and Data Mining
4: Automatic Information Organization 4: Introduction to Modern Information Retrieval

Cluster 2: Physician Directory (4 documents) Cluster 2: Dr. Directory (5 documents)


1: Physician Directory 1: Physician Directory
2: Emergency Physician Directory Group, PC 2: Emergency Physician Directory Group, PC
3: AMA Physician Select 3: Doctor Directory
4: Doctor Directory 4: Think Like A Doctor
5: AMA Physician Select
Comparison of clusters generated by original vs. semantic Lingo algorithm

V. Conclusion
The inclusion of semantic recognition using synonyms from WordNet has provided a clear improvement over
the original Lingo algorithm, mainly in the assignment of documents to clusters where the matching was
improved because of the knowledge of relations between words. However, some future experimentation and
analysis needs to be carried out to assess the performance impact of adding semantic recognition, as well as to
improve the clusters generated using the proposed semantic Lingo algorithm. The areas that can further
deployed and enhanced are in the inclusion of word stemming and Include Identification of Part of Speech.
In addition to these areas, the proposed algorithm will require benchmarking against a large number of input
documents to be able to measure the overhead incurred by the additional processing for retrieving the
synonyms from WordNet.
CBC, which can handle a large number of documents, a large number of output clusters, and a large sparse
feature space. It discovers clusters using well- scattered tight clusters called committees.
The analysis can be further deployed in the field of research by inculcating the features of both CBC and
modified Lingo algorithm which would better result in a most efficient and compatible Document clustering
algorithm and thus results in the effective document retrieval from the repositories.
VI. References

[1] Geraci, Filippo, Marco Pellegrini, Marco Maggini, and Fabrizio Sebastiani. "Cluster Generation and Cluster
Labelling for Web Snippets." F. Crestani, P. Ferragina, and M. Sanderson (Eds.): SPIRE 2006, LNCS 4209. Berlin
Heidelberg: Springer-Verlag, 2006. 25-36.
[2] Oikonomakou, Nora, and Michalis Vazirgiannis. "A Review of Web Document Clustring Approaches." Data
Mining andKnowledge Discovery Handbook.
[3] Osinski, Stanislaw, and Dawid Weiss. "A Concept-Driven Algorithm for Clustering Search Results." IEEE
Intelligent Systems 20 (2005): 48-54.
[4] Osinski, Stanislaw, and Dawid Weiss. "Conceptual Clustering Using Lingo Algorithm: Evaluation on Open
Directory Project Data." Advances in Soft Computing, Intelligent Information
[5] Osinski, Stanislaw. "Improving Quality of Search Results Clustering with Approximate Matrix Factorisations." 28 th
European Conference on IR Research (ECIR 2006), 10 Apr. 2006, London, UK. Springer Lecture Notes in
Computer Science. Vol. 3936. 2006. 167-78.
[6] Osinski, Stanislaw, Jerzy Stefanowski, and Dawid Weiss. "Lingo: Search Results Clustering Algorithm Based on
Singular Value Decomposition." Advances in Soft Computing, Intelligent Information Processing and Web Mining,
Proceedings of the International IIS: IIPWM´04 Conference, 2004, Zakopane, Poland. 359-68.
[7] Sameh, Ahmed Prince Sultan University, Department. of Computer Science. & Information. System
[8] Wei, ZHANG, XU Baowen, ZHANG Weifeng, and XU Junling. "ISTC: A New Method for Clustering Search
Results." Wuhan University Journal of Natural Sciences 13 (2008): 501-04.
[9] WordNet - Princeton University Cognitive Science Laboratory. 02 Jan. 2009 <http://wordnet.princeton.edu/>.

IFRSA’s International Journal Of Computing|Vol1|issue 3|July 2011 447


M. Hanumanthappa,B R Prakash,M Kumar| Comprehensive Document Clustering for Information
retrieval on web
[10] Zeng, Hua-Jun, Qi-Cai He, Zheng Chen, Wei-Ying Ma, and Jinwen Ma. "Learning to Cluster Web Search Results."
SIGIR'04,July 25–29, 2004, Sheffield, South Yorkshire, UK.
[11] Sedding, Julian, and Dimitar Kazakov. "WordNet-based Text Document Clustering."
[12] Cutting, D. R.; Karger, D.; Pedersen, J.; and Tukey, J. W. 1992. Scatter/Gather: A cluster-based approach to
browsing large document collections. In Proceedings of SIGIR-92. pp. 318.329. Copenhagen, Denmark.
[13] Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill.
[14] R. Barrett, P. Maglio, D. Kellem, \How to personalize the Web," Proc. Conf. on Human Factors in Computing
Systems, 1997.

IFRSA’s International Journal Of Computing|Vol1|issue 3|July 2011 448

You might also like