You are on page 1of 38

W We eb b C Cl lu us st te er ri in ng g E En ng gi in ne es s

SEMINAR REPORT
2009-2011
1n viv {v{incn c{ )cqvicncn. in
Iccc c{ ^v.c c{ Iccnncc,
1n
COMPUTER & INFORMATION SCIENCE
+^1II1I
DEEPTHI THERESA K.K.
DEPARTMENT OF COMPUTER SCIENCE
COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY
KOCHI - 682 022
COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY
KOCHI - 682 022
DEPARTMENT OF COMPUTER SCIENCE
( (1 1) )I I1 1I I1 1( (/ /I I1 1
This is to certiIy that the seminar report entitled 'Web Clustering Engines is being
submitted by D De ee ep pt th hi i T Th he er re es sa a K K. .K K. . in partial IulIillment oI the requirements Ior the award oI
M.Tech in Computer & InIormation Science is a bonaIide record oI the seminar presented by
her during the academic year 2010.
Mr. G.Santhosh Kumar Prof. Dr.K.Poulose 1acob
Lecturer Director
Dept. oI Computer Science Dept. oI Computer Science
A AC CK KN NO OW WL LE ED DG GE EM ME EN NT T
Eirst oI all let me thank our Director Prof: Dr. K. Paulose 1acob, Dept. of
Computer Science, CUSAT who provided with the necessary Iacilities and advice. I am also
thankIul to Mr. G.Santhosh Kumar, Lecturer, Dept of Computer Science, CUSAT Ior his
valuable suggestions and support Ior the completion oI this seminar. With great pleasure I
remember Dr. Sumam Mary Idicula, Reader, Dept. of Computer Science, CUSAT Ior her
sincere guidance. Also I am thankIul to all oI my teaching and non-teaching staII in the
department and my Iriends Ior extending their warm kindness and help.
I would like to thank my parents without their blessings and support I would not have
been able to accomplish my goal. I also extend my thanks to all my well wishers. Einally, I
thank the almighty Ior giving the guidance and blessings.
ABSTRACT
Web clustering Engines are emerging trend in the Iield oI inIormation retrieval.
They organize search results by topic, thus oIIering a complementary view to the Ilat ranked
list returned by the conventional search engines. The search results returned by traditional
search engines on diIIerent subtopics or meanings oI a query will be mixed together in the list
so that the user may have to siIt through a large number oI irrelevant items to locate those oI
interest. The Web clustering engines categorize the search results into diIIerent hierarchical
groups/clusters and display those cluster labels. Hence the user can locate the desired
document very Iast.
In this seminar we discuss diIIerent phases in the implementation oI web clustering
engines in detail and also incorporate some oI the web clustering algorithms, their advantages
and issues. We will Iamiliarize some currently using web clustering engines. Some Iuture
research directions are also presented.
Additional Key Words and Phrases: Web Clustering Engines, InIormation retrieval, meta
search engines, search results clustering, Search results acquisition, Preprocessing, Cluster
construction and labeling, Vector Space model, data centric clustering algorithms, description
aware algorithms
Contents
1. Introduction 1
1.1 Motivation 1
1.2 Goal oI web clustering engines 2
1.3 Issues in the implementation oI clusters 3
2. Architecture and techniques oI web clustering engines 5
2.1 Architecture oI web clustering engines 5
2.1.1 Search results acquisition 5
2.1.2 Preprocessing oI search results 6
2.1.3 Cluster construction and labeling 7
2.1.3.1 Data centric clustering algorithms 8
2.1.3.2 Description aware algorithms 10
2.1.4. Visualization oI clustered results 15
3. EIIiciency and Iuture works 20
3.1.Search results clustering eIIiciency Iactors 20
3.2 Improve eIIiciency oI clustering 21
3.3 PerIormance evaluation 22
3.4 Research directions and Iuture works 23
4. Conclusion 24
5. ReIerences 25
6. Appendix 26
Seminar Report 2010 1 Web Clustering Engine
Dept. OI Computer Science CUSAT
1.INTRODUCTION
1.1 MOTIVATION
Search engines are an invaluable tool Ior retrieving inIormation Irom the Web. In
response to a user query, they return a list oI results ranked in order oI relevance to the
query. The user starts at the top oI the list and Iollows it down examining one result at a
time, until the sought inIormation has been Iound.
Now a days eIIicient search engines are available like Google, Yahoo etc. Even
though they are deIinitely good Ior navigational searching and transactional searching,
they are not that much eIIicient in the case oI queries which includes ambiguity.
Ambiguous queries means they should have multiple meaning in diIIerent contexts. The
search results returned by conventional search engines on diIIerent subtopics or meanings
oI a query will be mixed together in the list so that the user may have to siIt through a
large number oI irrelevant items to locate those oI interest. In this context clustering oI
search results come in to picture.
Clustering is the act oI grouping similar object into sets. The distance between the
objects in the same cluster(inter-cluster variations) should be minimum and the distance
between objects in diIIerent clusters(intra-cluster variations) should be maximum. In the
web search context, organizing web pages (search results) into groups, so that diIIerent
groups correspond to diIIerent user needs.
In 1979 Van Rijsbergen introduced the concept Cluster Hypothesis in the Iield oI
inIormation retrieval. It states that 'Closely related documents tend to be relevant to the
same requests.
Web Clustering Engines are the systems that perIorm clustering oI web search
results. This systems group the results returned by a search engine into a hierarchy oI
labeled clusters (also called categories).
Seminar Report 2010 2 Web Clustering Engine
Dept. OI Computer Science CUSAT
To illustrate, Eigure 1 in appendix shows the clustered results returned Ior the
query 'tiger .This result is given by one oI the very popular web clustering engine called
Vivisimo (as oI March 5, 2010). Like many queries on the Web, 'tiger has multiple
meanings like: the Ieline, the Mac OS X computer operating system, the golI champion
and so on. These diIIerent meanings are well represented in Eigure 1.By contrast, iI we
submit the query 'tiger to Google or Yahoo!(Eigure 2), we can see that each meaning`s
items are scattered in the ranked list oI search results, oIten through a large number oI
result pages.
The Iirst commercial clustering engine was Northern Light, at the end oI the
1990s. It was based on a predeIined set oI categories, to which the search results were
assigned. A major breakthrough was then made by Vivisimo, whose clusters and cluster
labels were dynamically generated Irom the search results. Some other available
clustering engines are Clusty, Grokker, KartOO, Lingo3G, CREDO
1.2 GOAL OF WEB CLUSTERING ENGINES
Web Clustering Engines organize search results by topic, thus oIIering a
complementary view to the Ilat ranked list returned by the conventional search engines.
Main advantages oI the cluster hierarchy is that:
It makes for shortcuts to the items that relate to the same meaning. Since Web
Clustering Engines group the search results having the same meaning within
same cluster it is very easy Ior the user to Iind similar documents. Hence the
search time will be less.
It allows better topic understanding. Since Web Clustering Engines give a
high level view oI the query, it is useIul Ior inIormational searches in
unknown or dynamic domains.
Seminar Report 2010 3 Web Clustering Engine
Dept. OI Computer Science CUSAT
It favors svstematic exploration of search results. A clustering engine
summarizes the content oI many search results in one single view on the Iirst
result page, the user may review hundreds oI potentially relevant results
without the need to download and scroll to subsequent pages.
A clustering engine tries to address the limitations oI current search engines by
providing clustered results as an added Ieature to their standard user interIace.
1.3 ISSUES IN THE IMPLEMENTATION OF CLUSTERS
Unlike document clustering Web search results clustering included constantly
changing billions oI pages. The data are mainly unstructured and heterogeneous and
additional inIormation to consider (i.e. links, click-through data, etc.).
This dynamic nature oI the data together with the interactive use oI clustered
results pose new requirements and challenges to clustering technology:
Short input data description. Due to computational reasons, the data available
to the clustering algorithm Ior each search result are usually limited to a URL,
an optional title, and a short excerpt oI the document`s text (the snippet)
Meaningful labels. Each cluster label should indicate the contents oI the
cluster items within that cluster.
Selection of similaritv measure. So many known methods are there Ior Iinding
the dissimilarity/similarity between 2 items within a cluster like, euclidean
distance, Manhattan distance etc.
Seminar Report 2010 4 Web Clustering Engine
Dept. OI Computer Science CUSAT
Grouping of obfects into clusters. So many approaches are available Ior
grouping the objects like, agglomerative clustering, suIIix tree clustering, k-
means clustering.
Computational efficiencv. Search results clustering is perIormed online, within
an application that requires overall subsecond response times. The critical step
is the acquisition oI search results, whereas the eIIiciency oI the cluster
construction algorithm is less important due to the low number oI input
results.
Overlapping clusters. Since the same result may applied to diIIerent themes
we may allow overlapping clusters. Handling oI overlapping clusters in a
dynamic environment is a open issue.
Unknown number of clusters. In search results clustering, both the number and
the size oI clusters cannot be predetermined because they vary with the query.
Seminar Report 2010 5 Web Clustering Engine
Dept. OI Computer Science CUSAT
2. ARCHITECTURE AND TECHNIQUES OF WEB
CLUSTERING ENGINES
2.1 ARCHITECTURE OF WEB CLUSTERING ENGINES
Practical implementations oI Web search clustering engines will usually consist oI
Iour general components: search results acquisition, input preprocessing, cluster
construction, and visualization oI clustered results, all arranged in a processing pipeline.
2.1.1 SEARCH RESULTS ACQUISITION
The task oI the search results acquisition component is to provide input Ior the
rest oI the system. Based on the query, the acquisition component must deliver 50 to 500
results, each oI which should contain a title, a contextual snippet, and the URL pointing
to the Iull text being reIerred to.
The source oI search results can be any public search engines, such as google,
yahoo etc. Clustering applied to this smaller set oI documents ,returned by the
Seminar Report 2010 6 Web Clustering Engine
Dept. OI Computer Science CUSAT
conventional search engines, in response to the query. The most elegant way oI Ietching
results Irom such search engines is by using application programming interIaces(APIs)
these engines provide.
2.1.2 PREPROCESSING OF SEARCH RESULTS
Input preprocessing is a step that is common to all search results clustering
systems. Its primary aim is to convert the contents oI search results (output by the
acquisition component) into a sequence oI features used by the actual clustering
algorithm.
Steps Ior Ieature extraction are, Language identiIication, Tokenization, Stemming,
Selection oI Ieatures.
Clustering engines that support multilingual content must perIorm initial
language recognition on each search result in the input.
During the tokeni:ation step, the text oI each search result gets split into a
sequence oI basic independent units called tokens, which will usually represent single
words, numbers, symbols and so on .Tokenization becomes much more complex Ior
languages where white spaces are not present (such as Chinese) or where the text may
switch direction (such as an Arabic text, within which English phrases are quoted).
The aim oI stemming is to remove the inIlectional preIixes and suIIixes oI each
word and thus reduce diIIerent grammatical Iorms oI the word to a common base Iorm
called a stem. Eor example, the words connected, connecting and interconnection would
be transIormed to the word connect .Here connect is the stem.
Last but not least, the preprocessing step needs to extract features Ior each search
result present in the input. Eeatures are atomic entities by which we can describe an
object and represent its most important characteristic to an algorithm. When looking at
Seminar Report 2010 7 Web Clustering Engine
Dept. OI Computer Science CUSAT
text, the most intuitive set oI Ieatures would be simply words oI a given language. But
this is not the only possibility. The Ieatures can vary Irom single words and Iixed-length
tuples oI words (n-grams) to Irequent phrases (variable-length sequences oI words), and
very algorithm-speciIic data structures, such as approximate sentences.
One method Ior representing a text is Vector Space model(VSM). A document d
is represented in the VSM as a vector |w
t0
, w
t1
, . . .w
tn
|, where t
0
, t
1
, . . . t
n
is a global set
oI words (Ieatures) and w
ti
expresses the weight (importance) oI Ieature t
i
to document d.
Weights in a document vector typically reIlect the distribution oI occurrences oI Ieatures
in that document. Eor example, a term vector Ior the phrase 'Polly had a dog and the dog
had Polly could appear as shown below (weights are simply counts oI words, articles are
rarely speciIic to any document and normally would be omitted).
2.1.3 CLUSTER CONSTRUCTION AND LABELLING
The set oI search results along with their Ieatures, extracted in the preprocessing
step, are given as input to the clustering algorithm, which is responsible Ior building the
clusters and labeling them. There are a number oI algorithms available Ior clustering. We
can classiIy them into two diIIerent categories, Data centric and Description aware.
In search results clustering users are the ultimate consumers oI cluster. Hence the
created clusters should be aptly labeled. The labels should be unique, unambiguous,
comprehensive and sensible to the content. An ineIIiciently labeled cluster is useless
eventhough it contains closely related, relevant documents.
Seminar Report 2010 8 Web Clustering Engine
Dept. OI Computer Science CUSAT
2.1.3.1 DATA CENTRIC CLUSTERING ALGORITHMS
The representatives oI this group consists oI a conventional data clustering
algorithms like Agglomerative Hierarchical Clustering (AHC), K-means etc.
Scatter/Gather is a landmark example oI a data-centric system, developed in 1992 at
Xerox PARC, Scatter/Gather is commonly perceived as a predecessor and conceptual
parent oI all clustering systems that appeared later. This system uses VSM Ior text
representation and the clustering technique used is agglomerative hierarchical clustering
(AHC), with an average-link merge criterion. It has an initial clustering oI a collection oI
documents in a set oI k clusters(scattering).At Query time the user selected clusters oI
interest(gather) and the system re-clustered those documents. This process repeats until a
small cluster with relevant documents is Iound. The Iollowing Iigure depicts the Iunction
oI a Scatter/Gather system
Agglomerative Hierarchical Clustering(AHC) is a typical example oI Data centric
clustering algorithms. It is a bottom up approach. Initially each document is in its own
cluster. Build a distance matrix (dissimilarity matrix) Ior every pair oI clusters. Merge 2
closest clusters and build the new distance matrix by replacing the merged cluster by one
Seminar Report 2010 9 Web Clustering Engine
Dept. OI Computer Science CUSAT
cluster. Continue this process until the desired no oI k clusters reached. The Complexity
oI this algorithm is clearly O(n
2
) since we are using a matrix, where n is the number oI
clusters.
Another Data centric algorithm is called as K-means clustering. K is a predeIined
value Ior number oI clusters and we are always selecting an average one as the cluster
centroid. Hence the name. Eirstly choose the number oI clusters k. Randomly generate k
clusters and Iind cluster representative/centroid. Calculate the distance between each
cluster and each document. Assign each document to the nearest cluster centroid. Re-
compute new cluster centroid. Repeat the steps until some convergence criterion is met.
The complexity is O(knT),where k is the number oI clusters, n is the number oI
documents and T is the number oI times the algorithm should repeat Ior getting a stable
system(without changing the membership oI document).
Data-centric algorithms borrow their strengths Irom well-known and proven
techniques targeted at clustering numeric data. Eventhough it uses simple keyword based
Ieatures, still it is a powerIul method.
But there are some diIIiculties in these set oI algorithms. All these algorithms are not
incremental in nature. Incremental` in the sense, as each document arrives Irom the web,
we 'clean it and add it to the available model. All the above algorithms excluded the
incremental property.
Another diIIiculty raised in Data centric approaches are in the case oI meaningIul labels.
In these algorithms cluster labels are created by selecting Irequent keywords Irom the set
oI cluster documents. This keyword based representation seemed to be insuIIicient Irom
the user perspective. Once a text is converted to a document vector we can hardly speak
oI the text`s meaning, because the vector is basically a collection oI unrelated terms.
Using the extracted Ieatures in a keyword based approach the content oI the cluster is not
that much readable.
Seminar Report 2010 10 Web Clustering Engine
Dept. OI Computer Science CUSAT
Eor justiIying this argument reIer the Iigure 3 in the appendix.
The query used here is
Retrieve the top 250 documents that contain the word star .
We ask Scatter/Gather to place the 250 documents into 5 groups. The Eigure
contains only the Iirst scattered clusters. Shown here are the clusters' sizes (how many
documents they contain), a list oI topical terms, and a list oI document titles.
One can see Irom the topical terms oI Cluster 1 that this cluster contains
documents that involve stars as symbols, as in military rank and patriotic songs. Cluster 2
has 68 documents that appear mainly to be about movie and tv stars. Cluster 3 contains
97 documents that having to do with aspects oI astrophysics. Cluster 4 contains 67
documents also about astronomy and astrophysics. This cluster contains many articles
about people who are astronomers. Cluster 5 contains all the articles that discuss animals
or plants, and that happen to contain the word star, Ior example, star Iish.
But looking in to this clusters we can hardly conclude these descriptions about the
cluster contents. Eor getting more detailed cluster labels we can use Description aware
algorithms.
2.1.3.2 DESCRIPTION AWARE ALGORITHMS
Description-aware algorithms are aware oI this labeling problem and try to ensure
that the construction oI cluster descriptions is that Ieasible and it yields results
interpretable to a human. One way to achieve this goal is to use a monothetic clustering
algorithm (i.e., one in which objects are assigned to clusters based on a single Ieature)
and careIully select the Ieatures so that they are immediately recognizable to the user as
something meaningIul. II Ieatures are meaningIul and precise then they can be used to
describe the output clusters accurately and suIIiciently. The algorithm that Iirst
implemented this idea was SuIIix Tree Clustering (STC), described in a Iew seminal
Seminar Report 2010 11 Web Clustering Engine
Dept. OI Computer Science CUSAT
papers by Zamir and Etzioni in 1998, 1999, and implemented in a system called Grouper.
In practice, STC was as much oI a break through to search results clustering.
Suffix Tree Clustering(STC) uses a data structure called suIIix tree. It Use
phrases(ordered sequence oI words) as their atomic Ieatures rather than keywords. 3 steps
are there Ior perIorming suIIix tree clustering. Those are, data cleaning, identiIying base
clusters and combining base clusters. We deIine a base cluster to be a set oI documents
that share a common phrase.
A suffix tree-Definition
1 A suIIix tree oI a string S is a compact trie containing all suIIixes oI S.
2. It is a rooted tree.
3. Each internal node has at least two children
4. Each edge is labeled with a non empty substring oI S. The label oI a node is the
concatenation oI the edge labels on the path Irom the root to that node
5. No two edges out oI the same node can have edge labels that begin with the same word
Eor example the suIIixes oI a sentence 'mouse ate cheese too are:
Suffix no. Suffixes
1. mouse ate cheese too
2. ate cheese too
3. cheese too
4. too
Seminar Report 2010 12 Web Clustering Engine
Dept. OI Computer Science CUSAT
A General Suffix 1ree (GS1) means a suIIix tree contains all the suIIixes oI two
or more sentences.
Step1-Data Cleaning
In this step, the string oI text representing each document is transIormed using a
light stemming algorithm (deleting word preIixes and suIIixes and reducing plural to
singular). Sentence boundaries (identiIied via punctuation and HTML tags) are marked
and non-word tokens (such as numbers, HTML tags and most punctuation) are stripped.
Step 2-Identifying base clusters
The Iollowing picture is an example Ior a General SuIIix Tree oI a set oI strings-
1)"cat ate cheese", 2)"mouse ate cheese too" and 3)"cat ate mouse too". The nodes oI the
suIIix tree are drawn as circles. Each suIIix-node has one or more boxes attached to it
designating the string(s) it originated Irom. The Iirst number in each box designates the
string oI origin (1-3 in our example, by the order the strings appear above); the second
number designates which suIIix oI that string labels that suIIix-node.
Seminar Report 2010 13 Web Clustering Engine
Dept. OI Computer Science CUSAT
Each node oI the suIIix tree represents a group oI documents and a phrase that is
common to all oI them. The label oI the node represents the common phrase; the set oI
documents tagging the suIIix-nodes that are descendants oI the node make up the
document group. ThereIore, each node represents a base cluster.
Eollowing Table lists the six marked nodes (a-I) Irom the example shown above
and their corresponding base clusters:
Each base cluster is assigned a score that is a Iunction oI the number oI
documents it contains, and the words that make up its phrase. The score s(B) oI base
cluster B with phrase P is given by:
where ,B, is the number oI documents in base cluster B, and ,P, is the number oI words in
P that have a non-zero score (i.e., the eIIective length oI the phrase)
Step 3 - Combining Base Clusters
This step oI the algorithm merges the base clusters, with a high overlap in their
document sets. Eor doing this we are using a base cluster graph. The nodes in this graph
are base clusters. Combine these base clusters based on some similarity measure.
The Iollowing Iigure is a base cluster graph oI the previous example.
Seminar Report 2010 14 Web Clustering Engine
Dept. OI Computer Science CUSAT
We deIine a binary similarity measure. Given 2 Base clusters B
m
and B
n
with
sizes ,B
m
, and , B
n
, respectively., B
m
B
n
, is the number. oI documents common to both
base clusters. We deIine the similarity between B
m
and B
n
is to be 1 iII:
, B
m
B
n
, / , B
m
,~0.5 and
, B
m
B
n
, / , B
n
,~0.5
Otherwise similarity is equal to 0.
II similarity between base clusters is equal to 1 then draw an edge connecting
those base clusters. A cluster is deIined as being a connected component in the base
cluster graph. Each cluster contains the union oI the documents oI all its base clusters. In
the above base cluster example there is one connected component, thereIore one cluster.
The advantages oI STC over Data centric algorithms are, The STC can be
constructed in linear time. It is incremental in nature. This method Iocused attention on
cluster label descriptiveness, so that the cluster labels will be more eIIective. STC support
overlapping clusters.
The Iollowing picture gives us an overview about the clusters created by SuIIix
Tree Clustering method:
Seminar Report 2010 15 Web Clustering Engine
Dept. OI Computer Science CUSAT
The Query used here is salsa`. Only the Iirst 5 clusters are shown here. The
words in bold are the shared phrases Iound in the clusters. Note the descriptive power oI
phrases such as "Puerto Rico", "Latin Music" and "York Salsa Dancers".
2.1.4. VISUALIZATION OF CLUSTERED RESULTS
Now powerIul visualizations are available Ior Web Clustering Engines. One
prominent approach is based on hierarchical Iolders. The Web Clustering Engines like,
Clusty, CREDO, Lingo3G ,etc are using hierarchical Iolder visualization approach. A
Iamous Clustering Engine called Grokker uses Nesting and zooming approach. Some
search engines also used Graph based interIaces. KartOO is such a system.
Seminar Report 2010 16 Web Clustering Engine
Dept. OI Computer Science CUSAT
Some Clustering Engines and their visualizations are mentioned below:
Clusty
Clusty is a clustering engine developed by the company Vivisimo. Vivisimo won
the 'best meta-search engine award assigned by SearchEngineWatch.com Irom 2001 to
2003. Vivisimo means lively, bright, or clever in Spanish. Vivisimo's Iounders picked the
name to express their vision oI optimizing and giving liIe to our inIormation. Clusty is a
meta search engine, meaning it combines results Irom a variety oI diIIerent sources. It
uses an algorithm to cluster content based on textual similarity. Every time oI a search,
Clusty pulls together the data Irom other engines like Ask, MSN and Wisenut. It then
organizes the search results in a way that helps us navigate away Irom ambiguity towards
speciIic cluster oI results.
Clusty uses a hierarchical Iolder approach. It is a very simple method and Iamiliar
to everyone. Eigure1 in appendix is the screenshot (taken on March 5, 2010) oI Clusty.
Seminar Report 2010 17 Web Clustering Engine
Dept. OI Computer Science CUSAT
The hierarchical Iolders are limited in the leIt side oI the screen so that the user can
choose any cluster he may need within no time.
CREDO
CREDO ( Conceptual REorganization oI DOcuments) has been developed at
Eondazione Ugo Bordoni by Claudio Carpineto and Gianni Romano. CREDO groups the
results oI a web search (currently Yahoo APIs search results) in a lattice oI conceptual
clusters that highlight the contents oI the retrieved documents. CREDO is based on a
mathematical data representation termed a concept lattice. Compared to other systems Ior
clustering Web results, the clusters produced by CREDO are more justiIiable, are easier
to navigate because they are organized in a lattice rather than a strict hierarchy, and allow
discovery oI causal associations between the words contained in the results. CREDO is
an interesting example oI a system that attempts to build the taxonomy oI topics and their
descriptions simultaneously. Eventhough CREDO do not Iollow a strict hierarchical
organization can still use a tree-based visualization. ReIer Eigure 4(taken on March 6,
2010) in appendix Ior seeing the visualization oI CREDO.
A version oI CREDO Ior PDAs (Credino) and Ior cellular phones (SmartCREDO)
has been developed in collaboration with SteIano Mizzaro and Andrea Della Pietra
(University oI Udine).
Grokker
Grokker is developed by a company called Groxis. Groxis was a tech company
based in San Erancisco, CaliIornia. The name Grokker is inspired by the 1961 Robert A.
Heinlein science Iiction classic Stranger in a Strange Land, in which Grok is a Martian
word meaning literally to drink` and metaphorically to be one with.` To grok something
is to understand something so well that it is Iully absorbed into oneselI. It is to look at
every problem, opportunity, action, and point oI view Irom any and all perspectives.
Grokker sits on top oI multiple sources. AIter Grokker retrieves the inIormation, it
Seminar Report 2010 18 Web Clustering Engine
Dept. OI Computer Science CUSAT
"Iederates" it, meaning it meshes it all together. Einally, it clusters the returns into
categories. End users most Irequently look at less than three screens Irom the thousands
oI returned search results. Using Grokker, users immediately see the cluster(s) oI greatest
relevance, and drill down, only within the cluster(s) that matter to them.
Grokker uses Nesting and Zooming approach. The screen shot oI Grokker
is shown in appendix Eigure 5. This Map View is a visual representation oI the return oI
hits. When the user click on one oI the circles and see the subcategories again. By
clicking on Search Options the user can change the number oI hits he will return. The
user can also choose which sites you want to search: Yahoo, Wikipedia and/or Amazon.
Simultaneous searching oI diIIerent sites are also permitted. Einally, we can limit our
results by using the tools on the leIt side oI the screen.
Some universities are using Grokker as their searching tool. StanIord University
was one oI the Iirst customers oI Grokker. The new platIorm provides Iaculty and
students with a single point oI access to multiple resources, including library catalogs,
proprietary subscription databases, and the Web. It helps StanIord users to be more
eIIicient in their research and navigation among the numerous available resources. The
desktop version oI StanIord Grokker is no longer being supported, and is not available Ior
download. In March oI 2009, Groxis ceased operations.
KartOO
KartOO was a meta search engine which displayed a visual interIace. It operated
Irom 2001 to early 2010. KartOO had an advanced Adobe Elash GUI, as opposed to a
text-based list oI results.It uses a Graph based approach. Its color scheme was to a degree
reminiscent oI Apple Computer's Aqua interIace. Search results were presented as a
"map", with blob-like masses oI varying color connecting each item. The shape oI the
blobs clearly depends on the relevance oI the keyword corresponding to that blob,
according to the query. II one began their search with a general topic, KartOO sometimes
helped to narrow it down. Every "blob" clicked added another word to the search query.
Seminar Report 2010 19 Web Clustering Engine
Dept. OI Computer Science CUSAT
The map would oIten succeed in presenting keywords or subtopics that deIined the topic
one was searching on. ReIer Eigure 6 in appendix Ior seeing the visualization oI KartOO.
It was co-Iounded in Erance by two cousins, Laurent and Nicholas Baleydier. This
project was then launched in 2001. In 2004, KartOO launched a new version called
UJIKO. In January 2010 KartOO closed down, removing all content Irom the KartOO
and UJIKO websites, but leaving a small message in Erench thanking its users Ior their
support.
Seminar Report 2010 20 Web Clustering Engine
Dept. OI Computer Science CUSAT
3. EFFICIENCY AND FUTURE WORKS
3.1 SEARCH RESULTS CLUSTERING EFFICIENCY FACTORS
The most critical tasks involve the Iirst three components presented namely
search result acquisition, preprocessing, and clustering. The visualization component is
not likely to aIIect the overall system eIIiciency in a signiIicant manner.
Search Results Acquisition
The number oI search results required Ior clustering cannot be Ietched in one
remote request. The Yahoo! API allows up to 50 search results to be retrieved in one
request, while Google SOAP API returns a mere 10 results per one remote call. The
results obviously depend on network congestion , on the capability oI local equipment
used , and also on the speciIic server processing the request on the search engine side.
Preprocessing
The perIormance oI tokenization is a critical concern in the case oI
preprocessing oI search results. Tokenizers will have a diIIerent perIormance
characteristic depending on whether they were hand-written or automatically generated.
Tokenization becomes much more complex Ior languages where white spaces are not
present (such as Chinese) or where the text may switch direction (such as an Arabic text,
within which English phrases are quoted).
Clustering
Depending on the speciIic algorithm used, the clustering phase can signiIicantly
contribute to the overall processing time. Search results clustering systems must be
optimized to handle smaller instances and process them as Iast as possible.
Seminar Report 2010 21 Web Clustering Engine
Dept. OI Computer Science CUSAT
3.2 IMPROVE EFFICIENCY OF CLUSTERING
There are a number oI techniques that can be used to improve the computational
perIormance oI a search results clustering engine.
Client side processing
The majority oI currently available search clustering engines are doing all
processes as server-side processing. One possible problem with this approach is
thatduring high query rate periods the response times can signiIicantly increase and thus
degrade the user experience. Eor avoiding this we can do some processes using the client
side resources. In this way, scalability issues and the resulting problems could be
avoided.
Incremental processing
One desirable Ieature oI search results clustering would be incremental
processing- as each document arrives Irom the web, we 'clean it and add it to the
available model.
Pretokenized documents
The input to the Web Clustering Engine is the search results returned by the
conventional search engines. This search engines already will do some preprocessing
techniques to their results beIore they are retrieved. II the clustering engines can use these
tokens Ior their work it will be an added advantage.
Seminar Report 2010 22 Web Clustering Engine
Dept. OI Computer Science CUSAT
3.3 PERFORMANCE EVALUATION
Clustering engines are designed to overcome the limitations oI plain search
engines. So we need to evaluate whether the use oI clustered results does yield a gain in
retrieval perIormance over Ilat ranked lists. Some methods are explained below:
Eirst suggestive method related to the conventional notion oI Recall and precision.
Eor applying this concept the retrieved list should be in a linear list, not in a clustered
Iorm. One obvious way to perIorm such a clustering linearization would be to preserve
the order in which clusters are presented and just expand their content, but this would
amount to ignoring the role played by the user in the choice oI the clusters to be
expanded. One oI the earliest and simplest linearization techniques is to assume that the
user can choose the cluster with the highest density oI relevant documents and to consider
only the documents contained in it ranked in order oI relevance.
A more analytic approach is based on the reach time: a modelization oI the time
taken to locate a relevant document in the hierarchy.
Another method is by analyzing the user logs. Compare the search engine logs to
clustering engine logs, computing several metrics such as the number oI documents
Iollowed, the time spent, and the click distance. The interpretation oI user logs is,
however, diIIicult.
To date, the evaluation issue has probably not yet received suIIicient attention. It
remains still as an open issue. Anyway some experimental Iindings are suggesting that
Web Clustering Engines may be more eIIective than plain search engines. Due to the lack
oI an eIIicient method Ior the perIormance evaluation oI clustering engines they are still
not seeking the attention oI people.
Seminar Report 2010 23 Web Clustering Engine
Dept. OI Computer Science CUSAT
3.4 RESEARCH DIRECTIONS AND FUTURE WORKS
The most important research issue is thus how to improve the quality and
usability oI output hierarchies. Eor improving the cluster eIIiciency, should extract
powerIul Ieatures. The developers should adopts methods Ior generating more expressive
and eIIective descriptions oI clusters.
Einding optimal cluster representatives is another approach Ior increasing the
eIIiciency oI clustering phase. II we can Iind a better cluster representative then the
iterations Ior stable clustering will be less, means less response time. Combination oI
existing clustering algorithms can also be used Ior getting better clusters.
One advanced concept is called Personali:ed clustering. Since the clustering
process does not depend only on the search results, but is also inIluenced by the user
characteristics, we speak oI personalization. Personalization means instead oI optimizing
the construction oI the hierarchy structure, one can try to reorganize a given structure
based on user actions. This proposed techniques exploit user Ieedback, to Iilter out parts
oI the hierarchy that are presumably oI no interest to the user.
One oI the recent topics in the Iield oI search result clustering is the on growing
market oI mobile search. Two mobile versions oI CREDO, suitable Ior personal digital
assistants and cellular phones, the systems, termed Credino (small CREDO,in Italian) and
SmartCREDO, are exclusively based on the search results and are Ireely available online.
The screenshots oI Credino is available in the appendix Eigure 7,Eigure 8, Eigure 9(taken
in March 6, 2010)
Semantic Web is a recent research topic. In Semantic Web the meaning
(semantics) oI inIormation on the web is deIined, making it possible Ior machines to
process it. Google has initiated a good example oI Semantic Web technology with its
"rich snippets". Swoogle is a semantic web search engine. In Iuture clustering can also be
applied Ior Semantic web search engines also.
Seminar Report 2010 24 Web Clustering Engine
Dept. OI Computer Science CUSAT
4. CONCLUSION
Web clustering engines organize search results by topic, thus oIIering a
complementary view to the Ilat-ranked list returned by conventional search engines. Web
Clustering Engines has reached a level in which research has been deployed and
commercial systems are being deployed. A number oI advances must be made to improve
the cluster labels, coherence oI cluster structure, perIormance evaluation studies,
advanced visualization techniques. Then Web Clustering Engines entirely IulIills the
promise oI being the PageRank oI the Iuture.
Seminar Report 2010 25 Web Clustering Engine
Dept. OI Computer Science CUSAT
5. REFERENCES
1ournal/Paper:
Claudio Carpineto,Stanisiaw Osinski,Giovanni Romano and Dawid Weiss,A survey
oI Web Clustering Engines,ACM Computing Survevs,Vol.41,No.3,Article 17,July
2009.
Oren Zamir and Orem Etzioni,Web Document Clustering :A Eeasibility
Demonstration, In Proc. 21st annual Int. ACM SIGIR Conf. on Research and
Development of Information Retrieval, pp.46-54 ,1998.
Books:
C.J.Van Rijsbergen , Information Retrieval, Butterworth , 1979
Ricardo Baeza Yates and Berthier Ribeiro Neto, Modern Information Retrieval
Addison Wesley Longman Publishing Co. Inc.,1999
Websites:
http://clusty.com/ March 5, 2010
http://credo.Iub.it March 8, 2010
http://www2.parc.com/istl/projects/ia/sg-example1.html March 4, 2010
http://credino.dimi.uniud.it/ March 10, 2010
http://smartcredo.dimi.uniud.it March 10, 2010
Seminar Report 2010 26 Web Clustering Engine
Dept. OI Computer Science CUSAT
6. APPENDIX
Eigure 1
Seminar Report 2010 27 Web Clustering Engine
Dept. OI Computer Science CUSAT
Eigure 2
Seminar Report 2010 28 Web Clustering Engine
Dept. OI Computer Science CUSAT
Eigure 3
Seminar Report 2010 29 Web Clustering Engine
Dept. OI Computer Science CUSAT
Eigure 4
Seminar Report 2010 30 Web Clustering Engine
Dept. OI Computer Science CUSAT
Eigure 5
Seminar Report 2010 31 Web Clustering Engine
Dept. OI Computer Science CUSAT
Eigure 6
Seminar Report 2010 32 Web Clustering Engine
Dept. OI Computer Science CUSAT
Eigure 7
Eigure 8
Seminar Report 2010 33 Web Clustering Engine
Dept. OI Computer Science CUSAT
Eigure 9