You are on page 1of 15

CMR TECHNiCAL CAMPUS

UGC AUTONOMOUS
Accredited by NBA & NAAC with A Grade
Approved by AICTE, New Delhi and Affiliated to JNTU, Hyderabad

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

TECHNICAL SEMINAR
ON
Web Clustering Engines.
PRESENTED BY
STUDENT NAME : Kalyani Yash
ROLL NUMBER : 187R1A05F4
CLASS : IV – CSE - C
Abstract
 Web clustering Engines are emerging trend in the field of
information retrieval. They organize search results by topic,
thus offering a complementary view to the flat ranked list
returned by the conventional search engines.
 The search results returned by traditional search engines on
different subtopics or meanings of a query will be mixed together
in the list so that the user may have to sift through a large number
of irrelevant items to locate those of interest. The Web clustering
engines categorize the search results into different hierarchical
groups/clusters and display those cluster labels.
 Hence the user can locate the desired document very fast. In this
seminar we discuss different phases in the implementation of web
clustering engines in detail and also incorporate some of the web
clustering algorithms, their advantages and issues. We will
familiarize some currently using web clustering engines. Some
future research directions are also presented.

 Clustering is the act of grouping similar object into sets. The


distance between the objects in the same cluster (inter-cluster
variations) should be minimum and the distance between objects
in different clusters (intra-cluster variations) should be maximum.
In the web search context, organizing web pages (search results)
into groups,
Presentation Outline
 Introduction
 Objective
 Applications
 Functionality
 Architecture/
Design
 Advantages and Disadvantages
 Conclusion
 Future Enhancements
Introduction
 Web Clustering Engines organize search results by topic,
thus offering a complementary view to the flat ranked
list returned by the conventional search engines.
 A clustering engine tries to address the limitations of
current search engines by providing clustered results as
an added feature to their standard user interface.
 Web Clustering Engines group the search results having
the same meaning within same cluster it is very easy for
the user to find similar documents. Hence the search
time will be less.
 Web Clustering Engines give a high level view of the
query, it is useful for informational searches in unknown
or dynamic domains.
 Steps for feature extraction are, Language identification, Tokenization, Stemming, Selection of features.
 Steps for feature extraction are, Language identification, Tokenization, Stemming, Selection of features.

Objectives

 The main aim is to convert the contents of search


results (output by the acquisition component) into a
sequence of features used by the actual clustering
algorithm.
 Steps for feature extraction are, Language
identification, Tokenization, Stemming, Selection of
features.
Applications
 Web clustering Engines are emerging trend in the field
of information retrieval.
 The Web clustering engines categorize the search
results into different hierarchical groups/clusters and
display those cluster labels.
 Web Clustering Engines group the search results having
the same meaning within same cluster it is very easy for
the user to find similar documents. Hence the search
time will be less.
 A clustering engine summarizes the content of many
search results in one single view on the first result
page, the user may review hundreds of potentially
relevant results without the need to download and
scroll to subsequent pages.
Flat Ranked Search Engine.
• Example : Google, Bing, Microsoft Edge.
Web clustering engine.
 Examples : Vivisimo, carrot2, kartoo, and
duckduckgo 
Architecture / Design
 Practical implementations of Web search clustering engines
will usually consist of four general components: search results
acquisition, input preprocessing, cluster construction, and
visualization of clustered results, all arranged in a processing
pipeline.
 The task of the search results acquisition component is to
provide input for the rest of the system. Based on the query,
the acquisition component must deliver 50 to 500 results,
each of which should contain a title, a contextual snippet, and
the URL pointing to the full text being referred to.
 The source of search results can be any public search
engines, such as google, yahoo etc. Clustering applied to this
smaller set of documents ,returned by the conventional
search engines, in response to the query. The most elegant
way of fetching results from such search engines is by using
application programming interfaces(APIs) these engines
provide.
• The set of search results along with their features, extracted in
the preprocessing step, are given as input to the clustering
algorithm, which is responsible for building the clusters and
labeling them. There are a number of algorithms available for
clustering. We can classify them into two different categories,
Data centric and Description aware. In search results clustering
users are the ultimate consumers of cluster. Hence the created
clusters should be aptly labeled. The labels should be unique,
unambiguous, comprehensive and sensible to the content. An
inefficiently labeled cluster is useless eventhough it contains
closely related, relevant documents.
Advantages
 Web Clustering Engines makes shortcuts to the items
that relate to the same meaning.
 It allows better topic understanding.
 It favours systematic exploration of search results.
Disadvantages

 Short input Description.


 Meaningful Description.
 Selection of similarity measure.
 Grouping of objects into clusters.
 Computational efficiency.
 Unknown number of clusters.
Conclusions
 Web clustering engines organize search results by topic,
thus offering a complementary view to the flat-ranked
list returned by conventional search engines. A number
of advances must be made to improve the cluster
labels, coherence of cluster structure, performance
evaluation studies, advanced visualization techniques.
 Due to the lack of efficient method for the performance
evaluation of clustering engines they are still not
seeking the attention of people.
Thank You.

You might also like