You are on page 1of 4

International Journal of Computer Applications (0975 – 8887)

Volume 76– No.1, August 2013

Web Information Retrieval

Vikas Thada Vivek Jaglan, Ph.D


Research Scholar Asst.Prof(CSE),ASET
Dr.K.N.M., Amity University
Newai, India Gurgaon, India

ABSTRACT Web Information Retrieval a truly efficient and faster retrieval


method.
Information retrieval (IR) is the process of searching within a
document collection for a particular information need (called 2.1 Huge size
a query) [4]. Web IR can be defined as the application of
theories and methodologies from IR to the World Wide Web. The Web is indeed huge! In fact, it's so big that it's hard to get
It is concerned with addressing the technological challenges an accurate count of its size. By July 13, 2013, according to
facing Information. The utility and ubiquity of web search is worldwidewebsize.com it was estimated that the Web
making Web Information Retrieval (IR) an increasingly contained at least 4.12 billion pages. with an average page
popular research topic. The incredible success of Google is a size of 500KB. Because of this there is an ever increasing
striking example of how important it is for academia and volume of data and information published in numerous Web
industry to foster innovation in the field of Web IR. The pages.
unabated growth of the Web and the increasing expectation
placed by the user on the search engine to anticipate and infer
2.2 Dynamic
his/her information needs and provide relevant results has The web is dynamic. Anyone can publish any type of material
fostered the development of the field of Web Information on to the web at any time. Thus information onto the web is
Retrieval (Web IR). This paper takes a deeper dive into the constantly changing and being updated. This clearly contrast
Web IR process, a variant of classical Information Retrieval, with traditional static document collections for the two
by clearly explaining its core concepts , the components, reasons. First, once a document is added to a traditional
model categories, tools, tasks and the performance measures collection, it does not change. Second, for the most part, the
that quantify the quality of retrieval results. size of a traditional document collection is relatively static.

General Terms 2.3 Self-Organized


Growth, retrieval, crawl, engine On the Web, anyone can post a webpage and link away at
will. There are no standards and no gatekeepers policing
Keywords content, structure, and format. The data are volatile; there are
Web, information, IR, crawl, search, domain rapid updates, broken links, and file disappearances. In
contrast traditional document collections are usually collected
1. INTRODUCTION and categorized by trained (and often highly paid) specialists.

Tim Berners-Lee and his World Wide Web entered the 2.4 Heterogeneity
information retrieval world in 1989 .This event caused a
branch that focused specifically on search within this new The data is heterogeneous, existing in multiple formats,
document collection to break away from traditional languages, and alphabets with duplicity of this data at various
information retrieval. This branch is called web information places. In addition, there is no editorial review process, which
retrieval [4]. Web Information retrieval is the process of means errors, falsehoods, and invalid statements abound.
searching within a huge World Wide Web document
collection for a particular information need (called a query).
2.5 Duplication
By all measures, the Web is enormous and growing at a Several studies indicate that nearly 30% of the web's content
staggering rate, which has made it increasingly intricate and is duplicated, mainly due to mirroring[7].
crucial for both people and programs to have quick and
accurate access to Web information and services. Thus, it is 2.6 Hyperlinked
imperative to provide users with tools for efficient and
The web is hyperlinked. Hyperlinks make focused, effective
effective resource and knowledge discovery. Web IR can be
defined as the application of theories and methodologies from searching a reality. The hyperlink functionality alone—that is,
IR to the World Wide Web. It is concerned with addressing the hyperlink to Web page B that is contained in Web page
A—is not directly useful in information retrieval. However,
the technological challenges facing Information Retrieval (IR)
the way Web page authors use hyperlinks can give them
in the setting of WWW [10].
valuable information content. [4][10].
2. CHARACTERISTICS OF WEB IR These characteristics of Web make the task of retrieving
The web is a unique document collection. Some of the information from it quite different from the traditional
distinguished characteristics of the Web make the nature of information retrieval.

29
International Journal of Computer Applications (0975 – 8887)
Volume 76– No.1, August 2013

3. WEB IR COMPONENTS 3.4 Query Module


To address the challenges found in Web IR, Web search The task of the query module is to converts a user's natural
systems need very specialized architectures. The one is shown language query into a language that the search system can
in the figure 1. The four main components of Web Search are understand (usually numbers), and consults the various
‗Crawling‘ ‗Indexing‘ ‗Querying‘ & ‗Ranking‘ . During the indexes in order to answer the query. This results into the set
web search the first two are performed offline and last two are of relevant pages which are passed to the ranking module.
done online. Crawling and Indexing is query independent and
Querying and Ranking is query dependent. The components 3.5 Ranking Module
are described in brief in the following sections
The ranking module takes the set of relevant pages from queru
3.1 Crawler module and ranks them according to some criterion. The top
pages are have higher ranking and are much more relevant to
The crawling software creates virtual robots or spiders, which the given query.
constantly scour the Web gathering new information and web
pages and returning to store them in a central repository. An .
example is Googlebot used by Google.

3.2 Page Repository


The spiders return with new web pages, which are temporarily
stored as full, complete web pages in the page repository. The
new pages remain in the repository until they are sent to the
indexing module, where their vital information is stripped to
create a compressed version of the page. Popular pages that
are repeatedly used to serve queries are stored here longer,
perhaps indefinitely.

3.3 Indexing Module


The indexing module takes each new uncompressed page and
extracts only the vital descriptors, creating a compressed
description of the page that is stored in various indexes.

Fig 1: Web IR Components

30
International Journal of Computer Applications (0975 – 8887)
Volume 76– No.1, August 2013

4. WEB IR MODELS learning ranking rather than classification, preferences etc


[10]
A retrieval model can be a description of either the
computational process (documents are chosen for retrieval 5. WEB IR TASKS
based on some query) or the human process of retrieval
(information needs are first articulated and then refined). The Web IR tasks are typically organized in tasks with
specific goals to be achieved. In the following sections we
The model of Web IR has two components. First is a set of discuss following tasks: ad hoc retrieval, known item search,
premises and second is an algorithm to be used for ranking interactive retrieval, filtering, browsing, clustering, mining,
documents retrieved in response to a user query. Formally An gathering and crawling.
information retrieval model is a quadruple [D,Q, F,R(qi, dj )]
where D is a set composed of logical views for the documents 5.1 Ad Hoc Retrieval
in the collection;Q is a set composed of logical views for the Characteristics of an ad hoc retrieval task are arbitrary subject
user queries; F is a framework for modeling document of the search and a short duration. A retrieval system returns
representations, queries, and their relationships; R(qi, dj) is a generally a list of documents ranked by decreasing similarity
ranking function which associates a real number(ranking) in response to the query. The internet search engines are
with a query qi document dj examples of information retrieval objects from which one can
Retrieval models can describe the Computational process, for perform ad hoc search.
example, how the documents are ranked and note that how
documents or indexes are stored is implementation. The 5.2 Known Item Search
Retrieval models can also attempt to describe the User In this type of search the target of the search is a particular
process, for example, the information need and interaction document or a small set of documents that the searcher knows
level. to exist in the collection and wants to find it. For example, in
the library environment, a researcher would like to retrieve all
4.1 First Dimension: The Mathematical articles by an author.
Basis
5.3 Filtering
According to the first dimension the models can be put into
three categories: set theoretic, algebraic and probabilistic Select documents using a fixed query in a dynamic
models. All three models are also termed as classical models. collection. For example, ―Retrieve all documents related to
We discuss each of the models in brief. ‗Research in India‘ from a continuous feed‖.

4.1.1 Set theoretic models 5.4 Homepage Finding


Set-theoretic models are based on set theory and Boolean Find the URL of a named entity. For example, ―Find the URL
algebra. Documents and queries are represented as a set of of the Income Tax India homepage‖
index terms (sets of words or phrases). Query definition is
done by Boolean expressions. Similarities are usually derived 5.5 Categorization/Clustering
from set-theoretic operations on those sets. Common models
It is the automatic recognition and the generation of categories
under this class are: Standard Boolean Model, the Extended
of entities that can be text documents. It is usually based on
Boolean Model and the Fuzzy Model[3].
some similarity measure between documents, as well as an
4.1.2. Algebraic models explicit or implicit definition of some unique characteristic
possessed by the groups of documents. Because the search can
In an algebraic model, extracted index terms of a document be restricted on a set of interested category, It is generally
are represented by vectors. It is possible to calculate the used to improve the retrieval process.
similarity between a query and documents in a document
collection. The result of the calculation can be ranked. The 5.6 Adversarial Web IR
similarity of the query vector and document vector is
This addresses tasks such as gathering, indexing, filtering,
represented as a scalar value. Common models under this
retrieving and ranking information from collections wherein a
category are Vector space model, Generalized vector space
model, (Enhanced) Topic-based Vector Space Model and subset has been manipulated maliciously. Adversarial IR
Extended Boolean model. includes the study of methods to detect, isolate, and defeat
such manipulation.
4.1.3. Probabilistic models
5.7 Interactive Retrieval
This model uses probabilistic theory to compute the relevance
of documents depending on the user query. Similarities are During the interactive task the system attempts to perceive
computed as probabilities that a document is relevant for a how the user interacts with it and based on yes/no judgment of
given query. Probabilistic theorems like the Bayes' theorem documents relevance the current search strategy can be
are often used in these models[1]. modified. The system uses these judgments to expand and/or
reweigh the query.
4.2 Second Dimension: User Process
6. CONCLUSION
Another dimension of defining different categories of Web IR
models can be based on their applications. They are Query The paper has discussed one of the rapidly growing and
languages, Indexing ( Boolean), Introducing ranking and becoming extremely important field in recent years: Web
weighting (Vector Space), Probabilistic models of documents, Information Retrieval. This is all because of intriguing
queries, topics (Language Modeling) ―Learning to Rank‖, challenges presented in tapping the Internet and the Web as a
huge source of information. The superlative success of Web

31
International Journal of Computer Applications (0975 – 8887)
Volume 76– No.1, August 2013

search engines especially Google is a testimony to this fact. [5] S. Spat ―About Web IR models.pdf‖, pp 31-35, March
The paper further discussed the characteristics and models of 2007
Web IR.
[6] H.Monica, ―The Past, Present, and Future of Web Search
7. REFERENCES Engines‖, Proceedings of 31st International Colloquium,
ICALP 2004,Finland, July 12-16, 2004.
[1] M. Kobayashi & K. Takeda, ―Information Retrieval on the
Web‖, ACM Computing Surveys, Vol. 32, No.2, June [7] N. Sérgio, ―State of the Art in Web Information
2000. Retrieval‖, Technical Report, FEUP, 2006.

[2] N. Fuhr, ―Models in Information Retrieval‖, ESSIR 2000: [8] H.Monica, ―Information Retrieval on the Web‖, 39th
21-50. Annual Symposium on Foundations of Computer
Science (FOCS'98), Palo Alto, CA
[3] S.Amit, "Modern Information Retrieval: A Brief
Overview", Bulletin of the IEEE Computer Society [9] R. Baeza-Yates, B. Ribeiro-Neto, ―Modern Information
Technical Committee on Data Engineering 24 (4): 35– Retrieval‖. Addison Wesley, New York, 1999.
43, 2001. [10] B.MPS,A.Kumar ― A Primar on the Web Information
[4] N. Langville and D. Meyer ―Science of search engine Retrieval Paradigm‖, JATIT, March 2007.
rankings‖, Chapter 1 ,Princeton Pubs,2006

32
IJCATM : www.ijcaonline.org

You might also like