You are on page 1of 2

Challenges in Web Information

Retrieval
Monika Arora1, and Uma Kanjilal2, Dinesh Varshney3,4,
1
Department of IT, Apeejay School of Management, Dwarka Institutional Area, New Delhi, India
2
Department of Library and Information Science, Indira Gandhi Open University Maidan Garhi, New
Delhi-110 068,India
3
School of Physics, Devi Ahilya University, Khandwa road Campus, Indore, M. P. India
4
Multimedia regional Centre, Madhya Pradesh Bhoj (open) University, Khandwa road Campus,
Indore- 452001, M. P. India
engines [1]. In addressing the problem of Information
Abstract— The major challenge in information access is the rich
Retrieval (IR) on the web, there are a number of challenges
data available for information retrieval, evolved to provide
principle approaches or strategies for searching. The search has researchers are involve, some of these challenges in this
become the leading paradigm to find the information on World paper and identify additional problems that may motivate
Wide Web. For building the successful web retrieval search future work in the IR research community. It also describes
engine model, there are a number of challenges that arise at the some work in these areas that has been conducted at various
different levels where techniques, such as Usenet, support vector search engines. It begins by briefly outlining some of the
machine are employed to have a significant impact. The present issues or factors that arise in web information retrieval. The
investigations explore the number of problems identified its level people/User relates to the system directly for the Information
and related to finding information on web. This paper attempts to retrieval as shown in Figure 1. They are easy to compare
examine the issues by applying different methods such as web
fields with well- defined semantics to queries in order to find
graph analysis, the retrieval and analysis of newsgroup postings
and statistical methods for inferring meaning in text. We also matches. For example the Records are easy to find for
discuss how one can have control over the vast amounts of data example bank database query. The semantics of the keywords
on web, by providing the proper address to the problems in also plays and important role which is send through the
innovative ways that can extremely improve on standard. The interface. System includes the interface of search engine
proposed model thus assists the users in finding the existing servers, the databases and the indexing mechanism, which
formation of data they need. The developed information retrieval includes the stemming techniques. The User defines the
model deals with providing access to information available in search strategy and also gives the requirement for
various modes and media formats and to provide the content is searching .The documents available in www apply subject
with facilitating users to retrieve relevant and comprehensive
indexing, ranking and clustering [2] .The relevant matches
information efficiently and effectively as per their requirements.
This paper attempts to discuss the parameters factors that are easily found
responsible for the efficient searching. These parameters can be
distinguished in terms of important and less important based on
the inputs that we have. The important parameters can be taken
care of for the future extension or development of search engines
Key words: Information Retrieval, Web Information
Retrieval, Search Engine, Usenet, Support Vector machine
Figure1: IR System Components
I. INTRODUCTION
by comparison with field values of records. It will be simple
Search engines are extensively important to help users to for the database it terms of maintenance and retrieval of
find relevant retrieval of information on the World Wide records but for the unstructured documents it is difficult
Web. In order to give the best according to the needs of users, where we use text.
a search engine must find and filter the most relevant
information matching a user’s query, and then present that II. INFORMATION RETRIEVAL ON THE WEB SEARCHES
information in a manner that makes the information most
readily presentable to the user. Moreover, the task of The some criteria for searching will give the better matches
information retrieval and presentation must be done in a and also the better results. The different dimensions of IR
scalable fashion to serve the hundreds of millions of user have become vast because of different media, different types
queries that are issued every day to a popular web search of search applications, and different tasks, which is not only a
text, but also a web search as a central. The IR approaches to
search and evaluation are appropriate in all media is an
emerging issues of IR. The information retrieval involved in

141
T. Sobh, K. Elleithy (eds.), Innovations in Computing Sciences and Software Engineering,
DOI 10.1007/978-90-481-9112-3_24, © Springer Science+Business Media B.V. 2010
142 ARORA ET AL.

the following tasks and sub tasks: 1) Ad-hoc search involve an active research area in which there is still much fertile
with the process where it generalizes the criteria and searches research ground to be explored.
for all the records, which finds all the relevant documents for
an arbitrary text query; 2) Filtering is an important process This may refer to the recent work on Hub and researchers
where the users identify the relevant user profiles for a new from where, it identifies in the form of equilibrium for WWW
document. The user profile is maintained where the user can sources on a common theme/topic in which we explicitly
be identified with a profile and accordingly the relevant build into the model by taking care of the diversity of roles
documents are categorized and displayed; 3) Classification between the different types of pages [2]. Some pages, are the
involve with respect to the identification and lies in the prominent sources of primary data/content, are considered to
relevant list of the classification, this works in identifying the be the authorities on the topic; other pages, equally essential
relevant labels for documents; 4) Question answering to the structure, accumulate high-quality guides and resource
Technique involves for the better judgment of the lists that act as focused hubs, directing users to suggested
classification with the relevant questions automatically authorities. The nature of the linkage in this framework is
frames to generate the focus of the individuals. The tasks are highly asymmetric. Hubs link heavily to authorities, and they
described in the Figure 2.The Field of IR deals with the may have very few incoming links linked to themselves, and
the authorities are not link to other authorities. This, is
completely a suggested model [2], is completely natural;
relatively anonymous individuals are creating many good
hubs on the Web. A formal type of equilibrium consistent
model can be defined only by assigning the weights to the
two numbers called as a hub weight and an authority
weight .The weights to each page in such a way that a page's
authority weight is proportional to the sum of the hub weights
of pages that link to it to maintain the balance and a page's
hub weight is proportional to the sum of the authority weights
of pages that it links to.

Figure 2: Proposed Model of Search The adversarial Classification [5] may be dealing with
Spam on the Web. One particularly interesting problem in
relevance, evaluation and interacts with the user to provide web IR arises from the attempt by some commercial interests
them according to their needs/query. IR involves in the to excessively heighten the ranking of their web pages by
effective ranking and testing. Also it measures of the data engaging in various forms of spamming [4]. The SPAM
available for the retrieval. The relevant document contains methods can be effective against traditional IR ranking
the information that a person was looking for when they schemes that do not make use of link structure, but have more
submitted a query to the search engine. There is many factors limited utility in the context of global link analysis. Realizing
influence a person’s to take the decision about the relevancy this, spammers now also utilize link spam where they will
that may be task, context, novelty, and style. The topical create large numbers of web pages that contain links to other
relevance (same topic) and user relevance (everything else) pages whose rankings they wish to rise. The interesting
are the dimensions, which help in the IR modeling. The technique applied will continually to the automatic filters.
retrieval models define a view of relevance. The user The spam filtering in email [7] is very popular. This
provides information that the system can use to modify its technique with concurrently involved the applying the
next search or next display. The relevance feedback is the indexes the documents
how much system understand user in terms of what the need,
and also to know about the concept and terms related to the III. AN APPROACH OF RETRIEVAL IN USENET ARCHIVE
information needs.
The UseNet archive is considered to be less visible
The phases uses the different techniques such as the web document collections in the context of general-purpose
pages contain links to other pages and by analyzing this web search engines, which is conservatively estimated to be at
graph structure it is possible to determine a more global least 800 million documents. The UseNet archive, have 20
notion of page quality. The remarkable successes in this area newsgroups data set used in text classification tasks—is
include the Page Rank algorithm [1], which globally analyzes extremely interesting. UseNet started as a loosely structured
the entire web graph and provided the original basis for collection of groups that people could post to. Over the
ranking in the various search engines, and Kleinberg’s years, it evolved into a large hierarchy of over 50,000
hyperlink algorithm [2,3], which analyzes a local groups with topics ranging in different dimensions. IR in
neighborhood of the web graph containing an initial set of the context of UseNet articles raises some very interesting
web pages matching the user’s query. Since that time, several issues. One previously explored possibility is to address
other linked-based methods for ranking web pages have been retrieval in UseNet as a two-stage IR problem: (1) find the
proposed including variants of both PageRank and HITS [3, most relevant newsgroup, and (2) find the most relevant
4], and this remains document within that newsgroup. This 20-years of archive of

You might also like