Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
Information Retrieval on the World Wide Web

Information Retrieval on the World Wide Web

Ratings: (0)|Views: 86|Likes:
Published by Simon Wistow

More info:

Published by: Simon Wistow on Dec 05, 2009
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF or read online from Scribd
See more
See less

12/04/2009

 
FEATURE
58
1089-7801/ 97/ $10.00
© 
1997 IEEE IEEE
INTERNETCOMPUTING
INFORMATION
RETRIEVAL ONTHE WORLDWIDE WEB
V
ENKAT
N. G
UDIVADA
Dow Jones Markets 
V
IJAY
V. R
AGHAVAN
University of Southwestern Louisiana
W
ILLIAM
I. G
ROSKY
Wayne State University 
R
AJESH
K
ASANAGOTTU
University of Missouri 
T
he World Wide Web is a very large distributed digital informa-tion space. From its origins in 1991 as an organization-widecollaborative environment at CERN for sharing research doc-uments in nuclear physics, the Web has grown to encompass diverseinformation resources: personal home pages; online digital libraries;virtual museums; product and service catalogs; government informa-tion for public dissemination; research publications; and Gopher,FTP, Usenet news, and mail servers. Some estimates suggest that theWeb currently includes about 150 million pages and that this numberdoubles every four months.The ability to search and retrieve information from the Web effi-ciently and effectively is an enabling technology for realizing its fullpotential. With powerful workstations and parallel processing tech-nology, efficiency is not a bottleneck. In fact, some existing search toolssift through gigabyte-size precompiled Web indexes in a fraction of asecond. But retrieval effectiveness is a different matter. Current searchtools retrieve too many documents, of which only a small fraction arerelevant to the user query. Furthermore, the most relevant documentsdo not necessarily appear at the top of the query output order.Few details concerning system architectures, retrieval models, andquery-execution strategies are available for commercial search tools.The cause of preserving proprietary information has promulgated theview that developing Web search tools is esoteric rather than rational.In this article, we hope to promote innovative research and develop-ment in this area by offering a systematic perspective on the progressand challenges in searching the Web.
Effective search andretrieval are enablingtechnologies forrealizing the fullpotential of the Web.The authors examinerelevant issues,including methodsfor representingdocument content.They also compareavailable search toolsand suggest methodsfor improvingretrieval effectiveness.
.
 
We begin with a brief discussion of navigation strategies forsearching the Web, followed by a review of methods for repre-senting the information content of Web documents and mod-els for retrieving it. We then classify, describe, and comparecurrent search tools and services, and conclude by examiningsome techniques for improving their retrieval effectiveness.
TRAVERSING THE WEB
One way to find relevant documents on the Web is to launcha Web robot (also called a wanderer, worm, walker, spider, orknowbot). These software programs receive a user query, thensystematically explore the Web to locate documents, evalu-ate their relevance, and return a rank-ordered list of docu-ments to the user. The vastness and exponential growth of the Web make this approach impractical for every user query.An alternative is to search a precompiled index built andupdated periodically by Web robots. The index is a search-able archive that gives reference pointers to Web documents.This is obviously more practical, and many existing searchtools are based on this approach.Generating a comprehensive index requires systematictraversal of the Web to locate all documents. The Web’sstructure is similar to that of a directed graph, so it can betraversed using graph-traversal algorithms. Because Webservers and clients use the client-server paradigm to com-municate, it is possible for a robot executing on a singlecomputer to traverse the entire Web. There are currentlythree traversal methods:
s
Providing the robot a “seed URL” to initiate exploration.The robot indexes the seed document, extracts URLspointing to other documents, then examines each of theseURLs recursively in a breadth-first or depth-first fashion.
s
Starting with a set of URLs determined on the basis of a Web site’s popularity and searching recursively.Intuitively, we can expect a popular site’s home page tocontain URLs that point to the most frequently soughtinformation on the local and other Web servers.
s
Partitioning the Web space based on Internet names orcountry codes and assigning one or more robots toexplore the space exhaustively. This method is morewidely used than the first two.The frequency of Web traversal is another design variablefor Web robots with important implications for the curren-cy and completeness of the index.
INDEXING WEB DOCUMENTS
We can view effective Web searches as an informationretrieval problem.
1,2
IR problems are characterized by a col-lection of documents and a set of users who perform querieson the collection to find a particular subset of it. This dif-fers from database problems, for example, where the searchand retrieval terms are precisely structured. In the IR con-text,
indexing
is the process of developing a document rep-resentation by assigning content descriptors or terms to thedocument. These terms are used in assessing the relevanceof a document to a user query. They contribute directly tothe retrieval effectiveness of an IR system.IR systems include two types of terms: objective and non-objective.
Objective terms
are extrinsic to semantic content,and there is generally no disagreement about how to assignthem. Examples include author name, document URL, anddate of publication.
 Nonobjective terms
, on the other hand,are intended to reflect the information manifested in the doc-ument, and there is no agreement about the choice or degreeof applicability of these terms. Thus, they are also known as
content terms
. Indexing in general is concerned with assign-ing nonobjective terms to documents. The assignment mayoptionally include a weight indicating the extent to whichthe term represents or reflects the information content.The effectiveness of an indexing system is controlled bytwo main parameters.
 Indexing exhaustivity
reflects the degreeto which all the subject matter manifested in a document isactually recognized by the indexing system. When the index-ing system is exhaustive, it generates a large number of termsto reflect all aspects of the subject matter present in the doc-ument; when it is nonexhaustive, it generates fewer terms,corresponding to the major subjects in the document.
Termspecificity
refers to the breadth of the terms used for index-ing.
2
Broad terms retrieve many useful documents along witha significant number of irrelevant ones; narrow terms retrievefewer documents and may miss some relevant items.The effect of indexing exhaustivity and term specificityon retrieval effectiveness can be explained by two parame-ters used for many years in IR problems:
s
 Recall
is the ratio of the number of relevant documentsretrieved to the total number of relevant documents inthe collection.
s
Precision
is the ratio of the number of relevant documentsretrieved to the total number of documents retrieved.Ideally, you would like to achieve both high recall and highprecision. In reality, you must strike a compromise. Indexingterms that are specific yields higher precision at the expenseof recall. Indexing terms that are broad yields higher recallat the cost of precision. For this reason, an IR system’s effec-tiveness is measured by the precision parameter at variousrecall levels.Indexing can be performed either manually or automat-ically. The sheer size of the Web together with the diversityof subject matter make manual indexing impractical.Automatic indexing does not require the tightly controlledvocabularies that manual indexers use, and it offers thepotential to represent many more aspects of a document
INFORMATION RETRIEVAL
59
IEEE
INTERNETCOMPUTING
http:/ / computer.org/ internet/
SEPTEMBER • OCTOBER 1997
.
 
than manual indexing can. However, it also remains at aprimitive level of development, despite many years of study.(For details on current ways to automatically assign contentterms to documents, see the sidebar below.)
INFORMATION RETRIEVAL MODELS
An IR model is characterized by four parameters:
s
representations for documents and queries,
s
matching strategies for assessing the relevance of docu-ments to a user query,
s
methods for ranking query output, and
s
mechanisms for acquiring user-relevance feedback.IR models can be classed into four types: set theoretic, alge-braic, probabilistic, and hybrid models. In the following sec-tions, we describe instances of each type in the context of the IR model parameters.
FEATURE
60
SEPTEMBER • OCTOBER 1997
http:/ / computer.org/ internet/
IEEE
INTERNETCOMPUTING
.
The automatic assigning of content terms to documents can bebased on single or multiple terms.
Single-Term Indexing
The
term set 
of the document includes its set of words and theirfrequency. Words that perform strictly grammatical functionsare compiled into a
stop list 
and removed. The term set can alsobe refined by
stemming 
to remove word suffixes.Approaches to assigning weights for single terms may begrouped into the following categories: statistical, information-theoretic, and probabilistic. While the first two categories justuse document and collection properties, the probabilisticapproaches require user input in terms of relevance judgments.
Statistical methods.
Assume that we have
documents in acollection. Let
tf 
ij 
denote the term frequency, which is a functionof the frequency of the term
 j 
in document
.Indexing based on
term frequenc
fulfills one indexing aim,namely, recall. However, terms that have concentration in a fewdocuments of a collection can be used to improve precision bydistinguishing documents in which they occur from those inwhich they do not. Let
df 
 j 
denote the document frequency of theterm
 j 
in a collection of
documents, which is the number ofdocuments in which the term occurs. Then, the inverse documentfrequency, given by log(
 / 
df 
 j 
), is an appropriate indicator of
 j 
as a document discriminator.The term-frequency and inverse-document-frequency com-ponents can be combined into a single frequency-based index-ing model,
1,2
where the weight of a term
 j 
in document
denot-ed
ij 
is given by
ij 
=
tf 
ij 
log(
 / 
df 
 j 
)Another statistical approach to indexing is based on
term discrimination.
This approach views each document as a pointin the document space. As the term sets for two documentsbecome more similar, the corresponding points in the documentspace become closer (that is, the density of the document spaceincreases) and vice versa.Under this scheme, we can approximate the value of a termas a document discriminator based on the type of change thatoccurs in the document space when a term is introduced to thecollection. We can quantify this change according to the increaseor decrease in the average distance between the documents. Aterm has a good discrimination value if it increases the averagedistance between the documents; in other words, terms with gooddiscrimination value decrease the density of the document space.The term-discrimination value of a term
 j 
, denoted
dv 
 j 
, is thencomputed as the difference of the document space densitiesbefore and after the term
 j 
is introduced. The net effect is thathigh-frequency terms have negative discrimination values, medi-um-frequency terms have positive discrimination values, and low-frequency terms tend to have discrimination values close to zero.
1
A term-weighting scheme such as
ij 
=
tf 
ij 
dv 
 j 
is used to combineterm frequency and discrimination values.
Information-theoretic methods.
In information theory, theleast-predictable terms carry the greatest information value.
3
Least-predictable terms are those that occur with smallestprobabilities. Information theory concepts have been used toderive a measure, called
signal-noise rati
, of term usefulnessfor indexing. This method favors terms that are concentrated inparticular documents. Therefore, its properties are similar tothose of inverse document frequency.
Probabilistic methods.
Probabilistic approaches require atraining set of documents obtained by asking users to providerelevance judgments with respect to query results.
4
The trainingset is used to compute term weights by estimating conditionalprobabilities that a term occurs given that a document is relevant(or irrelevant). Assume that a collection of
documents of which
are relevant to the user query,
of the relevant documentscontain term
, and
occurs in
documents. Two conditionalprobabilities are estimated for each term as follows:Pr [
in document|document is relevant] =
 / 
;Pr [
in document|document is irrelevant] = (
− 
)/(
− 
).From these estimates, Bayes’ theorem is used, under certainassumptions, to derive the weight of term
as
AUTOMATIC INDEXING METHODS

Activity (2)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->