You are on page 1of 4

Information retrieval is a complex process because there is no infallible way to

provide a direct connection between a user's query for information and


documents that contain the desired information. Information retrieval is based
on a match between the words used to formulate the query and the words used
to express concepts or ideas in a document. A search may fail because the
user does not correctly guess the words that a useful document would contain,
so important material is missed. Or, the user's search terms may appear in
retrieved documents that pertain to a subject other than the one intended by the
user, so material is retrieved which is not useful. Research in information
retrieval has aimed at developing systems which minimize these two types of
failures.

History of Information Retrieval

Almost as soon as computers were developed, information scientists suggested


that the new machines had the potential to perform text processing as well as
arithmetical operations. By representing text as ASCII characters, queries
formulated as character strings could be matched against the character strings
in documents. The first computer-based IR systems, which appeared in the
1950s, were based on punched cards . These were followed in the 1960s by
systems based on storage of the database on magnetic tape .
These first systems were hampered by the limited processing power of early
computers, and the limited capacity for and high cost of storage. They operated
offline , in a batch processing mode. It was not until the 1970s that IR
systems made it possible for users to submit their queries and obtain an
immediate response, allowing them to view the results and modify their queries
as needed. The development of magnetic disk storage and improvements in
telecommunications networks at this time made it possible to provide access to
IR systems nationwide.
At first very little textual information was available in electronic form, though
printed indexing and abstracting services for manual searching had been
available for many years. Over time, however, a significant back file of a
number of databases was created, making it realistic to do a retrospective
search for literature on a given topic.
One of the best known commercial information systems is DIALOG, which
currently has hundreds of databases containing many types of information—
newspapers, encyclopedias, statistical profiles, directories, and full-text and
bibliographic databases in the sciences, humanities, and business. Another
well-known commercial system is LEXIS-NEXIS, which is widely used for its
full-text collection in business and particularly law, since it provides computer
searching of statutes and case law.
Much early work in information retrieval was conducted at U.S. government
institutions such as the National Aeronautics and Space Administration (NASA)
and the National Library of Medicine (NLM), and included the forerunners of
today's systems. Versions of the DIALOG system were first operated by NASA
and the Atomic Energy Commission; it later became a commercial system. The
MEDLINE system operated by NLM today originated in an experimental system
for searching their medical database, MEDLARS.

Boolean Information Retrieval

For many years, the standard method of retrieval from commercially available
databases was Boolean retrieval. In Boolean retrieval, queries are constructed
by combining search terms with the Boolean operators AND, OR, and NOT.
The system returns those documents which exactly match the search terms and
the logical constraints.
In addition to the basic AND, OR, NOT operators, most operational Boolean
systems offer proximity operators so that searchers can specify that terms must
be adjacent or within a fixed distance of one another. This allows the
specification of a phrase as a search term, for example "grand ADJ canyon,"
meaning "grand" must be adjacent to "canyon" in retrieved documents. Many
other functions are commonly available, such as the ability to search specific
parts of a document, to search many databases simultaneously, or to remove
duplicates. However the basic functionality in commercial systems remains the
standard Boolean search.

Problems with Boolean Retrieval

Boolean searching has been criticized because it requires searchers to


understand and apply basic Boolean logic in constructing their search
strategies, rather than posing their queries in natural language. Another
criticism is that Boolean searching requires that terms in the retrieved document
exactly match the query terms, so potentially useful information may be missed
because a document does not contain the specific term the searcher thought to
use. A Boolean search essentially divides a database into two parts: documents
that match and those that do not match the query. The number of documents
retrieved may be zero, if the query was very specific, or it could be tens of
thousands if very common terms were used. All documents retrieved are
treated equally so the system cannot make recommendations about the order in
which they should be viewed. Because of its complexity, Boolean searching has
often been carried out by information professionals such as librarians who act
as research intermediaries for their patrons.
Boolean retrieval has also been criticized on the basis of performance. The
standard measures of performance for IR systems are precision and recall.
Precision is a measure of the ability of a system to retrieve only relevant
documents (those which match the subject of the user's query). Recall is a
measure of the ability of the system to retrieve all the relevant documents in the
system. Using these measures, the performance of Boolean systems has been
criticized as inadequate, leading to the continuing search for other ways to
retrieve information electronically.

Alternatives to Boolean Retrieval


Since the 1960s and 1970s, IR researchers explored ways to improve the
performance of information retrieval systems. Gerard Salton (1927–1995), a
professor at Cornell University, was a key figure in this research. For more than
thirty years, he and his students worked on the Smart system, a research
environment that allowed them to explore the impact of varying parameters in
the retrieval system. Using measures such as precision and recall, he and other
researchers found that performance improvements can be made by
implementing systems with features such as term weighting, ranked output
based on the calculation of query-document similarity, and relevance feedback.
In these systems, documents are represented by the terms they contain. The
list of terms is often referred to as a document vector and is used to position the
document in N-dimensional space (where N is the number of unique terms in
the entire collection of documents). This approach to IR is referred to as the
"vector space model."
For each term, a weight is calculated using the statistics of term frequency,
which represents the importance of the term in the document. A common
method is to calculate the tfxidf value (term frequency x inverse document
frequency). In this model the weight of a term in a document is proportional to
the frequency of occurrence of the term in the document, and inversely
proportional to the frequency with which the term occurs in the entire document
collection. In other words, a good index term is one that occurs frequently in a
particular document but infrequently in the database as a whole.
The query is also considered as a vector in N-dimensional space, and the
distance between a document and a query is an indication of the similarity, or
degree of match, between them. This distance is quantified by using a distance
measure, commonly a similarity function such as the cosine measure. The
results are sorted by similarity value and displayed in order, best match first.
The relevance feedback feature allows the user to examine documents and
make some judgments about their relevance. This information is used to
recalculate the weights and rerank the documents, improving the usefulness of
the document display.
These systems allow the user to state an information need in natural language,
rather than constructing a formal query as required by Boolean systems. The
ranked output also imposes an order on the documents retrieved, so that the
first documents to be viewed are most likely to be relevant. The search is
modified automatically based on the user's feedback to the system.
More recently, information retrieval systems have been developed to search the
World Wide Web. These search engines use software programs called crawlers
that locate pages on the web which are indexed on a centralized server. The
index is used to answer queries submitted to the web search engine. The
matching algorithms used to match queries with web pages are based on the
Boolean or vector space model.
Individual search engines vary in terms of the information on the web page that
they index, the factors used in assigning term weights, and the ranking
algorithm used. Some search engines index information extracted from
hyperlinks as well as from the text itself. Because information on the search
engine is usually proprietary , details of the algorithms are not readily
available. Comparisons of retrieval performance are also difficult because the
systems index different parts of the web and because they undergo constant
change. Recall is impossible to measure because the potential number of
pages relevant to a query is so large.

The Future of Information Retrieval

Researchers continue to improve the performance of information retrieval


systems. An ongoing series of experiments called TREC (Text Retrieval
Evaluation Conference) is conducted annually by the National Institute of
Standards and Technology to encourage research in information retrieval and
its use in real-world systems.
One long-term goal is to develop systems that do more than simply identify
useful documents. By considering a database as a knowledge base rather than
simply a collection of documents, it may be possible to design retrieval systems
that can interpret documents and use the knowledge they contain to answer
questions. This will require developments in artificial intelligence (AI) , natural
language processing, expert systems , and related fields. Research so far has
concentrated primarily on relatively narrow subject areas, but the goal is to
create systems that can understand and respond to questions in broad subject
areas.
see also Boolean Algebra; E-commerce; Search Engines; World Wide Web.
Edie M. Rasmussen

Bibliography

Bourne, Charles P. "On-line Systems: History, Technology and Economics."


Journal of the American Society for Information Science 31 (1980): 155–160.
Hahn, Trudi Bellardo. "Pioneers of the Online Age." Information Processing and
Management 32 (1996): 33–48.
Korfhage, Robert R. Information Storage and Retrieval. New York, NY: John
Wiley and Sons, 1997.
Lancaster, F. Wilfred, and Amy J. Warner. Information Retrieval Today.
Arlington, VA: Information Resources Press, 1993.
Meadow, Charles T. Text Information Retrieval Systems. San Diego, CA:
Academic Press, 1992.
Salton, Gerard. Automatic Text Processing: The Transformation, Analysis, and
Retrieval of Information by Computer. Reading, MA: Addison-Wesley Publishing
Company, 1989.

You might also like