Professional Documents
Culture Documents
RETRIEVAL
SYSTEM
PRIMARY IR PROBLEMS
The difference between how a user expresses what information they are looking for and the
way the author of the item expressed the information he/she is presenting. In other words, the
challenge is the mismatch between the language of the user and the language of the author.
The inability to accurately create a good query. In addition to the complexities in generating a
query, quite often the user is not an expert in the area that is being searched and lacks domain
specific vocabulary unique to that particular subject area. The user starts the search process
with a general concept of the information required, but does not have a focused definition of
exactly what is needed.
How to effectively represent the possible items of interest identified by the system so the user
can focus in on the ones of most likely value.
Notes
The term “item” shall be used to define a specific information object. This could be a textual
document, a news item from an RSS feed, an image, a video program or an audio program.
A user will have an information need and will translate the semantics of their information need
into the vocabulary they normally use which they present as a query.
OBJECTIVES OF IRS
The general objective of an Information Retrieval System is to minimize the time it takes for a
user to locate the information they need. The goal is to provide the information needed to
satisfy the user’s question.
The times that are candidates to be minimized in an Information Retrieval System are the time to
create the query, the time to execute the query, the time to select what items returned from the
query the user wants to review in detail and the time to determine if the returned item is of
value.
In information retrieval the term “relevant” is used to represent an item containing the needed
information. In reality the definition of relevance is not a binary classification but a continuous
function. Items can exactly match the information need or partially match the information need.
From a user’s perspective “relevant” and “needed” are synonymous. From a system perspective,
information could be relevant to a search statement (i.e., matching the criteria of the search
statement) even though it is not needed/relevant to user (e.g., the user already knew the
information or just read it in the previous item reviewed).
Relevant documents are those that contain some information that helps answer the user’s
information need. Non-relevant documents do not contain any useful information. Using these
definitions the two primary metrics used in evaluating information retrieval systems can be
defined. They are Precision and Recall:
The semantic concept is that the closer two terms are found in a text the more likely they are related in
the description of a particular concept.
Notes
Token -- (LINGUISTICS) an individual occurrence of a linguistic unit in speech or writing,
as contrasted with the type or class of linguistic unit of which it is an instance.
-- (COMPUTING) a sequence of bits passed continuously between nodes in a fixed order
and enabling a node to transmit information.
The masking may be in the front, at the end, at both front and end, or imbedded.
The first three of these cases are called suffix search, prefix search and imbedded character
string search, respectively.
The use of an imbedded variable length don’t care is seldom used.
The symbol“*” represents a variable length don’t care.
“*COMPUTER” Suffix Search
“COMPUTER*” Prefix Search
“*COMPUTER*” Imbedded String Search
NUMERIC AND DATE RANGES
Term masking is useful when applied to words, but does not work for finding ranges of
numbers or numeric dates.
To find numbers larger than “125,” using a term “125*” will not find any number except those
that begin with the digits “125”.
Systems, as part of their normalization process, characterize words as numbers or dates.
This allows for specialized numeric or date range processing against those words.
A user could enter inclusive (e.g., “125–425” or “4/2/93–5/2/95” for numbers and dates) to
infinite ranges (“>125,” “<=233,” representing “Greater Than” or “Less Than or Equal”) as
part of a query.
VOCABULARY BROWSE
A capability used first in databases in the 1980s.
The concept was to assist the user in creating a query by providing the user with an
alphabetical sorted list of terms in a field along with the number of database records the term
was found in.
This helped the user in two different ways.
The first was by looking at the list surrounding the word the user was interested in, they could
discover misspellings they wanted to include in their query.
It also would show them the number of records the term was found in allowing them to add
additional search terms if there were going to be too many hits.
This concept has been carried over to IRS recently with the expansion capabilities provided by
GOOGLE.
In this case the system is not trying to show misspellings or the number of items a search term
is found in.
Instead the system is trying to help the user determine additional modifiers (additional terms)
they can add to their query to make it more precise based upon data in the database and what
other users search on.
DATA STRUCTURES
There are usually two major data structures in any information system.
One structure stores and manages the received items in their normalized form and is the
version that is displayed to the user --- “document manager”
The other major data structure contains the processing tokens and associated data (e.g., index)
to support search --- “document search manager”
The results of a search are references to the items that satisfy the search statement, which are
passed to the document manager for retrieval.
The most common data structure encountered in both data base and information systems is the
inverted file system.
It minimizes secondary storage access when multiple search terms are applied across the total
database.
All commercial and most academic systems use inversion as the searchable data structure.
VARIANT OF THE SEARCHABLE DATA
STRUCTURE
Firstly, it is the N-gram structure that breaks processing tokens into smaller string units and
uses the token fragments for search. N-grams have demonstrated improved efficiencies and
conceptual manipulations over full word inversion.
PAT trees and arrays view the text of an item as a single long stream versus a juxtaposition of
words. Around this paradigm search algorithms are defined based upon text strings. The name
PAT is short for PATRICIA Trees (PATRICIA stands for Practical Algorithm To Retrieve
Information Coded In Alphanumerics)
Signature files are based upon the idea of fast elimination of non-relevant items reducing the
searchable items to a manageable subset. The subset can be returned to the user for review or
other search algorithms may be applied to it to eliminate any false hits that passed the signature
filter.
The XML data structure is the most common structure used in sharing information between
systems and frequently how it is stored within a system. It is how items are received by the
ingest process and it is typically used if items are exported to other applications and systems.
The hypertext data structure is the basis behind URL references on the internet. But more
importantly is the logical expansion of the definition of an item when hypertext references are
used and its potential impact on searches. The latest internet search systems have started to
make use of hypertext links to expand what information is indexed associated with items.
Most commonly it is used when indexing multimedia objects but there is a natural extension to
textual items.
INVERTED FILE STRUCTURE
The most common data structure used in both database management and IRS is the inverted
file structure.
Inverted file structures are composed of three basic files: the document file, the inversion lists
(sometimes called posting files) and the dictionary.
The name “inverted file” comes from its underlying methodology of storing an inversion of
the documents: inversion of the documents from the perspective that instead of having a set of
documents with words in them, you create a set of words that has the list of documents they
are found in. Each document in the system is given a unique numerical identifier. It is that
identifier that is stored in the inversion list.
The way to locate the inversion list for a particular word is via the dictionary.
The dictionary is typically a sorted list of all unique words (processing tokens) in the system
and a pointer to the location of its inversion list.
Dictionaries can also store other information used in query optimization such as the length of
inversion lists.
Additional information may be used from the item to increase precision and provide a more
optimum inversion list file structure.
MATHEMATICAL
ALGORITHMS
There are a number of mathematical concepts that form the basis behind a lot of the weighted
indexing techniques used in creating the indices for IRS.
The two most important theories are the Bayesian theory and Shannon’ Information theory.
Bayesian theory are a conditional model associated with probabilities that estimates the
probability of one event given another event takes place.
This directly maps into the probability that a document is relevant given a specific query.
It additionally can be used to define clustering relationships used in automatic creation of
taxonomies associated with search results and item databases.
Shannon’s information model describes the “information value” given the frequency of
occurrence of an event.
In this case it can be related to how many items contain a particular word and how that affects
its importance (if a word is found in every item in the database it does not have much search
value).
BAYESIAN THEORY
The earliest mathematical foundation for information retrieval dates back to the early 1700s
when Thomas Bayes developed a theorem that relates the conditional and marginal
probabilities of two random events—called Baye’s Theorem.
It can be used to compute the posterior probability (probability assigned “after” relevant
evidence is considered) of random events.
For example, it allows to consider the symptoms of a patient and use that information to
determine the probability of what is causing the illness.
Bayes’ theorem relates the conditional and marginal probabilities of events A and B, where B
cannot equal zero:
P(A) is called the prior probability of A. It is called “prior” because it does not take into
account any information about B.
P(A|B) is the conditional probability of A, given B. It is sometimes named the posterior
probability because the probability depends upon the probability of B.
P(B|A) is the conditional probability of B given A.
P(B) is the marginal probability of B, and normalizes the result.
Putting the terms into words given our example helps in understanding the formula:
The probability of a patient having the flu given the patient has a high temperature is equal to
the probability that if you have a high temperature you have the flu times the probability you
will have the flu. This is then normalized by dividing the probability that you have a high
temperature.
To relate Bayesian Theory to IR you need only to consider the search process.
A user provides a query, consisting of words, which represent the user’s preconceived attempt
to describe the semantics needed in an item to be retrieved for it to be relevant.
Since each user submits these terms to reflect their own idea of what is important, they imply a
preference ordering (ranking) among all of the documents in the database.
Applying this to Bayes’s Theorem we have: