You are on page 1of 40

INFORMATION

RETRIEVAL
SYSTEM
PRIMARY IR PROBLEMS
 The difference between how a user expresses what information they are looking for and the
way the author of the item expressed the information he/she is presenting. In other words, the
challenge is the mismatch between the language of the user and the language of the author.
 The inability to accurately create a good query. In addition to the complexities in generating a
query, quite often the user is not an expert in the area that is being searched and lacks domain
specific vocabulary unique to that particular subject area. The user starts the search process
with a general concept of the information required, but does not have a focused definition of
exactly what is needed.
 How to effectively represent the possible items of interest identified by the system so the user
can focus in on the ones of most likely value.

Notes
 The term “item” shall be used to define a specific information object. This could be a textual
document, a news item from an RSS feed, an image, a video program or an audio program.
 A user will have an information need and will translate the semantics of their information need
into the vocabulary they normally use which they present as a query.
OBJECTIVES OF IRS
 The general objective of an Information Retrieval System is to minimize the time it takes for a
user to locate the information they need. The goal is to provide the information needed to
satisfy the user’s question.
The times that are candidates to be minimized in an Information Retrieval System are the time to
create the query, the time to execute the query, the time to select what items returned from the
query the user wants to review in detail and the time to determine if the returned item is of
value.
 In information retrieval the term “relevant” is used to represent an item containing the needed
information. In reality the definition of relevance is not a binary classification but a continuous
function. Items can exactly match the information need or partially match the information need.
From a user’s perspective “relevant” and “needed” are synonymous. From a system perspective,
information could be relevant to a search statement (i.e., matching the criteria of the search
statement) even though it is not needed/relevant to user (e.g., the user already knew the
information or just read it in the previous item reviewed).
 Relevant documents are those that contain some information that helps answer the user’s
information need. Non-relevant documents do not contain any useful information. Using these
definitions the two primary metrics used in evaluating information retrieval systems can be
defined. They are Precision and Recall:

 The Number_Possible_Relevant are the number of relevant items in the database,


Number_Total_Retrieved is the total number of items retrieved from the query, and
Number_Retrieved_Relevant is the number of items retrieved that are relevant to the user’s
search need.
 When a user executes a search and has 80% precision it means that 4 out of 5 items that are
retrieved are of interest to the user.
 From a user perspective the lower the precision the more likely the user is wasting his resource
(time) looking at non-relevant items
 In reality most users will only look at the first few pages of hit (search) results before deciding
to change their query strategy
 Thus what is of more value in commercial systems is not the total precision but the precision
across the first 20–50 hits
 But when comparing search systems the total precision is used
 Recall is a very useful concept in comparing systems
 It measures how well a search system is capable of retrieving all possible hits that exist in the
database.
 Unfortunately it is impossible to calculate except in very controlled environments
 It requires in the denominator the total number of relevant items in the database.
 If the system could determine that number, then the system could return them.
 There have been some attempts to estimate the total relevant items in a database, but there are
no techniques that provide accurate enough results to be used for a specific search request.
FUNCTIONAL OVERVIEW OF
IRS
 A functional overview will help to better place the technologies in perspective and provide
additional insight into what an information system needs to achieve.
 An information retrieval system starts with the ingestion of information.
 There are multiple functions that are applied to the information once it has been ingested.
 The most obvious function is to store the item in it’s original format in an items data base and
create a searchable index to allow for later ad hoc searching and retrieval of an item.
 Another operation that can occur on the item as it’s being received is “Selective Dissemination
of Information” (SDI).
 This function allows users to specify search statements of interest (called “Profiles”) and
whenever an incoming item satisfies the search specification, the item is stored in a user’s
“mail” box for later review.
 This is a dynamic filtering of the input stream for each user for the subset they want to look at
on a daily basis.
 Since it’s a dynamic process the mail box is constantly getting new items of possible interest
 Associated with the Selective Dissemination of Information process is the “Alert” process
 The alert process will attempt to notify the user whenever any new item meets the user’s
criteria for immediate action on an item.
 This helps the user in multitasking — doing their normal daily tasks but be made aware when
there is something that requires immediate attention.
 Finally there is automatically adding metadata and creating a logical view of the items into a
structured taxonomy.
 The user can then navigate the taxonomy to find items of interest.
 The indexing assigns additional descriptive citational and semantic metadata to an item.
UNDERSTANDING SEARCH
FUNCTIONS
 Boolean Logic
 Proximity
 Contiguous Word Phrases
 Fuzzy Searches
 Term Masking
 Numeric and Date Ranges
 Vocabulary Browse
BOOLEAN LOGIC
 Allows a user to logically relate multiple concepts together to define what information is
needed.
 Typically the Boolean functions apply to processing tokens identified anywhere within an
item.
 The typical Boolean operators are AND, OR, and NOT.
 Placing portions of the search statement in parentheses are used to overtly specify the order of
Boolean operations.
 If parentheses are not used, the system follows a default precedence ordering of operations
(e.g., typically NOT then AND then OR).
PROXIMITY
 Restrict the distance allowed within an item between two search terms.

The semantic concept is that the closer two terms are found in a text the more likely they are related in
the description of a particular concept.

 Proximity is used to increase the precision of a search.


 If the terms COMPUTER and DESIGN are found within a few words of each other then the
item is more likely to be discussing the design of computers than if the words are paragraphs
apart.
 The typical format for proximity is  TERM1 within “m” “units” of TERM2
where
the distance operator “m” is an integer number and units are in Characters, Words, Sentences,
or Paragraphs.
 A special case of the Proximity operator is the Adjacent (ADJ) operator that normally has a
distance operator of one and a forward only direction.
 Another special case is where the distance is set to zero meaning within the same semantic
unit.
CONTIGUOUS WORD
PHRASES
 A Contiguous Word Phrase (CWP) is both a way of specifying a query term and a special
search operator.
 A CWP is two or more words that are treated as a single semantic unit.
 An example of a CWP is “United States of America”. It is four words that specify a search
term representing a single specific semantic concept (a country) that can be used with other
operators. Thus a query could specify “manufacturing” AND “United States of America”
which returns any item that contains the word “manufacturing” and the contiguous words
“United States of America”.
 A contiguous word phrase also acts like a special search operator that is similar to the
proximity (Adjacency) operator but allows for additional specificity.
 If two terms are specified, the contiguous word phrase and the proximity operator using
directional one word parameters or the adjacent operator are identical.
FUZZY SEARCHES
 Provide the capability to locate spellings of words that are similar to the entered search term.
 Primarily used to compensate for errors in spelling of words.
 Fuzzy searching increases recall at the expense of decreasing precision (i.e., it can erroneously
identify terms as the search term).
 In the process of expanding a query term fuzzy searching includes other terms that have
similar spellings, giving more weight (in systems that rank output) to words in the database
that have similar word lengths and position of the characters as the entered term.
 A Fuzzy Search on the term “computer” would automatically include the following words
from the information database: “computer,” “compiter,” “conputer,” “computter,” “compute” .
 An additional enhancement may lookup the proposed alternative spelling and if it is a valid
word with a different meaning, include it in the search with a low ranking or not include it at
all (e.g., “commuter”).
TERM MASKING
 Term masking is the ability to expand a query term by masking a portion of the term and
accepting as valid any processing token that maps to the unmasked portion of the term.
 There are two types of search term masking: fixed length and variable length. Sometimes they
are called fixed and variable length “don’t care” functions.
 Variable length “don’t care” allows masking of any number of characters within a processing
token.

Notes
 Token -- (LINGUISTICS) an individual occurrence of a linguistic unit in speech or writing,
as contrasted with the type or class of linguistic unit of which it is an instance.
-- (COMPUTING) a sequence of bits passed continuously between nodes in a fixed order
and enabling a node to transmit information.
 The masking may be in the front, at the end, at both front and end, or imbedded.
 The first three of these cases are called suffix search, prefix search and imbedded character
string search, respectively.
 The use of an imbedded variable length don’t care is seldom used.
 The symbol“*” represents a variable length don’t care.
 “*COMPUTER” Suffix Search
“COMPUTER*” Prefix Search
“*COMPUTER*” Imbedded String Search
NUMERIC AND DATE RANGES
 Term masking is useful when applied to words, but does not work for finding ranges of
numbers or numeric dates.
 To find numbers larger than “125,” using a term “125*” will not find any number except those
that begin with the digits “125”.
 Systems, as part of their normalization process, characterize words as numbers or dates.
 This allows for specialized numeric or date range processing against those words.
 A user could enter inclusive (e.g., “125–425” or “4/2/93–5/2/95” for numbers and dates) to
infinite ranges (“>125,” “<=233,” representing “Greater Than” or “Less Than or Equal”) as
part of a query.
VOCABULARY BROWSE
 A capability used first in databases in the 1980s.
 The concept was to assist the user in creating a query by providing the user with an
alphabetical sorted list of terms in a field along with the number of database records the term
was found in.
 This helped the user in two different ways.
 The first was by looking at the list surrounding the word the user was interested in, they could
discover misspellings they wanted to include in their query.
 It also would show them the number of records the term was found in allowing them to add
additional search terms if there were going to be too many hits.
 This concept has been carried over to IRS recently with the expansion capabilities provided by
GOOGLE.
 In this case the system is not trying to show misspellings or the number of items a search term
is found in.
 Instead the system is trying to help the user determine additional modifiers (additional terms)
they can add to their query to make it more precise based upon data in the database and what
other users search on.
DATA STRUCTURES
 There are usually two major data structures in any information system.
 One structure stores and manages the received items in their normalized form and is the
version that is displayed to the user --- “document manager”
 The other major data structure contains the processing tokens and associated data (e.g., index)
to support search --- “document search manager”
 The results of a search are references to the items that satisfy the search statement, which are
passed to the document manager for retrieval.
 The most common data structure encountered in both data base and information systems is the
inverted file system.
 It minimizes secondary storage access when multiple search terms are applied across the total
database.
 All commercial and most academic systems use inversion as the searchable data structure.
VARIANT OF THE SEARCHABLE DATA
STRUCTURE
 Firstly, it is the N-gram structure that breaks processing tokens into smaller string units and
uses the token fragments for search. N-grams have demonstrated improved efficiencies and
conceptual manipulations over full word inversion.
 PAT trees and arrays view the text of an item as a single long stream versus a juxtaposition of
words. Around this paradigm search algorithms are defined based upon text strings. The name
PAT is short for PATRICIA Trees (PATRICIA stands for Practical Algorithm To Retrieve
Information Coded In Alphanumerics)
 Signature files are based upon the idea of fast elimination of non-relevant items reducing the
searchable items to a manageable subset. The subset can be returned to the user for review or
other search algorithms may be applied to it to eliminate any false hits that passed the signature
filter.
 The XML data structure is the most common structure used in sharing information between
systems and frequently how it is stored within a system. It is how items are received by the
ingest process and it is typically used if items are exported to other applications and systems.
 The hypertext data structure is the basis behind URL references on the internet. But more
importantly is the logical expansion of the definition of an item when hypertext references are
used and its potential impact on searches. The latest internet search systems have started to
make use of hypertext links to expand what information is indexed associated with items.
Most commonly it is used when indexing multimedia objects but there is a natural extension to
textual items.
INVERTED FILE STRUCTURE
 The most common data structure used in both database management and IRS is the inverted
file structure.
 Inverted file structures are composed of three basic files: the document file, the inversion lists
(sometimes called posting files) and the dictionary.
 The name “inverted file” comes from its underlying methodology of storing an inversion of
the documents: inversion of the documents from the perspective that instead of having a set of
documents with words in them, you create a set of words that has the list of documents they
are found in. Each document in the system is given a unique numerical identifier. It is that
identifier that is stored in the inversion list.
 The way to locate the inversion list for a particular word is via the dictionary.
 The dictionary is typically a sorted list of all unique words (processing tokens) in the system
and a pointer to the location of its inversion list.
 Dictionaries can also store other information used in query optimization such as the length of
inversion lists.
 Additional information may be used from the item to increase precision and provide a more
optimum inversion list file structure.
MATHEMATICAL
ALGORITHMS
 There are a number of mathematical concepts that form the basis behind a lot of the weighted
indexing techniques used in creating the indices for IRS.
 The two most important theories are the Bayesian theory and Shannon’ Information theory.
 Bayesian theory are a conditional model associated with probabilities that estimates the
probability of one event given another event takes place.
 This directly maps into the probability that a document is relevant given a specific query.
 It additionally can be used to define clustering relationships used in automatic creation of
taxonomies associated with search results and item databases.
 Shannon’s information model describes the “information value” given the frequency of
occurrence of an event.
 In this case it can be related to how many items contain a particular word and how that affects
its importance (if a word is found in every item in the database it does not have much search
value).
BAYESIAN THEORY
 The earliest mathematical foundation for information retrieval dates back to the early 1700s
when Thomas Bayes developed a theorem that relates the conditional and marginal
probabilities of two random events—called Baye’s Theorem.
 It can be used to compute the posterior probability (probability assigned “after” relevant
evidence is considered) of random events.
 For example, it allows to consider the symptoms of a patient and use that information to
determine the probability of what is causing the illness.
 Bayes’ theorem relates the conditional and marginal probabilities of events A and B, where B
cannot equal zero:
 P(A) is called the prior probability of A. It is called “prior” because it does not take into
account any information about B.
 P(A|B) is the conditional probability of A, given B. It is sometimes named the posterior
probability because the probability depends upon the probability of B.
 P(B|A) is the conditional probability of B given A.
 P(B) is the marginal probability of B, and normalizes the result.
 Putting the terms into words given our example helps in understanding the formula:

The probability of a patient having the flu given the patient has a high temperature is equal to
the probability that if you have a high temperature you have the flu times the probability you
will have the flu. This is then normalized by dividing the probability that you have a high
temperature.
 To relate Bayesian Theory to IR you need only to consider the search process.
 A user provides a query, consisting of words, which represent the user’s preconceived attempt
to describe the semantics needed in an item to be retrieved for it to be relevant.
 Since each user submits these terms to reflect their own idea of what is important, they imply a
preference ordering (ranking) among all of the documents in the database.
 Applying this to Bayes’s Theorem we have:

You might also like