You are on page 1of 61

UNIT-1

INTRODUCTION TO INFORMATION RETRIEVAL SYSTEMS

SYEDA SAMEERA NAYYAR


ASSISTANT PROFESSOR
CSE DEPT
1.1 Definition of Information Retrieval System

• An IR System is a system capable of storage, retrieval, and maintenance of


information.
• Information:
Text, image, audio, video, and other multimedia objects Focus on
textual information here.
• Item:
The smallest complete textual unit processed and manipulated by an
IR system
• Depend on how a specific source treats information
• Objectives of an IR System: Minimize the overhead for finding information
1.1 Definition of Information Retrieval System
Overhead:
• The time a user spends in all of the steps leading to reading an item
containing needed information, excluding the time for actually
reading the relevant data
• Query generation
• Search composition
• Search execution
• Scanning results of query to select items to read
1.1 Definition of Information Retrieval System

• An Information Retrieval System consists of a software program that


facilitates a user in finding the information the user needs.

• The system may use standard computer hardware or specialized


hardware to support the search sub function and to convert non-
textual sources to a searchable media.
1.2 Objectives of Information Retrieval Systems

• The general objective of an IR system is ,

1. To minimize the overhead of a user locating needed information

2. The two major measures commonly associated with information


systems are “precision” and “recall”

3. Support of user search generation

4. How to present the search results in a format that facilitate the user in
determining relevant items
1.2 Objectives of Information Retrieval Systems

• The two major measures commonly associated with information systems


are precision and recall.

• The database is logically divided into four

• Relevant items are those documents that contain information that helps
the searcher in answering his question.

• Non-relevant items are those items that do not provide any directly useful
information.

• There are two possibilities with respect to each item:


• it can be retrieved or not retrieved by the user’s query.
1.2 Objectives of Information Retrieval Systems
1.2 Objectives of Information Retrieval Systems
1.2 Objectives of Information Retrieval Systems

• Number_Possible_Relevant: are the number of relevant items in the


database.
• Number_Total_Retieved:is the total number of items retrieved from
the query.
• Number_Retrieved_Relevant: is the number of items retrieved that
are relevant to the user’s search need.
1.3 Functional Overview :
• A total Information Storage and Retrieval System is composed of four
major functional processes:

1. Item normalization,
2. Selective dissemination of information (i.e., “mail”),
3. Archival document database search,
4. An index database search along with the Automatic file build
process that supports index files.
1.3 Functional Overview :
1.3.1 Item Normalization:

1. Normalize incoming items to a standard format


2. Logical restructuring – zoning
3. Create a searchable data structure (Indexing)
• Identification of processing tokens
• Characterization of the tokens – single words, or phrase
• Stemming of the tokens
1.3.1 Item Normalization
• 1.3.1.1 Standardize Input:
• Standardizing the input takes the different external format of input data
and performs the translation to the formats acceptable to the system.
• One example of standardization could be translation of foreign languages
into Unicode.
• Every language has a different internal binary encoding for the characters
in the language.
• One standard encoding that covers English, French, Spanish, etc. is ISO-
Latin.
• Translate multi-media input into a standard format
❑Video: MPEG-2, MPEG-1, AVI, Real Video…
❑Audio: WAV, Real Audio
❑Image: GIF, JPEG, BMP…
1.3.1 Item Normalization
• 1.3.1.2 Logical Subsetting (Zoning)
• Parse the item into logical sub-divisions that have meaning to user This
process, called "Zoning,“.

• A typical item is sub-divided into zones, Title, Author, Abstract, Main Text,
Conclusion, References.

• A major limitation to the user is the size of the display screen which
constrains the number of items that are visible for review.

• Quite often the user will only display zones such as the Title or Title mid
Abstract. This allows multiple items to be displayed per screen.

• The user can expand those items of potential interest to see the complete
text.
1.3.1 Item Normalization
• 1.3.1.3 Identify Processing Tokens :
• Identify the information that are used in the search process –
Processing Tokens (Better than Words)
• The first step in identification of a processing token consists of
determining a word.
• Systems determine words by dividing input symbols into three classes
• Valid word symbols: Alphabetic characters, numbers.
• Inter-word symbols: Blanks, periods, semicolons (non searchable)
• Special processing symbols: hyphen (-)
• A word is defined as a contiguous set of word symbols bounded by
inter-word symbols.
1.3.1 Item Normalization
1.3.1.4 Stop Algorithm:
• a Stop List/Algorithm is applied to the list of potential processing tokens.
• The objective of the Stop function is to save system resources by
eliminating from the set of searchable processing tokens those that have
little value to the system
• Eliminating these words saves on storage and access structure complexities
• Any word found in almost every item
• Any word only found once or twice in the database
• The rank-frequency law of Ziph is
• Frequency * Rank = Constant
1.3.1 Item Normalization
1.3.1 Item Normalization
1.3.1.5 Characterize Tokens
• The next step in finalizing on processing tokens is identification of any
• specific word characteristics.
• The characteristic is used in systems to assist in disambiguation of a
particular word.
• Analysis of the processing token's part of speech is included here.
• Uppercase – proper names, acronyms, and organization Numbers and
dates
1.3.1 Item Normalization
1.3.1.6 Stemming Algorithm
• Normalize the token to a standard semantic representation
• For example, the system must keep singular, plural, past tense,
possessive, etc. as separate searchable tokens and potentially expand
a term at search time to all its possible representations
• Computer, Compute, Computers, Computing
• Comput
• Reduce the number of unique words the system has to contain.
1.3.1 Item Normalization
• 1.3.1.7 Create Searchable Data Structure:
• Based upon the stemmmg algorithm, they are used as updates to the
searchable data structure.
• The searchable data structure is the internal representation
• This structure contains
1. The semantic concepts that represent the items in the database
2. limits what a user can find as a result of their search.
1.3.2 Selective Dissemination of Information
• Provides the capability to dynamically compare newly received items
in the information system against standing statements of interest of
users and deliver the item to those users whose statement of interest
matches the contents of the items.
• It Consist of ,
1. Search process
2. User statements of interest (Profile)
3. User mail file.
• As each item is received, it is processed against every user’s profile
When the search statement is satisfied, the item is placed in the mail
file.
• A profile contains a typically broad search statement along with a list
of user mail files that will receive the document if the search
statement in the profile is satisfied
1.3.3 Document Database Search :
• Provides the capability for a query to search against all items received
by the system.
• Composed of the
1. Search process,
2. User entered queries
3. Document database
• Document database contains all items that have been received,
processed and store by the system.
• Usually items in the Document DB do not change May be partitioned
by time and allow for archiving by the Time partitions.
• The Document Database can be very large, hundreds of millions of
items or more.
1.3.4 Index Database Search:
• When an item is determined to be of interest, a user may want to
save it (file it) for future reference Accomplished via the index
process.
• In the index process, the user can logically store an item in a file along
with additional index terms and descriptive text the user wants to
associate with the item.
• The Index Database Search Process provides the capability to create
indexes and search them
• The user may search the index and retrieve the index and/or the
document it references
1.3.4 Index Database Search:
Two classes of index files:
1. Public
2. Private index files
• Every user can have one or more private index files leading to a very
large number of files
• Each private index file references only a small subset of the total
number of items in the Document database
• Public index files are maintained by professional library services and
typically index every item in the Document database
• Private Index files typically have very limited access lists.
1.3.5 Multimedia Database Search

• The multi-media data is not logically its own data structure, but an
augmentation to the existing structures in the information retrieval
system.
• The correlation between the multi-media and the textual domains
will be either via time or positional synchronization
• Time synchronization is the example of transcribed text from audio or
composite video sources.
• Positional synchronization is where the multi-media is localized by a
hyperlink in a textual item
1.4 Relationship to Database Management Systems :
• There are two major categories of systems available to process items:
1. Information Retrieval Systems
2. Data Base Management Systems (DBMS).
• An Information Retrieval System is software that has the features and
functions required to manipulate “information” items versus a DBMS that
is optimized to handle “structured” data.
• With structured data a user enters a specific request and the results
returned provide the user with the desired information
• The results are frequently tabulated and presented in a report format for
ease of use.
• In contrast, a search of “information” items has a high probability of not
finding all the items a user is looking for.
1.4 Relationship to Database Management Systems :
• The user has to refine his search to locate additional items of interest.
This process is called “iterative search “.
• The integration of DBMS’s and Information Retrieval Systems is very
important.
• Commercial database companies have already integrated the two
types of systems.
• One of the first commercial databases to integrate the two systems
into a single view is the INQUIRE DBMS .
1.5 Digital Libraries and Data Warehouses

• Two other systems frequently described in the context of information


retrieval are,
1. Digital Libraries and
2. Data Warehouses.
• All three systems are repositories of information and their primary
goal is to “satisfy user information needs” .
• As the quantities of information grew exponentially, libraries were
forced to make maximum use of electronic tools to facilitate the
storage and retrieval process.
• With the worldwide interneting of libraries and information via the
Internet, more focus has been on the concept of an electronic library.
1.5 Digital Libraries and Data Warehouses

• Indexing is one of the critical disciplines in library science and


significant effort has gone into the establishment of indexing and
cataloguing standards.
• Migration of many of the library products to a digital format
introduces both opportunities and challenges
• Information Storage and Retrieval technology has addressed a small
subset of the issues associated with Digital Libraries
1.6 Information Retrieval Systems Capabilities :

1. Search Capabilities
2. Browse Capabilities
3. Miscellaneous Capabilities
• Search and browse capabilities are crucial to assist the user in locating
relevant items.
• The search capabilities address both Boolean and Natural Language
queries.
• The majority of existing commercial systems are based upon Boolean
query and search capabilities
1.6 Information Retrieval Systems Capabilities :
• 2.1 Search Capabilities:

• The objective of the search capability is to allow for a mapping


between a user's specified need and the items in the information
database that will answer that need.

• The search statement may apply to the complete item or contain


additional parameters limiting it to a logical division of the item.
2.1 Search Capabilities

• 2.1.1 Boolean Logic


• 2.1.2 Proximity
• 2.1.3 Contiguous Word Phrases
• 2.1.4 Fuzzy Searches
• 2.1.5 Term Masking
• 2.1.6 Numeric and Date Ranges
• 2.1.7 Concept and Thesaurus Expansions
• 2.1.8 Natural Language Queries
2.1 Search Capabilities

• 2.1.1 Boolean Logic


• Boolean logic allows a user to logically relate multiple concepts together to
define what information is needed.
• Typically the Boolean functions apply to processing tokens identified
anywhere within an item.
• Operators : AND, OR, NOT (sometimes XOR).
• Corresponding set operations : intersection, union, difference. Operate on
the sets of documents that contain the query terms.
• Precedence : NOT, AND, OR; use parentheses to override;
• Process left-to- right among operators with same precedence.
• Weighting : A weight is associated with each term
2.1.1 Boolean Logic
2.1 Search Capabilities
• 2.1.2 Proximity
• Proximity is used to restrict the distance allowed within an item
between two search terms.
• Proximity is used to increase the precision of a search.
• If the terms COMPUTER and DESIGN are found within a few words of
each other then the item is more likely to be discussing the design of
computers.
• A special case of the Proximity operator is the Adjacent (ADJ)operator
that normally has a distance operator of one(m=1) and a forward only
direction.
• The typical format for proximity is: TERM1 within "m . . . . units" of
TERM2
2.1.2 Proximity
2.1 Search Capabilities
• 2.1.3 Contiguous Word Phrases:
• A Contiguous Word Phrase (CWP) is both a way of specifying a query
term and a special search operator.
• A Contiguous Word Phrase is two or more words that are treated as a
single semantic unit.
• An example of a CWP is "United States of America.“
• Thus a query could specify "manufacturing" AND "United States of
America" which returns any item that contains the word
"manufacturing" and the contiguous words "United States of
America."
2.1.3 Contiguous Word Phrases:

• Proximity and Boolean operators are binary operators but contiguous


word phrases are an "N"ary operator where "N" is the number of
words in the CWP.
• Multiple Adjacency (ADJ) operators are used to define a Literal String
(e.g., "United" ADJ "States" ADJ "of' ADJ "America").
2.1 Search Capabilities

2.1.4 Fuzzy Searches:


• Fuzzy Searches provide the capability to locate spellings of words that
are similar to the entered search term.
• Increased recall (more documents qualify because new terms may be
matched) at the expense of deceased precision (erroneous matches
may introduce nonrelevant documents).
• A Fuzzy Search on the term "computer" would automatically include
the following words from the information database: "computer,“
"compiter," "conputer," "computter," "compute."
2.1 Search Capabilities
2.1.5 Term Masking:
• Single position mask: accept any term that will match the query term,
once the character in a certain position is disregarded
• Example: the term MULTI$NATIONAL will be matched by “multi-national”
or “multinational” (but not by “multi national” since it is a sequence of two
terms! )
• Variable length mask: accept any term that will match the query term,
once a sequence of any number of characters in a certain position is
disregarded.
• Suffix: *WARE will match terms that end with “ware”.
• Prefix: WARE* will match terms that begin with “ware”. The most common
mask ( sometimesapplied by default).
• Imbedded: *WARE* will match terms that contain “ware
2.1.5 Term Masking:
2.1 Search Capabilities

• 2.1.6 Numeric and Date Ranges:


• Match numeric or date terms that are in the range of the query term.
• Numeric query terms: >125 (matches all numbers greater that 125) or
125-425 (matched all numbers between 125 and 425).
• Date query terms: 9/1/97 - 8/31/98 ( matches all dates between 1
September 1997 and 31 August 1998).
2.1 Search Capabilities
2.1.7 Concept/Thesaurus Expansion
• A Thesaurus is typically a one-level or two-level expansion of a term
to other terms that are similar in meaning.
• A Concept Class is a tree structure that expands each meaning of a
word into potential concepts that are related to the initial term
• A semantic thesaurus is a listing of words and then other words that
are semantically similar.
• A concept based database shows associations that are not normally
found in a language based thesaurus.
2.1.7 Concept/Thesaurus Expansion
2.1.7 Concept/Thesaurus Expansion
2.1 Search Capabilities

2.1.8 Natural Language Queries


• Rather than having the user enter a specific Boolean query by
specifying search terms and the logic between them, Natural
Language Queries allow a user to enter a prose statement that
describes the information that the user wants to find.
• The longer the prose, the more accurate file results returned
• The system searches and finds those items most like the query
statement entered.
2.1 Search Capabilities
• An example of a Natural Language Query is:
• Find for me all the items that discuss oil reserves and current
attempts to find new oil reserves. Include ally items that discuss the
international financial aspects of the oil production process. Do not
include items about the oil industry in the United States.
2.2 Browse Capabilities
• Browse capabilities provide the user with the capability to determine
which items are of interest and select those to be displayed.
• From these summary displays, the user can select the specific items
mid zones within the items for display.
• 2.2.1 Ranking
• 2.2.2 Zoning
• 2.2.3 Highlighting
2.2 Browse Capabilities
2.2.1 Ranking
• With the introduction of ranking based upon predicted relevance
values, the status summary displays the relevance score associated
with the item along with a brief descriptor of the item.
• The relevance score is an estimate of the search system on how
closely the item satisfies the search statement.
• Typically relevance scores are normalized to a value between 0.0
and 1.0.
• The highest value of 1.0 is interpreted that the system is sure that the
item is relevant to the search statement.
2.2 Browse Capabilities
2.2.1 Ranking
• This zone is frequently the Title and provides the user with additional
information with the relevance weight to avoid selecting non relevant
items for review.
• Presenting the actual relevance number seems to be more confusing
to the user than presenting a category that the number falls in.
• For example, some systems create relevance categories and indicate,
by displaying items in different colors, which category an item belongs
to.
• Other systems uses a nomenclature such as High, Medium High,
Medium, Low, and Non-relevant.
2.2 Browse Capabilities
• 2.2.2 Zoning
• The user wants to see the minimum information needed to determine if
the item is relevant.
• Once the determination is made an item is possibly relevant, the user
wants to display the complete item for detailed review.
• Limited display screen sizes require selectability of what portions of an
item a user needs to see to make the relevance determination.
• For example, display of the Title and Abstract may be sufficient information
for a user to predict the potential relevance of an item.
• Limiting the display of each item to these two zones allows multiple items
to be displayed on a single display screen.
• This makes maximum understanding the potential relevance of the
multiple items on the screen.
2.2 Browse Capabilities

• 2.2.3 Highlighting
• Highlighting has always been useful in Boolean systems to indicate the
cause of the retrieval.
• This is because of the direct mapping between the terms in the search and
the terms in the item
• In a ranking system different terms can contribute to different degrees to
the decision to retrieve an item.
• The highlighting may vary by introducing colors and intensities to indicate
the relative importance of a particular word in the item in the decision to
retrieve the item.
• Information visualization appears to be a better display process to assist in
helping the user formulate his query than highlights in items.
2.3 Miscellaneous Capabilities
• 2.3.1 Vocabulary Browse
• 2.3.2 lterative Search and Search History Log
• 2.3.3 Canned Query
2.3.1 Vocabulary Browse
• Vocabulary Browse provides the capability to display in alphabetical
sorted order words from the document database.
• Logically, all unique words in the database are kept in sorted order
along with a count of the number of unique items in which the word
is found.
• The user can enter a word or word fragment and the system will
begin to display file dictionary around the entered text
• The user can determine that entering the search term "compul*" in
effect is searching for "compulsion" or "compulsive" or“ compulsory.“
• The system indicates what word fragment the user entered and then
alphabetically displays other words found in the database in collating
sequence on either side of file entered term
2.3.1 Vocabulary Browse
2.3.2 Iterative Search and Search History Log
• Rather than typing in a complete new query, the results of the
previous search can be used as a constraining list to create a new
query that is applied against it.
• This has the same effect as taking the original query and adding
additional search statement against it in an AND condition.
• This process of refining the results of a previous search to focus on
relevant items is called iterative search.
• To facilitate locating previous searches as starting points for new
searches, search history logs are available.
• The search history log is the capability to display all the previous
searches that were executed during the current session.
2.3.3 Canned Query
• The capability to name a query and store it to be retrieved and
executed during a later user session is called canned or stored queries
• A canned query allows a user to create and refine a search that
focuses on the user's general area of interest one time and then
retrieve it to add additional search criteria to retrieve data that is
currently needed.
• Queries that start with a canned query are significantly larger than ad
hoc queries.
• Canned query features also allow for variables to be inserted into the
query and bound to specific values at execution time.
2.4 Z39.50 and WAIS Standards
• NISO is the only organization accredited by ANSI to approve and
maintain standards for information services, libraries and publishers.
• This standard is one of many standards they generated to support
interconnection of computer systems.
• American National Standards Institute/National Information
Standards Organization (ANSI/NISO) in generating its Z39.50,
Information Retrieval Application Service Definition and Protocol
Specification.
• In addition to the formal standard, a second de facto standard is the
WAIS standard based upon its usage in the INTERNET.
Z39.50
• The Z39.50 standard does not specify an implementation, but the
capabilities within an application (Application Service) and the protocol
used to communicate between applications
• Its objective is to overcome different system incompatibilities associated
with multiple database searching
• The client is identified as the "Origin“ and performs the communications
functions relating to initiating a search, translation of the query into a
standardized format, sending a query, and requesting return records.
• The server is identified as the "Target" and interfaces to the database at
the remote responding to requests from the Origin.
• This makes the dissimilarities of different database systems transparent to
the user and facilitates issuing one query against multiple databases at
different sites returning to the user a single integrated Hit file.
WAIS
• WAIS was developed by a project started in 1989 by three commercial
companies (Apple, Thinking Machines, andDow Jones).
• Wide Area Information Service (WAIS) is the de facto standard for many
search environments on the INTERNET.
• The original idea was to create a program that would act as a personal
librarian.
• It would act like a personal agent keeping track of significant amounts of
data and filtering it for the information most relevant to the user.
• The interface concept was user entered natural language statements of
topics the user had interest.
• In addition it provided the capability to communicate to the computer that
a particular item was of interest and have the computer automatically find
similar items

You might also like