You are on page 1of 42

Introduction to Information Storage and Retrieval Systems

BYRESEARCH SCHOLAR

1. Introduction
Information

retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing. Automated information retrieval systems are used to reduce what has been called "information overload"
Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines are the most visible IR applications.

The problem of IR
Goal = find documents relevant to an

information need from a large document set


Info. need

Query

Document collection

Retrieval

IR system

Answer list

Example
4

Google

Web

Information Retrieval Applications


Digital libraries and archives Media search Blog search Image retrieval Music retrieval News search Speech retrieval Video retrieval Search engines Desktop search Enterprise search Mobile search Social search Web search

Domain specific applications of information retrieval

Geographic information retrieval


Information retrieval for chemical structures Information retrieval in software engineering

Legal information retrieval

Differences between DMBS and IR


Databases What is retrieved Queries Outcomes Structured data, clear semantics Unambiguous, formal, mathematical Exact and correct IR Unstructured, free text NL based vague and imprecise needs Vague list-imprecise Generally not relevant

Interaction One shot

Continuous refinement

2. Conceptual Models of IR
An IR conceptual model is a general approach to IR

systems. Faloutsos (1985) gives three basic approaches: text pattern search, inverted file search, and signature search. Belkin and Croft (1987) categorize IR conceptual models differently. They divide retrieval techniques first into exact match and inexact match.

Contd..
The exact match category contains text pattern

search and Boolean search techniques. Text pattern search queries are strings or regular expressions. Text pattern systems are more common for searching small collections, such as personal collections of files. In a Boolean IR System, documents are represented by sets of keywords, usually stored in an inverted file.

The inexact match category contains such techniques

as probabilistic, vector space, and clustering, among others. It is possible to assign a probability of relevance to each documents in retrieved set allowing retrieved documents to be ranked in order of probable relevance. It is possible to group(cluster) documents based on the terms that they contain and to retrieve from these groups using a ranking methodology.

Important considerations for an IR system


File Structure

Query Operations
Term Operations Document operations

Hardware for IR

3. File Structures
A fundamental decision in the design of IR systems is which

type of file structure to use for the underlying document database.


The file structures used in IR systems are flat files, inverted

files, signature files, PAT trees, and graphs.


Though it is possible to keep file structures in main

memory, in practice IR databases are usually stored on disk because of their size.

3.1 Flat and Inverted files


Flat File : One or more documents are stored in a

file. Flat File Search is usually done via pattern matching.


Inverted File: It is a kind of indexed file.
Structure of Indexed File- <keyword, document-ID,

field-ID>
Unique name that indicates from which field in the document the keyword came

Indexing terms that describes the document

Unique Identifier for a document

3.2 Signature Files


Signature Files: It contains signature-bit patterns- that

represent documents. Signature Method: documents are split into logical blocks each containing a fixed number of distinct significant, that is, non-stoplist, words. Each word in the block is hashed to give a signature a bit pattern with some of the bits set to 1. The signature of each word in a block are ORed together to create a block signature. The block signatures are then concatenated to produce the document signature. Searching is done by comparing the signatures of queries with document signatures.

Graphs, or Network, are ordered collections of nodes

connected by arcs. For example, a kind of graph called a semantic net can be used to represent the semantic relationships in text often lost in the indexing systems above. Graph based techniques for IR are impractical now because of the amount of manual effort that would be needed to represent a large document collection in this form.

4. Query Operations
Queries are formal statements of information needs put to

the IR system by users. The operations on queries are obviously a function of the type of query, and the capabilities of the IR system. One common query operation is parsing, that is breaking the query into its constituent elements. Boolean queries, for example, must be parsed into their constituent terms and operators. The set of document identifiers associated with each query term is retrieved, and the sets are then combined according to the Boolean operators.

For e.g. Let us consider Shakespeares Collected

Works, and use it to introduce the basics of the Boolean retrieval model.
Suppose we record for each document here a play

of Shakespeares whether it contains each word out of all the words Shakespeare used (Shakespeare used about 32,000 different words)

Now, depending on whether we look at the matrix rows

or columns, we can have a vector for each term, which shows the documents it appears in, or a vector for each document, showing the terms that occur in it.

To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND: 110100 AND 110111 AND 101111 = 100100

The answers for this query are thus Antony and Cleopatra and Hamlet

5. Term Operations
Operations on terms in an IR system include stemming, truncation, weighting, and stop list and thesaurus operations. Stemming is the automated conflation (fusing or

combining) of related words, usually by reducing the words to a common root form. Take, taken, taking would result in take. Walk, walking, walked would result in walk. Computation, Computing, would result in compute.

Truncation is manual conflation of terms by using

wildcard characters in the word, so that the truncated term will match multiple words. Truncation allows you to search for various word endings and spellings simultaneously. It allows you to retrieve results with all the different endings of that root word.

Another way of conflating related terms is with a

thesaurus which lists synonymous terms, and sometimes the relationships among them.
A stoplist is a list of words considered to have no

indexing value, used to eliminate potential indexing terms. Each potential indexing term is checked against the stoplist and eliminated if found there.

Examples of stop list words are


and another any are around as at be became because do does

doesn't doing don't during each else every it's its itself just

know most name need rather said same there under using very

6. Document Operations
Documents are the primary objects in IR systems

and there are many operations for them.


In many types of IR systems, documents added to a

database must be given unique identifiers, parsed into their constituent fields, and those fields broken into field identifiers and terms.

Using the information about terms in the document,

it is possible to assign a probability of relevance to each document in a retrieved set, allowing retrieved documents to be ranked in order of probable relevance.
Term distribution information can also be used to

cluster similar documents in a document space.

7. Hardware for IR
Hardware affects the design of IR systems because it

determines, in part, the operating speed of an IR system--a crucial factor in interactive information systems.
Along with the need for greater speed there is the

need for storage media capable of compactly holding the huge document database that have proliferated.

Functional View of Paradigm IR System

When building the database, documents are taken

one by one, and their text is broken into words.


The words from the documents are compared

against a stop list--a list of words thought to have no indexing value.


Words from the document not found in the stop list

may next be stemmed.

Words may then also be counted, since the frequency

of words in documents and in the database as a whole are often used for ranking retrieved documents.
Finally, the words and associated information such

as the documents, fields within the documents, and counts are put into the database.

The database then might consist of pairs of document

identifiers and keywords as follows. keyword1 - document1-Field_2 keyword2 - document1-Field_2, 5 keyword2 - document3-Field_1, 2 keyword3 - document3-Field_3, 4 keyword-n - document-n-Field_i, j Such a structure is called what we have already talked about, an inverted file. In an IR system, each document must have a unique identifier, and its fields, if field operations are supported, must have unique field names.

IR Evaluation Criteria
Effectiveness

Efficiency

Usability

IR Effectiveness Evaluation
User-centered strategy Given several users, and at least 2 retrieval systems Have each user try the same task on both systems Measure which system works the best System-centered strategy Given documents, queries, and relevance judgments Try several variations on the retrieval system Measure which ranks more good docs near the top

Which is the Best Rank Order?


A.

B.

C.

D.

E.

F.

= relevant document

Which is the Best Rank Order?


a b c d e f g h

R
R R R R R R R R R R R R R R R

R
R

R
R R

R
R R R R R R R R R R R R

Defining Relevance
Relevance relates a topic and a document Duplicates are equally relevant by definition Constant over time and across users Relevance may include concerns such as timeliness, authority or novelty of the result. Pertinence relates a task and a document Accounts for quality, complexity, language, Utility relates a user and a document

Another View
Space of all documents

Relevant

Relevant + Retrieved

Retrieved

Not Relevant + Not Retrieved

Set-Based Effectiveness Measures

Precision How much of what was found is relevant?

Often of interest, particularly for interactive searching

Recall How much of what is relevant was found?

Particularly important for law, patents, and medicine

Effectiveness Measures
Action Doc Relevant Not relevant Retrieved Relevant Retrieved False Alarm Not Retrieved Miss Irrelevant Rejected

UserOriented

Relevant Retrieved Precision Retrieved Relevant Retrieved Recall Relevant

Measuring Precision and Recall


Assume there are a total of 14 relevant documents

Hits 1-10
Precision Recall
1/1 1/14 1/2 1/14 1/3 1/14 1/4 1/14 2/5 2/14 3/6 3/14 3/7 3/14 4/8 4/14 4/9 4/14 4/10 4/14

Hits 11-20 Precision Recall


5/11 5/14 5/12 5/14 5/13 5/14 5/14 5/14 5/15 5/14 6/16 6/14 6/17 6/14 6/18 6/14 6/19 6/14 4/20 6/14

= relevant document

General form of precision/recall


40
Precision 1.0

Recall 1.0

-Precision change w.r.t. Recall

FAR, FRR
The false acceptance rate, or FAR, is the measure of

the likelihood that the system will incorrectly retrieve an irrelevant document. A systems FAR typically is stated as the ratio of the number of false retrievals divided by the number of retrievals made.
The false rejection rate, or FRR, is the measure of the

likelihood that the system will incorrectly not retrieve a relevant document. A systems FRR typically is stated as the ratio of the number of false rejections divided by the number of retrievals made .

Contd..

Thank You.

You might also like