You are on page 1of 25

An Introduction to Information

Retrieval Systems
Intelligent Systems
March 18, 2004
Ramashis Das
Definition
 We discuss about Automatic Information
Retrieval
 Automatic – as against ‘manual’.
 Information – as against ‘data’.
 Defn : An information retrieval system does not
inform (i.e. change the knowledge of) the user on
the subject of his inquiry. It merely informs on the
existence (or non-existence) and whereabouts of
documents relating to his request.
IR Vs Data Retrieval

Data Retrieval Info Retrieval


Matching Exact match Partial match, best
match
Inference Deduction Induction
Model Deterministic Probabilistic
Classification Monothetic Polythetic
Query language Artificial Natural
Query specification Complete Incomplete
Items wanted Matching Relevant
Error response Sensitive Insensitive
Classification
 Monothetic classification is one with classes
defined by objects possessing attributes both
necessary and sufficient to belong to a class.
 Polythetic classification is one where each
individual in a class will possess only a
proportion of all the attributes possessed by all
the members of that class.
 Hence no attribute is necessary nor sufficient for
membership to a class.
Experimental Vs Operational IR Systems

 Many Automatic Information Retrieval


Systems are Experimental. Experimental
IR is mainly carried on in a ‘Laboratory'
situation.
 Other kind are Operational Systems (or
‘Real World’ IR Systems) that are
Commercial Systems which charge for
the service they provide.
Why IR? – A Simple E.g.
 Suppose there is a store of documents and a
person (user of the store) formulates a
question (request or query) to which the
answer is a set of documents satisfying the
information need expressed by his question.
 Solution : User can read all the documents in
the store, retain the relevant documents and
discard all the others – Perfect Retrieval…
NOT POSSIBLE !!!
 Alternative : Use a High Speed Computer to
read entire document collection and extract
the relevant documents.
Black Box Model

FEEDBACK

Queries

INPUT PROCESSOR
OUTPUT

Documents
INPUT
 The main problem here is to obtain a
Representation of each Document and Query
suitable for a computer to use.
 Most Computer-Based Retrieval Systems
store only a representation of the Document
(or Query)
 Implies actual text is lost, an artificial language
used instead.
 User needs to be taught to express his information
need in the language.
Feedback and PROCESSOR

 On-line change in request during a


search session in the light of a sample
retrieval hoping improvement in the
subsequent retrieval run – Feedback.
 PROCESSOR – Retrieval Process.
 Structuring Information in appropriate way.
 Actual Retrieval Function – Search Strategy
in response to a Query.
OUTPUT

 Set of Citations or Document Numbers.


 For Experimental Systems, proper
Evaluation technique follows.
Historical Development
 Three main areas of Research:
 Content Analysis : Describing the contents
of documents in a form suitable for
computer processing;
 Information Structures : Exploiting
relationships between documents to
improve the efficiency and effectiveness of
retrieval strategies;
 Evaluation : the measurement of the
effectiveness of retrieval.
Information Representation
 Luhn’s approach : frequency count of words in
the Document.
 List of Keywords or Terms.
 Freq. of occurrence of Keyword in body of
Document indicates its significance.
 Statistical Association between Keywords -
exploited by Maron and Kuhns and Stiles
 Sparck Jones - measures of association
between keywords based on their frequency of
co-occurrence.
Information Structure

 Fairly Recent, Slow Development - loath


to try out new organization techniques
for faster and better retrieval.
 Serial File Organization
 Inverted File (?)
 Clustering – Good, Fairthorne; Doyle;
Rocchio
Evaluation of Retrieval Systems
 Extremely Difficult
 Dichotomous Scale : Relevant and Non-
Relevant.
 Precision - the ratio of the number of relevant
documents retrieved to the total number of
documents retrieved
 Recall - ratio of the number of relevant
documents retrieved to the total number of
relevant documents (both retrieved and not
retrieved).
Steps…
1. Generation of Machine Representations for the
Information.
2. Explanation of the Logical Structures that may be
arrived at by Clustering.
3. Representing these Structures in the Computer, or in
other words, choice of File Structures to Represent
the Logical Structure.
4. Search Strategies.
5. Probabilistic Retrieval, i.e. to create a Formal Model
for certain kinds of Search Strategies.
6. Ways of Evaluating the Effectiveness of Retrieval.
AUTOMATIC TEXT ANALYSIS

 Storing Information
 Original : In form of Documents
 Document Representation is stored

 Emphasis is on the statistical rather than


linguistic approaches.
 We start with original ideas of Luhn
Luhn’s Ideas

 Frequency of word occurrence in an


article furnishes a useful measurement
of word significance.
 relative position within a sentence of
words having given values of
significance furnish a useful
measurement for determining the
significance of sentences.
Demonstration

 f – Frequency of occurrence of words


 r – Rank Order
 Zipf’s Law - the product of the frequency
of use of words and the rank order is
approximately constant.
 Luhn used the above law to define two
cut-offs.
Generating Document
Representatives - conflation
 Text Processing System
 Input text – full text, abstract or title
 Output – a doc representative adequate for use in
an automatic retrieval system
 The document representative consists of a list
of class names, each name representing a
class of words occurring in the total input text.
 A document will be indexed by a name if one
of its significant words occurs as a member of
that class.
Text Processing System
 Such system will consist of three parts:
 Removal of high frequency words
 Suffix stripping
 Detecting equivalent stems
 Removal of High Freq words :
 One way of implementing Luhn’s upper cut-off.
 Maintain list of ‘stop list’; compare and remove
 Document size reduces by 30 to 50 %
Text Processing System
 Suffix stripping – more involved
 Complete list of suffixes; match and remove the longest
possible one.
 Context free removal leads to Error : Removing ‘UAL’ from
FACTUAL and EQUAL
 Solution : Have some rules
 Equivalent Stems :
 Map to same morphological form on removal of suffixes.
 Other kinds, which do not match on mere removal of suffixes.
(ABSORB- and ABSORPT-)
 For these, a list of equivalent stem-endings is maintained.
(For e.g. ‘B’ and ‘PT’ are equivalent stem ending)
Text Processing System
 The final output from a conflation algorithm is
a set of classes, one for each stem detected.
 A class name is assigned to a document if and
only if one of its members occurs as a
significant word in the text of the document.
 A document representative then becomes a
list of class names. These are often referred to
as the documents index terms or keywords.
 Queries : Queries are handled in the same
way.
Indexing

 index language is the language used to


describe documents and requests
 elements of the index language are
index terms which may be derived from
the text of the document to be described,
or may be arrived at independently.
Some distinctions
 Index Languages can be described as :
 Pre-coordinate : terms are coordinated at the time
of indexing
 Post-coordinate : at the time of searching.
 Vocabulary of Index Language :
 Controlled : list of approved index terms that an
indexer may use. One may put other kinds of
syntactic controls (e.g. certain terms used only as
adjectives)
 Uncontrolled

You might also like