You are on page 1of 25

Information Retrieval

Course Outline The Information Retrieval Process Query Modification and Effectiveness Representation of Documents IR Models -Boolean and Vector Space Models The Retrieval Process: Full Text Scanning Inverted Files Signature Files Clustered Files

Useful Texts Information Retrieval - Data Structures & Algorithms W B Frakes and R Baeza-Yates Modern Information Retrieval R Baeza-Yates and B Ribeiro-Neto Automatic Text Processing G Salton Information Retrieval C J van Rijsbergen (available at www.dcs.gla.ac.uk/Keith/Preface.html)

The Information Retrieval Process The purpose of an automatic retrieval strategy is to retrieve all the relevant documents whilst at the same time retrieving as few of the nonrelevant ones as possible. ! The process involves a certain element of feedback and is best illustrated using the diagram from Belkin and Croft's paper (CACM Dec 1992)

There are three main ingredients to the IR process Texts or Documents Queries The process of Evaluation For Texts, the main problem is to obtain a representation of the text in a form which is amenable to automatic processing. This is achieved

by creating an abbreviated form

of the text, known as a text surrogate. A typical surrogate would consist of a set of index terms or keywords or descriptors For Queries, the query has arisen as a result of an information need on the part of a user. The query is then a representation of the information need and must be expressed in a language understood by the system Due to the inherent difficulty of accurately representing this information need, the query in

an IR system is always regarded as approximate and imperfect

The Evaluation process involves a comparison of the texts actually retrieved with those the user expected to retrieve This often leads to some modification, typically of the query though possibly of the information need or even of the surrogates The extent to which modification is required is closely linked with the process of measuring the effectiveness of the retrieval operations

Measures of Effectiveness The most commonly used measures of retrieval effectiveness are recall and precision ! Recall is the ratio of relevant documents retrieved for a given query over the number of relevant documents for that query in the database ! Precision is the ratio of the number of relevant documents retrieved over the total number of documents retrieved

In general, the average user will want to achieve both high recall and high precision - the user will want that a large proportion of the useful documents should be retrieved and at the same time a large proportion of the non-relevant documents should be rejected In practice, a compromise must be reached as simultaneously optimising recall and precision is not normally achievable

A combined measure of recall and precision, E, has been developed by van Rijsbergen

E = 1 - [(1+b2 )PR] / [b2P + R]


where R = Recall P = Precision

and b indicates that the user is attaching b times as much importance to recall as to precision b = 0 b = for a user who attaches no importance to recall for a user who attaches no importance

to precision Recall and Precision are based on the assumption that the set of relevant documents for a query is the same, no matter who the user is Different users might have a different interpretation as to which document is relevant and which is not. As a result, various useroriented measures have been proposed

User Oriented Measures U = no of relevant documents known to user

Rk = no of relevant documents known to user, that were retrieved Ru = no of relevant documents previously unknown to user, that were retrieved Coverage Ratio
be relevant, that have been retrieved)

Rk/U

(fraction of docs known to

Novelty Ratio
relevant docs retrieved,

Ru/(Ru + Rk)

(fraction of

which was unknown to user)

Relative Recall
relevant docs found by

(Ru + Rk)/U

(ratio of

system and the no of relevant docs user expected to find)

Document Representation We need to be able to extract from the document those words or terms that best capture the meaning of the document The first problem that we need to tackle is what constitutes a 'word' in our indexing process. For example, we need to consider how to deal with digits and hyphens

Secondly, we would want to eliminate frequently occurring words as their discrimination value is low. A list of such words is known as a stoplist or negative dictionary The stoplist should take into account the application for which the list is being constructed

The next step is that of stemming. Stemming is the automated conflation of related words,

usually by reducing them to a common root form (see Porter's algorithm) The process of stemming helps reduce the size of index files - compression factors of over 50% can be achieved

Term Frequency Considerations A further step in the process of representing a document is determining the relative importance of each of the remaining words in the context of the document H P Luhn, one of the earliest researchers into IR, proposed a measure whereby 'the frequency of word occurrence in an article furnishes a useful measurement of word significance' However, a high-frequency term will be acceptable for indexing purposes only if its occurrence frequency is not equally high in all documents of the collection The precision measure will best be served by having terms that occur frequently in an individual document, but rarely in the document collection

In order to determine the importance of a term we will need a measure that combines term frequency (the no. of times a given term occurs in a given document) with document frequency (the no. of documents in the collection in which the given term occurs) One such measure is given as: wij = tfij x log (N/dfj) where wij indicates the importance of term j in document i tfij gives the no. of occurrences of term j in document i dfj gives the no. of documents occurs in which term j

N gives the no. of documents in the collection under consideration log(N/dfj) is known as the inverse document frequency factor Each document Di can now be represented by a vector containing the index terms together with their respective weights Di = (T1, wi1; T2, wi2;..Tn, win) A further step in the process of document representation is using a thesaurus to replace overly-specific terms with less-specific, medium frequency terms. This will have the effect of enhancing recall We would need to ensure that the thesaurus we use is compatible with the document collection under consideration

Query Processing There are several possible retrieval strategies; we shall concentrate on two of them - the Boolean Model and the Vector Space Model In the Boolean Model the query terms are connected by AND/OR/NOT. In the case of AND the retrieval operation will only be successful if we have an 'exact match' ie all the terms included in the query are found in the document We have seen that we can represent a document Di by a vector that includes weights indicating the importance of the relevant term We can use the weights to rank the documents in order of likely importance, so that if a query consists of (Ti, Tj, Tk) the retrieval weight of document n = wni + wnj + wnk

With the Boolean Model, determining document retrieval weights is somewhat more complex due to the alternative ways of expressing queries eg T1 AND (T2 OR T3) (T1 AND T2) OR (T1 AND T3) One possibility is to transform each query into disjunctive normal form (or sum-of-products form) With the Boolean Model, if one of the query terms is misspelled or is not present in the document, the document will not be retrieved The Vector Space Model is based on the 'best match' principle and ranks documents on the basis of those query terms that are present

Full Text Scanning The most straightforward, and laborious, method of locating the documents that contain a given term is to search each document, character by character, for the given term This method is known as the Brute-Force method and requires of the order of m*n comparisons (with m characters in the term and n characters in the document) A faster method, known as the KMP method (suggested by Knuth, Morris and Pratt), involves scanning the characters in a left-to-right mode and when a mismatch occurs between the characters of the document and the characters of the query term an optimum shift is carried out, instead of an automatic shift of just one character position

The KMP method requires of the order of m + n comparisons

Inverted Files An alternative method of structuring the information for query processing purposes is to use an inverted file structure An inverted file is a sorted list of index terms with each index term having links to the documents containing that term This structure has the advantage that a query can be constructed that can retrieve any desired subset of documents (using AND/OR/NOT) and that the query logic can be evaluated entirely in the inverted file without having to gain access to

the main document file (except to retrieve the relevant documents) We could extend the inverted indexes to include: ! term-location information ! word numbers within sentences ! term weights

Signature Files Documents can be encoded using relatively small signatures. This will allow for faster query processing as the signature can now be scanned, rather than the complete document Signatures do not uniquely represent a document (ie we could have several documents with the same

signature); we would therefore implement retrieval in two phases: ! scan all signatures and identify 'possible hits' ! scan all documents in the 'possible hits' list to ensure that they are correct matches Thus it is quite possible for a 'false match' to arise with this method A boolean signature cannot store proximity information or information regarding the weight of a term in the context of a given document

Clustered File Organisation Inverted file organisation has the disadvantage that information pertaining to any one document may be scattered among many inverted lists Further, different documents containing similar terms are not likely to be in close proximity in the file system that stores the documents If browsing is to be permitted in the document collection, then documents containing related items should appear close together Browsing will become possible with a clustered file organisation whereby documents that are judged to be similar are grouped into common areas A clustered file organisation could be maintained in addition to a set of inverted files, so that searches are carried out using the inverted files and browsing is carried out using the clustered files Alternatively, we could use the clustered files for both searching and browsing

In a clustered file documents and queries are represented by term vectors of the form: Di = (ai1, ai2,ait) Qj = (qj1, qj2,.qjt) where the coefficients represent the values of the relevant terms in document i and query j respectively Typically, aik (or qjk) is set to 1, if term k appears in Di (or Qj) is set to 0, if term k is absent Alternatively, the coefficients could represent the weighting factors for the relevant terms In a clustered file, each cluster is represented by a special term vector, known as a centroid The centroid would often be the average vector of all the documents appearing in the cluster, though it could be identical to a particular document included in the cluster

Cluster Generation The cluster generation process would normally be carried out only once, with cluster maintenance being carried out at relatively infrequent intervals Whatever cluster generation procedure is adopted, it should ideally meet the two goals of theoretical soundness and efficiency By theoretical soundness we mean that ! the method should be stable under growth ie the partitioning should not change drastically with the insertion of new documents ! small errors in the description of the documents should lead to small changes in the partitioning

! the method should be independent of the initial ordering of the documents The main criterion for efficiency is the time required for clustering

Heuristic Cluster Generation Methods Heuristic methods produce rough cluster arrangements rapidly at relatively little expense The simplest such method is the single-pass method whereby the items to be clustered are taken one at a time, in arbitrary order, requiring no advance knowledge of item similarities Under this procedure, the first item is placed in a cluster of its own and then each subsequent

item to be entered is compared against all existing clusters and is placed in a cluster whenever it is 'similar enough' to that cluster Determining similarity involves comparing the incoming item with all existing centroids using some similarity measure. A similarity threshold will have to be decided by the user Where an item is added to an existing cluster, the corresponding centroid must be appropriately updated If an incoming item is not sufficiently similar to any existing cluster, the new item forms a cluster on its own A common similarity measure is the Cosine Coefficient measure

The single-pass method has the advantage of simplicity but tends to produce large clusters early in the clustering process whilst other clusters may consist of single items

Cluster size can be controlled by specifying a desirable cluster size and splitting clusters when the maximum permitted size is exceeded The single-pass method is fast and efficient; however the final classification depends on the order that the items are processed and the results of errors in the document descriptions are unpredictable with the result that the conditions for soundness are not entirely satisfied Cluster searching can be carried out in either a top-down or bottom-up manner

You might also like