You are on page 1of 26

Information Retrieval

Yogesh Kakde

29th January, 2011


What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Outline

1 What is Information Retrieval

2 Basic Components in an Web-IR system

3 Theoretical Models Of IR
Boolean Model
Vector Model
Probabilistic Model
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

What is Information Retrieval

Information Retrieval (IR) means searching for relevant documents


and information within the contents of a specific data set such as
the World Wide Web.

ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Basic Components in an Web-IR system

ss
Crawling: The system browses the document collection and
fetches documents.
Indexing: The system builds an index of the documents
fetched during crawling.
Ranking: The system retrieves documents that are
relevant to the query from the index and displays to the
user.
Relevance Feedback: The initial results returned from a
given query may be used to refine the query itself.
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Terminologies Used

Inverted Index: An index that maps back from terms to the


documents where they occur.
D 1 : The GDP increased 2 percent this quarter .
D 2 : The Spring economic slowdown continued to spring
downwards

ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Terminologies Used

A token is an instance of a sequence of characters in some


particular document that are grouped together as a
useful semantic unit for processing.
Example: The GDP increased 2 percent this quarter.
Tokens: The,GDP,increased,2,percent,this,quarter
A type is the class of all tokens containing the same character
sequence.
Example: To be or not to be.
Types: to,be,or,not
A term is a type that is included in the IR system’s dictionary.
Example: The GDP increased 2 percent this quarter.
Terms: GDP, increased,2,percent,quarter
Stop words: some extremely common words which are of
little value in select documents matching a user need.
Example: The GDP increased 2 percent this quarter.
Stop words: The,this
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Terminologies Used

Stemming is the process for reducing inflected (or sometimes


derived) words to their stem, base or root form. Stemmers
can be catagorised as
Rule-based stemmers
Co-occurrence based stemmers
Lemma: the canonical form, dictionary form, or citation
form of a set of words.
Lemmatization is the algorithmic process of determining the
lemma for a given word with the use of a vocabulary and
morphological analysis.
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Logical View Of Document

ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Measures Of Accuracy

ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

A Formal Characterization of IR Models

An information retrieval model is a quadruple

{D , Q , F , R (qi , dj )}

where
D is a set composed of logical views for the documents in
the collection.
Q is a set composed of logical views for the user
information needs.
F is a framework for modeling document representations,
queries, and their relationships.
R (qi , dj ) is a ranking function which associates a real
number
with a query qi belong to Q and a document representation
dj belong to D . Such ranking defines an ordering among the
documents with regard to the query qi .
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Boolean Model

Boolean Model

The model is based on Boolean Logic and classical Set theory


Documents and the query are conceived as sets of terms.
Retrieval is based on whether or not the documents contain
the query terms
The queries are specified as Boolean expressions which
have precise semantics.
Advantages:
Formalism and Simplicity
Very precise
Disadvantages
No notion of partial match
Boolean expressions have precise semantics(not simple
from users point-of-view)
Not Ranked
The model does not use term weights and frequency
Given a large set of documents, the Boolean model either
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Vector Model

Vector Model

The use of binary weights is too limiting so vector model proposes


non-binary weights to index terms in queries and documents.

Document vector = {w1j , w2j , ..., wtj }
j

Query vector q¯i = {w1i , w2i , ..., wyi }


Where t : total number of index terms in the collection
of documents
Term Frequency(tf ): The term-frequency is simply the count
of the term i in document j . The term-frequency of a term
is normalized by the maximum term frequency of any term in
the document to bring the range of the term-frequency 0 to
1. freqi ,j
tfi ,j =
maxl (freq l ,j)
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Vector Model

Inverse Document Frequency: This factor is motivated from the


fact that if a term appears in all documents in a set, then it
loses its distinguishing power. The inverse document frequency
(idf ) of a term i is given by:
N
idfi = log
n i
where
ni is the number of documents in which the term i occurs
N is the total number of documents
The term weights can be approximated by the product of term
frequency(tf ) and inverse document frequency(idf ). This is
called the tf − idf measure.
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Vector Model

Similarity measure between two vectors

Cosine Similarity The cosine similarity between two vectors d¯j


(the document vector) and q¯i (query vector) is given by:

d¯j .q¯i
similarity (d¯j , q¯i ) = cos θ
|d¯j |, |
=
q¯i |

ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Vector Model

Advantage
The cosine similarity measure returns value in the range 0 to
1. Hence, partial matching is
possible.
Ranking of the retrieved results according to the
cosine similarity score is possible.
Disadvantage
Index terms are considered to be mutually independent. Thus,
this model does not capture the semantics of the query or the
document.
It cannot denote the “clear logic view” like Boolean
model
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Probabilistic Model

Prbabilistic Model

The document is retrieved according to the probability of the


document being relevant to the query. Mathematically, the
scoring function is given by

P (R = 1|d , q)

The document is termed relevant if its


probability of being relevant is greater
than its probability of being non
relevant

P (R = 1|d¯ , q¯) > P (R = 0|d¯ ,


q¯)
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Probabilistic Model

Binary Independence Model

Assumptions
Each index term is Boolean in nature (1 if present, 0
if absent).
There is no association between index terms.
The relevance of each document is calculated independent of
the relevance of other documents.
Documents and queries are both represented as binary term
incidence vectors.
dj = x1j , x2j , ..., xMj
[xi is 1 if xi is present in document. Else, this is zero.]

qj = q1 , q2 , ..., qM
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Probabilistic Model

Odds of relevance is used as ranking function as


it is monotonic with respect to probability of
relevance it reduces the computation
P (dj relevant to q)
odds of relevance =
P (dj non − relevant to q)

Ranking Function rf ()

P (R |d¯j
rf () =
) (R¯ |
P
d¯j )
where
P (R |d¯j ) is the probability of document dj relevant to q.
P (R¯ |d¯j ) is the probability of document dj not relevant to
q.
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Probabilistic Model

By using Baye’s Rule we get

P (d¯j |R ).P
rf () =
(R(d) ¯j |R¯ ).P
P
(R¯ )
Since P (R ) and P (R¯ ) are same for each document eliminating
it from the above equation wont change the ranking of the
documents.
P (d¯j |R )
rf () =
P (d¯j |
R¯(x)1 |R ).P (x2 |x1 , R )...P (xM |x1 ...xM −1 , R )
P
= (1)
P (x1 |R¯ ).P (x2 |x1 , R¯ )...P (xM |x1 ...xM −1 , R¯ )
M
Y P (xt |x1 ...xt −1 , R
= ( )
P (x t |x1 ...xt −1 , R
¯
t =1
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Probabilistic Model

Applying Naive-Bayes assumption which states that all index


term
xt are independent we get,
M
Y P (xt lR )
rf () =
t =1
P (xt lR¯ )
Since each xt is either 0 or 1 under Binary
Independence Model,
we can separate the terms to give
YM YM P (xt lR )
P (xt lR )
rf () = .
t :xt =1
P (xt lR ¯
) t :xt =0
P (xt lR ) ¯

Now let, pt = P (xt lR ) br the probability of a term xt appearing in


a relevant document and ut = P (xt lR¯ ) be the probability that a
term xt appears in a non-relevant document.
YM YM 1 - pt
pt
rf () = .
ut 1 - ut
t :xt =1 t :xt =0
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Probabilistic Model

Multiply the above equation


by
1 - pt
MY
1 - ut
t :xt =1

and its reciprocal 1 - ut


MY
1 - pt
t :xt =1

Now the term Y


M
1 - pt
t :xt
1 - ut
=1∨0
as this term is independent of index terms xt . Hence it can be
ignored and we get,
YM pt .(1 - ut )
rf () =
t :xt =1
ut .(1 - pt )
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Probabilistic Model

Taking the logarithm of equation we get


M
pt .(1 - ut )
rf () = log
t :xt =1
ut .(1 - pt )


M
pt (1 - ut )
= log + (2)
t :xt
(1 - pt )log ut
=1

= �
M
ct

t :xt =1
ct is the log odds ratio for the terms in query.

odds of term present in a relevant document


ct = log
odds of term being present in a non - relevant document
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Probabilistic Model

Equation (2) gives the formal scoring function of probabilistic


information retrieval model. Now ct needs to be approximated.

Advantages
Documents are ranked in decreasing order of their
probability if being relevant
Disadvantages
The need to guess the initial seperation of documents into
relevant and non-relevant sets.
All wights are binary
Index terms are assumed to be independent
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Probabilistic Model

Estimation of ct

When there are no retrieved documents we can assume


pt is constant for all index terms xi .
The distribution of index terms among the non-relevant
documents can be approximated by the distribution of
index terms among all the documents in the collection.
Thus,
pt = 0.5
ni
ut =
N
where, ni is the number of documents which contain the index
term xi and N is the total number of documents in the
collection.
Putting these value we get 0.5 1 - nt N
ct = log + nt
1 - 0.5 N (3)
N - nt
= log
nt
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Probabilistic Model

This gives the initial ranking. Improved ranking can be obtained


by
following procedure We assume
We can approximate pt by the distribution of the index term
xi among the documents initially retrieved.
We can approximate ut by considering that the non-retrieved
documents are not relevant.
We can write,
pt = Vi
V
(4)
ni - Vi
ut =
N-V

where,
V is subset of the documents initially retrieved and ranked
by the probabilistic model.
Vi be the subset of V containing index term xi .
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Probabilistic Model

For small values of Vi and V it creates problem, so we add an


adjustment factor.

Vi +Nni
pt =
V+1
(5)
ni - Vi + ni N
ut =
N-V+1

You might also like