IR Basics Lec28 Oct 3 2011

Information Retrieval
Yogesh Kakde
29th January, 2011

What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Outline
1 What is Information Retrieval
2 Basic Components in an Web-IR system
3 Theoretical Models Of IR
Boolean Model
Vector Model
Probabilistic Model
What is Information Retrieval
Information Retrieval (IR) means searching for relevant documents

and information within the contents of a specific data set such as
the World Wide Web.
ss
Basic Components in an Web-IR system
ss
Crawling: The system browses the document collection and
fetches documents.
Indexing: The system builds an index of the documents
fetched during crawling.
Ranking: The system retrieves documents that are
relevant to the query from the index and displays to the
user.
Relevance Feedback: The initial results returned from a
given query may be used to refine the query itself.
Terminologies Used
Inverted Index: An index that maps back from terms to the

documents where they occur.
D 1 : The GDP increased 2 percent this quarter .
D 2 : The Spring economic slowdown continued to spring
downwards
ss
Terminologies Used
A token is an instance of a sequence of characters in some

particular document that are grouped together as a
useful semantic unit for processing.
Example: The GDP increased 2 percent this quarter.
Tokens: The,GDP,increased,2,percent,this,quarter
A type is the class of all tokens containing the same character
sequence.
Example: To be or not to be.
Types: to,be,or,not
A term is a type that is included in the IR system’s dictionary.
Terms: GDP, increased,2,percent,quarter
Stop words: some extremely common words which are of
little value in select documents matching a user need.
Stop words: The,this
Terminologies Used
Stemming is the process for reducing inflected (or sometimes

derived) words to their stem, base or root form. Stemmers
can be catagorised as
Rule-based stemmers
Co-occurrence based stemmers
Lemma: the canonical form, dictionary form, or citation
form of a set of words.
Lemmatization is the algorithmic process of determining the
lemma for a given word with the use of a vocabulary and
morphological analysis.
Logical View Of Document
ss
Measures Of Accuracy
ss
A Formal Characterization of IR Models
An information retrieval model is a quadruple
{D , Q , F , R (qi , dj )}
where
D is a set composed of logical views for the documents in
the collection.
Q is a set composed of logical views for the user
information needs.
F is a framework for modeling document representations,
queries, and their relationships.
R (qi , dj ) is a ranking function which associates a real
number
with a query qi belong to Q and a document representation
dj belong to D . Such ranking defines an ordering among the
documents with regard to the query qi .
Boolean Model
Boolean Model
The model is based on Boolean Logic and classical Set theory

Documents and the query are conceived as sets of terms.
Retrieval is based on whether or not the documents contain
the query terms
The queries are specified as Boolean expressions which
have precise semantics.
Advantages:
Formalism and Simplicity
Very precise
Disadvantages
No notion of partial match
Boolean expressions have precise semantics(not simple
from users point-of-view)
Not Ranked
The model does not use term weights and frequency
Given a large set of documents, the Boolean model either
Vector Model
Vector Model
The use of binary weights is too limiting so vector model proposes

non-binary weights to index terms in queries and documents.
d¯
Document vector = {w1j , w2j , ..., wtj }
j
Query vector q¯i = {w1i , w2i , ..., wyi }

Where t : total number of index terms in the collection
of documents
Term Frequency(tf ): The term-frequency is simply the count
of the term i in document j . The term-frequency of a term
is normalized by the maximum term frequency of any term in
the document to bring the range of the term-frequency 0 to
1. freqi ,j
tfi ,j =
maxl (freq l ,j)
Vector Model
Inverse Document Frequency: This factor is motivated from the

fact that if a term appears in all documents in a set, then it
loses its distinguishing power. The inverse document frequency
(idf ) of a term i is given by:
N
idfi = log
n i
where
ni is the number of documents in which the term i occurs
N is the total number of documents
The term weights can be approximated by the product of term
frequency(tf ) and inverse document frequency(idf ). This is
called the tf − idf measure.
Vector Model
Similarity measure between two vectors
Cosine Similarity The cosine similarity between two vectors d¯j

(the document vector) and q¯i (query vector) is given by:
d¯j .q¯i
similarity (d¯j , q¯i ) = cos θ
|d¯j |, |
=
q¯i |
ss
Vector Model
Advantage
The cosine similarity measure returns value in the range 0 to
1. Hence, partial matching is
possible.
Ranking of the retrieved results according to the
cosine similarity score is possible.
Disadvantage
Index terms are considered to be mutually independent. Thus,
this model does not capture the semantics of the query or the
document.
It cannot denote the “clear logic view” like Boolean
model
Probabilistic Model
Prbabilistic Model
The document is retrieved according to the probability of the

document being relevant to the query. Mathematically, the
scoring function is given by
P (R = 1|d , q)
The document is termed relevant if its

probability of being relevant is greater
than its probability of being non
relevant
P (R = 1|d¯ , q¯) > P (R = 0|d¯ ,

q¯)
Probabilistic Model
Binary Independence Model
Assumptions
Each index term is Boolean in nature (1 if present, 0
if absent).
There is no association between index terms.
The relevance of each document is calculated independent of
the relevance of other documents.
Documents and queries are both represented as binary term
incidence vectors.
dj = x1j , x2j , ..., xMj
[xi is 1 if xi is present in document. Else, this is zero.]
qj = q1 , q2 , ..., qM
Probabilistic Model
Odds of relevance is used as ranking function as

it is monotonic with respect to probability of
relevance it reduces the computation
P (dj relevant to q)
odds of relevance =
P (dj non − relevant to q)
Ranking Function rf ()
P (R |d¯j
rf () =
) (R¯ |
P
d¯j )
where
P (R |d¯j ) is the probability of document dj relevant to q.
P (R¯ |d¯j ) is the probability of document dj not relevant to
q.
Probabilistic Model
By using Baye’s Rule we get
P (d¯j |R ).P
rf () =
(R(d) ¯j |R¯ ).P
P
(R¯ )
Since P (R ) and P (R¯ ) are same for each document eliminating
it from the above equation wont change the ranking of the
documents.
P (d¯j |R )
rf () =
P (d¯j |
R¯(x)1 |R ).P (x2 |x1 , R )...P (xM |x1 ...xM −1 , R )
P
= (1)
P (x1 |R¯ ).P (x2 |x1 , R¯ )...P (xM |x1 ...xM −1 , R¯ )
M
Y P (xt |x1 ...xt −1 , R
= ( )
P (x t |x1 ...xt −1 , R
¯
t =1
Probabilistic Model
Applying Naive-Bayes assumption which states that all index

term
xt are independent we get,
M
Y P (xt lR )
rf () =
t =1
P (xt lR¯ )
Since each xt is either 0 or 1 under Binary
Independence Model,
we can separate the terms to give
YM YM P (xt lR )
P (xt lR )
rf () = .
t :xt =1
P (xt lR ¯
) t :xt =0
P (xt lR ) ¯
Now let, pt = P (xt lR ) br the probability of a term xt appearing in

a relevant document and ut = P (xt lR¯ ) be the probability that a
term xt appears in a non-relevant document.
YM YM 1 - pt
pt
rf () = .
ut 1 - ut
t :xt =1 t :xt =0
Probabilistic Model
Multiply the above equation

by
1 - pt
MY
1 - ut
t :xt =1
and its reciprocal 1 - ut

MY
1 - pt
t :xt =1
Now the term Y

M
1 - pt
t :xt
1 - ut
=1∨0
as this term is independent of index terms xt . Hence it can be
ignored and we get,
YM pt .(1 - ut )
rf () =
t :xt =1
ut .(1 - pt )
Probabilistic Model
Taking the logarithm of equation we get
�
M
pt .(1 - ut )
rf () = log
t :xt =1
ut .(1 - pt )
�
M
pt (1 - ut )
= log + (2)
t :xt
(1 - pt )log ut
=1
= �
M
ct
t :xt =1
ct is the log odds ratio for the terms in query.
odds of term present in a relevant document

ct = log
odds of term being present in a non - relevant document
Probabilistic Model
Equation (2) gives the formal scoring function of probabilistic

information retrieval model. Now ct needs to be approximated.
Advantages
Documents are ranked in decreasing order of their
probability if being relevant
Disadvantages
The need to guess the initial seperation of documents into
relevant and non-relevant sets.
All wights are binary
Index terms are assumed to be independent
Probabilistic Model
Estimation of ct
When there are no retrieved documents we can assume

pt is constant for all index terms xi .
The distribution of index terms among the non-relevant
documents can be approximated by the distribution of
index terms among all the documents in the collection.
Thus,
pt = 0.5
ni
ut =
N
where, ni is the number of documents which contain the index
term xi and N is the total number of documents in the
collection.
Putting these value we get 0.5 1 - nt N
ct = log + nt
1 - 0.5 N (3)
N - nt
= log
nt
Probabilistic Model
This gives the initial ranking. Improved ranking can be obtained

by
following procedure We assume
We can approximate pt by the distribution of the index term
xi among the documents initially retrieved.
We can approximate ut by considering that the non-retrieved
documents are not relevant.
We can write,
pt = Vi
V
(4)
ni - Vi
ut =
N-V
where,
V is subset of the documents initially retrieved and ranked
by the probabilistic model.
Vi be the subset of V containing index term xi .
Probabilistic Model
For small values of Vi and V it creates problem, so we add an

adjustment factor.
Vi +Nni
pt =
V+1
(5)
ni - Vi + ni N
ut =
N-V+1

IR Basics Lec28 Oct 3 2011

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IR Basics Lec28 Oct 3 2011

Uploaded by

Copyright:

Available Formats

Information Retrieval

29th January, 2011

1 What is Information Retrieval

2 Basic Components in an Web-IR system

What is Information Retrieval

Information Retrieval (IR) means searching for relevant documents

Basic Components in an Web-IR system

Inverted Index: An index that maps back from terms to the

A token is an instance of a sequence of characters in some

Stemming is the process for reducing inflected (or sometimes

Logical View Of Document

A Formal Characterization of IR Models

An information retrieval model is a quadruple

The model is based on Boolean Logic and classical Set theory

The use of binary weights is too limiting so vector model proposes

Query vector q¯i = {w1i , w2i , ..., wyi }

Inverse Document Frequency: This factor is motivated from the

Similarity measure between two vectors

Cosine Similarity The cosine similarity between two vectors d¯j

The document is retrieved according to the probability of the

The document is termed relevant if its

P (R = 1|d¯ , q¯) > P (R = 0|d¯ ,

Binary Independence Model

Odds of relevance is used as ranking function as

By using Baye’s Rule we get

Applying Naive-Bayes assumption which states that all index

Now let, pt = P (xt lR ) br the probability of a term xt appearing in

Multiply the above equation

and its reciprocal 1 - ut

Now the term Y

Taking the logarithm of equation we get

odds of term present in a relevant document

Equation (2) gives the formal scoring function of probabilistic

When there are no retrieved documents we can assume

This gives the initial ranking. Improved ranking can be obtained

For small values of Vi and V it creates problem, so we add an

You might also like