Professional Documents
Culture Documents
Yogesh Kakde
Outline
3 Theoretical Models Of IR
Boolean Model
Vector Model
Probabilistic Model
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
ss
Crawling: The system browses the document collection and
fetches documents.
Indexing: The system builds an index of the documents
fetched during crawling.
Ranking: The system retrieves documents that are
relevant to the query from the index and displays to the
user.
Relevance Feedback: The initial results returned from a
given query may be used to refine the query itself.
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Terminologies Used
ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Terminologies Used
Terminologies Used
ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Measures Of Accuracy
ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
{D , Q , F , R (qi , dj )}
where
D is a set composed of logical views for the documents in
the collection.
Q is a set composed of logical views for the user
information needs.
F is a framework for modeling document representations,
queries, and their relationships.
R (qi , dj ) is a ranking function which associates a real
number
with a query qi belong to Q and a document representation
dj belong to D . Such ranking defines an ordering among the
documents with regard to the query qi .
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Boolean Model
Boolean Model
Vector Model
Vector Model
Vector Model
Vector Model
d¯j .q¯i
similarity (d¯j , q¯i ) = cos θ
|d¯j |, |
=
q¯i |
ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Vector Model
Advantage
The cosine similarity measure returns value in the range 0 to
1. Hence, partial matching is
possible.
Ranking of the retrieved results according to the
cosine similarity score is possible.
Disadvantage
Index terms are considered to be mutually independent. Thus,
this model does not capture the semantics of the query or the
document.
It cannot denote the “clear logic view” like Boolean
model
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
Prbabilistic Model
P (R = 1|d , q)
Probabilistic Model
Assumptions
Each index term is Boolean in nature (1 if present, 0
if absent).
There is no association between index terms.
The relevance of each document is calculated independent of
the relevance of other documents.
Documents and queries are both represented as binary term
incidence vectors.
dj = x1j , x2j , ..., xMj
[xi is 1 if xi is present in document. Else, this is zero.]
qj = q1 , q2 , ..., qM
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
Ranking Function rf ()
P (R |d¯j
rf () =
) (R¯ |
P
d¯j )
where
P (R |d¯j ) is the probability of document dj relevant to q.
P (R¯ |d¯j ) is the probability of document dj not relevant to
q.
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
P (d¯j |R ).P
rf () =
(R(d) ¯j |R¯ ).P
P
(R¯ )
Since P (R ) and P (R¯ ) are same for each document eliminating
it from the above equation wont change the ranking of the
documents.
P (d¯j |R )
rf () =
P (d¯j |
R¯(x)1 |R ).P (x2 |x1 , R )...P (xM |x1 ...xM −1 , R )
P
= (1)
P (x1 |R¯ ).P (x2 |x1 , R¯ )...P (xM |x1 ...xM −1 , R¯ )
M
Y P (xt |x1 ...xt −1 , R
= ( )
P (x t |x1 ...xt −1 , R
¯
t =1
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
Probabilistic Model
Probabilistic Model
�
M
pt .(1 - ut )
rf () = log
t :xt =1
ut .(1 - pt )
�
M
pt (1 - ut )
= log + (2)
t :xt
(1 - pt )log ut
=1
= �
M
ct
t :xt =1
ct is the log odds ratio for the terms in query.
Probabilistic Model
Advantages
Documents are ranked in decreasing order of their
probability if being relevant
Disadvantages
The need to guess the initial seperation of documents into
relevant and non-relevant sets.
All wights are binary
Index terms are assumed to be independent
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
Estimation of ct
Probabilistic Model
where,
V is subset of the documents initially retrieved and ranked
by the probabilistic model.
Vi be the subset of V containing index term xi .
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
Vi +Nni
pt =
V+1
(5)
ni - Vi + ni N
ut =
N-V+1