You are on page 1of 6

2015 International Conference on Computational Intelligence and Communication Networks

Analysis of Probabilistic model for Document Retrieval in Information Retrieval

Astha Tamrakar, Santosh K. Vishwakarma


Department of CSE
GGITS, Jabalpur, India
Email : tamrakar.astha@gmail.com, santoshscholar@gmail.com

Abstract - Information Retrieval is the activity of finding


documents which is of unstructured nature and it should
satisfy user’s information needs. The term “IR” refers to the
retrieval of unstructured records, that is, records which are
free-form natural language text. There are various models for
weighting terms of corpus documents and query terms. The
probabilistic model captures the IR problem using a
probabilistic framework, It tries to find the probability that a
document will be relevant to a user query or not. In this we
have a collection of user query, and there is an ideal answer set
for each query, first of all initial set of documents are retrieved
from the corpus or collection of documents. User inspects these Fig. 1 Information Retrieval System
documents for searching the relevant documents, then IR
system use this information to find the description to get the The above figure shows information retrieval system. In
ideal answer set. This work is carried out to analyze and this figure there are collection of documents goes to indexer
evaluate the retrieval effectiveness of various probabilistic which performs the indexing process and at the same time
models with use of new data set i.e; FIRE 2011. The
user enters their query into the system, then indexer based
experiments were performed with different variants of
probabilistic models. Terrier 3.5, which is an open search on the user query retrieve the top-k relevant documents.
engine was used for all experiments and evaluation. Our result For this task, some training data is given, then some
shows that IFB2 model gives the highest precision values with learning algorithms and ranking model are applied to the
the news corpus dataset. training data to match the user query and then result’s page
is obtained. There are some preprocessing steps which are
General Terms- Information Retrieval, Variants of
used by information retrieval system before assigning ranks
Probabilistic Models, Retrieval Effectiveness, Precision.
to the documents that will be discussed in the next section.
2. PRE-PROCESSING
1. INTRODUCTION
Pre-processing is the process of incorporating a new
document into an information retrieval system [6].
The goal of Information Retrieval is to find and rank the
preprocessing is used for storing the document and for
documents which is of unstructured nature and that match
processing retrieval requests that is, for space and time
the user’s information needs [1]. When user enters a query
requirements. It has various stages like tokenization,
into the system, then query does not uniquely identify the
stemming, stop words elimination, inverted index etc. are
single objects in the collection, instead several objects may
applied on each document to determine how well
match the query with different degrees of relevancy. To
documents satisfy information needs.
understand the query of user, information retrieval system
uses many models, according to that it assigns the ranks to
the all documents and bring the top relevant documents
from data set. For this paper we use static data set to
evaluate the result of different models. Information retrieval
is the process for searching a collection of documents,
which uses the term document, to identify those documents
which deal with a specific subject and its success is Fig. 2 Pre-processing
determined at which rate of accuracy data is retrieved [3]. It
is the recall and the precision which attempt to measure the The above figure shows the various stages of preprocessing
effectiveness. and the explanation of all the steps are given below.

978-1-5090-0076-0/15 $31.00 © 2015 IEEE 760


DOI 10.1109/CICN.2015.155
2.1 Tokenization similar words also doesn’t include stop word in index and
In tokenization, character sequence is given and with a the reason behind elimination of stop words is just to avoid
defined document unit, it is the task of chopping it up into space problem for system. In above figure, collection of all
pieces, means it breaks the documents into words, symbols, searchable words called dictionary and link of documents
phrases, numbers etc which is called tokens [8] . IDs called posting.

Input : Friends,Romans,lend me 3.REVIEW OF PROBABILISTIC MODELS

Output : Friend Romans lend me Information Models define the way to represent the
document text as well as query. In Probabilistic model for a
2.2 Stemming given query, if we know some documents that are relevant,
Stemming is the process in which it replaces all the variants then terms that occur in these documents should be given
of the word with the single stem of the word or the root of greater weighting in comparison with other relevant
the word [7]. Variants include plurals, gerund forms (ing- documents. It assigns ranks to documents according to their
form), third person suffixes, past tense suffixes, etc. A stem relevancy. We are using different variants of Probabilistic
word can represent many words that might have different models such as BM25, BB2, IFB2, In_expB2, In_expC2,
meaning. So, Stemming avoids many original words and InL2, DFR_BM25, DFI0 and PL2.
uses a single root word from which the size of the
dictionary will be reduced that contains all the words of the 3.1 BM25 Model [9][13]
document collection. For example-
-Connect: connects, connected, connecting, connection, BM25 is probabilistic model that is developed by Stephen
-Car: car, cars, car’s, cars’ E. Robertson, Karen Sparck Jones, and others.BM stands
2.3 Stop-words for ‘best match’. BM25 model doesn’t use a single function,
The objective of stop word elimination is to filter out words it use set of functions.
that comes in most of the documents. Such words have no
value for retrieval purposes. These words are called 
 .
BM25 = ∑(  . log ( ). qtf )
stopwords [8]. A typical stopword list contains several  .  .

hundred words and for the user’s perspective every word is (1)
not equally important that’s why these type of words are tϵq∩d
eliminated from the vocabulary entirely. They include
- Articles (a, an, the, …) with:
-Prepositions (in, on, of, …) - tf : frequency of term occurrences,
- Conjunctions (and, or, but, if, …) - N: total number of documents in the collection,
- Pronouns (I, you, them, it…) - dft: number of documents containing a term t,
- Possibly some verbs, nouns. - qtf : frequency of occurrences of a term t in the
Inverted Index query,
The purpose of an inverted index is to allow fast text - k1: parameters influencing the frequency of terms
searches, cost of processing is increased when a document that is adjusted to 1.2 by default,
is added to the database. The inverted index is the list of - nb: normalization factor is calculated as follows:

words, and the documents in which they appear [7]. In the nb =(1-b) + b .

web search example, you provide the list of words (your (2)
search query), and Google produces the documents.
with:
Director 6 9 30 51 59 62 68 78 - tl: Number of terms in the document
(document length),
Bollywood 14 28 31 44 64 86 91 97 - tlavg : Average number of words in a
document.

3.2 BB2 Model [13]

Bose-Einstein model for randomness, the ratio of two


Bernoulli’s processes for first normalization, and
Fig 3. Inverted Index
Normalization 2 for term frequency normalization.
Generally we use stemmed words in the index, like compute
for computer, computation, computations and many other

761
 - F is the term frequency of the term t in the whole
w(t,d)= (-log₂ (N-1) - log₂(e) + f (N+F-1, N+F -
 () collection.
tfn-2) –f (F,F-tfn)) - nt is the document frequency of the term t.
(3) - N is the number of documents in the collection.
- ne is given by:
where- 
N · (1 − (1 - − ) F ) (8)
- w(t, d) is the within-document term weight of the

term t in the document d, - tfne is also the normalized term frequency. It is


- tf is the within-document frequency of the term t in given by the modified version of the
the document d, normalization 2:
_
- F is the term frequency of the term t in the whole - tfne = tf · loge(1 + c · ) (9)

collection,
- N is the number of documents in the collection, 3.6 PL2 Model [13]
- nt is the document frequency of the term t,

- λ is given by Poisson model with Laplace after-effect and normalization

_ 2. This model can be used for tasks that require early
- tfn = tf · log2 (1 + c · )

(4) precision.
where c is a parameter. l and avg l are the document length   
of the document d and the average document length in the w(t,d) = (tfn.log₂ + λ + - tfn) . log₂e + 0.5 +
  .
collection respectively. log₂(2π . tfn)) (10)

3.3 IFB2 Model [13] where-


- w(t, d) is the within-document term weight of the
Inverse Term Frequency model with Bernoulli after-effect term t in the document d,
and normalization 2. _
- tfn = tf · log2 (1 + c · )


 where c is a parameter l and avg l are the document length
w(t,d)= (tfn.log₂. ) (5) of the document d and the average document length in the
 ()  .
collection respectively.
3.4 In_expB2 Model [13]
3.7 I(n)L2 Model [13]
Inverse Expected Document Frequency model with
Bernoulli after-effect and normalization 2. The logarithms Inverse Document Frequency model with Laplace after-
effect and normalization 2. This model can be used for tasks
are base 2. This model can be used for classic ad-hoc tasks.
that require early precision.


w(t,d) = (tfn.log₂. ) (6) 

 ()   . w(t,d) = (tfn . log₂ ) (11)
   .

3.5 In_expC2 Model [13] 3.8 BM25 Model [13]


Inverse Expected Document Frequency model with BM25 is probabilistic model that is developed by Stephen
Bernoulli after-effect and normalization 2. The logarithms E. Robertson, Karen Sparck Jones, and others.BM stands
are base e. This model can be used for classic ad-hoc tasks. for ‘best match’. BM25 model doesn’t use a single function,
The I(n exp)C2 model is developed by Amati & Van it use set of functions.
Rijsbergen’s DFR framework and is considered as an

 .
effective and robust model on the collection used in this BM25 = ∑( . log ( ). qtf ) tϵq∩d (12)
 .  .
Robust Track. Its formula is: with:
- tf : frequency of term occurrences,


w(t,d) = . ( tfn .log₂. ) (7) - N: total number of documents in the collection,
 ( )   .
- dft: number of documents containing a term t,
- qtf : frequency of occurrences of a term t in the
where:
query,
- w(t, d) is the weight of the term t in the document
- k1: parameters influencing the frequency of terms
d.
that is adjusted to 1.2 by default,

762
- nb: nb is normalization factor. 4.2.Description of Data Set

4.EXPERIMENTAL EVALUATION The experiments has been carried out on the data set of
FIRE 2011[http://www.isical.ac.in/~fire/] for English.
Evaluation is an important part of every research of any Datasets is consist of qrel file which is also known as topic
area in all around the world, and it is also useful in terms of file, Result file and Evaluation file. The data set contains
information retrieval. In simple form Evaluation means is various documents from newspapers and websites. We just
how effective a system performs and produces valuable took a sample from this data set for performing the
result with accuracy. experiments. The task of corpus creation was carried out to
support experiments for research purpose in information
4.1Evaluation Measures retrieval domain.
4.2.1 FIRE and Document Format
In Information Retrieval, to measure the effectiveness of the FIRE stands for forum of Information Retrieval and
system that is, to estimate that how well a system can Evaluation. It’s an India based organization for research on
perform, our requirements are data set, a set of queries and information retrieval. FIRE works on languages of South
some function are used to check relevance factor between Asian countries. The Forum for Information Retrieval
document and queries. Generally queries are not perfect in Evaluation (FIRE) has following aims-
two respects. First, when they retrieve some irrelevant - Explore new Information Retrieval / Access tasks that
documents. Second when they do not retrieve all the arise as our information needs evolve, and new needs
relevant documents. Simple IR system just fetches the best emerge.
relevant documents that are related to the query and assign - Provide a common evaluation infrastructure for comparing
ranks to them. Only two measures are used to evaluate the the performance of different IR systems.
effectiveness of a retrieval method and these measures are Document format that used in FIRE collection follow the
precision and recall. In this paper we use these standard representation of TREC collection. Documents
measurements that is precision and recall that are discussed contain tags like DOC, DOCNO and TEXT. DOCNO
in next section. represents a unique number for every document in the data
set. Text field contains the actual news article in plain text.
4.1.1.Precision [1][4] The example of a text file is shown below.

Precision is equal to the proportion of retrieved documents


that are actually relevant. If searchers want to raise <DOC>
precision, then they have to narrow their queries. <DOCNO>doc_54/0054</DOCNO>
<TEXT>

.    !"# Jurassic Park is a 1993 American science fiction
Precision =
.  !"# adventure film directed by Steven Spielberg. It is the
first installment of the Jurassic Park film series. It is
4.1.2.Recall [1][4]
based on the 1990 novel of the same name by Michael
Crichton, with a screenplay written by Crichton and
Recall is equal to the proportion of all relevant documents
David Koepp. The film centers on the fictional Isla
that are actually retrieved. If searchers want to raise recall,
Nublar, an islet located off Central America's Pacific
then they have to broader their queries. There is inverse
Coast, near Costa Rica, where a billionaire
relationship between precision and recall.
philanthropist and a small team of genetic scientists

.    !"# have created a wildlife park of cloned dinosaurs.
Recall = </TEXT>

.  !"#
</DOC>
4.1.3.Mean Average Precision [4] Fig 4. Document Format

In our paper we used MAP to evaluate our results. Mean 4.2.2 Topic File [4]
average precision is standard measure and accepted by
TCER community for their evaluation. Mean average Topic file contain some pre-fixed queries for the data set,
precision for a set of queries is the mean of the average these queries almost cover every document within the data
precision scores for each query. set. According to our sampled data of FIRE data collection
*
we take 17 queries. Example of our topic file is shown in
$+, %&(') figure 5.The topic file format contain tags such as top, num
MAP= (13)
-
Where Q is the number of queries.

763
and title. Title is the query and number is assign to every
topic. Number of queries = 17
Retrieved = 184
Relevant = 87
<topics> Relevant retrieved = 72
<top> ____________________________________
<num>13</num> Average Precision: 0.7796
<title>jurassic park</title> R Precision : 0.7616
</top>
<top>
Fig. 7 Eval File
<num>14</num>
<title>far from the madding crowd</title>
We plot the Precision values of all the implemented models
</top> in Figure 8.1, 8.2 and 8.3 as shown following-
Fig. 5 Topic File

4.2.3 Qrels File [4] Average Precision


0.786
0.784
Qrels file format describes the presence and absence of 0.782
each terms of query in the document. Format of qrels shown 0.78
0.778
in figure 6 and description of format is like, a first column 0.776
0.774 Average
show the query ID that is according to the topic file, second 0.772
Precision

DFI0
BM25
BB2
IFB2

InL2
PL2

DFR_BM25
In_expB2
In_expC2
column show iteration, third column show the document ID
that is mention in document format and last column shows
the presence and absence of that query in document by 0 or
1.
Fig. 8.1
9 Q0 doc_17/0017 1
9 Q0 doc_18/001 0
In the above figure i.e; fig 8.1 on the x axis, we have taken
9 Q0 doc_19/0019 0
different probabilistic models which we are using in this
9 Q0 doc_20/0020 0
paper and on the y axis we plot different values and the
9 Q0 doc_21/0021 0
points shows in a graph are the MAP values which are
9 Q0 doc_22/0022 0
obtained in our results and the results shows that IFB2
9 Q0 doc_23/0023 0
model has the highest MAP values.
9 Q0 doc_24/0024 0
9 Q0 doc_25/0025 1
9 Q0 doc_26/0026 0 R Precision

Fig. 6 View of qrel file 0.775


5. RESULT AND ANALYSIS 0.77
We performed our experiments in Terrier 3.5 which is an 0.765
open search engine. It has all the necessary codes to support 0.76 R Precision
experiments for FIRE dataset. We make some changes in
terrier’s properties file. There are many Information model
which are already supported by the terrier-3.5. Initially we
are showing the result of all variants of probabilistic
models. We are using different Probabilistic models, and
these models are BB2, DFI0, In_expC2, InL2 , PL2 ,
Fig. 8.2
BM25, DFR_BM25 and In_expC2. We use two measures
for all model that is MAP and R-Precision. In figure 7
Similarly in fig 8.2 on the x axis, we have taken different
example of eval file and it is generated for every model by
probabilistic models which we are using in this paper and
terrier- 3.5 and it shows information about retrieved and
on the y axis we plot different values and the points shows
relevant documents.
in a graph are the values of the probabilistic models which
we have obtained in our results as R-Precision.

764
for all the topics file. The results were evaluated and
0.426 P@10 successfully compared with Terrier, which is the open
source search engine.
0.422
0.418 In future work,
0.414 ƒ We plan to work with the large datasets,
P@10
0.41
ƒ We will use other regional languages like Hindi etc,
ƒ We plan to use various other tools.
Fig. 8.3
REFERENCES
[1] Modern information retrieval system chapter 7 Text Processing Saif
Rababah web2. aabu.edu.jo/ tool/course_file/ lec_notes/902333_chapter07.
In fig 8.3 on the x axis, we have taken different probabilistic [2] Amati, Giambattista. Probability models for information retrieval based
models and on the y axis we plot different precision values on divergence from randomness. Diss. University of Glasgow, 2003.
of the models and the points shows in a graph are the [3] An Introduction to Information Retrieval Christopher D. Manning
precision values which we obtained in the results. As we Prabhakar Raghavan, Hinrich Schütze.
[4] Santosh K Vishwakarma, Kamaljit I Lakhtaria ,Chandra Shekhar
can see clearly in the graph that for p@10, model Jangid Ad-hoc Retrieval on FIRE Data Set with TF-IDF and Probabilistic
In_expC2 has highest precision value of all the models. We Models.
applied various probabilistic models [5] Modern Information Retrieval Chapter 3 Modeling IR book.
in our dataset and compare the results. Table 1 illustrates [6] Vikram Singh and Balwinder Saini An Effective Pre-processing
algorithm for Information Retrieval Systems.
the result of comparisons. IFB2 gives the MAP value of [7]nlp.stanford.edu/IR-b book / html / html / html / html edition
0.7846 and it is highest in the class of all its variants. /Stemming-and-lemmatization-1.html.
[8]www.cs.stu.ca/CourseCentral/slides/Tokenization.
Table 1: MAP, R-Precision and P@10 values for different [9] Robertson Stephen; Zaragoza, Hugo (2009). The Probabilistic
Relevance Framework: BM25 and Beyond
IR Models [10] Stephen E. Robertson, Steve Walker, and Micheline Hancock-
Beaulieu (November 1998). Okapi at TREC-7 (PDF). Proceedings of the
Mod B B IF In_ In_ In DF DFR Seventh Text REtrieval Conference. Gaithersburg, USA.
els M B B exp exp L2 PL I0 _BM [11] Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). "A probabilistic
model of information retrieval: Development and comparative
25 2 2 B2 C2 2 25
experiments: Part 1". Information Processing & Management 36.
[12] A Query-based Pre-retrieval Model Selection Approach to
MA 0. 0. 0. 0.78 0.77 0.7 0.7 0.7 0.78 Information Retrieval Ben He and Iadh Ounis.
P 78 77 78 20 91 82 77 80 28 [13] Word Indexing Versus Conceptual Indexing in Medical Image
28 96 46 8 9 1 Retrieval(ReDCAD participation at Image CLEF Medical Image Retrieval
2012) Karim Gasmi, Mouna Torjmen-Khemakhem, and Maher Ben Jemaa.
R- 0. 0. 0. 0.76 0.76 0.7 0.7 0.7 0.77 [14]Introduction to Probabilistic Models for Information Retrieval Victor
Preci 77 76 77 16 16 73 61 61 34 Lavrenko University of Edinburgh.
sion 34 16 34 4 6 6 [15] University of Glasgow at the Robust Track -A Query-based Model
Selection Approach for the Poorly-performing Queries Ben He ben@ dcs.
0. 0. 0. 0.41 0.4 0.4 0.4 0.41 gla. ac. uk, Iadh Ounis.
P @ 0.42
[16] Sentence Ranking for Document Indexing Saptaditya Maiti1, Deba P.
10 41 41 41 76 35 17 17 17 76
Mandal1 and Pabitra Mitra2.
76 18 76 6 6 6 [17] Frakes, William B. "Stemming Algorithms." (1992): 131-160.
[18] Information Retrieval Methods in Libraries and Information Centres,
MAP, R-Precision and P@10 values for different IR models An International Multidisciplinary Journal, Ethiopia.
[19] airccse.org/journal/ijdms/papers/6614ijdms02.pdf.
are shown in the above Table 1.This table shows that model [20] Silberschatz, Korth and Sudarshan, Database system concepts chapter
IFB2 has highest MAP value as we can see it in bold 19 information retrieval.
numbers, similarly model BM25, IFB2 and InL2 has
highest R-Precision values. And model In_expC2 is highest
at P@10 as compares to other models.

6. CONCLUSION AND FUTURE WORK

This work has been carried out to analyze the performance


of various Information Retrieval Models with the FIRE
dataset which contains corpus of various newspapers. We
implemented Probabilistic model and its variants and
compare the results of each probabilistic model. Based on
our results we conclude that IFB2 produces the best results

765

You might also like