Professional Documents
Culture Documents
Output : Friend Romans lend me Information Models define the way to represent the
document text as well as query. In Probabilistic model for a
2.2 Stemming given query, if we know some documents that are relevant,
Stemming is the process in which it replaces all the variants then terms that occur in these documents should be given
of the word with the single stem of the word or the root of greater weighting in comparison with other relevant
the word [7]. Variants include plurals, gerund forms (ing- documents. It assigns ranks to documents according to their
form), third person suffixes, past tense suffixes, etc. A stem relevancy. We are using different variants of Probabilistic
word can represent many words that might have different models such as BM25, BB2, IFB2, In_expB2, In_expC2,
meaning. So, Stemming avoids many original words and InL2, DFR_BM25, DFI0 and PL2.
uses a single root word from which the size of the
dictionary will be reduced that contains all the words of the 3.1 BM25 Model [9][13]
document collection. For example-
-Connect: connects, connected, connecting, connection, BM25 is probabilistic model that is developed by Stephen
-Car: car, cars, car’s, cars’ E. Robertson, Karen Sparck Jones, and others.BM stands
2.3 Stop-words for ‘best match’. BM25 model doesn’t use a single function,
The objective of stop word elimination is to filter out words it use set of functions.
that comes in most of the documents. Such words have no
value for retrieval purposes. These words are called
.
BM25 = ∑( . log ( ). qtf )
stopwords [8]. A typical stopword list contains several .
.
hundred words and for the user’s perspective every word is (1)
not equally important that’s why these type of words are tϵq∩d
eliminated from the vocabulary entirely. They include
- Articles (a, an, the, …) with:
-Prepositions (in, on, of, …) - tf : frequency of term occurrences,
- Conjunctions (and, or, but, if, …) - N: total number of documents in the collection,
- Pronouns (I, you, them, it…) - dft: number of documents containing a term t,
- Possibly some verbs, nouns. - qtf : frequency of occurrences of a term t in the
Inverted Index query,
The purpose of an inverted index is to allow fast text - k1: parameters influencing the frequency of terms
searches, cost of processing is increased when a document that is adjusted to 1.2 by default,
is added to the database. The inverted index is the list of - nb: normalization factor is calculated as follows:
words, and the documents in which they appear [7]. In the nb =(1-b) + b .
web search example, you provide the list of words (your (2)
search query), and Google produces the documents.
with:
Director 6 9 30 51 59 62 68 78 - tl: Number of terms in the document
(document length),
Bollywood 14 28 31 44 64 86 91 97 - tlavg : Average number of words in a
document.
761
- F is the term frequency of the term t in the whole
w(t,d)= (-log₂ (N-1) - log₂(e) + f (N+F-1, N+F -
() collection.
tfn-2) –f (F,F-tfn)) - nt is the document frequency of the term t.
(3) - N is the number of documents in the collection.
- ne is given by:
where-
N · (1 − (1 - − ) F ) (8)
- w(t, d) is the within-document term weight of the
_ 2. This model can be used for tasks that require early
- tfn = tf · log2 (1 + c · )
(4) precision.
where c is a parameter. l and avg l are the document length
of the document d and the average document length in the w(t,d) = (tfn.log₂ + λ + - tfn) . log₂e + 0.5 +
.
collection respectively. log₂(2π . tfn)) (10)
762
- nb: nb is normalization factor. 4.2.Description of Data Set
4.EXPERIMENTAL EVALUATION The experiments has been carried out on the data set of
FIRE 2011[http://www.isical.ac.in/~fire/] for English.
Evaluation is an important part of every research of any Datasets is consist of qrel file which is also known as topic
area in all around the world, and it is also useful in terms of file, Result file and Evaluation file. The data set contains
information retrieval. In simple form Evaluation means is various documents from newspapers and websites. We just
how effective a system performs and produces valuable took a sample from this data set for performing the
result with accuracy. experiments. The task of corpus creation was carried out to
support experiments for research purpose in information
4.1Evaluation Measures retrieval domain.
4.2.1 FIRE and Document Format
In Information Retrieval, to measure the effectiveness of the FIRE stands for forum of Information Retrieval and
system that is, to estimate that how well a system can Evaluation. It’s an India based organization for research on
perform, our requirements are data set, a set of queries and information retrieval. FIRE works on languages of South
some function are used to check relevance factor between Asian countries. The Forum for Information Retrieval
document and queries. Generally queries are not perfect in Evaluation (FIRE) has following aims-
two respects. First, when they retrieve some irrelevant - Explore new Information Retrieval / Access tasks that
documents. Second when they do not retrieve all the arise as our information needs evolve, and new needs
relevant documents. Simple IR system just fetches the best emerge.
relevant documents that are related to the query and assign - Provide a common evaluation infrastructure for comparing
ranks to them. Only two measures are used to evaluate the the performance of different IR systems.
effectiveness of a retrieval method and these measures are Document format that used in FIRE collection follow the
precision and recall. In this paper we use these standard representation of TREC collection. Documents
measurements that is precision and recall that are discussed contain tags like DOC, DOCNO and TEXT. DOCNO
in next section. represents a unique number for every document in the data
set. Text field contains the actual news article in plain text.
4.1.1.Precision [1][4] The example of a text file is shown below.
In our paper we used MAP to evaluate our results. Mean 4.2.2 Topic File [4]
average precision is standard measure and accepted by
TCER community for their evaluation. Mean average Topic file contain some pre-fixed queries for the data set,
precision for a set of queries is the mean of the average these queries almost cover every document within the data
precision scores for each query. set. According to our sampled data of FIRE data collection
*
we take 17 queries. Example of our topic file is shown in
$+, %&(') figure 5.The topic file format contain tags such as top, num
MAP= (13)
-
Where Q is the number of queries.
763
and title. Title is the query and number is assign to every
topic. Number of queries = 17
Retrieved = 184
Relevant = 87
<topics> Relevant retrieved = 72
<top> ____________________________________
<num>13</num> Average Precision: 0.7796
<title>jurassic park</title> R Precision : 0.7616
</top>
<top>
Fig. 7 Eval File
<num>14</num>
<title>far from the madding crowd</title>
We plot the Precision values of all the implemented models
</top> in Figure 8.1, 8.2 and 8.3 as shown following-
Fig. 5 Topic File
DFI0
BM25
BB2
IFB2
InL2
PL2
DFR_BM25
In_expB2
In_expC2
column show iteration, third column show the document ID
that is mention in document format and last column shows
the presence and absence of that query in document by 0 or
1.
Fig. 8.1
9 Q0 doc_17/0017 1
9 Q0 doc_18/001 0
In the above figure i.e; fig 8.1 on the x axis, we have taken
9 Q0 doc_19/0019 0
different probabilistic models which we are using in this
9 Q0 doc_20/0020 0
paper and on the y axis we plot different values and the
9 Q0 doc_21/0021 0
points shows in a graph are the MAP values which are
9 Q0 doc_22/0022 0
obtained in our results and the results shows that IFB2
9 Q0 doc_23/0023 0
model has the highest MAP values.
9 Q0 doc_24/0024 0
9 Q0 doc_25/0025 1
9 Q0 doc_26/0026 0 R Precision
764
for all the topics file. The results were evaluated and
0.426 P@10 successfully compared with Terrier, which is the open
source search engine.
0.422
0.418 In future work,
0.414 We plan to work with the large datasets,
P@10
0.41
We will use other regional languages like Hindi etc,
We plan to use various other tools.
Fig. 8.3
REFERENCES
[1] Modern information retrieval system chapter 7 Text Processing Saif
Rababah web2. aabu.edu.jo/ tool/course_file/ lec_notes/902333_chapter07.
In fig 8.3 on the x axis, we have taken different probabilistic [2] Amati, Giambattista. Probability models for information retrieval based
models and on the y axis we plot different precision values on divergence from randomness. Diss. University of Glasgow, 2003.
of the models and the points shows in a graph are the [3] An Introduction to Information Retrieval Christopher D. Manning
precision values which we obtained in the results. As we Prabhakar Raghavan, Hinrich Schütze.
[4] Santosh K Vishwakarma, Kamaljit I Lakhtaria ,Chandra Shekhar
can see clearly in the graph that for p@10, model Jangid Ad-hoc Retrieval on FIRE Data Set with TF-IDF and Probabilistic
In_expC2 has highest precision value of all the models. We Models.
applied various probabilistic models [5] Modern Information Retrieval Chapter 3 Modeling IR book.
in our dataset and compare the results. Table 1 illustrates [6] Vikram Singh and Balwinder Saini An Effective Pre-processing
algorithm for Information Retrieval Systems.
the result of comparisons. IFB2 gives the MAP value of [7]nlp.stanford.edu/IR-b book / html / html / html / html edition
0.7846 and it is highest in the class of all its variants. /Stemming-and-lemmatization-1.html.
[8]www.cs.stu.ca/CourseCentral/slides/Tokenization.
Table 1: MAP, R-Precision and P@10 values for different [9] Robertson Stephen; Zaragoza, Hugo (2009). The Probabilistic
Relevance Framework: BM25 and Beyond
IR Models [10] Stephen E. Robertson, Steve Walker, and Micheline Hancock-
Beaulieu (November 1998). Okapi at TREC-7 (PDF). Proceedings of the
Mod B B IF In_ In_ In DF DFR Seventh Text REtrieval Conference. Gaithersburg, USA.
els M B B exp exp L2 PL I0 _BM [11] Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). "A probabilistic
model of information retrieval: Development and comparative
25 2 2 B2 C2 2 25
experiments: Part 1". Information Processing & Management 36.
[12] A Query-based Pre-retrieval Model Selection Approach to
MA 0. 0. 0. 0.78 0.77 0.7 0.7 0.7 0.78 Information Retrieval Ben He and Iadh Ounis.
P 78 77 78 20 91 82 77 80 28 [13] Word Indexing Versus Conceptual Indexing in Medical Image
28 96 46 8 9 1 Retrieval(ReDCAD participation at Image CLEF Medical Image Retrieval
2012) Karim Gasmi, Mouna Torjmen-Khemakhem, and Maher Ben Jemaa.
R- 0. 0. 0. 0.76 0.76 0.7 0.7 0.7 0.77 [14]Introduction to Probabilistic Models for Information Retrieval Victor
Preci 77 76 77 16 16 73 61 61 34 Lavrenko University of Edinburgh.
sion 34 16 34 4 6 6 [15] University of Glasgow at the Robust Track -A Query-based Model
Selection Approach for the Poorly-performing Queries Ben He ben@ dcs.
0. 0. 0. 0.41 0.4 0.4 0.4 0.41 gla. ac. uk, Iadh Ounis.
P @ 0.42
[16] Sentence Ranking for Document Indexing Saptaditya Maiti1, Deba P.
10 41 41 41 76 35 17 17 17 76
Mandal1 and Pabitra Mitra2.
76 18 76 6 6 6 [17] Frakes, William B. "Stemming Algorithms." (1992): 131-160.
[18] Information Retrieval Methods in Libraries and Information Centres,
MAP, R-Precision and P@10 values for different IR models An International Multidisciplinary Journal, Ethiopia.
[19] airccse.org/journal/ijdms/papers/6614ijdms02.pdf.
are shown in the above Table 1.This table shows that model [20] Silberschatz, Korth and Sudarshan, Database system concepts chapter
IFB2 has highest MAP value as we can see it in bold 19 information retrieval.
numbers, similarly model BM25, IFB2 and InL2 has
highest R-Precision values. And model In_expC2 is highest
at P@10 as compares to other models.
765