JIMMA UNIVERSITY MSC INFORMATION SCIENCE IDF IMPLEMENTATION

JIMMA UNIVERSITY
JIMMA INSTITUTE OF TECHNOLOGY

FACULTY OF COMPUTING AND
INFORMATICS
MSC. IN INFORMATION SCIENCE
(ELECTRONIC AND DIGITAL RECOURSE
MANAGEMENT)
Course Name: Advanced Information Retrieval
Assignment 4: Inverse Document Frequency
Prepared by: Ruth Wondu
Submitted to: Dr Getachew Mamo
January, 2021
1. Definition of Inverse Document Frequency
Raw term frequency as above suffers from a critical problem: all terms are considered
equally important when it comes to assessing relevancy on a query. In fact certain terms have
little or no discriminating power in determining relevance. For instance, a collection of
documents on the auto industry is likely to have the term auto in almost every document. To
this end, we introduce a mechanism for attenuating the effect of terms that occur too often in
the collection to be meaningful for relevance determination. An immediate idea is to scale
down the term weights of terms with high collection frequency, defined to be the total
number of occurrences of a term in the collection. The idea would be to reduce the tf weight
of a term by a factor that grows with its collection frequency.
Instead, it is more commonplace to use for this purpose the document frequency dft, defined
to be the number of documents in the collection that contain a term - t . This is because in
trying to discriminate between documents for the purpose of scoring it is better to use a
document-level statistic (such as the number of documents containing a term) than to use a
collection-wide statistic for the term.
Figure shows: Collection frequency (cf) and document

frequency (df) behave differently, as in this example from
the Reuters collection. The reason to prefer df to cf is illustrated in Figure 6.7 , where a
simple example shows that collection frequency (cf) and document frequency (df) can
behave rather differently. In particular, the cf values for both try and insurance are roughly
equal, but their df values differ significantly. Intuitively, we want the few documents that
contain insurance to get a higher boost for a query on insurance than the many documents
containing try get from a query on try. How is the document frequency df of a term used to
scale its weight? Denoting as usual the total number of documents in a collection by-N, we
define the inverse document frequency of a term -t as follows:
1|Page
Thus the idf of a rare term is high, whereas the idf of a frequent term is likely to be low.
Figure 6.8 gives an example of idf's in the Reuters collection of 806,791 documents; in this
example logarithms are to the base 10. In fact, as we will see in Exercise 6.2.2 , the precise base
of the logarithm is not material to ranking. We will give on page 11.3.3 a justification of the
particular form in Equation 21.
2. Advantages and Disadvantages of Inverse Document Frequency

Advantages
 It helps search engines to identify what it is that makes a given document special.
 It computes document similarity directly in the word-count space, which may be slow for
large vocabularies
 Is an efficient and simple algorithm for matching words in a query to documents that are
relevant to the query
Disadvantages
 It computes document similarity directly in the word-count space, which may be slow for
large vocabularies.
 It assumes that the counts of different words provide independent evidence of similarity.
 It makes no use of semantic similarities between words.
2|Page
3. Implementation using any programming language (python/java ...). I recommend you
python programming language
from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy as np
corpus = [
'it is going to rain today',
'today i am not going outside',
'i am going to watch the season premier',
'season premier is over'
def TF(corpus, unique_words):
tf_dict={}
N=len(corpus)
3|Page
for i in unique_words:
count=0
for sen in corpus:
if i in sen.split():
count=count+1
tf_dict[i]=(count/(count+1))+1
return tf_dict
def fit(whole_data):
unique_words = set()
if isinstance(whole_data, (list,)):
for x in whole_data:
for y in x.split():
if len(y)<2:
continue
unique_words.add(y)
unique_words = sorted(list(unique_words))
vocab = {j:i for i,j in enumerate(unique_words)}
tf_values_of_all_unique_words=TF(whole_data,unique_words)
return vocab, tf_values_of_all_unique_words
4|Page
Vocabulary, tf_of_vocabulary=fit(corpus)
print(list(Vocabulary.keys()))
print(list(tf_of_vocabulary.values()))
5|Page
6|Page
Reference
1. Kishore papineni, InverseDocumentFrequency
https://moz.com/blog/inverse-document-frequency-and-the-importance-of-
uniqueness#:~:text=It%20helps%20search%20engines%20identify,term
%20(e.g.%20keyword%20density). Assessed on 2021
7|Page

JIMMA UNIVERSITY MSC INFORMATION SCIENCE IDF IMPLEMENTATION

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

JIMMA UNIVERSITY MSC INFORMATION SCIENCE IDF IMPLEMENTATION

Uploaded by

Copyright:

Available Formats

JIMMA UNIVERSITY

JIMMA INSTITUTE OF TECHNOLOGY

Course Name: Advanced Information Retrieval

Assignment 4: Inverse Document Frequency

Prepared by: Ruth Wondu

Submitted to: Dr Getachew Mamo

Figure shows: Collection frequency (cf) and document

2. Advantages and Disadvantages of Inverse Document Frequency

from collections import Counter

from tqdm import tqdm

from scipy.sparse import csr_matrix

from sklearn.preprocessing import normalize

'it is going to rain today',

'today i am not going outside',

'i am going to watch the season premier',

'season premier is over'

def TF(corpus, unique_words):

for sen in corpus:

vocab = {j:i for i,j in enumerate(unique_words)}

return vocab, tf_values_of_all_unique_words

You might also like