You are on page 1of 45

1

Terms
Terms are usually stems. Terms can be also phrases,
such as “Computer Science”, “World Wide Web”, etc.
Documents and queries are represented as vectors or
“bags of words” (BOW).
Each vector holds a place for every term in the collection.
Position 1 corresponds to term 1, position 2 to term 2,
position n to term n.

Di  wd i1 , wd i 2 ,..., wd in
Q  wq1 , wq 2, ..., wqn W=0 if a term is absent

2
Document Collection
A collection of n documents can be represented in
the vector space model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of
a term in the document; zero means the term has no
significance in the document or it simply doesn’t exist
in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn

Documents are represented by binary weights or Non-


binary weighted vectors of3 terms.
Binary Weights
• Only the presence (1) or docs t1 t2 t3
absence (0) of a term is D1 1 0 1
D2 1 0 0
included in the vector D3 0 1 1
• Binary formula gives every D4 1 0 0
D5 1 1 1
word that appears in a D6 1 1 0
document equal relevance. D7 0 1 0
D8 0 1 0
• It can be useful when D9 0 0 1
frequency is not important. D10 0 1 1
D11 1 0 1
• Binary Weights Formula:
1 if freq ij  0

freq ij  
0 if freq ij  0

Why use term weighting?
Binary weights are too limiting.
Terms are either present or absent.
Not allow to order documents according to their level of
relevance for a given query.
Non-binary weights allow to model partial matching .
A technique that use statistical weighting schemes
Partial matching allows retrieval of docs that approximate
the query.
• Term-weighting improves quality of answer set.
Term weighting enables ranking of retrieved documents;
such that best matching documents are ordered at the top
as they are more relevant than others.
5
Term weighting :
This is done by collecting numerical values to each
of the index terms in a query or a document
reflecting their relative importance.
A term with a high weight is assumed to be very
relevant to the document or query.
A term with low weight indicates little relevance
to the content of the document or query.
This makes it possible to retrieve documents in the
decreasing order of query-document similarity, the
most similar (presumably relevant) being retrieved
first.
 These weights can then be used to define a function which measures
the similarity or closeness between query and documents.
6
Some of the term weighting schemes (or functions or

methods) suggested are


The term frequency (tf) weights
An inverse document frequency (IDF or collection
frequency) weights
The composite weight (tf*idf)

7
Term frequency (TF) weights
 The frequency of occurrence of a term is a useful indication
of its relative importance in describing a document.
 In other words, term importance is related to frequency of
occurrence.
 If term A is mentioned more than term B, then the
document is more about A than about B (assuming A
and B to be content bearing terms).
 Such measure assumes that the value, or weight, of a term
assigned to a document is simply proportional to the
term frequency (i.e., the frequency of occurrence of that
particular term in that particular document).
 The more frequently a term occurs in a document the
more likely it is to be of value in describing the content
of the document.
8
TF (term frequency) - Count
the number of times term
docs t1 t2 t3
occurs in a document. D1 2 0 3
fij = frequency of term i in D2 1 0 0
document j D3 0 4 7
D4 3 0 0
D5 1 6 3
The more times a term t occurs D6 3 5 0
in document d the more likely it D7 0 8 0
D8 0 10 0
is that t is relevant to the D9 0 0 1
document, i.e. more indicative D10 0 3 5
of the topic.. D11 4 0 1
Accordingly, the weight of term j in document i,
denoted by wij, might be determined by
wij  FREQij
where, FREQij is the frequency of term j in
document i
It is a simple count of the number of occurrences of a
term in a particular document (or query).
Is a measure of term density in a document.
Experiments have shown that this is better than
Boolean.
Having all weaknesses this method shows better
results than that of Boolean.
10
Problems with Term frequency (TF)
 Such a weighting system sometimes does not perform as
expected, especially in cases where the high frequency
words are equally distributed throughout the collection.
 Since it does not take into account the role of term j in
any document other than document i.
 This simple measure is not normalized to account for
variances in the length of documents (i.e. Long documents
have an unfair advantage.)
 A one-page document with 10 mentions of A is “more about
A” than a 100 page document with 20 mentions of A.
 Used alone, favors common words, long documents (Why
favor common words and long documents? )

11
Solutions to the Problems with Term frequency (TF)
Two solutions
1. Divide each frequency count by the length of the
document (length Normalization).
In this case the normalized frequency tfij is used

instead of FREQij.
2. Divide each frequency count by the maximum frequency
count of any item in the document.
The normalized tf is given by

FREQij
tf ij 
Where, max l ( FREQlj )
 tfij is the normalized frequency of term j in document i
maxl is the maximum frequency of any term in document d l
12
Document Frequency
 It is defined to be the number of documents in the
collection that contain a term

DF = document frequency

 Count the frequency considering the whole collection


of documents.
 Less frequently a term appears in the whole collection,
the more discriminating it is.

df i = document frequency of term i


= number of documents containing term i
• The example shows that collection
frequency and document
frequency behaves differently. 13
Inverse Document Frequency (IDF) weights
The concept is introduced by Spark Jones.
Assuming that term k occurs in at least one document (d k ≠ 0) a
possible measure of the inverse document frequency is defined by
( dN )
Where, wk  log 2 k
 N is the total number of documents in the collection.
dk the number of documents in which term k occurs
Wk the weight assigned to term k (i.e inverse document
frequency of term i,)
That is, the weight of a term in a document is the logarithm of
the number of documents in the collection divided by the
number of documents in the collection that contain the term
(with 2 as the base of the logarithm).
14
IDF measures rarity of the term in collection. The IDF
is a measure of the general importance of the term
Inverts the document frequency.
It diminishes the weight of terms that occur very
frequently in the collection and increases the weight of
terms that occur rarely.
Gives full weight to terms that occur in one
document only.
Gives lowest weight to terms that occur in all
documents.
Terms that appear in many different documents are
less indicative of overall topic.
 The
more a term t occurs throughout all documents, the
more poorly that term t discriminates between documents.
If a term occurs in many of the documents in the
collection, then it does not serve well as a document
identifier and should be given low weight as a
potential index term
15
As the collection frequency of a term decreases its
weight increases.
 The total number of frequency of a term across the
entire document collection.
Emphasis is on terms exhibiting the lowest
document frequency.
Term importance is
Inversely proportional to the total number of
documents to which the each term is assigned
Associated towards terms appearing in less
number of documents or items

16
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?

Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966

• IDF provides high values for rare words and low values for
common words.
• IDF is an indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
• Make the difference between Document frequency vs. corpus
frequency? 17
Problems with IDF weights
Identifies that a term that appears in many documents
is not very useful for distinguishing relevant
documents from non-relevant ones.
Because, this function does not take into account
the frequency of a term in a given document (i.e.,
FREQij)
That is, it is possible for a term to occur in only few
documents of a collection and at the same time a
small number of times in such documents but such
a term is not important for an author uses
important terms now and then.

18
Solution to the problem of IDF weights
Weights should combine two measurements.
Weights should be in direct proportion to the
frequency of the term in a document.=TF
tf = FREQ / max {FREQ }
i,j i,j k k,j
This quantifies how well a term describes the
document (or the content).
Weights should be in inverse proportion to the
number of documents in the collection in which the
term appears. =IDF (N)
wk  log 2dk
Thisquantifies the ability of the term to
separate documents.
Altogether: wi,k = tfi,k · idfk
19
TF*IDF Weighting
Combines term frequency (TF) and inverse document
frequency (IDF).
The most used term-weighting is tf*idf weighting scheme:
wij = tfij idfi = tfij * log2 (N/ dfi)
According to this function
 Weight of term j in a given document i would increase as the
frequency of the term in the document (FREQij ) increases but
decreases as the document frequency dj increases.
A term occurring frequently in the document but rarely in the
rest of the collection is given high weight.
 The tf*idf value for a term will always be greater than or equal to
zero.
Experimentally, tf*idf has been found to work well.
 It is often used in the vector space model together with cosine
similarity to determine the similarity between two documents.
20
A high occurrence frequency in a particular document
indicates that the term carries a great deal of importance in
that document.
A low-overall collection (the number of documents in the
collection to which the term is assigned) indicates at the
same time that the importance of the term in the remainder
of the collection is relatively small so that the term can
actually distinguish the documents to which it is assigned
from the remainder of the collection
Thus, such a term can be considered as being of potentially
greater importance for retrieval purposes
This scheme assigns a weight to each term (vocabulary
word) in a given document.

21
TF*IDF weighting
When does TF*IDF registers a high weight? when a
term t occurs many times within a small number of
documents.
Highest tf*idf for a term shows a term has a high term
frequency (in the given document) and a low
document frequency (in the whole collection of
documents);
The weights hence tend to filter out common terms.
Thus, lending high discriminating power to those
documents.
Lower TF*IDF is registered when the term occurs fewer
times in a document, or occurs in many documents.
Thus, offering a less pronounced relevance signal.
Lowest TF*IDF is registered when the term occurs in
virtually all documents.
Computing TF-IDF: An Example
Assume collection contains 10,000 documents and
statistical analysis shows that document frequencies
(DF) of three terms are: A(50), B(1300), C(250). And also
term frequencies (TF) of these terms are: A(3), B(2), C(1)
with a maximum term frequency of 3. Compute TF*IDF
for each term?
A: tf = 3/3=1.0 idf = log2(10000/50) = 7.644; tf*idf = 7.644
B: tf = 2/3=0.667 idf = log2(10000/1300) = 2.943; tf*idf =
1.962
C: tf = 1/3=0.33 idf = log2(10000/250) = 5.322; tf*idf = 1.774
Query vector is typically treated as a document and also tf*idf
weighted.
23
More Example
Consider a document containing 100 words where in the
word cow appears 3 times. Now, assume we have 10 million
documents and cow appears in one thousand of these.

The term frequency (TF) for cow :


3/100 = 0.03

The inverse document frequency is


log2(10,000,000 / 1,000) = 13.228

The TF*IDF score is the product of these frequencies: 0.03 *


13.228 = 0.39684

24
Exercise Word C TW TD DF TF IDF TF*IDF
• Let C = number of times
a given word appears in airplane 5 46 3 1
a document; blue 1 46 3 1
• TW = total number of
words in a document; chair 7 46 3 3
• TD = total number of computer 3 46 3 1
documents in a corpus,
and forest 2 46 3 1
• DF = total number of justice 7 46 3 3
documents containing a
given word; love 2 46 3 1
• compute TF, IDF and might 2 46 3 1
TF*IDF score for each
perl 5 46 3 2
term
rose 6 46 3 3
shoe 4 46 3 1
thesis 2 46 3 2
25
Concluding remarks
Suppose from a set of English documents, we wish to determine
which once are the most relevant to the query "the brown cow."
A simple way to start out is by eliminating documents that do
not contain all three words "the," "brown," and "cow," but this
still leaves many documents.
To further distinguish them, we might count the number of
times each term occurs in each document and sum them all
together;
The number of times a term occurs in a document is called its TF.
However, because the term "the" is so common, this will tend to
incorrectly emphasize documents which happen to use the word
"the" more, without giving enough weight to the more meaningful
terms "brown" and "cow".
Also the term "the" is not a good keyword to distinguish relevant
and non-relevant documents and terms like "brown" and "cow" that
occur rarely are good keywords to distinguish relevant documents
from the non-relevant once.
26
Concluding remarks
Hence IDF is incorporated which diminishes the
weight of terms that occur very frequently in the
collection and increases the weight of terms that
occur rarely.
This leads to use TF*IDF as a better weighting
technique
On top of that we apply similarity measures to
calculate the distance between document i and
query j.
Similarity Measure

28
Similarity Measure
We now have vectors for all documents in
the collection and a vector for the query,
how do we compute similarity? t3

A similarity measure is a function that 


computes the degree of similarity or
distance between document vector and D1
Q
query vector.
 t1
Using a similarity measure between the
query and each document:
t2 D2
It is possible to rank the retrieved
documents in the order of presumed
relevance.
It is possible to enforce a certain
threshold so that the size of the
retrieved set can be controlled.
29
t3
d2

d3
d1
θ
φ
t1

d5
t2
d4

Postulate: Documents that are “close together” in the vector


space talk about the same things and more similar than
others.
Similarity Measure
1. If d1 is near d2, then d2 is near d1.
2. If d1 near d2, and d2 near d3, then d1 is not far from d3.
3. No document is closer to d than d itself.
Sometimes it is a good idea to determine the maximum
possible similarity as the “distance” between a document
d and itself.
A similarity measure attempts to compute the
distance between document vector wj and query wq
vector.
The assumption here is that documents whose vectors are
close to the query vector are more relevant to the query than
documents whose vectors are away from the query vector.
31
Similarity Measure: Techniques
• There are a number of similarity measures; the most common
similarity measures are
Euclidean distance , Inner or Dot product, Cosine
similarity, etc.
Euclidean distance
It is the most common similarity measure. Euclidean distance
examines the root of square differences between coordinates of a
pair of document and query terms.
Dot product
The dot product is also known as the scalar product or inner
product.
The dot product is defined as the product of the magnitudes of
query and document vectors.
Cosine similarity (or normalized inner product)
It projects document and query vectors into a term space and
calculate the cosine angle between these. 32
Euclidean distance
Similarity between vectors for the document di and
query q can be computed as:
n
sim(dj, q) = |dj – q| =  (w
i 1
ij  wiq ) 2

where wij is the weight of term i in document j and wiq is the


weight of term i in the query
• Example: Determine the Euclidean distance between the
document 1 vector (0, 3, 2, 1, 10) and query vector (2, 7, 1, 0,
0). 0 means corresponding term not found in document
or query.
2 2 2 2 2
 (0  2)  (3  7)  (2  1)  (1  0)  (10  0)  11 .05
33
Inner Product
Similarity between vectors for the document di and
query q can be computed as the vector inner product:
n

sim(dj, q) = dj• q = 
w ·w
iij1 iq

where wij is the weight of term i in document j and wiq is


the weight of term i in the query q.
For binary vectors, the inner product is the number
of matched query terms in the document (size of
intersection).
For weighted term vectors, it is the sum of the
products of the weights of the matched terms.

34
Properties of Inner Product
Favors long documents with a large number of unique
terms.
Again, the issue of normalization.
Measures how many terms matched but not how
many terms are not matched.

35
Inner Product -- Examples
Binary weight :
Size of vector = size of vocabulary = 7
sim(D, Q) = 3
Retrieval Database Term Computer Text Manage Data
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1

• Term Weighted: Retrieval Database Architecture


D1 2 3 5
D2 3 7 1
Q 1 0 2

Sim (D1 , Q) = 2*1 + 3*0 + 5*2 = 12


Inner Product: Example 1
k2
k1
d2 d6 d7
d4
d5
d3
d1
k1 k2 k3 q  dj
d1 1 0 1 2 k3
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1

q 1 1 1
37
Inner Product: Exercise
k2
k1
d2 d6 d7
d4 d5
d1 d3
k1 k2 k3 q  dj
d1 1 0 1 ? k3
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?

q 1 2 3
38
Cosine similarity
Measures similarity between d1 and d2 captured by the
cosine of the angle x between them.
 

n
dj q i 1
wi , j wi ,q
sim(d j , q )    
dj q i 1 w i1 i,q
n 2 n2
i, j w
Or;
 

n
d j  dk i 1
wi , j wi ,k
sim(d j , d k )    
d j dk i1 w i 1 i,k
n 2 n 2
i, j w
The denominator involves the lengths of the vectors


n 2
Length d j  i 1
w
i, j

So, the cosine measure is also known as the normalized


inner product.
Example: Computing Cosine Similarity
• Let say we have query vector Q = (0.4, 0.8); and also
document D1 = (0.2, 0.7). Compute their similarity
using cosine?

(0.4 * 0.2)  (0.8 * 0.7)


sim (Q, D2 ) 
[(0.4) 2  (0.8) 2 ] * [(0.2) 2  (0.7) 2 ]
0.64
  0.98
0.42
Example: Computing Cosine Similarity
• Let say we have two documents in our corpus; D1 =
(0.8, 0.3) and D2 = (0.2, 0.7). Given query vector Q =
(0.4, 0.8), determine which document is the most
relevant one for the query?

1.0 Q
cos 1  0.74 D2
0.8

cos  2  0.98 0.6 2


0.4
1 D1
0.2

0.2 0.4 0.6 0.8 1.0


41
Example
Given three documents; D1, D2 and D3 with the

corresponding TFIDF weight, Which documents are


more similar using the three measurement?

Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254

42
Cosine Similarity vs. Inner Product
Cosine similarity measures the cosine of the angle
between two vectors.
Inner product normalized by the vector lengths.
  t

dj q   ( wij  wiq )
 

i 1
Cosin(dj, q) = t t
dj  q  wij   wiq
2 2

i 1 i 1
 
InnerProduct(dj, q) =d j q 

D1 = 2T1 + 3T2 + 5T3 Cosin(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81


D2 = 3T1 + 7T2 + 1T3 Cosin(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times
better using inner product in terms of closeness to query Q.
43
Exercises
A database collection consists of 1 million documents,
of which 200,000 contain the term holiday while
250,000 contain the term season. A document repeats
holiday 7 times and season 5 times. It is known that
holiday is repeated more than any other term in the
document. Calculate the weight of both terms in this
document using three different term weight methods.
Try with
(i) normalized and un-normalized TF;
(ii) TF*IDF based on normalized and un-normalized
TF

44
45

You might also like