Professional Documents
Culture Documents
Terms
Terms are usually stems. Terms can be also phrases,
such as “Computer Science”, “World Wide Web”, etc.
Documents and queries are represented as vectors or
“bags of words” (BOW).
Each vector holds a place for every term in the collection.
Position 1 corresponds to term 1, position 2 to term 2,
position n to term n.
Di wd i1 , wd i 2 ,..., wd in
Q wq1 , wq 2, ..., wqn W=0 if a term is absent
2
Document Collection
A collection of n documents can be represented in
the vector space model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of
a term in the document; zero means the term has no
significance in the document or it simply doesn’t exist
in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
7
Term frequency (TF) weights
The frequency of occurrence of a term is a useful indication
of its relative importance in describing a document.
In other words, term importance is related to frequency of
occurrence.
If term A is mentioned more than term B, then the
document is more about A than about B (assuming A
and B to be content bearing terms).
Such measure assumes that the value, or weight, of a term
assigned to a document is simply proportional to the
term frequency (i.e., the frequency of occurrence of that
particular term in that particular document).
The more frequently a term occurs in a document the
more likely it is to be of value in describing the content
of the document.
8
TF (term frequency) - Count
the number of times term
docs t1 t2 t3
occurs in a document. D1 2 0 3
fij = frequency of term i in D2 1 0 0
document j D3 0 4 7
D4 3 0 0
D5 1 6 3
The more times a term t occurs D6 3 5 0
in document d the more likely it D7 0 8 0
D8 0 10 0
is that t is relevant to the D9 0 0 1
document, i.e. more indicative D10 0 3 5
of the topic.. D11 4 0 1
Accordingly, the weight of term j in document i,
denoted by wij, might be determined by
wij FREQij
where, FREQij is the frequency of term j in
document i
It is a simple count of the number of occurrences of a
term in a particular document (or query).
Is a measure of term density in a document.
Experiments have shown that this is better than
Boolean.
Having all weaknesses this method shows better
results than that of Boolean.
10
Problems with Term frequency (TF)
Such a weighting system sometimes does not perform as
expected, especially in cases where the high frequency
words are equally distributed throughout the collection.
Since it does not take into account the role of term j in
any document other than document i.
This simple measure is not normalized to account for
variances in the length of documents (i.e. Long documents
have an unfair advantage.)
A one-page document with 10 mentions of A is “more about
A” than a 100 page document with 20 mentions of A.
Used alone, favors common words, long documents (Why
favor common words and long documents? )
11
Solutions to the Problems with Term frequency (TF)
Two solutions
1. Divide each frequency count by the length of the
document (length Normalization).
In this case the normalized frequency tfij is used
instead of FREQij.
2. Divide each frequency count by the maximum frequency
count of any item in the document.
The normalized tf is given by
FREQij
tf ij
Where, max l ( FREQlj )
tfij is the normalized frequency of term j in document i
maxl is the maximum frequency of any term in document d l
12
Document Frequency
It is defined to be the number of documents in the
collection that contain a term
DF = document frequency
16
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values for
common words.
• IDF is an indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
• Make the difference between Document frequency vs. corpus
frequency? 17
Problems with IDF weights
Identifies that a term that appears in many documents
is not very useful for distinguishing relevant
documents from non-relevant ones.
Because, this function does not take into account
the frequency of a term in a given document (i.e.,
FREQij)
That is, it is possible for a term to occur in only few
documents of a collection and at the same time a
small number of times in such documents but such
a term is not important for an author uses
important terms now and then.
18
Solution to the problem of IDF weights
Weights should combine two measurements.
Weights should be in direct proportion to the
frequency of the term in a document.=TF
tf = FREQ / max {FREQ }
i,j i,j k k,j
This quantifies how well a term describes the
document (or the content).
Weights should be in inverse proportion to the
number of documents in the collection in which the
term appears. =IDF (N)
wk log 2dk
Thisquantifies the ability of the term to
separate documents.
Altogether: wi,k = tfi,k · idfk
19
TF*IDF Weighting
Combines term frequency (TF) and inverse document
frequency (IDF).
The most used term-weighting is tf*idf weighting scheme:
wij = tfij idfi = tfij * log2 (N/ dfi)
According to this function
Weight of term j in a given document i would increase as the
frequency of the term in the document (FREQij ) increases but
decreases as the document frequency dj increases.
A term occurring frequently in the document but rarely in the
rest of the collection is given high weight.
The tf*idf value for a term will always be greater than or equal to
zero.
Experimentally, tf*idf has been found to work well.
It is often used in the vector space model together with cosine
similarity to determine the similarity between two documents.
20
A high occurrence frequency in a particular document
indicates that the term carries a great deal of importance in
that document.
A low-overall collection (the number of documents in the
collection to which the term is assigned) indicates at the
same time that the importance of the term in the remainder
of the collection is relatively small so that the term can
actually distinguish the documents to which it is assigned
from the remainder of the collection
Thus, such a term can be considered as being of potentially
greater importance for retrieval purposes
This scheme assigns a weight to each term (vocabulary
word) in a given document.
21
TF*IDF weighting
When does TF*IDF registers a high weight? when a
term t occurs many times within a small number of
documents.
Highest tf*idf for a term shows a term has a high term
frequency (in the given document) and a low
document frequency (in the whole collection of
documents);
The weights hence tend to filter out common terms.
Thus, lending high discriminating power to those
documents.
Lower TF*IDF is registered when the term occurs fewer
times in a document, or occurs in many documents.
Thus, offering a less pronounced relevance signal.
Lowest TF*IDF is registered when the term occurs in
virtually all documents.
Computing TF-IDF: An Example
Assume collection contains 10,000 documents and
statistical analysis shows that document frequencies
(DF) of three terms are: A(50), B(1300), C(250). And also
term frequencies (TF) of these terms are: A(3), B(2), C(1)
with a maximum term frequency of 3. Compute TF*IDF
for each term?
A: tf = 3/3=1.0 idf = log2(10000/50) = 7.644; tf*idf = 7.644
B: tf = 2/3=0.667 idf = log2(10000/1300) = 2.943; tf*idf =
1.962
C: tf = 1/3=0.33 idf = log2(10000/250) = 5.322; tf*idf = 1.774
Query vector is typically treated as a document and also tf*idf
weighted.
23
More Example
Consider a document containing 100 words where in the
word cow appears 3 times. Now, assume we have 10 million
documents and cow appears in one thousand of these.
24
Exercise Word C TW TD DF TF IDF TF*IDF
• Let C = number of times
a given word appears in airplane 5 46 3 1
a document; blue 1 46 3 1
• TW = total number of
words in a document; chair 7 46 3 3
• TD = total number of computer 3 46 3 1
documents in a corpus,
and forest 2 46 3 1
• DF = total number of justice 7 46 3 3
documents containing a
given word; love 2 46 3 1
• compute TF, IDF and might 2 46 3 1
TF*IDF score for each
perl 5 46 3 2
term
rose 6 46 3 3
shoe 4 46 3 1
thesis 2 46 3 2
25
Concluding remarks
Suppose from a set of English documents, we wish to determine
which once are the most relevant to the query "the brown cow."
A simple way to start out is by eliminating documents that do
not contain all three words "the," "brown," and "cow," but this
still leaves many documents.
To further distinguish them, we might count the number of
times each term occurs in each document and sum them all
together;
The number of times a term occurs in a document is called its TF.
However, because the term "the" is so common, this will tend to
incorrectly emphasize documents which happen to use the word
"the" more, without giving enough weight to the more meaningful
terms "brown" and "cow".
Also the term "the" is not a good keyword to distinguish relevant
and non-relevant documents and terms like "brown" and "cow" that
occur rarely are good keywords to distinguish relevant documents
from the non-relevant once.
26
Concluding remarks
Hence IDF is incorporated which diminishes the
weight of terms that occur very frequently in the
collection and increases the weight of terms that
occur rarely.
This leads to use TF*IDF as a better weighting
technique
On top of that we apply similarity measures to
calculate the distance between document i and
query j.
Similarity Measure
28
Similarity Measure
We now have vectors for all documents in
the collection and a vector for the query,
how do we compute similarity? t3
d3
d1
θ
φ
t1
d5
t2
d4
sim(dj, q) = dj• q =
w ·w
iij1 iq
34
Properties of Inner Product
Favors long documents with a large number of unique
terms.
Again, the issue of normalization.
Measures how many terms matched but not how
many terms are not matched.
35
Inner Product -- Examples
Binary weight :
Size of vector = size of vocabulary = 7
sim(D, Q) = 3
Retrieval Database Term Computer Text Manage Data
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1
q 1 1 1
37
Inner Product: Exercise
k2
k1
d2 d6 d7
d4 d5
d1 d3
k1 k2 k3 q dj
d1 1 0 1 ? k3
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?
q 1 2 3
38
Cosine similarity
Measures similarity between d1 and d2 captured by the
cosine of the angle x between them.
n
dj q i 1
wi , j wi ,q
sim(d j , q )
dj q i 1 w i1 i,q
n 2 n2
i, j w
Or;
n
d j dk i 1
wi , j wi ,k
sim(d j , d k )
d j dk i1 w i 1 i,k
n 2 n 2
i, j w
The denominator involves the lengths of the vectors
n 2
Length d j i 1
w
i, j
1.0 Q
cos 1 0.74 D2
0.8
Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254
42
Cosine Similarity vs. Inner Product
Cosine similarity measures the cosine of the angle
between two vectors.
Inner product normalized by the vector lengths.
t
dj q ( wij wiq )
i 1
Cosin(dj, q) = t t
dj q wij wiq
2 2
i 1 i 1
InnerProduct(dj, q) =d j q
44
45