Professional Documents
Culture Documents
measures
1
Terms
• Terms are usually stems. Terms can be also phrases, such
as “Computer Science”, “World Wide Web”, etc.
• Documents and queries are represented as vectors or
“bags of words” (BOW).
– Each vector holds a place for every term in the collection.
– Position 1 corresponds to term 1, position 2 to term 2,
position n to term n.
Di wd i1 , wd i 2 ,..., wd in
Q wq1 , wq 2, ..., wqn
W=0 if a term is absent
• Documents are represented by binary weights or Non-
binary weighted vectors of terms. 2
Document Collection
• A collection of n documents can be represented in the vector
space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a term
in the document; zero means the term has no significance in
the document or it simply doesn’t exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
3
Binary Weights
• Only the presence (1) or docs t1 t2 t3
absence (0) of a term is D1 1 0 1
D2 1 0 0
included in the vector D3 0 1 1
• Binary formula gives every D4 1 0 0
D5 1 1 1
word that appears in a D6 1 1 0
document equal relevance. D7 0 1 0
D8 0 1 0
• It can be useful when D9 0 0 1
frequency is not important. D10 0 1 1
D11 1 0 1
• Binary Weights Formula:
1 if freq ij 0
freq ij
0 if freq ij 0
Why use term weighting?
• Binary weights are too limiting.
– terms are either present or absent.
– Not allow to order documents according to their level of
relevance for a given query
7
Problems with term frequency
• Need a mechanism for attenuating the effect of terms
that occur too often in the collection to be meaningful for
relevance/meaning determination
• Scale down the term weight of terms with high collection
frequency
– Reduce the tf weight of a term by a factor that grows
with the collection frequency
• More common for this purpose is document frequency
– how many documents in the collection contain the term
14
More Example
• Consider a document containing 100 words
wherein the word cow appears 3 times. Now,
assume we have 10 million documents and cow
appears in one thousand of these.
– The term frequency (TF) for cow :
3/100 = 0.03
blue 1 46 3 1
• TW = total number of
chair 7 46 3 3
words in a document;
computer 3 46 3 1
• TD = total number of
documents in a corpus, forest 2 46 3 1
and justice 7 46 3 3
• DF = total number of love 2 46 3 1
documents containing might 2 46 3 1
a given word;
perl 5 46 3 2
• compute TF, IDF and
rose 6 46 3 3
TF*IDF score for each
term shoe 4 46 3 1
16
thesis 2 46 3 2
Exercise 2
• A database collection consists of 1 million documents,
of which 200,000 contain the term holiday while
250,000 contain the term season. A document repeats
holiday 7 times and season 5 times. It is known that
holiday is repeated more than any other term in the
document. Calculate the weight of both terms in this
document using three different term weight methods.
Try with
(i) normalized and unnormalized TF;
(ii) TF*IDF based on normalized and (iii)unnormalized
TF
17
Similarity Measure
• We now have vectors for all documents in
the collection, a vector for the query, how to t3
compute similarity?
• A similarity measure is a function that
computes the degree of similarity or distance D1
between document vector and query vector. Q
• Using a similarity measure between the t1
query and each document:
–It is possible to rank the retrieved t2 D2
documents in the order of presumed
relevance.
–It is possible to enforce a certain threshold
so that the size of the retrieved set can be
controlled.
18
Intuition
t3
d2
d3
d1
θ
φ
t1
d5
t2
d4
Postulate: Documents that are “close together” in the vector
space talk about the same things and more similar than others.
• Therefore, retrieve documents based on how close the
document is to the query (i.e., similarity ~ “closeness”)
Similarity Measure
Desiderata for proximity
1. If d1 is near d2, then d2 is near d1.
2. If d1 near d2, and d2 near d3, then d1 is not far from d3.
3. No document is closer to d than d itself.
– Sometimes it is a good idea to determine the maximum
possible similarity as the “distance” between a document d
and itself.
• A similarity measure attempts to compute the distance
between document vector wj and query wq vector.
– The assumption here is that documents whose vectors are close
to the query vector are more relevant to the query than
documents whose vectors are away from the query vector. 20
Similarity Measure: Techniques
• Euclidean distance
–It is the most common similarity measure. Euclidean
distance examines the root of square differences between
coordinates of a pair of document and query terms.
• Dot product
–The dot product is also known as the scalar product or
inner product
–the dot product is defined as the product of the
magnitudes of query and document vectors
• Cosine similarity (or normalized inner product)
–It projects document and query vectors into a term space
and calculate the cosine angle between these.
21
Euclidean distance
• Similarity between vectors for the document di and query
q can be computed as:
n
sim(dj,q) = |dj – q| = (w
i 1
ij wiq ) 2
i 1
iq
sim(D, Retrieval
Q) = 3 Database Term Computer Text Manage Data
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1
• Term Weighted:
Retrieval Database Architecture
D1 2 3 5
D2 3 7 1
Q 1 0 2
Inner Product:
Example 1
k2
k1
d2 d6 d7
d4
d5
d3
d1
k1 k2 k3 q dj k3
d1 1 0 1 2
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1
q 1 1 1 26
Cosine similarity
• Measures similarity between d1 and d2 captured by the
cosine of the anglex between them. n
d j q
sim ( d j , q )
i 1
wi , j wi , q
i 1 w i 1 i ,q
n n
dj q 2
i, j w 2
• Or;
n
d j dk wi , j wi ,k
sim(d j , d k ) i 1
i 1 w i 1 i,k
n n
d j dk 2
i, j w 2
0.64
0.98
0.42
Example: Computing Cosine Similarity
• Let say we have two documents in our corpus; D1 =
(0.8, 0.3) and D2 = (0.2, 0.7). Given query vector Q =
(0.4, 0.8), determine which document is the most
relevant one for the query?
1.0 Q
D2
cos 1 0.74 0.8
2
cos 2 0.98
0.6
0.4
1 D1
0.2
i. Find out the relevant documents in ranked order for the query
using cosine similarity
ii. Using dot product
iii. Compare the results
30
Example
• Given three documents; D1, D2 and D3 with the
corresponding TFIDF weight, Which documents are more
similar using the three measurement?
Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254
31
Cosine Similarity vs. Inner Product
• Cosine similarity measures the cosine of the angle
between two vectors.
• Inner product normalized by the vector lengths.
t
dj q ( wij wiq )
i 1
CosSim(dj, q) = t t
dj q wij wiq 2 2
i 1 i 1
InnerProduct(dj, q) = dj q