Professional Documents
Culture Documents
1
term-document matrix
• Documents and queries are represented as vectors or “bags
of words” (BOW) in a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a term
in the document
– The weight of terms (wij) may be a binary weight or Non-binary
weight . wij is zero means the term doesn’t exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1 1 if freqij 0
D2 w12 w22 … wt2
Wij
: : : :
0 else freq ij 0
: : : :
Dn w1n w2n … wtn
2
Binary Weights
• Only the presence (1) or docs t1 t2 t3
absence (0) of a term is D1 1 0 1
D2 1 0 0
included in the vector D3 0 1 1
• Binary formula gives every D4 1 0 0
D5 1 1 1
word that appears in a D6 1 1 0
document equal relevance. D7 0 1 0
D8 0 1 0
• It can be useful when D9 0 0 1
frequency is not important. D10 0 1 1
D11 1 0 1
• Binary Weights Formula:
1 if freq ij 0
freq ij
0 if freq ij 0
Why use term weighting?
• Binary weights are too limiting.
– Terms are either present or absent.
– Not allow to order documents according to their level of
relevance for a given query
i 1
where X = (x1, x2, …, xn) and Y = (y1, y2, …, yn) are two n-
dimensional data objects; n is size of vector attributes of
the data object; q= 1,2,3,…
19
Inner Product
• Similarity between vectors for the document di and query q
can be computed as the vector inner product:
n
sim(dj,q) = dj•q = wij · wiq
i 1
n n
dj q w 2
i, j
2
w
i ,q
i 1 i 1
0.64
0.98
0.42
Example 2: Computing Cosine Similarity
• Let say we have two documents in our corpus; D1 =
(0.8, 0.3) and D2 = (0.2, 0.7). Given query vector Q =
(0.4, 0.8), determine which document is more relevant
one for the query?
1.0 Q
D2
0.8
0.6 2
0.4
1 D1
0.2
Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254
25