Professional Documents
Culture Documents
CHAPTER 1
INTRODUCTION TO THE VECTOR SPACE MODEL
The Vector Space Model (VSM) is a ranking model that ranks individual documents
against a query. Originally introduced by Gerald Salton and his associates [Salton
and Lesk 68], [Salton et al. 81], [Salton and Buckley 88], the VSM has been involved
in a large part of the Information Retrieval research. Readers may refer to [Frakes
and Baeza-Yates 92] for an overview of VSM.
Consider all unique terms included in the collection of documents, also known as
dictionary. This set of terms defines an n-dimensional space where each document,
including the query, is represented as a vector. Then, a measure of similarity between
document d and query q is the cosine of the angle that is formed between the two
vectors:
d q
sim d , q = . (1.1)
d q
Considering the array representations of these vectors, d = [w0d, w1d,…, wn–1,d] and
q = [w0q, w1q,…, wn–1,q], where the weight wij represents the presence of term i in
document j (coordinate of vector j in the i-th dimension), the inner product becomes
a normalized sum of weight products:
n 1
w w
sim d , q i 0 id iq
. (1.2)
i 0 id i0 iq
n 1 2 n 1 2
w w
The weight wij could be a simple binary coordinate, i.e., 1 if term i appears in
document j and 0 otherwise, or it could be adjusted following some specific
weighting scheme. A common approach, called TF-IDF, is to consider the product
wij = tfij * idfi (1.3)
where
1
tfij = term frequency of term i in document j
idfi = inverse document frequency showing how rare is term i in the entire
collection of documents.
The inverse document frequency can be computed as
idf i log 2 N ni [originally idf i log 2 N ni + 1, other versions exist]
where
N = number of documents in the collection
ni = term frequency of term i in the entire collection of documents.
When VSM calculations are used by search engines, the amount of calculations at
query run time has to be minimized to ensure fast query processing. To that end,
similarities between the query vector and each document vector in the collection are
calculated as the dot product
n 1
wid wiq
sim d , q i 0 , (1.4)
wid wiq
n 1 2 n 1 2
i 0 i 0
wid
where the normalized term weights are computed for each term-document
i 0 wid
n 1 2
2
containing multiple occurrences of terms, and using simple binary weighting for
collections with short documents or collections using controlled vocabulary.
In order to recognize the similarity of words with the same root such as send and
sending, a number of stemming algorithms emerged in the area of Information
Retrieval [Frakes and Baeza-Yates 92, pp.131-160]. Although recognizing other
token variants such as singular and plural nouns, verb forms, negation, prepositions,
etc. has been found to produce dramatically different text classification results [Riloff
95], our stemming efforts here will be restricted to the more traditional Porter
Stemmer which follows the affix removal approach [Porter 80].
A1. The H.323 standard provides a foundation for audio, video, and data
communications across IP-based networks, including the Internet. H.323 is an
umbrella recommendation from the International Telecommunications Union (ITU)
that sets standards for multimedia communications over Local Area Networks (LANs)
that do not provide a guaranteed Quality of Service (QoS). These networks dominate
today’s corporate desktops and include packet-switched TCP/IP and IPX over
Ethernet, Fast Ethernet and Token Ring network technologies. Therefore, the H.323
standards are important building blocks for a broad new range of collaborative,
LAN-based applications for multimedia communications, including
videoconferencing.
Since this is a long answer, the practical aspect of grading may be better served by
a shorter operational “correct answer” such as:
A2. H.323 is a standard for sending streaming audio and video via the Internet,
used for videoconferencing.
3
Trying to further reduce the correct answer to the absolute essential keywords, we
can form the list of keywords:
A3. Internet, standard, streaming, audio, video, videoconferencing.
Table 1.1. Ten free-text answers to the exam question “What is H.323 standard?”
docID Document Text Grade
0 I have not got a clue 0
1 A level of acceptance 0
2 A standard for remote meeting protocol 1.5
3 This is a standard for videoconferencing 1.5
4 H.323 is an Internet standard that allows audio to be sent and 2
received in the style of a telephone conversation
5 The standard used for videoconferencing and sending streaming 2
audio and video via the Internet
6 Standard for videoconferencing transmissions 1.5
7 It is the standard provided by ISO and related to project 0
management
8 Standard for videoconferencing transmissions 1.5
9 Standard for network connection 1
Consider all 28 unique terms included in this collection of documents, also known as
dictionary. For our H.323 question example, the dictionary is {acceptance, allows,
audio, clue, connection, conversation, h.323, internet, iso, level, management,
meeting, network, project, protocol, provided, received, related, remote, sending,
sent, standard, streaming, style, telephone, transmissions, video, videoconferencing}.
4
Table 1.2. Tokenization of the ten documents for the H.323 question example.
docID Tokens
0 clue
1 level, acceptance
2 standard, remote, meeting, protocol
3 standard, videoconferencing
4 h.323, internet, standard, allows, audio, sent, received, style, telephone,
conversation
5 standard, videoconferencing, sending, streaming, audio, video, internet
6 standard, videoconferencing, transmissions
7 standard, provided, iso, related, project, management
8 standard, videoconferencing, transmissions
9 standard, network, connection
Table 1.3. Vector of merged tokens with the term frequency of term i in document j.
Term docID tfij Term docID tfij
acceptance 1 1 sending 5 1
allows 4 1 sent 4 1
audio 4 1 standard 2 1
audio 5 1 standard 3 1
clue 0 1 standard 4 1
connection 9 1 standard 5 1
conversation 4 1 standard 6 1
h.323 4 1 standard 7 1
internet 4 1 standard 8 1
internet 5 1 standard 9 1
iso 7 1 streaming 5 1
level 1 1 style 4 1
management 7 1 telephone 4 1
meeting 2 1 transmissions 6 1
network 9 1 transmissions 8 1
project 7 1 video 5 1
protocol 2 1 videoconferencing 3 1
provided 7 1 videoconferencing 5 1
received 4 1 videoconferencing 6 1
related 7 1 videoconferencing 8 1
remote 2 1
This set of terms defines an n-dimensional space where each document, including the
query, is represented as a vector (here, n = 28). Table 1.4. shows a calculation of the
inverse document frequencies idfi for each term used in the H.323 question example.
5
Each term is shown with its frequency in the entire document collection. Terms that
are more frequent in the collection are penalized with a smaller idf.
Combining the term frequencies from Table 1.3, the inverse document
frequencies from Table 1.4, and the normalization denominators from Table 1.5, we
get the term weights shown in Table 1.6 for all term-document combinations in the
H.323 question example. The term “clue” in document 0 has a very high normalized
weight (w’id = 1) because it is a rare term in a short document. On the other hand, the
term “standard” in document 4 has a very low normalized weight (w’id = 0.0343)
because it is a common term in a longer document.
Table 1.4. The inverse document frequencies idfi for each term i.
Term i ni idfi
acceptance 1 3.32
allows 1 3.32
audio 2 2.32
clue 1 3.32
connection 1 3.32
conversation 1 3.32
h.323 1 3.32
internet 2 2.32
iso 1 3.32
level 1 3.32
management 1 3.32
meeting 1 3.32
network 1 3.32
project 1 3.32
protocol 1 3.32
provided 1 3.32
received 1 3.32
related 1 3.32
remote 1 3.32
sending 1 3.32
sent 1 3.32
standard 8 0.32
streaming 1 3.32
style 1 3.32
telephone 1 3.32
transmissions 2 2.32
video 1 3.32
videoconferencing 4 1.32
6
Table 1.5. Term weight normalization denominators for each document d.
docID norm denom
0 3.321928
1 4.697915631
2 5.762747136
3 1.360562852
4 9.387905857
5 6.763094518
6 2.691185782
7 7.435029645
8 2.691185782
9 4.708932885
The inverted file structure for our H.323 question example, consisting of the
dictionary and the postings table is shown on Table 1.7.
7
1.11. Performing a Query
Let us refer again to the H.323 question example and consider answer A2. Correct
answers for exam free-text exam questions will be often referred to as queries
throughout this text, following the standard Information Retrieval terminology. A2
(see section 1.5) contains the following tokens: [audio, h.323, internet, sending,
standard, streaming, video, videoconferencing]. Following the steps outlined in
chapter 1, one can calculate the normalized term weights for all tokens included in
this query. Table 1.8 shows the normalized weight calculations for all terms in query
A2. Using these normalized weights wiq where i is the query term index, the
similarity between a document d and the query q is given by sim d , q i wid wiq .
Table 1.7. Dictionary and Postings for the H.323 question example.
DICTIONARY POSTINGS
word docCount tfi postIndex docID tfij normWeight
acceptance 1 1 0 0 1 1 0.7071
allows 1 1 1 1 4 1 0.3539
audio 2 2 2 2 4 1 0.2473
clue 1 1 4 3 5 1 0.3433
connection 1 1 5 4 0 1 1.0000
conversation 1 1 6 5 9 1 0.7055
h.323 1 1 7 6 4 1 0.3539
internet 2 2 8 7 4 1 0.3539
iso 1 1 10 8 4 1 0.2473
level 1 1 11 9 5 1 0.3433
management 1 1 12 10 7 1 0.4468
meeting 1 1 13 11 1 1 0.7071
network 1 1 14 12 7 1 0.4468
project 1 1 15 13 2 1 0.5764
protocol 1 1 16 14 9 1 0.7055
provided 1 1 17 15 7 1 0.4468
received 1 1 18 16 2 1 0.5764
related 1 1 19 17 7 1 0.4468
remote 1 1 20 18 4 1 0.3539
sending 1 1 21 19 7 1 0.4468
sent 1 1 22 20 2 1 0.5764
standard 8 8 23 21 5 1 0.4912
streaming 1 1 31 22 4 1 0.3539
style 1 1 32 23 2 1 0.0559
telephone 1 1 33 24 3 1 0.2366
transmissions 2 2 34 25 4 1 0.0343
video 1 1 36 26 5 1 0.0476
videoconferencing 4 4 37 27 6 1 0.1196
28 7 1 0.0433
29 8 1 0.1196
30 9 1 0.0684
31 5 1 0.4912
32 4 1 0.3539
33 4 1 0.3539
34 6 1 0.8628
35 8 1 0.8628
36 5 1 0.4912
37 3 1 0.9716
38 5 1 0.1955
39 6 1 0.4912
40 8 1 0.4912
8
Table 1.8. Normalized query term weights for the H.323 question example.
Query Term i idfi tfi wi w'i
audio 1.8745 1 1.8745 0.3283
h.323 2.4594 1 2.4594 0.4307
internet 1.8745 1 1.8745 0.3283
sending 2.4594 1 2.4594 0.4307
standard 0.2895 1 0.2895 0.0507
streaming 2.4594 1 2.4594 0.4307
video 2.4594 1 2.4594 0.4307
videoconferencing 1.1375 1 1.1375 0.1992
9
Table 1.10. Accumulator components for the similarity calculations
document j
Term i 0 1 2 3 4 5 6 7 8 9 Query
acceptance 1
allows 1
audio 1 1 1
clue 1
connection 1
conversation 1
h.323 1 1
internet 1 1 1
iso 1
level 1
management 1
meeting 1
network 1
project 1
protocol 1
provided 1
received 1
related 1
remote 1
sending 1 1
sent 1
standard 1 1 1 1 1 1 1 1 1
streaming 1 1
style 1
telephone 1
transmissions 1 1
video 1 1
videoconferencing 1 1 1 1 1
10
Table 1.11. Normalized binary term weights and resulting document similarities.
document j
Term i 0 1 2 3 4 5 6 7 8 9 Query
acceptance 0.71
allows 0.32
audio 0.32 0.38 0.354
clue 1
connection 0.58
conversation 0.32
h.323 0.32 0.354
internet 0.32 0.38 0.354
iso 0.41
level 0.71
management 0.41
meeting 0.5
network 0.58
project 0.41
protocol 0.5
provided 0.41
received 0.32
related 0.41
remote 0.5
sending 0.38 0.354
sent 0.32
standard 0.5 0.71 0.32 0.38 0.58 0.41 0.58 0.58 0.354
streaming 0.38 0.354
style 0.32
telephone 0.32
transmissions 0.58 0.58
video 0.38 0.354
videoconferencing 0.71 0.38 0.58 0.58 0.354
Similarities: 0.00 0.00 0.18 0.50 0.45 0.94 0.41 0.14 0.41 0.20
1.13. Entropy
The concept of entropy was introduced in communication theory by the seminal work
of Claude Shannon (Shannon 1948). Entropy represents a measure of information
content, where lack of order (presence of noise) would correspond to higher entropy
and therefore, more bits of information would be required so that a message can be
communicated in a noisy channel.
11
1.14. Document similarities with entropy weights
As we discussed in section 1.1, similarity between a document d and an IR query
text q is calculated as the weighted and normalized dot-product of the two vectors,
sim( d , q ). For easier reference, we repeat expression (1.2) here:
ww
n 1
sim d , q i 0 id iq
, (1.2 - duplicate)
w
n 1 2 n 1
i 0 id w2
i 0 iq
We will now look into this entropy-based term weighting scheme in detail. The basic
difference between IDF-based and entropy-based weights is multiplying the term
frequencies tfij by Gi instead of idfi, when calculating the raw weights wij. Table 1.12
summarizes all the formulas needed in document similarity calculations, using all
three term weighting schemes we have covered so far: TF-IDF, entropy-based, and
binary.
12
Table 1.12. Summary of all VSM formulas related to document-query similarities
Document-Query similarity:
w w
n 1
wid wiq
N 1 N 1
i 0
wid wiq
id iq
sim d , q = = i 0
= i 0
w
n 1 2 n 1 2 N 1 2 N 1 2
i 0 id i 0
w
iq i 0
w
id i 0
w
iq
Document Weights:
TF-IDF: wij = tfij * idfi with idfi = log 2 N ni
p log pij
Entropy: wij = tfij * Gi with Gi = 1 ij and pij tf ij ni
j log( N )
NOTE: in a ratio of two logarithms, the two bases cancel each other out, so in
the above formula the logs no longer have to be base-2.
1 if term i is present in document j
Binary: wij =
0 otherwise
Query Weights:
TF-IDF: wi = tfi * idfi with idfi = log 2 ( N 1) (ni 1)
NOTE: query formulas are slightly different than corresponding document
formulas because the presence of term i in the query increases ni by 1 and the
addition of this query to the collection increases N by 1.
and pi tf i ni 1
pi log( pi )
Entropy: wi = tfi * Gi with Gi = 1
log( N 1)
1 if term i is present in the query
Binary: wi =
0 otherwise
Where:
wij = raw weight of term i in document j,
wi = raw weight of term i in the query,
w’ij = normalized weight of term i in document j,
w’i = normalized weight of term i in the query,
tfij = term frequency of term i in document j,
tfi = term frequency of term i in the query,
idfi = inverse document frequency of term i,
pij = conditional probability to get document j given term i,
Gi = entropy of term i,
N = number of documents in the collection, and
ni = term frequency of term i in the entire collection of documents.
13
Table 1.13. Tokenization of the ten documents for the H.323 question example.
docID Tokens
0 clue
1 level, acceptance
2 standard, remote, meeting, protocol
3 standard, videoconferencing
4 h.323, internet, standard, allows, audio, sent, received, style, telephone,
conversation
5 standard, videoconferencing, sending, streaming, audio, video, internet
6 standard, videoconferencing, transmissions
7 standard, provided, iso, related, project, management
8 standard, videoconferencing, transmissions
9 standard, network, connection
Table 1.14. Vector of merged tokens with the term frequency of term i in document j.
Term docID tfij pij Term docID tfij pij
acceptance 1 1 1 sending 5 1 1
allows 4 1 1 sent 4 1 1
audio 4 1 0.5 standard 2 1 0.125
audio 5 1 0.5 standard 3 1 0.125
clue 0 1 1 standard 4 1 0.125
connection 9 1 1 standard 5 1 0.125
conversation 4 1 1 standard 6 1 0.125
h.323 4 1 1 standard 7 1 0.125
internet 4 1 0.5 standard 8 1 0.125
internet 5 1 0.5 standard 9 1 0.125
iso 7 1 1 streaming 5 1 1
level 1 1 1 style 4 1 1
management 7 1 1 telephone 4 1 1
meeting 2 1 1 transmissions 6 1 0.5
network 9 1 1 transmissions 8 1 0.5
project 7 1 1 video 5 1 1
protocol 2 1 1 videoconferencing 3 1 0.25
provided 7 1 1 videoconferencing 5 1 0.25
received 4 1 1 videoconferencing 6 1 0.25
related 7 1 1 videoconferencing 8 1 0.25
remote 2 1 1
14
Entropies. Table 1.15. shows a calculation of the entropy Gi for each term used in
the H.323 question example. Each term’s frequency in the entire document
collection, ni, is also shown. In a way similar to what happened when we used TF-
IDF weights, terms that are more frequent in the collection are penalized with a
smaller Gi.
Term weights. Combining the term frequencies from Table 1.14, the entropies from
Table 1.15, and the normalization denominators from Table 1.16, we get the term
weights shown in Table 1.17 for all term-document combinations in the H.323
question example. The term “clue” in document 0 has a very high normalized weight
(w’ij = 1) because it is a rare term in a short document. On the other hand, the term
“standard” in document 4 has a very low normalized weight (w’ij = 0.0343) because it
is a common term in a longer document. Note that normalized term weights
calculated here turned out to be the same as the normalized weights calculated using
IDFs (see table 1.17), but this is not always the case.
15
Table 1.16. Term weight normalization denominators for each document d.
docID norm denom
0 1
1 1.414213562
2 1.734759796
3 0.409570252
4 2.826041343
5 2.035894377
6 0.810127677
7 2.238167006
8 0.810127677
9 1.417530087
Dictionary and postings. The inverted file structure for our H.323 question
example, consisting of the dictionary and the postings table is shown on Table 1.18.
Since the normalized weights turned out to be the same as in the TF-IDF case, table
1.18 is also the same as table 1.7.
16
Table 1.18. Dictionary and Postings for the H.323 question example.
DICTIONARY POSTINGS
word docCount tfi postIndex docID tfij normWeight
acceptance 1 1 0 0 1 1 0.7071
allows 1 1 1 1 4 1 0.3539
audio 2 2 2 2 4 1 0.2473
clue 1 1 4 3 5 1 0.3433
connection 1 1 5 4 0 1 1.0000
conversation 1 1 6 5 9 1 0.7055
h.323 1 1 7 6 4 1 0.3539
internet 2 2 8 7 4 1 0.3539
iso 1 1 10 8 4 1 0.2473
level 1 1 11 9 5 1 0.3433
management 1 1 12 10 7 1 0.4468
meeting 1 1 13 11 1 1 0.7071
network 1 1 14 12 7 1 0.4468
project 1 1 15 13 2 1 0.5764
protocol 1 1 16 14 9 1 0.7055
provided 1 1 17 15 7 1 0.4468
received 1 1 18 16 2 1 0.5764
related 1 1 19 17 7 1 0.4468
remote 1 1 20 18 4 1 0.3539
sending 1 1 21 19 7 1 0.4468
sent 1 1 22 20 2 1 0.5764
standard 8 8 23 21 5 1 0.4912
streaming 1 1 31 22 4 1 0.3539
style 1 1 32 23 2 1 0.0559
telephone 1 1 33 24 3 1 0.2366
transmissions 2 2 34 25 4 1 0.0343
video 1 1 36 26 5 1 0.0476
videoconferencing 4 4 37 27 6 1 0.1196
28 7 1 0.0433
29 8 1 0.1196
30 9 1 0.0684
31 5 1 0.4912
32 4 1 0.3539
33 4 1 0.3539
34 6 1 0.8628
35 8 1 0.8628
36 5 1 0.4912
37 3 1 0.9716
38 5 1 0.1955
39 6 1 0.4912
40 8 1 0.4912
17
Performing a Query. Let us refer again to the H.323 question example and consider
answer A2. Following the formulas presented earlier in this chapter, one can
calculate the normalized term weights for all tokens included in this query. Table
1.19 shows the normalized weight calculations for all terms in query A2. Using these
normalized weights wiq where i is the query term index, the similarity between a
document d and the query q, is calculated in table 1.20. Once again we use the notion
of an accumulator since the similarities are sums of products. Table 1.20 is
somewhat similar, but not identical to IDF-based table 1.9. They both indicate that
the most relevant document is 5, followed by 4, 3, 6, and 8, but using entropies
document 3 ranks higher than document 4. The grades awarded by the human expert
(1.5 and 2 respectively) favor the TF-IDF approach, but there are many text
evaluation cases where the entropy-based approach is the one that fares better.
Table 1.19. Normalized query term weights for the H.323 question example.
Query Term i Gi tfi pi wi w'i
audio 0.8473 1 0.333 0.8473 0.3482
h.323 0.8555 1 0.5 0.8555 0.3516
internet 0.8473 1 0.333 0.8473 0.3482
sending 0.8555 1 0.5 0.8555 0.3516
standard 0.8982 1 0.111 0.8982 0.3692
streaming 0.8555 1 0.5 0.8555 0.3516
video 0.8555 1 0.5 0.8555 0.3516
videoconferencing 0.8658 1 0.2 0.8658 0.3558
18
1.13. Porter Stemmer
Table 1.21 shows the tokens in the H.323 question example after affix removal was
applied to them using a Porter Stemmer algorithm [Porter 1980]. In this example the
effect of prefixes, suffixes, plurals, etc. is minimal and the document similarities turn
out to be identical even after the stemming operation was applied. However, term
stemming, and term conflation in general, are established text mining methods that
work well in a number of text mining applications.
Figure 1.3 shows the class diagram of a VSM class. Only selected public methods
are shown. The source code of a Java implementation of this class is posted on our
course Web site.
19
VSM
- docVec: Vector
- dictVec: Vector
- postVec: Vector
- stemmer: Porter
- docWeightingMethod: String
- queryWeightingMethod: String
- activatePorterStemmer: Boolean
- eliminateUniqueTerms: Boolean
+ create(): VSM
+ loadStopwords (filename: String)
+ loadBlockwords (filename: String)
+ setDocumentsVector (inputVec: Vector)
+ setDocWeightingMethod (m: String)
+ setQueryWeightingMethod (m: String)
+ setEliminateUniqueTerms (p: Boolean)
+ setActivatePorterStemmer (p: Boolean)
+ createDictPost()
+ createXmatrix()
+ getXMatrix(): Double[][]
+ performOneQuery (queryStr: String)
+ getOneQueryResultsVec(): Vector
+ setQueryVector (queryVec: Vector)
+ createQmatrix()
+ getQMatrix(): Double[][]
References
Lochbaum, K. E. and Streeter, L.A. 1989. Comparing and Combining the Effectiveness of Latent
Semantic Indexing and the Ordinary Vector Space Model for Information Retrieval.
Information Processing and Management, 25(6), 665-76.
Frakes, W. B., and R. Baeza-Yates, Eds. (1992). Information Retrieval: Data Structures and
Algorithms, Prentice-Hall, New Jersey, 1992.
Porter, M. F. (1980). An Algorithm for Suffix Stripping. Program, 14(3), 130-37, 1980.
Riloff, E. (1995). Little words can make a big difference for text classification. In Proc. 18th Int.
ACM SIGIR Conf. on Research and Development in Information Retrieval, 130-136, 1995.
Salton, G., and C. Buckley (1988) Term-Weighting Approaches in Automatic Text Retrieval.
Information Processing and Management, 24(5), 513-23, 1988.
Salton, G., and M. E. Lesk (1968). Computer Evaluation of Indexing and Text Processing. J.
Association for Computing Machinery, 15(1), 8-36, 1968.
Salton, G., H. Wu, and C. T. Yu (1981). The Measurement of Term Importance in Automatic
Indexing. J. American Society for Information Science, 32(3), 175-86, 1981.
Shannon, C. 1948. A Mathematical Theory of Communication. The Bell System Technical Journal,
Vol. 27, pp. 379–423, July 1948, pp. 623–656, October, 1948.
20