You are on page 1of 20

DSCI 5910 – Text Mining Notes, Nick Evangelopoulos, University of North Texas,

January 2007. Some material based on N. Evangelopoulos’ Master’s Thesis, Univ. of


Kansas, 2003. Revised January 2011.

CHAPTER 1
INTRODUCTION TO THE VECTOR SPACE MODEL

1.1. Document Similarity and the Vector Space Model (VSM)

The Vector Space Model (VSM) is a ranking model that ranks individual documents
against a query. Originally introduced by Gerald Salton and his associates [Salton
and Lesk 68], [Salton et al. 81], [Salton and Buckley 88], the VSM has been involved
in a large part of the Information Retrieval research. Readers may refer to [Frakes
and Baeza-Yates 92] for an overview of VSM.

Consider all unique terms included in the collection of documents, also known as
dictionary. This set of terms defines an n-dimensional space where each document,
including the query, is represented as a vector. Then, a measure of similarity between
 
document d and query q is the cosine of the angle that is formed between the two
vectors:
 
 
  d q
sim d , q =   . (1.1)
d  q

Considering the array representations of these vectors, d = [w0d, w1d,…, wn–1,d] and

q = [w0q, w1q,…, wn–1,q], where the weight wij represents the presence of term i in

document j (coordinate of vector j in the i-th dimension), the inner product becomes
a normalized sum of weight products:

n 1

 
  w w
sim d , q  i  0 id iq
. (1.2)
i 0 id i0 iq
n 1 2 n 1 2
w w

1.2. TF-IDF Weights

The weight wij could be a simple binary coordinate, i.e., 1 if term i appears in
document j and 0 otherwise, or it could be adjusted following some specific
weighting scheme. A common approach, called TF-IDF, is to consider the product
wij = tfij * idfi (1.3)
where

1
tfij = term frequency of term i in document j
idfi = inverse document frequency showing how rare is term i in the entire
collection of documents.
The inverse document frequency can be computed as
idf i  log 2  N ni  [originally idf i  log 2  N ni  + 1, other versions exist]
where
N = number of documents in the collection
ni = term frequency of term i in the entire collection of documents.

When VSM calculations are used by search engines, the amount of calculations at
query run time has to be minimized to ensure fast query processing. To that end,
similarities between the query vector and each document vector in the collection are
calculated as the dot product
 
  n 1  
  wid wiq
sim d , q  i 0  , (1.4)
  wid  wiq 
n 1 2 n 1 2
 i 0 i 0 
wid
where the normalized term weights are computed for each term-document
i 0 wid
n 1 2

combination beforehand and stored in a lookup table. The normalization



n 1
denominators i 0
wid2 are equal within each document d. The normalized weights
are usually stored in a lookup file called the inverted file [Frakes and Baeza-Yates 92,
pp. 28-43, pp. 377-382]. This consists of two tables, called the dictionary and the
postings.

1.3. Binary weights

The idea behind the aforementioned weighting formula is to promote multiple


occurrences of a term of interest within the same document, assuming that such a
document would be more relevant to that term. In some text evaluation applications,
however, this approach may not be desirable. For example, when dealing with the
evaluation of text answers in an exam question, multiple occurrences of a term
included in the correct answer does not make a student’s answer more correct. Also,
the weighting formula promotes terms that are rare in the entire collection, assuming
that common terms are less interesting since they carry less information. But in exam
documents we do not want to underplay a term that appears in many answer
documents just because the question was easy and most of the students got it right. If
all answers contain a good term, we want to compute a high similarity between the
correct answer and all student answers and give all students a high grade. Salton and
Buckley [Salton and Buckley 88] provide some tailoring guidelines for the weighting
scheme. They suggest reducing the query weights wiq to just tfiq for long queries

2
containing multiple occurrences of terms, and using simple binary weighting for
collections with short documents or collections using controlled vocabulary.

1.4. Porter Stemmer

In order to recognize the similarity of words with the same root such as send and
sending, a number of stemming algorithms emerged in the area of Information
Retrieval [Frakes and Baeza-Yates 92, pp.131-160]. Although recognizing other
token variants such as singular and plural nouns, verb forms, negation, prepositions,
etc. has been found to produce dramatically different text classification results [Riloff
95], our stemming efforts here will be restricted to the more traditional Porter
Stemmer which follows the affix removal approach [Porter 80].

1.5. Test data: The H.323 question example

Consider the following exam question, related to an introductory Information


Systems course: Question: What is H.323 standard? (2 pts.). Table 1.1 lists ten
sampled student answers taken from an actual exam at a large public U.S. University.
The table lists the students’ free-text answers and the instructor’s grades on a scale
from 0 to 2, using 0, 0.5, 1, 1.5, and 2 as the possible grades. In this chapter we will
illustrate all the CBG-related operations and calculations using this example. The
objective of these calculations is to obtain a small test data set of verifiable results, in
order to facilitate a reliable programming implementation. We start by providing a
complete correct answer for the H.323 question, taken from
http://www.protocols.com/pbook/h323.htm:

A1. The H.323 standard provides a foundation for audio, video, and data
communications across IP-based networks, including the Internet. H.323 is an
umbrella recommendation from the International Telecommunications Union (ITU)
that sets standards for multimedia communications over Local Area Networks (LANs)
that do not provide a guaranteed Quality of Service (QoS). These networks dominate
today’s corporate desktops and include packet-switched TCP/IP and IPX over
Ethernet, Fast Ethernet and Token Ring network technologies. Therefore, the H.323
standards are important building blocks for a broad new range of collaborative,
LAN-based applications for multimedia communications, including
videoconferencing.

Since this is a long answer, the practical aspect of grading may be better served by
a shorter operational “correct answer” such as:
A2. H.323 is a standard for sending streaming audio and video via the Internet,
used for videoconferencing.

3
Trying to further reduce the correct answer to the absolute essential keywords, we
can form the list of keywords:
A3. Internet, standard, streaming, audio, video, videoconferencing.

Table 1.1. Ten free-text answers to the exam question “What is H.323 standard?”
docID Document Text Grade
0 I have not got a clue 0
1 A level of acceptance 0
2 A standard for remote meeting protocol 1.5
3 This is a standard for videoconferencing 1.5
4 H.323 is an Internet standard that allows audio to be sent and 2
received in the style of a telephone conversation
5 The standard used for videoconferencing and sending streaming 2
audio and video via the Internet
6 Standard for videoconferencing transmissions 1.5
7 It is the standard provided by ISO and related to project 0
management
8 Standard for videoconferencing transmissions 1.5
9 Standard for network connection 1

1.7. Document Tokenization


According to the VSM, document similarity calculations start with tokenization of all
documents in the collection. All words are converted to lowercase. Common English
words are excluded. A text file (stopfile) posted on our course Web site lists all such
common words (stopwords). Table 1.2. shows the remaining tokens for each student
answer document in the H.323 question example.

1.8. Term Frequencies


After all documents are tokenized, all tokens are merged into a sorted vector of terms
and the term frequencies are calculated. Table 1.3. shows this vector for the H.323
question example. For each term i, the frequency with which it appears in document j
is listed as tfi,j. All term frequencies listed in Table 1.3 happen to be equal to 1, just
because no term appears in the same answer more than once.

1.9. Document Similarity and Term Weights

Consider all 28 unique terms included in this collection of documents, also known as
dictionary. For our H.323 question example, the dictionary is {acceptance, allows,
audio, clue, connection, conversation, h.323, internet, iso, level, management,
meeting, network, project, protocol, provided, received, related, remote, sending,
sent, standard, streaming, style, telephone, transmissions, video, videoconferencing}.

4
Table 1.2. Tokenization of the ten documents for the H.323 question example.
docID Tokens
0 clue
1 level, acceptance
2 standard, remote, meeting, protocol
3 standard, videoconferencing
4 h.323, internet, standard, allows, audio, sent, received, style, telephone,
conversation
5 standard, videoconferencing, sending, streaming, audio, video, internet
6 standard, videoconferencing, transmissions
7 standard, provided, iso, related, project, management
8 standard, videoconferencing, transmissions
9 standard, network, connection

Table 1.3. Vector of merged tokens with the term frequency of term i in document j.
Term docID tfij Term docID tfij
acceptance 1 1 sending 5 1
allows 4 1 sent 4 1
audio 4 1 standard 2 1
audio 5 1 standard 3 1
clue 0 1 standard 4 1
connection 9 1 standard 5 1
conversation 4 1 standard 6 1
h.323 4 1 standard 7 1
internet 4 1 standard 8 1
internet 5 1 standard 9 1
iso 7 1 streaming 5 1
level 1 1 style 4 1
management 7 1 telephone 4 1
meeting 2 1 transmissions 6 1
network 9 1 transmissions 8 1
project 7 1 video 5 1
protocol 2 1 videoconferencing 3 1
provided 7 1 videoconferencing 5 1
received 4 1 videoconferencing 6 1
related 7 1 videoconferencing 8 1
remote 2 1

This set of terms defines an n-dimensional space where each document, including the
query, is represented as a vector (here, n = 28). Table 1.4. shows a calculation of the
inverse document frequencies idfi for each term used in the H.323 question example.

5
Each term is shown with its frequency in the entire document collection. Terms that
are more frequent in the collection are penalized with a smaller idf.

Combining the term frequencies from Table 1.3, the inverse document
frequencies from Table 1.4, and the normalization denominators from Table 1.5, we
get the term weights shown in Table 1.6 for all term-document combinations in the
H.323 question example. The term “clue” in document 0 has a very high normalized
weight (w’id = 1) because it is a rare term in a short document. On the other hand, the
term “standard” in document 4 has a very low normalized weight (w’id = 0.0343)
because it is a common term in a longer document.

Table 1.4. The inverse document frequencies idfi for each term i.
Term i ni idfi
acceptance 1 3.32
allows 1 3.32
audio 2 2.32
clue 1 3.32
connection 1 3.32
conversation 1 3.32
h.323 1 3.32
internet 2 2.32
iso 1 3.32
level 1 3.32
management 1 3.32
meeting 1 3.32
network 1 3.32
project 1 3.32
protocol 1 3.32
provided 1 3.32
received 1 3.32
related 1 3.32
remote 1 3.32
sending 1 3.32
sent 1 3.32
standard 8 0.32
streaming 1 3.32
style 1 3.32
telephone 1 3.32
transmissions 2 2.32
video 1 3.32
videoconferencing 4 1.32

6
Table 1.5. Term weight normalization denominators for each document d.
docID norm denom
0 3.321928
1 4.697915631
2 5.762747136
3 1.360562852
4 9.387905857
5 6.763094518
6 2.691185782
7 7.435029645
8 2.691185782
9 4.708932885

Table 1.6. Normalized term weights.


Term docID wij Norm(wij) Term docID wij Norm(wij)
acceptance 1 3.32 0.7071 sending 5 3.32 0.4912
allows 4 3.32 0.3539 sent 4 3.32 0.3539
audio 4 2.32 0.2473 standard 2 0.32 0.0559
audio 5 2.32 0.3433 standard 3 0.32 0.2366
clue 0 3.32 1.0000 standard 4 0.32 0.0343
connection 9 3.32 0.7055 standard 5 0.32 0.0476
conversation 4 3.32 0.3539 standard 6 0.32 0.1196
h.323 4 3.32 0.3539 standard 7 0.32 0.0433
internet 4 2.32 0.2473 standard 8 0.32 0.1196
internet 5 2.32 0.3433 standard 9 0.32 0.0684
iso 7 3.32 0.4468 streaming 5 3.32 0.4912
level 1 3.32 0.7071 style 4 3.32 0.3539
management 7 3.32 0.4468 telephone 4 3.32 0.3539
meeting 2 3.32 0.5764 transmissions 6 2.32 0.8628
network 9 3.32 0.7055 transmissions 8 2.32 0.8628
project 7 3.32 0.4468 video 5 3.32 0.4912
protocol 2 3.32 0.5764 videoconferencing 3 1.32 0.9716
provided 7 3.32 0.4468 videoconferencing 5 1.32 0.1955
received 4 3.32 0.3539 videoconferencing 6 1.32 0.4912
related 7 3.32 0.4468 videoconferencing 8 1.32 0.4912
remote 2 3.32 0.5764

1.10. The inverted file: Dictionary and Postings

The inverted file structure for our H.323 question example, consisting of the
dictionary and the postings table is shown on Table 1.7.

7
1.11. Performing a Query

Let us refer again to the H.323 question example and consider answer A2. Correct
answers for exam free-text exam questions will be often referred to as queries
throughout this text, following the standard Information Retrieval terminology. A2
(see section 1.5) contains the following tokens: [audio, h.323, internet, sending,
standard, streaming, video, videoconferencing]. Following the steps outlined in
chapter 1, one can calculate the normalized term weights for all tokens included in
this query. Table 1.8 shows the normalized weight calculations for all terms in query
A2. Using these normalized weights wiq where i is the query term index, the
 
 
similarity between a document d and the query q is given by sim d , q  i wid wiq .

Table 1.7. Dictionary and Postings for the H.323 question example.
DICTIONARY POSTINGS
word docCount tfi postIndex docID tfij normWeight
acceptance 1 1 0 0 1 1 0.7071
allows 1 1 1 1 4 1 0.3539
audio 2 2 2 2 4 1 0.2473
clue 1 1 4 3 5 1 0.3433
connection 1 1 5 4 0 1 1.0000
conversation 1 1 6 5 9 1 0.7055
h.323 1 1 7 6 4 1 0.3539
internet 2 2 8 7 4 1 0.3539
iso 1 1 10 8 4 1 0.2473
level 1 1 11 9 5 1 0.3433
management 1 1 12 10 7 1 0.4468
meeting 1 1 13 11 1 1 0.7071
network 1 1 14 12 7 1 0.4468
project 1 1 15 13 2 1 0.5764
protocol 1 1 16 14 9 1 0.7055
provided 1 1 17 15 7 1 0.4468
received 1 1 18 16 2 1 0.5764
related 1 1 19 17 7 1 0.4468
remote 1 1 20 18 4 1 0.3539
sending 1 1 21 19 7 1 0.4468
sent 1 1 22 20 2 1 0.5764
standard 8 8 23 21 5 1 0.4912
streaming 1 1 31 22 4 1 0.3539
style 1 1 32 23 2 1 0.0559
telephone 1 1 33 24 3 1 0.2366
transmissions 2 2 34 25 4 1 0.0343
video 1 1 36 26 5 1 0.0476
videoconferencing 4 4 37 27 6 1 0.1196
28 7 1 0.0433
29 8 1 0.1196
30 9 1 0.0684
31 5 1 0.4912
32 4 1 0.3539
33 4 1 0.3539
34 6 1 0.8628
35 8 1 0.8628
36 5 1 0.4912
37 3 1 0.9716
38 5 1 0.1955
39 6 1 0.4912
40 8 1 0.4912

8
Table 1.8. Normalized query term weights for the H.323 question example.
Query Term i idfi tfi wi w'i
audio 1.8745 1 1.8745 0.3283
h.323 2.4594 1 2.4594 0.4307
internet 1.8745 1 1.8745 0.3283
sending 2.4594 1 2.4594 0.4307
standard 0.2895 1 0.2895 0.0507
streaming 2.4594 1 2.4594 0.4307
video 2.4594 1 2.4594 0.4307
videoconferencing 1.1375 1 1.1375 0.1992

Table 1.9. Accumulator components for the similarity calculations.


Similarity to document:
Query Term i
1 2 3 4 5 6 7 8 9 10
audio 0.08 0.11
h.323 0.15
internet 0.08 0.11
sending 0.21
standard 0 0.01 0 0 0.01 0 0.01 0
streaming 0.21
video 0.21
videoconferencing 0.19 0.04 0.1 0.1
Cummulative
similarities 0.00 0.00 0.00 0.21 0.32 0.90 0.10 0.00 0.10 0.00

A commonly used structure for efficient calculation of this similarity measure is


called accumulator because it starts with document-query similarities due to the first
query term and continues accumulating similarities until all query terms are
exhausted. Table 1.9 shows the components that get added cumulatively to the
accumulator during the 8 passes required for query A2 in the H.323 question
example. The last row in the table shows the final accumulator values after the 8
passes. These are the final similarity values, indicating that the most relevant
document is 5, followed by 4, 3, 6, and 8. Comparing these similarities to the actual
grades awarded by the grader (Table 1.1) we observe that some reasonably good
correspondence exists between VSM similarities and actual grades.

1.12. Binary weights


For our exam grading purposes, since the documents are relatively short and the
vocabulary is controlled, we will adopt binary document weighting. Table 1.10 lists
the binary raw term weights wij for all 10 documents of our H.323 question example
and Table 1.11 lists the corresponding normalized weights as well as the document
similarities to query A2. The resulting document similarities using binary weighting
are not identical, but nevertheless very comparable to the ones in Table 1.9.

9
Table 1.10. Accumulator components for the similarity calculations
document j
Term i 0 1 2 3 4 5 6 7 8 9 Query
acceptance 1
allows 1
audio 1 1 1
clue 1
connection 1
conversation 1
h.323 1 1
internet 1 1 1
iso 1
level 1
management 1
meeting 1
network 1
project 1
protocol 1
provided 1
received 1
related 1
remote 1
sending 1 1
sent 1
standard 1 1 1 1 1 1 1 1 1
streaming 1 1
style 1
telephone 1
transmissions 1 1
video 1 1
videoconferencing 1 1 1 1 1

10
Table 1.11. Normalized binary term weights and resulting document similarities.
document j
Term i 0 1 2 3 4 5 6 7 8 9 Query
acceptance 0.71
allows 0.32
audio 0.32 0.38 0.354
clue 1
connection 0.58
conversation 0.32
h.323 0.32 0.354
internet 0.32 0.38 0.354
iso 0.41
level 0.71
management 0.41
meeting 0.5
network 0.58
project 0.41
protocol 0.5
provided 0.41
received 0.32
related 0.41
remote 0.5
sending 0.38 0.354
sent 0.32
standard 0.5 0.71 0.32 0.38 0.58 0.41 0.58 0.58 0.354
streaming 0.38 0.354
style 0.32
telephone 0.32
transmissions 0.58 0.58
video 0.38 0.354
videoconferencing 0.71 0.38 0.58 0.58 0.354
Similarities: 0.00 0.00 0.18 0.50 0.45 0.94 0.41 0.14 0.41 0.20

1.13. Entropy

The concept of entropy was introduced in communication theory by the seminal work
of Claude Shannon (Shannon 1948). Entropy represents a measure of information
content, where lack of order (presence of noise) would correspond to higher entropy
and therefore, more bits of information would be required so that a message can be
communicated in a noisy channel.

11
1.14. Document similarities with entropy weights

As we discussed in section 1.1, similarity between a document d and an IR query

text q is calculated as the weighted and normalized dot-product of the two vectors,
 
sim( d , q ). For easier reference, we repeat expression (1.2) here:
 ww
n 1

 
 
sim d , q  i 0 id iq
, (1.2 - duplicate)
 w 
n 1 2 n 1
i 0 id w2
i  0 iq

where the weights wij could be binary (1 if term i is present in document j, 0


otherwise) or TF-IDF (see formula 1.3). An alternative is to consider an entropy-
based weighting scheme (Lochbaum and Streeter, 1989):
 p ij log 2  pij  
wij = tfij * Gi = tfij * 1   
 (1.5)
 j log 2  N  
where
tfij = term frequency of term i in document j,
tf ij
pij  = conditional probability to get document j given term i,
ni
Gi = entropy of term i,
N = number of documents in the collection, and
ni = term frequency of term i in the entire collection of documents.

We will now look into this entropy-based term weighting scheme in detail. The basic
difference between IDF-based and entropy-based weights is multiplying the term
frequencies tfij by Gi instead of idfi, when calculating the raw weights wij. Table 1.12
summarizes all the formulas needed in document similarity calculations, using all
three term weighting schemes we have covered so far: TF-IDF, entropy-based, and
binary.

1.15. Document similarities with entropy weights

Term frequencies and conditional probabilities. Referring back to our H.323


example, table 1.13 replicates table 1.2, listing all tokens (after stopword removal) in
the first 10 documents representing students’ answers in the exam question “What is
H.323 standard”. Table 1.14 lists term frequencies tfij for all these tokens (terms),
after they have been alphabetically sorted. These term frequencies are identical to
those listed on table 1.3, but table 1.14 also lists the conditional probabilities pij,
useful for the entropy calculations.

12
Table 1.12. Summary of all VSM formulas related to document-query similarities
Document-Query similarity:

 w w
n 1

 
  wid wiq
 
N 1 N 1
i 0
wid wiq
id iq
sim d , q = = i 0
= i 0
 w   
n 1 2 n 1 2 N 1 2 N 1 2
i 0 id i 0
w
iq i 0
w
id i 0
w
iq

Document Weights:
TF-IDF: wij = tfij * idfi with idfi = log 2  N ni 
p log pij 
Entropy: wij = tfij * Gi with Gi = 1   ij and pij  tf ij ni
j log( N )
NOTE: in a ratio of two logarithms, the two bases cancel each other out, so in
the above formula the logs no longer have to be base-2.
1 if term i is present in document j
Binary: wij = 
0 otherwise
Query Weights:
TF-IDF: wi = tfi * idfi with idfi = log 2 ( N  1) (ni  1) 
NOTE: query formulas are slightly different than corresponding document
formulas because the presence of term i in the query increases ni by 1 and the
addition of this query to the collection increases N by 1.

and pi  tf i ni  1
pi log( pi )
Entropy: wi = tfi * Gi with Gi = 1 
log( N  1)
1 if term i is present in the query
Binary: wi = 
0 otherwise
Where:
wij = raw weight of term i in document j,
wi = raw weight of term i in the query,
w’ij = normalized weight of term i in document j,
w’i = normalized weight of term i in the query,
tfij = term frequency of term i in document j,
tfi = term frequency of term i in the query,
idfi = inverse document frequency of term i,
pij = conditional probability to get document j given term i,
Gi = entropy of term i,
N = number of documents in the collection, and
ni = term frequency of term i in the entire collection of documents.

13
Table 1.13. Tokenization of the ten documents for the H.323 question example.
docID Tokens
0 clue
1 level, acceptance
2 standard, remote, meeting, protocol
3 standard, videoconferencing
4 h.323, internet, standard, allows, audio, sent, received, style, telephone,
conversation
5 standard, videoconferencing, sending, streaming, audio, video, internet
6 standard, videoconferencing, transmissions
7 standard, provided, iso, related, project, management
8 standard, videoconferencing, transmissions
9 standard, network, connection

Table 1.14. Vector of merged tokens with the term frequency of term i in document j.
Term docID tfij pij Term docID tfij pij
acceptance 1 1 1 sending 5 1 1
allows 4 1 1 sent 4 1 1
audio 4 1 0.5 standard 2 1 0.125
audio 5 1 0.5 standard 3 1 0.125
clue 0 1 1 standard 4 1 0.125
connection 9 1 1 standard 5 1 0.125
conversation 4 1 1 standard 6 1 0.125
h.323 4 1 1 standard 7 1 0.125
internet 4 1 0.5 standard 8 1 0.125
internet 5 1 0.5 standard 9 1 0.125
iso 7 1 1 streaming 5 1 1
level 1 1 1 style 4 1 1
management 7 1 1 telephone 4 1 1
meeting 2 1 1 transmissions 6 1 0.5
network 9 1 1 transmissions 8 1 0.5
project 7 1 1 video 5 1 1
protocol 2 1 1 videoconferencing 3 1 0.25
provided 7 1 1 videoconferencing 5 1 0.25
received 4 1 1 videoconferencing 6 1 0.25
related 7 1 1 videoconferencing 8 1 0.25
remote 2 1 1

14
Entropies. Table 1.15. shows a calculation of the entropy Gi for each term used in
the H.323 question example. Each term’s frequency in the entire document
collection, ni, is also shown. In a way similar to what happened when we used TF-
IDF weights, terms that are more frequent in the collection are penalized with a
smaller Gi.

Term weights. Combining the term frequencies from Table 1.14, the entropies from
Table 1.15, and the normalization denominators from Table 1.16, we get the term
weights shown in Table 1.17 for all term-document combinations in the H.323
question example. The term “clue” in document 0 has a very high normalized weight
(w’ij = 1) because it is a rare term in a short document. On the other hand, the term
“standard” in document 4 has a very low normalized weight (w’ij = 0.0343) because it
is a common term in a longer document. Note that normalized term weights
calculated here turned out to be the same as the normalized weights calculated using
IDFs (see table 1.17), but this is not always the case.

Table 1.15. The entropies Gi for each term i.


Term i ni Gi
acceptance 1 1.000
allows 1 1.000
audio 2 0.699
clue 1 1.000
connection 1 1.000
conversation 1 1.000
h.323 1 1.000
internet 2 0.699
iso 1 1.000
level 1 1.000
management 1 1.000
meeting 1 1.000
network 1 1.000
project 1 1.000
protocol 1 1.000
provided 1 1.000
received 1 1.000
related 1 1.000
remote 1 1.000
sending 1 1.000
sent 1 1.000
standard 8 0.097
streaming 1 1.000
style 1 1.000
telephone 1 1.000
transmissions 2 0.699
video 1 1.000
videoconferencing 4 0.398

15
Table 1.16. Term weight normalization denominators for each document d.
docID norm denom
0 1
1 1.414213562
2 1.734759796
3 0.409570252
4 2.826041343
5 2.035894377
6 0.810127677
7 2.238167006
8 0.810127677
9 1.417530087

Table 1.17. Normalized term weights.


doc Norm doc Norm
Term i j tfij Gi wij (wij) Term i j tfij Gi wij (wij)
acceptance 1 1 1.000 1.000 0.7071 sending 5 1 1.000 1.000 0.4912
allows 4 1 1.000 1.000 0.3539 sent 4 1 1.000 1.000 0.3539
audio 4 1 0.699 0.699 0.2473 standard 2 1 0.097 0.097 0.0559
audio 5 1 0.699 0.699 0.3433 standard 3 1 0.097 0.097 0.2366
clue 0 1 1.000 1.000 1.0000 standard 4 1 0.097 0.097 0.0343
connection 9 1 1.000 1.000 0.7055 standard 5 1 0.097 0.097 0.0476
conversation 4 1 1.000 1.000 0.3539 standard 6 1 0.097 0.097 0.1196
h.323 4 1 1.000 1.000 0.3539 standard 7 1 0.097 0.097 0.0433
internet 4 1 0.699 0.699 0.2473 standard 8 1 0.097 0.097 0.1196
internet 5 1 0.699 0.699 0.3433 standard 9 1 0.097 0.097 0.0684
iso 7 1 1.000 1.000 0.4468 streaming 5 1 1.000 1.000 0.4912
level 1 1 1.000 1.000 0.7071 style 4 1 1.000 1.000 0.3539
management 7 1 1.000 1.000 0.4468 telephone 4 1 1.000 1.000 0.3539
meeting 2 1 1.000 1.000 0.5764 transmissions 6 1 0.699 0.699 0.8628
network 9 1 1.000 1.000 0.7055 transmissions 8 1 0.699 0.699 0.8628
project 7 1 1.000 1.000 0.4468 video 5 1 1.000 1.000 0.4912
protocol 2 1 1.000 1.000 0.5764 videoconferencing 3 1 0.398 0.398 0.9716
provided 7 1 1.000 1.000 0.4468 videoconferencing 5 1 0.398 0.398 0.1955
received 4 1 1.000 1.000 0.3539 videoconferencing 6 1 0.398 0.398 0.4912
related 7 1 1.000 1.000 0.4468 videoconferencing 8 1 0.398 0.398 0.4912
remote 2 1 1.000 1.000 0.5764

Dictionary and postings. The inverted file structure for our H.323 question
example, consisting of the dictionary and the postings table is shown on Table 1.18.
Since the normalized weights turned out to be the same as in the TF-IDF case, table
1.18 is also the same as table 1.7.

16
Table 1.18. Dictionary and Postings for the H.323 question example.
DICTIONARY POSTINGS
word docCount tfi postIndex docID tfij normWeight
acceptance 1 1 0 0 1 1 0.7071
allows 1 1 1 1 4 1 0.3539
audio 2 2 2 2 4 1 0.2473
clue 1 1 4 3 5 1 0.3433
connection 1 1 5 4 0 1 1.0000
conversation 1 1 6 5 9 1 0.7055
h.323 1 1 7 6 4 1 0.3539
internet 2 2 8 7 4 1 0.3539
iso 1 1 10 8 4 1 0.2473
level 1 1 11 9 5 1 0.3433
management 1 1 12 10 7 1 0.4468
meeting 1 1 13 11 1 1 0.7071
network 1 1 14 12 7 1 0.4468
project 1 1 15 13 2 1 0.5764
protocol 1 1 16 14 9 1 0.7055
provided 1 1 17 15 7 1 0.4468
received 1 1 18 16 2 1 0.5764
related 1 1 19 17 7 1 0.4468
remote 1 1 20 18 4 1 0.3539
sending 1 1 21 19 7 1 0.4468
sent 1 1 22 20 2 1 0.5764
standard 8 8 23 21 5 1 0.4912
streaming 1 1 31 22 4 1 0.3539
style 1 1 32 23 2 1 0.0559
telephone 1 1 33 24 3 1 0.2366
transmissions 2 2 34 25 4 1 0.0343
video 1 1 36 26 5 1 0.0476
videoconferencing 4 4 37 27 6 1 0.1196
28 7 1 0.0433
29 8 1 0.1196
30 9 1 0.0684
31 5 1 0.4912
32 4 1 0.3539
33 4 1 0.3539
34 6 1 0.8628
35 8 1 0.8628
36 5 1 0.4912
37 3 1 0.9716
38 5 1 0.1955
39 6 1 0.4912
40 8 1 0.4912

17
Performing a Query. Let us refer again to the H.323 question example and consider
answer A2. Following the formulas presented earlier in this chapter, one can
calculate the normalized term weights for all tokens included in this query. Table
1.19 shows the normalized weight calculations for all terms in query A2. Using these
normalized weights wiq where i is the query term index, the similarity between a
document d and the query q, is calculated in table 1.20. Once again we use the notion
of an accumulator since the similarities are sums of products. Table 1.20 is
somewhat similar, but not identical to IDF-based table 1.9. They both indicate that
the most relevant document is 5, followed by 4, 3, 6, and 8, but using entropies
document 3 ranks higher than document 4. The grades awarded by the human expert
(1.5 and 2 respectively) favor the TF-IDF approach, but there are many text
evaluation cases where the entropy-based approach is the one that fares better.

Table 1.19. Normalized query term weights for the H.323 question example.
Query Term i Gi tfi pi wi w'i
audio 0.8473 1 0.333 0.8473 0.3482
h.323 0.8555 1 0.5 0.8555 0.3516
internet 0.8473 1 0.333 0.8473 0.3482
sending 0.8555 1 0.5 0.8555 0.3516
standard 0.8982 1 0.111 0.8982 0.3692
streaming 0.8555 1 0.5 0.8555 0.3516
video 0.8555 1 0.5 0.8555 0.3516
videoconferencing 0.8658 1 0.2 0.8658 0.3558

Table 1.20. Accumulator components for the similarity calculations.


Similarity to document:
Query Term i
0 1 2 3 4 5 6 7 8 9
audio 0.09 0.12
h.323 0.12
internet 0.09 0.12
sending 0.17
standard 0.02 0.09 0.01 0.02 0.04 0.02 0.04 0.03
streaming 0.17
video 0.17
videoconferencing 0.35 0.07 0.17 0.17
Cummulative
similarities 0.00 0.00 0.02 0.43 0.31 0.84 0.22 0.02 0.22 0.03

18
1.13. Porter Stemmer

Table 1.21 shows the tokens in the H.323 question example after affix removal was
applied to them using a Porter Stemmer algorithm [Porter 1980]. In this example the
effect of prefixes, suffixes, plurals, etc. is minimal and the document similarities turn
out to be identical even after the stemming operation was applied. However, term
stemming, and term conflation in general, are established text mining methods that
work well in a number of text mining applications.

Table 1.21. Token stems after application of Porter stemmer.


docID Token stems
0 clue
1 accept, level
2 meet, protocol, remot, standard
3 standard, videoconferenc
4 allow, audio, convers, h323, internet, receiv, sent, standard, style, telephon
5 audio, internet, send, standard, stream, video, videoconferenc
6 standard, transmis, videoconferenc
7 iso, manag, project, provid, relat, standard
8 standard, transmis, videoconferenc
9 connec, network, standard
Query audio, h323, internet, send, standard, stream, video, videoconferenc

1.14. Specification of a VSM Class

Figure 1.3 shows the class diagram of a VSM class. Only selected public methods
are shown. The source code of a Java implementation of this class is posted on our
course Web site.

19
VSM
- docVec: Vector
- dictVec: Vector
- postVec: Vector
- stemmer: Porter
- docWeightingMethod: String
- queryWeightingMethod: String
- activatePorterStemmer: Boolean
- eliminateUniqueTerms: Boolean
+ create(): VSM
+ loadStopwords (filename: String)
+ loadBlockwords (filename: String)
+ setDocumentsVector (inputVec: Vector)
+ setDocWeightingMethod (m: String)
+ setQueryWeightingMethod (m: String)
+ setEliminateUniqueTerms (p: Boolean)
+ setActivatePorterStemmer (p: Boolean)
+ createDictPost()
+ createXmatrix()
+ getXMatrix(): Double[][]
+ performOneQuery (queryStr: String)
+ getOneQueryResultsVec(): Vector
+ setQueryVector (queryVec: Vector)
+ createQmatrix()
+ getQMatrix(): Double[][]

Fig. 1.3. Class Diagram of VSM class.

References
Lochbaum, K. E. and Streeter, L.A. 1989. Comparing and Combining the Effectiveness of Latent
Semantic Indexing and the Ordinary Vector Space Model for Information Retrieval.
Information Processing and Management, 25(6), 665-76.
Frakes, W. B., and R. Baeza-Yates, Eds. (1992). Information Retrieval: Data Structures and
Algorithms, Prentice-Hall, New Jersey, 1992.
Porter, M. F. (1980). An Algorithm for Suffix Stripping. Program, 14(3), 130-37, 1980.
Riloff, E. (1995). Little words can make a big difference for text classification. In Proc. 18th Int.
ACM SIGIR Conf. on Research and Development in Information Retrieval, 130-136, 1995.
Salton, G., and C. Buckley (1988) Term-Weighting Approaches in Automatic Text Retrieval.
Information Processing and Management, 24(5), 513-23, 1988.
Salton, G., and M. E. Lesk (1968). Computer Evaluation of Indexing and Text Processing. J.
Association for Computing Machinery, 15(1), 8-36, 1968.
Salton, G., H. Wu, and C. T. Yu (1981). The Measurement of Term Importance in Automatic
Indexing. J. American Society for Information Science, 32(3), 175-86, 1981.
Shannon, C. 1948. A Mathematical Theory of Communication. The Bell System Technical Journal,
Vol. 27, pp. 379–423, July 1948, pp. 623–656, October, 1948.

20

You might also like