Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989

Automatic Indexing
Automatic Text Processing

by G. Salton, Addison-Wesley, 1989.
1
Indexing
 indexing: assign identifiers to text documents.

 assign: manual vs. automatic indexing
 identifiers:
» objective vs. nonobjective text identifiers
cataloging rules define, e.g., author names, publisher names, dates
of publications, …
» controlled vs. uncontrolled vocabularies
instruction manuals, terminological schedules, …
» single-term vs. term phrase
2
Two Issues
 Issue 1: indexing exhaustivity

» exhaustive: assign a large number of terms
» nonexhaustive
 Issue 2: term specificity
» broad terms (generic)
cannot distinguish relevant from nonrelevant documents
» narrow terms (specific)
retrieve relatively fewer documents, but most of them are relevant
3
Parameters of
retrieval effectiveness
 Recall
Number of relevant items retrieved
R
Total number of relevant items in collection
 Precision
Number of relevant items retrieved
P
Total number of items retrieved
 Goal
high recall and high precision
4
Retrieved
Part
b a
Nonrelevant Relevant
Items Items
c d
a a
Recall  Precision 
a +d a+b
A Joint Measure
 F-score (  2  1)  P  R
F
2  P  R
  is a parameter that encode the importance of recall and

procedure.
 =1: equal weight
 <1: precision is more important
 >1: recall is more important
6
Choices of Recall and Precision
 Both recall and precision vary from 0 to 1.

 In principle, the average user wants to achieve both
high recall and high precision.
 In practice, a compromise must be reached because
simultaneously optimizing recall and precision is not
normally achievable.
7
Choices of Recall and Precision (Continued)
 Particular choices of indexing and search policies

have produced variations in performance ranging
from 0.8 precision and 0.2 recall to 0.1 precision and
0.8 recall.
 In many circumstance, both the recall and the
precision varying between 0.5 and 0.6 are more
satisfactory for the average users.
8
Term-Frequency Consideration
 Function words
» for example, "and", "or", "of", "but", …
» the frequencies of these words are high in all texts
 Content words
» words that actually relate to document content
» varying frequencies in the different texts of a collect
» indicate term importance for content
9
A Frequency-Based Indexing Method
 Eliminate common function words from the document texts by

consulting a special dictionary, or stop list, containing a list of
high frequency function words.
 Compute the term frequency tfij for all remaining terms Tj in
each document Di, specifying the number of occurrences of Tj in
Di.
 Choose a threshold frequency T, and assign to each document
Di all term Tj for which tfij > T.
10
Inverse Document Frequency
 Inverse Document Frequency (IDF) for term Tj

N
idf j  log
df j
where dfj (document frequency of term Tj) is
number of documents in which Tj occurs.
» fulfil both the recall and the precision
» occur frequently in individual documents but rarely in the
remainder of the collection
11
New Term Importance Indicator
 weight wij of a term Tj in a document di

N
wij  tf ij  log
df j
 Eliminating common function words
 Computing the value of wij for each term Tj in each document Di
 Assigning to the documents of a collection all terms with
sufficiently high (tf x idf) factors
12
Term-discrimination Value
 Useful index terms

» distinguish the documents of a collection from each other
 Document Space
» two documents are assigned very similar term sets, when the
corresponding points in document configuration appear close
together
» when a high-frequency term without discrimination is assigned, it
will increase the document space density
13
A Virtual Document Space
Original State After Assignment of After Assignment of

good discriminator poor discriminator
14
Good Term Assignment
 When a term is assigned to the documents of a

collection, the few objects to which the term is
assigned will be distinguished from the rest of the
collection.
 This should increase the average distance between
the objects in the collection and hence produce a
document space less dense than before.
15
Poor Term Assignment
 A high frequency term is assigned that does not

discriminate between the objects of a collection.
 Its assignment will render the document more similar.
 This is reflected in an increase in document space
density.
16
Term Discrimination Value
 definition
dvj = Q - Qj
where Q and Qj are space densities before and
after the assignments of term Tj.
1 N N
Q  
N ( N  1) i 1 k 1
sim( Di , Dk )
ik
 dvj>0, Tj is a good term;

dvj<0, Tj is a poor term.
17
Variations of Term-Discrimination Value
with Document Frequency
Document
Frequency
N
Low frequency Medium frequency High frequency
dvj=0 dvj>0 dvj<0
18
Another Term Weighting
 wij = tfij x dvj

N
 compared with wij  tf ij  log
df j
N
» : decrease steadily with increasing document
df j
frequency
» dvj: increase from zero to positive as the document
frequency of the term increase,
decrease shapely as the document frequency
becomes still larger.
19
Document Centroid
 Issue: efficiency problem

N(N-1) pairwise similarities
 document centroid C = (c1, c2, c3, ..., ct)
N
cj  w
i 1
ij
where wij is the j-th term in document i.

 space density
1 N
Q
N
 sim(C , D )
i 1
i
20
Discussions
 dvj and idfj

global properties of terms in a document collection
 ideal
term characteristics that occur between relevant and
nonrelevant documents of a collection
21
Probabilistic Term Weighting
 Goal
Explicit distinctions between occurrences of terms in relevant
and nonrelevant documents of a collection
 Definition
Given a user query q, and the ideal answer set of the relevant
documents
 From decision theory, the best ranking algorithm
Pr( x|rel ) Pr( rel )

g ( x )  log  log
Pr( x|nonrel ) Pr( nonrel )
22
Probabilistic Term Weighting
 Pr(x|rel), Pr(x|nonrel):
occurrence probabilities of document x in the relevant and
nonrelevant document sets
 Pr(rel), Pr(nonrel):
document’s a priori probabilities of relevance and nonrelevance
 Further assumptions
Terms occur independently in relevant documents;
terms occur independently in nonrelevant documents.
23
t
Pr( x|rel )   Pr( xi |rel )
i 1
t
Pr( x|nonrel )   Pr( xi |nonrel )
i 1
t
Pr( xi | rel )
g ( x )   log  constants
i 1 Pr( xi | nonrel )
Derivation Process
Pr( x| rel ) Pr( rel )

g ( x )  log  log
Pr( x| nonrel ) Pr( nonrel )
t
 Pr( x | rel )
i
 log t
i 1
 constants
 Pr( x | nonrel )
i 1
i
t
Pr( xi | rel )
  log
i 1 Pr( xi | nonrel )
 constants
25
Given a document D=(d1, d2, …, dt),
the retrieval value of D is:
t
Pr( xi  di | rel )
g ( D)   log
i 1 Pr( xi  di |nonrel )
 constants
where di: term weights of term xi.

Assume di is either 0 or 1.
0: i-th term is absent from D.
1: i-th term is present in D.
d 1  d
Pr( xi  di |rel )  pi (1  pi )
i i
d 1  d
Pr( xi  di |nonrel )  qi (1  qi )
i i
pi=Pr(xi=1|rel)
1-pi=Pr(xi=0|rel)
qi=Pr(xi=1|nonrel)
1-qi=Pr(xi=0|nonrel)
t
Pr( xi  di | rel )
g ( D)   log  constants
i 1 Pr( xi  di | nonrel )
di 1di
t
p (1 p )
i i
  log d 1d
 constants
q (1 q ) i i
i 1
i i
di di
t
p (1
i q ) (1  p ) i i
  log d
 constants
d
q (1 p ) (1  q )
i i
i 1
i i i
d
t
( p (1 qi)) (1 p )
i
i
i
  log  constants
d
i 1
(q (1 p )) (1 q )
i i
i
i
t 1  pi t pi (1  qi )
g ( D)   log   di log  constants
i 1 1  qi i 1 qi (1  pi )
The retrieval value of each Tj present in a document

(i.e., dj=1) is: term relevance weight
pj (1  qj )
trj  log
qj (1  pj )
New Term Weighting
 term-relevance weight of term Tj: trj

 indexing value of term Tj in document Dj:
wij = tfij *trj
 Issue
It is necessary to characterize both the relevant and nonrelevant
documents of a collection.
how to find a representative document sample feedback
information from retrieved documents??
30
Estimation of Term-Relevance
 Little is known about the relevance properties of terms.

» The occurrence probability of a term in the nonrelevant documents
qj is approximated by the occurrence probability of the term in the
entire document collection
qj = dfj / N
» The occurrence probabilities of the terms in the small number of
relevant documents is equal by using a constant value pj = 0.5 for
all j.
31
df j
0.5 * (1  )
pj (1  qj ) N
trj  log  log
qj (1  pj ) df j
* 0.5
N
( N  df j )
 log
df j
When N is sufficiently large, N-dfj  N,

( N  df j ) N
trj  log  log = idfj
df j df j
32
Estimation of Term-Relevance
 Estimate the number of relevant documents rj in the collection

that contain term Tj as a function of the known document
frequency tfj of the term Tj.
pj = rj / R
qj = (dfj-rj)/(N-R)
R: an estimate of the total number of relevant documents in the
collection.
33
Term Relationships in Indexing
 Single-term indexing
» Single terms are often ambiguous.
» Many single terms are either too specific or too broad to be
useful.
 Complex text identifiers
» subject experts and trained indexers
» linguistic analysis algorithms, e.g., NP chunker
» term-grouping or term clustering methods
34
Tree-Dependence Model
 Only certain dependent term pairs are actually included, the

other term pairs and all higher-order term combinations being
disregarded.
 Example:
sample term-dependence tree
35
(children, school), (school, girls), (school, boys) 
(children, girls), (children, boys) 
(school, girls, boys), (children, achievement, ability) 
36
Term Classification (Clustering)
T 1 T 2 T 3 Tt
D1 d 11 d 12  d 1t
D2 d 21 d 22  d 2 t
    
Dn dn1 dn 2  dnt
37
Term Classification (Clustering)
 Column part
Group terms whose corresponding column representation
reveal similar assignments to the documents of the collection.
 Row part
Group documents that exhibit sufficiently similar term
assignment.
38
Linguistic Methodologies
 Indexing phrases:
nominal constructions including adjectives and nouns
» Assign syntactic class indicators (i.e., part of speech) to the words
occurring in document texts.
» Construct word phrases from sequences of words exhibiting certain
allowed syntactic markers (noun-noun and adjective-noun
sequences).
39
Term-Phrase Formation
 Term Phrase
a sequence of related text words carry a more specific meaning
than the single terms
e.g., “computer science” vs. computer;
Thesaurus Phrase
transformation transformation
Document
Frequency
N
40
Simple Phrase-Formation Process
 the principal phrase component (phrase head)

a term with a document frequency exceeding a stated threshold,
or exhibiting a negative discriminator value
 the other components of the phrase
medium- or low- frequency terms with stated co-occurrence
relationships with the phrase head
 common function words
not used in the phrase-formation process
41
An Example
 Effective retrieval systems are essential for people in

need of information.
» “are”, “for”, “in” and “of”:
common function words
» “system”, “people”, and “information”:
phrase heads
42
The Formatted Term-Phrases
effective retrieval systems essential people need information
Phrase Heads and Components Phrase Heads and Components
Must Be Adjacent Co-occur in Sentence
1. retrieval system* 6. effective systems
2. systems essential 7. systems need
3. essential people 8. effective people
4. people need 9. retrieval people
5. need information* 10. effective information*
11. retrieval information*
12. essential information*
2/5 5/12
*: phrases assumed to be useful for content identification
43
The Problems
 A phrase-formation process controlled only by word co-

occurrences and the document frequencies of certain words in
not likely to generate a large number of high-quality phrases.
 Additional syntactic criteria for phrase heads and phrase
components may provide further control in phrase formation.
44
Additional Term-Phrase Formation Steps
 Syntactic class indicator are assigned to the terms, and phrase

formation is limited to sequences of specified syntactic markers,
such as adjective-noun and noun-noun sequences.
Adverb-adjective  adverb-noun 
 The phrase elements are all chosen from within the same
syntactic unit, such as subject phrase, term phrase, and verb
phrase.
45
Consider Syntactic Unit
 effective retrieval systems are essential for people in

the need of information
 subject phrase
» effective retrieval systems
 verb phrase
» are essential
 term phrase
» people in need of information
46
Phrases within Syntactic Components
[subj effective retrieval systems] [vp are essential ]
for [obj people need information]
 Adjacent phrase heads and components within syntactic
components
» retrieval systems*
» people need
» need information*
2/3
 Phrase heads and components co-occur within syntactic
components
» effective systems
47
Problems
 More stringent phrase formation criteria produce

fewer phrases, both good and bad, than less stringent
methodologies.
 Prepositional phrase attachment, e.g.,
The man saw the girl with the telescope.
 Anaphora resolution
He dropped the plate on his foot and broke it.
48
Problems (Continued)
 Any phrase matching system must be able to deal with the

problems of
» synonym recognition
» differing word orders
» intervening extraneous word
 Example
» retrieval of information vs. information retrieval
49
Equivalent Phrase Formulation
 Base form: text analysis system

 Variants:
» system analyzes the text
» text is analyzed by the system
» system carries out text analysis
» text is subjected to system analysis
 Related term substitution
» text: documents, information documents
» analysis: processing, transformation, manipulation
» system: program, process
50
Thesaurus-Group Generation
 Thesaurus transformation
» broadens index terms whose scope is too narrow to be useful in
retrieval
» a thesaurus must assemble groups of related specific terms under
more general, higher-level class indicators
Thesaurus Phrase
transformation transformation
Document
Frequency
N
Low frequency Medium frequency High frequency
dvj=0 dvj>0 dvj<0 51
Sample Classes of Roget’s Thesaurus
Class Indicator Entry Class Indicator Entry

permission offer
leave presentation
760 sanction tender
allowance 763 overture
tolerance advance
authorization submission
prohibition proposal
veto proposition
761 disallowance invitation
injunction refusal
ban declining
taboo 764 noncompliance
consent rejection
acquiescence denial
762 compliance
agreement
acceptance
52
Methods of Thesaurus Construction
 dij: value of term Tj in document Di

 sim(Tj, Tk): similarity measure between Tj and Tk
 possible measures
n
sim(Tj , Tk )   dij‧dik
i 1
n
 d ‧d ij ik
sim(Tj , Tk )  n
i 1
n
2 2
 (d )
i 1
ij ‧  ( dik )
i 1
53
Term-classification strategies
 single-link
Each term must have a similarity exceeding a stated threshold value
with at least one other term in the same class.
Produce fewer, much larger term classes
 complete-link
Each term has a similarity to all other terms in the same class that
exceeds the the threshold value.
Produce a large number of small classes
54
The Indexing Prescription (1)
 Identify the individual words in the document collection.

 Use a stop list to delete from the texts the function words.
 Use an suffix-stripping routine to reduce each remaining word to word-
stem form.
 For each remaining word stem Tj in document Di, compute wij.
 Represent each document Di by
Di=(T1, wi1; T2, wi2; …, Tt, wit)
55
Word Stemming
 effectiveness --> effective --> effect

 picnicking --> picnic
 king -\-> k
56
Some Morphological Rules
 Restore a silent e after suffix removal from certain words to

produce “hope” from “hoping” rather than “hop”
 Delete certain doubled consonants after suffix removal, so as to
generate “hop” from “hopping” rather than “hopp”.
 Use a final y for an I in forms such as “easier”, so as to generate
“easy” instead of “easi”.
57
The Indexing Prescription (2)
 Identify individual text words.

 Use stop list to delete common function words.
 Use automatic suffix stripping to produce word stems.
 Compute term-discrimination value for all word stems.
 Use thesaurus class replacement for all low-frequency terms with
discrimination values near zero.
 Use phrase-formation process for all high-frequency terms with
negative discrimination values.
 Compute weighting factors for complex indexing units.
 Assign to each document single term weights, term phrases, and
thesaurus classes with weights.
58
Query vs. Document
 Differences
» Query texts are short.
» Fewer terms are assigned to queries.
» The occurrence of query terms rarely exceeds 1.
Q=(wq1, wq2, …, wqt) where wqj: inverse document frequency

Di=(di1, di2, …, dit)
where dij: term frequency*inverse document frequency
t
sim( Q, D)   wqj‧dij
j 1
59
Query vs. Document
 When non-normalized documents are used, the longer documents with

more assigned terms have a greater chance of matching particular
query terms than do the shorter document vectors.
t
w
j 1
qj ‧dij
sim(Q, Di ) 
t
or
2
 (d
j 1
ij )
t
w
j 1
qj ‧dij
sim(Q, Di ) 
t t
2 2
 (d
j 1
ij ) ‧ ( wqj )
j 1
60
Relevance Feedback
 Terms present in previously retrieved documents that have been

identified as relevant to the user’s query are added to the original
formulations.
 The weights of the original query terms are altered by replacing the
inverse document frequency portion of the weights with term-relevance
weights obtained by using the occurrence characteristics of the terms in
the previous retrieved relevant and nonrelevant documents of the
collection.
61
Relevance Feedback
 Q = (wq1, wq2, ..., wqt)
 Di = (di1, di2, ..., dit)
 New query may be the following form
Q’ = {wq1, wq2, ..., wqt}+{w’qt+1, w’qt+2, ..., w’qt+m}
 The weights of the newly added terms Tt+1 to Tt+m
may consist of a combined term-frequency and term-
relevance weight.
62
Final Indexing
 Identify individual text words.

 Use a stop list to delete common words.
 Use suffix stripping to produce word stems.
 Replace low-frequency terms with thesaurus classes.
 Replace high-frequency terms with phrases.
 Compute term weights for all single terms, phrases, and thesaurus
classes.
 Compare query statements with document vectors.
 Identify some retrieved documents as relevant and some as
nonrelevant to the query.
63
Final Indexing
 Compute term-relevance factors based on available relevance

assessments.
 Construct new queries with added terms from relevant documents and
term weights based on combined frequency and term-relevance weight.
 Return to step (7).
Compare query statements with document vectors ……..
64
Summary of expected effectiveness of automatic
indexing
 Basic single-term automatic indexing -

 Use of thesaurus to group related terms in the given topic area
+10% to +20%
 Use of automatically derived term associations obtained from joint term
assignments found in sample document collections
0% to -10%
 Use of automatically derived term phrases obtained by using co-
occurring terms found in the texts of sample collections
+5% to +10%
 Use of one iteration of relevant feedback to add new query terms
extracted from previously retrieved relevant documents
+30% to +60%
65

Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989

Uploaded by

Copyright:

Available Formats

Automatic Indexing

Automatic Text Processing

 indexing: assign identifiers to text documents.

 Issue 1: indexing exhaustivity

  is a parameter that encode the importance of recall and

 Both recall and precision vary from 0 to 1.

 Particular choices of indexing and search policies

 Eliminate common function words from the document texts by

 Inverse Document Frequency (IDF) for term Tj

 weight wij of a term Tj in a document di

 Useful index terms

Original State After Assignment of After Assignment of

 When a term is assigned to the documents of a

 A high frequency term is assigned that does not

 dvj>0, Tj is a good term;

 wij = tfij x dvj

 Issue: efficiency problem

where wij is the j-th term in document i.

 dvj and idfj

Pr( x|rel ) Pr( rel )

Pr( x| rel ) Pr( rel )

where di: term weights of term xi.

The retrieval value of each Tj present in a document

 term-relevance weight of term Tj: trj

 Little is known about the relevance properties of terms.

When N is sufficiently large, N-dfj  N,

 Estimate the number of relevant documents rj in the collection

 Only certain dependent term pairs are actually included, the

 the principal phrase component (phrase head)

 Effective retrieval systems are essential for people in

 A phrase-formation process controlled only by word co-

 Syntactic class indicator are assigned to the terms, and phrase

 effective retrieval systems are essential for people in

 More stringent phrase formation criteria produce

 Any phrase matching system must be able to deal with the

 Base form: text analysis system

Class Indicator Entry Class Indicator Entry

 dij: value of term Tj in document Di

 Identify the individual words in the document collection.

 effectiveness --> effective --> effect

 Restore a silent e after suffix removal from certain words to

 Identify individual text words.

Q=(wq1, wq2, …, wqt) where wqj: inverse document frequency

 When non-normalized documents are used, the longer documents with

 Terms present in previously retrieved documents that have been

 Identify individual text words.

 Compute term-relevance factors based on available relevance

 Basic single-term automatic indexing -

You might also like