You are on page 1of 65

Automatic Indexing

Automatic Text Processing


by G. Salton, Addison-Wesley, 1989.

1
Indexing

 indexing: assign identifiers to text documents.


 assign: manual vs. automatic indexing
 identifiers:
» objective vs. nonobjective text identifiers
cataloging rules define, e.g., author names, publisher names, dates
of publications, …
» controlled vs. uncontrolled vocabularies
instruction manuals, terminological schedules, …
» single-term vs. term phrase

2
Two Issues

 Issue 1: indexing exhaustivity


» exhaustive: assign a large number of terms
» nonexhaustive
 Issue 2: term specificity
» broad terms (generic)
cannot distinguish relevant from nonrelevant documents
» narrow terms (specific)
retrieve relatively fewer documents, but most of them are relevant

3
Parameters of
retrieval effectiveness

 Recall
Number of relevant items retrieved
R
Total number of relevant items in collection
 Precision
Number of relevant items retrieved
P
Total number of items retrieved
 Goal
high recall and high precision

4
Retrieved
Part
b a
Nonrelevant Relevant
Items Items
c d

a a
Recall  Precision 
a +d a+b
A Joint Measure

 F-score (  2  1)  P  R
F
2  P  R

  is a parameter that encode the importance of recall and


procedure.
 =1: equal weight
 <1: precision is more important
 >1: recall is more important

6
Choices of Recall and Precision

 Both recall and precision vary from 0 to 1.


 In principle, the average user wants to achieve both
high recall and high precision.
 In practice, a compromise must be reached because
simultaneously optimizing recall and precision is not
normally achievable.

7
Choices of Recall and Precision (Continued)

 Particular choices of indexing and search policies


have produced variations in performance ranging
from 0.8 precision and 0.2 recall to 0.1 precision and
0.8 recall.
 In many circumstance, both the recall and the
precision varying between 0.5 and 0.6 are more
satisfactory for the average users.

8
Term-Frequency Consideration

 Function words
» for example, "and", "or", "of", "but", …
» the frequencies of these words are high in all texts
 Content words
» words that actually relate to document content
» varying frequencies in the different texts of a collect
» indicate term importance for content

9
A Frequency-Based Indexing Method

 Eliminate common function words from the document texts by


consulting a special dictionary, or stop list, containing a list of
high frequency function words.
 Compute the term frequency tfij for all remaining terms Tj in
each document Di, specifying the number of occurrences of Tj in
Di.
 Choose a threshold frequency T, and assign to each document
Di all term Tj for which tfij > T.

10
Inverse Document Frequency

 Inverse Document Frequency (IDF) for term Tj


N
idf j  log
df j
where dfj (document frequency of term Tj) is
number of documents in which Tj occurs.
» fulfil both the recall and the precision
» occur frequently in individual documents but rarely in the
remainder of the collection

11
New Term Importance Indicator

 weight wij of a term Tj in a document di


N
wij  tf ij  log
df j
 Eliminating common function words
 Computing the value of wij for each term Tj in each document Di
 Assigning to the documents of a collection all terms with
sufficiently high (tf x idf) factors

12
Term-discrimination Value

 Useful index terms


» distinguish the documents of a collection from each other
 Document Space
» two documents are assigned very similar term sets, when the
corresponding points in document configuration appear close
together
» when a high-frequency term without discrimination is assigned, it
will increase the document space density

13
A Virtual Document Space

Original State After Assignment of After Assignment of


good discriminator poor discriminator

14
Good Term Assignment

 When a term is assigned to the documents of a


collection, the few objects to which the term is
assigned will be distinguished from the rest of the
collection.
 This should increase the average distance between
the objects in the collection and hence produce a
document space less dense than before.

15
Poor Term Assignment

 A high frequency term is assigned that does not


discriminate between the objects of a collection.
 Its assignment will render the document more similar.
 This is reflected in an increase in document space
density.

16
Term Discrimination Value

 definition
dvj = Q - Qj
where Q and Qj are space densities before and
after the assignments of term Tj.
1 N N
Q  
N ( N  1) i 1 k 1
sim( Di , Dk )
ik

 dvj>0, Tj is a good term;


dvj<0, Tj is a poor term.

17
Variations of Term-Discrimination Value
with Document Frequency

Document
Frequency
N
Low frequency Medium frequency High frequency
dvj=0 dvj>0 dvj<0

18
Another Term Weighting

 wij = tfij x dvj


N
 compared with wij  tf ij  log
df j
N
» : decrease steadily with increasing document
df j
frequency
» dvj: increase from zero to positive as the document
frequency of the term increase,
decrease shapely as the document frequency
becomes still larger.

19
Document Centroid

 Issue: efficiency problem


N(N-1) pairwise similarities
 document centroid C = (c1, c2, c3, ..., ct)
N
cj  w
i 1
ij

where wij is the j-th term in document i.


 space density
1 N
Q
N
 sim(C , D )
i 1
i

20
Discussions

 dvj and idfj


global properties of terms in a document collection
 ideal
term characteristics that occur between relevant and
nonrelevant documents of a collection

21
Probabilistic Term Weighting

 Goal
Explicit distinctions between occurrences of terms in relevant
and nonrelevant documents of a collection
 Definition
Given a user query q, and the ideal answer set of the relevant
documents
 From decision theory, the best ranking algorithm

Pr( x|rel ) Pr( rel )


g ( x )  log  log
Pr( x|nonrel ) Pr( nonrel )

22
Probabilistic Term Weighting

 Pr(x|rel), Pr(x|nonrel):
occurrence probabilities of document x in the relevant and
nonrelevant document sets
 Pr(rel), Pr(nonrel):
document’s a priori probabilities of relevance and nonrelevance
 Further assumptions
Terms occur independently in relevant documents;
terms occur independently in nonrelevant documents.

23
t
Pr( x|rel )   Pr( xi |rel )
i 1
t
Pr( x|nonrel )   Pr( xi |nonrel )
i 1
t
Pr( xi | rel )
g ( x )   log  constants
i 1 Pr( xi | nonrel )
Derivation Process

Pr( x| rel ) Pr( rel )


g ( x )  log  log
Pr( x| nonrel ) Pr( nonrel )
t

 Pr( x | rel )
i
 log t
i 1
 constants
 Pr( x | nonrel )
i 1
i

t
Pr( xi | rel )
  log
i 1 Pr( xi | nonrel )
 constants

25
Given a document D=(d1, d2, …, dt),
the retrieval value of D is:
t
Pr( xi  di | rel )
g ( D)   log
i 1 Pr( xi  di |nonrel )
 constants

where di: term weights of term xi.


Assume di is either 0 or 1.
0: i-th term is absent from D.
1: i-th term is present in D.
d 1  d
Pr( xi  di |rel )  pi (1  pi )
i i

d 1  d
Pr( xi  di |nonrel )  qi (1  qi )
i i

pi=Pr(xi=1|rel)
1-pi=Pr(xi=0|rel)
qi=Pr(xi=1|nonrel)
1-qi=Pr(xi=0|nonrel)
t
Pr( xi  di | rel )
g ( D)   log  constants
i 1 Pr( xi  di | nonrel )
di 1di
t
p (1 p )
i i
  log d 1d
 constants
q (1 q ) i i
i 1
i i

di di
t
p (1
i q ) (1  p ) i i
  log d
 constants
d
q (1 p ) (1  q )
i i
i 1
i i i

d
t
( p (1 qi)) (1 p )
i
i
i
  log  constants
d
i 1
(q (1 p )) (1 q )
i i
i
i
t 1  pi t pi (1  qi )
g ( D)   log   di log  constants
i 1 1  qi i 1 qi (1  pi )

The retrieval value of each Tj present in a document


(i.e., dj=1) is: term relevance weight

pj (1  qj )
trj  log
qj (1  pj )
New Term Weighting

 term-relevance weight of term Tj: trj


 indexing value of term Tj in document Dj:
wij = tfij *trj
 Issue
It is necessary to characterize both the relevant and nonrelevant
documents of a collection.
how to find a representative document sample feedback
information from retrieved documents??

30
Estimation of Term-Relevance

 Little is known about the relevance properties of terms.


» The occurrence probability of a term in the nonrelevant documents
qj is approximated by the occurrence probability of the term in the
entire document collection
qj = dfj / N
» The occurrence probabilities of the terms in the small number of
relevant documents is equal by using a constant value pj = 0.5 for
all j.

31
df j
0.5 * (1  )
pj (1  qj ) N
trj  log  log
qj (1  pj ) df j
* 0.5
N
( N  df j )
 log
df j

When N is sufficiently large, N-dfj  N,


( N  df j ) N
trj  log  log = idfj
df j df j

32
Estimation of Term-Relevance

 Estimate the number of relevant documents rj in the collection


that contain term Tj as a function of the known document
frequency tfj of the term Tj.
pj = rj / R
qj = (dfj-rj)/(N-R)
R: an estimate of the total number of relevant documents in the
collection.

33
Term Relationships in Indexing

 Single-term indexing
» Single terms are often ambiguous.
» Many single terms are either too specific or too broad to be
useful.
 Complex text identifiers
» subject experts and trained indexers
» linguistic analysis algorithms, e.g., NP chunker
» term-grouping or term clustering methods

34
Tree-Dependence Model

 Only certain dependent term pairs are actually included, the


other term pairs and all higher-order term combinations being
disregarded.
 Example:
sample term-dependence tree

35
(children, school), (school, girls), (school, boys) 
(children, girls), (children, boys) 
(school, girls, boys), (children, achievement, ability) 

36
Term Classification (Clustering)

T 1 T 2 T 3 Tt
D1 d 11 d 12  d 1t
D2 d 21 d 22  d 2 t
    
Dn dn1 dn 2  dnt
37
Term Classification (Clustering)

 Column part
Group terms whose corresponding column representation
reveal similar assignments to the documents of the collection.
 Row part
Group documents that exhibit sufficiently similar term
assignment.

38
Linguistic Methodologies

 Indexing phrases:
nominal constructions including adjectives and nouns
» Assign syntactic class indicators (i.e., part of speech) to the words
occurring in document texts.
» Construct word phrases from sequences of words exhibiting certain
allowed syntactic markers (noun-noun and adjective-noun
sequences).

39
Term-Phrase Formation

 Term Phrase
a sequence of related text words carry a more specific meaning
than the single terms
e.g., “computer science” vs. computer;

Thesaurus Phrase
transformation transformation
Document
Frequency
N

40
Simple Phrase-Formation Process

 the principal phrase component (phrase head)


a term with a document frequency exceeding a stated threshold,
or exhibiting a negative discriminator value
 the other components of the phrase
medium- or low- frequency terms with stated co-occurrence
relationships with the phrase head
 common function words
not used in the phrase-formation process

41
An Example

 Effective retrieval systems are essential for people in


need of information.
» “are”, “for”, “in” and “of”:
common function words
» “system”, “people”, and “information”:
phrase heads

42
The Formatted Term-Phrases
effective retrieval systems essential people need information
Phrase Heads and Components Phrase Heads and Components
Must Be Adjacent Co-occur in Sentence
1. retrieval system* 6. effective systems
2. systems essential 7. systems need
3. essential people 8. effective people
4. people need 9. retrieval people
5. need information* 10. effective information*
11. retrieval information*
12. essential information*
2/5 5/12
*: phrases assumed to be useful for content identification
43
The Problems

 A phrase-formation process controlled only by word co-


occurrences and the document frequencies of certain words in
not likely to generate a large number of high-quality phrases.
 Additional syntactic criteria for phrase heads and phrase
components may provide further control in phrase formation.

44
Additional Term-Phrase Formation Steps

 Syntactic class indicator are assigned to the terms, and phrase


formation is limited to sequences of specified syntactic markers,
such as adjective-noun and noun-noun sequences.
Adverb-adjective  adverb-noun 
 The phrase elements are all chosen from within the same
syntactic unit, such as subject phrase, term phrase, and verb
phrase.

45
Consider Syntactic Unit

 effective retrieval systems are essential for people in


the need of information
 subject phrase
» effective retrieval systems
 verb phrase
» are essential
 term phrase
» people in need of information

46
Phrases within Syntactic Components
[subj effective retrieval systems] [vp are essential ]
for [obj people need information]
 Adjacent phrase heads and components within syntactic
components
» retrieval systems*
» people need
» need information*
2/3
 Phrase heads and components co-occur within syntactic
components
» effective systems

47
Problems

 More stringent phrase formation criteria produce


fewer phrases, both good and bad, than less stringent
methodologies.
 Prepositional phrase attachment, e.g.,
The man saw the girl with the telescope.
 Anaphora resolution
He dropped the plate on his foot and broke it.

48
Problems (Continued)

 Any phrase matching system must be able to deal with the


problems of
» synonym recognition
» differing word orders
» intervening extraneous word
 Example
» retrieval of information vs. information retrieval

49
Equivalent Phrase Formulation

 Base form: text analysis system


 Variants:
» system analyzes the text
» text is analyzed by the system
» system carries out text analysis
» text is subjected to system analysis
 Related term substitution
» text: documents, information documents
» analysis: processing, transformation, manipulation
» system: program, process

50
Thesaurus-Group Generation

 Thesaurus transformation
» broadens index terms whose scope is too narrow to be useful in
retrieval
» a thesaurus must assemble groups of related specific terms under
more general, higher-level class indicators

Thesaurus Phrase
transformation transformation
Document
Frequency
N
Low frequency Medium frequency High frequency
dvj=0 dvj>0 dvj<0 51
Sample Classes of Roget’s Thesaurus

Class Indicator Entry Class Indicator Entry


permission offer
leave presentation
760 sanction tender
allowance 763 overture
tolerance advance
authorization submission
prohibition proposal
veto proposition
761 disallowance invitation
injunction refusal
ban declining
taboo 764 noncompliance
consent rejection
acquiescence denial
762 compliance
agreement
acceptance

52
Methods of Thesaurus Construction

 dij: value of term Tj in document Di


 sim(Tj, Tk): similarity measure between Tj and Tk
 possible measures
n
sim(Tj , Tk )   dij‧dik
i 1
n

 d ‧d ij ik

sim(Tj , Tk )  n
i 1
n
2 2
 (d )
i 1
ij ‧  ( dik )
i 1

53
Term-classification strategies

 single-link
Each term must have a similarity exceeding a stated threshold value
with at least one other term in the same class.
Produce fewer, much larger term classes
 complete-link
Each term has a similarity to all other terms in the same class that
exceeds the the threshold value.
Produce a large number of small classes

54
The Indexing Prescription (1)

 Identify the individual words in the document collection.


 Use a stop list to delete from the texts the function words.
 Use an suffix-stripping routine to reduce each remaining word to word-
stem form.
 For each remaining word stem Tj in document Di, compute wij.
 Represent each document Di by
Di=(T1, wi1; T2, wi2; …, Tt, wit)

55
Word Stemming

 effectiveness --> effective --> effect


 picnicking --> picnic
 king -\-> k

56
Some Morphological Rules

 Restore a silent e after suffix removal from certain words to


produce “hope” from “hoping” rather than “hop”
 Delete certain doubled consonants after suffix removal, so as to
generate “hop” from “hopping” rather than “hopp”.
 Use a final y for an I in forms such as “easier”, so as to generate
“easy” instead of “easi”.

57
The Indexing Prescription (2)

 Identify individual text words.


 Use stop list to delete common function words.
 Use automatic suffix stripping to produce word stems.
 Compute term-discrimination value for all word stems.
 Use thesaurus class replacement for all low-frequency terms with
discrimination values near zero.
 Use phrase-formation process for all high-frequency terms with
negative discrimination values.
 Compute weighting factors for complex indexing units.
 Assign to each document single term weights, term phrases, and
thesaurus classes with weights.

58
Query vs. Document

 Differences
» Query texts are short.
» Fewer terms are assigned to queries.
» The occurrence of query terms rarely exceeds 1.

Q=(wq1, wq2, …, wqt) where wqj: inverse document frequency


Di=(di1, di2, …, dit)
where dij: term frequency*inverse document frequency
t
sim( Q, D)   wqj‧dij
j 1

59
Query vs. Document

 When non-normalized documents are used, the longer documents with


more assigned terms have a greater chance of matching particular
query terms than do the shorter document vectors.
t

w
j 1
qj ‧dij
sim(Q, Di ) 
t
or
2
 (d
j 1
ij )
t

w
j 1
qj ‧dij
sim(Q, Di ) 
t t
2 2
 (d
j 1
ij ) ‧ ( wqj )
j 1

60
Relevance Feedback

 Terms present in previously retrieved documents that have been


identified as relevant to the user’s query are added to the original
formulations.
 The weights of the original query terms are altered by replacing the
inverse document frequency portion of the weights with term-relevance
weights obtained by using the occurrence characteristics of the terms in
the previous retrieved relevant and nonrelevant documents of the
collection.

61
Relevance Feedback
 Q = (wq1, wq2, ..., wqt)
 Di = (di1, di2, ..., dit)
 New query may be the following form
Q’ = {wq1, wq2, ..., wqt}+{w’qt+1, w’qt+2, ..., w’qt+m}
 The weights of the newly added terms Tt+1 to Tt+m
may consist of a combined term-frequency and term-
relevance weight.

62
Final Indexing

 Identify individual text words.


 Use a stop list to delete common words.
 Use suffix stripping to produce word stems.
 Replace low-frequency terms with thesaurus classes.
 Replace high-frequency terms with phrases.
 Compute term weights for all single terms, phrases, and thesaurus
classes.
 Compare query statements with document vectors.
 Identify some retrieved documents as relevant and some as
nonrelevant to the query.

63
Final Indexing

 Compute term-relevance factors based on available relevance


assessments.
 Construct new queries with added terms from relevant documents and
term weights based on combined frequency and term-relevance weight.
 Return to step (7).
Compare query statements with document vectors ……..

64
Summary of expected effectiveness of automatic
indexing

 Basic single-term automatic indexing -


 Use of thesaurus to group related terms in the given topic area
+10% to +20%
 Use of automatically derived term associations obtained from joint term
assignments found in sample document collections
0% to -10%
 Use of automatically derived term phrases obtained by using co-
occurring terms found in the texts of sample collections
+5% to +10%
 Use of one iteration of relevant feedback to add new query terms
extracted from previously retrieved relevant documents
+30% to +60%

65

You might also like