Professional Documents
Culture Documents
1
Indexing
2
Two Issues
3
Parameters of
retrieval effectiveness
Recall
Number of relevant items retrieved
R
Total number of relevant items in collection
Precision
Number of relevant items retrieved
P
Total number of items retrieved
Goal
high recall and high precision
4
Retrieved
Part
b a
Nonrelevant Relevant
Items Items
c d
a a
Recall Precision
a +d a+b
A Joint Measure
F-score ( 2 1) P R
F
2 P R
6
Choices of Recall and Precision
7
Choices of Recall and Precision (Continued)
8
Term-Frequency Consideration
Function words
» for example, "and", "or", "of", "but", …
» the frequencies of these words are high in all texts
Content words
» words that actually relate to document content
» varying frequencies in the different texts of a collect
» indicate term importance for content
9
A Frequency-Based Indexing Method
10
Inverse Document Frequency
11
New Term Importance Indicator
12
Term-discrimination Value
13
A Virtual Document Space
14
Good Term Assignment
15
Poor Term Assignment
16
Term Discrimination Value
definition
dvj = Q - Qj
where Q and Qj are space densities before and
after the assignments of term Tj.
1 N N
Q
N ( N 1) i 1 k 1
sim( Di , Dk )
ik
17
Variations of Term-Discrimination Value
with Document Frequency
Document
Frequency
N
Low frequency Medium frequency High frequency
dvj=0 dvj>0 dvj<0
18
Another Term Weighting
19
Document Centroid
20
Discussions
21
Probabilistic Term Weighting
Goal
Explicit distinctions between occurrences of terms in relevant
and nonrelevant documents of a collection
Definition
Given a user query q, and the ideal answer set of the relevant
documents
From decision theory, the best ranking algorithm
22
Probabilistic Term Weighting
Pr(x|rel), Pr(x|nonrel):
occurrence probabilities of document x in the relevant and
nonrelevant document sets
Pr(rel), Pr(nonrel):
document’s a priori probabilities of relevance and nonrelevance
Further assumptions
Terms occur independently in relevant documents;
terms occur independently in nonrelevant documents.
23
t
Pr( x|rel ) Pr( xi |rel )
i 1
t
Pr( x|nonrel ) Pr( xi |nonrel )
i 1
t
Pr( xi | rel )
g ( x ) log constants
i 1 Pr( xi | nonrel )
Derivation Process
Pr( x | rel )
i
log t
i 1
constants
Pr( x | nonrel )
i 1
i
t
Pr( xi | rel )
log
i 1 Pr( xi | nonrel )
constants
25
Given a document D=(d1, d2, …, dt),
the retrieval value of D is:
t
Pr( xi di | rel )
g ( D) log
i 1 Pr( xi di |nonrel )
constants
d 1 d
Pr( xi di |nonrel ) qi (1 qi )
i i
pi=Pr(xi=1|rel)
1-pi=Pr(xi=0|rel)
qi=Pr(xi=1|nonrel)
1-qi=Pr(xi=0|nonrel)
t
Pr( xi di | rel )
g ( D) log constants
i 1 Pr( xi di | nonrel )
di 1di
t
p (1 p )
i i
log d 1d
constants
q (1 q ) i i
i 1
i i
di di
t
p (1
i q ) (1 p ) i i
log d
constants
d
q (1 p ) (1 q )
i i
i 1
i i i
d
t
( p (1 qi)) (1 p )
i
i
i
log constants
d
i 1
(q (1 p )) (1 q )
i i
i
i
t 1 pi t pi (1 qi )
g ( D) log di log constants
i 1 1 qi i 1 qi (1 pi )
pj (1 qj )
trj log
qj (1 pj )
New Term Weighting
30
Estimation of Term-Relevance
31
df j
0.5 * (1 )
pj (1 qj ) N
trj log log
qj (1 pj ) df j
* 0.5
N
( N df j )
log
df j
32
Estimation of Term-Relevance
33
Term Relationships in Indexing
Single-term indexing
» Single terms are often ambiguous.
» Many single terms are either too specific or too broad to be
useful.
Complex text identifiers
» subject experts and trained indexers
» linguistic analysis algorithms, e.g., NP chunker
» term-grouping or term clustering methods
34
Tree-Dependence Model
35
(children, school), (school, girls), (school, boys)
(children, girls), (children, boys)
(school, girls, boys), (children, achievement, ability)
36
Term Classification (Clustering)
T 1 T 2 T 3 Tt
D1 d 11 d 12 d 1t
D2 d 21 d 22 d 2 t
Dn dn1 dn 2 dnt
37
Term Classification (Clustering)
Column part
Group terms whose corresponding column representation
reveal similar assignments to the documents of the collection.
Row part
Group documents that exhibit sufficiently similar term
assignment.
38
Linguistic Methodologies
Indexing phrases:
nominal constructions including adjectives and nouns
» Assign syntactic class indicators (i.e., part of speech) to the words
occurring in document texts.
» Construct word phrases from sequences of words exhibiting certain
allowed syntactic markers (noun-noun and adjective-noun
sequences).
39
Term-Phrase Formation
Term Phrase
a sequence of related text words carry a more specific meaning
than the single terms
e.g., “computer science” vs. computer;
Thesaurus Phrase
transformation transformation
Document
Frequency
N
40
Simple Phrase-Formation Process
41
An Example
42
The Formatted Term-Phrases
effective retrieval systems essential people need information
Phrase Heads and Components Phrase Heads and Components
Must Be Adjacent Co-occur in Sentence
1. retrieval system* 6. effective systems
2. systems essential 7. systems need
3. essential people 8. effective people
4. people need 9. retrieval people
5. need information* 10. effective information*
11. retrieval information*
12. essential information*
2/5 5/12
*: phrases assumed to be useful for content identification
43
The Problems
44
Additional Term-Phrase Formation Steps
45
Consider Syntactic Unit
46
Phrases within Syntactic Components
[subj effective retrieval systems] [vp are essential ]
for [obj people need information]
Adjacent phrase heads and components within syntactic
components
» retrieval systems*
» people need
» need information*
2/3
Phrase heads and components co-occur within syntactic
components
» effective systems
47
Problems
48
Problems (Continued)
49
Equivalent Phrase Formulation
50
Thesaurus-Group Generation
Thesaurus transformation
» broadens index terms whose scope is too narrow to be useful in
retrieval
» a thesaurus must assemble groups of related specific terms under
more general, higher-level class indicators
Thesaurus Phrase
transformation transformation
Document
Frequency
N
Low frequency Medium frequency High frequency
dvj=0 dvj>0 dvj<0 51
Sample Classes of Roget’s Thesaurus
52
Methods of Thesaurus Construction
d ‧d ij ik
sim(Tj , Tk ) n
i 1
n
2 2
(d )
i 1
ij ‧ ( dik )
i 1
53
Term-classification strategies
single-link
Each term must have a similarity exceeding a stated threshold value
with at least one other term in the same class.
Produce fewer, much larger term classes
complete-link
Each term has a similarity to all other terms in the same class that
exceeds the the threshold value.
Produce a large number of small classes
54
The Indexing Prescription (1)
55
Word Stemming
56
Some Morphological Rules
57
The Indexing Prescription (2)
58
Query vs. Document
Differences
» Query texts are short.
» Fewer terms are assigned to queries.
» The occurrence of query terms rarely exceeds 1.
59
Query vs. Document
w
j 1
qj ‧dij
sim(Q, Di )
t
or
2
(d
j 1
ij )
t
w
j 1
qj ‧dij
sim(Q, Di )
t t
2 2
(d
j 1
ij ) ‧ ( wqj )
j 1
60
Relevance Feedback
61
Relevance Feedback
Q = (wq1, wq2, ..., wqt)
Di = (di1, di2, ..., dit)
New query may be the following form
Q’ = {wq1, wq2, ..., wqt}+{w’qt+1, w’qt+2, ..., w’qt+m}
The weights of the newly added terms Tt+1 to Tt+m
may consist of a combined term-frequency and term-
relevance weight.
62
Final Indexing
63
Final Indexing
64
Summary of expected effectiveness of automatic
indexing
65