Professional Documents
Culture Documents
Model
Information Retrieval
Computer Science Tripos Part II
Simone Teufel
Simone.Teufel@cl.cam.ac.uk
Lent 2014
147
IR System Components
Document
Collection
Document Normalisation
Indexer
Query Norm.
IR System
Query UI
Indexes
Ranking/Matching Module
Set of relevant
documents
148
IR System Components
Document
Collection
Document Normalisation
Indexer
Query Norm.
IR System
Query UI
Indexes
Ranking/Matching Module
Set of relevant
documents
149
IR System Components
Document
Collection
Document Normalisation
Indexer
Query Norm.
IR System
Query UI
Indexes
Ranking/Matching Module
Set of relevant
documents
150
Overview
1 Recap
3 Term frequency
151
Overview
1 Recap
3 Term frequency
152
Recap: Large-scale, distributed indexing
153
Upcoming
154
Overview
1 Recap
3 Term frequency
155
Problem with Boolean search: Feast or famine
Boolean queries often have either too few or too many results.
Query 1
standard AND user AND dlink AND 650
→ 200,000 hits Feast!
Query 2
standard AND user AND dlink AND 650
AND no AND card AND found
→ 0 hits Famine!
lioness
157
Take 1: Jaccard coefficient
(A 6= ∅ or B 6= ∅)
jaccard(A, A) = 1
jaccard(A, B) = 0 if A ∩ B = 0
A and B don’t have to be the same size.
Always assigns a number between 0 and 1.
158
Jaccard coefficient: Example
Query
“ides of March”
Document
“Caesar died in March”
jaccard(q, d) = 1/6
159
What’s wrong with Jaccard?
160
Overview
1 Recap
3 Term frequency
161
Count matrix
162
Bag of words model
163
Term frequency tf
164
Instead of raw frequency: Log frequency weighting
165
Overview
1 Recap
3 Term frequency
166
Zipf’s law
Zipf’s law
The i th most frequent term has
frequency cfi proportional to 1/i :
cfi ∝ 1i
167
Zipf’s law
Zipf’s law
The i th most frequent term has
frequency cfi proportional to 1/i :
cfi ∝ 1i
So if the most frequent term (the) occurs cf1 times, then the
second most frequent term (of) has half as many occurrences
cf2 = 21 cf1 . . .
. . . and the third most frequent term (and) has a third as
many occurrences cf3 = 31 cf1 etc.
Equivalent: cfi = ci k and log cfi = log c + k log i (for k = −1)
Example of a power law
168
Zipf’s Law: Examples from 5 Languages
169
Zipf’s law: Rank × Frequency ∼ Constant
170
Other collections (allegedly) obeying power laws
Sizes of settlements
Frequency of access to web pages
Income distributions amongst top earning 3% individuals
Korean family names
Size of earth quakes
Word senses per word
Notes in musical performances
...
171
Zipf’s law for Reuters
7
6
is important is the
4
log10 cf
0 1 2 3 4 5 6 7
log10 rank
172
Desired weight for rare terms
173
Desired weight for frequent terms
174
Document frequency
175
idf weight
idf weight
N
idft = log10
dft
176
Examples for idf
1,000,000
Compute idft using the formula: idft = log10
dft
177
Effect of idf on ranking
178
Collection frequency vs. Document frequency
Collection Document
Term frequency frequency
insurance 10440 3997
try 10422 8760
179
tf-idf weighting
tf-idf weight
N
wt,d = (1 + log tft,d ) · log
dft
tf-weight
idf-weight
Best known weighting scheme in information retrieval
Alternative names: tf.idf, tf x idf
180
Summary: tf-idf
181
Overview
1 Recap
3 Term frequency
182
Count matrix
183
Binary → count → weight matrix
184
Documents as vectors
185
Queries as vectors
186
How do we formalize vector space similarity?
187
Why distance is a bad idea
q: [rich poor]
188
Use angle instead of distance
189
From angles to cosines
190
Cosine
191
Length normalization
192
Cosine similarity between query and document
P|V |
~
q · ~d qi di
cos(~q , ~d) = sim(~q , ~d) = =q i =1q
|~q ||~d| P|V | 2 P|V | 2
i =1 qi i =1 di
193
Cosine for normalized vectors
194
Cosine similarity illustrated
poor
1 ~v (d1 )
~v (q)
~v (d2 )
~v (d3 )
0
rich
0 1
195
Cosine: Example
196
Cosine: Example
197
Computing the cosine score
198
Components of tf-idf weighting
0.5×tft,d N−dft
a (augmented) 0.5 +
maxt (tft,d )
p (prob idf) max{0, log dft } u (pivoted 1/u
unique)
1 if tft,d > 0
b (boolean) b (byte size) 1/CharLengthα ,
0 otherwise
α<1
1+log(tft,d )
L (log ave) 1+log(avet∈d (tft,d ))
Default: no weighting
199
tf-idf example
Example: lnc.ltn
Document:
l ogarithmic tf
n o df weighting
c osine normalization
Query:
l ogarithmic tf
t – means idf
n o normalization
200
tf-idf example: lnc.ltn
Query: “best car insurance”. Document: “car insurance auto insurance”.
word query document product
tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized
auto 0 0 5000 2.3 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0 0 0 0 0
car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04
insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted
term frequency, df: document frequency, idf: inverse document frequency, weight: the final
weight of the term in the query or document, n’lized: document weights after cosine
normalization, product: the product of final query weight and final document weight
√
12 + 02 + 12 + 1.32 ≈ 1.92
1/1.92 ≈ 0.52
1.3/1.92 ≈ 0.68
P
Final similarity score between query and document: i wqi · wdi = 0 + 0 + 1.04 + 2.04 = 3.08
201
Summary: Ranked retrieval in the vector space model
202
Reading
203