Professional Documents
Culture Documents
Model
Information Retrieval
Computer Science Tripos Part II
Helen Yannakoudakis1
helen.yannakoudakis@cl.cam.ac.uk
2018
1
Based on slides from Simone Teufel and Ronan Cummins
1
Overview
1 Recap
3 Term frequency
2
Overview
1 Recap
3 Term frequency
3
IR System Components
Document
Collection
Document Normalisation
Indexer
Query Norm.
IR System
Query UI
Indexes
Ranking/Matching Module
Set of relevant
documents
4
IR System Components
Document
Collection
Document Normalisation
Indexer
Query Norm.
IR System
Query UI
Indexes
Ranking/Matching Module
Set of relevant
documents
5
IR System Components
Document
Collection
Document Normalisation
Indexer
Query Norm.
IR System
Query UI
Indexes
Ranking/Matching Module
Set of relevant
documents
6
Upcoming
7
Overview
1 Recap
3 Term frequency
8
Problem with Boolean search: Feast or famine
Boolean queries often have either too few or too many results.
Query 1
standard AND user AND dlink AND 650
→ 200,000 hits
9
Problem with Boolean search: Feast or famine
Boolean queries often have either too few or too many results.
Query 1
standard AND user AND dlink AND 650
→ 200,000 hits Feast!
9
Problem with Boolean search: Feast or famine
Boolean queries often have either too few or too many results.
Query 1
standard AND user AND dlink AND 650
→ 200,000 hits Feast!
Query 2
standard AND user AND dlink AND 650
AND no AND card AND found
→ 0 hits
9
Problem with Boolean search: Feast or famine
Boolean queries often have either too few or too many results.
Query 1
standard AND user AND dlink AND 650
→ 200,000 hits Feast!
Query 2
standard AND user AND dlink AND 650
AND no AND card AND found
→ 0 hits Famine!
9
Problem with Boolean search: Feast or famine
Boolean queries often have either too few or too many results.
Query 1
standard AND user AND dlink AND 650
→ 200,000 hits Feast!
Query 2
standard AND user AND dlink AND 650
AND no AND card AND found
→ 0 hits Famine!
9
Ranked retrieval models
10
Scoring as the basis of ranked retrieval
11
Scoring as the basis of ranked retrieval
lioness
11
Take 1: Scoring with the Jaccard coefficient
(A ̸= ∅ or B ̸= ∅)
jaccard(A, A) = 1
jaccard(A, B) = 0 if A ∩ B = 0
A and B don’t have to be the same size.
Always assigns a number between 0 and 1.
12
Jaccard coefficient: Scoring example
Query
“ides of March”
Document 1
“Caesar died in March”
Document 2
“the long March”
13
What’s wrong with Jaccard?
14
Overview
1 Recap
3 Term frequency
15
Binary incidence matrix
15
Count matrix
16
Count matrix
16
Bag of words model
17
Term frequency tf
18
Instead of raw frequency: Log-frequency weighting
19
Instead of raw frequency: Log-frequency weighting
tft,d 0 1 2 10 1000
wt,d 0 1 1.3 2 4
19
Instead of raw frequency: Log-frequency weighting
tft,d 0 1 2 10 1000
wt,d 0 1 1.3 2 4
19
Instead of raw frequency: Log-frequency weighting
tft,d 0 1 2 10 1000
wt,d 0 1 1.3 2 4
19
Overview
1 Recap
3 Term frequency
20
Zipf’s law
How many frequent vs. infrequent words should we expect in
a collection?
21
Zipf’s law
How many frequent vs. infrequent words should we expect in
a collection?
In natural language, there are a small number of very
high-frequency words and a large number of low-frequency
words.
Word frequency distributions obey a power law (Zipf’s law)
21
Zipf’s law
How many frequent vs. infrequent words should we expect in
a collection?
In natural language, there are a small number of very
high-frequency words and a large number of low-frequency
words.
Word frequency distributions obey a power law (Zipf’s law)
Zipf’s law
The i th most frequent word has
frequency cfi proportional to 1/i:
cfi ∝ 1i
Zipf’s law
The i th most frequent term has
frequency cfi proportional to 1/i:
cfi ∝ 1i
So if the most frequent term (the) occurs cf1 times, then the
second most frequent term (of) has half as many occurrences
cf2 = 12 cf1 . . .
. . . and the third most frequent term (and) has a third as
many occurrences cf3 = 13 cf1 etc.
Equivalent: cfi = p · i k and log cfi = log p + k log i (for
k = −1)
22
frequency in Moby Dick
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
the
of
and
a
to
in
that
his
it
s
is
he
with
was
as
all
for
this
at
by
but
not
him
from
be
on
token
so
whale
one
you
had
have
there
But
or
were
now
which
me
like
The
their
are
they
an
There are a small number of high-frequency words...
some
then
my
when
upon
23
Zipf’s Law: Examples from 5 Languages
24
Zipf’s law for Reuters
25
Other collections (allegedly) obeying power laws
Sizes of settlements
Frequency of access to web pages
Income distributions amongst top earning 3% individuals
Korean family names
Size of earthquakes
Word senses per word
Notes in musical performances
...
26
Desired weight for rare terms
27
Desired weight for frequent terms
28
Desired weight for frequent terms
28
Desired weight for frequent terms
28
Desired weight for frequent terms
28
Document frequency
29
Document frequency
29
idf weight
30
idf weight
N
idft = log10
dft
30
idf weight
N
idft = log10
dft
30
Examples for idf (suppose N = 1,000,000)
1,000,000
Compute idft using the formula: idft = log10
dft
31
Effect of idf on ranking
32
Effect of idf on ranking
32
Collection frequency vs. Document frequency
Collection Document
Term frequency frequency
insurance 10440 3997
try 10422 8760
33
tf–idf weighting
tf–idf weight
N
wt,d = (1 + log tft,d ) · log
dft
tf weight
idf weight
Best known weighting scheme in information retrieval
(alternative names: tf.idf, tf x idf)
Increases wrt number of occurrences in document (tf)
Increases wrt the rarity of the term in the entire collection
(idf)
34
tf–idf weighting
tf–idf weight
N
wt,d = (1 + log tft,d ) · log
dft
tf weight
idf weight
Best known weighting scheme in information retrieval
(alternative names: tf.idf, tf x idf)
Increases wrt number of occurrences in document (tf)
Increases wrt the rarity of the term in the entire collection
(idf)
34
tf–idf weighting
tf–idf weight
N
wt,d = (1 + log tft,d ) · log
dft
tf weight
idf weight
Best known weighting scheme in information retrieval
(alternative names: tf.idf, tf x idf)
Increases wrt number of occurrences in document (tf)
Increases wrt the rarity of the term in the entire collection
(idf)
34
Overview
1 Recap
3 Term frequency
35
Count matrix
36
Binary → count → weight matrix
37
Binary → count → weight matrix
37
Documents as vectors
38
Queries as vectors
39
How do we formalize vector space similarity?
40
How do we formalize vector space similarity?
40
How do we formalize vector space similarity?
40
Why distance is a bad idea
q: [rich poor]
41
Use angle instead of distance
42
From angles to cosines
43
Length normalization
How do we compute the cosine?
A vector can be (length-) normalized by dividing each of its
components by its length – here we use the L2 norm:
√
∑
||x||2 = xi2
i
|V
∑|
qi di
⃗ · d⃗
q i=1
cos(⃗ ⃗
q , d) = sim(⃗ ⃗
q , d) = =√ √
|⃗ ⃗
q ||d| |V
∑| 2 |V ∑| 2
qi di
i=1 i=1
45
Cosine for normalized vectors
46
Cosine similarity illustrated
poor
1 ⃗v (d1 )
⃗v (q)
⃗v (d2 )
⃗v (d3 )
0
rich
0 1
47
Cosine: Example
48
Cosine: Example
a Term frequencies
a (raw counts)
term SaS PaP WH
affection 115 58 20
jealous 10 7 11
gossip 2 0 6
wuthering 0 0 38
49
Cosine: Example
49
Cosine: Example
49
Cosine: Example
49
Cosine: Example
49
Components of tf–idf weighting
0.5×tft,d
a (augmented) 0.5 + p (prob idf) dft }
max{0, log N−df u (pivoted 1/u
t
maxt (tft,d )
{ unique)
1 if tft,d > 0
b (boolean) b (byte size) 1/CharLengthα ,
0 otherwise
α<1
1+log(tft,d )
L (log ave) 1+log(avet∈d (tft,d ))
50
Components of tf–idf weighting
0.5×tft,d
a (augmented) 0.5 + p (prob idf) dft }
max{0, log N−df u (pivoted 1/u
t
maxt (tft,d )
{ unique)
1 if tft,d > 0
b (boolean) b (byte size) 1/CharLengthα ,
0 otherwise
α<1
1+log(tft,d )
L (log ave) 1+log(avet∈d (tft,d ))
50
Components of tf–idf weighting
0.5×tft,d
a (augmented) 0.5 + p (prob idf) dft }
max{0, log N−df u (pivoted 1/u
t
maxt (tft,d )
{ unique)
1 if tft,d > 0
b (boolean) b (byte size) 1/CharLengthα ,
0 otherwise
α<1
1+log(tft,d )
L (log ave) 1+log(avet∈d (tft,d ))
Default: no weighting
50
tf–idf example
Example: lnc.ltn
Document:
l ogarithmic tf
n o df weighting
c osine normalization
Query:
l ogarithmic tf
t – means idf
n o normalization
51
tf-idf example: lnc.ltn
Query: “best car insurance”. Document: “car insurance auto insurance”.
word query document product
tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized
auto
best
car
insurance
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted
term frequency, df: document frequency, idf: inverse document frequency, weight: the final
weight of the term in the query or document, n’lized: document weights after cosine
normalization, product: the product of final query weight and final document weight
52
tf-idf example: lnc.ltn
Query: “best car insurance”. Document: “car insurance auto insurance”.
word query document product
tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized
auto 0
best 1
car 1
insurance 1
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted
term frequency, df: document frequency, idf: inverse document frequency, weight: the final
weight of the term in the query or document, n’lized: document weights after cosine
normalization, product: the product of final query weight and final document weight
52
tf-idf example: lnc.ltn
Query: “best car insurance”. Document: “car insurance auto insurance”.
word query document product
tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized
auto 0 1
best 1 0
car 1 1
insurance 1 2
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted
term frequency, df: document frequency, idf: inverse document frequency, weight: the final
weight of the term in the query or document, n’lized: document weights after cosine
normalization, product: the product of final query weight and final document weight
52
tf-idf example: lnc.ltn
Query: “best car insurance”. Document: “car insurance auto insurance”.
word query document product
tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized
auto 0 0 1
best 1 1 0
car 1 1 1
insurance 1 1 2
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted
term frequency, df: document frequency, idf: inverse document frequency, weight: the final
weight of the term in the query or document, n’lized: document weights after cosine
normalization, product: the product of final query weight and final document weight
52
tf-idf example: lnc.ltn
Query: “best car insurance”. Document: “car insurance auto insurance”.
word query document product
tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized
auto 0 0 1 1
best 1 1 0 0
car 1 1 1 1
insurance 1 1 2 1.3
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted
term frequency, df: document frequency, idf: inverse document frequency, weight: the final
weight of the term in the query or document, n’lized: document weights after cosine
normalization, product: the product of final query weight and final document weight
52
tf-idf example: lnc.ltn
Query: “best car insurance”. Document: “car insurance auto insurance”.
word query document product
tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized
auto 0 0 5000 1 1
best 1 1 50000 0 0
car 1 1 10000 1 1
insurance 1 1 1000 2 1.3
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted
term frequency, df: document frequency, idf: inverse document frequency, weight: the final
weight of the term in the query or document, n’lized: document weights after cosine
normalization, product: the product of final query weight and final document weight
52
tf-idf example: lnc.ltn
Query: “best car insurance”. Document: “car insurance auto insurance”.
word query document product
tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized
auto 0 0 5000 2.3 1 1
best 1 1 50000 1.3 0 0
car 1 1 10000 2.0 1 1
insurance 1 1 1000 3.0 2 1.3
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted
term frequency, df: document frequency, idf: inverse document frequency, weight: the final
weight of the term in the query or document, n’lized: document weights after cosine
normalization, product: the product of final query weight and final document weight
52
tf-idf example: lnc.ltn
Query: “best car insurance”. Document: “car insurance auto insurance”.
word query document product
tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized
auto 0 0 5000 2.3 0 1 1
best 1 1 50000 1.3 1.3 0 0
car 1 1 10000 2.0 2.0 1 1
insurance 1 1 1000 3.0 3.0 2 1.3
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted
term frequency, df: document frequency, idf: inverse document frequency, weight: the final
weight of the term in the query or document, n’lized: document weights after cosine
normalization, product: the product of final query weight and final document weight
52
tf-idf example: lnc.ltn
Query: “best car insurance”. Document: “car insurance auto insurance”.
word query document product
tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized
auto 0 0 5000 2.3 0 1 1
best 1 1 50000 1.3 1.3 0 0
car 1 1 10000 2.0 2.0 1 1
insurance 1 1 1000 3.0 3.0 2 1.3
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted
term frequency, df: document frequency, idf: inverse document frequency, weight: the final
weight of the term in the query or document, n’lized: document weights after cosine
normalization, product: the product of final query weight and final document weight
52
tf-idf example: lnc.ltn
Query: “best car insurance”. Document: “car insurance auto insurance”.
word query document product
tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized
auto 0 0 5000 2.3 0 1 1 1
best 1 1 50000 1.3 1.3 0 0 0
car 1 1 10000 2.0 2.0 1 1 1
insurance 1 1 1000 3.0 3.0 2 1.3 1.3
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted
term frequency, df: document frequency, idf: inverse document frequency, weight: the final
weight of the term in the query or document, n’lized: document weights after cosine
normalization, product: the product of final query weight and final document weight
52
tf-idf example: lnc.ltn
Query: “best car insurance”. Document: “car insurance auto insurance”.
word query document product
tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized
auto 0 0 5000 2.3 0 1 1 1 0.52
best 1 1 50000 1.3 1.3 0 0 0 0
car 1 1 10000 2.0 2.0 1 1 1 0.52
insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted
term frequency, df: document frequency, idf: inverse document frequency, weight: the final
weight of the term in the query or document, n’lized: document weights after cosine
normalization, product: the product of final query weight and final document weight
√
12 + 02 + 12 + 1.32 ≈ 1.92
1/1.92 ≈ 0.52
1.3/1.92 ≈ 0.68
52
tf-idf example: lnc.ltn
Query: “best car insurance”. Document: “car insurance auto insurance”.
word query document product
tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized
auto 0 0 5000 2.3 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0 0 0 0 0
car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04
insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted
term frequency, df: document frequency, idf: inverse document frequency, weight: the final
weight of the term in the query or document, n’lized: document weights after cosine
normalization, product: the product of final query weight and final document weight
52
tf-idf example: lnc.ltn
Query: “best car insurance”. Document: “car insurance auto insurance”.
word query document product
tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized
auto 0 0 5000 2.3 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0 0 0 0 0
car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04
insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted
term frequency, df: document frequency, idf: inverse document frequency, weight: the final
weight of the term in the query or document, n’lized: document weights after cosine
normalization, product: the product of final query weight and final document weight
∑
Final similarity score between query and document: i wqi · wdi = 0 + 0 + 1.04 + 2.04 = 3.08
52
Summary: Ranked retrieval in the vector space model
53
Reading
54