Professional Documents
Culture Documents
Introduction to
Information Retrieval
Lecture 6: Scoring, Term Weighting and the
Vector Space Model
Introduction to Information Retrieval
Contents
▪ Parametric and zone indexes
▪ Ranked retrieval
▪ Term Weighting schemes
▪ Vector space scoring
Introduction to Information Retrieval
User view
4
Introduction to Information Retrieval
Zones
▪ A zone is an identified region within a doc
▪ E.g., Title, Abstract, Bibliography
▪ Contents of a zone are free text
▪ Not a “finite” vocabulary
▪ Indexes for each zone - allow queries like
▪ find documents with merchant in the title and william in the
author list and the phrase gentle rain in the body
Introduction to Information Retrieval
etc.
7
Introduction to Information Retrieval
10
Introduction to Information Retrieval Sec. 6.2
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
Introduction to Information Retrieval
Term frequency tf
▪ The term frequency tft,d of term t in document d is
defined as the number of times that t occurs in d.
▪ We want to use tf when computing query-document
match scores. But how?
▪ Raw term frequency is not what we want:
▪ A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
▪ But not 10 times more relevant.
▪ Relevance does not increase proportionally with term
frequency.
NB: frequency = count in IR
Introduction to Information Retrieval Sec. 6.2
Log-frequency weighting
▪ The log frequency weight of term t in d is
⎧1 + log10 tf t,d , if tf t,d > 0
wt,d =⎨
⎩ 0, otherwise
Document frequency
▪ Rare terms are more informative than frequent terms
▪ Recall stop words
▪ → We want a high weight for rare terms.
Introduction to Information Retrieval Sec. 6.2.1
idf weight
▪ dft is the document frequency of t: the number of
documents that contain t
▪ dft is an inverse measure of the informativeness of t
▪ dft ≤ N
▪ We define the idf (inverse document frequency) of t
by
idf t = log10 ( N/df t )
▪ We use log (N/dft) instead of N/dft to “dampen” the effect
of idf.
tf-idf weighting
▪ The tf-idf weight of a term is the product of its tf weight
and its idf weight.
tf - idf t ,d
= log(1 + tf t , d ) × log10 ( N / df t )
▪ Best known weighting scheme in information retrieval
▪ Note: the “-” in tf-idf is a hyphen, not a minus sign!
▪ Alternative names: tf.idf, tf x idf
▪ Increases with the number of occurrences within a
document(tf)
▪ Increases with the rarity of the term in the collection(idf)
Introduction to Information Retrieval Sec. 6.3
Score(q,d) = ∑ tf.idft,d
t ∈q∩d
21
Introduction to Information Retrieval Sec. 6.3
Vector Similarity
▪ How do we quantify the similarity between two
documents in this vector space?
▪ A first attempt : the magnitude of the vector
difference between two document vectors.
▪ Drawback: two documents with very similar content can
have a significant vector difference simply because one is
much longer than the other.
▪ Solution to compensate for the effect of document
length is to compute the cosine similarity.
23
Introduction to Information Retrieval
24
Introduction to Information Retrieval Sec. 6.3
cosine(document,document)
DOT PRODUCT
(2)
EUCLIDEAN LENGTH
Introduction to Information Retrieval
(3)
27
Introduction to Information Retrieval Sec. 6.4
Example: N= 1,000,000
Example: N= 1,000,000
Document: car insurance auto insurance
Query: best car insurance
Term Query Document Prod
auto
best
car
insurance
Introduction to Information Retrieval Sec. 6.4
Example: N= 1,000,000
Document: car insurance auto insurance
Query: best car insurance
Term Query Document Prod
Query length = 0 2 + 1 .3 2 + 2 2 + 3 2 ≈ 3 .8
Doc length = 2.32 + 0 2 + 2 2 + 3.9 2 ≈ 4.9
Score = 0+0+0.22+0.63 = 0.85
Introduction to Information Retrieval
Introduction to
Information Retrieval
Lecture 8: Evaluation IR systems
Introduction to Information Retrieval Sec. 8.6
Evaluating an IR system
▪ Note: the information need is translated into a query
▪ Relevance is assessed relative to the information
need not the query
▪ E.g., Information need: I'm looking for information on
whether travelling by train from Cairo to Assuit is
more effective than flying.
▪ Query: travelling by train from Cairo to Assuit
effective
▪ Evaluate whether the doc addresses the information
need, not whether it has these words
33
Introduction to Information Retrieval Sec. 8.2
34
Introduction to Information Retrieval Sec. 8.3
Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn
36
Introduction to Information Retrieval Sec. 8.3
Precision/Recall
▪ You can get high recall (but low precision) by
retrieving all docs for all queries!
▪ Recall is a non-decreasing(increasing) function of
the number of docs retrieved
37
Introduction to Information Retrieval Sec. 8.3
A combined measure: F
▪ Combined measure that assesses precision/recall
tradeoff is F measure (weighted harmonic mean):
1
F=
1 1
α + (1 − α )
P R
▪ People usually use balanced F1 measure
▪ i.e., with α = ½
2 PR
F1 =
P+R 38
Introduction to Information Retrieval Sec. 8.4
39
Introduction to Information Retrieval Sec. 8.4
40
Introduction to Information Retrieval Sec. 8.4
0.75
Precision
0.5
0.25
0
0 0.3 0.5 0.8 1
Recall
41