Professional Documents
Culture Documents
• Retrieval variables:
– queries, documents, terms, relevance judgements, users, information needs
• Advantages:
– efficient
– predictable, easy to explain
– structured queries
– work well when you know exactly what docs you want
Allan, Ballesteros, Croft, and/or Turtle
Exact-match Retrieval Model
• Disadvantages:
– query formulation difficult for most users
– difficulty increases with collection size (why?)
– indexing vocabulary same as query vocabulary
– acceptable precision generally means unacceptable recall
– ranking models are consistently better
• Boolean queries
– Used by Boolean retrieval model and in other models
– Boolean query Boolean model
• Proximity operators
– Phrases - “West Publishing”
– Word proximity - West /5 Publishing
– Same sentence - Massachusetts /s technology
– Same paragraph - “information retrieval” /p exact-match
• Restrictions
– (e.g. DATE(AFTER 1992 & BEFORE 1995))
• Document display
– sort order specified by user (e.g. date)
– query term highlighting
• Probabilistic models
• User Task
– Retrieval
– Browsing
Example: T3
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3 5
Q = 0T1 + 0T2 + 2T3
D1 = 2T1+ 3T2 + 5T3
Q = 0T1 + 0T2 + 2T3
2 3
T1
D2 = 3T1 + 7T2 + T3
• Is D1 or D2 more similar to Q?
• How to measure the degree of
7
T2 similarity? Distance? Angle?
Projection?
Allan, Ballesteros, Croft, and/or Turtle
Term Weights: Term Frequency
• More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j
where wij is the weight of term i in document j and wiq is the weight of
term i in the query
Weighted:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3
D1
Q
t t1
dj q
( wij wiq )
CosSim(dj, q) = i 1
t t
dj q 2
wij wiq
2
i 1 i 1 t2 D2
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
Allan, Ballesteros, Croft, and/or Turtle
Simple Implementation
1. Convert all documents in collection D to tf-idf weighted
vectors, dj, for keyword vocabulary V.
3. For each dj in D do
Compute score sj = cosSim(dj, q)
4. Sort documents by decreasing score.