Professional Documents
Culture Documents
Why rank?
More on cosine
Implementation of ranking
2011
1 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Overview
Recap Why rank? More on cosine Implementation of ranking The complete search system
2 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Outline
Recap Why rank? More on cosine Implementation of ranking The complete search system
3 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
The log frequency weight of term t in d is dened as follows wt,d = 1 + log10 tft,d 0 if tft,d > 0 otherwise
4 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
idf weight
The document frequency dft is dened as the number of documents that t occurs in. We dene the idf weight of term t as follows: idft = log10 N dft
5 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
tf-idf weight
The tf-idf weight of a term is the product of its tf weight and its idf weight. wt,d = (1 + log tft,d ) log N dft
6 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
|V |
qi
|V | 2 i =1 qi
i =1
di
|V | 2 i =1 di
qi is the tf-idf weight of term i in the query. di is the tf-idf weight of term i in the document. q/|q| and d/|d | are length-1 vectors (= normalized). |q| and |d| are the lengths of q and d.
7 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
poor
1
0 0
v (d3 )
1
rich
8 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
0 0 1.04 2.04
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the nal weight of the term in the query or document, nlized: document weights after cosine normalization, product: the product of nal query weight and nal document weight 12 + 02 + 12 + 1.32 1.92 1/1.92 0.52 1.3/1.92 0.68 Final similarity score between query and document:
i
9 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Take-away today
10 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Take-away today
10 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Take-away today
The importance of ranking: User studies at Google Length normalization: Pivot normalization
10 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Take-away today
The importance of ranking: User studies at Google Length normalization: Pivot normalization Implementation of ranking
10 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Take-away today
The importance of ranking: User studies at Google Length normalization: Pivot normalization Implementation of ranking The complete search system
10 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Outline
Recap Why rank? More on cosine Implementation of ranking The complete search system
11 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
12 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
12 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
12 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
12 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
12 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
12 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
12 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Next: More data on users only look at a few results Actually, in the vast majority of cases they only examine 1, 2, or 3 results.
12 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
13 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
13 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
13 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
13 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
13 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
13 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
13 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
13 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
13 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
The following slides are from Dan Russells JCDL talk Dan Russell is the Uber Tech Lead for Search Quality & User Happiness at Google.
13 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
20 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Viewing abstracts: Users are a lot more likely to read the abstracts of the top-ranked pages (1, 2, 3, 4) than the abstracts of the lower ranked pages (7, 8, 9, 10).
20 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Viewing abstracts: Users are a lot more likely to read the abstracts of the top-ranked pages (1, 2, 3, 4) than the abstracts of the lower ranked pages (7, 8, 9, 10). Clicking: Distribution is even more skewed for clicking
20 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Viewing abstracts: Users are a lot more likely to read the abstracts of the top-ranked pages (1, 2, 3, 4) than the abstracts of the lower ranked pages (7, 8, 9, 10). Clicking: Distribution is even more skewed for clicking In 1 out of 2 cases, users click on the top-ranked page.
20 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Viewing abstracts: Users are a lot more likely to read the abstracts of the top-ranked pages (1, 2, 3, 4) than the abstracts of the lower ranked pages (7, 8, 9, 10). Clicking: Distribution is even more skewed for clicking In 1 out of 2 cases, users click on the top-ranked page. Even if the top-ranked page is not relevant, 30% of users will click on it.
20 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Viewing abstracts: Users are a lot more likely to read the abstracts of the top-ranked pages (1, 2, 3, 4) than the abstracts of the lower ranked pages (7, 8, 9, 10). Clicking: Distribution is even more skewed for clicking In 1 out of 2 cases, users click on the top-ranked page. Even if the top-ranked page is not relevant, 30% of users will click on it. Getting the ranking right is very important.
20 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Viewing abstracts: Users are a lot more likely to read the abstracts of the top-ranked pages (1, 2, 3, 4) than the abstracts of the lower ranked pages (7, 8, 9, 10). Clicking: Distribution is even more skewed for clicking In 1 out of 2 cases, users click on the top-ranked page. Even if the top-ranked page is not relevant, 30% of users will click on it. Getting the ranking right is very important. Getting the top-ranked page right is most important.
20 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Viewing abstracts: Users are a lot more likely to read the abstracts of the top-ranked pages (1, 2, 3, 4) than the abstracts of the lower ranked pages (7, 8, 9, 10). Clicking: Distribution is even more skewed for clicking In 1 out of 2 cases, users click on the top-ranked page. Even if the top-ranked page is not relevant, 30% of users will click on it. Getting the ranking right is very important. Getting the top-ranked page right is most important.
20 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Outline
Recap Why rank? More on cosine Implementation of ranking The complete search system
21 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
0 0
The Euclidean distance of q and d2 is large although the distribution of terms in the query q and the distribution of terms in the document d2 are very similar. Thats why we do length normalization or, equivalently, use cosine to compute query-document matching scores.
Gray: Computing Scores in a Complete Search System 22 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
23 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
23 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
23 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
23 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
23 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
23 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
23 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
What ranking do we expect in the vector space model? What can we do about this?
23 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Pivot normalization
Cosine normalization produces weights that are too large for short documents and too small for long documents (on average).
24 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Pivot normalization
Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: turning the average normalization on the pivot
24 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Pivot normalization
Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: turning the average normalization on the pivot Eect: Similarities of short documents with query decrease; similarities of long documents with query increase.
24 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Pivot normalization
Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: turning the average normalization on the pivot Eect: Similarities of short documents with query decrease; similarities of long documents with query increase. This removes the unfair advantage that short documents have.
24 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
25 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
25 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Pivot normalization
26 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Pivot normalization
source: Lillian Le
Gray: Computing Scores in a Complete Search System 26 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
27 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
27 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Outline
Recap Why rank? More on cosine Implementation of ranking The complete search system
28 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
29 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
... ...
29 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
... ...
term frequencies
29 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
... ...
29 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
30 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
In each posting, store tft,d in addition to docID d As an integer frequency, not as a (log-)weighted real number ...
30 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
In each posting, store tft,d in addition to docID d As an integer frequency, not as a (log-)weighted real number ... . . . because real numbers are dicult to compress.
30 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
In each posting, store tft,d in addition to docID d As an integer frequency, not as a (log-)weighted real number ... . . . because real numbers are dicult to compress. Unary code is eective for encoding term frequencies.
30 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
In each posting, store tft,d in addition to docID d As an integer frequency, not as a (log-)weighted real number ... . . . because real numbers are dicult to compress. Unary code is eective for encoding term frequencies. Why?
30 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
In each posting, store tft,d in addition to docID d As an integer frequency, not as a (log-)weighted real number ... . . . because real numbers are dicult to compress. Unary code is eective for encoding term frequencies. Why? Overall, additional space requirements are small: less than a byte per posting with bitwise compression.
30 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
In each posting, store tft,d in addition to docID d As an integer frequency, not as a (log-)weighted real number ... . . . because real numbers are dicult to compress. Unary code is eective for encoding term frequencies. Why? Overall, additional space requirements are small: less than a byte per posting with bitwise compression. Or a byte per posting with variable byte code
30 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
31 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
31 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
In many applications, we dont need a complete ranking. We just need the top k for a small k (e.g., k = 100).
31 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
In many applications, we dont need a complete ranking. We just need the top k for a small k (e.g., k = 100). If we dont need a complete ranking, is there an ecient way of computing just the top k?
31 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
In many applications, we dont need a complete ranking. We just need the top k for a small k (e.g., k = 100). If we dont need a complete ranking, is there an ecient way of computing just the top k? Naive:
31 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
In many applications, we dont need a complete ranking. We just need the top k for a small k (e.g., k = 100). If we dont need a complete ranking, is there an ecient way of computing just the top k? Naive:
Compute scores for all N documents
31 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
In many applications, we dont need a complete ranking. We just need the top k for a small k (e.g., k = 100). If we dont need a complete ranking, is there an ecient way of computing just the top k? Naive:
Compute scores for all N documents Sort
31 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
In many applications, we dont need a complete ranking. We just need the top k for a small k (e.g., k = 100). If we dont need a complete ranking, is there an ecient way of computing just the top k? Naive:
Compute scores for all N documents Sort Return the top k
31 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
In many applications, we dont need a complete ranking. We just need the top k for a small k (e.g., k = 100). If we dont need a complete ranking, is there an ecient way of computing just the top k? Naive:
Compute scores for all N documents Sort Return the top k
31 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
In many applications, we dont need a complete ranking. We just need the top k for a small k (e.g., k = 100). If we dont need a complete ranking, is there an ecient way of computing just the top k? Naive:
Compute scores for all N documents Sort Return the top k
31 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
32 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
32 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Use a binary min heap A binary min heap is a binary tree in which each nodes value is less than the values of its children.
32 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Use a binary min heap A binary min heap is a binary tree in which each nodes value is less than the values of its children. Takes O(N log k) operations to construct (where N is the number of documents) . . .
32 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Use a binary min heap A binary min heap is a binary tree in which each nodes value is less than the values of its children. Takes O(N log k) operations to construct (where N is the number of documents) . . . . . . then read o k winners in O(k log k) steps
32 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
33 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
0.85
0.7
0.9
0.97
0.8
0.95
33 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
34 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
34 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Goal: Keep the top k documents seen so far Use a binary min heap
34 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Goal: Keep the top k documents seen so far Use a binary min heap To process a new document d with score s :
34 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Goal: Keep the top k documents seen so far Use a binary min heap To process a new document d with score s :
Get current minimum hm of heap (O(1))
34 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Goal: Keep the top k documents seen so far Use a binary min heap To process a new document d with score s :
Get current minimum hm of heap (O(1)) If s hm skip to next document
34 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Goal: Keep the top k documents seen so far Use a binary min heap To process a new document d with score s :
Get current minimum hm of heap (O(1)) If s hm skip to next document If s > hm heap-delete-root (O(log k))
34 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Goal: Keep the top k documents seen so far Use a binary min heap To process a new document d with score s :
Get current minimum hm of heap (O(1)) If s hm skip to next document If s > hm heap-delete-root (O(log k)) Heap-add d /s (O(log k))
34 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
35 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
35 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
36 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
36 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
36 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
36 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
36 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
36 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
36 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
37 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
37 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
37 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
37 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
37 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
37 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
37 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
37 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
37 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
38 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
38 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
38 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
38 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
38 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Recap
Why rank?
More on cosine
Implementation of ranking
39 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
39 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
39 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
39 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Recap
Why rank?
More on cosine
Implementation of ranking
Document-at-a-time processing
40 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Document-at-a-time processing
Both docID-ordering and PageRank-ordering impose a consistent ordering on documents in postings lists.
40 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Document-at-a-time processing
Both docID-ordering and PageRank-ordering impose a consistent ordering on documents in postings lists. Computing cosines in this scheme is document-at-a-time.
40 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Document-at-a-time processing
Both docID-ordering and PageRank-ordering impose a consistent ordering on documents in postings lists. Computing cosines in this scheme is document-at-a-time. We complete computation of the query-document similarity score of document di before starting to compute the query-document similarity score of di +1 .
40 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Document-at-a-time processing
Both docID-ordering and PageRank-ordering impose a consistent ordering on documents in postings lists. Computing cosines in this scheme is document-at-a-time. We complete computation of the query-document similarity score of document di before starting to compute the query-document similarity score of di +1 . Alternative: term-at-a-time processing
40 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
41 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
41 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
41 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
41 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
41 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
41 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
41 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
41 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
41 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Term-at-a-time processing
42 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Term-at-a-time processing
Simplest case: completely process the postings list of the rst query term
42 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Term-at-a-time processing
Simplest case: completely process the postings list of the rst query term Create an accumulator for each docID you encounter
42 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Term-at-a-time processing
Simplest case: completely process the postings list of the rst query term Create an accumulator for each docID you encounter Then completely process the postings list of the second query term
42 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Term-at-a-time processing
Simplest case: completely process the postings list of the rst query term Create an accumulator for each docID you encounter Then completely process the postings list of the second query term . . . and so forth
42 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Term-at-a-time processing
43 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Term-at-a-time processing
CosineScore(q) 1 oat Scores[N] = 0 2 oat Length[N] 3 for each query term t 4 do calculate wt,q and fetch postings list for t 5 for each pair(d, tft,d ) in postings list 6 do Scores[d]+ = wt,d wt,q 7 Read the array Length 8 for each d 9 do Scores[d] = Scores[d]/Length[d] 10 return Top k components of Scores[] The elements of the array Scores are called accumulators.
43 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
For the web (20 billion documents), an array of accumulators A in memory is infeasible. Thus: Only create accumulators for docs occurring in postings lists This is equivalent to: Do not create accumulators for docs with zero scores (i.e., docs that do not contain any of the query terms)
44 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Accumulators: Example
... ...
For query: [Brutus Caesar]: Only need accumulators for 1, 5, 7, 13, 17, 83, 87 Dont need accumulators for 8, 40, 85
45 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Removing bottlenecks
46 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Removing bottlenecks
46 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Removing bottlenecks
Use heap / priority queue as discussed earlier Can further limit to docs with non-zero cosines on rare (high idf) words
46 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Removing bottlenecks
Use heap / priority queue as discussed earlier Can further limit to docs with non-zero cosines on rare (high idf) words Or enforce conjunctive search (a la Google): non-zero cosines on all words in query
46 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Removing bottlenecks
Use heap / priority queue as discussed earlier Can further limit to docs with non-zero cosines on rare (high idf) words Or enforce conjunctive search (a la Google): non-zero cosines on all words in query Example: just one accumulator for [Brutus Caesar] in the example above . . .
46 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Removing bottlenecks
Use heap / priority queue as discussed earlier Can further limit to docs with non-zero cosines on rare (high idf) words Or enforce conjunctive search (a la Google): non-zero cosines on all words in query Example: just one accumulator for [Brutus Caesar] in the example above . . . . . . because only d1 contains both words.
46 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Outline
Recap Why rank? More on cosine Implementation of ranking The complete search system
47 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
48 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Tiered indexes
49 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Tiered indexes
Basic idea:
49 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Tiered indexes
Basic idea:
Create several tiers of indexes, corresponding to importance of indexing terms
49 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Tiered indexes
Basic idea:
Create several tiers of indexes, corresponding to importance of indexing terms During query processing, start with highest-tier index
49 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Tiered indexes
Basic idea:
Create several tiers of indexes, corresponding to importance of indexing terms During query processing, start with highest-tier index If highest-tier index returns at least k (e.g., k = 100) results: stop and return results to user
49 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Tiered indexes
Basic idea:
Create several tiers of indexes, corresponding to importance of indexing terms During query processing, start with highest-tier index If highest-tier index returns at least k (e.g., k = 100) results: stop and return results to user If weve only found < k hits: repeat for next index in tier cascade
49 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Tiered indexes
Basic idea:
Create several tiers of indexes, corresponding to importance of indexing terms During query processing, start with highest-tier index If highest-tier index returns at least k (e.g., k = 100) results: stop and return results to user If weve only found < k hits: repeat for next index in tier cascade
49 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Tiered indexes
Basic idea:
Create several tiers of indexes, corresponding to importance of indexing terms During query processing, start with highest-tier index If highest-tier index returns at least k (e.g., k = 100) results: stop and return results to user If weve only found < k hits: repeat for next index in tier cascade
49 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Tiered indexes
Basic idea:
Create several tiers of indexes, corresponding to importance of indexing terms During query processing, start with highest-tier index If highest-tier index returns at least k (e.g., k = 100) results: stop and return results to user If weve only found < k hits: repeat for next index in tier cascade
49 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Tiered indexes
Basic idea:
Create several tiers of indexes, corresponding to importance of indexing terms During query processing, start with highest-tier index If highest-tier index returns at least k (e.g., k = 100) results: stop and return results to user If weve only found < k hits: repeat for next index in tier cascade
49 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Tiered index
50 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Tiered index
auto Doc2
Tier 1
best
car insurance
Doc1
Doc3
Doc2
Doc3
auto
Doc1
Doc3
auto
Doc1
Tier 3
best
car insurance
Doc2
50 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Tiered indexes
51 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Tiered indexes
The use of tiered indexes is believed to be one of the reasons that Google search quality was signicantly higher initially (2000/01) than that of competitors.
51 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Tiered indexes
The use of tiered indexes is believed to be one of the reasons that Google search quality was signicantly higher initially (2000/01) than that of competitors. (along with PageRank, use of anchor text and proximity constraints)
51 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Exercise
52 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Exercise
Design criteria for tiered system
Each tier should be an order of magnitude smaller than the next tier. The top 100 hits for most queries should be in tier 1, the top 100 hits for most of the remaining queries in tier 2 etc. We need a simple test for can I stop at this tier or do I have to go to the next one?
There is no advantage to tiering if we have to hit most tiers for most queries anyway.
Question 1: Consider a two-tier system where the rst tier indexes titles and the second tier everything. What are potential problems with this type of tiering? Question 2: Can you think of a better way of setting up a multitier system? Which zones of a document should be indexed in the dierent tiers (title, body of document, others?)? What criterion do you want to use for including a document in tier 1?
Gray: Computing Scores in a Complete Search System 52 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
53 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Document preprocessing (linguistic and otherwise) Positional indexes Tiered indexes Spelling correction k-gram indexes for wildcard queries and spelling correction Query processing Document scoring Term-at-a-time processing
54 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Document cache: we need this for generating snippets (= dynamic summaries) Zone indexes: They separate the indexes for dierent zones: the body of the document, all highlighted text in the document, anchor text, text in metadata elds etc Machine-learned ranking functions Proximity ranking (e.g., rank documents in which the query terms occur in the same local window higher than documents in which the query terms occur far from each other) Query parser
55 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
56 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
56 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
56 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
56 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
56 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
56 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
56 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
56 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Take-away today
The importance of ranking: User studies at Google Length normalization: Pivot normalization Implementation of ranking The complete search system
57 / 58
Recap
Why rank?
More on cosine
Implementation of ranking
Resources
58 / 58