Professional Documents
Culture Documents
Skip Pointers and Phrase Queries
Skip Pointers and Phrase Queries
Introduction to
Information Retrieval
Faster postings merges:
Skip pointers/Skip lists
Introduction to Information Retrieval
2 4 8 41 48 64 128 Brutus
2 8
1 2 3 8 11 17 21 31 Caesar
Can we do better?
Yes (if the index isn’t changing too fast).
Introduction to Information Retrieval
11 31
1 2 3 8 11 17 21 31
Why?
To skip postings that will not figure in the search results.
How?
Where do we place skip pointers?
Introduction to Information Retrieval
11 31
1 2 3 8 11 17 21 31
Placing skips
Simple heuristic: for postings of length L, use L evenly-spaced skip
pointers [Moffat and Zobel 1996]
This definitely used to help; with modern hardware it may not unless
you’re memory-based [Bahle et al. 2002]
Introduction to Information Retrieval
1. Biword indexes
One approach to handling phrases is to consider every pair of
consecutive terms in a document as a phrase.
For example, the text Friends, Romans, Countrymen would
generate the biwords:
friends romans
romans countrymen
In this model, we treat each of these biwords as a vocabulary
term.
The concept of a biword index can be extended to longer
sequences of words, and if the index includes variable length
word sequences, it is generally referred to as a phrase index.
Introduction to Information Retrieval
2. Positional indexes
A biword index is not the standard solution. Rather, a positional
index is most commonly employed.
Here, for each term in the vocabulary, we store postings of the
form docID: {hposition1, position2, . . . } e.g.
to, 993427:
(1, 6: (7, 18, 33, 72, 86, 231);
2, 5: (1, 17, 74, 222, 255);
4, 5: (8, 16, 190, 429, 433);
5, 2: (363, 367);
7, 3: (13, 23, 191); ..... . . )
be, 178239:
(1, 2: (17, 25);
4, 5: (17, 191, 291, 430, 434);
5, 3: (14, 19, 101); . . . ..)
Introduction to Information Retrieval
2. Positional indexes
To process a phrase query, we still need to access the inverted
index entries for each distinct term.
As before, we would start with the least frequent term and
then work to further restrict the list of possible candidates.
to: (. . . ; 4: (. . . ,429,433); . . . )
be: (. . . ; 4(. . . ,430,434); . . . )