Skip Pointers and Phrase Queries

Introduction to Information Retrieval
Introduction to
Information Retrieval
Faster postings merges:
Skip pointers/Skip lists
Recall basic merge

 Walk through the two postings simultaneously, in
time linear in the total number of postings entries
2 4 8 41 48 64 128 Brutus
2 8
1 2 3 8 11 17 21 31 Caesar
If the list lengths are m and n, the merge takes O(m+n)

operations.
Can we do better?
Yes (if the index isn’t changing too fast).
Augment postings with skip pointers (at indexing time)

41 128
2 4 8 41 48 64 128
11 31
1 2 3 8 11 17 21 31
 Why?
 To skip postings that will not figure in the search results.
 How?
 Where do we place skip pointers?
Query processing with skip pointers

41 128
2 4 8 41 48 64 128
11 31
1 2 3 8 11 17 21 31
Suppose we’ve stepped through the lists until we process 8 on

each list. We match it and advance.
We then have 41 and 11 on the lower. 11 is smaller.
But the skip successor of 11 on the lower list is 31, so

we can skip ahead past the intervening postings.
Where do we place skips?

 Tradeoff:
 More skips  shorter skip spans  more likely to skip.
But lots of comparisons to skip pointers.
 Fewer skips  few pointer comparison, but then long skip
spans  few successful skips.
Postings lists intersection with skip pointers

Placing skips
 Simple heuristic: for postings of length L, use L evenly-spaced skip
pointers [Moffat and Zobel 1996]
 This ignores the distribution of query terms.
 Easy if the index is relatively static; harder if L keeps changing because of

updates.
 This definitely used to help; with modern hardware it may not unless
you’re memory-based [Bahle et al. 2002]
Positional postings and phrase queries

 Many complex or technical concepts and many organization
and product names are multiword compounds or phrases.
 Most recent search engines support a double quotes syntax

(“stanford university”) for phrase queries.
 As many as 10% of web queries are phrase queries, and many

more are implicit phrase queries (such as person names),
entered without use of double quotes.
1. Biword indexes
 One approach to handling phrases is to consider every pair of
consecutive terms in a document as a phrase.
For example, the text Friends, Romans, Countrymen would
generate the biwords:
friends romans
romans countrymen
 In this model, we treat each of these biwords as a vocabulary
term.
 The concept of a biword index can be extended to longer
sequences of words, and if the index includes variable length
word sequences, it is generally referred to as a phrase index.
2. Positional indexes
 A biword index is not the standard solution. Rather, a positional
index is most commonly employed.
 Here, for each term in the vocabulary, we store postings of the
form docID: {hposition1, position2, . . . } e.g.
to, 993427:
(1, 6: (7, 18, 33, 72, 86, 231);
2, 5: (1, 17, 74, 222, 255);
4, 5: (8, 16, 190, 429, 433);
5, 2: (363, 367);
7, 3: (13, 23, 191); ..... . . )
be, 178239:
(1, 2: (17, 25);
4, 5: (17, 191, 291, 430, 434);
5, 3: (14, 19, 101); . . . ..)
2. Positional indexes
 To process a phrase query, we still need to access the inverted
index entries for each distinct term.
 As before, we would start with the least frequent term and
then work to further restrict the list of possible candidates.
 In the merge operation, the same general technique is used as

before, but rather than simply checking that both terms are in
a document, we also need to check that their positions of
appearance in the document are compatible with the phrase
query being evaluated.
Example: Satisfying phrase queries

Suppose the postings lists for ‘to’ and ‘be’ are as in previous slide, and the query is “to
be or not to be”. The postings lists to access are: to, be, or, not. We will examine
intersecting the postings lists for ‘to’ and ‘be’. We first look for documents that
contain both terms. Then, we look for places in the lists where there is an
occurrence of ‘be’ with a token index one higher than a position of ‘to’, and then we
look for another occurrence of each word with token index 4 higher than the first
occurrence. In the above lists, the pattern of occurrences that is a possible match is:
to: (. . . ; 4: (. . . ,429,433); . . . )
be: (. . . ; 4(. . . ,430,434); . . . )

Skip Pointers and Phrase Queries

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Skip Pointers and Phrase Queries

Uploaded by

Copyright:

Available Formats

Introduction to Information Retrieval

Recall basic merge

If the list lengths are m and n, the merge takes O(m+n)

Augment postings with skip pointers (at indexing time)

Query processing with skip pointers

Suppose we’ve stepped through the lists until we process 8 on

We then have 41 and 11 on the lower. 11 is smaller.

But the skip successor of 11 on the lower list is 31, so

Where do we place skips?

Postings lists intersection with skip pointers

 This ignores the distribution of query terms.

 Easy if the index is relatively static; harder if L keeps changing because of

Positional postings and phrase queries

 Most recent search engines support a double quotes syntax

 As many as 10% of web queries are phrase queries, and many

 In the merge operation, the same general technique is used as

Example: Satisfying phrase queries

You might also like