You are on page 1of 32

Indexing and Searching

Modern Information Retrieval


by R. Baeza-Yates and B. Ribeiro-Neto
Chapter 8

1
Outline
 Inverted Files
 Other Indices for Text
 Sequential Searching
 Pattern Matching
 Compression

2
Inverted Files
 And inverted file (or inverted index) is a word-
oriented mechanism for indexing a text collection
in order to speed up the searching task.
 Structure:vocabulary and occurrences
 Block addressing
 The text is divided in blocks, and the
occurrences point to the blocks
 Full inverted indices:exact occurrences

3
4
5
Inverted Files
 The search algorithm on an inverted index
 Vocabulary search

 Retrieval of occurrences

 Manipulation of occurrences

 Construction (split the index into two files)


 Posting file:the lists of occurrences are stored
contiguously
 The vocabulary is stored in lexicographical
order and points to its list.
6
7
Inverted Files
 For Large texts
 Partial index

 Merging two indices consists of merging


the sorted vocabularies.

8
9
Other Indices for Text
 Suffix Trees
 Suffix Arrays
 Signature Files

10
Suffix Trees and Suffix Arrays
 Each position in the text is considered as a
text suffix
 Index points are selected form the text,
which point to the beginning of the text
positions which will be retrievable

11
12
Suffix arrays
 The main drawbacks of Suffix Array are its
costly construction process.
 Allow binary searches done by comparing
the contents of each pointer.
 Supra-indices (for large suffix array)

13
14
15
Construction of Suffix Arrays for
Large Texts

16
Signature Files
 Word-oriented index structures base on hashing
 Maps words to bit masks of B bits
 Divides the text in blocks of b words each
 The mask is obtained by bitwise ORing the
signatures of all the words in the text block.
 Hash the query to a bit mask W
 If W & Bi = W, the text block may contain the
word

17
18
Sequential Searching
 Brute Force
 Knuth-Morris-Pratt
 Boyer-Moore Family
 Shift-Or
 Suffix Automaton
 Backward DAWG matching (BDM)

 BNDM

19
Knuth-Morris-Pratt

20
Boyer-Moore Family

21
Shift-Or

22
Suffix Automaton

23
24
Pattern Matching
 Searching allowing errors
 Dynamic Programming

 Automaton

 Regular Expressions and Extended patterns


 Pattern Matching Using Indices
 Inverted files

 Suffix Trees and Suffix Arrays

25
Dynamic Programming

26
Automaton

27
Regular Expressions

28
Pattern Matching Using Indices
 Inverted Files
 The types of queries such as suffix or
substring queries, searching allowing
errors and regular expressions, are solved
by a sequential search
 The restriction is to find approximate
matches or regular expressions that span
many word.

29
Pattern Matching Using Indices
 Suffix Trees
 Suffix trees are able to perform complex

searches
 Word, prefix, suffix, substring, and Range
queries
 Regular expressions

 Unrestricted approximate string matching

 Useful in specific areas

 Find the longest substring

 Find the most common substring of a fixed 30


size
Pattern Matching Using Indices
 Suffix Arrays
 Some patterns can be searched directly in
the suffix array without simulation the
suffix tree
 Word, prefix, suffix, subword search and
range search

31
Compression
 Compressed text--Huffman coding
 Taking words as symbols

 Use an alphabet of bytes instead of bits

 Compressed indices
 Inverted Files

 Suffix Trees and Suffix Arrays

 Signature Files

32

You might also like