You are on page 1of 7

Tries

Inverted Index

 The basic method of Web search and traditional IR is to find documents


that contain the terms in the user query.

 Given a user query, one option is to scan the document database


sequentially to find the documents that contain the query terms.
However, this method is obviously impractical for a large collection,
such as the Web.

 Another option is to build some data structures (called indices) from the
document collection to speed up retrieval or search.

 There are many index schemes for text.

2
Inverted Index

 The inverted index, which has been shown superior to most other
indexing schemes, is a popular one. It is perhaps the most
important index method used in search engines.

 This indexing scheme not only allows efficient retrieval of


documents that contain query terms, but also very fast to build.

 In its simplest form, the inverted index of a document collection is


basically a data structure that attaches each distinctive term with a
list of all documents that contains the term.
 Thus, in retrieval, it takes constant time to find the documents that
contains a query term.
Inverted Index

 Depending on the need of the retrieval or ranking algorithm, different


pieces of information may be included. For example, to support
phrase and proximity search, a posting for a term ti usually consists
of the following,

<idj, fij, [o1, o2, …, o| fij|]>


 where
 idj is the ID of document dj that contains the term ti
 fij is the frequency count of ti in dj, and
 ok are the offsets (or positions) of term ti in dj.

 Postings of a term are sorted in increasing order based on the idj’s


and
so are the offsets in each posting.
Inverted Index :: Example

 The numbers below each document are the offset position of


each word.

 The vocabulary is the set:

{Web, mining, useful, applications, usage, structure, studies,


hyperlink}

 Stopwords “is” and “the” have been removed, but no stemming


is applied.
Inverted Index :: Example

 Fig. (A) is a simple version, where each term is attached with only an
inverted list of IDs of the documents that contain the term.
 Each inverted list in Fig (B) is more complex as it contains additional
information, i.e., the frequency count of the term and its positions in
each document.
Index Construction
 Let us build an inverted index for the three documents in previous
Example.
 To build the index efficiently, the trie is usually stored in memory. However,
in the context of the Web, the whole index will not fit in the main memory.

 Instead of using a trie, an alternative method is to use an in-memory


hash table (or other data structures) for terms.

You might also like