You are on page 1of 22

Efficient Search in Large

Textual Collections with

Jiangong Zhang and Torsten Suel
In the Proceedings of the Sixteenth
International World Wide Web Conference
(WWW2007) Banff, CANADA, May 2007

Presented By

14/7/2009 1
 Introduction

 Technical Background

 New Approach

 Experimental Results

 Conclusion

14/7/2009 2
 Search engines search only the most recent version
of a web page

 Could full-text search over large web archives be


 Challenge: Achieving a reasonable index when

there are many
similar pages (e.g., different versions of the same

 Proposes a new and general framework to

efficiently index and search large web page
collections with redundancy
14/7/2009 3
Technical Background
 Inverted Index
 Consists of a set of inverted lists
 Each List is sequence of postings
 Posting – (docID,f,
f-frequency of a term,pi-position of a term
 Query processing needs to traverse inverted lists

 Ranked Queries
 Consists of a set of terms
 Ranking Function assigns a score to each page
Eg.Cosine Measure
 Document-at-a-time Query Processing (DAAT)

14/7/2009 4
Technical Background Cont…

 Index Updates
• Search engine updates
 Old pages are replaced by new updated
 Non-existing pages are deleted
 New pages are inserted

Fig.1 Search engine

14/7/2009 5
Technical Background Cont…

 Update of Archival Search

 New and changed pages are inserted
 No deletion of pages
 Multiple versions of one page

Fig.2 Archival search update

14/7/2009 6
New Approach
 Content-dependent partition: Winnowing[2]

 Sharing policies: Local Sharing and Global


 Modified query processing

 Efficient updates

14/7/2009 7
Content-dependent partition
 Goal: Partition a page into a set of fragments

 Two similar pages will have many fragments in


 Fragments are identified by a fragID

 Index fragments instead of complete documents

 Winnowing Algorithm

14/7/2009 8
Content-dependent partition

 Winnowing Algorithm
 Uses two hash functions

window of size b for


R A K A B A B A B F H M A C …

2 1 4 1 4 1 4 8 1 7 2 1 2 1 …
3 7 5 3 8 3 8 7 9 1 2 9 3

window of
size w
Block Block Block
1 2 3

Fig.3 winnowing on a file with b=3 and

14/7/2009 9
Content-dependent partition

 Winnowing Algorithm
1. Choose a hash function to map substrings of some
fixed small size to integer values
2. Choose a larger window size and slide this window
over the hash array. Use the following rules to
partition the file
 Suppose the current hash value is strictly smaller
than all other values in the window, cut directly
before it
 Suppose there are several positions in the current
window with the same minimum value. If it has
cut previously directly before one of these
positions, no cut is applied in this step. Otherwise,
cut before the rightmost such position
14/7/2009 10
Sharing policies
 Local sharing
 Avoid re-indexing of a fragment if it has previously
occurred in a version of the same page
 Number of Fragments that need to be indexed :13

Fig.4 Local sharing

14/7/2009 11
Sharing policies cont…

 Global sharing
 If a fragment has previously occurred in any
other page, it is not indexed again
 Fragments indexed:4 + 5 + 2 = 11 fragments
(Total 18)

14/7/2009 12
Data Structures

Fig.6 Standard data

 Inverted Index: Consists of inverted lists
sorted by docID.
 Dictionary: Stores a pointer to the start of
the inverted list for each term, plus other
statistical information.
 Page Table: Stores complete URL, length of
14/7/2009 13
document, pagerank, and other useful
information of the document.
Data Structures cont…

Fig.7 Additional data

 structures
Doc/Version Table: stores information about a page and its
various versions.
 Hash Table: stores a hash value of the content of each distinct
 Reuse Table: stores information about a fragment such as in
which other pages the fragment occurs.
14/7/2009 14
Modified Query Processing
 Local Sharing Query Processing
 Identify pages that contain all query words.
 Check if any version of the obtained page contains all
words. Compute the actual score for a page or

 Global Sharing Query Processing

 Identify pages that contain all query words
 Uses Reuse table and Doc/Ver table. Compute the
actual score for a page or version.
Both requires additional computational and memory
14/7/2009 15
Efficient Updates

Fig.8 Index Updates

 Partition into fragments.

 Hash the content of each fragment.

14/7/2009 16
Efficient Updates cont…

 Index fragment only if it is not in the Hash table.

 Add posting for each term in the fragment in the
Inverted index table.
 Posting - (fragID,f,p0… where f-frequency of a
term in the fragment and pi-position of terms in the
 FragID –(docID of fragment’s primary page , fragment
 Update or insert the appropriate records in the
various tables.
 All new postings are first inserted in a main-memory
structure and later periodically merged into disk
based structures
14/7/2009 17
Experimental Evaluation

Fig.9 Cumulative percentage of unique

fragments versus week of crawl

 Experiment used a data set from Stanford WebBase: total of

6,356,374 versions of pages from 2,528,362 distinct URLs.
 A reduction in the number of fragments when duplicate
fragments are eliminated under local sharing policy.

14/7/2009 18
Experimental Evaluation cont…

Fig.10 comparison of number Fig.11 Relative reduction in

of fragments under different the number of fragments and
policies. positions

 Global sharing performs better than local sharing in size.

14/7/2009 19
 Provides a new framework for indexing and query
processing on textual collections with significant
amounts of redundancy.

 Results in significant reductions in index size and query

processing cost.

 Supports highly efficient updates.

 It can be used for applications such as desktop search

and indexing of versioning file systems that retain old
versions of all files.

14/7/2009 20
[1] Jiangong Zhang and Torsten Suel. Efficient search in large textual collections
with redundancy. Proceedings of the Sixteenth International World Wide Web
pages 412-420, 2007.

[2] S. Schleimer, D. Wilkerson, and A. Aiken. Winnowing: Local algorithms

for document fingerprinting. Proceedings of the 2003 ACM
SIGMODInternational Conference
on Management of Data,pages 76-85, 2003.

[3] L. Lim, M. Wang, S. Padmanabhan, J. Vitter, and R. Agarwal. Dynamic

maintenance of web
indexes using landmarks. Proceedings of the 12th Int. World Wide Web
Conference, pages 102-
111, 2003.

[4] F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes

for fast query
evaluation. Proceedings of the 25th Annual SIGIR Conference on Research and
Development in
Information Retrieval, pages 222-229, 2002.

[5] V. Anh and A. Moat. Index compression using indexed binary codewords.
14/7/2009 21
Proceedings of the 15th
Int. Australasian Database Conference, pages 61-67, 2004.
Thank You

14/7/2009 22