You are on page 1of 22

Efficient Search in Large

Textual Collections with


Redundancy
Jiangong Zhang and Torsten Suel
In the Proceedings of the Sixteenth
International World Wide Web Conference
(WWW2007) Banff, CANADA, May 2007

Presented By

14/7/2009 1
Outline
 Introduction

 Technical Background

 New Approach

 Experimental Results

 Conclusion

14/7/2009 2
Introduction
 Search engines search only the most recent version
of a web page

 Could full-text search over large web archives be


supported?

 Challenge: Achieving a reasonable index when


there are many
similar pages (e.g., different versions of the same
page)

 Proposes a new and general framework to


efficiently index and search large web page
collections with redundancy
14/7/2009 3
Technical Background
 Inverted Index
 Consists of a set of inverted lists
 Each List is sequence of postings
 Posting – (docID,f,p0..pf-1)
f-frequency of a term,pi-position of a term
 Query processing needs to traverse inverted lists

 Ranked Queries
 Consists of a set of terms
 Ranking Function assigns a score to each page
Eg.Cosine Measure
 Document-at-a-time Query Processing (DAAT)

14/7/2009 4
Technical Background Cont…

 Index Updates
• Search engine updates
 Old pages are replaced by new updated
pages
 Non-existing pages are deleted
 New pages are inserted

Fig.1 Search engine


update
14/7/2009 5
Technical Background Cont…

 Update of Archival Search


 New and changed pages are inserted
 No deletion of pages
 Multiple versions of one page

Fig.2 Archival search update


14/7/2009 6
New Approach
 Content-dependent partition: Winnowing[2]

 Sharing policies: Local Sharing and Global


Sharing

 Modified query processing

 Efficient updates

14/7/2009 7
Content-dependent partition
 Goal: Partition a page into a set of fragments

 Two similar pages will have many fragments in


common

 Fragments are identified by a fragID

 Index fragments instead of complete documents

 Winnowing Algorithm

14/7/2009 8
Content-dependent partition
cont…

 Winnowing Algorithm
 Uses two hash functions

window of size b for


hashing

R A K A B A B A B F H M A C …

h
ash
2 1 4 1 4 1 4 8 1 7 2 1 2 1 …
3 7 5 3 8 3 8 7 9 1 2 9 3

window of
size w
Block Block Block
1 2 3

Fig.3 winnowing on a file with b=3 and


w=5
14/7/2009 9
Content-dependent partition
Cont…

 Winnowing Algorithm
1. Choose a hash function to map substrings of some
fixed small size to integer values
2. Choose a larger window size and slide this window
over the hash array. Use the following rules to
partition the file
 Suppose the current hash value is strictly smaller
than all other values in the window, cut directly
before it
 Suppose there are several positions in the current
window with the same minimum value. If it has
cut previously directly before one of these
positions, no cut is applied in this step. Otherwise,
cut before the rightmost such position
14/7/2009 10
Sharing policies
 Local sharing
 Avoid re-indexing of a fragment if it has previously
occurred in a version of the same page
 Number of Fragments that need to be indexed :13
fragments

Fig.4 Local sharing


14/7/2009 11
Sharing policies cont…

 Global sharing
 If a fragment has previously occurred in any
other page, it is not indexed again
 Fragments indexed:4 + 5 + 2 = 11 fragments
(Total 18)

14/7/2009 12
Data Structures

Fig.6 Standard data


structures
 Inverted Index: Consists of inverted lists
sorted by docID.
 Dictionary: Stores a pointer to the start of
the inverted list for each term, plus other
statistical information.
 Page Table: Stores complete URL, length of
14/7/2009 13
document, pagerank, and other useful
information of the document.
Data Structures cont…

Fig.7 Additional data


 structures
Doc/Version Table: stores information about a page and its
various versions.
 Hash Table: stores a hash value of the content of each distinct
fragment.
 Reuse Table: stores information about a fragment such as in
which other pages the fragment occurs.
14/7/2009 14
Modified Query Processing
 Local Sharing Query Processing
 Identify pages that contain all query words.
 Check if any version of the obtained page contains all
words. Compute the actual score for a page or
version.

 Global Sharing Query Processing


 Identify pages that contain all query words
 Uses Reuse table and Doc/Ver table. Compute the
actual score for a page or version.
Both requires additional computational and memory
cost.
14/7/2009 15
Efficient Updates

Fig.8 Index Updates

 Partition into fragments.


 Hash the content of each fragment.

14/7/2009 16
Efficient Updates cont…

 Index fragment only if it is not in the Hash table.


 Add posting for each term in the fragment in the
Inverted index table.
 Posting - (fragID,f,p0…..pf-1) where f-frequency of a
term in the fragment and pi-position of terms in the
fragment.
 FragID –(docID of fragment’s primary page , fragment
number)
 Update or insert the appropriate records in the
various tables.
 All new postings are first inserted in a main-memory
structure and later periodically merged into disk
based structures
14/7/2009 17
Experimental Evaluation

Fig.9 Cumulative percentage of unique


fragments versus week of crawl

 Experiment used a data set from Stanford WebBase: total of


6,356,374 versions of pages from 2,528,362 distinct URLs.
 A reduction in the number of fragments when duplicate
fragments are eliminated under local sharing policy.

14/7/2009 18
Experimental Evaluation cont…

Fig.10 comparison of number Fig.11 Relative reduction in


of fragments under different the number of fragments and
policies. positions

 Global sharing performs better than local sharing in size.

14/7/2009 19
Conclusion
 Provides a new framework for indexing and query
processing on textual collections with significant
amounts of redundancy.

 Results in significant reductions in index size and query


processing cost.

 Supports highly efficient updates.

 It can be used for applications such as desktop search


and indexing of versioning file systems that retain old
versions of all files.

14/7/2009 20
References
[1] Jiangong Zhang and Torsten Suel. Efficient search in large textual collections
with redundancy. Proceedings of the Sixteenth International World Wide Web
Conference,
pages 412-420, 2007.

[2] S. Schleimer, D. Wilkerson, and A. Aiken. Winnowing: Local algorithms


for document fingerprinting. Proceedings of the 2003 ACM
SIGMODInternational Conference
on Management of Data,pages 76-85, 2003.

[3] L. Lim, M. Wang, S. Padmanabhan, J. Vitter, and R. Agarwal. Dynamic


maintenance of web
indexes using landmarks. Proceedings of the 12th Int. World Wide Web
Conference, pages 102-
111, 2003.

[4] F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes


for fast query
evaluation. Proceedings of the 25th Annual SIGIR Conference on Research and
Development in
Information Retrieval, pages 222-229, 2002.

[5] V. Anh and A. Moat. Index compression using indexed binary codewords.
14/7/2009 21
Proceedings of the 15th
Int. Australasian Database Conference, pages 61-67, 2004.
Thank You

14/7/2009 22