Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
Fast Snippet Search

Fast Snippet Search

Ratings: (0)|Views: 117|Likes:
Published by Simon Wistow

More info:

Published by: Simon Wistow on Dec 05, 2009
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF or read online from Scribd
See more
See less

12/04/2009

 
Fast Generation of Result Snippets in Web Search
Andrew Turpin &Yohannes Tsegay
RMIT UniversityMelbourne, Australia
aht@cs.rmit.edu.auytsegay@cs.rmit.edu.auDavid Hawking
CSIRO ICT CentreCanberra, Australia
david.hawking@acm.orgHugh E. Williams
Microsoft CorporationOne Microsoft WayRedmond, WA.
hughw@microsoft.com
ABSTRACT
The presentation of query biased document snippets as partof results pages presented by search engines has become anexpectation of search engine users. In this paper we ex-plore the algorithms and data structures required as part of a search engine to allow efficient generation of query biasedsnippets. We begin by proposing and analysing a documentcompression method that reduces snippet generation timeby 58% over a baseline using the
zlib
compression library.These experiments reveal that finding documents on sec-ondary storage dominates the total cost of generating snip-pets, and so caching documents in RAM is essential for afast snippet generation process. Using simulation, we exam-ine snippet generation performance for different size RAMcaches. Finally we propose and analyse document reorder-ing and compaction, revealing a scheme that increases thenumber of document cache hits with only a marginal af-fect on snippet quality. This scheme effectively doubles thenumber of documents that can fit in a fixed size cache.
Categories and Subject Descriptors
H.3.3 [
Information Storage and Retrieval
]: InformationSearch and Retrieval; H.3.4 [
Information Storage andRetrieval
]: Systems and Software—
performance evaluation (efficiency and effectiveness)
;
General Terms
Algorithms, Experimentation, Measurement, Performance
Keywords
Web summaries, snippet generation, document caching
1. INTRODUCTION
Each result in search results list delivered by current WWWsearch engines such as
search.yahoo.com
,
google.com
and
search.msn.com
typically contains the title and URL of theactual document, links to live and cached versions of thedocument and sometimes an indication of file size and type.
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.
SIGIR’07,
July23–27,2007,Amsterdam,TheNetherlands.Copyright2007ACM978-1-59593-597-7/07/0007...
$
5.00.
In addition, one or more
snippets
are usually presented, giv-ing the searcher a sneak preview of the document contents.Snippets are short fragments of text extracted from thedocument content (or its metadata). They may be static(for example, always show the first 50 words of the docu-ment, or the content of its description metadata, or a de-scription taken from a directory site such as
dmoz.org
) or
query-biased 
[20]. A query-biased snippet is one selectivelyextracted on the basis of its relation to the searcher’s query.The addition of informative snippets to search results maysubstantially increase their value to searchers. Accuratesnippets allow the searcher to make good decisions aboutwhich results are worth accessing and which can be ignored.In the best case, snippets may obviate the need to open anydocuments by directly providing the answer to the searcher’sreal information need, such as the contact details of a personor an organization.Generation of query-biased snippets by Web search en-gines indexing of the order of ten billion web pages and han-dling hundreds of millions of search queries per day imposesa very significant computational load (remembering thateach search typically generates ten snippets). The simple-minded approach of keeping a copy of each document in afile and generating snippets by opening and scanning files,works when query rates are low and collections are small,but does not scale to the degree required. The overhead of opening and reading ten files per query on top of access-ing the index structure to locate them, would be manifestlyexcessive under heavy query load. Even storing ten billionfiles and the corresponding hundreds of terabytes of data isbeyond the reach of traditional filesystems. Special-purposefilesystems have been built to address these problems [6].Note that the utility of snippets is by no means restrictedto whole-of-Web search applications. Efficient generation of snippets is also important at the scale of whole-of-governmentsearch services such as
www.firstgov.gov
(c. 25 millionpages) and
govsearch.australia.gov.au
(c. 5 million pages)and within large enterprises such as IBM [2] (c. 50 millionpages). Snippets may be even more useful in database orfilesystem search applications in which no useful URL ortitle information is present.We present a new algorithm and compact single-file struc-ture designed for rapid generation of high quality snippetsand compare its space/time performance against an obviousbaseline based on the
zlib
compressor on various data sets.We report the proportion of time spent for disk seeks, diskreads and cpu processing; demonstrating that the time forlocating each document (seek time) dominates, as expected.
 
As the time to process a document in RAM is small incomparison to locating and reading the document into mem-ory, it may seem that compression is not required. However,this is only true if there is no caching of documents in RAM.Controlling the RAM of physical systems for experimenta-tion is difficult, hence we use simulation to show that cachingdocuments dramatically improves the performance of snip-pet generation. In turn, the more documents can be com-pressed, the more can fit in cache, and hence the more diskseeks can be avoided: the classic data compression tradeoff that is exploited in inverted file structures and computingranked document lists [24].As hitting the document cache is important, we examinedocument
compaction 
, as opposed to compression, schemesby imposing an
a priori 
ordering of sentences within a docu-ment, and then only allowing leading sentences into cache foreach document. This leads to further time savings, with onlymarginal impact on the quality of the snippets returned.
2. RELATED WORK
Snippet generation is a special type of extractive docu-ment summarization, in which sentences, or sentence frag-ments, are selected for inclusion in the summary on the basisof the degree to which they match the search query. Thisprocess was given the name of 
query-biased summarization 
by Tombros and Sanderson [20] The reader is referred toMani [13] and to Radev et al. [16] for overviews of the verymany different applications of summarization and for theequally diverse methods for producing summaries.Early Web search engines presented query-independentsnippets consisting of the first
k
bytes of the result docu-ment. Generating these is clearly much simpler and muchless computationally expensive than processing documentsto extract query biased summaries, as there is no need tosearch the document for text fragments containing queryterms. To our knowledge, Google was the first whole-of-Web search engine to provide query biased summaries, butsummarization is listed by Brin and Page [1] only under theheading of future work.Most of the experimental work using query-biased sum-marization has focused on comparing their value to searchersrelative to other types of summary [20, 21], rather than ef-ficient generation of summaries. Despite the importance of efficient summary generation in Web search, few algorithmsappear in the literature. Silber and McKoy [19] describe alinear-time lexical chaining algorithm for use in generic sum-maries, but offer no empirical data for the performance of their algorithm. White et al [21] report some experimentaltimings of their WebDocSum system, but the snippet gener-ation algorithms themselves are not isolated, so it is difficultto infer snippet generation time comparable to the times wereport in this paper.
3. SEARCH ENGINE ARCHITECTURES
A search engine must perform a variety of activities, and iscomprised of many sub-systems, as depicted by the boxes inFigure 1. Note that there may be several other sub-systemssuch as the “Advertising Engine” or the “Parsing Engine”that could easily be added to the diagram, but we have con-centrated on the sub-systems that are relevant to snippetgeneration. Depending on the number of documents thatthe search engine indexes, the data and processes for each
RankingEngineCrawling EngineIndexing EngineEngineLexiconMeta DataEngineEngineSnippet
   T  e  r  m    &    D  o  c   #  s
 S ni   p p e  t   p e  d  o c 
WEBQuery EngineQueryResults Page
   T  e  r  m    #  s
D o c  s 
    I   n   v  e   r   t  e   d     l    i  s   t  s
D  o  c  s  
 p e  d  o c i   t  l   e  , UR , e  t   c 
   D  o  c   #  s
   
  T e r m s
   Q  u  e  r  y  s   t  r   i  n  g
 e m  s 
Figure 1: An abstraction of some of the sub-systemsin a search engine. Depending on the number of documents indexed, each sub-system could reside ona single machine, be distributed across thousands of machines, or a combination of both.
sub-system could be distributed over many machines, or alloccupy a single server and filesystem, competing with eachother for resources. Similarly, it may be more efficient tocombine some sub-systems in an implementation of the di-agram. For example, the meta-data such as document titleand URL requires minimal computation apart from high-lighting query words, but we note that disk seeking is likelyto be minimized if title, URL and fixed summary informa-tion is stored contiguously with the text from which querybiased summaries are extracted. Here we ignore the fixedtext and consider only the generation of query biased sum-maries: we concentrate on the “Snippet Engine”.In addition to data and programs operating on that data,each sub-system also has its own memory management scheme.The memory management system may simply be the mem-ory hierarchy provided by the operating system used on ma-chines in the sub-system, or it may be explicitly coded tooptimise the processes in the sub-system.There are many papers on caching in search engines (see[3] and references therein for a current summary), but itseems reasonable that there is a
query cache
in the QueryEngine that stores precomputed final result pages for verypopular queries. When one of the popular queries is issued,the result page is fetched straight from the query cache. If the issued query is not in the query cache, then the QueryEngine uses the four sub-systems in turn to assemble a re-sults page.1. The Lexicon Engine maps query terms to integers.2. The Ranking Engine retrieves inverted lists for eachterm, using them to get a ranked list of documents.3. The Snippet Engine uses those document numbers andquery term numbers to generate snippets.4. The Meta Data Engine fetches other information abouteach document to construct the results page.
 
IN
A document broken into one sentence per line,and a sequence of query terms.
1
For each line of the text,
L
= [
w
1
,w
2
,...,w
m
]
2
Let
h
be 1 if 
L
is a heading, 0 otherwise.
3
Let
be 2 if 
L
is the first line of a document,1 if it is the second line, 0 otherwise.
4
Let
c
be the number of 
w
i
that are queryterms, counting repetitions.
5
Let
d
be the number of distinct query termsthat match some
w
i
.
6
Identify the longest contiguous run of queryterms in
L
, say
w
j
...w
j
+
k
.
7
Use a weighted combination of 
c
,
d
,
k
,
h
and
to derive a score
s
.
8
Insert
L
into a max-heap using
s
as the key.
OUT
Remove the number of sentences required fromthe heap to form the summary.
Figure 2: Simple sentence ranker that operates onraw text with one sentence per line.
4. THE SNIPPET ENGINE
For each document identifier passed to the Snippet En-gine, the engine must generate text, preferably containingquery terms, that attempts to summarize that document.Previous work on summarization identifies the sentence asthe minimal unit for extraction and presentation to theuser [12]. Accordingly, we also assume a web snippet extrac-tion process will extract sentences from documents. In orderto construct a snippet, all sentences in a document should beranked against the query, and the top two or three returnedas the snippet. The scoring of sentences against queries hasbeen explored in several papers [7, 12, 18, 20, 21], with dif-ferent features of sentences deemed important.Based on these observations, Figure 2, shows the generalalgorithm for scoring sentences in relevant documents, withthe highest scoring sentences making the snippet for eachdocument. The final score of a sentence, assigned in Step 7,can be derived in many different ways. In order to avoidbias towards any particular scoring mechanism, we comparesentence quality later in the paper using the individual com-ponents of the score, rather than an arbitrary combinationof the components.
4.1 Parsing Web Documents
Unlike well edited text collections that are often the targetfor summarization systems, Web data is often poorly struc-tured, poorly punctuated, and contains a lot of data that donot form part of valid sentences that would be candidatesfor parts of snippets.We assume that the documents passed to the SnippetEngine by the Indexing Engine have all HTML tags andJavaScript removed, and that each document is reduced to aseries of word tokens separated by non-word tokens. We de-fine a word token as a sequence of alphanumeric characters,while a non-word is a sequence of non-alphanumeric charac-ters such as whitespace and the other punctuation symbols.Both are limited to a maximum of 50 characters. Adjacent,repeating characters are removed from the punctuation.Included in the punctuation set is a special end of sentencemarker which replaces the usual three sentence terminators“?!.”. Often these explicit punctuation characters are miss-ing, and so HTML tags such as
<br>
and
<p>
are assumed toterminate sentences. In addition, a sentence must contain atleast five words and no more than twenty words, with longeror shorter sentences being broken and joined as required tomeet these criteria [10].Unterminated HTML tags—that is, tags with an openbrace, but no close brace—cause all text from the open braceto the next open brace to be discarded.A major problem in summarizing web pages is the pres-ence of large amounts of promotional and navigational ma-terial (“navbars”) visually above and to the left of the actualpage content. For example, “The most wonderful companyon earth. Products. Service. About us. Contact us. Trybefore you buy.” Similar, but often not identical, naviga-tional material is typically presented on every page within asite. This material tends to lower the quality of summariesand slow down summary generation.In our experiments we did not use any particular heuris-tics for removing navigational information as the test collec-tion in use (
wt100g
) pre-dates the widespread take up of the current style of web publishing. In
wt100g
, the averageweb page size is more than half the current Web average [11].Anecdotally, the increase is due to inclusion of sophisticatednavigational and interface elements and the JavaScript func-tions to support them.Having defined the format of documents that are pre-sented to the Snippet Engine, we now define our CompressedToken System (
CTS
) document storage scheme, and thebaseline system used for comparison.
4.2 Baseline Snippet Engine
An obvious document representation scheme is to simplycompress each document with a well known adaptive com-pressor, and then decompress the document as required [1],using a string matching algorithm to effect the algorithm inFigure 2. Accordingly, we implemented such a system, using
zlib
[4] with default parameters to compress every documentafter it has been parsed as in Section 4.1.Each document is stored in a single file. While manage-able for our small test collections or small enterprises withmillions of documents, a full Web search engine may requiremultiple documents to inhabit single files, or a special pur-pose filesystem [6].For snippet generation, the required documents are de-compressed one at a time, and a linear search for providedquery terms is employed. The search is optimized for ourspecific task, which is restricted to matching whole wordsand the sentence terminating token, rather than general pat-tern matching.
4.3 The
CTS
Snippet Engine
Several optimizations over the baseline are possible. Thefirst is to employ a semi-static compression method over theentire document collection, which will allow faster decom-pression with minimal compression loss [24]. Using a semi-static approach involves mapping words and non-words pro-duced by the parser to single integer tokens, with frequentsymbols receiving small integers, and then choosing a codingscheme that assigns small numbers a small number of bits.Words and non-words strictly alternate in the compressedfile, which always begins with a word.In this instance we simply assign each symbol its ordinalnumber in a list of symbols sorted by frequency. We use the

Activity (2)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->