Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
Look up keyword
Like this
0 of .
Results for:
No results containing your search query
P. 1
Design and Implementation of a High Performance Distributed Web Crawler

Design and Implementation of a High Performance Distributed Web Crawler

Ratings: (0)|Views: 143|Likes:
Published by seoilgun

More info:

Published by: seoilgun on Sep 15, 2009
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





Design and Implementation of a High-PerformanceDistributed Web Crawler
Vladislav Shkapenyuk Torsten Suel 
CIS DepartmentPolytechnic UniversityBrooklyn, NY 11201
vshkap@research.att.com, suel@poly.edu
 Broad web search engines as well as many more special-ized search tools rely on web crawlers to acquire large col-lections of pages for indexing and analysis. Such a webcrawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibil-ity, and manageabilityare of major importance. In addition, I/O performance, network resources, and OS limits must betaken into account in order to achieve high performance at a reasonable cost. In this paper, we describe the design and implementationof a distributed web crawler that runs on a network of work-stations. The crawler scales to (at least) several hundred  pages per second, is resilient against system crashes and other events, and can be adapted to various crawling appli-cations. We present the software architecture of the system,discuss the performance bottlenecks, and describe efficient techniques for achieving high performance. We also report  preliminary experimental results based on a crawl of 
million pages on
million hosts.
1 Introduction
The World Wide Web has grown from a few thousand pagesin1993tomorethantwobillionpagesat present. Dueto thisexplosion in size, web search engines are becoming increas-inglyimportantas the primarymeansof locatingrelevantin-formation. Such search engines rely on massive collectionsofwebpagesthatareacquiredwiththehelpof 
,which traverse the web by following hyperlinks and storingdownloadedpagesinalargedatabasethatislaterindexedforefficient execution of user queries. Many researchers havelooked at web search technology over the last few years, in-cluding crawling strategies, storage, indexing, ranking tech-niques, and a significant amount of work on the structuralanalysis of the web and web graph; see [1, 7] for overviews
Work supported by NSF CAREER Award NSF CCR-0093400, IntelCorporation, and the New York State Center for Advanced Technology inTelecommunications (CATT) at Polytechnic University, and by equipmentgrants from Intel Corporation and Sun Microsystems.
of some recent work and [26, 2] for basic techniques.Thus, highlyefficient crawling systems are needed in orderto download the hundreds of millions of web pages indexedbythemajorsearchengines. In fact, searchenginescompeteagainst each other primarily based on the size and currencyof their underlying database, in addition to the quality andresponse time of their ranking function. Even the largestsearch engines, such as Google or AltaVista, currentlycoveronly limited parts of the web, and much of their data is sev-eral months out of date. (We note, however, that crawlingspeed is not the only obstacle to increased search enginesize, and that the scaling of query throughput and responsetime to larger collections is also a major issue.)A crawler for a large search engine has to address two is-sues. First, it has to have a good crawling strategy, i.e., astrategy for deciding which pages to download next. Sec-ond, it needs to have a highly optimized system architecturethat can downloada large numberof pages per second whilebeingrobustagainstcrashes,manageable,andconsiderateof resources and web servers. There has been some recent aca-demic interest in the first issue, including work on strategiesfor crawling important pages first [12, 21], crawling pageson a particular topic or of a particular type [9, 8, 23, 13], re-crawling (refreshing) pages in order to optimize the overall“freshness” of a collection of pages [11, 10], or schedulingof crawling activity over time [25].In contrast, there has been less work on the second is-sue. Clearly, all the major search engines have highly op-timized crawling systems, although details of these systemsare usually proprietary. The only system described in de-tail in the literature appears to be the
system of Heydon and Najork at DEC/Compaq [16], which is used byAltaVista. (Some details are also known about the first ver-sion of the Google crawler [5] and the system used by theInternet Archive [6].) While it is fairly easy to build a slowcrawler that downloads a few pages per second for a shortperiod of time, building a high-performancesystem that candownload hundreds of millions of pages over several weekspresents a number of challenges in system design, I/O and
network efficiency, and robustness and manageability.Most of the recent work on crawling strategies does not ad-dress these performanceissues at all, but instead attempts tominimize the number of pages that need to be downloaded,or maximize the benefit obtained per downloaded page. (Anexception is the work in [8] that considers the system per-formance of a focused crawler built on top of a general-purpose database system, although the throughput of thatsystem is still significantlybelow that ofa high-performancebulk crawler.) In the case of applications that have only verylimited bandwidth that is acceptable. However, in the caseof a largersearchengine, we needto combinegoodcrawlingstrategy and optimized system design.In this paper, we describe the design and implementationof such an optimized system on a network of workstations.The choice of crawling strategy is largely orthogonal to ourwork. We describe the system using the exampleof a simplebreadth-first crawl, although the system can be adapted toother strategies. We are primarily interested in the I/O andnetwork efficiency aspects of such a system, and in scalabil-ity issues in terms of crawling speed and number of partici-pating nodes. We are currently using the crawler to acquirelarge data sets for work on other aspects of web search tech-nology such as indexing, query processing, and link analy-sis. We note that high-performance crawlers are currentlynot widely used by academic researchers, and hence fewgroups have run experiments on a scale similar to that of themajor commercial search engines (one exception being theWebBase project [17] and related work at Stanford). Thereare many interesting questions in this realm of massive datasets that deserve more attention by academic researchers.
1.1 Crawling Applications
There are a number of different scenarios in which crawlersare used for data acquisition. We now describe a few exam-ples and how they differ in the crawling strategies used.
Breadth-First Crawler:
In order to build a major searchengine or a large repository such as the Internet Archive[18], high-performance crawlers start out at a small set of pages and then explore other pages by following links in a“breadth first-like” fashion. In reality, the web pages are of-ten not traversed in a strict breadth-first fashion, but using avariety of policies, e.g., for pruningcrawls inside a web site,or for crawling more important pages first
Recrawling Pages for Updates:
After pages are initiallyacquired, they may have to be periodically recrawled andcheckedfor updates. In the simplest case, this couldbe doneby starting another broad breadth-first crawl, or by simplyrequestingallURLs in thecollectionagain. However,a vari-
See [12] for heuristics that attempt to crawl important pages first,and [21] for experimental results showing that even strict breadth-first willquickly find most pages with high Pagerank.
ety of heuristics can be employed to recrawl more importantpages, sites, or domains more frequently. Good recrawlingstrategies are crucial for maintaining an up-to-date searchindex with limited crawling bandwidth, and recent work byCho and Garcia-Molina [11, 10] has studied techniques foroptimizing the “freshness” of such collections given obser-vations about a page’s update history.
Focused Crawling:
More specialized search engines mayuse crawling policies that attempt to focus only on certaintypes of pages, e.g., pages on a particular topic or in a par-ticular language, images, mp3 files, or computer science re-search papers. In addition to heuristics, more general ap-proaches have been proposed based on link structure anal-ysis [9, 8] and machine learning techniques [13, 23]. Thegoal of a focused crawler is to find many pages of interestwithout using a lot of bandwidth. Thus, most of the previ-ous work does not use a high-performancecrawler,althoughdoing so could support large specialized collections that aresignificantly more up-to-date than a broad search engine.
Random Walking and Sampling:
Several techniqueshave been studied that use random walks on the web graph(or a slightly modified graph) to sample pages or estimatethe size and quality of search engines [3, 15, 14].
Crawling the “Hidden Web”:
A lot of the data accessi-ble via the web actually resides in databases and can onlybe retrieved by posting appropriatequeries and/or filling outforms on web pages. Recently, a lot of interest has focusedon automatic access to this data, also called the “HiddenWeb”, “DeepWeb”, or“FederatedFacts andFigures”. Work in [22] has looked at techniques for crawling this data. Acrawler such as the one described here could be extendedand used as an efficient front-end for such a system. Wenote, however, that there are many other challenges associ-ated with access to the hidden web, and the efficiency of thefront end is probably not the most important issue.
1.2 Basic Crawler Structure
Giventhesescenarios,wewouldliketodesignaflexiblesys-tem that can be adapted to different applications and strate-gies with a reasonable amount of work. Note that there aresignificant differences between the scenarios. For exam-ple, a broad breadth-first crawler has to keep track of whichpages have been crawled already; this is commonlydone us-ing a
“URL seen”
data structure that may have to reside ondisk for large crawls. A link analysis-based focused crawler,on the other hand, may use an additional data structure torepresent the graph structure of the crawled part of the web,and a classifier to judge the relevance of a page [9, 8], butthe size of the structures may be much smaller. On the otherhand, there are a number of common tasks that need to bedone in all or most scenarios, such as enforcement of robotexclusion, crawl speed control, or DNS resolution.
For simplicity, we separate our crawler design into twomain components, referred to as
crawling application
crawling system
. The crawling application decides whatpage to request next given the current state and the previ-ously crawled pages, and issues a stream of requests (URLs)to the crawling system. The crawling system (eventually)downloads the requested pages and supplies them to thecrawling application for analysis and storage. The crawlingsystem is in charge of tasks such as robot exclusion, speedcontrol, and DNS resolution that are common to most sce-narios, while the application implements crawling strategiessuch as “breadth-first” or ”focused“. Thus, to implement afocused crawler instead of a breadth-first crawler, we woulduse the same crawling system (with a few different parame-ter settings) but a significantly different application compo-nent, written using a library of functions for common taskssuch as parsing, maintenance of the
“URL seen”
structure,and communication with crawling system and storage.
Figure 1: Basic two components of the crawler
URL requestsdownloaded pages
At first glance, implementation of the crawling systemmay appear trivial. This is however not true in the high-performance case, where several hundred or even a thou-sand pages have to be downloaded per second. In fact, ourcrawling system consists itself of several components thatcan be replicated for higher performance. Both crawlingsystem and applicationcan also be replicated independently,and several different applications could issue requests to thesame crawling system, showing another motivation for thedesign
. More details on the architecture are given in Sec-tion 2. We note here that this partition into application andsystem componentsis a designchoice in oursystem, andnotused by some other systems
, and that it is not always obvi-ous in which componenta particulartask shouldbe handled.For the work in this paper, we focus on the case of a broadbreadth-first crawler as our crawling application.
1.3 Requirements for a Crawler
We now discuss the requirements for a good crawler, andapproaches for achieving them. Details on our solutions are
Of course, in the case of the application, replication is up to the de-signer of the component, who has to decide how to partition data structuresand workload.
E.g., Mercator [16] does not use this partition, but achieves flexibilitythrough the use of pluggable Java components.
given in the subsequent sections.
As mentioned, we would like to be able to usethe system in a variety of scenarios, with as few modifica-tions as possible.
Low Cost and High Performance:
The system shouldscale to at least several hundred pages per second and hun-dreds of millions of pages per run, and should run on low-cost hardware. Note that efficient use of disk access is cru-cial to maintain a high speed after the main data structures,such as the
“URL seen”
structure and crawl frontier, be-come too large for main memory. This will only happenafter downloading several million pages.
There are several aspects here. First, sincethe system will interact with millions of servers, it has totolerate bad HTML, strange server behavior and configura-tions, and many other odd issues. Our goal here is to err onthe side of caution, and if necessary ignore pages and evenentire servers with odd behavior, since in many applicationswe can only download a subset of the pages anyway. Sec-ondly, since a crawl may take weeks or months, the systemneedstobeabletotoleratecrashesandnetworkinterruptionswithout losing (too much of) the data. Thus, the state of thesystem needs to be kept on disk. We note that we do notreally require strict ACID properties. Instead, we decided toperiodically synchronize the main structures to disk, and torecrawl a limited number of pages after a crash.
Etiquette and Speed Control:
It is extremely impor-tant to follow the standard conventions for robot exclusion(
and robots meta tags), to supply a contactURL for the crawler, and to supervise the crawl. In addi-tion, we need to be able to control access speed in severaldifferent ways. We have to avoid putting too much load ona single server; we do this by contacting each site only onceevery
seconds unless specified otherwise. It is also de-sirable to throttle the speed on a domain level, in order notto overload small domains, and for other reasons to be ex-plained later. Finally, since we are in a campus environmentwhere our connection is shared with many other users, wealso need to control the total download rate of our crawler.In particular, we crawl at low speed during the peek usagehours of the day, and at a much higher speed during the latenight and early morning, limited mainly by the load toler-ated by our main campus router.
Manageability and Reconfigurability:
An appropriateinterface is needed to monitor the crawl, including the speedof the crawler, statistics about hosts and pages, and the sizesof the main data sets. The administrator should be able toadjust the speed, add and remove components, shut downthe system, force a checkpoint, or add hosts and domains toa “blacklist” of places that the crawler should avoid. After acrash or shutdown, the software of the system may be modi-fied to fix problems, and we may want to continue the crawl

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->