For simplicity, we separate our crawler design into twomain components, referred to as
. The crawling application decides whatpage to request next given the current state and the previ-ously crawled pages, and issues a stream of requests (URLs)to the crawling system. The crawling system (eventually)downloads the requested pages and supplies them to thecrawling application for analysis and storage. The crawlingsystem is in charge of tasks such as robot exclusion, speedcontrol, and DNS resolution that are common to most sce-narios, while the application implements crawling strategiessuch as “breadth-ﬁrst” or ”focused“. Thus, to implement afocused crawler instead of a breadth-ﬁrst crawler, we woulduse the same crawling system (with a few different parame-ter settings) but a signiﬁcantly different application compo-nent, written using a library of functions for common taskssuch as parsing, maintenance of the
structure,and communication with crawling system and storage.
Figure 1: Basic two components of the crawler
URL requestsdownloaded pages
At ﬁrst glance, implementation of the crawling systemmay appear trivial. This is however not true in the high-performance case, where several hundred or even a thou-sand pages have to be downloaded per second. In fact, ourcrawling system consists itself of several components thatcan be replicated for higher performance. Both crawlingsystem and applicationcan also be replicated independently,and several different applications could issue requests to thesame crawling system, showing another motivation for thedesign
. More details on the architecture are given in Sec-tion 2. We note here that this partition into application andsystem componentsis a designchoice in oursystem, andnotused by some other systems
, and that it is not always obvi-ous in which componenta particulartask shouldbe handled.For the work in this paper, we focus on the case of a broadbreadth-ﬁrst crawler as our crawling application.
1.3 Requirements for a Crawler
We now discuss the requirements for a good crawler, andapproaches for achieving them. Details on our solutions are
Of course, in the case of the application, replication is up to the de-signer of the component, who has to decide how to partition data structuresand workload.
E.g., Mercator  does not use this partition, but achieves ﬂexibilitythrough the use of pluggable Java components.
given in the subsequent sections.
As mentioned, we would like to be able to usethe system in a variety of scenarios, with as few modiﬁca-tions as possible.
Low Cost and High Performance:
The system shouldscale to at least several hundred pages per second and hun-dreds of millions of pages per run, and should run on low-cost hardware. Note that efﬁcient use of disk access is cru-cial to maintain a high speed after the main data structures,such as the
structure and crawl frontier, be-come too large for main memory. This will only happenafter downloading several million pages.
There are several aspects here. First, sincethe system will interact with millions of servers, it has totolerate bad HTML, strange server behavior and conﬁgura-tions, and many other odd issues. Our goal here is to err onthe side of caution, and if necessary ignore pages and evenentire servers with odd behavior, since in many applicationswe can only download a subset of the pages anyway. Sec-ondly, since a crawl may take weeks or months, the systemneedstobeabletotoleratecrashesandnetworkinterruptionswithout losing (too much of) the data. Thus, the state of thesystem needs to be kept on disk. We note that we do notreally require strict ACID properties. Instead, we decided toperiodically synchronize the main structures to disk, and torecrawl a limited number of pages after a crash.
Etiquette and Speed Control:
It is extremely impor-tant to follow the standard conventions for robot exclusion(
and robots meta tags), to supply a contactURL for the crawler, and to supervise the crawl. In addi-tion, we need to be able to control access speed in severaldifferent ways. We have to avoid putting too much load ona single server; we do this by contacting each site only onceevery
seconds unless speciﬁed otherwise. It is also de-sirable to throttle the speed on a domain level, in order notto overload small domains, and for other reasons to be ex-plained later. Finally, since we are in a campus environmentwhere our connection is shared with many other users, wealso need to control the total download rate of our crawler.In particular, we crawl at low speed during the peek usagehours of the day, and at a much higher speed during the latenight and early morning, limited mainly by the load toler-ated by our main campus router.
Manageability and Reconﬁgurability:
An appropriateinterface is needed to monitor the crawl, including the speedof the crawler, statistics about hosts and pages, and the sizesof the main data sets. The administrator should be able toadjust the speed, add and remove components, shut downthe system, force a checkpoint, or add hosts and domains toa “blacklist” of places that the crawler should avoid. After acrash or shutdown, the software of the system may be modi-ﬁed to ﬁx problems, and we may want to continue the crawl