Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Download
Standard view
Full view
of .
Look up keyword
Like this
5Activity
0 of .
Results for:
No results containing your search query
P. 1
Web Crawler

Web Crawler

Ratings: (0)|Views: 191|Likes:
Published by hoangtxss
web crawler
web crawler

More info:

Published by: hoangtxss on Mar 12, 2009
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

05/30/2015

pdf

text

original

 
Dominos: A New Web Crawler’s Design
Youn`es Hafri
Ecole Polytechnique de NantesInstitut National de l’Audiovisuel4, avenue de l’Europe94366 Bry sur Marne - cedex, France
yhafri@ina.frChabane Djeraba
Laboratoire d’Informatique Fondamentale LilleUMR CNRS 8022- Batiment M359655 Villeneuve d’Ascq C´edex - France
djeraba@lifl.fr
ABSTRACT
Today’s search engines are equipped with specialized agentsknown as Web crawlers (download robots) dedicated tocrawling large Web contents on line. These contents are thenanalyzed, indexed and made available to users. Crawlers in-teract with thousands of Web servers over periods extendingfrom a few weeks to several years. This type of crawlingprocess therefore means that certain judicious criteria needto be taken into account, such as the robustness, flexibilityand maintainability of these crawlers. In the present paper,we will describe the design and implementation of a real-time distributed system of Web crawling running on acluster of machines. The system crawls several thousandsof pages every second, includes a high-performance faultmanager, is platform independent and is able to adapttransparently to a wide range of configurations withoutincurring additional hardware expenditure. We will thenprovide details of the system architecture and describethe technical choices for very high performance crawling.Finally, we will discuss the experimental results obtained,comparing them with other documented systems.
Keywords
Web Crawler, Distributed Systems, High Availability, FaultTolerance
1. INTRODUCTION
The World Wide Web has grown at a phenomenal pace,from several thousand pages in 1993 to over 3 billion today.This explosion in size has made Web crawlers indispensablefor information retrieval on the Web. They download largequantities of data and browse documents by passing fromone hypertext link to another.High performance crawling systems first appeared in aca-demic and industrial sectors, allowing hundreds of millionsof documents to be downloaded and indexed. Indeed, searchengines may be compared on the basis of the number of 
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.
 IWAW ´ 04,
September 16 2004, Bath, UK.
documents indexed, as well as of the quality of replies (scorecalculation) obtained. Even search engines such as Googleor Altavista cover only a limited part of the Web, andthe majority of their archives are not updated (however,we should point out that download speed is not the onlyobstacle: weak bandwidth, unsuitable server configuration...).Any crawling system in this category should offer at leastthe following two features. Firstly, it needs to be equippedwith an intelligent navigation strategy, i.e. enabling it tomake decisions regarding the choice of subsequent actionsto be taken (pages to be downloaded etc). Secondly,its supporting hardware and software architecture shouldbe optimized to crawl large quantities of documents perunit of time (generally per second). To this we may addfault tolerance (machine crash, network failure etc.) andconsiderations of Web server resources.Recently we have seen a small interest in these twofield. Studies on the first point include crawling strategiesfor important pages[9,17], topic-specific document down- loading[5,6,18,10], page recrawling to optimize overall refresh frequency of a Web archive[8,7] or scheduling the downloading activity according to time [22]. However, little research has been devoted to the second point, being verydifficult to implement[20,13]. We will focus on this latter point in the rest of this paper.Indeed, only a few crawlers are equipped with an opti-mized scalable crawling system, yet details of their internalworkings often remain obscure (the majority being propri-etary solutions). The only system to have been given afairly in-depth description in existing literature is
Mercator 
by Heydon and Najork of DEC/Compaq [13]used in the AltaVista search engine (some details also exist on the firstversion of the Google [3]and Internet Archive[4]robots). Most recent studies on crawling strategy fail to deal withthese features, contenting themselves with the solution of minor issues such as the calculation of the number of pagesto be downloaded in order to maximize/minimize somefunctional objective. This may be acceptable in the caseof small applications, but for
real time
1
applications thesystem must deal with a much larger number of constraints.We should also point out that little academic researchis concerned with high performance search engines, ascompared with their commercial counterparts (with theexception of the WebBase project [14]at Stanford). In the present paper, we will describe a very high
1
”Soft” real time
 
availability, optimized and distributed crawling system.We will use the system on what is known as
breadth- first 
crawling, though this may be easily adapted to othernavigation strategies. We will first focus on input/output,on management of network traffic and robustness whenchanging scale. We will also discuss download policies interms of speed regulation, fault management by supervisorsand the introduction/suppression of machine nodes withoutsystem restart during a crawl.Our system was designed within the experimental frame-work of the
D´epˆot egal du Web Fran¸cais (French WebLegal Deposit)
. This consists of archiving only multimediadocuments in French available on line, indexing them andproviding ways for these archives to be consulted. Legaldeposit requires a real crawling strategy in order to ensuresite continuity over time. The notion of registration isclosely linked to that of archiving, which requires a suitablestrategy to be useful. In the course of our discussion, wewill therefore analyze the implication and impact of thisexperimentation for system construction.
2. STATE OF THE ART2.1 Prerequisites of a Crawling System
In order to set our work in this field in context, listedbelow are definitions of services that should be consideredthe minimum requirements for any large-scale crawlingsystem.
Flexibility: as mentioned above, with some minoradjustments our system should be suitable for variousscenarios. However, it is important to remember thatcrawling is established within a specific framework:namely, Web legal deposit.
High Performance: the system needs to be scalablewith a minimum of one thousand pages/second andextending up to millions of pages for each run onlow cost hardware. Note that here, the quality andefficiency of disk access are crucial to maintaining highperformance.
Fault Tolerance: this may cover various aspects. Asthe system interacts with several servers at once,specific problems emerge. First, it should at leastbe able to process invalid HTML code, deal withunexpected Web server behavior, and select goodcommunication protocols etc. The goal here is to avoidthis type of problem and, by force of circumstance, tobe able to ignore such problems completely. Second,crawling processes may take days or weeks, and it isimperative that the system can handle failure, stoppedprocesses or interruptions in network services, keepingdata loss to a minimum. Finally, the system shouldbe
persistent 
, which means periodically switching largedata structures from memory to the disk (e.g. restartafter failure).
Maintainability and Configurability: an appropriateinterface is necessary for monitoring the crawlingprocess, including download speed, statistics on thepages and amounts of data stored. In online mode, theadministrator may adjust the speed of a given crawler,add or delete processes, stop the system, add or deletesystem nodes and supply the black list of domains notto be visited, etc.
2.2 General Crawling Strategies
There are many highly accomplished techniques in termsof Web crawling strategy. We will describe the most relevantof these here.
Breadth-first Crawling: in order to build a wide Webarchive like that of the Internet Archive[15], a crawl is carried out from a set of Web pages (initial URLsor seeds). A breadth-first exploration is launchedby following hypertext links leading to those pagesdirectly connected with this initial set. In fact, Websites are not really browsed breadth-first and variousrestrictions may apply, e.g. limiting crawling processesto within a site, or downloading the pages deemedmost interesting first
2
Repetitive Crawling: once pages have been crawled,some systems require the process to be repeatedperiodically so that indexes are kept updated. In themost basic case, this may be achieved by launchinga second crawl in parallel. A variety of heuristicsexist to overcome this problem: for example, byfrequently relaunching the crawling process of pages,sites or domains considered important to the detrimentof others. A good crawling strategy is crucial formaintaining a constantly updated index list. Recentstudies by Cho and Garcia-Molina [8,7]have focused on optimizing the update frequency of crawls by usingthe history of changes recorded on each site.
Targeted Crawling: more specialized search enginesuse crawling process heuristics in order to target acertain type of page, e.g. pages on a specific topic orin a particular language, images, mp3 files or scientificpapers. In addition to these heuristics, more genericapproaches have been suggested. They are based onthe analysis of the structures of hypertext links[6, 5] and techniques of learning[9,18]: the objective here being to retrieve the greatest number of pagesrelating to a particular subject by using the minimumbandwidth. Most of the studies cited in this categorydo not use high performance crawlers, yet succeed inproducing acceptable results.
Random Walks and Sampling: some studies havefocused on the effect of random walks on Web graphsor modified versions of these graphs via sampling inorder to estimate the size of documents on line[1,12, 11].
Deep Web Crawling: a lot of data accessible viathe Web are currently contained in databases andmay only be downloaded through the medium of appropriate requests or forms. Recently, this often-neglected but fascinating problem has been the focusof new interest. The
Deep Web
is the name given tothe Web containing this category of data [9].
2
See [9] for the heuristics that tend to find the mostimportant pages first and[17]for experimental results proving that breadth-first crawling allows the swift retrievalof pages with a high PageRank.
 
Lastly, we should point out the acknowledged differ-ences that exist between these scenarios. For example,a breadth-first search needs to keep track of all pagesalready crawled. An analysis of links should usestructures of additional data to represent the graphof the sites in question, and a system of classifiers inorder to assess the pages’ relevancy [6,5]. However, some tasks are common to all scenarios, such asrespecting robot exclusion files (robots.txt), crawlingspeed, resolution of domain names ...In the early 1990s, several companies claimed that theirsearch engines were able to provide complete Web coverage.It is now clear that only partial coverage is possible atpresent. Lawrence and Giles[16] carried out two exper- iments in order to measure coverage performance of dataestablished by crawlers and of their updates. They adoptedan approach known as
overlap analysis
to estimate the sizeof the Web that may be indexed (See also Bharat and Broder1998 on the same subject). Let
be the total set of Webpages and
a
and
b
the pages downloadedby two different crawlers
a
and
b
. What is the size of 
a
and
b
as compared with
? Let us assume that uniformsamples of Web pages may be taken and their membership of both sets tested. Let
(
a
) and
(
b
) be the probabilitythat a page is downloaded by
a
or
b
respectively. We knowthat:
(
a
b
|
b
) =
a
b
|
b
|
(1)Now, if these two crawling processes are assumed to beindependent, the left side of equation1may be reduced to
(
a
), that is data coverage by crawler
a
. This may beeasily obtained by the intersection size of the two crawlingprocesses. However, an exact calculation of this quantityis only possible if we do not really know the documentscrawled. Lawrence and Giles used a set of controlled data of 575 requests to provide page samples and count the numberof times that the two crawlers retrieved the same pages. Bytaking the hypothesis that the result
(
a
) is correct, wemay estimate the size of the Web as
|
a
|
/P 
(
a
). Thisapproach has shown that the Web contained at least 320million pages in 1997 and that only 60% was covered by thesix major search engines of that time. It is also interestingto note that a single search engine would have covered only1
/
3 of the Web. As this approach is based on observation, itmay reflect a visible Web estimation, excluding for instancepages behind forms, databases etc. More recent experimentsassert that the Web contains several billion pages.
2.2.1 Selective Crawling
As demonstrated above, a single crawler cannot archivethe whole Web. The fact is that the time required to carryout the complete crawling process is very long, and impos-sible given the technology currently available. Furthermore,crawling and indexing very large amounts of data impliesgreat problems of scalability, and consequently entails notinconsiderable costs of hardware and maintenance. Formaximum optimization, a crawling system should be ableto recognize relevant sites and pages, and restrict itself todownloading within a limited time.A document or Web page’s relevancy may be officiallyrecognized in various ways. The idea of selective crawlingmay be introduced intuitively by associating each URL
u
with a score calculation function
s
(
ξ
)
θ
respecting relevancycriterion
ξ
and parameters
θ
. In the most basic case, wemay assume a Boolean relevancy function, i.e.
s
(
u
) = 1 if the document designated by
u
is relevant and
s
(
u
) = 0 if not.More generally, we may think of 
s
(
d
) as a function with realvalues, such as a conditional probability that a documentbelongs to a certain category according to its content. In allcases, we should point out that the score calculation functiondepends only on the URL and
ξ
and not on the time or stateof the crawler.A general approach for the construction of a selectivecrawler consists of changing the URL insertion and extrac-tion policy in the queue
Q
of the crawler. Let us assumethat the URLs are sorted in the order corresponding to thevalue retrieved by
s
(
u
). In this case, we obtain the
best- first 
strategy (see [19]) which consists of downloading URLs with the best scores first). If 
s
(
u
) provides a good relevancymodel, we may hope that the search process will be guidedtowards the best areas of the Web.Various studies have been carried out in this direction: forexample, limiting the search depth in a site by specifyingthat pages are no longer relevant after a certain depth. Thisamounts to the following equation:
s
(
depth
)
θ
(
u
) =
1
,
if 
|
root
(
u
)
u
|
< δ
0
,
else(2)where
root
(
u
) is the root of the site containing
u
. Theinterest of this approach lies in the fact that maximizingthe search breadth may make it easier for the end-user toretrieve the information. Nevertheless, pages that are toodeep may be accessed by the user, even if the robot fails totake them into account.A second possibility is the estimation of a page’s popu-larity. One method of calculating a document’s relevancywould relate to the number of backlinks.
s
(
backlinks
)
θ
(
u
) =
1
,
if 
indegree
(
u
)
> τ 
0
,
else(3)where
τ 
is a threshold.It is clear that
s
(
backlinks
)
θ
(
u
) may only be calculated if we have a complete site graph (site already downloadedbeforehand). In practice, we make take an approximatevalue and update it incrementally during the crawlingprocess. A derivative of this technique is used in Google’sfamous
PageRank 
calculation.
3. OUR APPROACH: THE DOMINOS SYS-TEM
As mentioned above, we have divided the system into twoparts:
workers
and
supervisors
. All of these processes maybe run on various operating systems (Windows, MacOS X,Linux, FreeBSD) and may be replicated if need be. Theworkers are responsible for processing the URL flow comingfrom their supervisors and for executing crawling processtasks in the strict sense. They also handle the resolution of domain names by means of their integrated DNS resolver,and adjust download speed in accordance with node policy.A worker is a light process in the Erlang sense, acting asa fault tolerant and highly available
HTTP 
client. Theprocess-handling mode in Erlang makes it possible to createseveral thousands of workers in parallel.In our system, communication takes place mainly by send-ing asynchronous messages as described in the specifications

Activity (5)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads
k4lonk liked this
golujaan liked this
Pragya liked this

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->