Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more ➡
Download
Standard view
Full view
of .
Add note
Save to My Library
Sync to mobile
Look up keyword
Like this
27Activity
×
0 of .
Results for:
No results containing your search query
P. 1
Web Crawler

Web Crawler

Ratings: (0)|Views: 2,708|Likes:
Published by abhijit10089

More info:

Published by: abhijit10089 on Mar 19, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, DOC, TXT or read online from Scribd
See More
See less

05/16/2013

pdf

text

original

 
CHAPTER-1WEB CRAWLER 
1.1 Introduction
A
Web crawler
is a computer program that browses the World Wide Web in a methodical,automated manner. Other terms for Web crawlers are ants, automatic indexers, bots, and wormsor Web spider, Web robot, Web scutter.This process is called Web crawling or spidering. Many sites, in particular search engines, usespidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copyof all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on aWeb site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).A Web crawler is one type of software agent. In general, it starts with a list of URLs to visit,called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page andadds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier arerecursively visited according to a set of policies.
1.2 Prerequisites of a Crawling System
In order to set our work in this field in context, listed below are definitions of services thatshould be considered the minimum requirements for any large-scale crawling system.1.Flexibility: as mentioned above, with some minor adjustments our system should besuitable for various scenarios. However, it is important to remember that crawling isestablished within a special framework: namely, Web legal deposit.1
 
2.High Performance: the system needs to be scalable with a minimum of one thousand pages/ second and extending up to millions of pages for each run on low cost hardware. Note that here the quality and efficiency of disk access are crucial to maintaining high performance.3.Fault Tolerance: this may cover various aspects. As the system interacts with severalservers at once, specific problems emerge. First, it should at least be able to processinvalid HTML code, deal with unexpected Web server behavior, and select goodcommunication protocols etc. The goal here is to avoid this type of problem and, by forceof circumstance, to be able to ignore such problems completely. Second, crawling processes may take days or weeks, and it is imperative that the system can handle failure,stopped processes or interruptions in network services, keeping data loss to a minimum.Finally, the system should be persistent, which means periodically switching large datastructures from memory to the disk (e.g. restart after failure).4.Maintainability and Configurability: an appropriate interface is necessary for monitoringthe crawling process, including download speed, statistics on the pages and amounts of data stored. In online mode, the administrator may adjust the speed of a given crawler,add or delete processes, stop the system, add or delete system nodes and supply the black list of domains not to be visited, etc.
 1.3 General Crawling Strategies
There are many highly accomplished techniques in terms of Web crawling strategy. We willdescribe the most relevant of these here: 1.Breadth-First Crawling: in order to build a wide Web archive like that of the InternetArchive, a crawl is carried out from a set of Web pages (initial URLs or seeds). A breadth-first exploration is launched by following hypertext links leading to those pages2
 
directly connected with this initial set. In fact, Web sites are not really browsed breadth-first and various restrictions may apply, e.g. limiting crawling processes to within a site,or downloading the pages deemed most interesting first.2.Repetitive Crawling: once pages have been crawled, some systems require the process to be repeated periodically so that indexes are kept updated. In the most basic case, this may be achieved by launching a second crawl in parallel. A variety of heuristics exist toovercome this problem: for example, by frequently re-launching the crawling process of  pages, sites or domains considered important to the detriment of others. A good crawlingstrategy is crucial for maintaining a constantly updated index list.3.Targeted Crawling: more specialized search engines use crawling process heuristics inorder to target a certain type of page, e.g. pages on a specific topic or in a particular language, images, mp3 files or scientific papers. In addition to these heuristics, moregeneric approaches have been suggested. They are based on the analysis of the structuresof hypertext links and techniques of learning: the objective here is being to retrieve thegreatest number of pages relating to a particular subject by using the minimum bandwidth. Most of the studies cited in this category do not use high performancecrawlers, yet succeed in producing acceptable results.4.Random Walks and Sampling: some studies have focused on the effect of random walkson Web graphs or modified versions of these graphs via sampling in order to estimate thesize of documents on line.5.Deep Web crawling: a lot of data accessible via the Web are currently contained indatabases and may only be downloaded through the medium of appropriate requests or forms. Recently, this often neglected but fascinating problem has been the focus of newinterest. The Deep Web is the name given to the Web containing this category of data.
1.4 Crawling policies
3

Activity (27)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads
k4lonk liked this
Nawaz Rahman liked this
sahalbabu liked this
sahalbabu liked this
patilmilind007 liked this
pranaysrivastav liked this
palshikarmadhav liked this
akilakesavan liked this

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->