You are on page 1of 24

Intelligent Crawling

Junghoo Cho
Hector Garcia-Molina
Stanford InfoLab

1
What is a crawler?

 Program that automatically


retrieves pages from the Web.

 Widely used for search engines.

2
Challenges
 There are many pages out on the Web.
(Major search engines indexed more than
100M pages)
 The size of the Web is growing enormously.
 Most of them are not very interesting

 In most cases, it is too costly or not


worthwhile to visit the entire Web space.

3
Good crawling strategy
 Make the crawler visit “important
pages” first.

 Save network bandwidth


 Save storage space and
management cost
 Serve quality pages to the client
application

4
Outline
 Importance metrics
: what are important pages?
 Crawling models
: How is crawler evaluated?
 Experiments
 Conclusion & Future work

5
Importance metric
The metric for determining if a page is
HOT

 Similarity to driving query


 Location Metric
 Backlink count
 Page Rank

6
Similarity to a driving query
Example) “Sports”, “Bill Clinton”
the pages related to a specific topic

 Importance is measured by closeness


of the page to the topic (e.g. the
number of the topic word in the page)

 Personalized crawler

7
Importance metric
The metric for determining if a page is
HOT

 Similarity to driving query


 Location Metric
 Backlink count
 Page Rank

8
Backlink-based metric
 Backlink count
 number of pages pointing to the
page
 Citation metric

 Page Rank
 weighted backlink count
 weight is iteratively defined

9
B
A C

E
D
F

BackLinkCount(F) = 2

PageRank(F) = PageRank(E)/2 + PageRank(C)

10
Ordering metric

 The metric for a crawler to “estimate”


the importance of a page

 The ordering metric can be different


from the importance metric

11
Crawling models

 Crawl and Stop


 Keep crawling until the local disk space is full.

 Limited buffer crawl


 Keep crawling until the whole web space is visited
throwing out seemingly unimportant pages.

12
Crawl and stop model
% crawled HOT pages
100% Perfect!!
80%
Good
60%
Random
40%
Poor
20%

0%
0% 20% 40% 60% 80% 100%
% crawled pages (e.g. time)
Crawling models
 Crawl and Stop
 Keep crawling until the local disk space is full.

 Limited buffer crawl


 Keep crawling until the whole web space is visited
throwing out seemingly unimportant pages.

14
Limited buffer model
% crawled HOT pages
100%
Perfect!!
80%
Good
60%

40%
Poor
20%

0%
0% 20% 40% 60% 80% 100%
% crawled pages (e.g. time)
Architecture
HTML parser
crawled page
extracted URL
Virtual Crawler page info
WebBase URL pool
Crawler selected Page Info
URL

URL selector
Repository

Stanford
WWW

16
Experiments

 Backlink-based importance metric


 backlink count
 PageRank
 Similiarty-based importance metric
 similarity to a query word

17
Ordering metrics in experiments

 Breadth first order


 Backlink count
 PageRank

18
% crawled HOT pages
100%
Importance Metric :
Backlink Count
80%

60%
Ordering Metric :
40% PageRank
backlink
20% breadth
random
0%
0% 20% 40% 60% 80% 100%
% crawled pages (e.g. time)
Similarity-based crawling

 The content of the page is not


available before it is visited
 Essentially, the crawler should
“guess” the content of the page
 More difficult than backlink-based
crawling

20
Promising page
Anchor Text HOT Parent Page URL

Sports!!
Sports!!
Sports …/sports.html

? ? ?

21
Virtual crawler for
similarity-based crawling
Promising page
 Query word appears in its anchor text

 Query word appears in its URL

 The page pointing to it is “important” page

 Visit “promising pages” first


 Visit “non-promising pages” in the ordering
metric order

22
% crawled HOT pages
100%

Topic word :
80%
"admission"

60% Ordering Metric :


PageRank
40% backlink
breadth
20% random
(modified ordering
0% metrics)
0% 20% 40% 60% 80% 100%
% crawled pages (e.g. time)
Conclusion

 PageRank is generally good as an


ordering metric.

 By applying a good ordering metric, it is


possible to gather important pages
quickly.

24

You might also like