Professional Documents
Culture Documents
Junghoo Cho
Hector Garcia-Molina
Stanford InfoLab
1
What is a crawler?
2
Challenges
There are many pages out on the Web.
(Major search engines indexed more than
100M pages)
The size of the Web is growing enormously.
Most of them are not very interesting
3
Good crawling strategy
Make the crawler visit “important
pages” first.
4
Outline
Importance metrics
: what are important pages?
Crawling models
: How is crawler evaluated?
Experiments
Conclusion & Future work
5
Importance metric
The metric for determining if a page is
HOT
6
Similarity to a driving query
Example) “Sports”, “Bill Clinton”
the pages related to a specific topic
Personalized crawler
7
Importance metric
The metric for determining if a page is
HOT
8
Backlink-based metric
Backlink count
number of pages pointing to the
page
Citation metric
Page Rank
weighted backlink count
weight is iteratively defined
9
B
A C
E
D
F
BackLinkCount(F) = 2
10
Ordering metric
11
Crawling models
12
Crawl and stop model
% crawled HOT pages
100% Perfect!!
80%
Good
60%
Random
40%
Poor
20%
0%
0% 20% 40% 60% 80% 100%
% crawled pages (e.g. time)
Crawling models
Crawl and Stop
Keep crawling until the local disk space is full.
14
Limited buffer model
% crawled HOT pages
100%
Perfect!!
80%
Good
60%
40%
Poor
20%
0%
0% 20% 40% 60% 80% 100%
% crawled pages (e.g. time)
Architecture
HTML parser
crawled page
extracted URL
Virtual Crawler page info
WebBase URL pool
Crawler selected Page Info
URL
URL selector
Repository
Stanford
WWW
16
Experiments
17
Ordering metrics in experiments
18
% crawled HOT pages
100%
Importance Metric :
Backlink Count
80%
60%
Ordering Metric :
40% PageRank
backlink
20% breadth
random
0%
0% 20% 40% 60% 80% 100%
% crawled pages (e.g. time)
Similarity-based crawling
20
Promising page
Anchor Text HOT Parent Page URL
Sports!!
Sports!!
Sports …/sports.html
? ? ?
21
Virtual crawler for
similarity-based crawling
Promising page
Query word appears in its anchor text
22
% crawled HOT pages
100%
Topic word :
80%
"admission"
24