Intelligent Crawling: Junghoo Cho Hector Garcia-Molina Stanford Infolab

Intelligent Crawling
Junghoo Cho
Hector Garcia-Molina
Stanford InfoLab
1
What is a crawler?
 Program that automatically

retrieves pages from the Web.
 Widely used for search engines.
2
Challenges
 There are many pages out on the Web.
(Major search engines indexed more than
100M pages)
 The size of the Web is growing enormously.
 Most of them are not very interesting
 In most cases, it is too costly or not

worthwhile to visit the entire Web space.
3
Good crawling strategy
 Make the crawler visit “important
pages” first.
 Save network bandwidth

 Save storage space and
management cost
 Serve quality pages to the client
application
4
Outline
 Importance metrics
: what are important pages?
 Crawling models
: How is crawler evaluated?
 Experiments
 Conclusion & Future work
5
Importance metric
The metric for determining if a page is
HOT
 Similarity to driving query

 Location Metric
 Backlink count
 Page Rank
6
Similarity to a driving query
Example) “Sports”, “Bill Clinton”
the pages related to a specific topic
 Importance is measured by closeness

of the page to the topic (e.g. the
number of the topic word in the page)
 Personalized crawler
7
Importance metric
The metric for determining if a page is
HOT
 Similarity to driving query

 Location Metric
 Backlink count
 Page Rank
8
Backlink-based metric
 Backlink count
 number of pages pointing to the
page
 Citation metric
 Page Rank
 weighted backlink count
 weight is iteratively defined
9
B
A C
E
D
F
BackLinkCount(F) = 2
PageRank(F) = PageRank(E)/2 + PageRank(C)
10
Ordering metric
 The metric for a crawler to “estimate”

the importance of a page
 The ordering metric can be different

from the importance metric
11
Crawling models
 Crawl and Stop

 Keep crawling until the local disk space is full.
 Limited buffer crawl

 Keep crawling until the whole web space is visited
throwing out seemingly unimportant pages.
12
Crawl and stop model
% crawled HOT pages
100% Perfect!!
80%
Good
60%
Random
40%
Poor
20%
0%
0% 20% 40% 60% 80% 100%
% crawled pages (e.g. time)
Crawling models
 Crawl and Stop
 Keep crawling until the local disk space is full.
 Limited buffer crawl

 Keep crawling until the whole web space is visited
throwing out seemingly unimportant pages.
14
Limited buffer model
% crawled HOT pages
100%
Perfect!!
80%
Good
60%
40%
Poor
20%
0%
0% 20% 40% 60% 80% 100%
Architecture
HTML parser
crawled page
extracted URL
Virtual Crawler page info
WebBase URL pool
Crawler selected Page Info
URL
URL selector
Repository
Stanford
WWW
16
Experiments
 Backlink-based importance metric

 backlink count
 PageRank
 Similiarty-based importance metric
 similarity to a query word
17
Ordering metrics in experiments
 Breadth first order

 Backlink count
 PageRank
18
% crawled HOT pages
100%
Importance Metric :
Backlink Count
80%
60%
Ordering Metric :
40% PageRank
backlink
20% breadth
random
0%
0% 20% 40% 60% 80% 100%
Similarity-based crawling
 The content of the page is not

available before it is visited
 Essentially, the crawler should
“guess” the content of the page
 More difficult than backlink-based
crawling
20
Promising page
Anchor Text HOT Parent Page URL
Sports!!
Sports!!
Sports …/sports.html
? ? ?
21
Virtual crawler for
similarity-based crawling
Promising page
 Query word appears in its anchor text
 Query word appears in its URL
 The page pointing to it is “important” page
 Visit “promising pages” first

 Visit “non-promising pages” in the ordering
metric order
22
% crawled HOT pages
100%
Topic word :
80%
"admission"
60% Ordering Metric :

PageRank
40% backlink
breadth
20% random
(modified ordering
0% metrics)
0% 20% 40% 60% 80% 100%
Conclusion
 PageRank is generally good as an

ordering metric.
 By applying a good ordering metric, it is

possible to gather important pages
quickly.
24

Intelligent Crawling: Junghoo Cho Hector Garcia-Molina Stanford Infolab

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intelligent Crawling: Junghoo Cho Hector Garcia-Molina Stanford Infolab

Uploaded by

Copyright:

Available Formats

Intelligent Crawling

 Program that automatically

 Widely used for search engines.

 In most cases, it is too costly or not

 Save network bandwidth

 Similarity to driving query

 Importance is measured by closeness

 Similarity to driving query

PageRank(F) = PageRank(E)/2 + PageRank(C)

 The metric for a crawler to “estimate”

 The ordering metric can be different

 Crawl and Stop

 Limited buffer crawl

 Limited buffer crawl

 Backlink-based importance metric

 Breadth first order

 The content of the page is not

 Query word appears in its URL

 The page pointing to it is “important” page

 Visit “promising pages” first

60% Ordering Metric :

 PageRank is generally good as an

 By applying a good ordering metric, it is

You might also like