Professional Documents
Culture Documents
L T P J C
3 0 2 0 4
Dr. S M SATAPATHY
Associate Professor,
School of Computer Science and Engineering,
VIT Vellore, TN, India – 632 014.
Module – 2
WEB CRAWLING
Universal Crawler
Focused Crawler
Topical Crawler
2
APPLICATIONS
3
Applications of Web Crawlers
Business Intelligence
Collect information about their competitors and potential
collaborators
Monitor Web site and Pages of Interest
A user or community can be notified when new information appears
in certain places
Support of Search Engine
Crawlers are the main consumers of Internet bandwidth.
They collect pages for search engines to build their indexes.
4
Applications of Web Crawlers
Well known search engines such as Google, Yahoo! and MSN run very
efficient universal crawlers designed to gather all pages irrespective of
their content.
Other crawlers, sometimes called preferential crawlers, are more
targeted.
They attempt to download only pages of certain types or topics.
5
Web Scrapper vs. Web Crawler
A web crawler is a software program that visits websites and reads their
pages and other related information in order to build entries for a search
engine index.
The major search engines like Google, Yahoo, Bing etc on the Web all
have such a program, which is also known as a “web spider” or a “bot.”
Web scraping is the process of automatically requesting a web
document and collecting information from it.
To do web scraping, you have to do some degree of web crawling to
move around the websites.
Web Scraping is essentially targeted at specific websites for specific
data, e.g. for stock market data, business leads, supplier product
scraping.
6
Web Scrapper vs. Web Crawler
Web Scraper would be doing things a good web crawler wouldn’t do, i.e.:
Doesn’t obey robots.txt
Submit forms with data
Execute Javascript
Transforming the data into required form and format
Saving extracted data into database
7
TYPES OF CRAWLER
8
Types of Crawlers
• Universal Crawler
• Focused Crawler
• Topical Crawler
9
Basic Crawler Algorithm
A crawler starts from a set of seed pages (URLs) and then uses the
links within them to fetch other pages.
The links in these pages are, in turn, extracted and the corresponding
pages are visited.
The crawler maintains a list of unvisited URLs called the frontier. The
list is initialized with seed URLs which may be provided by the user or
another program.
10
Basic Sequential Crawler
11
Breadth-First Crawlers
12
Breadth-First Crawlers
The URL to crawl next comes from the head of the queue and new
URLs are added to the tail of the queue.
Once the frontier reaches its maximum size, the breadth-first crawler
can add to the queue only one unvisited URL from each new page
crawled.
13
Breadth-First Crawlers
The breadth-first strategy does not imply that pages are visited in
“random” order. Because
14
Crawling Algorithm
17
Depth-First Crawlers
18
Depth-First Crawlers
19
Breadth-First vs. Depth-First Crawlers
depth-first goes off into one branch until it reaches a leaf node
not good if the goal node is on another branch
neither complete nor optimal
uses much less space than breadth-first
20
Preferential Crawlers
21
Preferential Crawlers
If pages are visited in the order specified by the priority values in the
frontier, then we have a best-first crawler.
In this approach, relevancy calculation is done for each link and the
most relevant link, such as one with the highest relevancy value, is
fetched from the frontier.
Thus every time the best available link is opened and traversed.
22
Preferential Crawlers
Once the frontier’s maximum size is reached, only the best URLs are
kept; the frontier must be pruned after each new set of links is added.
23
Preferential Crawlers
Fish Search
Fish Search is a dynamic heuristic search algorithm.
The key point of Fish Search algorithm lies in the maintenance of URL
order.
24
Preferential Crawlers
A* Search
A* uses Best First Search.
The sums of these two values serve as the measure for selecting the
best path.
25
Preferential Crawlers
Adaptive A* Search
Adaptive A* Search works on informed heuristics to focus its searches.
With its each iteration, it updates the relevancy value of the page and
uses it for the next traversal.
The pages are updated for log(Graph Size) times, (after log(Graph
Size) times the overhead of updating is much more than the
improvement that can be achieved in getting more relevant pages).
26
IMPLEMENTATION ISSUES
27
Implementation Issues
Fetching
The client needs to timeout connections to prevent spending
unnecessary time waiting for responses from slow servers or reading
huge pages.
Redirect loops are to be detected and broken by storing URLs from a
redirection chain in a hash table and halting if the same URL is
encountered twice.
One may also parse and store the last-modified header to determine
the age of the document.
Error-checking and exception handling is important during the page
fetching process since the same code must deal with potentially millions
of remote servers.
28
Implementation Issues
Parsing
HTML has the structure of a DOM (Document Object Model) tree
Unfortunately actual HTML is often incorrect in a strict syntactic sense.
Crawlers, like browsers, must be robust/forgiving.
Many pages are published with missing required tags, tags improperly
nested, missing close tags, misspelled or missing attribute names and
values, missing quotes around attribute values, unescaped special
characters, and so on.
Fortunately there are tools that can help (E.g. tidy.sourceforge.net)
Must pay attention to HTML entities and unicode in text.
What to do with a growing number of other formats?
Flash, SVG, RSS, AJAX…
29
Implementation Issues
30
Implementation Issues
Stopword Removal
Noise words that do not carry meaning should be eliminated (“stopped”)
before they are indexed
E.g. in English: AND, THE, A, AT, OR, ON, FOR, etc…
Typically syntactic markers
Typically the most common terms
Typically kept in a negative dictionary
10–1,000 elements
E.g.
http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words
Parser can detect these right away and disregard them
31
Implementation Issues
32
Implementation Issues
Stemming
Morphological conflation based on rewrite rules
Language dependent!
Porter stemmer very popular for English
http://www.tartarus.org/~martin/PorterStemmer/
Context-sensitive grammar rules, eg:
“IES” except (“EIES” or “AIES”) --> “Y”
Versions in Perl, C, Java, Python, C#, Ruby, PHP, etc.
Porter has also developed Snowball, a language to create stemming
algorithms in any language
http://snowball.tartarus.org/
Ex. Perl modules: Lingua::Stem and Lingua::Stem::Snowball
33
Implementation Issues
35
Implementation Issues
36
Implementation Issues
37
Implementation Issues
38
Implementation Issues
Spider Trap
Misleading sites: indefinite number of pages dynamically generated by
CGI scripts.
These are Web sites where the URLs of dynamically created links are
modified based on the sequence of actions taken by the browsing user
(or crawler).
In practice spider traps are not only harmful to the crawler, which
wastes bandwidth and disk space to download and store duplicate or
useless data. (DOS attack)
39
Implementation Issues
Spider Trap - Solutions
Since the dummy URLs often become larger and larger in size as the
crawler becomes entangled in a spider trap, one common heuristic
approach to tackle such traps is by limiting the URL sizes to some
maximum number of characters, say 256.
The code associated with the frontier can make sure that every
consecutive sequence of, say, 100 URLs fetched by the crawler
contains at most one URL from each fully qualified host name.
Page Repository
Naïve: store each page as a separate file
Can map URL to unique filename using a hashing function, e.g.
MD5
This generates a huge number of files, which is inefficient from the
storage perspective
Better: combine many pages into a single large file, using some XML
markup to separate and identify them
Must map URL to {filename, page_id}
Database options
Any RDBMS -- large overhead
Light-weight, embedded databases such as Berkeley DB
41
Implementation Issues
Concurrency
A crawler incurs several delays:
Resolving the host name in the URL to an IP address using DNS
Connecting a socket to the server and sending the request
Receiving the requested page in response
42
Implementation Issues
43
Implementation Issues
Concurrency
Can use multi-processing or multi-threading
Each process or thread works like a sequential crawler, except they
share data structures: frontier and repository
Shared data structures must be synchronized (locked for concurrent
writes)
Speedup of factor of 5-10 are easy this way
44
UNIVERSAL CRAWLER
45
Universal Crawler
46
Large Scale Universal Crawler vs. Concurrent Breadth-first Crawler
47
Large Scale Universal Crawler: Scalability
48
Large Scale Universal Crawler: Scalability
49
Coverage vs. Freshness vs. Importance
Coverage
New pages get added all the time
Can the crawler find every page?
Freshness
Pages change over time, get removed, etc.
How frequently can a crawler revisit ?
Trade-off!
Focus on most “important” pages (crawler bias)?
“Importance” is subjective
50
Coverage vs. Freshness vs. Importance
51
Estimating Page Change Rate
The most recent and exhaustive study reports that while new pages are
created at a rate of about 8% per week, only about 62% of the content
of these pages is really new because pages are often copied from
existing ones.
The link structure of the Web is more dynamic, with about 25% new
links created per week.
52
Do we need to Crawl the entire web?
53
PREFERENTIAL CRAWLER
54
Preferential Crawler
55
Preferential Crawling Algorithms
Breadth-First
Exhaustively visit all links in order encountered
Best-N-First
Priority queue sorted by similarity, explore top N at a time
Variants: DOM context, hub scores
PageRank
Priority queue sorted by keywords, PageRank
SharkSearch
Priority queue sorted by combination of similarity, anchor text, similarity
of parent, etc. (powerful cousin of FishSearch)
InfoSpiders
Adaptive distributed algorithm using an evolving population of learning
agents. (http://carl.cs.indiana.edu/fil/IS/slides.html)
56
Focused Crawler
57
Focused Crawler :: Classifier
58
Focused Crawler :: Classifier
59
Focused Crawler :: Classifier
Calculating precision and recall is actually quite easy. Imagine there are
100 positive cases among 10,000 cases. You want to predict which
ones are positive, and you pick 200 to have a better chance of catching
many of the 100 positive cases.
60
Focused Crawler :: Classifier
• What percent of positive predictions were correct?
61
Focused Crawler :: Classifier
62
Focused Crawler :: Classifier
64
Focused Crawler
then the URLs from the crawled page p are added to the frontier.
Otherwise they are discarded.
65
Focused Crawler
Can have multiple topics with as many classifiers, with scores
appropriately combined (Chakrabarti et al. 1999)
Can use a distiller to find topical hubs periodically, and add these to the
frontier
66
Context - Focused Crawler
67
Context - Focused Crawler
• A naïve Bayesian classifier is built for each layer in the context graph.
68
Context - Focused Crawler
69
Context - Focused Crawler
• If Pr( l*|p) is less than a threshold, p is classified into the “other” class,
which represents pages that do not have a good fit with any of the
layers. If Pr( l*|p) exceeds the threshold, p is classified into l*.
70
Topical Crawler
All we have is a topic (query, description, keywords) and a set of seed
pages (not necessarily relevant)
No labeled examples
71
Topical Crawler
72
Pros and Cons of Topical Crawler
73
Topical Locality
Topical locality is a necessary condition for a topical crawler to work,
and for surfing to be a worthwhile activity for humans
Links must encode semantic information, i.e. say something about
neighbour pages, not be random.
It is also a sufficient condition if we start from “good” seed pages.
Crawling algorithms can use cues from words and hyperlinks,
associated respectively with a lexical and a link topology.
In the former, two pages are close to each other if they have similar
textual content; in the latter, if there is a short path between them.
74
Topical Locality
From a crawler's perspective, there are two central questions:
This makes the assumption that pages which link to each other are
closely topically related, i.e. a page on bananas is likely to link to other
pages about bananas (or at least fruit).
75
Topical Locality
2. link-cluster conjecture: whether two pages that link to each other are
more likely to be semantically related to each other, compared to two
randomly selected pages.
This assumes that pages which are clustered together, or in the same
"web-community" are closely topically related, i.e. the page about
bananas from above links to a page about fruit which links to a page
about food - all three are closely topically related.
76
Link-Content Conjecture
• Let us use the cosine similarity measure σ (p1, p2) between pages p1
and p2.
• We can measure the link distance δ (p1, p2) along the shortest directed
path from p1 and p2, revealed by the breadth-first crawl.
• Both distances δ (q, p) and similarities σ (q, p) were averaged for each
topic q over all pages p in the crawl set Pdq for each depth d:
• where Ndq is the size of the cumulative page set Pdq = {p | δ (q, p) ≤ d}.
77
Link-Content Conjecture
78
Link-Content Conjecture
79
Link-Cluster Conjecture
81
Correlation between different similarity measures
82
TF - IDF
83
TF - IDF
84
TF - IDF
The term "the" is not a good keyword to distinguish relevant and non-
relevant documents and terms, unlike the less common words "brown"
and "cow".
86
Jaccard’s Coefficient of Link Neighbourhood
where
Na - number of elements in set А, Nb - number of elements in set B
Nc - number of elements in intersecting set
87
Evaluation of Topical Crawlers
88
Evaluation of Topical Crawlers
The harvest rate is the fraction of web pages crawled that are relevant
to the given topic, which measures how well it is doing at rejecting
irrelevant web pages. The expression is given by:
where
𝑉 is the number of web pages crawled by a crawler in current;
𝑟𝑖 is the relevance between web page 𝑖 and the given topic, and the
value of 𝑟𝑖 can only be 0 or 1.
If relevant, then 𝑟𝑖 = 1; otherwise 𝑟𝑖 = 0.
90
Evaluation of Topical Crawlers
91
Evaluation of Topical Crawlers
92
Evaluation of Topical Crawlers
93
Evaluation of Topical Crawlers
A second approach is to split the set of known relevant pages into two
sets; one set can be used as seeds, the other as targets.
While there is no guarantee that the targets are reachable from the
seeds, this approach is significantly simpler because no back-crawl is
necessary.
Another advantage is that each of the two relevant subsets can be used
in turn as seeds and targets.
In this way, one can measure the overlap between the pages crawled
starting from the two disjoint sets.
94
Evaluation of Topical Crawlers
The use of known relevant pages as proxies for unknown relevant sets
implies an important assumption, which is illustrated by the Venn
diagram.
Here S is a set of crawled pages and T is the set of known relevant
target pages, a subset of the relevant set R.
95
Evaluation of Topical Crawlers
96
Evaluation of Topical Crawlers
where St is the set of pages crawled at time t (t can be wall clock time,
network latency, number of pages visited, number of bytes downloaded,
and so on). Tθ is the relevant target set, where θ represents the
parameters used to select the relevant target pages.
97
Evaluation of Topical Crawlers
98
Crawler Ethics and Conflicts
Crawlers can cause trouble, even unwillingly, if not properly designed to
be “polite” and “ethical”
For example, sending too many requests in rapid succession to a single
server can amount to a Denial of Service (DoS) attack!
Server administrator and users will be upset
Crawler developer/admin IP address may be blacklisted
99
Crawler Ettiquette
Identify yourself
Use ‘User-Agent’ HTTP header to identify crawler, website with
description of crawler and contact information for crawler developer
Use ‘From’ HTTP header to specify crawler developer email
Do not disguise crawler as a browser by using their ‘User-Agent’
string
Always check that HTTP requests are successful, and in case of error,
use HTTP error code to determine and immediately address problem
Pay attention to anything that may lead to too many requests to any one
server, even unwillingly, e.g.:
redirection loops
spider traps
100
Crawler Ettiquette
Spread the load, do not overwhelm a server
Make sure that no more than some max. number of requests to any
single server per unit time, say < 1/second
Honor the Robot Exclusion Protocol
A server can specify which parts of its document tree any crawler is
or is not allowed to crawl by a file named ‘robots.txt’ placed in the
HTTP root directory, e.g. http://www.indiana.edu/robots.txt
Crawler should always check, parse, and obey this file before
sending any requests to a server
More info at:
http://www.google.com/robots.txt
http://www.robotstxt.org/wc/exclusion.html
101
Crawler Ethics Issues
Is compliance with robot exclusion a matter of law?
No! Compliance is voluntary, but if you do not comply, you may be
blocked
Someone (unsuccessfully) sued Internet Archive over a robots.txt
related issue
Some crawlers disguise themselves
Using false User-Agent
Randomizing access frequency to look like a human/browser
Example: click fraud for ads
102
Crawler Ethics Issues
Servers can disguise themselves, too
Cloaking: present different content based on UserAgent
E.g. stuff keywords on version of page shown to search engine
crawler
Search engines do not look kindly on this type of “spamdexing” and
remove from their index sites that perform such abuse
Case of bmw.de made the news
103
Gray Areas of Crawler Ethics
If you write a crawler that unwillingly follows links to ads, are you just
being careless, or are you violating terms of service, or are you violating
the law by defrauding advertisers?
Is non-compliance with Google’s robots.txt in this case equivalent to
click fraud?
If you write a browser extension that performs some useful service,
should you comply with robot exclusion?
104
New Crawling Code?
105
Thank You for Your Attention !
106