You are on page 1of 106

CSE3024 WEB MINING

L T P J C
3 0 2 0 4

Dr. S M SATAPATHY
Associate Professor,
School of Computer Science and Engineering,
VIT Vellore, TN, India – 632 014.
Module – 2

WEB CRAWLING

 Basic Crawler Algorithm

 Universal Crawler

 Focused Crawler

 Topical Crawler

2
APPLICATIONS

3
Applications of Web Crawlers

 Business Intelligence
 Collect information about their competitors and potential
collaborators
 Monitor Web site and Pages of Interest
 A user or community can be notified when new information appears
in certain places
 Support of Search Engine
 Crawlers are the main consumers of Internet bandwidth.
 They collect pages for search engines to build their indexes.

4
Applications of Web Crawlers

 Harvest Email-Address (Malicious Application)


 Used by spammers or collect personal information to be used in
phishing and other identity theft attacks.

 Well known search engines such as Google, Yahoo! and MSN run very
efficient universal crawlers designed to gather all pages irrespective of
their content.
 Other crawlers, sometimes called preferential crawlers, are more
targeted.
 They attempt to download only pages of certain types or topics.

5
Web Scrapper vs. Web Crawler

 A web crawler is a software program that visits websites and reads their
pages and other related information in order to build entries for a search
engine index.
 The major search engines like Google, Yahoo, Bing etc on the Web all
have such a program, which is also known as a “web spider” or a “bot.”
 Web scraping is the process of automatically requesting a web
document and collecting information from it.
 To do web scraping, you have to do some degree of web crawling to
move around the websites.
 Web Scraping is essentially targeted at specific websites for specific
data, e.g. for stock market data, business leads, supplier product
scraping.
6
Web Scrapper vs. Web Crawler
 Web Scraper would be doing things a good web crawler wouldn’t do, i.e.:
 Doesn’t obey robots.txt
 Submit forms with data
 Execute Javascript
 Transforming the data into required form and format
 Saving extracted data into database

7
TYPES OF CRAWLER

8
Types of Crawlers
• Universal Crawler
• Focused Crawler
• Topical Crawler

9
Basic Crawler Algorithm

 A crawler starts from a set of seed pages (URLs) and then uses the
links within them to fetch other pages.

 The links in these pages are, in turn, extracted and the corresponding
pages are visited.

 The process repeats until a sufficient number of pages are visited or


some other objective is achieved.

 The crawler maintains a list of unvisited URLs called the frontier. The
list is initialized with seed URLs which may be provided by the user or
another program.
10
Basic Sequential Crawler

11
Breadth-First Crawlers

12
Breadth-First Crawlers

 The frontier may be implemented as a first-in-first-out (FIFO) queue,


corresponding to a breadth-first crawler.

 The URL to crawl next comes from the head of the queue and new
URLs are added to the tail of the queue.

 Once the frontier reaches its maximum size, the breadth-first crawler
can add to the queue only one unvisited URL from each new page
crawled.

13
Breadth-First Crawlers

 The breadth-first strategy does not imply that pages are visited in
“random” order. Because

 The order in which pages are visited by a breadth-first crawler is


highly correlated with their PageRank or indegree values.

 They are greatly affected by the choice of seed pages.


 Topical locality measures indicate that pages in the link
neighborhood of a seed page are much more likely to be related
to the seed pages than randomly selected pages.

14
Crawling Algorithm

• Initialize queue (Q) with initial set of known URL’s.


• Until Q empty or page or time limit exhausted:
• Pop URL, L, from front of Q.
• If L is not an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…)
exit loop.
• If already visited L, continue loop(get next url).
• Download page, P, for L.
• If cannot download P (e.g. 404 error, robot excluded)
exit loop, else.
• Index P (e.g. add to inverted index or store cached copy)
• Parse P to obtain list of new links N.
• Append N to the end of Q.
15
Breadth-First Traversal
Given any graph and a set of seeds at which to start, the graph can be
traversed using the algorithm
1. Put all the given seeds into the queue;
2. Prepare to keep a list of “visited” nodes (initially empty);
3. As long as the queue is not empty:
a. Remove the first node from the queue;
b. Append that node to the list of “visited” nodes
c. For each edge starting at that node:
i. If the node at the end of the edge already appears on the list of
“visited” nodes or it is already in the queue, then do nothing more
with that edge;
ii. Otherwise, append the node at the end of the edge, to the end of
the queue.
16
Depth-First Crawlers

17
Depth-First Crawlers

18
Depth-First Crawlers

 The frontier may be implemented as a last-in-first-out (LIFO) stack,


corresponding to a depth-first crawler.

 Wander away (“lost in cyberspace”)

Use depth first search (DFS) algorithm


• Get the 1st link not visited from the start page
• Visit link and get 1st non-visited link
• Repeat above step till no non-visited links
• Go to next non-visited link in the previous level and repeat 2nd step

19
Breadth-First vs. Depth-First Crawlers

 breadth-first is more careful by checking all alternatives


 complete and optimal
 very memory-intensive

 depth-first goes off into one branch until it reaches a leaf node
 not good if the goal node is on another branch
 neither complete nor optimal
 uses much less space than breadth-first

20
Preferential Crawlers

 A different crawling strategy is obtained if the frontier is implemented as


a priority queue rather than a FIFO queue.

 Typically, preferential crawlers assign each unvisited link a priority


based on an estimate of the value of the linked page.

 The estimate can be based on


 topological properties (e.g., the indegree of the target page)
 content properties (e.g., the similarity between a user query and the
source page)
 any other combination of measurable features.

21
Preferential Crawlers

 If pages are visited in the order specified by the priority values in the
frontier, then we have a best-first crawler.

 Best First Search is a heuristic based search algorithm.

 In this approach, relevancy calculation is done for each link and the
most relevant link, such as one with the highest relevancy value, is
fetched from the frontier.

 Thus every time the best available link is opened and traversed.

22
Preferential Crawlers

 The time complexity of inserting a URL into the priority queue is


O(logF), where F is the frontier size (looking up the hash requires
constant time).

 To dequeue a URL, it must first be removed from the priority queue


(O(logF)) and then from the hash table (again O(1)). Thus the parallel
use of the two data structures yields a logarithmic total cost per URL.

 Once the frontier’s maximum size is reached, only the best URLs are
kept; the frontier must be pruned after each new set of links is added.

23
Preferential Crawlers

Fish Search
 Fish Search is a dynamic heuristic search algorithm.

 It works on the intuition that relevant links have relevant neighbours;


hence it starts with a relevant link and goes deep under that link and
stops searching under the links that are irrelevant.

 The key point of Fish Search algorithm lies in the maintenance of URL
order.

24
Preferential Crawlers

A* Search
 A* uses Best First Search.

 It calculates the relevancy of each link and the difference between


expected relevancy of the goal web-page and the current link.

 The sums of these two values serve as the measure for selecting the
best path.

25
Preferential Crawlers

Adaptive A* Search
 Adaptive A* Search works on informed heuristics to focus its searches.

 With its each iteration, it updates the relevancy value of the page and
uses it for the next traversal.

 The pages are updated for log(Graph Size) times, (after log(Graph
Size) times the overhead of updating is much more than the
improvement that can be achieved in getting more relevant pages).

 Then normal A* traversal is done.

26
IMPLEMENTATION ISSUES

27
Implementation Issues

Fetching
 The client needs to timeout connections to prevent spending
unnecessary time waiting for responses from slow servers or reading
huge pages.
 Redirect loops are to be detected and broken by storing URLs from a
redirection chain in a hash table and halting if the same URL is
encountered twice.
 One may also parse and store the last-modified header to determine
the age of the document.
 Error-checking and exception handling is important during the page
fetching process since the same code must deal with potentially millions
of remote servers.
28
Implementation Issues

Parsing
 HTML has the structure of a DOM (Document Object Model) tree
 Unfortunately actual HTML is often incorrect in a strict syntactic sense.
 Crawlers, like browsers, must be robust/forgiving.
 Many pages are published with missing required tags, tags improperly
nested, missing close tags, misspelled or missing attribute names and
values, missing quotes around attribute values, unescaped special
characters, and so on.
 Fortunately there are tools that can help (E.g. tidy.sourceforge.net)
 Must pay attention to HTML entities and unicode in text.
 What to do with a growing number of other formats?
 Flash, SVG, RSS, AJAX…
29
Implementation Issues

30
Implementation Issues

Stopword Removal
 Noise words that do not carry meaning should be eliminated (“stopped”)
before they are indexed
 E.g. in English: AND, THE, A, AT, OR, ON, FOR, etc…
 Typically syntactic markers
 Typically the most common terms
 Typically kept in a negative dictionary
 10–1,000 elements
 E.g.
http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words
 Parser can detect these right away and disregard them

31
Implementation Issues

Conflation and thesauri


 Idea: improve recall by merging words with same meaning
 We want to ignore superficial morphological features, thus merge
semantically similar tokens
 {student, study, studying, studious} => studi
 We can also conflate synonyms into a single form using a thesaurus
 30-50% smaller index
 Doing this in both pages and queries allows to retrieve pages about
‘automobile’ when user asks for ‘car’
 Thesaurus can be implemented as a hash table

32
Implementation Issues

Stemming
 Morphological conflation based on rewrite rules
 Language dependent!
 Porter stemmer very popular for English
 http://www.tartarus.org/~martin/PorterStemmer/
 Context-sensitive grammar rules, eg:
 “IES” except (“EIES” or “AIES”) --> “Y”
 Versions in Perl, C, Java, Python, C#, Ruby, PHP, etc.
 Porter has also developed Snowball, a language to create stemming
algorithms in any language
 http://snowball.tartarus.org/
 Ex. Perl modules: Lingua::Stem and Lingua::Stem::Snowball
33
Implementation Issues

Link Extraction and Canonicalization


 HTML parsers provide the functionality to identify tags and associated
attribute-value pairs in a given Web page.
 However, the URLs thus obtained need to be further processed. First,
filtering may be necessary to exclude certain file types that are not to be
crawled.
 This can be achieved with white lists (e.g., only follow links to text/html
content pages) or black lists (e.g., discard links to PDF files).
 The identification of a file type may rely on file extensions. However,
they are often unreliable and sometimes missing altogether.
 A compromise is to send an HTTP HEAD request and inspect the
content-type response header, which is usually a more reliable label.
34
Implementation Issues

Link Extraction and Canonicalization :: Static vs. Dynamic Pages


 Is it worth trying to eliminate dynamic pages and only index static
pages?
 Examples:
 http://www.amazon.com/exec/obidos/subst/home/home.html/002-
8332429-6490452
 http://www.imdb.com/name/nm0578801/
 Why or why not? How can we tell if a page is dynamic? What about
‘spider traps’?
 What do Google and other search engines do?

35
Implementation Issues

Link Extraction and Canonicalization :: Relative vs. Absolute URLs


 Crawler must translate relative URLs into absolute URLs
 Need to obtain Base URL from HTTP header, or HTML Meta tag, or
else current page path by default
 Examples
 Base: http://www.cnn.com/linkto/
 Relative URL: intl.html
 Absolute URL: http://www.cnn.com/linkto/intl.html

 Relative URL: /US/


 Absolute URL: http://www.cnn.com/US/

36
Implementation Issues

Link Extraction and Canonicalization :: URL Canonicalization


 In order to avoid duplication, the crawler must transform all URLs into
canonical form
 Definition of “canonical” is arbitrary, e.g.:
 Could always include port
 Or only include port when not default :80

37
Implementation Issues

Link Extraction and Canonicalization :: URL Canonicalization

38
Implementation Issues

Spider Trap
 Misleading sites: indefinite number of pages dynamically generated by
CGI scripts.
 These are Web sites where the URLs of dynamically created links are
modified based on the sequence of actions taken by the browsing user
(or crawler).
 In practice spider traps are not only harmful to the crawler, which
wastes bandwidth and disk space to download and store duplicate or
useless data. (DOS attack)

39
Implementation Issues
Spider Trap - Solutions
 Since the dummy URLs often become larger and larger in size as the
crawler becomes entangled in a spider trap, one common heuristic
approach to tackle such traps is by limiting the URL sizes to some
maximum number of characters, say 256.

 The code associated with the frontier can make sure that every
consecutive sequence of, say, 100 URLs fetched by the crawler
contains at most one URL from each fully qualified host name.

 Eliminate URLs with non-textual data types.


 May disable crawling of dynamic pages, if can detect
40
Implementation Issues

Page Repository
 Naïve: store each page as a separate file
 Can map URL to unique filename using a hashing function, e.g.
MD5
 This generates a huge number of files, which is inefficient from the
storage perspective
 Better: combine many pages into a single large file, using some XML
markup to separate and identify them
 Must map URL to {filename, page_id}
 Database options
 Any RDBMS -- large overhead
 Light-weight, embedded databases such as Berkeley DB
41
Implementation Issues

Concurrency
 A crawler incurs several delays:
 Resolving the host name in the URL to an IP address using DNS
 Connecting a socket to the server and sending the request
 Receiving the requested page in response

 Solution: Overlap the above delays by fetching many pages


concurrently

42
Implementation Issues

Architecture of Concurrent Crawler

43
Implementation Issues

Concurrency
 Can use multi-processing or multi-threading
 Each process or thread works like a sequential crawler, except they
share data structures: frontier and repository
 Shared data structures must be synchronized (locked for concurrent
writes)
 Speedup of factor of 5-10 are easy this way

44
UNIVERSAL CRAWLER

45
Universal Crawler

 Support universal search engines


 Large-scale
 Huge cost (network bandwidth) of crawl is amortized over many queries
from users
 Incremental updates to existing index and other data repositories

46
Large Scale Universal Crawler vs. Concurrent Breadth-first Crawler

 Two major issues:


 Performance
 Need to scale up to billions of pages
 Policy
 Need to trade-off coverage, freshness, and bias (e.g. toward
“important” pages)

47
Large Scale Universal Crawler: Scalability

 Need to minimize overhead of DNS lookups


 Need to optimize utilization of network bandwidth and disk throughput
(I/O is bottleneck)
 Use asynchronous sockets
 Multi-processing or multi-threading do not scale up to billions of
pages
 Non-blocking: hundreds of network connections open
simultaneously
 Polling socket to monitor completion of network transfers

48
Large Scale Universal Crawler: Scalability

49
Coverage vs. Freshness vs. Importance

 Coverage
 New pages get added all the time
 Can the crawler find every page?
 Freshness
 Pages change over time, get removed, etc.
 How frequently can a crawler revisit ?
 Trade-off!
 Focus on most “important” pages (crawler bias)?
 “Importance” is subjective

50
Coverage vs. Freshness vs. Importance

 Universal crawlers are never “done”


 High variance in rate and amount of page changes
 HTTP headers are notoriously unreliable
 Last-modified
 Expires
 Solution
 Estimate the probability that a previously visited page has changed
in the meanwhile
 Prioritize by this probability estimate

51
Estimating Page Change Rate

 The most recent and exhaustive study reports that while new pages are
created at a rate of about 8% per week, only about 62% of the content
of these pages is really new because pages are often copied from
existing ones.

 The link structure of the Web is more dynamic, with about 25% new
links created per week.

 Finally, there is an agreement on the observation that the degree of


change of a page is a better predictor of future change than the
frequency of change

52
Do we need to Crawl the entire web?

 If we cover too much, it will get stale


 There is an abundance of pages in the Web
 For PageRank, pages with very low prestige are largely useless
 What is the goal?
 General search engines: pages with high prestige
 News portals: pages that change often
 Vertical portals: pages on some topic
 What are appropriate priority measures in these cases?
Approximations?

53
PREFERENTIAL CRAWLER

54
Preferential Crawler

 Selective bias toward some pages, eg. most “relevant”/topical, closest


to seeds, most popular/largest PageRank, unknown servers, highest
rate/amount of change, etc…
 Focused crawlers
 Supervised learning: classifier based on labelled examples
 Topical crawlers
 Best-first search based on similarity(topic, parent)
 Adaptive crawlers
 Reinforcement learning
 Evolutionary algorithms/artificial life

55
Preferential Crawling Algorithms
 Breadth-First
 Exhaustively visit all links in order encountered
 Best-N-First
 Priority queue sorted by similarity, explore top N at a time
 Variants: DOM context, hub scores
 PageRank
 Priority queue sorted by keywords, PageRank
 SharkSearch
 Priority queue sorted by combination of similarity, anchor text, similarity
of parent, etc. (powerful cousin of FishSearch)
 InfoSpiders
 Adaptive distributed algorithm using an evolving population of learning
agents. (http://carl.cs.indiana.edu/fil/IS/slides.html)
56
Focused Crawler

 A focused crawler attempts to bias the crawler towards pages in


certain categories in which the user is interested.
 A focused crawler is based on a classifier. The idea is to first build a
text classifier using labeled example pages from, say, the ODP or
dmoz.
 Then the classifier would guide the crawler by preferentially selecting
from the frontier those pages that appear most likely to belong to the
categories of interest, according to the classifier's prediction.

57
Focused Crawler :: Classifier

58
Focused Crawler :: Classifier

 Precision (also called positive predictive value) is the fraction of


relevant instances among the retrieved instances, while Recall (also
known as sensitivity) is the fraction of relevant instances that have been
retrieved over total relevant instances in the image.
 When a search engine returns 30 pages only 20 of which were relevant
while failing to return 40 additional relevant pages, its precision is 20/30
= 2/3 while its recall is 20/60 = 1/3.
 So, in this case, precision is "how useful the search results are", and
recall is "how complete the results are".
 Precision can be seen as a measure of exactness or quality, whereas
recall is a measure of completeness or quantity.

59
Focused Crawler :: Classifier

 Calculating precision and recall is actually quite easy. Imagine there are
100 positive cases among 10,000 cases. You want to predict which
ones are positive, and you pick 200 to have a better chance of catching
many of the 100 positive cases.

1. TN / True Negative: case was negative and predicted negative


2. TP / True Positive: case was positive and predicted positive
3. FN / False Negative: case was positive but predicted negative
4. FP / False Positive: case was negative but predicted positive

60
Focused Crawler :: Classifier
• What percent of positive predictions were correct?

You answer: the "precision" was 60 out of 200 = 30%

• What percent of the positive cases did you catch?

You answer: the "recall" was 60 out of 100 = 60%

61
Focused Crawler :: Classifier

 Precision is the fraction of retrieved documents that are relevant

 Recall is the fraction of relevant documents that are retrieved

62
Focused Crawler :: Classifier

Why having two numbers?


 The advantage of having the two numbers for precision and recall is
that one is more important than the other in many circumstances.

 Typical web surfers:


would like every result on the first page to be relevant (high precision),
but have not the slightest interest in knowing let alone looking at
every document that is relevant.

 Professional searchers such as paralegals and intelligence analysts:


are very concerned with trying to get as high recall as possible, and
will tolerate fairly low precision results in order to get it.
63
Focused Crawler

• For each category c in the


taxonomy, we can build a
Bayesian classifier to compute
the probability Pr(c|p) that a
crawled page p belongs to c.
• Pr(top|p) = 1 for the top or root
category.
• The user can select a set c* of
categories of interest.
• Each crawled page is assigned a
relevance score.

64
Focused Crawler

Two strategies were explored.


• In the “soft” focused strategy, the crawler uses the score R(p) of each
crawled page p as a priority value for all unvisited URLs extracted from
p. The URLs are then added to the frontier, which is treated as a priority
queue.
• In the “hard” focused strategy, for a crawled page p, the classifier first
finds the leaf category cˆ( p) in the taxonomy most likely to include p.

• If an ancestor of cˆ( p) is a focus category, i.e.

then the URLs from the crawled page p are added to the frontier.
Otherwise they are discarded.
65
Focused Crawler
 Can have multiple topics with as many classifiers, with scores
appropriately combined (Chakrabarti et al. 1999)

 Can use a distiller to find topical hubs periodically, and add these to the
frontier

 Can accelerate with the use of a critic (Chakrabarti et al. 2002)

 Can use alternative classifier algorithms to naïve-Bayes, e.g. SVM and


neural nets have reportedly performed better (Pant & Srinivasan 2005)

66
Context - Focused Crawler

• It uses naïve Bayesian classifiers as a guide, but in this case the


classifiers are trained to estimate the link distance between a crawled
page and a set of relevant target pages.

• For Example:- Imagine you are looking for information on “machine


learning.” One might go to the home pages of computer science
departments and from there to faculty pages, which may then lead to
relevant pages and papers.

• A department home page, however, may not contain the keywords


“machine learning.”

67
Context - Focused Crawler

• A typical focused or best-first crawler would give such a page a low


priority and possibly never follow its links.

• However, if the crawler could estimate that pages about machine


learning are only two links away from a page containing the keywords
“computer science department,” then it would give the department
home page a higher priority.

• The context-focused crawler is trained using a context graph with L


layers.

• A naïve Bayesian classifier is built for each layer in the context graph.
68
Context - Focused Crawler

69
Context - Focused Crawler

• A prior probability Pr(l) = 1/L is assigned to each layer.


• All the pages in a layer are used to compute Pr(t| l ), the probability of
occurrence of a term t given the layer (class) l .
• At the crawling time, these are used to compute Pr(p| l ) for each
crawled page p.
• The posterior probability Pr( l |p) of p belonging to layer l can then be
computed for each layer from Bayes’ rule.
• The layer l* with highest posterior probability wins:

• If Pr( l*|p) is less than a threshold, p is classified into the “other” class,
which represents pages that do not have a good fit with any of the
layers. If Pr( l*|p) exceeds the threshold, p is classified into l*.
70
Topical Crawler
 All we have is a topic (query, description, keywords) and a set of seed
pages (not necessarily relevant)

 No labeled examples

 Must predict relevance of unvisited links to prioritize

 Original idea: Menczer 1997, Menczer & Belew 1998

71
Topical Crawler

• Example:- MySpiders applet is designed to demonstrate two topical


crawling algorithms, best-N-first and InfoSpiders
• Unlike a search engine, this application has no index to search for
results.
• Instead the Web is crawled in real time. As pages deemed relevant are
crawled, they are displayed in a list that is kept sorted by a user-
selected criterion: score or recency.
• The score is simply the content (cosine) similarity between a page and
the query, and the recency of a page is estimated by the last-modified
header, if returned by the server (not a very reliable estimate).

72
Pros and Cons of Topical Crawler

• All hits are fresh by definition.


• This makes this type of crawlers suitable for applications that look for
very recently posted documents, which a search engine may not have
indexed yet.
• On the down side, the search is slow compared to a traditional search
engine because the user has to wait while the crawler fetches and
analyzes pages.
• Another disadvantage is that the ranking algorithms cannot take
advantage of global prestige measures, such as PageRank, available to
a traditional search engine.

73
Topical Locality
 Topical locality is a necessary condition for a topical crawler to work,
and for surfing to be a worthwhile activity for humans
 Links must encode semantic information, i.e. say something about
neighbour pages, not be random.
 It is also a sufficient condition if we start from “good” seed pages.
 Crawling algorithms can use cues from words and hyperlinks,
associated respectively with a lexical and a link topology.
 In the former, two pages are close to each other if they have similar
textual content; in the latter, if there is a short path between them.

74
Topical Locality
 From a crawler's perspective, there are two central questions:

1. link-content conjecture: whether two pages that link to each other


are more likely to be lexically similar to each other, compared to two
randomly selected pages.

This makes the assumption that pages which link to each other are
closely topically related, i.e. a page on bananas is likely to link to other
pages about bananas (or at least fruit).

75
Topical Locality
2. link-cluster conjecture: whether two pages that link to each other are
more likely to be semantically related to each other, compared to two
randomly selected pages.

This assumes that pages which are clustered together, or in the same
"web-community" are closely topically related, i.e. the page about
bananas from above links to a page about fruit which links to a page
about food - all three are closely topically related.

76
Link-Content Conjecture

• Let us use the cosine similarity measure σ (p1, p2) between pages p1
and p2.
• We can measure the link distance δ (p1, p2) along the shortest directed
path from p1 and p2, revealed by the breadth-first crawl.
• Both distances δ (q, p) and similarities σ (q, p) were averaged for each
topic q over all pages p in the crawl set Pdq for each depth d:

• where Ndq is the size of the cumulative page set Pdq = {p | δ (q, p) ≤ d}.

77
Link-Content Conjecture

• More specifically, if a crawler can obtain inlinks to good pages (by


querying a search engine), it can use co-citation to detect hubs.
• If a page links to several good pages, it is probably a good hub and all
its out-links should be given high priority.
• This strategy, related to the so-called sibling locality, has been used in
focused crawlers and in topical crawlers for business intelligence.
• In addition to co-citation, one could look at bibliographic coupling: if
several good pages link to a certain page, that target is likely to be a
good authority so it and its in-links should be given high priority.

78
Link-Content Conjecture

79
Link-Cluster Conjecture

• The link-cluster conjecture, also known as linkage locality, states that


one can infer the meaning of a page by looking at its neighbours.
• The same exhaustive crawl data used to validate the link-content
conjecture can also be used to explore the link-cluster conjecture,
namely the extent to which relevance is preserved within link space
neighbourhoods and the decay in expected relevance as one browses
away from a relevant page.
• The link-cluster conjecture can be simply formulated in terms of the
conditional probability that a page p is relevant with respect to some
query q, given that page r is relevant and that p is within d links from r:

where relq() is a binary relevance assessment with respect to q.


80
Link-Cluster Conjecture

• Rq(d) is the posterior relevance probability given the evidence of a


relevant page nearby.
• The conjecture is then represented by the likelihood ratio λ (q, d)
between Rq(d) and the prior relevance probability Gq ≡ Pr(relq(p)), also
known as the generality of the query.
• If semantic inferences are possible within a link radius d, then the
following condition must hold:

81
Correlation between different similarity measures

82
TF - IDF

 TF-IDF, short for Term Frequency–Inverse Document Frequency, is


a numerical statistic that is intended to reflect how important a word is
to a document in a collection or corpus.

 Variations of the tf–idf weighting scheme are often used by search


engines as a central tool in scoring and ranking a
document's relevance given a user query.

 Nowadays, tf-idf is one of the most popular term-weighting schemes.


For instance, 83% of text-based recommender systems in the domain
of digital libraries use tf-idf.

83
TF - IDF

 Suppose we have a set of English text documents and wish to


determine which document is most relevant to the query "the brown
cow".

 A simple way to start out is by eliminating documents that do not


contain all three words "the", "brown", and "cow", but this still leaves
many documents.

 To further distinguish them, we might count the number of times each


term occurs in each document; the number of times a term occurs in a
document is called its term frequency.

84
TF - IDF

 Because the term "the" is so common, term frequency will tend to


incorrectly emphasize documents which happen to use the word "the"
more frequently, without giving enough weight to the more meaningful
terms "brown" and "cow".

 The term "the" is not a good keyword to distinguish relevant and non-
relevant documents and terms, unlike the less common words "brown"
and "cow".

 Hence an inverse document frequency factor is incorporated which


diminishes the weight of terms that occur very frequently in the
document set and increases the weight of terms that occur rarely.
85
Jaccard’s Coefficient of Link Neighbourhood

 The Jaccard Coefficient, also known as Jaccard index or Jaccard


similarity coefficient, is a statistic measure used for comparing similarity
of sample sets.
 It is usually denoted as 𝐽(𝑥, 𝑦) where 𝑥 and 𝑦 represent two different
nodes in a network.
 In link prediction, all the neighbours of a node are treated as a set and
the prediction is done by computing and ranking the similarity of the
neighbour set of each node pair.
 This method is based on Common Neighbours method and its
complexity is also 𝑂(𝑁𝑘2 ).

86
Jaccard’s Coefficient of Link Neighbourhood

 The mathematical expression of this method is as follows


Score (x, y) =

 Jaccard / Tanimoto coefficient is one of the metrics used to compare the


similarity and diversity of sample sets. It uses the ratio of the
intersecting set to the union set as the measure of similarity. Thus it
equals to zero if there are no intersecting elements and equals to one if
all elements intersect. Equation for Jaccard / Tanimoto coefficient is

 where
Na - number of elements in set А, Nb - number of elements in set B
Nc - number of elements in intersecting set
87
Evaluation of Topical Crawlers

 Goal: build “better” crawlers to support applications


 Build an unbiased evaluation framework
 Define common tasks of measurable difficulty
 Identify topics, relevant targets
 Identify appropriate performance measures
 Effectiveness: quality of crawler pages, order, etc.
 Efficiency: separate CPU & memory of crawler algorithms from
bandwidth & common utilities
 Perhaps the most crucial evaluation of a crawler is to measure the rate
at which relevant web pages are acquired and how effectively irrelevant
web pages are filtered out from the crawler.

88
Evaluation of Topical Crawlers

 With this knowledge, we could estimate the precision and recall of a


crawler after crawling 𝑛 web pages.
 The precision would be the fraction of pages crawled that are relevant
to the topic and recall would be the fraction of relevant pages crawled.
 However, the relevant set for any given topic is unknown in the web, so
the true recall is hard to measure.
 Therefore, we adopt harvest rate and target recall to evaluate the
performance of the crawler.
 In case we have boolean relevance scores, we could measure the rate
at which “good” pages are found; if 100 relevant pages are found in the
first 500 pages crawled, we have an acquisition rate or harvest rate of
20% at 500 pages.
89
Evaluation of Topical Crawlers

 The harvest rate is the fraction of web pages crawled that are relevant
to the given topic, which measures how well it is doing at rejecting
irrelevant web pages. The expression is given by:

 where
 𝑉 is the number of web pages crawled by a crawler in current;
 𝑟𝑖 is the relevance between web page 𝑖 and the given topic, and the
value of 𝑟𝑖 can only be 0 or 1.
 If relevant, then 𝑟𝑖 = 1; otherwise 𝑟𝑖 = 0.

90
Evaluation of Topical Crawlers

 The target recall is the fraction of relevant pages crawled, which


measures how well it is doing at finding all the relevant web pages.
 However, the relevant set for any given topic is unknown in the Web, so
the true target recall is hard to measure.
 In view of this situation, we delineate a specific network, where given a
set of seed URLs and a certain depth, the range reached by a crawler
using breadth-first crawling strategy is the virtual Web.
 We assume that the target set 𝑇 is the relevant set in the virtual Web; 𝐶𝑡
is the set of first 𝑡 pages crawled. The expression is given by:

91
Evaluation of Topical Crawlers

 If the relevance scores are continuous (e.g., from cosine similarity or a


trained classifier) they can be averaged over the crawled pages.
 Sometimes running averages are calculated over a window of a number
of pages, e.g., the last 50 pages from a current crawl point.
 Another measure from information retrieval that has been applied to
crawler evaluation is search length, defined as the number of pages
(or the number of irrelevant pages) crawled before a certain percentage
of the relevant pages are found.
 Search length is akin to the reciprocal of precision for a preset level of
recall.

92
Evaluation of Topical Crawlers

 If a set of known relevant target pages is used to measure the


performance of a topical crawler, these same pages cannot be used as
seeds for the crawl.
 Two approaches have been proposed to obtain suitable seed pages.
One is to perform a back-crawl from the target pages.
 By submitting link: queries to a search engine API, one can obtain a list
of pages linking to each given target; the process can be repeated from
these parent pages to find “grandparent” pages, and so on until a
desired link distance is reached.
 The greater the link distance, the harder the task is for the crawler to
locate the relevant targets from these ancestor seed pages.

93
Evaluation of Topical Crawlers

 A second approach is to split the set of known relevant pages into two
sets; one set can be used as seeds, the other as targets.
 While there is no guarantee that the targets are reachable from the
seeds, this approach is significantly simpler because no back-crawl is
necessary.
 Another advantage is that each of the two relevant subsets can be used
in turn as seeds and targets.
 In this way, one can measure the overlap between the pages crawled
starting from the two disjoint sets.

94
Evaluation of Topical Crawlers

 The use of known relevant pages as proxies for unknown relevant sets
implies an important assumption, which is illustrated by the Venn
diagram.
 Here S is a set of crawled pages and T is the set of known relevant
target pages, a subset of the relevant set R.

95
Evaluation of Topical Crawlers

 The precision and recall are estimated by |R ∩ S| / |R| by |T ∩ S| / |T|.


This approximation only holds if T is a representative, unbiased sample
of R independent of the crawl process.
 While the crawler attempts to cover as much as possible of R, it should
not have any information about how pages in T are sampled from R.
 If T and S are not independent, the measure is biased and unreliable.
 For example if a page had a higher chance of being selected in T
because it was in S, or vice versa, then the recall would be
overestimated.

96
Evaluation of Topical Crawlers

 Crawler performance measures can be characterized along two


dimensions: the source of relevance assessments (target pages vs.
similarity to their descriptions) and the normalization factor (average
relevance, or precision, vs. total relevance, or recall).
 Using target pages as the relevant sets we can define crawler precision
and recall as follows:

 where St is the set of pages crawled at time t (t can be wall clock time,
network latency, number of pages visited, number of bytes downloaded,
and so on). Tθ is the relevant target set, where θ represents the
parameters used to select the relevant target pages.
97
Evaluation of Topical Crawlers

 Analogously we can define crawler precision and recall based on


similarity to target descriptions:

 where Dθ is the textual description of the target pages, selected with


parameters θ, and σ is a text-based similarity function, e.g., cosine
similarity

98
Crawler Ethics and Conflicts
 Crawlers can cause trouble, even unwillingly, if not properly designed to
be “polite” and “ethical”
 For example, sending too many requests in rapid succession to a single
server can amount to a Denial of Service (DoS) attack!
 Server administrator and users will be upset
 Crawler developer/admin IP address may be blacklisted

99
Crawler Ettiquette
 Identify yourself
 Use ‘User-Agent’ HTTP header to identify crawler, website with
description of crawler and contact information for crawler developer
 Use ‘From’ HTTP header to specify crawler developer email
 Do not disguise crawler as a browser by using their ‘User-Agent’
string
 Always check that HTTP requests are successful, and in case of error,
use HTTP error code to determine and immediately address problem
 Pay attention to anything that may lead to too many requests to any one
server, even unwillingly, e.g.:
 redirection loops
 spider traps
100
Crawler Ettiquette
 Spread the load, do not overwhelm a server
 Make sure that no more than some max. number of requests to any
single server per unit time, say < 1/second
 Honor the Robot Exclusion Protocol
 A server can specify which parts of its document tree any crawler is
or is not allowed to crawl by a file named ‘robots.txt’ placed in the
HTTP root directory, e.g. http://www.indiana.edu/robots.txt
 Crawler should always check, parse, and obey this file before
sending any requests to a server
 More info at:
 http://www.google.com/robots.txt
 http://www.robotstxt.org/wc/exclusion.html
101
Crawler Ethics Issues
 Is compliance with robot exclusion a matter of law?
 No! Compliance is voluntary, but if you do not comply, you may be
blocked
 Someone (unsuccessfully) sued Internet Archive over a robots.txt
related issue
 Some crawlers disguise themselves
 Using false User-Agent
 Randomizing access frequency to look like a human/browser
 Example: click fraud for ads

102
Crawler Ethics Issues
 Servers can disguise themselves, too
 Cloaking: present different content based on UserAgent
 E.g. stuff keywords on version of page shown to search engine
crawler
 Search engines do not look kindly on this type of “spamdexing” and
remove from their index sites that perform such abuse
 Case of bmw.de made the news

103
Gray Areas of Crawler Ethics
 If you write a crawler that unwillingly follows links to ads, are you just
being careless, or are you violating terms of service, or are you violating
the law by defrauding advertisers?
 Is non-compliance with Google’s robots.txt in this case equivalent to
click fraud?
 If you write a browser extension that performs some useful service,
should you comply with robot exclusion?

104
New Crawling Code?

 Reference C implementation of HTTP, HTML parsing, etc


 w3c-libwww package from World-Wide Web Consortium: www.w3c.org/Library/
 LWP (Perl)
http://www.oreilly.com/catalog/perllwp/
http://search.cpan.org/~gaas/libwww-perl-5.804/
 Open source crawlers/search engines
Nutch: http://www.nutch.org/ (Jakarta Lucene: jakarta.apache.org/lucene/)
Heretrix: http://crawler.archive.org/
WIRE: http://www.cwr.cl/projects/WIRE/
Terrier: http://ir.dcs.gla.ac.uk/terrier/
 Open source topical crawlers, Best-First-N (Java)
http://informatics.indiana.edu/fil/IS/JavaCrawlers/
 Evaluation framework for topical crawlers (Perl)
http://informatics.indiana.edu/fil/IS/Framework/

105
Thank You for Your Attention !

106

You might also like