Focused Crawler

Focused crawler
A focused crawler is a web crawler that collects Web pages that satisfy some specific property, by
carefully prioritizing the crawl frontier and managing the hyperlink exploration process.[1] Some predicates
may be based on simple, deterministic and surface properties. For example, a crawler's mission may be to
crawl pages from only the .jp domain. Other predicates may be softer or comparative, e.g., "crawl pages
about baseball", or "crawl pages with large PageRank". An important page property pertains to topics,
leading to 'topical crawlers'. For example, a topical crawler may be deployed to collect pages about solar
power, swine flu, or even more abstract concepts like controversy[2] while minimizing resources spent
fetching pages on other topics. Crawl frontier management may not be the only device used by focused
crawlers; they may use a Web directory, a Web text index, backlinks, or any other Web artifact.
A focused crawler must predict the probability that an unvisited page will be relevant before actually
downloading the page.[3] A possible predictor is the anchor text of links; this was the approach taken by
Pinkerton[4] in a crawler developed in the early days of the Web. Topical crawling was first introduced by
Filippo Menczer.[5][6] Chakrabarti et al. coined the term 'focused crawler' and used a text classifier[7] to
prioritize the crawl frontier. Andrew McCallum and co-authors also used reinforcement learning[8][9] to
focus crawlers. Diligenti et al. traced the context graph[10] leading up to relevant pages, and their text
content, to train classifiers. A form of online reinforcement learning has been used, along with features
extracted from the DOM tree and text of linking pages, to continually train[11] classifiers that guide the
crawl. In a review of topical crawling algorithms, Menczer et al.[12] show that such simple strategies are
very effective for short crawls, while more sophisticated techniques such as reinforcement learning and
evolutionary adaptation can give the best performance over longer crawls. It has been shown that spatial
information is important to classify Web documents.[13]
Another type of focused crawlers is semantic focused crawler, which makes use of domain ontologies to
represent topical maps and link Web pages with relevant ontological concepts for the selection and
categorization purposes.[14] In addition, ontologies can be automatically updated in the crawling process.
Dong et al.[15] introduced such an ontology-learning-based crawler using support vector machine to update
the content of ontological concepts when crawling Web Pages.
Crawlers are also focused on page properties other than topics. Cho et al.[16] study a variety of crawl
prioritization policies and their effects on the link popularity of fetched pages. Najork and Weiner[17] show
that breadth-first crawling, starting from popular seed pages, leads to collecting large-PageRank pages early
in the crawl. Refinements involving detection of stale (poorly maintained) pages have been reported by
Eiron et al.[18] A kind of semantic focused crawler, making use of the idea of reinforcement learning has
been introduced by Meusel et al.[19] using online-based classification algorithms in combination with a
bandit-based selection strategy to efficiently crawl pages with markup languages like RDFa, Microformats,
and Microdata.
The performance of a focused crawler depends on the richness of links in the specific topic being searched,
and focused crawling usually relies on a general web search engine for providing starting points.
Davison[20] presented studies on Web links and text that explain why focused crawling succeeds on broad
topics; similar studies were presented by Chakrabarti et al.[21] Seed selection can be important for focused
crawlers and significantly influence the crawling efficiency.[22] A whitelist strategy is to start the focus
crawl from a list of high quality seed URLs and limit the crawling scope to the domains of these URLs.
These high quality seeds should be selected based on a list of URL candidates which are accumulated over
a sufficiently long period of general web crawling. The whitelist should be updated periodically after it is
created.
References
1. Soumen Chakrabarti, Focused Web Crawling (http://www.springerreference.com/docs/html/c
hapterdbid/63300.html), in the Encyclopedia of Database Systems (http://www.springerrefer
ence.com/docs/navigation.do?m=Encyclopedia+of+Database+Systems+(Computer+Scienc
e)-book65).
2. Controversial topics (https://www.semanticjuice.com/controversial-topics/)
3. Improving the Performance of Focused Web Crawlers (http://www.intelligence.tuc.gr/~petraki
s/publications/BaPeMi09.pdf)[1] (https://qoqlinks.com), Sotiris Batsakis, Euripides G. M.
Petrakis, Evangelos Milios, 2012-04-09
4. Pinkerton, B. (1994). Finding what people want: Experiences with the WebCrawler (http://w
ww.thinkpink.com/bp/WebCrawler/WWW94.html). In Proceedings of the First World Wide
Web Conference, Geneva, Switzerland.
5. Menczer, F. (1997). ARACHNID: Adaptive Retrieval Agents Choosing Heuristic
Neighborhoods for Information Discovery (http://informatics.indiana.edu/fil/Papers/ICML.ps)
Archived (https://web.archive.org/web/20121221113620/http://informatics.indiana.edu/fil/Pap
ers/ICML.ps) 2012-12-21 at the Wayback Machine. In D. Fisher, ed., Proceedings of the 14th
International Conference on Machine Learning (ICML97). Morgan Kaufmann.
6. Menczer, F. and Belew, R.K. (1998). Adaptive Information Agents in Distributed Textual
Environments (http://informatics.indiana.edu/fil/Papers/AA98.ps) Archived (https://web.archiv
e.org/web/20121221113630/http://informatics.indiana.edu/fil/Papers/AA98.ps) 2012-12-21 at
the Wayback Machine. In K. Sycara and M. Wooldridge (eds.) Proceedings of the 2nd
International Conference on Autonomous Agents (Agents '98). ACM Press.
7. Focused crawling: a new approach to topic-specific Web resource discovery (http://www8.or
g/w8-papers/5a-search-query/crawling/index.html), Soumen Chakrabarti, Martin van den
Berg and Byron Dom, WWW 1999.
8. A machine learning approach to building domain-specific search engines (http://dl.acm.org/c
itation.cfm?id=1624313), Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie
Seymore, IJCAI 1999.
9. Using Reinforcement Learning to Spider the Web Efficiently (http://dl.acm.org/citation.cfm?id
=657633), Jason Rennie and Andrew McCallum, ICML 1999.
10. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. (2000). Focused crawling
using context graphs (http://nautilus.dii.unisi.it/pubblicazioni/files/conference/2000-Diligenti-
VLDB.pdf) Archived (https://web.archive.org/web/20080307065612/http://nautilus.dii.unisi.it/
pubblicazioni/files/conference/2000-Diligenti-VLDB.pdf) 2008-03-07 at the Wayback
Machine. In Proceedings of the 26th International Conference on Very Large Databases
(VLDB), pages 527-534, Cairo, Egypt.
11. Accelerated focused crawling through online relevance feedback (http://dl.acm.org/citation.cf
m?id=511466), Soumen Chakrabarti, Kunal Punera, and Mallela Subramanyam, WWW
2002.
12. Menczer, F., Pant, G., and Srinivasan, P. (2004). Topical Web Crawlers: Evaluating Adaptive
Algorithms (http://doi.acm.org/10.1145/1031114.1031117). ACM Trans. on Internet
Technology 4(4): 378–419.
13. Recognition of common areas in a Web page using visual information: a possible
application in a page classification (https://ieeexplore.ieee.org/document/1183910/), Milos
Kovacevic, Michelangelo Diligenti, Marco Gori, Veljko Milutinovic, Data Mining, 2002. ICDM
2003.
14. Dong, H., Hussain, F.K., Chang, E.: State of the art in semantic focused crawlers (https://ww
w.researchgate.net/publication/44241179_State_of_the_Art_in_Semantic_Focused_Crawle
rs). Computational Science and Its Applications – ICCSA 2009. Springer-Verlag, Seoul,
Korea (July 2009) pp. 910-924
15. Dong, H., Hussain, F.K.: SOF: A semi-supervised ontology-learning-based focused crawler.
(https://www.researchgate.net/publication/264620349_SOF_A_semi-supervised_ontology-l
earning-based_focused_crawler) Concurrency and Computation: Practice and Experience.
25(12) (August 2013) pp. 1623-1812
16. Junghoo Cho, Hector Garcia-Molina, Lawrence Page: Efficient Crawling Through URL
Ordering (http://dl.acm.org/citation.cfm?id=297835). Computer Networks 30(1-7): 161-172
(1998)
17. Marc Najork, Janet L. Wiener: Breadth-first crawling yields high-quality pages (http://dl.acm.o
rg/citation.cfm?id=371965). WWW 2001: 114-118
18. Nadav Eiron, Kevin S. McCurley, John A. Tomlin: Ranking the web frontier (http://dl.acm.org/
citation.cfm?id=988714). WWW 2004: 309-318.
19. Meusel R., Mika P., Blanco R. (2014). Focused Crawling for Structured Data (http://dl.acm.or
g/citation.cfm?doid=2661829.2661902). ACM International Conference on Information and
Knowledge Management, Pages 1039-1048.
20. Brian D. Davison: Topical locality in the Web (http://dl.acm.org/citation.cfm?doid=345508.34
5597). SIGIR 2000: 272-279.
21. Soumen Chakrabarti, Mukul Joshi, Kunal Punera, David M. Pennock: The structure of broad
topics on the Web (http://dl.acm.org/citation.cfm?id=511480). WWW 2002: 251-262.
22. Jian Wu, Pradeep Teregowda, Juan Pablo Fernández Ramírez, Prasenjit Mitra, Shuyi
Zheng, C. Lee Giles, The evolution of a crawling strategy for an academic document search
engine: whitelists and blacklists (http://dl.acm.org/citation.cfm?id=2380718.2380762), In
proceedings of the 3rd Annual ACM Web Science Conference Pages 340-343, Evanston, IL,
USA, June 2012.
Retrieved from "https://en.wikipedia.org/w/index.php?title=Focused_crawler&oldid=1155353827"

Focused Crawler

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Focused Crawler

Uploaded by

Copyright:

Available Formats

Focused crawler

Retrieved from "https://en.wikipedia.org/w/index.php?title=Focused_crawler&oldid=1155353827"

You might also like