You are on page 1of 28

Web Crawling

INF 141: Information Retrieval 
Discussion Session 
Week 4 – Winter 2008

Yasser Ganjisaffar

Open Source Web Crawlers
Open Source Web Crawlers

Heritrix

WebSphinx

Nutch

Crawler4j

 Web‐Scale • Command line tool • Web‐based Management Interface W b b dM tI t f • Distributed • Internet Archive Internet Archive’ss Crawler Crawler .Heritrix • Extensible.

.  including an archive of the Web including an archive of the Web.Internet Archive Internet Archive • dedicated dedicated to building and maintaining a free  to building and maintaining a free and openly accessible online digital library.

Internet Archive’ss WayBack Internet Archive WayBack Machine .

Google ‐ 1998 Google  .

Nutch • Apache’s Open Source Search Engine • Distributed • Tested with 100M Pages Tested with 100M Pages .

WebSphinx • 1998‐2002 1998 2002 • Single Machine • Lots of Problems (Memory leaks.  …) • Reported to be very slow R t dt b l .

Crawler4j • Single Machine Single Machine • Should Easily Scale to 20M Pages • Very Fast Crawled and Processed the whole English Wikipedia in 10  hours (including time for extracting palindromes and storing  link structure and text of articles). ) .

What is a docid? What is a docid? • A A unique sequential integer value that  unique sequential integer value that uniquely identifies a Page/URL.ics.edu/about” (53 bytes) • 120‐123 (8 bytes) 120 123 (8 bytes) –… .uci.ics. • Why use it? – Storing links: • “http://www.edu/”‐”http://www.uci.

Docid Server Docid Server Should be Extremely Fast.  Should be Extremely Fast otherwise it would be a bottleneck .

( ) return lastdocid. } } . } else { lastdocid++. lastdocid) in storage. put (URL.Docid Server Docid Server public  static  synchronized int p y ggetDocID(String URL) { ( g ){ if (there is any key‐value pair for key = URL) { return value.

Docid Server Docid Server • Key‐value Key value pairs are stored in a B+ pairs are stored in a B+‐tree tree data  data structure. • Berkeley DB as the storage engine Berkeley DB as the storage engine .

• You can think of it as a large HashMap: Key Value . Berkeley DB comes in form of a jar file which  is linked to the Java program and runs in the process  space of the crawlers.Berkeley DB Berkeley DB • Unlike Unlike traditional database systems like MySQL and  traditional database systems like MySQL and others.  • No need for inter‐process communication and  waiting for context switch between processes.

com/bot. en‐US. http://www. rv:1.4 • Compare with Google’s user agent:  – Mozilla/5.0. • For example: – Wikipedia’s policy does not allow bots to send requests faster  iki di ’ li d ll b d f than 1 request/second. g / .  – Crawler4j has a history of sending 200 requests/second – Introduces itself as a Firefox agent! d lf f ! • Mozilla/5.1.Crawler4j • It is not polite It is not polite – Does not respects robots. Googlebot/2.0.4) Gecko/2008102920 Firefox/3.1. U. Windows NT 5.0 (Windows.0 (compatible.9.txt limitations – Does not limit number of requests sent to a host  q per second.html)  / ( p . p // g g / ) .google.

Crawler4j • Only Crawls Textual Content Only Crawls Textual Content – Don’t try to download images and other media  with it with it. • A Assumes that page is encoded in UTF‐8  th t i d d i UTF 8 format. .

Crawler4j • There is another version of crawler4j which is: There is another version of crawler4j which is: – Polite – Supports all types of content Supports all types of content – Supports all encodings of text and automatically  detects the encoding detects the encoding – Not open source .) .

Crawler4j – Crawler of Wikijoo Crawler4j  Crawler of Wikijoo .

Your job? Your job? .

Your job? Your job? .

Crawler4j Objects Crawler4j Objects • Page – String html: getHTML() – String text: getText() g g () – String title: getTitle() – WebURL url: getWebURL() g () – ArrayList<WebURL> urls: getURLs() • WebURL – String url: getURL() – int docid: getDocid() g .

wikipedia. File:.wikipedia.org/wiki/Talk:Main_Page http://en. Help:.org/wiki/Wikipedia:Searching http://en.org/wiki/Category:Linguistics http://en.assignment03.jar  ir.Controller . … Set Maximum heap size for java: Set Maximum heap size for java: – java –Xmx1024M –cp .wikipedia.wikipedia.:crawler4j. • You might want to filter pages which are not in the main namespace: – – – – – • http://en.Assignment 03 Assignment 03 • Feel free to use any crawler you like.org/wiki/Special:Random Image:. Media:.

• Use nohup command on remote machines.jar  ir.:crawler4j.Assignment 03 Assignment 03 • Dump your partial results Dump your partial results – For example. after processing each 5000 page write the results in a text file.Controller . – nohup java –Xmx1024M –cp .assignment03.

sh: Available online: http://www.Shell Scripts Shell Scripts run.ics.edu/~yganjisa/TA/ .uci.

edu/~yganjisa/TA/ .ics.uci.Shell Scripts Shell Scripts run‐nohup.sh: Check logs: Available online: http://www.

not print on screen frequently • Your seeds should be accepted in the  shouldVisit() function shouldVisit() function.org/wiki/Main_Page .org/wiki/* – Bad seed: B d d • http://en.wikipedia.Tips • Do Do not print on screen frequently.org/ – Good seed: G d d • http://en. – Pattern: http://en.wikipedia.wikipedia.

chiark.net/) (http://winscp net/) .html – File transfer: • WinSCP (http://winscp.edu – File transfer: • scp or other tools • From a Windows machine: From a Windows machine: – Connecting to shell: • Download putty.exe  Download putty.uk/~sgtatham/putty/download.uci.greenend.Openlab • From a Linux/MAC machine: o a u / C ac e: – Connecting to shell: • ssh myicsid@openlab.exe – http://www.org.ics.

QUESTIONS? .