You are on page 1of 7

Homework: Crawling the World Wide Web with Apache Nutch

1. Objective
In Homework: Tika you used Tika parsers to extract text from the provided PDF files and helped search
for UFOs in the FBIs vault dataset.
But its time to take a step back. Lets consider how did we obtain all those PDF files in the first place?
The short answer is that we crawled the FBIs Vault site http://vault.fbi.gov/ and downloaded all of the
relevant PDFs from that site, dropped them in a directory, and then packaged the whole directory into a
tarball.
We used Apache Nutch (http://nutch.apache.org/) to collect the vault.tar.gz dataset. Now, you get to
do it too! Youd think getting all the PDF files would be a snap. That is, of course, if they were named
with URLs that ended with .pdf. And of course, provided that the PDFs were all available from a similar
directory structure, like http://vault.fbi.gov/pdfs/xxx.pdf. Unfortunately, theyre not.
Your goal in this assignment is to download, install and leverage the Apache Nutch web crawling
framework to obtain all of the PDFs from a tiny subset of the FBIs Vault website that we have mirrored
at http://fs.vsoeclassrooms.usc.edu/minivault/, and then to package those PDFs and in order to create your
own vault.tar.gz file.

1.1 Individual Work


This assignment is to be
done as individual work
only.

2. Preliminaries
We will be making use of the UNIX servers of the Student Compute Facility (SCF) at aludra.usc.edu for
this exercise.1 We will assume a working knowledge of the SCF from CSCI571 or other prerequisite
course; if you are not familiar with this please look up the ITS documentation available at:

http://www.usc.edu/its/unix/servers/scf.html
http://www.usc.edu/its/unix/
http://www.usc.edu/its/ssh/putty.html
http://www.usc.edu/its/ssh/
http://www.usc.edu/its/sftp/

If you are an experienced Unix/Linux user, with the OS already installed, you may want to consider doing this on your
own computer. Check with the TA first to obtain permission before proceeding.

2.1 Making Space


This exercise will require the use of a very large portion of your SCF disk quota, so if you have used your
UNIX account for other courses or other purposes in the past, then odds are that you will need to back up
your data onto your own local computer2.
This exercise will require at least:

Nutch:
65MB unpacked
Crawl data:
up to 200MB
Extracted PDFs: approximately 100MB
plus other assorted logs and outputs

We recommend that you have more than 400+MB free on your account before beginning this exercise!
You can check your current disk usage using the command:
quota -v

To see how much space is occupied by the contents of each of your directories, you can use:
du -k

Further information on managing the storage space available on your SCF account can be found at:
http://www.usc.edu/its/accounts/unixquotas.html
WARNING: done correctly, this exercise requires a very large portion of your SCF disk quota. Be very
careful with your settings if you botch something, you may end up exceeding your available disk space
and/or encounter other technical issues that can potentially be painful to fix. In particular, pay close
attention to the scope of the crawl to ensure that it is properly confined to our mirror site and does not
extend beyond to FBI websites, or worse, the entire Internet!

2.2 Setting up Java


Nutch requires a more recent version of Java compared to the default settings on your SCF account.
You can check your current version of Java on aludra by running:
aludra.usc.edu(1): java version
java version "1.6.0_23"
Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
Java HotSpot(TM) Server VM (build 19.0-b09, mixed mode)
aludra.usc.edu(2):

If you get anything lower than 1.6.0, you will need to change it! Full instructions in Appendix A.

It is probably a good idea to back all your data up before you begin this exercise anyway, in case something goes wrong.

3. Crawling using Apache Nutch


3.1 Downloading and Installing Apache Nutch
To get started with Apache Nutch, grab the latest release, available from:
http://www.apache.org/dyn/closer.cgi/nutch/
The latest release of Nutch that we will use in this assignment is Nutch 1.6. Grab the binary distribution
bin.tar.gz. You can download it directly into your home directory on aludra using:
wget http://<mirror site>/apache/nutch/1.6/ apache-nutch-1.6-bin.tar.gz

And then unpack it:


gtar -xzvf apache-nutch-1.6-bin.tar.gz

Make sure to chmod +x bin/nutch, and then to test it out by running bin/nutch and making sure
you see command line output.
Note: Once you have finished unpacking Nutch, you should delete the bin.tar.gz to free up disk space.

3.2 Configuring Nutch


The best way to get started on configuring Nutch for your Web crawl is to read the Section 3 and 3.1 of
the Quick Start on the wiki:
http://wiki.apache.org/nutch/NutchTutorial
Pay particular attention to the configuration files that you will need to change/update.
Note: You must set the agent name and its related properties appropriately (this will be graded!).
As in previous assignments, you are advised to do any and all testing using a small crawl. You do not
want to spend several hours running a crawl only to find that you botched it and wasted your time.
Caution: Be very careful about the scope of your crawl! If your crawl is not properly confined to our
mirror site and extends beyond to FBI websites, or worse, the entire Internet, you may end up filling up
all available disk space and/or encounter other technical issues.

3.2.1 Politeness
Crawlers can retrieve data much quicker and in greater depth than human users, so they can have a
crippling impact on the performance of a site. A server would naturally have a hard time keeping up with
requests from multiple crawlers as some of you may have already discovered while doing the earlier
crawler assignment!
Except this time all of you are going to be torturing the same server. Please be kind to it if it dies under
your bombardment, none of you will be able to complete the assignment!
To avoid overloading the server, you should set an appropriate interval between successive requests to the
same server. (Do not exceed the default settings in Nutch!)

3.3 Crawling Hints

For sanitys sake, please use a wired connection if possible to do your crawling.

Configuring Nutch to perform this task will likely be one of the trickiest parts of the assignment.
Pay particular attention to the RegexURLFilter required to crawl the site, and to the crawl command
arguments/parameters.

It may be helpful to do manual inspection of the website to gain a better understanding of what you
are dealing with, and what settings you need to adjust.
o The site that you are crawling, http://fs.vsoeclassrooms.usc.edu/minivault/ was created by
deleting a large portion of our original mirror site in order to reduce its size. There is no need
to be alarmed by the 404 errors that you encounter; those files were intentionally deleted.
o Consequently you may find it more useful to inspect the larger version of the mirror at
http://fs.vsoeclassrooms.usc.edu/vault/ or the original site at http://vault.fbi.gov/. However, DO
NOT CRAWL THESE SITES! They will exceed the space you have.

You should observe that some of the URLs include spaces; these need to be percent encoded to
normalize the URL to a valid form that Nutch can work with.

Some of the files we want to crawl are pretty big.

4. Extracting the PDFs from the Sequence File compressed format


Nutch uses a compressed SequenceFile binary format to represent the data that it downloads. Data
is stored in SequenceFiles which are part of segments on disk, some splittable portion of the crawl.
Nutch performs this operation both for efficiency, but also to allow for distribution to allow multiple
fetches to be distributed out on the cluster using the Apache Hadoop Map-Reduce Framework.
After successful crawling of FBI vault site you should have a crawl folder with a bunch of segments in it,
and inside of those segments, SequenceFiles with the content stored inside.
To get some idea of how to extract data from Nutch SequenceFiles, have a look at this blog post:
http://www-scf.usc.edu/~csci572/2013Spring/homework/nutch/allenday20080829.html
Your goal in this portion of the assignment is to write a Java program PDFExtractor.java that will
extract the PDF files out of the SequenceFile format, and to write the PDFs to disk. Your program should
be executed with arguments indicating the source (crawl data) directory and the output (PDF files)
directory, like this:
java PDFExtractor <crawl dir> <output dir>

The result of this command will be to extract all PDF files from <crawl dir>, and to output them as named
individual PDF files to the directory path identified by <output dir>.

4.1 Extraction Hints

Once you have completed your crawl, the PDF extraction step may be done in any environment you
prefer Java program PDFExtractor.java ought to be cross-platform, so there is no strict
requirement for this step to be done on aludra.

Successful PDF extraction should have 20 files. If you dont have all the files, it may be because
you failed to extract them, but it could also be because you failed to crawl them in the first place.

This should be obvious: you can test your extracted PDF files by opening them in Adobe Reader.

5. More Documentation and Resources


You may find it helpful to refer to:
Nutch API docs: http://nutch.apache.org/apidocs-1.6/index.html
Hadoop API docs: http://hadoop.apache.org/common/docs/r1.0.0/api/index.html
Besides consulting available documentation, and using Piazza, you may also wish to try the mailing lists
at dev@nutch.apache.org and/or user@nutch.apache.org if you encounter any difficulties. Both of these
lists are publicly archived:
http://mail-archives.apache.org/mod_mbox/nutch-dev/
http://mail-archives.apache.org/mod_mbox/nutch-user/

6. Submission Guidelines
This assignment is to be submitted electronically, before the start of class on the specified due date, via
https://blackboard.usc.edu/ under Homework: Apache Nutch. A how-to movie is available at:
http://www.usc.edu/its/blackboard/support/bb9/assets/movies/bb91/submittingassignment/index.html

Include all Nutch configuration files and/or code that you needed to change or implement in order to
crawl the FBI Vault website. Avoid including unnecessary files. Place in a directory nutchconfig.

Include all source code and external libraries needed to extract the PDFs from Sequence File
compressed format. Put these in a directory pdfextract.
o All source code is expected to be commented, to compile, and to run. You should have (at least)
one Java source file: PDFExtractor.java, containing your main() function. You should
also include other java source files that you added, if any. Do not submit *.class files. We
will compile your program from submitted source.
o There is no need to submit jar files for Tika, Nutch and/or Hadoop. If you have used any other
external libraries, you should include those jar files in your submission.
o Prepare a readme.txt containing a detailed explanation of how to compile and execute your
program. Be especially specific on how to include other external libraries if you have used them.

Also include your name, USC ID number and email address in the readme.txt.

Do not include your crawled data or extracted PDF files.

Compress all of the above (nutchconfig folder, pdfextract folder and readme.txt) into a
single zip archive and name it according to the following filename convention:
<lastname>_<firstname>_CSCI572_HW_NUTCH.zip
Use only standard zip format. Do not use other formats such as zipx, rar, ace, etc.

Important Notes:

Make sure that you have attached the file the when submitting. Failure to do so will be treated as
non-submission.

Successful submission will be indicated in the assignments submission history. We advise that
you check to verify the timestamp, download and double check your zip file for good measure.

Academic Integrity Notice: Your code will be compared to all other student's code via the MOSS
code comparison tool to detect plagiarism.

6.1 Late Assignment Policy

-10% if submitted up to 7 days after the due date


no submissions after that date will be accepted

Appendix A: Java 1.6 Setup on aludra.usc.edu


Check your current version of java on aludra by running:
aludra.usc.edu(1): java version
java version "1.6.0_23"
Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
Java HotSpot(TM) Server VM (build 19.0-b09, mixed mode)
aludra.usc.edu(2):

If you get anything lower than 1.6.0, you need to change it.
In order to setup java properly, you need to know which shell you are using. By default, aludra accounts
are setup to use TC Shell and users have the option to change that. Here is how to check which shell you
are using:
aludra.usc.edu(3): ps -p $$
PID TTY
TIME CMD
28945 pts/168
0:00 tcsh
aludra.usc.edu(4):

alternatively, if the above command does not work for you, use:
aludra.usc.edu(5): finger $USER
Login name: ttrojan
In real life: Tommy Trojan
Directory: /home/scf-XX/ttrojan
Shell: /bin/bash
On since Sep 5 07:28:12 on pts/472 from xxx.usc.edu
No unread mail
No Plan.
aludra.usc.edu(6):

If neither of the above commands worked for you, it is safe to assume that you are using tcsh since it is
the default shell.
If you are using tcsh or csh, append the following two lines to your ~/.cshrc file:
source /usr/usc/jdk/1.6.0_23/setup.csh
setenv CLASSPATH .

If you are using bash, append the following two lines to your ~/.bashrc file:
source /usr/usc/jdk/1.6.0_23/setup.sh
export CLASSPATH=.

Once you have completed the steps outlined in that guide, check again to make sure that you now have
the correct version of Java set up. Note that you may need to log out and sign back in again for the
changes to take effect.
If you are still not able to get the desired Java version after following the above instructions, post on
Piazza to obtain TA assistance.