You are on page 1of 12

Hadoop Exercises

by
Matias Fernando Capristo
fcaprist@fi.uba.ar
Facultad de Ingenieria - Universidad de Buenos Aires (U.B.A)

Document developed for LUSSI department of Telecom Bretagne


Ecole Nationale Suprieure des Tlcommunications de Bretagne

Based on Cloudera Hadoop Demo 0.3.3 Tutorial

November 2012

Hadoop Exercises

Matias Fernando Capristo

Index
Configuring the work environment.......................................................................................................... 3
Configure the keyboard and the internet access .................................................................................. 3
Exercise 1: Getting familiar with Hadoop ................................................................................................. 3
1.a HDFS Access................................................................................................................................... 3
1.b Running a MapReduce Program..................................................................................................... 3
Exercise 2: Our first MapReduce Program - The Inverted index................................................................ 4
2.1 Coding the map() method .............................................................................................................. 4
2.2 Create the Reducer Class ............................................................................................................... 6
2.3 Run JUnit tests............................................................................................................................... 7
2.4 Debugging ..................................................................................................................................... 7
2.5 Compile your system ..................................................................................................................... 7
2.6 Running and monitoring your program .......................................................................................... 7
Exercise 3 : Improving the Inverted Index ................................................................................................ 8
Exercise 4: The patent inverted index ...................................................................................................... 9
Exercise 5: Counting things.................................................................................................................... 11
Bibliography .......................................................................................................................................... 12

Hadoop Exercises

Matias Fernando Capristo

Configuring the work environment


To start the Cloudera training Virtual Machine from the linux, there is an icon in the desktop.
At the end of the Lab, close the VMware properly, to could retrieve your contents in the next sessions. As
HDFS is not stored permanently, you do not must shut down the VM, you must suspend it.

Configure the keyboard and the internet access


1

The keyboard of the VM is configured in English. If you prefer it in French, just follow this steps:
a Go to System -> Preferences -> Keyboard -> Layout
b Click in + to add a language
c Add the french configuration.
d Set it as default.
e Apply the changes. The password will be requested: it is training . You need to
copy/paste this password from the unix prompt.

Since you are working inside the virtual machine, you have not internet access because the
system doesnt know your credentials. To be sure that you could connect outside, do the
following:
a Open a browser.
b Try to browse any web page. A page will appear asking for your credentials.
c Enter your user and password.

Exercise 1: Getting familiar with Hadoop


In the desktop of the virtual machine you will find a folder named instructions containing some tutorials
and exercises. As a first approach to Hadoop, we will do the exercise Getting familiar with Hadoop that
is in the exercises folder. Just follow the instructions in order to interact with Hadoop and practice some
basic commands.
Important: In this exercise you will be requested to update the files using a git command. Since this
maybe take a long while, you can download the files from the web page where this file is located.
1 Download the folder data
2 Copy the content of the folder data overwriting the content of the folder ~/git/data

1.a HDFS Access


The goal of this exercise is to practice ls, put and cat commands. These commands interact whit
HDFS in a similar way as they do in Unix. All the commands are prefixed with hadoop fs

1.b Running a MapReduce Program


The goal of this exercise is to practice the syntax of the command to run mapReduce programs,
experimenting with one program already provided by Hadoop.

Hadoop Exercises

Matias Fernando Capristo

Exercise 2: Our first MapReduce Program - The Inverted index


Start Eclipse (via the icon on the desktop of the Cloudera VM). A project has already been created for you
called LineIndexer (path: InvertedIndex/stub-src/index/LineIndexer). This project is preloaded with a
source code "skeleton" for the activity. This workspace also contains another project containing all the
Hadoop source code and libraries. This is required as a build dependency for LineIndexer; it's also there
for your reference.
The LineIndexer class is the main "driver" for the program, which configures the MapReduce job. This
class has already been written for you. You may want to examine its body so you understand how
MapReduce jobs are created and dispatched to Hadoop.

2.1 Coding the map() method


If you open the LineIndexMapper you will realize that the map() method is empty. You must complete the
code with the following :
public void map(LongWritable key, Text value, OutputCollector<Text, Text>
output, Reporter reporter) throws IOException {
FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
// The file name is obtained
String fileName = fileSplit.getPath().getName();
// The file name is wrapped with the Text class
Text outVal = new Text(fileName);
// StringTokenizer is used to split and iterate the string (in this case,
the string is a line of the text, and the value which separate each token is
the empty space
StringTokenizer tokenizer = new StringTokenizer(value.toString());
while (tokenizer.hasMoreTokens()) {
String word = tokenizer.nextToken();
output.collect(new Text(word), outVal);
}
}
This code has an error in its algorithm. The aim of this exercise is that get you familiar with some basic
commands. To find the error, you can run your program and analyse the output (using hadoop commands
like ls or cat. You must use ant tool in order to build the executable jar and then use the hadoop
command to run the jar.
1

Running ant:
a Go to the directory ~/git/exercises/shakespeare
b Run ant
This will compile the files using the compilation info located in build.xml file, and will generate
the jar file for your application code.

Running the jar:


a Step in the directory where the file is located.

Hadoop Exercises

Matias Fernando Capristo

Run hadoop jar indexer.jar index.LineIndexer

Tips to find the error in the mapper code :


1 The map function takes four parameters which by default correspond to:
a LongWritable key - the byte-offset of the current line in the file
b Text value - the current line from the file
c OutputCollector - output - this has the .collect method to output a <key, value> pair
d Reporter reporter - allows us to retrieve some information about the job (like the current
filename)
2

The program will read the input from the "input" folder. Instead of use the all-shakespeare text of
the previous exercise, which is long, you can reduce the input using the file "RomeoAndJulietPrologue-Part.txt". So put this file in the input folder and remove the all-shakespeare. By
reducing the input to a few lines, you can get a better understanding of the code and outputs.

Since you want to test just the mapper without a reducer, you can run the job without reducer
tasks, adding the parameter -D mapred.reduce.tasks=0 like this:
hadoop jar indexer.jar index.LineIndexer
or modify the runJob() method adding the line in bold, as is shown below:
private void runJob() throws IOException {
JobConf conf = new JobConf(getConf(), LineIndexer.class);
FileInputFormat.addInputPath(conf, new Path(INPUT_PATH));
FileOutputFormat.setOutputPath(conf, new Path(OUTPUT_PATH));
conf.setMapperClass(LineIndexMapper.class);
conf.setReducerClass(LineIndexReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setNumReduceTasks(0);
JobClient.runJob(conf);
}

The results of the mappers will be written straight to HDFS with no processing by the reducer.
4

The output file will be written in the "output" folder. You can use the following command to see the
output: training@training-vm:~$ hadoop fs -cat output/<<filename>> | less

Hadoop Exercises

Matias Fernando Capristo

Note: Despite the name of the task (Line Indexer) we will actually be referring to locations of individual
words by the byte offset at which the line starts not the "line number" in the conventional sense. This is
because the line number is actually not available to us. (We will, however be indexing a line at a time thus
the name "Line Indexer.") Large files can be broken up into smaller chunks which are passed to mappers
in parallel; these chunks are broken up on the line ends nearest to specific byte boundaries. Since there
is no easy correspondence between lines and bytes, a mapper over the second chunk in the file would
need to have read all of the first chunk to establish its starting line number defeating the point of parallel
processing!

2.2 Create the Reducer Class


Open the LineIndexReducer class in the project. The line indexer Reducer takes in all the <"word",
"filename@offset"> key/value pairs output by the Mapper for a single word. For example, for a given key,
the pairs look like:
<key, V1>
<key, V2>
...
<key, Vn >
Given all those <key, value> pairs, the reduce generates a single value string. For the line indexer
problem, the strategy is simply to concat all the values together to make a single large string, using "," to
separate the values. The choice of "," is arbitrary later code can split on the "," to recover the separate
values. So for the key given the output value string will look like:
<key

V1,V2,...,Vn>

To do this, the Reducer code simply iterates over values to get all the value strings, and concats them
together into our result String. In the following exercises you can test the reducer class and debug it to
find errors if the tests arent ok.
Tips:
1

You will have to iterate the values collection. If you dont remember too much about iterators, you
can check its documentation in http://docs.oracle.com/javase/1.4.2/docs/api/java/util/Iterator.html

Important: To enhance perfomance, instead of using the common way for concat strings:
s1 = s1 + s2; // Order of time to perform this operation in a loop : O(n^2)
you can use the StringBuilder class which provides more efficient string
operations.
e.g.:
StringBuilder sb = new StringBuilder();
sb.append(s1);
sb.append(s2);
sb.toString(); // return the fully concatenated string at the end.
The order of the time to perform this operation is linear, which is much better than O(n^2).

Hadoop Exercises

Matias Fernando Capristo

2.3 Run JUnit tests


A unit test is an automated piece of code that invokes the unit of work being tested and then checks
some assumptions about the end result of that unit. A unit test is almost always written using a unittesting framework. It can be written easily and runs quickly. Its trustworthy, readable, and maintainable. It
is
consistent in its results as long as production code has not changed.
The piece of code under test (usually called system under test (SUC) ), should be as small as possible.
One method may have multiple unit tests according to the usage and outputs of the function.
The goals of a test are:

Ensure that the code meets expectations and specifications: Does what it is expected.
Ensure that the code continues to meet expectations over time: Avoid regression

We will work with tests in this exercise. For this, go to Package Explorer in Eclipse and open the
test/index directory. Right click on AllTests.java, select "Run as..." and "JUnit test." This will use the
Eclipse JUnit plugin to test your Mapper and Reducer. The unit tests have already been written in
MapperTest.java and ReducerTest.java. These use a library developed by Cloudera to test mappers and
reducers, called MRUnit. The source for this library is included in your workspace.

2.4 Debugging
Now that you are more familiar with the tests, you can use them to debug your map reduce operations.
For do this:
1 Put a breakpoint in the code that you want to debug
2 Locate in the package explorer the AllTests.java class.
3 Right button on the class, select Debug as... -> Junit test. The execution of the mapper will
start, and will stop at your breakpoint. Then you can inspect variables, run step by step the
program, etc.

2.5 Compile your system


Open a terminal window, navigate to ~/workspace/LineIndexer/, and compile your Java sources into a jar:
$ cd ~/workspace/LineIndexer
$ ant
This will create a file named indexer.jar in the current directory. If you're curious about the instructions
that ant followed to create the compiled object, read the build.xml file in a text editor.
You can also run JUnit tests from the ant buildfile by running :
$ ant test

2.6 Running and monitoring your program


In the previous exercise, you should have loaded your sample input data into Hadoop. If you changed
all-shakespeare.txt for "RomeoAndJuliet-Prologue-Part.txt", change it again, because now that we have
our Map Reduce program working, it is better to run it with more data. Also before you run the program
again, you'll need to remove the output directory that you already created using $hadoop fs -rmr
output command; or else Hadoop will refuse to run the job and print an error message ("Directory
already exists").

Hadoop Exercises

Matias Fernando Capristo

Once you have only the all-shakespeare.txt file in the input folder, and the output folder is deleted, run:
$ hadoop jar indexer.jar index.LineIndexer
This will read all the files in the input directory in HDFS and compute an inverted index. It will be written to
a directory named output. View your results by using hadoop fs -cat filename. You may want to pipe this
into less.
If your program didn't work, go back and fix places you think are buggy. Recompile using the steps
above, and then re-run the program.
Tips:
1

You can use the user interface in your browser to watch the progress of your job and monitor it,
and other interesting statistics about it. You can browse http://localhost:50030 to access the
jobTracker.
You can check
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/welcome.html
which have a description of how to use and understand the Hadoop UI.

Exercise 3 : Improving the Inverted Index


If you take a look at exercise 2, you may notice that the StringTokenizer class doesn't do anything clever
with regards to punctuation or capitalization. You might want to improve your mapper to merge these
related tokens, and avoid indexing the same word several times just because it appears once with a
capital letter, once with a semi colon, etc.
Example.The index of the exercise 2 looks like this:
romeo shakespeare.txt@38624
Romeo shakespeare.txt@38625
Romeo; shakespeare.txt@38626,shakespeare.txt@12047

The output of the exercise 3 must be like this:


romeo

shakespeare.txt@38624,shakespeare.txt@38625,shakespeare.txt@38626,shakespeare.txt@12047

Modify the program and run it so to obtain the expected result.

Hadoop Exercises

Matias Fernando Capristo

Exercise 4: The patent inverted index


Lets now work with a bigger amount of data. We will use the patents data sets, which are available in the
National Bureau of Economic Research (NBER) site, at http://www.nber.org/patents/. You must download
the acite75_99.zip file. Unzipped, the file is approximately 250 MB which are small enough to make our
examples runnable in Hadoops standalone or pseudo-distributed mode.
The patent citation data set contains citations from U.S. patents issued between 1975 and 1999. It has
more than 16 million rows and the first few lines resemble the following:
"CITING","CITED"
3858241,956203
3858241,1324234
3858241,3398406
3858241,3557384
3858241,3634889
3858242,1515701
3858242,3319261
3858242,3668705
3858242,3707004

The data set is in the standard comma-separated values (CSV) format, with the first line a description of
the columns. Each of the other lines record one particu-lar citation. For example, the second line shows
that patent 3858241 cites patent 956203.
The aim of this exercise is to develop a program that will take the patent citation data and invert it. For
each patent,we want to find and group the patents that cite it. Our output should be similar to the
following:
1
10000
100000
1000006
1000007

3964859,4647229
4539112
5031388
4714284
4766693

You can use as a template the project of the previous exercise. The configuration of the job will look like
this:
public int run(String[] args) throws Exception {
Configuration conf = getConf();
JobConf job = new JobConf(conf, MyJob.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("MyJob");
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
job.setInputFormat(KeyValueTextInputFormat.class);

Hadoop Exercises

Matias Fernando Capristo

job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.set("key.value.separator.in.input.line", ",");
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MyJob(), args);
System.exit(res);
}

Tips:

1
2
3

You must indicate the input and output paths when you run your program from the command line.
They are not contained in the code like in the later exercise.
The mapper should implements the Mapper interface (Mapper<K1, V1, K2, V2) using the Text
class for each key and value (Mapper<Text, Text, Text, Text>)
The reducer should implements the Reducer interface (Reducer<Text, Text, Text, Text>) using
the Text class for each key and value (Reducer<Text, Text, Text, Text>)

10

Hadoop Exercises

Matias Fernando Capristo

Exercise 5: Counting things


In this exercise we will count the times that a patent is cited. Our output must be like this:
1
2
3
4
5
6
7
8
9
10
...
411

921128
552246
380319
278438
210814
163149
127941
102155
82126
66634
1

This output show us for example, that the patent 10 is cited 66634 times.
Tips:

1
2

The mapper should implements Mapper<Text, Text, IntWritable, IntWritable>


The reducer should implements Reducer<IntWritable, IntWritable, IntWritable, IntWritable>

When the exercise is done, you can put the output into a spread-sheet and plot it. You can realize that
you have processed a big file of data with a specific format and obtained an analysis of its data in a very
easy way.

11

Hadoop Exercises

Matias Fernando Capristo

Bibliography

Chuck Lam, Hadoop in Action, Manning Publications Co., 2011


Cloudera Hadoop Demo 0.3.3, Tutorial and exercises,
http://www.cloudera.com/content/cloudera/en/home.html
Amazon Web Services, Amazon Elastic Map Reduce,
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/welcome.html

12