Hadoop

An Elephant can't jump. But can carry heavy load.

A 20 page introduction to hadoop and friends.

Prashant Sharma

Table of Contents
1. INTRODUCTION............................................................................................................................................5 1.1 What is distributed computing?....................................................................................................................5 1.2 What is hadoop? (Name of a toy elephant actually).....................................................................................5 1.3 How does Hadoop eliminate complexities?...................................................................................................5 1.4 What is map-reduce?......................................................................................................................................6 1.5 What is HDFS?...............................................................................................................................................6 1.6 What is Namenode?........................................................................................................................................6 1.7 What is a datanode?.......................................................................................................................................6 1.8 What is a Jobtracker and tasktracker?.........................................................................................................7 2. HOW MAP-REDUCE WORK?......................................................................................................................7 ...............................................................................................................................................................................7

................................................................................................................................................................................ 7 2.1 Introduction....................................................................................................................................................8 2.2 Map-reduce is the answer..............................................................................................................................8 2.3 An example program which puts inverted index in action using Hadoop 0.20.203 API...........................8 2.4 How Hadoop runs Map-reduce?..................................................................................................................11 2.4.1 Submit Job...................................................................................................................................................11 2.4.2 Job Initialization..........................................................................................................................................11 2.4.3 Task Assignment.........................................................................................................................................12 2.4.4 Task Execution............................................................................................................................................12 3. HADOOP STREAMING...............................................................................................................................12 3.1 A simple example run...................................................................................................................................13 3.2 How it works?...............................................................................................................................................13 3.3 Features.........................................................................................................................................................13

4. HADOOP DISTRIBUTED FILE SYSTEM ................................................................................................13 4.1 Introduction..................................................................................................................................................13 4.2 What HDFS can not do?...............................................................................................................................14 4.3 Anatomy of HDFS !......................................................................................................................................14 4.3.1 Filesystem Metadata....................................................................................................................................14 4.3.2 Anatomy of write.........................................................................................................................................15 4.3.3 Anatomy of a read.......................................................................................................................................15 4.4 Accessibility...................................................................................................................................................15 4.4.1 DFS shell.....................................................................................................................................................15 4.4.2 DFS Admin..................................................................................................................................................15 4.4.3 Browser Interface.........................................................................................................................................16 4.4.4 Mountable HDFS.........................................................................................................................................16 5. SERIALIZATION..........................................................................................................................................16 5.1 Introduction..................................................................................................................................................16 5.2 Write your own composite writable............................................................................................................17 5.3 An example explained on serialization and having custom Writables from hadoop repos.....................17 5.4 Why Java Object Serialization is not so efficient compared to other Serialization frameworks?..........21 5.4.1 Java Serialization does not meet the criteria of Serialization format............................................................21 5.4.2 Java Serialization is not compact.................................................................................................................21 5.4.3 Java Serialization is not fast.........................................................................................................................21 5.4.4 Java Serialization is not extensible..............................................................................................................21 5.4.5 Java Serialization is not interoperable..........................................................................................................21 5.4.6 Serialization IDL.........................................................................................................................................21

6. DISTRIBUTED CACHE. ..............................................................................................................................22 6.1 Introdution....................................................................................................................................................22 6.2 An Example usage:.......................................................................................................................................22 7. SECURING THE ELEPHANT.....................................................................................................................22

..............................................................QMC pie estimator..................................................4 Further securing the elephant...................................................................................................22 7...................................................................................................................................................................................................................................................................24 Avro relies on JSON schemas.........................................25 Grep example:..................................................................................................................................................................................25 1.. ........................................................................................................2 example: of using kerberos............................................................................................................................................................................................................................................................................33 .23 Capacity Scheduler:.........................................24 Apache Avro ....................................................................................................................................22 7.............................24 Avro is a data serialization system................................................................................................................................................................................................................................................................................................24 APPENDIX 1A AVRO SERIALIZATION......................................................................................................................................................................................................................................................................................................................................23 8.................................................................................................................23 Fair Scheduler.................................................24 Avro Serialization is fast............................................................... .......................24 Avro Serialzation .31 WordCount..23 Default scheduler:...................24 Avro Serialization is Interoperable..................................22 7......................1Three schedulers:......3 Delegation tokens............................................................................................................ HADOOP JOB SCHEDULING.....................................................................................................................................23 8.......................24 APPENDIX 1B..............................................1 kerberos tickets...................................................................................................7....................................................................................

Fault tolerance. It facilitates scalability and takes cares of detecting and handling failures. interoperability. Different Operating systems. like increase in users. 3. Map-reduce (Job Tracker and task tracker) 2. Resource sharing. 8. different hardware. JobTracker (Runs on server) 5. 5.. 3. Datanode (Runs on slaves) 4. 4. Challenges of Distributed computing. Extensions. 2. Which is where hadoop comes in. Should appear as a whole instead of collection of computers. portability. Introduction 1. etc. Transparency. Scalability. 1. update of shared resources. Handle extra load.1. 1. by having provisions for redundancy and recovery. Concurrency. 6. Openness. Biggest challenge is to hide the details and complexity of accomplishing above challenges from the user and to have a common unified interface to interact with it. Following are major components. data loss. Access any data and utilize CPU resources across the system. Middleware system allows this. no single point of failure.2 What is hadoop? (Name of a toy elephant actually) Hadoop is a framework which provides open source libraries for distributed computing using simple single map-reduce interface and its own distributed filesystem called HDFS.3 How does Hadoop eliminate complexities? Hadoop has components which take care of all complexities for us and by using a simple map reduce framework we are able to harness the power of distributed computing without having to worry about complexities like fault tolerance. Namenode and Secondary namenode (A HDFS NameNode stores Edit logs and File system Image). Allows concurrent access. TaskTracker (Runs on slaves) . Heterogeneity. 1. It has replication mechanism for data recovery and job scheduling and blacklisting of faulty nodes by a configurable blacklisting policy.1 What is distributed computing? Multiple autonomous systems appear as one. 7. 1. interacting via a message passing interface.

(definition by Google paper on mapred) This broadly consists of two mandatory functions to implement: “Map and reduce”. At any time of failure these checkpointed images can be used to restore the namenode. . which is coherent and provides all facilities of a file system. A map is a function which is executed on each key-value pair from an input split. does some processing and emits again a key and value pair. Again in reduce you may play around as you want with key-values and what you emit know is also key value pairs which are dumped directly to a file. After map and before reduce can begin there is a phase called shuffle which copies and sorts the output on key and aggregates values. if namenode goes down . They periodically report back to Namenode with list of blocks they are storing. This process is also called aggregation as you get values aggregated for a particular key as input to reduce method.6 What is Namenode? A single point of failure for an HDFS installation. These key and “aggregated value” pairs are captured by reduce and outputs a reduced key value pair . Now simply by expressing a problem in terms of map-reduce we can execute a task in parallel and distribute it across a broad cluster and be relieved of taking care of all complexities of distributed computing.I mean. By saying It is a single point of failure . which in case of a failure of namenode can be used to replay all the actions of the filesystem and thus restore the state of the filesystem. had you tried doing the same thing with MPI libraries you can understand the complexity there scaling to thousands or even hundreds of nodes. There is a lot more going in map-reduce than just map-reduce. 1.whole filesystem is offline. Hadoop also has a secondary namenode which contains edit log. A secondary namenode regularly contacts namenode and takes checkpointed snapshot images. But the beauty of the hadoop is that it takes care of most of those things and a user may not dig into details for simply running a job. 1. Current efforts are going on to have high availability for Namenode. 1.4 What is map-reduce? The map reduce framework is introduced by google.7 What is a datanode? Datanode stores actual blocks of data and stores and retrieves blocks when asked. Indeed “Life made easy”.1. A simple and powerful interface that enables automatic parallelization and distribution of large-scale computations. combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs. It contains information regarding a block’s location as well as the information of entire directory structure and files. though it is good if one has knowledge of those features and can help in tuning parameters and thus improve efficiency and fault tolerance.5 What is HDFS? Hadoop has its own implementation of distributed file system called hadoop distributed filesystem. It implements ACLs and provides a subset of usual UNIX commands for accessing or querying the filesystem and if one mounts it as a fuse dfs then it is possible to access it as any other linux filesystem with standard unix commands.

8 What is a Jobtracker and tasktracker? There is one JobTracker(is also a single point of failure) running on a master node and several tasktracker running on slave nodes. In Mrv2 efforts are made to have high availability for Jobtracker. How map-reduce work? MapReduce: Simplified Data Processing on Large Clusters by google. which would definitely change the way it has been. 2.1. Each tasktracker has multiple task-instances running and every task tracker reports to jobtracker in the form of heart beat at regular intervals which also carries message of the progress of the current job it is executing and idle if it has finished executing. . Jobtracker schedules jobs and takes care of failed ones by re-executing them on some other nodes.

We will understand this taking inverted index as example.hadoop.hadoop. And what if data is in PBs? will it scale? 2.util.apache.hadoop.mapreduce. An inverted index is same as the one that appear at the back of the book . and emits a sequence of <word.mapreduce.Mapper.hadoop. Its main usage is to build indexes for search engines. Limitation of this approach: If the no of documents is large then disk I/o will become a bottle neck. Now the conventional way is to build the inverted index in a large map(Data structure) and update the map by reading the documents and updating the index.Reducer.2 Map-reduce is the answer. org.2.mapreduce. this is done in sort and shuffle phase.Hashtable.apache.IOException.apache.Path. java. Read the hadoop quickstart guide for installation intruction. list(document ID)> pair. document ID> pairs. org.lib.Job. org.StringTokenizer.1 Introduction. package testhadoop. Where word is the key .io. where each word is listed and then location where it occurs. import import import import import import import import import import import import import import java.Enumeration. The reduce function accepts all pairs for a given word.FileOutputFormat.hadoop. org.hadoop.mapreduce.apache. org.input.apache. as the name says its unique and different document ID with same word are merged when passed to reduce function as input.FileInputFormat.apache.*. org. org.fs.Text. /** * Here we find the inverted index of a corpus.GenericOptionsParser. The map function parses each document. It is easy to augment this computation to keep track of word positions. org.Configuration.20.hadoop.hadoop. Reduce :emtis word and occurances * in filenames. java.3 An example program which puts inverted index in action using Hadoop 0. The set of all output pairs forms a simple inverted index.mapreduce.output.io. you can use the wikipedia * corpus. Map * :emits word as key and filename as value. sorts the corresponding document IDs and emits a <word.apache.hadoop. org.input.util.mapreduce.apache.lib. * * @author Prashant * */ public class InvertedIndex { .util. Suppose you were to build a search engine with inverted index as the index.conf.apache.util. These are basic steps of a typical map reduce program as described by Google Map-reduce paper. java.apache. 2.hadoop.203 API. org.lib.

\t\n").getPath(). Enumeration<String> e = table. } String result = "".Filename in which it occurs). temp). " + tempvalue.getInputSplit())) .toString()). \t\n you should add more // symbols if you are using HTML or XML corpus..Text> Remember we can only use the writable implemented * classes for key and value pairs(Serialization issues discussed later) * emits(Word. while (itr. for (Text val : values) { if (table.") * Here we store the file names in a hashtable and increment the count to augment the index. result = result + "< " + tempkey + ". } } ."<Filename>. table. context. Text> { public void map(Object key. temp = temp. StringTokenizer itr = new StringTokenizer(value.put(val.OutVal>.. Text.toString().write(word. FN).toString() + " > ". // Here we used the context object to retrieve the file name of the // map is working on. new Text(result)). Text FN = new Text(file). Text> { public void reduce(Text key. "\".. * We specify the input key and value format and output key and value * formats in <Inkey. } else table. Text.toString().put(val.Text. } } } /** * Almost same concept for reducer as well as mapper(Read WFMapper * documentation) Emits (Word. new Long(1)). Iterable<Text> values. Long tempvalue = (Long) table.Outkey. // Tokenize each line on basis of \". * * @author Prashant */ public static class WFMapper extends Mapper<Object...set(itr.<Filename>.get(tempkey).keys().containsKey(val.get(val. while (e. // Emits intermediate key and value pairs.write(key.longValue() + 1. Context context) throws IOException. InterruptedException { Text word = new Text(). Text value. */ public static class WFReducer extends Reducer<Text.InVal. } context.toString()). Context context) throws IOException.hasMoreTokens()) { word. Text./** * Mapper is the Abstract class which need to be extended to write a mapper. InterruptedException { Hashtable<String.nextToken()). So in the mapper we chose * <Object.nextElement(). Text. Long>().Text.toString(). Long> table = new Hashtable<String. String file = new String(((FileSplit) (context.toString().hasMoreElements()) { String tempkey = e.toString())) { Long temp = table.

job.setOutputValueClass(Text.class). FileInputFormat. /** * Why do we use a combiner when its optional? Well a combiner helps in * reducing the output at the mapper end itself and thus bandwidth load * is reduced over the network and also increases the efficiency of the * reducer. */ job. } // Create Job object from configuration. Job job = new Job(conf.setOutputKeyClass(Text.setMapperClass(WFMapper. job.exit(job. System.class).println("Usage: invertedIndex <in> <out>").getRemainingArgs(). new Path(otherArgs[0])).setReducerClass(WFReducer. // pass the arguements to Hadoop utility for options parsing String[] otherArgs = new GenericOptionsParser(conf.class). job. } } .class).setCombinerClass(WFReducer. job. System.setOutputPath(job.err. new Path(otherArgs[1])). if (otherArgs. FileOutputFormat.setJarByClass(InvertedIndex. Although // its possible to write a // seperate. job.length != 2) { System.class). // We used the same class for // combiner as reducer.addInputPath(job.waitForCompletion(true) ? 0 : 1).public static void main(String[] args) throws Exception { /** * Load the configurations into the configuration object(From XML that * you Setup while you have setup hadoop) */ Configuration conf = new Configuration(). "Inverted index").class). args) .exit(2).

system. If you are curious about what goes on behind the scenes when you deployed a map-red program and saw a dump of spam printed on stderr(which is Info of your job’s status and not error-If everything went fine).4. error is thrown. ● Pass JobConf to JobClient.runJob() or submitJob() ● runJob() blocks. Job is ready 2. ● Tells the JobTracker.2. The diagram above is self explanatory and tells us about how map reduce works.replication” property. ● Computes input split for the job.dir”. throws error. submitJob() does not.1 Submit Job ● Asks the Job Tracker for a new ID ● Checks output spec of the Job. Job is not submitted.2 Job Initialization ● Puts the job in internal Queue ● Job Scheduler will pickup and initialize it .4. Copied with a high replication factor. (Asynchronous and synchronous ways of submitting a job.) ● JobClient: Determines proper division of input into InputSplits ● Submits the job to JobTracker. Splits cannot be computed(inputs does’t exist).submit.4 How Hadoop runs Map-reduce?. factor of Can be set by “mapred. 2. If exists. ● Job jar file. Job is not submitted ● Copies the resources needed to run the job to HDFS in a directory named specified by “mapred. Checks o/p Dir.

○ No.dir" to job specific directories and copies any other files required. Tasks ID’s are given for each task 2.3 Task Assignment. Hadoop Streaming.Child. ● Task Assignment ● Task trackers send heartbeats to JobTracker via RPC.4.4 Task Execution ● Task tracker has been assigned the task ● Next step is to run the task ● Localizes the job by copying the jar file from the "mapred. ● Creates a local working dir for the task. Hadoop streaming is a utility that comes with the Hadoop distribution. ● Task tracker indicates readines for a new task ● Job Tracker will allocate a Task ● Job Tracker communicates the task in a response to a heartbeat return ● Choosing a Task Tracker ○ Job Tracker must choose a Task for a TaskTracker ○ Uses scheduler to choose a task from ○ Job Scheduling algorithms – >default one based on priority and FIFO.reduce.● ● ● ● ● ○ Create a Job object and job being run ○ Encapsulate its tasks ○ Book keeping info to track tasks status and progress Create list of tasks to run Retrieves number of input splits computed by the JobClient from the shared filesystem Creates one map task for each split.system. un-jars the contents of the jar onto this dir ● Creates an instance of TaskRunner to run the task ● Task runner launches a new JVM to run each task ○ To avoid Task tracker to fail. .4.tasks. Scheduler creates the Reduce tasks and assigns them to taskTracker. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer. 2.main(): ○ Sets up the child TaskInProgress attempt ○ Reads XML configuration ○ Connects back to necessary MapReduce components via RPC ○ Uses TaskRunner to launch user process 3. of reduce tasks is determined by the map. if any bugs in MapReduce tasks ○ Only the child JVM exits in case of a problem ● TaskTracker.

● It has large block size (default 64mb) for storage to compensate for seek time to network bandwidth.mapred.apache. 3. You can specify internal class as a mapper instead of an executable like this. Similarly for reducer task each reducer task gets as input the converted form of key value pair of the map task into stdin readable input and output of the executable is converted to key value pairs. Hadoop Distributed File System 4.jar \ -input myInputDirs \ -output myOutputDir \ -mapper /bin/cat \ -reducer /bin/wc 3.reduce. As the mapper task runs. In the meantime.mapred.IdentityMapper\ -reducer /bin/wc \ -jobconf mapred.1 Introduction. $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input myInputDirs \ -output myOutputDir \ -mapper org. . -mapper org.apache.IdentityMapper \ Input ouput format classes can be specified like this.hadoop. which is collected as the output of the mapper. each mapper task will launch the executable as a separate process when the mapper is initialized.3 Features.lib.hadoop.3.1 A simple example run. -inputformat JavaClassName -outputformat JavaClassName -partitioner JavaClassName -combiner JavaClassName You can specify JobConf parameters. running on clusters of commodity hardware.lib.2 How it works? Mapper Side:When an executable is specified for mappers. it converts its inputs into lines and feed the lines to the stdin of the process. So very large files for storage are ideal. the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair. HDFS is a filesystem designed for storing very large files with streaming data access patterns. $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.tasks=2 4.

Stored in Namenode’s local filesystem. Write once and read many times architecture. Multiple writers and arbitrary modification. creating a new file. Lots of small files. There is no support for multiple writers in HDFS and files are written to by a single writer after end of each file. Since files are large time to read is significant parameter than seek to first record. 4. HDFS is capable of handling it. ● For example.3.● ● Streaming data access. Since block size is 64 MB and lots of small files(will waste blocks) will increase the memory requirements of namenode. Change replication factor of a file ● EditLog is stored in the Namenode’s local filesystem ● Entire filesystem namespace including mapping of blocks to files and file system properties is stored in a file FsImage. It is designed to run on commodity hardware which may fail. .2 What HDFS can not do? ● ● ● Low latency data access. It is not optimized for low latency data access it trades latency to increase the throughput of the data. ● Namenode uses a transaction log called the EditLog to record every change that occurs to the filesystem meta data. Commodity hardware. ● The HDFS namespace is stored by Namenode.1 Filesystem Metadata.3 Anatomy of HDFS ! 4. 4.

Name node gives a list of data nodes for the pipeline.4 Accessibility HDFS can be accessed from applications in many different ways. 4. A C language wrapper for this Java API is also available. DataStreamer asks namenode to get list of data nodes and uses the internal data queue. ● DFSOutputStream splits data into packets.2 Anatomy of write.txt Command bin/hadoop dfs -mkdir /foodir bin/hadoop dfs -cat /foodir/myfile.3.3 Anatomy of a read. The syntax of this command set is similar to other shells (e. 4.2 DFS Admin The DFSAdmin command set is used for administering an HDFS cluster. Or can be mounted as unix filesystem. These are commands that are used only by an HDFS administrator. It provides a commandline interface called DFSShell that lets a user interact with the data in HDFS. Here are some sample action/command pairs: Action Command . HDFS provides a Java API for applications to use.4. which manages datanode and namenode I/O Read is called repeatedly on the datanode till end of the block is reached Finds the next DataNode for the next datablock All happens transparently to the client Calls close after finishing reading the data 4. 4. ● ● ● ● Writes into an internal queue.txt DFSShell is targeted for applications that need a scripting language to interact with the stored data.4. csh) that users are already familiar with.4. bash.g. Maintains internal queue of packets waiting to be acknowledged. Natively.3. an HTTP browser can also be used to browse the files of an HDFS instance. In addition. Here are some sample action/command pairs: Action Create a directory named /foodir View the contents of a file named /foodir/myfile. ● Name node returns the locations of blocks for first few blocks of the file ● ● ● ● ● ● Data nodes list is sorted according to their proximity to the client FSDataInputStream wraps DFSInputStream.1 DFS shell HDFS allows user data to be organized in the form of files and directories.

Serialization is the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage. Serialization.Put a cluster in SafeMode Generate a list of Datanodes Decommission Datanode datanodename bin/hadoop dfsadmin -safemode enter bin/hadoop dfsadmin -report bin/hadoop dfsadmin -decommission datanodename 4. Less used ! as it is pretty much represented by Vlong. Reduced processing overhead of serializing and deserializing. 6. 1.3 Browser Interface A typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port. ○ ○ ○ ○ Compact . 5. DoubleWritable 4. NullWritable. Interoperable. Extensible. 4. FloatWritable 8. Expectation from a serialization interface. Fast. 9. There are following predefined implementations available for WritableComparable. Well this does not store anything and may be used when we do not want to give anything as key or value. stores as much as needed. . VIntWritable. BooleanWritable 7. 1-9 bytes storage 5. BytesWritable.4. Support for different languages. Hadoop has writable interface which has all of those features except interoperability which is implemented in Avro. To utilize bandwidth efficiently. Easily enhanceable protocols.4. Variable size. LongWritable 3. It has also got one important usage: For example we want to write a seq.1 Introduction. VLongWritable.4 Mountable HDFS: Please visit the wiki for more details MountableHDFS: 5. IntWritable 2. This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser.

TwoDArrayWritable 3. org.3 An example explained on serialization and having custom Writables from hadoop repos.hadoop. java.hadoop.0 * * Unless required by applicable law or agreed to in writing. 10. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND.fs.StringTokenizer.apache.to be serialized.org/licenses/LICENSE-2. all values will be merged by reduce method into one single instance.io.io.DataInput. org. software * distributed under the License is distributed on an "AS IS" BASIS. ObjectWritable 12. You may obtain a copy of the License at * * http://www.io.DataOutput. The ASF licenses this file * to you under the Apache License. you may not use this file except in compliance * with the License. */ package testhadoop.hadoop.io.Configuration.Text. then we can give key as NullWritable object and since it stores nothing. as a key and/or value in map-reduce . See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. java.apache.hadoop.2 Write your own composite writable. As it is important for any class to be used. * See the License for the specific language governing permissions and * limitations under the License.hadoop.apache. . Version 2.apache.RawComparator. MapWritable 4.apache. 5.conf.IOException.apache.Path. ArrayWritable 2.0 (the * "License").io. GenericWritable Apart from the above there are four Writable Collection types 1.IntWritable.apache. org.io.io. either express or implied. (See comments) /** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements.hadoop. Besides predefined “writables” we can implement WritableComparable to serialize a class and use as key and value. org. org. java. MD5Hash 11.LongWritable. SortedMapWritable 5. org. import import import import import import import import import import java.util.file and do not want it be stored in key and value pairs.

* Encoded as: MIN_VALUE -> 0.hadoop.mapreduce.apache.writeInt(first .mapreduce.readInt() + Integer. org.apache.io. */ public void set(int left. second = right.io. */ public class SecondarySort2 { /** * Define a pair of integers that are writable.hadoop.MIN_VALUE.import import import import import import import import import org.Partitioner.Integer.apache.Job.lib. /** * Set the left and right values.MIN_VALUE). } public int getSecond() { return second.writeInt(second .apache.GenericOptionsParser. out.mapreduce. MAX_VALUE-> -1 */ @Override public void readFields(DataInput in) throws IOException { first = in. org.mapreduce. } public int getFirst() { return first. * * To run: bin/hadoop jar build/hadoop-examples. } @Override and . * They are serialized in a byte comparable format. org.apache.MIN_VALUE).readInt() + Integer. /** * This is an example Hadoop Map/Reduce application. org.FileInputFormat. } /** * Read the two integers.right) value as right reducer class that just emits the sum of *the input values.hadoop.hadoop. org.hadoop.jar secondarysort * <i>in-dir</i> <i>out-dir</i> */ /* * In this example the use of composite key is demonstrated where since there is no default implementation of a composite key.mapreduce. } @Override public void write(DataOutput out) throws IOException { out. * It reads the text input files that must contain two integers per a line.mapreduce. private int second = 0.hadoop. 0 -> -MIN_VALUE. org.WritableComparable.apache.Integer.hadoop.MIN_VALUE.Reducer.WritableComparator. int right) { first = left.input.lib.apache.hadoop.FileOutputFormat. */ public static class IntPair implements WritableComparable<IntPair> { private int first = 0.Mapper. second = in. org. org. * The output is sorted by the first and second number and grouped on the * first number.util.apache. We had to override methods from WritableComparable. * * Mapclass: Mapclass simply reads the line from input and emits the pair as Intpair(left.hadoop.output.apache.

*/ public static class FirstPartitioner extends Partitioner<IntPair. IntPair. int l1. } } /** * Read two integers from each line and generate a key.define(IntPair. Text. } @Override public boolean equals(Object right) { if (right instanceof IntPair) { IntPair r = (IntPair) right. right).second) { return second < o. byte[] b2. } else if (second != o. private final IntWritable value = new IntWritable().first == first && r.class). Since we have our own implementation of key and hash function.toString()). } } } /** * Partition based on the first part of the pair.class.public int hashCode() { return first * 157 + second. .first ? -1 : 1. } else { return 0. Context context) throws IOException. IntWritable> { private final IntPair key = new IntPair(). int s2.getFirst() * 127) % numPartitions. new Comparator()).abs(key. } public int compare(byte[] b1. } } static { // register this comparator WritableComparator. int l2) { return compareBytes(b1. */ /* Partion function (first*127 MOD (noOfPartition)). InterruptedException { StringTokenizer itr = new StringTokenizer(inValue. } else { return false. return r. */ public static class Comparator extends WritableComparator { public Comparator() { super(IntPair. @Override public void map(LongWritable inKey.second ? -1 : 1. int numPartitions) { return Math. b2. } } /** A Comparator that compares serialized IntPair. IntWritable value. We will need to override the partitioner * as we cannot go for default Hashpartitioner. s2.IntWritable>{ @Override public int getPartition(IntPair key. l1.first) { return first < o. right). */ @Override public int compareTo(IntPair o) { if (first != o. } /** Compare on the basis of first first! then second.second == second. */ /* Mapclass simply reads the line from input and emits the pair as Intpair(left. l2). int s1. Text inValue. value pair * as ((left. s1.right) as key and value as right*/ public static class MapClass extends Mapper<LongWritable.

System. */ public static class Reduce extends Reducer<IntPair.parseInt(itr.class).getFirst())). "secondary sort").class). context.waitForCompletion(true) ? 0 : 1).class). } Job job = new Job(conf. Context context ) throws IOException. int right = 0. job. job.set(right). // the map output is IntPair.write(key.println("Usage: secondarysort2 <in> <out>"). } } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(). } /** * A reducer class that just emits the sum of the input values. // group and partition by the first int in the pair job.hasMoreTokens()) { left = Integer. value).setMapperClass(MapClass. args). job. FileInputFormat.nextToken()).set(Integer.setPartitionerClass(FirstPartitioner. IntWritable job.toString(key.class). job. IntWritable. } key. IntWritable job.nextToken()). for(IntWritable value: values) { context. if (itr.set(left.setMapOutputKeyClass(IntPair.err. System. if (itr. IntWritable> { private final Text first = new Text().write(first.getRemainingArgs().setOutputPath(job. InterruptedException { first.length != 2) { System.setOutputKeyClass(Text. // the reduce output is Text. @Override public void reduce(IntPair key.exit(job.hasMoreTokens()) { right = Integer. String[] otherArgs = new GenericOptionsParser(conf. FileOutputFormat.addInputPath(job.} } int left = 0. new Path(otherArgs[0])).class).setMapOutputValueClass(IntWritable. value.setJarByClass(SecondarySort2. job.class). new Path(otherArgs[1])).class). right).exit(2). value). Iterable<IntWritable> values.setOutputValueClass(IntWritable. Text.parseInt(itr. if (otherArgs. } } .setReducerClass(Reduce.class).

since the referent class may occur at any point in the preceding stream that is.4 Java Serialization is not extensible. The system generates types for different languages. 5.4. Java writes the classname of each object being written to the stream. This makes the assumption that the client knows the expected type. Java Serialization is a general-purpose mechanism for serializing graphs of objects. declarative fashion using Interface description language(IDL).4.3 Java Serialization is not fast. other languages could interpret the Java Serialization stream protocol (defined by the Java Object Serialization Specification). . but in practice there are no widely used implementations in other languages. Java Serialization has some support for evolving a type. and random access and sorting work as expected since each record is independent of the others (so there is no stream state).Externalizable. which is the approach that Writable takes.1 Java Serialization does not meet the criteria of Serialization format.5. rather than defining types through code. In terms of extensibility.io. Avro is one of the serialization framework which uses IDL mechanism. Please see appendix for details. 5. In principle.4. While Map-Reduce job at its core serializes and deserializes billions of records of different types. so it necessarily has some overhead for serialization and deserialization operations. 5.Serializable or java. The result is that the format is considerably more compact than Java Serialization. It allows them to define in a language-neutral.6 Serialization IDL There are many serialization frameworks that approach the serialization in a different way. since the first record of a particular class is distinguished and must be treated as a special case. 1)compact 2)fast 3)extensible 4)interoperable 5.4. reference handles don’t work well with random access. which encourages interoperability.2 Java Serialization is not compact. so it is a Java-only solution.4 Why Java Object Serialization is not so efficient compared to other Serialization frameworks? 5. there is state stored in the stream.io.4. thus benefiting in terms of memory and bandwidth by not allocating new objects. 5. but it is hard to use effectively (Writables have no support: the programmer has to manage them himself). which occupies only 5 bytes.5 Java Serialization is not interoperable. Even worse. The situation is the same for Writables. reference handles play havoc with sorting records in a serialized stream. All of these problems are avoided by not writing the classname to the stream at all. Subsequent instances of the same class write a reference handle to the first occurrence. this is true of classes that implement java. However.4.

. a)Authentication. 7. Distributed Cache. Contacts Ticket granting server with a TGT for a service ticket.2 An Example usage: $hadoop jar job. b)Authorization..3 Delegation tokens. Receives TGT (Ticket granting ticket ) from authentication server. There are three steps to access a service. Kerberos is computer network authentication protocol which assigns tickets to nodes communicating over insecure network to establish each other’s identity in a secure manner. Now use Service ticket can be used to access the resource. We would need some mechanism to ensure the access is only to an authorized entity with the rights as to what extent of security we would want to enforce for that user. c)Service Request. Generally all extra files needed by map reduce tasks should be distributed this way to save network bandwidth.6.1 kerberos tickets. 6. which may lead to data loss or leak of important data.2 example: of using kerberos. 7. Hadoop uses kerberos for ensuring cluster security. 7. A kerberos ticket is valid for 10 hours once received and can be renewed. 6. In order to safe guard hadoop cluster against unauthorized access .jar XYZUsingDistributedCache -files input/somethingtobecached input/data output 7..1 Introdution. Securing the elephant. $kinit provide password for HADOOP@user:****** $hadoop fs -put anything …. A facility provided by map-reduce framework to distribute the files explicitly specified using --files option across the cluster and kept in cache for processing.

tracker.acls.mapred.4 Further securing the elephant..e. However.LinuxTaskController. where a job is submitted to a queue.modify-job Hadoop does not use encryption for RPC and transferring HDFS blocks to and from datanodes.task.hadoop. ● Besides this in order to isolate on user rather than operating system. each user is able to access only his jobs. it will not be preempted for a higher priority job. though new tasks from the higher priority job will be preferentially scheduled. ● Free resources can be allocated to any queue beyond it's capacity.apache. set mapred. once a job is running. and setting for each user mapred. set mapred. ..task-controller to org. as tasks scheduled on these resources complete. they will be assigned to jobs on queues running below the capacity.job. ● To enforce the ACLs. ● Queues optionally support job priorities (disabled by default). i.enabled to true. When there is demand for these resources from queues running below capacity at a future point in time. 8. These are necessary to provide users access to cluster resources and also control access and limit usage. Hadoop Job Scheduling.1Three schedulers: Default scheduler: ● Single priority based queue of jobs ● Scheduling tries to balance map and reduce load on all tasktrackers in the cluster Capacity Scheduler: ● Yahoo!’s scheduler The Capacity Scheduler supports the following features: ● Support for multiple queues. 7. jobs with higher priority will have access to the queue's resources before jobs with lower priority. Hadoop has plug-gable schedulers and a default scheduler as well. 8.Delegation token are used in hadoop in background so that user does not have to authenticate at every command by contacting KDC. ● Within a queue. All jobs submitted to a queue will have access to the capacity allocated to the queue. Now since we have jobs running we need somebody to take care of them.acl-view-job and . ● Queues are allocated a fraction of the capacity of the grid in the sense that a certain capacity of resources will be at their disposal.

only worth implementing for statically typed languages.The smaller serialization size making it faster to transport to remote machines. It provides container file to store persistent data. Since the schema is present when data is read. Avro relies on a schema-based system that defines a data contract to be exchanged. Fair Scheduler: (ensure fairness amongst users) ● ● ● ● Built by Facebook Multiple queues (pools) of jobs – sorted in FIFO or by fairness limits Each pool is guaranteed a minimum capacity and excess is shared by all jobs using a fairness algorithm Scheduler tries to ensure that over time. resulting in smaller serialization size. considerably less type information need be encoded with data. means that a minimal amount of data is generated. considerably less type information need be encoded with data. It provides Remote Procedure Call(RPC). Code generation is not required to read or write data files nor to use or implement RPC protocols. resulting in smaller serialization size.The strategy employed by Avro. Java and Python and some more languages.● ● In order to prevent one or more users from monopolizing its resources. wherein a job can optionally specify higher memoryrequirements than the default. and the tasks of the job will only be run on TaskTrackers that have enough memory to spare. enabling faster transport. Avro data is always serialized with its schema. Since the schema is present when data is read. if there is competition for them. the schema used when writing it is always present. 1)compact 2)fast 3)interoperable 4)extensible Avro is a data serialization system. Avro Serialization is fast. and are transported in a binary data format. It provides simple integration with dynamic languages. This facilitates implementation in languages that already have JSON libraries. It provides rich data structures that are compact.When Avro data is read. Avro Serialization is Interoperable Avro schemas are defined with JSON . Support for memory-intensive jobs. each queue enforces a limit on the percentage of resources allocated to a user at any given time. Avro Serialzation is compact. Code generation as an optional optimization. Avro relies on JSON schemas. Apache Avro is one of the serialization framework used in Hadoop. all jobs receive the same number of resources Appendix 1A Avro Serialization. Avro is compatible with C. . Avro offers lot of advantages compared to Java Serialization.

30.BooleanWritable. software 13.conf.examples. */ 18.output.lib. 21.apache.apache. See the NOTICE file 4.FileSystem.hadoop.import org. You may obtain a copy of the License at 9.apache. 40.Tool.21 trunk version./** Firstly this is the only Mapreduce program that Prints output to the screen 43.Path.apache.apache. The ASF licenses this file 6.mapreduce. 24. 41. Version 2.import org. .import org.import org. either express or implied. Can be ported to hadoop 0. * to you under the Apache License.import org.0 11.QMC pie estimator. 20.package org.hadoop.apache. 1.Configuration.hadoop.io.io.import org.apache. 37.IOException.hadoop.SequenceFile.import org.import org.import org.apache.0 (the 7.hadoop.Writable.hadoop. 15.SequenceFile.conf.apache.*.input. * distributed with this work for additional information 5. * limitations under the License.apache. 27. * http://www.output.fs. /** 2.FileOutputFormat.hadoop.import java.import org.WritableComparable.Configured. 33.import org.203 api and used. 23.io. 29.import org.apache. * HaltonSequence: instead of file.hadoop.FileInputFormat. * regarding copyright ownership. you may not use this file except in compliance 8.lib. 28. 26.mapreduce. * 10.hadoop.apache.hadoop.BigDecimal. 1. * distributed under the License is distributed on an "AS IS" BASIS. 14.apache.hadoop. 39.SequenceFileInputFormat.CompressionType.io.LongWritable. * "License").math. * with the License.import org. 34.apache. 17.hadoop. 31.Appendix 1B Documented examples from latest Apache hadoop distribution in the new hadoop 0.hadoop.hadoop.import java. 19.ToolRunner.hadoop.io.io.import org. 36. 35.import org.apache.hadoop.lib. * Licensed to the Apache Software Foundation (ASF) under one 3. * Unless required by applicable law or agreed to in writing.hadoop.input.org/licenses/LICENSE-2. 42.apache.math. 25. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND.import org. 38. * 12.apache.apache.fs.RoundingMode.util.lib.mapreduce. 32.hadoop.import java.20.SequenceFileOutputFormat.mapreduce. * or more contributor license agreements. 22.io.mapreduce.import org.util.apache. * See the License for the specific language governing permissions and 16.

in first iteraion when i=0 q is. 66. *QmcReducer: 53. */ 64. * A reducer does not emit any thing it simply iterates over the keys of true and false and sums 54. *QmcMapper creates size samples and checks if they are inside our outside by first subtracting 0. * We emphasize numerical approximation of arbitrary integrals in this example. * 58.e. 81. consider using bbp./** 67. * 63. * Reducer: 90. * reduce tasks (Thus it is possible else concurrency issues).1/3. * It is easy to see that Pi is equal to $4I$. * Arbitrary integrals can be approximated numerically by qMC methods. 57. * 92. 1. * 79. * $x=(x_1. * This simply undoes what the constructor has done to get the next point 49. 46. * where $S=[0.x_2)$ is a 2-dimensional point. 95. * Let numTotal = numInside + numOutside. 80. 88. * and then count points inside/outside of the inscribed circle of the square. *nextPoint(): 48. i. 69. * A map/reduce program that estimates the value of Pi 68. * x is sum of all elements of series d * q. q is. It has a seperate overriden 55. 71. * There are better methods for computing Pi. * close method wherein it has written the output to the file reduce-out and as it has only one 56. 1 . *overridden cleanup():seperate close to register the output to file. 52. * Generate points in a unit square 87. * So an approximation of Pi is obtained once $I$ is evaluated numerically. * 89. 76. * Accumulate points inside/outside results from the mappers.. 65.44.. for second iteration when i=1 . * and $f$ is a function describing the inscribed circle of the square $S$. * we use a qMC method to approximate the integral $I = \int_S f(x) dx$. 45. 78. 47.1/9 . 72. 62. * 85. 70. * 83. 60. * In this example. * the reducer. 84. * (x^2 + y^2 > r^2) then it emits 2 values one is numInside and numoutside seperated by a true or false value as keys. * variable k is moded for randomization. * The fraction numInside/numTotal is a rational approximation of 94. 59. * using a quasi-Monte Carlo (qMC) method. * estimate: calls specified no of mapper and 1 reducer and reads the output from the file written by 61.1)^2$ is a unit square. 1/2 . otherwise. * The implementation is discussed below. 91. 1/4 . * the no of points and updates the variable numInside and numoutside . 82. * For computing many digits of Pi. * the value (Area of the circle)/(Area of the square) = $I$. * Populates the arrays with Halton Sequence. * coordinates (supposingly) from x and y and then putting it in equation of the circle and checking if it satisfies 51.5 (which are center 50. 77. * $f(x)=1$ if $(2x_1-1)^2+(2x_2-1)^2 <= 1$ and $f(x)=0$. 93. 73. 74. 63 nos.. * where the area of the inscribed circle is Pi/4 . 75. * Mapper: 86.

113. 105. * where H(i) is a 2-dimensional point and i >= 1 is the index. 116. q[i][j] = (j == 0? 1. * 151. * Assume the current point is H(index). * @return a 2-dimensional point with coordinates in [0. 150.0: q[i][j-1])/P[i]. the estimated value of Pi is 4(numInside/numTotal). index++. /** 2-dimensional Halton sequence {H(i)}. k = (k . */ 99. /** Maximum number of digits allowed */ 114. index = startindex.public class QuasiMonteCarlo extends Configured implements Tool { 100. } 145. 115. 140. 122. 109.1)^2 152. 108. i < K. 143.96. i++) { 135. long k = index.length. = "A map/reduce program that estimates Pi using a quasi-Monte Carlo 102. x[i] = 0. 98.d[i][j])/P[i]. q[i] = new double[K[i]]. 148. 126. static final String DESCRIPTION 101. 137. . 107. */ 124.class. 134. * Finally. method. i++) { 130.length][]. 119. 147. static private final Path TMP_DIR = new Path( 104. } 144. */ 153. static final int[] K = {63. private double[][] q. q = new double[K. double[] nextPoint() { 154. 120. } 133. for(int i = 0. 149. static final int[] P = {2. /** tmp directory for input/output */ 103. for(int i = 0. private long index. d = new int[K. 3}. private double[] x. * Compute H(index+1). /** Bases */ 112. j < K[i].length]. HaltonSequence(long startindex) { 125. 121. 118.length][]. } 146. for(int j = 0. 131. /** Compute next point. 127.". d[i][j] = (int)(k % P[i]).length. private static class HaltonSequence { 111. private int[][] d. j++) { 139. * and the area of unit square is 1. 40}. 129.getSimpleName() + "_TMP_3_141592654"). 117. * Halton sequence is used to generate sample points for Pi estimation. 106. d[i] = new int[K[i]]. */ 110. 141. x[i] += d[i][j] * q[i][j]. 138. 123. /** Initialize to H(startindex). QuasiMonteCarlo. 128. 142. i < K. 97. 132. x = new double[K. * so the sequence begins with H(startindex+1). 136.

} d[i][j] = 0. 175. 157. 190. 159. } else { numInside++.5. 212. 177. i++) { for(int j = 0. 207. 210. 170. */ public static class QmcMapper extends Mapper<LongWritable. if (d[i][j] < P[i]) { break. 162. ) { //generate points in a unit square final double[] point = haltonsequence. } } return x. } } //output map results context. 206. 176. 193. j < K[i]. 166. j++) { d[i][j]++.155. long numInside = 0L. } //report status i++. 163. 181. LongWritable> { /** Map method.get(). Context context) throws IOException. 164. 211. x[i] -= (j == 0? 1. 174. 171. 203. InterruptedException { final HaltonSequence haltonsequence = new HaltonSequence(offset. LongWritable.setStatus("Generated " + i + " samples. 194. * Generate points in a unit square * and then count points inside/outside of the inscribed circle of the square. 173.get()). . LongWritable size.length.0. if (x*x + y*y > 0. } } /** * Mapper class for Pi estimation. * @param offset samples starting from the (offset+1)th sample. 180.0: q[i][j-1]). 191. 200. if (i % 1000 == 0) { context. 202. * @param size the number of samples for this map * @param context output {ture->numInside. i < size. 188. for(long i = 0. 199. 209. 205. 183. 196. 201. 208. 187. 169. 156. false->numOutside} */ public void map(LongWritable offset. new LongWritable(numInside)). 197."). 213. 186. 167. 158. 179. 172. i < K. 161.25) { numOutside++.nextPoint().write(new BooleanWritable(true). long numOutside = 0L. 185. 189. //count points inside/outside of the inscribed circle of the square final double x = point[0] . 168.0. 204. final double y = point[1] . 198. 192. 195. 178. for(int i = 0. x[i] += q[i][j].5. 160. 182. 165. BooleanWritable. 184.

LongWritable. 271. 245. */ public void reduce(BooleanWritable isInside. 248. 234. 251. 250. 263. not used here. InterruptedException { if (isInside. 236. LongWritable. } } /** * Run a map/reduce job for estimating Pi. * * @return the estimated value of Pi */ public static BigDecimal estimatePi(int numMaps.get(). * @param isInside Is the points inside? * @param values An iterator to a list of point counts * @param context dummy. 258. new LongWritable(numOutside)). 252. Path outFile = new Path(outDir. * Accumulate points inside/outside results from the mappers. long numPoints. 217. Iterable<LongWritable> values. 249. 265. 237. 229.getConfiguration(). write output to a file. 221. 267. 226. } } else { for (LongWritable val : values) { numOutside += val. new LongWritable(numOutside)). 266. CompressionType. 238. 230. "reduce-out"). 259. 269. 231. 261. 239. 257. 216. */ public static class QmcReducer extends Reducer<BooleanWritable. 247.get(). 253. 241.createWriter(fileSys. 227. 220. outFile.Writer writer = SequenceFile.class. 223. Configuration conf . 264. 240. 260. 243. 270. 222. conf. 233. 242.append(new LongWritable(numInside). private long numOutside = 0. 244. writer. LongWritable. "out"). } } } /** * Reduce task done. 255. 232. 218. 272. context.get(conf). Configuration conf = context. 254. 246. 268. Writable> { private long numInside = 0.get()) { for (LongWritable val : values) { numInside += val.class. Context context) throws IOException.NONE). /** * Accumulate number of points inside/outside results from the mappers. */ @Override public void cleanup(Context context) throws IOException { //write output to a file Path outDir = new Path(TMP_DIR. FileSystem fileSys = FileSystem. 225.close(). 262. SequenceFile. 224. 228. 215. 256. 235. } } /** * Reducer class for Pi estimation. 219.214. writer.write(new BooleanWritable(false). WritableComparable<?>.

get(conf). 280.out. 303.class). 277. 309. 311. 322. 293. 278. 296. 318. LongWritable. if (fs. 319. 276. final SequenceFile.class. InterruptedException { Job job = new Job(conf). 302.createWriter( fs. final LongWritable offset = new LongWritable(i * numPoints). 317.setSpeculativeExecution(false)."). because DFS doesn't handle // multiple writers to the same file. outDir). "part"+i). FileOutputFormat. i < numMaps.mkdirs(inDir)) { throw new IOException("Cannot create input directory " + inDir).setOutputKeyClass(BooleanWritable. 290. inDir). 301. } //start a map/reduce job System.makeQualified(TMP_DIR) + " already exists.class).273.setOutputPath(job. job. job.out. 284.println("Starting Job"). 294. 325. 321. CompressionType. job.setOutputFormatClass(SequenceFileOutputFormat. 283. job.currentTimeMillis() .class). } finally { writer.out.setInputFormatClass(SequenceFileInputFormat.class. final FileSystem fs = FileSystem. 320.class.0. size). job. 300.class). 295.exists(TMP_DIR)) { throw new IOException("Tmp directory " + fs. FileInputFormat.startTime)/1000. ) throws IOException. } try { //generate an input file for each map task for(int i=0. 286. 326. 306. 282. 316. "out"). 307. 305.setJarByClass(QuasiMonteCarlo. } System. 281. 314. System.NONE). 299.close(). 328. ClassNotFoundException.append(offset.setOutputValueClass(LongWritable.getSimpleName()). //setup job conf job. 298. 308. final Path outDir = new Path(TMP_DIR. 304. // turn off speculative execution. Please remove it first. 288. 315. 291. . LongWritable. 312.Writer writer = SequenceFile. 324. job.class).currentTimeMillis(). 287.class). final double duration = (System.setMapperClass(QmcMapper. 292. 323. 329. job.setJobName(QuasiMonteCarlo. 289. try { writer.setInputPaths(job.setNumReduceTasks(1). job.waitForCompletion(true). job. 330. //setup input/output directories final Path inDir = new Path(TMP_DIR.class). 279. final long startTime = System. 327.setReducerClass(QmcReducer. } if (!fs.println("Wrote input for Map #"+i). 297. 310. "in"). 331. ++i) { final Path file = new Path(inDir. 285. 275. job.println("Job Finished in " + duration + " seconds"). final LongWritable size = new LongWritable(numPoints). conf. 274. 313. file.

343. 347. 367.get())) .err). } //compute estimated value final BigDecimal numTotal = BigDecimal. 355.valueOf(numPoints)). 363.} 386.parseInt(args[0]).getName()+" <nMaps> <nSamples>"). 371. 370. nSamples. */ public static void main(String[] argv) throws Exception { System.println("Number of Maps = " + nMaps). * Print output in standard out. SequenceFile. 338. 340.println("Samples per Map = " + nSamples). 364. getConf())). 369.332. LongWritable numInside = new LongWritable(). 378.length != 2) { System. Otherwise.valueOf(numInside. 375.Reader(fs. */ public int run(String[] args) throws Exception { if (args.Reader reader = new SequenceFile. //read outputs Path inFile = new Path(outDir.exit(ToolRunner. "reduce-out"). 365. 377.HALF_UP). 361. 357. ToolRunner. 335. 352. true). } finally { reader. System. 349. inFile. 342. 358. 346. LongWritable numOutside = new LongWritable(). } /** * main method for running it as a stand alone command. 354.out. final long nSamples = Long.valueOf(numMaps).out. } } /** * Parse arguments and then runs a map/reduce job.err. 348. 360. 359. 337. return BigDecimal. 383. 381. argv)). return 0.close(). 362. } finally { fs.next(numInside. 333.run(null. 385. return 2. conf).multiply(BigDecimal.multiply(BigDecimal. 334. 353. } Grep example: .delete(TMP_DIR. 372. 382. 384. 379. 344. 373. 368. 374. 366.parseLong(args[1]). 336. new QuasiMonteCarlo(). RoundingMode. try { reader.out.valueOf(4).println("Estimated value of Pi is " + estimatePi(nMaps. } final int nMaps = Integer.setScale(20) . 339. 341. 376. * * @return a non-zero if there is an error.divide(numTotal.printGenericCommandUsage(System. 351. System. 345.println("Usage: "+getClass(). return 0. System. 380. 350. 356. numOutside).

1.conf. 35. 31. 19. * regarding copyright ownership.import org.apache.mapreduce. System.output.LongWritable.LongSumReducer.apache.public class Grep extends Configured implements Tool { 48.io.Text.hadoop. * Longsumreducer simply sums all the long values.hadoop.input.lib. * http://www.io. 30.import org.printGenericCommandUsage(System. 29.util. * uses inversemapper class to sort the output on frequencies.apache.hadoop.apache. * with the License.apache.import org.import org.import org. 36. private Grep() {} // singleton 49.println("Grep <inDir> <outDir> <regex> [<group>]"). Path tempDir = 58.import org.apache.import org. if (args. The ASF licenses this file 6.SequenceFileInputFormat. ToolRunner. return 2. which is the count emitted 41.mapreduce.InverseMapper.reduce.util. 26. 23.apache.hadoop. 55./** Grep search uses RegexMapper to read from the input that satisfies the regex 38.hadoop. 13. * and then emits the count as value and word as key.hadoop.hadoop.Configured. 37.out).Tool.import org. 14.apache.apache. * to you under the Apache License.apache.hadoop. 20.import org. 33.fs. See the NOTICE file 4. 28.import org.*. * limitations under the License. You may obtain a copy of the License at 9. 27. */ 45. */ 47.hadoop.apache. either express or implied. software 12. 25.mapreduce.hadoop. * See the License for the specific language governing permissions and 15./* Extracts matching regexs from input files and counts them.hadoop.apache.SequenceFileOutputFormat. you may not use this file except in compliance 8.map.lib.apache.hadoop.FileOutputFormat. 50.hadoop.input.apache.mapreduce. 34.examples.fs. new Path("grep-temp-"+ . public int run(String[] args) throws Exception { 51. * distributed with this work for additional information 5. 42.lib.import org.RegexMapper.import org.apache.mapreduce. 53.Path. 39.map.import org. Version 2. * 10.output.import org. 44.org/licenses/LICENSE-2.lib.lib. 22.lib.mapreduce. 24. * 40.apache.apache.FileSystem.util.import org. 54. 32.hadoop.0 11. /** 2. * It first searches by above procedure and since the output obtained above is sorted on words it again 43.lib.length < 3) { 52.package org. */ 17. * or more contributor license agreements. * Licensed to the Apache Software Foundation (ASF) under one 3. 46. mapper by all the * for that particular key. * distributed under the License is distributed on an "AS IS" BASIS. 57. 18. } 56.apache.Configuration.ToolRunner.hadoop.hadoop.mapreduce. 21.conf.out.mapreduce.import org.hadoop.0 (the 7.import java.Random. * "License"). 16. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND. * Unless required by applicable law or agreed to in writing.FileInputFormat.

PATTERN.setReducerClass(LongSumReducer.class).setSortComparatorClass( // sort by decreasing freq LongWritable.length == 4) conf. 103. 72. Configuration conf = getConf(). 93. 71. grepJob. 104. } WordCount . 83.setCombinerClass(LongSumReducer. 92. args[0]). 80. 60. 105. 97. 100. 68. System.} 113.class). Job grepJob = new Job(conf). } return 0. 85. 95. 86. sortJob. tempDir). 90. 67. } public static void main(String[] args) throws Exception { int res = ToolRunner.DecreasingComparator. FileOutputFormat.59.setOutputPath(grepJob.setJobName("grep-sort"). 91. 74.setOutputFormatClass(SequenceFileOutputFormat.setInputFormatClass(SequenceFileInputFormat. 108. 99. 63. // write a single file FileOutputFormat. 96. grepJob. grepJob. sortJob. 62. } finally { FileSystem. new Grep(). 98.class). 111. try { grepJob.setNumReduceTasks(1). conf. 82. grepJob. 66.class).MAX_VALUE))).run(new Configuration(). 94.nextInt(Integer. 109.setOutputKeyClass(Text. 61. 87.class). 107. sortJob. 65. FileInputFormat. Integer. tempDir). sortJob.setOutputPath(sortJob. new Path(args[1])). 88.exit(res).class). 84. sortJob.set(RegexMapper.class). 73. 69. 102.setJobName("grep-search").setMapperClass(RegexMapper.waitForCompletion(true). grepJob. 78.class). grepJob. grepJob. 75. 110. 81. args[2]). 77. sortJob. args).waitForCompletion(true). 64. 70. FileInputFormat.setOutputValueClass(LongWritable.set(RegexMapper. 106. 101.class).toString(new Random(). 112.get(conf). 89. args[3]).setInputPaths(sortJob.setMapperClass(InverseMapper. 79. 76.GROUP.delete(tempDir. if (args.setInputPaths(grepJob. Job sortJob = new Job(conf). true).

54.package org. * limitations under the License. The ASF licenses this file 6.apache.hadoop.hadoop. 57. See the NOTICE file 4. * TokenizerMapper: It takes as input each line from the input set and tokenize emits each each word and 36.0 11.hadoop.write(word.apache.apache.lib. 48.import org.mapreduce. 28. public void map(Object key. 22. 14. InterruptedException { 53.hadoop.toString()).apache.conf.io. 17./** 34. * to you under the Apache License. * distributed under the License is distributed on an "AS IS" BASIS.apache.import org. Text value.import org. * Word count Documentation.import org. 39. * 40. 38.hadoop. * 41. * regarding copyright ownership. private Text word = new Text().import org.set(itr.io.Path.mapreduce. 19.0 (the 7. * or more contributor license agreements. * word as key and integer one as value. StringTokenizer itr = new StringTokenizer(value. 33. Context context 52. 31.org/licenses/LICENSE-2.import java. context. 43. 50.apache.import org. private final static IntWritable one = new IntWritable(1).hasMoreTokens()) { 55. 35. * 10.Text. extends Mapper<Object.Mapper.hadoop. * distributed with this work for additional information 5.hadoop.FileInputFormat.util. You may obtain a copy of the License at 9.hadoop. the ones mapper has emitted. 32.GenericOptionsParser. 15. * So iterating over all the values and adding gives the count of that word. 27. software 13. } 58.StringTokenizer.FileOutputFormat.io. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND. Version 2.hadoop.mapreduce. 51. 26.import org. 23. 56. 21.lib.import java.examples.Configuration. either express or implied. 20. word.nextToken()).hadoop.apache. IntWritable>{ 47.hadoop. * Unless required by applicable law or agreed to in writing. Text. while (itr. Text. 49. one). * Licensed to the Apache Software Foundation (ASF) under one 3.fs.apache.import org. * See the License for the specific language governing permissions and 16.mapreduce.public class WordCount { 44. 37.output. 29. ) throws IOException. 25.util.import org.apache. 24. */ 42. * http://www.apache. * IntSumReducer: It accepts each word as key and aggregates all values i.IntWritable.input.mapreduce. * "License"). public static class TokenizerMapper 46. } .apache.e.Reducer.apache. you may not use this file except in compliance 8. 30.1.import org. /** 2. * 12. 45.IOException.Job. * with the License. */ 18.

62. 85.exit(job. 70. } . 80. job. 64. 94. } result.setReducerClass(IntSumReducer. FileOutputFormat.err. 81.class). 88. Iterable<IntWritable> values.setJarByClass(WordCount. 90.setOutputPath(job.setOutputValueClass(IntWritable.setOutputKeyClass(Text. 92.class).IntWritable. 93.setCombinerClass(IntSumReducer. String[] otherArgs = new GenericOptionsParser(conf. System. 86.get(). 67. 74. 95. 65. 82.class).Text. } public static class IntSumReducer extends Reducer<Text. 71.class). result). 87. job. 78. 68.59.println("Usage: wordcount <in> <out>"). } Job job = new Job(conf. job. 69. 76.getRemainingArgs(). 83. } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration().class).} 96. 77. FileInputFormat. public void reduce(Text key. job. 61. for (IntWritable val : values) { sum += val. job. job.exit(2). new Path(otherArgs[1])). 91. args). 60. "word count"). 89.IntWritable> { private IntWritable result = new IntWritable(). 84.waitForCompletion(true) ? 0 : 1). 66. 72. 63.class).set(sum). 75.setMapperClass(TokenizerMapper. if (otherArgs. context. 73.write(key. Context context ) throws IOException. System. InterruptedException { int sum = 0.addInputPath(job. 79. new Path(otherArgs[0])).length != 2) { System.

Sign up to vote on this title
UsefulNot useful