Large Scale Data Management Systems

Hadoop API

Dimitrios Michail

Dept. of Informatics and Telematics Harokopio University of Athens

2012-2013

.

.

.

.

.

.

Hadoop
Input/Output (input) < k1 , v1 > ↓ map ↓ < k2 , v2 > ↓ combine ↓ < k2 , v2 > ↓ reduce ↓ < k3 , v3 >

(output)
. . . . . .

Dimitrios Michail – Harokopio University of Athens

2/29

Hadoop

Serializable The key and value classes need to be serializable by the framework. they need to implement the Writable interface

.

.

.

.

.

.

Dimitrios Michail – Harokopio University of Athens

3/29

. the key class also needs to be comparable. . they need to implement the Writable interface Comparable To facilitate sorting by the framework. . .Hadoop Serializable The key and value classes need to be serializable by the framework. needs to also implement the WritableComparable interface. Dimitrios Michail – Harokopio University of Athens 3/29 . . .

Hadoop Serializable The key and value classes need to be serializable by the framework.g. . they need to implement the Writable interface Comparable To facilitate sorting by the framework. DoubleWritable. There are implementations for all basic types. . . Dimitrios Michail – Harokopio University of Athens 3/29 . the key class also needs to be comparable. needs to also implement the WritableComparable interface. . . e. .

nextToken ( ) ) . s e t ( t o k e n i z e r . hasMoreTokens ( ) ) { word . . S t r i n g T o k e n i z e r t o k e n i z e r = new S t r i n g T o k e n i z e r ( l i n e ) . toString () . p r i v a t e Text word = new Text ( ) . Text v a l u e . } } } . O u t p u t C o l l e c t o r <Text . c o l l e c t ( word . . R e p o r t e r r e p o r t e r ) throws IOException { String l i n e = value . . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Dimitrios Michail – Harokopio University of Athens 4/29 . p u b l i c v o i d map( LongWritable key . one ) . . Text . I n t W r i t a b l e > output .WordCount Example p u b l i c c l a s s WordCount { p u b l i c s t a t i c c l a s s Map extends MapReduceBase implements Mapper<LongWritable . w h i l e ( t o k e n i z e r . output . Text . . I n t W r i t a b l e > { p r i v a t e s t a t i c I n t W r i t a b l e one = new I n t W r i t a b l e ( 1 ) .

w h i l e ( v a l u e s . . g e t ( ) . c o l l e c t ( key . . R e p o r t e r r e p o r t e r ) throws IOException { i n t sum = 0 . I t e r a t o r <I n t W r i t a b l e > v a l u e s . I n t W r i t a b l e . . O u t p u t C o l l e c t o r <Text . IntWritable> { p u b l i c v o i d r e d u c e ( Text key . I n t W r i t a b l e > output .WordCount Example 18 19 p u b l i c s t a t i c c l a s s Reduce extends MapReduceBase implements Reducer<Text . Text . hasNext ( ) ) { sum += v a l u e s . new I n t W r i t a b l e (sum) ) . } output . . Dimitrios Michail – Harokopio University of Athens 5/29 . . } } 20 21 22 23 24 25 26 27 28 . n e x t ( ) .

setOutputKeyClass ( Text . FileOutputFormat . c o n f . s e t M a p pe r Cl a s s (Map . c l a s s ) . s e t I n p u t F o r m a t ( TextInputFormat . c l a s s ) . c o n f . c o n f . c o n f . c o n f . setJobName ( ”wordcount” ) . } } // end o f c l a s s WordCount .WordCount Example 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 p u b l i c s t a t i c v o i d main ( S t r i n g [ ] a r g s ) throws E x c e p t i o n { JobConf c o n f = new JobConf ( WordCount . c l a s s ) . . new Path ( a r g s [ 1 ] ) ) . runJob ( c o n f ) . . Dimitrios Michail – Harokopio University of Athens 6/29 . c l a s s ) . F i l e I n p u t F o r m a t . setOutputValueClass ( IntWritable . . c l a s s ) . J o b C l i e n t . new Path ( a r g s [ 0 ] ) ) . s e t C o m b i n e r C l a s s ( Reduce . conf . c o n f . setOutputPath ( conf . s e t I n p u t P a t h s ( conf . c l a s s ) . setOutputFormat ( TextOutputFormat . . . s e t R e d u c e r C l a s s ( Reduce . c o n f . c l a s s ) . c l a s s ) .

Input There are three major interfaces for providing input to a map-reduce job InputFormat<K. . .V> InputSplit RecordReader . Dimitrios Michail – Harokopio University of Athens 7/29 . . . .

. The Map-Reduce framework relies on the InputFormat of the job to: 1 2 Validate the input-specification of the job.V> describes the input-specification for a Map-Reduce job. 3 . Dimitrios Michail – Harokopio University of Athens 8/29 . . . . Provide the RecordReader implementation to be used to produce input records from the logical InputSplit for processing by the Mapper. Split-up the input file(s) into logical InputSplit(s).V> Interface InputFormat<K. each of which is then assigned to an individual Mapper.InputFormat<K. .

. .InputSplit Interface InputSplit represents the data to be processed by an individual Mapper. . . Dimitrios Michail – Harokopio University of Athens 9/29 . presents a byte-oriented view on the input. . . and it is the responsibility of the RecordReader of the job to process this and present a record-oriented view.

It thus assumes the responsibility of processing record boundaries and presenting the tasks with keys and values. . Dimitrios Michail – Harokopio University of Athens 10/29 . . and presents a record-oriented view for the Mapper & Reducer tasks for processing. converts the byte-oriented view of the input. . value> pairs from an InputSplit.RecordReader Interface RecordReader reads <key. provided by the InputSplit. . . .

.V> for plain text files. files are broken into lines. Dimitrios Michail – Harokopio University of Athens 11/29 . Text> InputFormat<K.Default InputFormat The default InputFormat<K. . . . . keys are the position in the file. and values are the line of text. . TextInputFormat Implements TextInputFormat<LongWritable.V> is the TextInputFormat. Either linefeed or carriage-return are used to signal end of line.

Dimitrios Michail – Harokopio University of Athens 12/29 . . . .V> OutputFormat<K. . .V> RecordWriter FileSystem .Output The major interfaces for outputing a map-reduce job are OutputCollector<K.

. .V>is the generalization of the facility provided by the Map-Reduce framework to collect data output by either the Mapper or the Reducer i.OutputCollector Interface Collects the <key. . Dimitrios Michail – Harokopio University of Athens 13/29 . . . value> pairs output by Mappers and Reducers. OutputCollector<K.e. . intermediate outputs or the output of the job.

.g. check that the output directory doesn’t already exist. . Dimitrios Michail – Harokopio University of Athens 14/29 . For e. . Output files are stored in a FileSystem. . The Map-Reduce framework relies on the OutputFormat<K.V>of the job to: 1 Validate the output-specification of the job.OutputFormat<K.V> describes the output-specification for a Map-Reduce job.V> Interface OutputFormat<K. Provide the RecordWriter implementation to be used to write out the output files of the job. 2 . .

RecordWriter Interface RecordWriter writes the output <key. RecordWriter implementations write the job outputs to the FileSystem. . . Dimitrios Michail – Harokopio University of Athens 15/29 . . . value> pairs to an output file. . .

Dimitrios Michail – Harokopio University of Athens 16/29 . . . Implemented either as: ”local” reflects the locally-connected disk Distributed File System the Hadoop DFS which is a multi-machine system that appears as a single disk fault tolerance and potentially very large capacity .FileSystem An abstract base class for a fairly generic filesystem. . . .

V> is the TextOutputFormat. . . . TextOutputFormat OutputFormat<K.Default OutputFormat The default OutputFormat<K. . .V> that writes plain text files. Dimitrios Michail – Harokopio University of Athens 17/29 . .

.V>and OutputFormat<K. InputFormat<K. . combiner (if any).V> implementations to be used etc. . Dimitrios Michail – Harokopio University of Athens 18/29 . . JobConf typically specifies the Mapper. .JobConf map/reduce job configuration JobConf is the primary interface for a user to describe a map-reduce job to the Hadoop framework for execution. . Partitioner. Reducer.

The user implements the following interface: p u b l i c i n t e r f a c e Mapper<K1 . V2> extends JobConfigurable . . Closeable { v o i d map(K1 key . R e p o r t e r r e p o r t e r ) .Mapper Interface Mapper maps input key/value pairs to a set of intermediate key/value pairs. V1 v a l u e . O u t p u t C o l l e c t o r <K2 . V2> output . . K2 . Maps are the individual tasks which transform input records into intermediate records. V1 . Dimitrios Michail – Harokopio University of Athens 19/29 . } . . . .

. . . .V> for the job. Reporter) for each key/value pair in the InputSplit for that task. calls map(K1.V2>. .Hadoop Map-Reduce Framework Framework spawns one map task for each InputSplit generated by the InputFormat<K. . V1. OutputCollector<K2. Dimitrios Michail – Harokopio University of Athens 20/29 .

.Hadoop Map-Reduce Framework Grouping All intermediate values associated with a given output key are subsequently grouped by the framework. . . and passed to a Reducer to determine the final output. Users can control the grouping by specifying a Comparator via JobConf.setOutputKeyComparatorClass(Class). . . Dimitrios Michail – Harokopio University of Athens 21/29 . .

. . . Dimitrios Michail – Harokopio University of Athens 22/29 . . .Hadoop Map-Reduce Framework Partitioning The grouped Mapper outputs are partitioned per Reducer. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner. .

A partitioner is represented by the following interface: i n t e r f a c e P a r t i t i o n e r <K2 . Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner. . V2 v a l u e . i n t n u m P a r t i t i o n s ) . Dimitrios Michail – Harokopio University of Athens 22/29 . V2> extends J o b C o n f i g u r a b l e { i n t g e t P a r t i t i o n (K2 key .Hadoop Map-Reduce Framework Partitioning The grouped Mapper outputs are partitioned per Reducer. . . } . . .

setCombinerClass(Class).Hadoop Map-Reduce Framework Combiners Users can optionally specify a combiner. . . . via JobConf. . Combiners perform local aggregation of the intermediate outputs. Dimitrios Michail – Harokopio University of Athens 23/29 . Helps to cut down the amount of data transferred from the Mapper to the Reducer. . .

. There are three SequenceFile Writers: 1 2 Writer : Uncompressed records. . . . BlockCompressWriter : Block-compressed files. 3 .Hadoop Map-Reduce Framework The intermediate. only compress values. RecordCompressWriter : Record-compressed files. both keys & values are collected in ’blocks’ separately and compressed. The size of the ’block’ is configurable. Dimitrios Michail – Harokopio University of Athens 24/29 . . grouped outputs are stored in SequenceFiles which are flat files consisting of binary key/value pairs.

Hadoop Map-Reduce Framework The intermediate. Dimitrios Michail – Harokopio University of Athens 24/29 . . grouped outputs are stored in SequenceFiles which are flat files consisting of binary key/value pairs. . . The actual compression algorithm used to compress key and/or values can be specified by using the appropriate CompressionCodec. . . . There are also Readers and Sorters for each SequenceFile type.

V2 . . . } The number of Reducers for the job is set by the user via JobConf. K3 . .setNumReduceTasks(int). Closeable { v o i d r e d u c e (K2 key . Dimitrios Michail – Harokopio University of Athens 25/29 . V3> output . R e p o r t e r r e p o r t e r ) . . . V3> extends JobConfigurable . O u t p u t C o l l e c t o r < K3 . . I t e r a t o r <V2> v a l u e s . The user implements the following interface: p u b l i c i n t e r f a c e Reducer<K2 .Reducer Interface Reduce a set of intermediate values which share a key to a smaller set of values.

for each reducer. . . . via HTTP. Dimitrios Michail – Harokopio University of Athens 26/29 .Hadoop Map-Reduce Framework Reducer Phases Phase 1: Shuffle The framework. . fetches the relevant partition of the output of all the Mappers. . .

. . Phase 2: Sort The framework groups Reducer inputs by keys (since different Mappers may have output the same key).e.Hadoop Map-Reduce Framework Reducer Phases Phase 1: Shuffle The framework. . while outputs are being fetched they are merged. Dimitrios Michail – Harokopio University of Athens 26/29 . for each reducer. . . . via HTTP. The shuffle and sort phases occur simultaneously i. fetches the relevant partition of the output of all the Mappers.

. . V3). The output of the reduce task is typically written to the FileSystem via OutputCollector. (list of values)> pair in the grouped inputs.Hadoop Map-Reduce Framework Reducer Phases Phase 3: Reduce The reduce(K2. . OutputCollector<K3. Dimitrios Michail – Harokopio University of Athens 27/29 . Iterator<V2>. . . Reporter) method is called for each <key.V3>. .collect(K3.

Dimitrios Michail – Harokopio University of Athens 28/29 . . In scenarios where the application takes an insignificant amount of time to process individual key/value pairs. Is Alive? Mapper and Reducer can use the Reporter provided to report progress or just indicate that they are alive. status information etc. .Reporter A facility for Map-Reduce applications to report progress and update counters. this is crucial since the framework might assume that the task has timed-out and kill that task. . . . .

4/mapred_tutorial. Dimitrios Michail – Harokopio University of Athens 29/29 . .html .0.0. . .Material http://hadoop.org http://hadoop. .apache. .apache.org/docs/r1.apache.4 http: //hadoop.org/docs/r1.

Sign up to vote on this title
UsefulNot useful