You are on page 1of 34

Large Scale Data Management Systems

Hadoop API

Dimitrios Michail

Dept. of Informatics and Telematics Harokopio University of Athens

2012-2013

Hadoop
Input/Output (input) < k1 , v1 > map < k2 , v2 > combine < k2 , v2 > reduce < k3 , v3 >

(output)
. . . . . .

Dimitrios Michail Harokopio University of Athens

2/29

Hadoop

Serializable The key and value classes need to be serializable by the framework. they need to implement the Writable interface

Dimitrios Michail Harokopio University of Athens

3/29

Hadoop

Serializable The key and value classes need to be serializable by the framework. they need to implement the Writable interface

Comparable To facilitate sorting by the framework, the key class also needs to be comparable. needs to also implement the WritableComparable interface.

Dimitrios Michail Harokopio University of Athens

3/29

Hadoop

Serializable The key and value classes need to be serializable by the framework. they need to implement the Writable interface

Comparable To facilitate sorting by the framework, the key class also needs to be comparable. needs to also implement the WritableComparable interface. There are implementations for all basic types, e.g. DoubleWritable.
. . . . . .

Dimitrios Michail Harokopio University of Athens

3/29

WordCount Example
p u b l i c c l a s s WordCount { p u b l i c s t a t i c c l a s s Map extends MapReduceBase implements Mapper<LongWritable , Text , Text , I n t W r i t a b l e > { p r i v a t e s t a t i c I n t W r i t a b l e one = new I n t W r i t a b l e ( 1 ) ; p r i v a t e Text word = new Text ( ) ; p u b l i c v o i d map( LongWritable key , Text v a l u e , O u t p u t C o l l e c t o r <Text , I n t W r i t a b l e > output , R e p o r t e r r e p o r t e r ) throws IOException { String l i n e = value . toString () ; S t r i n g T o k e n i z e r t o k e n i z e r = new S t r i n g T o k e n i z e r ( l i n e ) ; w h i l e ( t o k e n i z e r . hasMoreTokens ( ) ) { word . s e t ( t o k e n i z e r . nextToken ( ) ) ; output . c o l l e c t ( word , one ) ; } } }
. . . . . .

1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16

Dimitrios Michail Harokopio University of Athens

4/29

WordCount Example

18 19

p u b l i c s t a t i c c l a s s Reduce extends MapReduceBase implements Reducer<Text , I n t W r i t a b l e , Text , IntWritable> { p u b l i c v o i d r e d u c e ( Text key , I t e r a t o r <I n t W r i t a b l e > v a l u e s , O u t p u t C o l l e c t o r <Text , I n t W r i t a b l e > output , R e p o r t e r r e p o r t e r ) throws IOException { i n t sum = 0 ; w h i l e ( v a l u e s . hasNext ( ) ) { sum += v a l u e s . n e x t ( ) . g e t ( ) ; } output . c o l l e c t ( key , new I n t W r i t a b l e (sum) ) ; } }

20 21

22 23 24 25 26 27 28

Dimitrios Michail Harokopio University of Athens

5/29

WordCount Example
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

p u b l i c s t a t i c v o i d main ( S t r i n g [ ] a r g s ) throws E x c e p t i o n { JobConf c o n f = new JobConf ( WordCount . c l a s s ) ; c o n f . setJobName ( wordcount ) ; c o n f . setOutputKeyClass ( Text . c l a s s ) ; conf . setOutputValueClass ( IntWritable . c l a s s ) ; c o n f . s e t M a p pe r Cl a s s (Map . c l a s s ) ; c o n f . s e t C o m b i n e r C l a s s ( Reduce . c l a s s ) ; c o n f . s e t R e d u c e r C l a s s ( Reduce . c l a s s ) ; c o n f . s e t I n p u t F o r m a t ( TextInputFormat . c l a s s ) ; c o n f . setOutputFormat ( TextOutputFormat . c l a s s ) ; F i l e I n p u t F o r m a t . s e t I n p u t P a t h s ( conf , new Path ( a r g s [ 0 ] ) ) ; FileOutputFormat . setOutputPath ( conf , new Path ( a r g s [ 1 ] ) ) ; J o b C l i e n t . runJob ( c o n f ) ; } } // end o f c l a s s WordCount
. . . . . .

Dimitrios Michail Harokopio University of Athens

6/29

Input

There are three major interfaces for providing input to a map-reduce job InputFormat<K,V> InputSplit RecordReader

Dimitrios Michail Harokopio University of Athens

7/29

InputFormat<K,V> Interface

InputFormat<K,V> describes the input-specication for a Map-Reduce job. The Map-Reduce framework relies on the InputFormat of the job to:
1 2

Validate the input-specication of the job. Split-up the input le(s) into logical InputSplit(s), each of which is then assigned to an individual Mapper. Provide the RecordReader implementation to be used to produce input records from the logical InputSplit for processing by the Mapper.

Dimitrios Michail Harokopio University of Athens

8/29

InputSplit Interface

InputSplit represents the data to be processed by an individual Mapper. presents a byte-oriented view on the input, and it is the responsibility of the RecordReader of the job to process this and present a record-oriented view.

Dimitrios Michail Harokopio University of Athens

9/29

RecordReader Interface

RecordReader reads <key, value> pairs from an InputSplit. converts the byte-oriented view of the input, provided by the InputSplit, and presents a record-oriented view for the Mapper & Reducer tasks for processing. It thus assumes the responsibility of processing record boundaries and presenting the tasks with keys and values.

Dimitrios Michail Harokopio University of Athens

10/29

Default InputFormat

The default InputFormat<K,V> is the TextInputFormat. TextInputFormat Implements TextInputFormat<LongWritable, Text> InputFormat<K,V> for plain text les, les are broken into lines. Either linefeed or carriage-return are used to signal end of line. keys are the position in the le, and values are the line of text.

Dimitrios Michail Harokopio University of Athens

11/29

Output

The major interfaces for outputing a map-reduce job are OutputCollector<K,V> OutputFormat<K,V> RecordWriter FileSystem

Dimitrios Michail Harokopio University of Athens

12/29

OutputCollector Interface

Collects the <key, value> pairs output by Mappers and Reducers. OutputCollector<K,V>is the generalization of the facility provided by the Map-Reduce framework to collect data output by either the Mapper or the Reducer i.e. intermediate outputs or the output of the job.

Dimitrios Michail Harokopio University of Athens

13/29

OutputFormat<K,V> Interface

OutputFormat<K,V> describes the output-specication for a Map-Reduce job. The Map-Reduce framework relies on the OutputFormat<K,V>of the job to:
1

Validate the output-specication of the job. For e.g. check that the output directory doesnt already exist. Provide the RecordWriter implementation to be used to write out the output les of the job. Output les are stored in a FileSystem.

Dimitrios Michail Harokopio University of Athens

14/29

RecordWriter Interface

RecordWriter writes the output <key, value> pairs to an output le. RecordWriter implementations write the job outputs to the FileSystem.

Dimitrios Michail Harokopio University of Athens

15/29

FileSystem

An abstract base class for a fairly generic lesystem. Implemented either as: local reects the locally-connected disk Distributed File System the Hadoop DFS which is a multi-machine system that appears as a single disk fault tolerance and potentially very large capacity

Dimitrios Michail Harokopio University of Athens

16/29

Default OutputFormat

The default OutputFormat<K,V> is the TextOutputFormat. TextOutputFormat OutputFormat<K,V> that writes plain text les,

Dimitrios Michail Harokopio University of Athens

17/29

JobConf
map/reduce job conguration

JobConf is the primary interface for a user to describe a map-reduce job to the Hadoop framework for execution. JobConf typically species the Mapper, combiner (if any), Partitioner, Reducer, InputFormat<K,V>and OutputFormat<K,V> implementations to be used etc.
. . . . . .

Dimitrios Michail Harokopio University of Athens

18/29

Mapper Interface

Mapper maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks which transform input records into intermediate records. The user implements the following interface:
p u b l i c i n t e r f a c e Mapper<K1 , V1 , K2 , V2> extends JobConfigurable , Closeable { v o i d map(K1 key , V1 v a l u e , O u t p u t C o l l e c t o r <K2 , V2> output , R e p o r t e r r e p o r t e r ) ; }

Dimitrios Michail Harokopio University of Athens

19/29

Hadoop Map-Reduce Framework

Framework spawns one map task for each InputSplit generated by the InputFormat<K,V> for the job. calls map(K1, V1, OutputCollector<K2,V2>, Reporter) for each key/value pair in the InputSplit for that task.

Dimitrios Michail Harokopio University of Athens

20/29

Hadoop Map-Reduce Framework

Grouping All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a Reducer to determine the nal output. Users can control the grouping by specifying a Comparator via JobConf.setOutputKeyComparatorClass(Class).

Dimitrios Michail Harokopio University of Athens

21/29

Hadoop Map-Reduce Framework

Partitioning The grouped Mapper outputs are partitioned per Reducer. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.

Dimitrios Michail Harokopio University of Athens

22/29

Hadoop Map-Reduce Framework

Partitioning The grouped Mapper outputs are partitioned per Reducer. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.

A partitioner is represented by the following interface:


i n t e r f a c e P a r t i t i o n e r <K2 , V2> extends J o b C o n f i g u r a b l e { i n t g e t P a r t i t i o n (K2 key , V2 v a l u e , i n t n u m P a r t i t i o n s ) ; }

Dimitrios Michail Harokopio University of Athens

22/29

Hadoop Map-Reduce Framework

Combiners Users can optionally specify a combiner, via JobConf.setCombinerClass(Class). Combiners perform local aggregation of the intermediate outputs. Helps to cut down the amount of data transferred from the Mapper to the Reducer.

Dimitrios Michail Harokopio University of Athens

23/29

Hadoop Map-Reduce Framework

The intermediate, grouped outputs are stored in SequenceFiles which are at les consisting of binary key/value pairs. There are three SequenceFile Writers:
1 2

Writer : Uncompressed records. RecordCompressWriter : Record-compressed les, only compress values. BlockCompressWriter : Block-compressed les, both keys & values are collected in blocks separately and compressed. The size of the block is congurable.

Dimitrios Michail Harokopio University of Athens

24/29

Hadoop Map-Reduce Framework

The intermediate, grouped outputs are stored in SequenceFiles which are at les consisting of binary key/value pairs. The actual compression algorithm used to compress key and/or values can be specied by using the appropriate CompressionCodec. There are also Readers and Sorters for each SequenceFile type.

Dimitrios Michail Harokopio University of Athens

24/29

Reducer Interface

Reduce a set of intermediate values which share a key to a smaller set of values. The user implements the following interface:
p u b l i c i n t e r f a c e Reducer<K2 , V2 , K3 , V3> extends JobConfigurable , Closeable { v o i d r e d u c e (K2 key , I t e r a t o r <V2> v a l u e s , O u t p u t C o l l e c t o r < K3 , V3> output , R e p o r t e r r e p o r t e r ) ; }

The number of Reducers for the job is set by the user via JobConf.setNumReduceTasks(int).
. . . . . .

Dimitrios Michail Harokopio University of Athens

25/29

Hadoop Map-Reduce Framework


Reducer Phases

Phase 1: Shue The framework, for each reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.

Dimitrios Michail Harokopio University of Athens

26/29

Hadoop Map-Reduce Framework


Reducer Phases

Phase 1: Shue The framework, for each reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.

Phase 2: Sort The framework groups Reducer inputs by keys (since dierent Mappers may have output the same key). The shue and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
. . . . . .

Dimitrios Michail Harokopio University of Athens

26/29

Hadoop Map-Reduce Framework


Reducer Phases

Phase 3: Reduce The reduce(K2, Iterator<V2>, OutputCollector<K3,V3>, Reporter) method is called for each <key, (list of values)> pair in the grouped inputs. The output of the reduce task is typically written to the FileSystem via OutputCollector.collect(K3, V3).

Dimitrios Michail Harokopio University of Athens

27/29

Reporter

A facility for Map-Reduce applications to report progress and update counters, status information etc. Is Alive? Mapper and Reducer can use the Reporter provided to report progress or just indicate that they are alive. In scenarios where the application takes an insignicant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task.

Dimitrios Michail Harokopio University of Athens

28/29

Material

http://hadoop.apache.org http://hadoop.apache.org/docs/r1.0.4 http: //hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html

Dimitrios Michail Harokopio University of Athens

29/29

You might also like