You are on page 1of 17

Introduction to Google MapReduce

WING Group Meeting 13 Oct 2006 Hendra Setiawan

What is MapReduce?  A programming model (& its associated      implementation) For processing large data set Exploits large set of commodity computers Executes process in distributed manner Offers high degree of transparencies In other words:  simple and maybe suitable for your tasks !!! .

Distributed Grep Split data Very big data Split data Split data grep grep grep grep matches matches matches cat All matches Split data matches .

Distributed Word Count Split data Very big data Split data Split data count count count count count count count merge merged count Split data count .

Map Reduce Very big data M A P Partitioning Function R E D U C E Result  Map:  Accepts input key/value pair  Emits intermediate key/value pair  Reduce :  Accepts intermediate key/value* pair  Emits output key/value pair .

Partitioning Function .

value)  Reduce (with R=1): emit(key.value) .Partitioning Function (2)  Default : hash(key)  Guarantee: mod R  Relatively well-balanced partitions  Ordering guarantee within partition  Distributed Sort  Map: emit(key.

pattern) emit(value.1)  Reduce: emit(key.sum(value*))  Distributed Word Count  Map: for all w in value do emit(w.sum(value*)) .MapReduce  Distributed Grep  Map: if match(value.1)  Reduce: emit(key.

MapReduce Transparencies Plus Google Distributed File System :  Parallelization  Fault-tolerance  Locality optimization  Load balancing .

Suitable for your task if  Have a cluster  Working with large dataset  Working with independent data (or assumed)  Can be cast into map and reduce .

MapReduce outside Google  Hadoop (Java)  Emulates MapReduce and GFS  The architecture of Hadoop MapReduce and DFS is master/slave Master Slave MapReduce jobtracker tasktracker DFS namenode datanode .

private Text word = new Text().Example Word Count (1)  Map public static class MapClass extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1). OutputCollector output.toString(). one). Writable value. StringTokenizer itr = new StringTokenizer(line). output.collect(word.hasMoreTokens()) { word. public void map(WritableComparable key. while (itr.nextToken()). Reporter reporter) throws IOException { String line = ((Text)value). } } } .set(itr.

get(). Iterator values. while (values. OutputCollector output. Reporter reporter) throws IOException { int sum = 0. } } .next()). } output.Example Word Count (2)  Reduce public static class Reduce extends MapReduceBase implements Reducer { public void reduce(WritableComparable key. new IntWritable(sum)).collect(key.hasNext()) { sum += ((IntWritable) values.

Example Word Count (3)  Main public static void main(String[] args) throws IOException { //checking goes here JobConf conf = new JobConf().class). conf.class).setOutputPath(new Path(args[1])).runJob(conf). JobClient.setCombinerClass(Reduce.setMapperClass(MapClass. conf. } . conf. conf. conf.class).setReducerClass(Reduce. conf.class).setOutputKeyClass(Text. conf.class).setInputPath(new Path(args[0])).setOutputValueClass(IntWritable.

xml and slaves  Initiate namenode  Run Hadoop MapReduce and DFS  Upload your data to DFS  Run your process…  Download your data from DFS .One time setup  set hadoop-site.

focus on problem. and let the library deal with the messy detail .Summary  A simple programming model for processing large dataset on large set of computer cluster  Fun to use.

org/wiki/MapReduce)  Hadoop – MapReduce in Java (http://lucene.com/papers/mapreduce .apache.MapReduce in Ruby (http://rufy.org/hadoop/)  Starfish .com/starfish/) .html)  On wikipedia (http://en.google.References  Original paper (http://labs.wikipedia.