This example illustrates one way to implement the MapReduce design pattern on the Sun Grid Compute Utility. The example leverages ComputeServer technology to simplify development of the solution using the Java programming language.
The MapReduce design pattern is widely used today to solve a set of data processing problems that involve two phases of execution: a map phaseand a reduce phase. In the first phase, input key/value pairs are processed through a mapping function to produce a set of intermediate results, alsoas key/value pairs. Those intermediate results are then reduced in a second execution phase, wherein all of the values for a single key areconsolidated to produce a final set of unique key/value pairs. This pattern was popularized by Google, which employs it to process large volumes of data, and takes its name from Google's MapReduce programming model and associated implementation, introduced by Jeffrey Dean and SanjayGhemawat in the 2004 Google Labs paper "MapReduce: Simplified Data Processing on Large Clusters," published athttp://labs.google.com/papers/mapreduceosdi04.pdf .Typically, the work to be completed in both the map and reduce phases is divided into independent tasks that execute in parallel, making thesesolutions highly scalable and well suited to run on the Sun Grid Compute Utility. That said, the simplicity of the MapReduce design pattern is easilylost among complexities introduced by distributed systems development. For that reason, many MapReduce solutions are implemented atop adistributed systems execution layer that is customized to insulate developers from these complexities and to preserve the clarity of the MapReducemodel.In this example we leverage Sun'sCompute Server technologyto simplify the development of a Word Counter application, modeled as a MapReduceproblem, using the Java programming language. Compute Server technology handles all of the distributed systems work for us including thedistribution and load balancing of tasks across many grid nodes, the collection of results from those distributed execution nodes, partial failuretolerance, and more thereby allowing us to focus our energies exclusively on development of the logic for our map and reduce functions. Using theGrid Compute Server Plugin for NetBeans™ IDE, we are also able to take advantage of the features of this popular IDE to aid in development and offgrid debugging of our application.
The MapReduce design pattern is easily implemented using theCompute Server programming model. This simple programming model helps Javadevelopers easily and efficiently use the Sun Grid Compute Utility for the distributed execution of parallel computations. Any application that can bemodeled as a set of independent, computebound tasks that execute in parallel, or as a series of such parallel execution phases, can take advantageof Compute Server technology. The MapReduce pattern fits this characterization, and is one of manydesign patterns supported by Compute Server technology.
[insert architecture diagram here]
The Compute Server interfaces implemented in the WordCounter example application are:
In the Compute Server programming model, tasks are independent units of work that are distributed across nodes on the Sun Grid Compute Utility for parallel execution. The WordCounter example includes a Map task class (WCMapTask.java) and a Reduce task class (WCReduceTask.java).WordCounter Map tasks are used to count the number of occurrences of each word in each input file. WordCounter Reduce tasks consolidate themap tasks' output to produce a total count, for each word, of all occurrences across all files.
A Compute Server task generator generates task objects. In the WordCounter example, one generator class (WCMapGenerator.java) is used duringthe map phase to generate Map tasks, and another generator class (WCReduceGenerator.java) is used during the reduce phase to generate Reducetasks. The map phase has been configured to run first, followed by the reduce phase.
A JobInputProducer is used offgrid, prior to executing a job, to prepare input for use by a Compute Server application. The WordCounter exampleuses an input producer class (WCInputProducer) to prepare a list of files for use as input to a WordCounter job run.
This class is also run offgrid, and is used to retrieve and process the final results of a completed Compute Server job. The OutputProcessor used bythe WordCounter example (WCOutputProcessor.java) simply prints job execution statistics and the list of word counts returned from the job run.
Example Code and Documentation
The WordCounter example code and the supporting Compute Server technology infrastructure are freely available under open source license throughtheCompute Server Project. Both are included as part of the Grid Compute Server Plugin for NetBeans™ IDE, which can be downloaded from theproject'sdownload page. In order to run Compute Server technology, you will also need theJava™ SE platformand theNetBeans™ Integrated
Development Environment. Once you have downloaded and installed the Compute Server plugin, simply open the WordCounter project (see ComputeServer documentation for the location of example projects) to examine and run the code.The WordCounter exam le also includes detailed documentation describin its im lementation. If ou would like to review the documentation rior to