This action might not be possible to undo. Are you sure you want to continue?
Author: Thang M. Le email@example.com
1. INTRODUCTION Collecting data and processing data are driving forces leading to business advances. However, building a system to leverage these advantages is a challenging task. Without a proper infrastructure in place, companies can neither manage nor catch up with their exponential growth of data. In a mist of finding a standard model for data-intensive computing, the MapReduce framework  arises as a simplified distributed computing model which can effectively process data on a large cluster of commodity machines. MapReduce enjoys a widespread adoption thanks to its simplicity and efficiency. Within Google, the implementation of MapReduce has consistently produced pleasant results for various applications ranging from extracting properties of web pages to indexing documents for Google web search service . Yahoo! has developed Hadoop, an open source implementation of the MapReduce architecture, for their data cloud computing. The Open Science Data Cloud maintained by the Open Cloud Consortium integrates Hadoop into its software stack for managing, analyzing and sharing scientific data . The Phoenix project  adopts MapReduce to build a scalable performance system capable of processing large data running on shared-memory machines. More and more companies are planning to build their data analyzing system based on the MapReduce architecture. Recently, there have been concerns regarding the efficiency of MapReduce systems. Direct comparisons between Hadoop and various parallel database management systems (DBMS) have been done . All results indicate parallel DBMSs perform substantially faster than Hadoop by a factor of 3 to 6. On August 8, 2009, another benchmark was performed using Terabyte Sort benchmark test on LexisNexis HighPerformance Computing Cluster (HPCC) and Hadoop (release 0.19). In this benchmark, the result shows HPCC system needs a total of 6 minutes and 27 seconds to complete the sorting test while Hadoop requires 25 minutes 28 seconds . We carefully revisited the MapReduce model to get a better understanding on some of questionable design decisions which we think are the culprits to the differences in performance. We also reviewed the above benchmark tests and various comparisons between parallel DBMSs and MapReduce. We came to the conclusion that the benchmark might indicate the differences in performance of the MapReduce architecture compared with the approach taken by commercial parallel DBMSs. However, these differences complement the overall reliability of the MapReduce system, which is much important when operating on a large cluster of thousands of commodity machines where “component failures are the norm rather than the exception” . In such environment, MapReduce has shown its fast recovery advance through a sorting benchmark performed with the deaths of 200 out of 1746 workers. The result was the sort program was completed in 933 seconds compared with 891 seconds where there was no failure . In contrast, we have not seen this type of benchmark tests for parallel DBMSs. Without statistics on how well a parallel DBMS reacts to large machine failures, it is insufficient to have a fair comparison between MapReduce and traditional parallel DBMSs. To further clarify our conclusion, we analyze the MapReduce model and compare it with the approach taken by parallel DBMSs in section 2. We also discuss various optimizing capabilities existing in the MapReduce model which developers should be aware of in order to excel MapReduce executions. Even though we believe the MapReduce model is a right approach toward data-intensive computing, designers and developers still need to build enhancements for optimization. In section 3, we provides further discussions and improvements which we think would significantly contribute to the overall performance of MapReduce systems.
2. A REVIEW OF THE MAP-REDUCE MODEL The MapReduce programming model expresses an operation as two abstract functions: map and reduce. A map function operates directly on input data and generates intermediate results under the form of key/value pairs (might contain duplicate keys). The intermediate results are then passed to the reduce function to produce a new set of key/value pairs (no duplicate keys) which can be the final output or an input for a next map function. With this flexibility, developers can easily take advantage of MapReduce in one execution or a chained list of executions for complex analytics . Working on the consistent format of key/value pairs for input/output and the ability to execute a chained list of MapReduce executions are the strengths of the MapReduce programming model over SQL-like languages. A map function and its associated reduce function have the producer-consumer relationship. They get executed in sequential order. In order to explore parallelism, MapReduce further assumes: Assumption1: Map functions can be applied to any subset of input data items independently Assumption2: Reduce functions can be applied to pairs of intermediate results in any order.
Assumption1 allows MapReduce to split a lengthy execution into smaller executions. Assumption2 lets MapReduce to perform tasks concurrently with no worry about synchronization. Together, MapReduce can split a lengthy execution into smaller tasks, then executing these tasks on many machines to achieve a high degree of concurrency. Because of this unique way of performing task, we relate MapReduce with distributed computing rather than parallel computing. To maximize efficiency, MapReduce introduces: Master node: one master node is responsible to assign map tasks to map workers, reduce task to reduce workers, coordinating reduce workers, maintaining various states for each task and takes action when workers are unavailable. Worker: a machine which executes map tasks or reduce tasks. Map task: there are map tasks which are results from splits on input filenames. Each map task will be assigned to a worker (map worker) for execution. The intermediate results will be ( ) hashed using and will be kept in memory. Periodically, the buffer will be written out into regions on the local disk of the map worker. Reduce task: there are reduce tasks. Each reduce task will be assigned to a worker (reduce worker). When a map task is completed, a map worker will update its regions with the master node. In response, the master node then pushes the updated information to all current reduce workers. Upon receiving the notification, a reduce worker performs remote procedure call (RPC) to read local data stored in the map worker. The reduce worker then sorts this data by the intermediate keys and passes the sorted results to the user defined reduce function.
From the basis, the MapReduce system embraces a shared-nothing hardware design . This is also the architecture used by many parallel DBMSs. The alternatives are shared-memory and shared-disks. In the shared-nothing paradigm, machines work on their own memory and disks. They exchange data with each other through message based communication over high-speed network.  provides full details on advantages when using the shared-nothing design in parallel DBMSs. With the shared-nothing design, a parallel DBMS system naturally performs computation right at a node storing data. This follows the
important principle “Move the code to the data” in data-intensive computing , which immediately results in minimum network traffic. In contrast, the MapReduce model does not obtain this optimal locality for granted. Given the assumption of contents of input files stored locally in workers which make up a MapReduce cluster, the MapReduce model requires a complex scheduling logic to correctly assign a map task processing certain input filenames to a worker which has these files stored locally. If such scheduling is not possible, the logic should attempt to assign the map task to a worker near to the one has the data in its local storage. This is the uncertainty in the MapReduce model which might cause unexpected performance. Future improvement should allow MapReduce systems to directly inherit soptimal task locality from the MapReduce model with less effort in implementation. MapReduce differentiates a map function from a reduce function. It further distinguishes a map worker from a reduce worker. Executing a map function on one machine and a reduce function on another machine requires to move local data stored in the map worker to the reduce worker. The amount of data transferred between workers is the main overhead in MapReduce . In parallel DBMSs, exchanged data is minimized thanks to built-in query optimizers. For example, if a query after doing a match on sequential search also aggregates matched items, the size of the “push-back” data is negligible compared with the amount of intermediate results have to be pulled by reduce workers in MapReduce. To compensate for this issue, MapReduce introduces Combiner function to perform partial merge on intermediate results prior to the completion of a map task. Combiner functions essentially play the role of query optimizers in MapReduce systems. Separating a map worker from a reduce worker provides various advantages rather drawbacks. The most benefit is MapReduce systems can achieve fast recovery, dynamic load balancing and efficiency . In an environment where hardware failures are common, failure resilience and fast recovery are vital to the design of distributed systems. In MapReduce, completed map tasks of a failed worker are required to be re-executed since the intermediate data stored locally in the failed worker are not accessible. In contrast, completed reduce tasks of a failed worker are already stored in global file system, hence, no further work is needed. For each execution, a master node keeps track of states of map tasks and ( ) states of reduce tasks . Maintaining these states at any time allows the master node to effectively determine and schedule completed map tasks for reexecutions. Such fine-grained recovery is not possible in parallel DBMSs. Periodically, a map worker writes its memory buffer to local disk for two-folds. First, intermediate results are presumably much larger than the memory capacity of a map worker. Writing out to local disk is unavoidable. In this manner, we suspect parallel DBMSs also perform various I/O operations internally. Second, writing out data in memory to disk into regions expedites concurrent data transfer over network. Such advantage relies on a high performance distributed file system with support for parallel I/O operations. Google File System  is an example of this type of distributed file system. 3. DISCUSSIONS AND IMPROVEMENTS There are existing “hot spots” in the MapReduce model where a large number of tasks are executed concurrently. Master nodes and map workers are the two. With map tasks and reduce tasks, a master node has to perform ( ) scheduling jobs and maintains ( ) states. Memory consumption is not an issue but the high degree of concurrency being placed on the master node is a big concern. Running on a large cluster of thousands of machines, the master node frequently needs to simultaneously process thousands of update requests from thousands of map workers who just finish their map tasks. The high level of parallel processing substantially impacts efficiency. Load balancing should be in place to
distribute computations across master nodes. In a similar manner, map workers are also facing the same problem. Once a master node notifies reduce workers of a completed map task, these reduce workers independently pull data from the local disk of the completed map worker. This imposes “back-pressure” on the running system . MapReduce relies on its underneath distributed file system to balance requests across map worker’s replicas. Even though the MapReduce paradigm contributes to a scalable, high performance distributed system applicable for data-intensive computing, the model receives a lot of complaints for taking a “brute force” approach to a problem. Working with unstructured data is a main reason. How much differences when operating on structured data? A trivial example is performing a search for an item in an unsorted list requires ( ) while it costs ( ) for the same operation on a sorted list. In a case where is petabyte in size, a ( ) result is attractive. Can MapReduce achieve a better running time than ( )? The answer is simply no. This explains why HPCC and commercial parallel DBMSs take lead over MapReduce systems in practice. The picture is not quite ugly since the door to the efficiency is widely open. MapReduce maintains a strict “one time pass” over data content. In return, MapReduce users should define how much data the system needs to parse. A naïve way is to process all existing files. However, if there are metadata built in place for guidance, the amount of input files can be reduced substantially. Metadata are extensively utilized in parallel DBMSs and HPCC system . For this reason, not only a MapReduce system has to do its best, but also its users are responsible to do their best in providing an optimal set of input files. Where it is appropriated, MapReduce should work with structured data stored in a distributed storage system such as BigTable  rather than a generic distributed file system such as GFS. 4. CONCLUSION MapReduce is well-suited for processing data on large scale cluster. Its strong support for data-intensive applications has its root from the programming model using map function and reduce function which can be further executed concurrently across machines in a cluster. The model has many advantages over traditional parallel DBMSs including the powerful programming model compared with SQL language, better fault-tolerance and fast recovery. The performance differences mentioned above are mainly due to the gap in optimization of a specific implementation when compared with highly optimized commercial grade parallel DBMSs. Many optimizations can be introduced at later time when the system is more mature. Future improvements on MapReduce should focus on the ease of achieving optimal task locality and lowering the degree of concurrency at master nodes and map workers. REFERENCES
 Bigtable: A Distributed Storage System for Structured Data. http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable -osdi06.pdf  MapReduce: Simplified data processing on large cluster. http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/mapreduce osdi04.pdf  David J. Dewitt and Jim Gray. Parallel Database System: The Future of High Performance Database Processing. Communications of the ACM, 36(6):85-98, 1992.  The Google File System.
http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google .com/en//papers/gfs-sosp2003.pdf  An Overview of the Open Science Data Cloud. HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing , 2010.  Anthony M. Middleton. Data-Intensive Technologies for Cloud Computing. Handbook of Cloud Computing. Springer, 2010.  A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D.J. Dewitt, S. Madden and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD '09: Proceedings of the 35th SIGMOD international conference on Management of data, pages 165-178, 2009.  M. Stonebraker, D. Abadi, D. J. Dewitt, S. Madden, E. Paulson, A. Pavlo and A. Rasin. MapReduce and Parallel DBMSs: Friends or Foes? Communications of the ACM, 53(1):64-71, 2010  Phoenix Rebirth: Scalable MapReduce on a Large-Scale Shared-Memory System. http://mapreduce.stanford.edu/
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.