You are on page 1of 4

Mapreduce work 1.

It splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner :- splits input into independent chunks in parallel manner 2. Map takes input as (key,value) and produced (key,value) as output 3. The framework sorts the outputs of the maps, which are then input to the reduce tasks. 4. Both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. 5. Typically the compute nodes and the storage nodes are the same 6. The Map-Reduce framework and the Distributed File System run on the same set of nodes. 7. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. 8. There are two types of nodes that control the job execution process: 1. jobtrackers 2. tasktrackers 9. The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers. 10. Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. 11. If a tasks fails, the jobtracker can reschedule it on a different tasktracker. 12. Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. 13. Hadoop creates one map task for each split, which runs the user defined map function for each record in the split. 14. The quality of the load balancing increases as the splits become more fine-grained. On the other hand, if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time. For most jobs, a good split size tends to be the size of a HDFS block, 64 MB by default. Map tasks write their output to local disk, not to HDFS. 15. Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete the map output can be thrown away. So storing it in HDFS, with replication, would be a waste of time. 16. It is also possible that the node running the map task fails before the map output has been consumed by the reduce task. 17. Reduce tasks don’t have the advantage of data locality the input to a single reduce task is normally the output from all mappers. 18. In case a single reduce task that is fed by all of the map tasks: The sorted map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. 19. The output of the reducer is normally stored in HDFS for reliability. For each HDFS block of the reduce output, the first replica is stored on the local node, with other replicas being stored on off rack nodes. There are number of mapper and reducer and how many mapper and reducer has to use is decided by the jobtracker.

disk and machine failure. 14. 8. To maintains the file system tree and the metadata for all the files and directories in the tree. The NameNode is responsible for metadata management and DataNodes stores data. The function of the NameNode: 1. 10. This information is stored on local disk in the form of the two files: namespace image and edit log. A HDFS cluster has two types of node operating in a master-worker pattern: a NameNode (the master) and a number of DataNodes (workers).HDFS: 1. The block size and number of replication are configurable per file. 16. because this information is reconstructed by DataNode each time system starts. In most configuration center dedicated node work as JobTracker and NameNode and all other nodes are work as TaskTracker and DataNode. 2. is a distributed file system which hold large amount of data(in petabytes or terabytes) and provides high throughput access to information. Files in HDFS are divided into blocks and then store as independent unit. and then those information is used for the task scheduling in the JobTracker. 6. File system that manage the storage across a network is called as distributed file system. 3. 4. any user or application can create directories and stores file in the directories. Recipts of hearbeat tells datanode is functioning properly and blockreport contains lists of all block stroes on datanode. 9. 12. 18. HDFS support traditional file system. To deal with corrupted block. 13. To manage the file system namespace. The heart beat message is also used for fault-tolerance but it also contains CPU usage and the status of each task. . 2. HDFS is composed of one NameNode and multiple DataNodes. A client accesses the file system on behalf of the user by communicating with the NameNode and DataNodes. each block is replicated to number of different computer (default 3). HDFS stores a data or file in a block. The blocks of files are replicated for fault tolerance. Typically block size for file system is 512 bytes but for HDFS it is much larger unit. 15.4 HDFS Architecture. DataNode store and retrieve blocks when client or NameNode request it to do so and they report to the NameNode with the list of block they hold. which is shown in Fig 2. 3. for which data migration take place. 11. 5. The NameNode makes all decision regarding replication of blocks. 17.64 MB by default. A task is referred straggler if its progress is slower than other tasks and if it is not completed with other task than MapReduce allocate it to idle node. Hadoop comes with Hadoop distributed file system (HDFS). It periodically recive heartbeats and blockreport from each data nod from a cluster. The NameNode does not store block location persistently. The NameNode also knows the DataNodes on which all the blocks for a given file are located 7.

3. Because MapReduce is not designed for semantics-based HPC analytics. 8. One way to approach this problem is to use a high-level programming abstraction for specifying the semantics and bridging the gap between the way data was written and the way it will be accessed. Today’s research deal with the increasing volume and complexity of data produced by ultra scale data. These data sets affect the way they stored in DFS and data sets are complex. consider an application. high resolution scientific equipment and experiments. scientists are using frameworks like MapReduce and require data sets to be copied to the accompanied distributed and parallel file system. Two MapReduce programs are used to implement this access pattern. 9. 7. That is. Such data sets are presents researchers with many challenges in representing. called MapReduce with Access Patterns (MRAP). However. These data sets are stored in parallel and distributed file systems and frequently retrieved for analytics applications. which is capable of understanding data semantics. some of the existing analytics applications use multiple MapReduce programs to specify and analyze data. A framework. There is information gap because current HPC applications write data to these new file systems stores data. which generate unoptimized writes to the file system. this information gap becomes critical because source of data and commodity based system does not have same data semantics. processing and managing it. 5.Hpc and Mapreduce 1. Currently. In HPC analytics. 2. the raw data obtained from simulation/experiments needs to be stored in data-intensive file system in a format they are useful for subsequent analytics applications. 10. However. The challenge is to be finding best way to retrieve data from the file system following semantics of the HPC data. is a unique combination of the data access semantics . the first program will merge the data sets and the second will extract the subsets. the way to access data in HDFS like file systems is to use the MapReduce programming abstraction. HDFS. 4. A framework based on MapReduce. The overhead of this approach is quantified as 1) the effort to transform the data patterns in MapReduce programs. 6. simplifying the writing of analytics applications and potentially improving performance by reducing MapReduce phases. For example. which needs to merge different data sets followed by extracting subsets of that data. Aim of this work is to utilize the scalability and fault tolerance benefits of MapReduce and combine them with scientific access patterns. and 3) the performance penalties because of reading excessive data sets from disk in each MapReduce program.2) number of lines of code required for MapReduce data preprocessing.

. MRAP API. that is provided to improve the performance of access patterns with distributed data chunks across nodes. that is provided to further improve the performance of the access patterns with small I/O problem 3. MRAP data restructuring.and the programming framework (MapReduce) used in implementing HPC analytics applications. that is provided to eliminate the multiple MapReduce phases used to specify data access patterns 2. MRAP framework consists of three components to handle these two patterns: 1. MRAP data-centric scheduling. .