Apache Hadoop and Map Reduce

Apache Hadoop is a top-level active project that was developed by Apache Software Foundation (ASF) and distributed under Apache License 2.0(free license). Main objective of Hadoop is to provide a framework that supports data-intensive distributed applications in a scalable and reliable manner. The Apache Hadoop framework consists with several software libraries that allows doing the distributed processing a large set of data across clusters of computers by using simple and well-defined programming model. It facilitates users to scale up from one server to thousands of machines which are working as clusters in which each offering local storage and computation. Basically Hadoop is written Java language and working in cross-platform. Google’s MapReduce and Google File System researches were the inspirations of Hadoop and it was initially developed to support distribution in Nutch search engine project. Apache Hadoop 1.0.0 is the current stable version which was released in December 27, 2011. There are three sub projects associated with Apache Hadoop such as Hadoop Common, Hadoop Distributed File System (HDFS) and Hadoop MapReduce. Hadoop Common consists with the common utilities that are needed to other sub projects such as source code and documentations and HDFS introduced a distributed file system that ensures the highthroughput access to application data. Hadoop MapReduce is a framework that is used to writing applications which are process rapidly huge amount of data in parallel manner on large clusters of compute nodes. A small Hadoop cluster consists with a single master with JobTracker, Task Tracker, NameNode, DataNode and multiple worker nodes. A worker/slave node acts as both a DataNode and TaskTracker, even though it is possible to have data-only and compute-only worker nodes. In a large cluster, the HDFS is managed via a dedicated NameNode server to host the file system index and a secondary NameNode which is used to generate snapshots of NameNode’s memory to minimize the corruption in file system and data loss. MapReduce Basically MapReduce job is parallelized process and split the input data-set into several chunks that are processed by map tasks firstly. After that the framework sorts the output of the maps that are used as input to the reduce tasks. Usually, both input and output of job are stored in a file system and the framework monitors scheduling tasks and re-executes the failed tasks. The MapReduce framework contains a single master JobTracker and one slave TaskTracker per cluster node. The JobTracker is responsible to schedule the jobs’ component tasks on slaves, monitor and ensure the error-free execution. The TaskTracker is used to execute the tasks that are directed by the Master (JobTracker). Basically, the application has to specify the locations of inputs, supply map and reduce the functions by implementing suitable interfaces or/and abstract classes. In job configuration, the Hadoop job clients submits the job and configures it with JobTracker which is guaranteed the error-free scheduling and providing status and diagnostic facts to job client.

Sign up to vote on this title
UsefulNot useful