Iroshan Priyantha

Apache Hadoop Project Apache Hadoop is a software solution or a framework for running applications on large cluster built of commodity hardware. The Hadoop framework provides reliable & data motion applications. It is designed to scale up from single servers to thousands of machines which each machine with local computation & storage. The Hadoop provides a Distributed File System (HDFS) that stores the data on the computer nodes. It uses existing file system of the operating system but extends this with redundancy & distribution. Using the HDFS, the data can be filtered & get aggregated data. HDFS hides the complexity of distributed storage & redundancy from the programmer. Other subprojects of Hadoop are Hadoop Common & Hadoop MapReduce. Hadoop Common includes the common utilities that support the other subprojects. Apart from these main projects, there are many Hadoop related projects such as Avro, Cassandra, Hbase, Chukwa, Hive, etc. Hadoop MapReduce Hadoop MapReduce is a software framework for processing large datasets using parallel & distributed solutions. It allows users to write applications easily which process vast amounts of data in parallel on large clusters. The MapReduce program has two components as Map & Reduce to transform lists of input data elements into lists of output data elements. In more formal functional mapping & reducing, a mapper must produce exactly one output element for each input element & reducer must produce exactly one output element for each input list. The Map transform is provided to transform an input data row of key & value to an output value. In MapReduce no value stands on its own. Every value has a key associated with it. Keys identify related values. That means for an input, it returns a list containing zero or more values & the output can be different from the input by having multiple entries & different key values. When the map operation outputs its pairs, they are available in memory. The Reduce transform is provided to take all values for a specific key & generate a new list of the reduced output. When a reduce task starts, its input is scattered in many files across all the nodes where map tasks ran. If run in distributed mode these need to be copied to the local file system. Once all data are available the aggregation of values can be done. A reducer function receives an iterator of input values from an input. It then combines these values together & returns a single output value. Reducing is often used to produce a summarized data turning a large volume of data into a smaller summary of itself.
Map Transform Reduce Transform

Sign up to vote on this title
UsefulNot useful