You are on page 1of 12

Lab 2

Title: Hadoop Lab Report


Objectives:

 To Understand the current trends and future developments in the big data
processing using Hadoop
 To Familiarize with the Hadoop Distributed File System (HDFS) and the
MapReduce programming model
 To Learn how to handle large amounts of data and perform distributed processing
 To Understand how to use the different components of the Hadoop ecosystem
such as Pig, Hive, and HBase
 To Understand how to optimize the performance of a Hadoop cluster
 To Understand the concepts of data warehousing and data mining using Hadoop

Theory

Introduction:

Apache Hadoop is an open-source framework that is used to efficiently store and process
large datasets ranging in size from gigabytes to petabytes of data. Instead of using one
large computer to store and process the data, Hadoop allows clustering multiple
computers to analyze massive datasets in parallel more quickly.

Hadoop consists of four main modules:

 Hadoop Distributed File System (HDFS) – A distributed file system that runs
on standard or low-end hardware. HDFS provides better data throughput than
traditional file systems, in addition to high fault tolerance and native support
of large datasets.

 Yet Another Resource Negotiator (YARN) – Manages and monitors cluster


nodes and resource usage. It schedules jobs and tasks.

 MapReduce – A framework that helps programs do the parallel computation


on data. The map task takes input data and converts it into a dataset that can
be computed in key value pairs. The output of the map task is consumed by
reduce tasks to aggregate output and provide the desired result.

 Hadoop Common – Provides common Java libraries that can be used across
all modules.
Hadoop History

As the World Wide Web grew in the late 1900s and early 2000s, search engines and
indexes were created to help locate relevant information amid the text-based content. In
the early years, search results were returned by humans. But as the web grew from dozens
to millions of pages, automation was needed. Web crawlers were created, many as
university-led research projects, and search engine start-ups took off (Yahoo, AltaVista,
etc.).

Why is Hadoop important?

 Ability to store and process huge amounts of any kind of data, quickly. With
data volumes and varieties constantly increasing, especially from social media and
the Internet of Things (IoT), that's a key consideration.

 Computing power. Hadoop's distributed computing model processes big


data fast. The more computing nodes you use, the more processing power you
have.

 Fault tolerance. Data and application processing are protected against hardware
failure. If a node goes down, jobs are automatically redirected to other nodes to
make sure the distributed computing does not fail. Multiple copies of all data are
stored automatically.

 Flexibility. Unlike traditional relational databases, you don’t have to preprocess


data before storing it. You can store as much data as you want and decide how to
use it later. That includes unstructured data like text, images and videos.

 Low cost. The open-source framework is free and uses commodity hardware to
store large quantities of data.

 Scalability. You can easily grow your system to handle more data simply by
adding nodes. Little administration is required.

Popular uses of Hadoop today include:

 Low-cost storage and data archive

The modest cost of commodity hardware makes Hadoop useful for storing and combining
data such as transactional, social media, sensor, machine, scientific, click streams, etc.
The low-cost storage lets you keep information that is not deemed currently critical but
that you might want to analyze later.
 Sandbox for discovery and analysis

Because Hadoop was designed to deal with volumes of data in a variety of shapes and
forms, it can run analytical algorithms. Big data analytics on Hadoop can help your
organization operate more efficiently, uncover new opportunities and derive next-level
competitive advantage. The sandbox approach provides an opportunity to innovate with
minimal investment.

 Data lake

Data lakes support storing data in its original or exact format. The goal is to offer a raw or
unrefined view of data-to-data scientists and analysts for discovery and analytics. It helps
them ask new or difficult questions without constraints. Data lakes are not a replacement
for data warehouses. In fact, how to secure and govern data lakes is a huge topic for IT.
They may rely on data federation techniques to create a logical data structure.

 Complement your data warehouse

We're now seeing Hadoop beginning to sit beside data warehouse environments, as well
as certain data sets being offloaded from the data warehouse into Hadoop or new types of
data going directly to Hadoop. The end goal for every organization is to have a right
platform for storing and processing data of different schema, formats, etc. to support
different use cases that can be integrated at different levels.

 IoT and Hadoop

Things in the IoT need to know what to communicate and when to act. At the core of the
IoT is a streaming, always on torrent of data. Hadoop is often used as the data store for
millions or billions of transactions. Massive storage and processing capabilities also allow
you to use Hadoop as a sandbox for discovery and definition of patterns to be monitored
for prescriptive instruction. You can then continuously improve these instructions,
because Hadoop is constantly being updated with new data that doesn’t match previously
defined patterns.

What are the challenges of using Hadoop?

MapReduce programming is not a good match for all problems. It’s good for simple
information requests and problems that can be divided into independent units, but it's not
efficient for iterative and interactive analytic tasks. MapReduce is file-intensive. Because
the nodes don’t intercommunicate except through sorts and shuffles, iterative algorithms
require multiple map-shuffle/sort-reduce phases to complete. This creates multiple files
between MapReduce phases and is inefficient for advanced analytic computing.
There’s a widely acknowledged talent gap. It can be difficult to find entry-level
programmers who have sufficient Java skills to be productive with MapReduce. That's
one reason distribution providers are racing to put relational (SQL) technology on top of
Hadoop. It is much easier to find programmers with SQL skills than MapReduce skills.
And, Hadoop administration seems part art and part science, requiring low-level
knowledge of operating systems, hardware and Hadoop kernel settings.

Data security. Another challenge centers around the fragmented data security issues,
though new tools and technologies are surfacing. The Kerberos authentication protocol is
a great step toward making Hadoop environments secure.

Full-fledged data management and governance. Hadoop does not have easy-to-use,
full-feature tools for data management, data cleansing, governance and metadata.
Especially lacking are tools for data quality and standardization.

Hadoop Architecture

At its core, Hadoop has two major layers namely −

 Processing/Computation layer (MapReduce), and

 Storage layer (Hadoop Distributed File System).

Fig: Architecture of Hadoop


MapReduce

MapReduce is a parallel programming model for writing distributed applications devised


at Google for efficient processing of large amounts of data (multi-terabyte data-sets), on
large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant
manner. The MapReduce program runs on Hadoop which is an Apache open-source
framework.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS)
and provides a distributed file system that is designed to run on commodity hardware. It
has many similarities with existing distributed file systems. However, the differences
from other distributed file systems are significant. It is highly fault-tolerant and is
designed to be deployed on low-cost hardware. It provides high throughput access to
application data and is suitable for applications having large datasets.

Apart from the above-mentioned two core components, Hadoop framework also includes
the following two modules −

 Hadoop Common − These are Java libraries and utilities required by other Hadoop
modules.
 Hadoop YARN − This is a framework for job scheduling and cluster resource
management.

How Does Hadoop Work?

It is quite expensive to build bigger servers with heavy configurations that handle large
scale processing, but as an alternative, you can tie together many commodity computers
with single-CPU, as a single functional distributed system and practically, the clustered
machines can read the dataset in parallel and provide a much higher throughput.
Moreover, it is cheaper than one high-end server. So, this is the first motivational factor
behind using Hadoop that it runs across clustered and low-cost machines.

Hadoop runs code across a cluster of computers. This process includes the following core
tasks that Hadoop performs −

 Data is initially divided into directories and files. Files are divided into
uniform sized blocks of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further
processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job

Advantages of Hadoop

 Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures
at the application layer.
 Servers can be added or removed from the cluster dynamically and Hadoop continues
to operate without interruption.
 Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.

Source Code:

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountTest {


public static void main(String[] args) throws Exception {

// TODO Auto-generated method stub

Configuration Configuration =new Configuration();

Path inputFolder = new Path(args[0]);

Path outputFolder = new Path(args[1]);

Job job = new Job(Configuration, "wordcount");

job.setJarByClass(WordCountTest.class);

job.setMapperClass(WordCountMapper.class);

job.setReducerClass(WordCountReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, inputFolder);

FileOutputFormat.setOutputPath(job, outputFolder);

System.exit(job.waitForCompletion(true)?0:1);

public static class WordCountMapper extends Mapper<LongWritable, Text, Text,


IntWritable>{

public void map (LongWritable key, Text value, Context con) throws IOException,
InterruptedException{

String line = value.toString();

String[] words = line.split(" ");

for(String word: words){

Text outputKey = new Text(word.toUpperCase().trim());


IntWritable outputValue = new IntWritable(1);

con.write(outputKey, outputValue);

public static class WordCountReducer extends Reducer<Text, IntWritable,


Text,IntWritable>{

public void reduce(Text word, Iterable<IntWritable> values, Context con) throws


IOException, InterruptedException{

int sum = 0;

for (IntWritable value: values){

sum+= value.get();

con.write(word,new IntWritable(sum));

Folder Structure:

After successful run, go to the browser: 127.0.0.1:8888


Click the Launch Dashboard. Sign in with username and password “raj_ops”

Go to the Files view – HDFS of Hadoop as:

Create the input data folder and input folder inside the input data and upload words.txt.
Also upload compiled WorCount.jar inside input data folder.

You can check the files by > "hadoop fs -ls /input_data"


Input:

Words.txt contains as:

Output:
Discussion:
In this Hadoop lab I am able to know about the Hadoop, Implementation of Hadoop,
advantages disadvantages, challenges of Hadoop.
Apache Hadoop is an open-source framework that is used to efficiently store and process
large datasets ranging in size from gigabytes to petabytes of data. Instead of using one
large computer to store and process the data, Hadoop allows clustering multiple
computers to analyze massive datasets in parallel more quickly. As the World Wide Web
grew in the late 1900s and early 2000s, search engines and indexes were created to help
locate relevant information amid the text-based content. In the early years, search results
were returned by humans. But as the web grew from dozens to millions of pages,
automation was needed. Web crawlers were created, many as university-led research
projects, and search engine start-ups took off (Yahoo, AltaVista, etc.).
Sandbox for discovery and analysis, Data Lake, complement your data warehouse, IoT
and Hadoop are the popular usages of Hadoop.

The challenges area for Hadoop is, MapReduce programming is not a good match for all
problems. There’s a widely acknowledged talent gap, Data security, Full-fledged data
management and governance. Map reduce, HDFS YRRN framework and common
utilities is the most importance component of haboob architecture. Hadoop runs code
across a cluster of computers. This process includes the following core tasks that Hadoop
performs

Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M). These files are then distributed across
various cluster nodes for further processing. HDFS, being on top of the local file system,
supervises the processing. Blocks are replicated for handling hardware failure. Checking
that the code was executed successfully. Performing the sort that takes place between the
map and reduce stages. Sending the sorted data to a certain computer. Writing the
debugging logs for each job.

Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores. Hadoop does not rely on hardware
to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has
been designed to detect and handle failures at the application layer. Servers can be added
or removed from the cluster dynamically and Hadoop continues to operate without
interruption. Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.
Conclusion
According to my Hadoop lab objectives my goals are fully archived. I am bale to know
about the Hadoop, it’s implementation, usages, advantages and disadvantages.

SO, my goals are fully achieved.

You might also like