Unit 4 Hadoop

School of Computing
Science and Engineering
Program: B.TECH
Course Code:BCSE4036
Course Name: Cloud Computing Technologies
School of Computing Science and Engineering
Course
ourse Code
Code :: BCSE4036
BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies
Course Outcomes :
CO 1 Able to analyse, design and implement, sustainable and ethical
solutions in the field of computer science.
CO 2 Identify the architecture, infrastructure and delivery models of cloud

computing.
CO 3 Identify the architecture, infrastructure and delivery models of cloud
computing.
CO 4 Choose the appropriate technologies, algorithms and approaches for
the related issues.
CO 5 Student can understand current research trends in cloud computing
Program Name:B.Tech Program Code:CSE

Course Code : BCSE4036 Course Name: Cloud Computing Technologies
Syllabus
Unit I: INTRODUCTION 9 lecture hours

Evolution of Cloud Computing -System Models for Distributed and Cloud Computing -
NIST Cloud Computing Reference Architecture -IaaS - On-demand Provisioning
-Elasticity in Cloud - Examples of IaaS Providers - PaaS - Examples of PaaS Providers
-SaaS - Examples of SaaS Providers - Public , Private and Hybrid Clouds – Google App
Engine, Amazon AWS - Cloud Software Environments -Eucalyptus, Open Nebula,
OpenStack, Nimbus
Unit II: VIRTUALIZATION 9 lecture hours

Basics of Virtual Machines - Process Virtual Machines – System Virtual Machines –
Emulation – Interpretation – Binary Translation - Taxonomy of Virtual Machines.
Virtualization –Management Virtualization –– Hardware Maximization – Architectures
–Virtualization Management – Storage Virtualization – Network Virtualization
Course Code : BCSE4036 Course Name: Cloud Computing Technologies
Unit III: VIRTUALIZATION INFRASTRUCTURE 9 lecture hours

G Comprehensive Analysis – Resource Pool – Testing Environment –Server
Virtualization – Virtual Workloads – Provision Virtual Machines –Desktop
Virtualization– Application Virtualization – Work with AppV – Mobile OS for smart
phones – Mobile Platform Virtualization – Collaborative Applications for Mobile
platforms
Unit IV : PROGRAMMING MODEL 9 lecture hours
Map Reduce Hadoop Distributed File Systems – Hadoop I/O – Developing Map
Reduce Applications – Working of Map Reduce – Types and Formats – Setting up
Hadoop Cluster
Unit V: CLOUD INFRASTRUCTURE AND SECURITY 9 lecture hours
Architectural Design of Compute and Storage Clouds - Inter Cloud Resource
Management - Resource Provisioning and Platform Deployment - Global Exchange of
Cloud Resources - Security Overview – Cloud Security Challenges – Software as a
Service Security – Security Governance – Risk Management – Security Monitoring –
Security Architecture Design – Data Security – Application Security – Virtual Machine
Security.
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Unit-IV –Programming Model
Map Reduce Hadoop Distributed File Systems- Hadoop I/O – Developing Map
Reduce- Applications- Working of Map Reduce-Types and Formats- Setting Up
Hadoop Cluster.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Map Reduce Hadoop Distributed File Systems

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
MapReduce
• HDFS handles the Distributed Filesystem layer

• MapReduce is a programming model for data processing.
• MapReduce
– Framework for parallel computing
– Programmers get simple API
– Don’t have to worry about handling
• parallelization
• data distribution
• load balancing
• fault tolerance

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
MapReduce
•It is a powerful paradigm for parallel computation
• A MapReduce is a data processing tool which is used to process the data

parallelly in a distributed form. It was developed in 2004, on the basis of
paper titled as "MapReduce: Simplified Data Processing on Large
Clusters," published by Google.
• Allows one to process huge amounts of data (terabytes and petabytes) on
thousands of processors

Course
Course Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Analogy: Counting Fans

Given a cricket stadium, count the number of fans for each player /
team
Traditional way
Smart way
Smarter way?

Course Code :BCSE4036 Course Name: Cloud Computing Technologies



Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
MapReduce
The MapReduce is a paradigm which has two phases, the map phase, and the
reducer phase

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Two phases
 In the Map, the input is given in the form of a key-value pair. The
output of the Map is fed to the reducer as input.
 The reducer runs only after the Mapper is over. The reducer too
takes input in key-value format, and the output of reducer is the
final output

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Steps in Map Reduce

• Mapper reads the block of data and converts it into key-value pairs.
• Now, these key-value pairs are input to the reducer.

• The reducer receives data tuples from multiple mappers.
• Reducer applies aggregation to these tuples based on the key.

• The final output from reducer gets written to HDFS.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Usage of MapReduce
• It can be used in various application like document clustering, distributed
sorting, and web link-graph reversal.
• It can be used for distributed pattern-based searching.
• We can also use MapReduce in machine learning.

• It was used by Google to regenerate Google's index of the World Wide
Web.
• It can be used in multiple computing environments such as multi-cluster,
multi-core, and mobile environment.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Why DFS?
1 Machine 10 Machines
4 I/O Channels
Each Channel – 100 MB/s Read 1 TB Data 4 I/O Channels
Each Channel – 100 MB/s

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Read 1 TB Data
1 Machine 10 Machines
4 I/O Channels
4 I/O Channels
45 Minutes 4.5 Minutes

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Hadoop I/O

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Syllabus

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
What is Hadoop?
 Apache Hadoop is a framework that allows for the distributed processing

of large data sets across
 clusters of commodity computers using a simple programming model.

 It is an Open-source Data Management with scale-out storage &
distributed processing.
Program Name:B.Tech
Program Name: Program Code:Code:CSE
Program
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
In a nutshell
• Hadoop provides: a reliable shared storage and analysis system.
• The storage is provided by HDFS
• The analysis by MapReduce.
Program Name:B.Tech
Program Name: Program Code:Code:CSE
Program
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Modules of Hadoop
• HDFS (High Distributed File System)
• It is the storage layer of Hadoop. Files in HDFS are broken into block-sized

chunks. HDFS consists of two types of nodes that is, NameNode and
DataNodes.
• NameNode stores metadata about blocks location.
• DataNodes stores the block and sends block reports to NameNode in a

definite time interval.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
HDFS Definition
• The Hadoop Distributed File System (HDFS) is a distributed file system designed
to run on commodity hardware.
•HDFS is a distributed, scalable, and portable filesystem written in Java for the
Hadoop framework.
•It has many similarities with existing distributed file systems.
•Hadoop Distributed File System (HDFS™) is the primary storage system used by
Hadoop applications.
• HDFS creates multiple replicas of data blocks and distributes them on
compute nodes throughout a cluster to enable reliable, extremely rapid
computations.
•HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.
•HDFS provides high throughput access to application data and is suitable for
applications that have large data sets

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
MapReduce
• MapReduce is the processing layer in Hadoop. It is a software framework
for writing an application that performs distributed processing.

YARN
• It is the resource management layer. YARN is responsible for resource
allocation and job scheduling.
• To study in detail Hadoop and its component, go through the
Hadoop architecture article.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Features of Hadoop
Hadoop is Open Source

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Features of Hadoop
 Hadoop is Open Source
 Hadoop cluster is Highly Scalable
 Hadoop provides Fault Tolerance
 Hadoop provides High Availability
 Hadoop is very Cost-Effective
 Hadoop is Faster in Data Processing
 Hadoop is based on Data Locality concept
 Hadoop provides Feasibility
 Hadoop is Easy to use
 Hadoop ensures Data Reliability

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Syllabus


Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Components of HDFS:-
• There are three major components of Hadoop HDFS are as follows:-

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
DataNode
• These are the nodes which store the actual data. HDFS stores the data in a
distributed manner. It divides the input files of varied formats into blocks.
The DataNodes stores each of these blocks. Following are the functions of
DataNodes:-
• On startup, DataNode does handshake with NameNode. It verifies the
namespace ID and software version of DataNode.
• Also, it sends a block report to NameNode and verifies the block replicas.
• It sends a heartbeat to NameNode every 3 seconds to tell that it is alive

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
NameNode
• NameNode is nothing but the master node. The NameNode is responsible for
managing file system namespace, controlling the client’s access to files. Also, it
executes tasks such as opening, closing and naming files and directories.
NameNode has two major files – FSImage and Edits log
• FSImage – FSImage is a point-in-time snapshot of HDFS’s metadata. It contains
information like file permission, disk quota, modification timestamp, access
time etc.
• Edits log – It contains modifications on FSImage. It records incremental
changes like renaming the file, appending data to the file etc.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Secondary NameNode
• If the NameNode has not restarted for months the size of Edits log
increases.
• This, in turn, increases the downtime of the cluster on the restart of
NameNode. In this case, Secondary NameNode comes into the picture.
• The Secondary NameNode applies edits log on FSImage at regular
intervals. And it updates the new FSImage on primary NameNode.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Syllabus

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Syllabus

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Developing a MapReduce

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
How MapReduce Works?

• MapReduce application is based on Hadoop Distributed FileSystem,
HDFS.
• Steps of the MapReduce process:
– The client submits the job.
– JobTracker coordinates the job and splits into tasks.
– TaskTrackers run the tasks, here is the main map and reduce phases.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Shuffle and Sort

• These two facilities are the heart of MapReduce which make Cloud
Computing powerful.
• Sort phase: guarantees the input to every reduce is sorted by key.
• Shuffle phase: transfers the map output to the reducers as input.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Applications

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Counting and Summing
Problem
• A number of documents with a set of terms
Need to calculate the number of occurrences of each term (word
count) or some arbitrary function over the terms (average
response time in log files)
Solution
• Map: For each term, emit the term and “1”
• Reduce: Take the sum (or any other operation) of each term values

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Collating
Problem
• A number of documents with a set of terms and some function of
one item
• Need to group all items that have the same value of function to either
store items together or perform some computation over them
•Solution
•Map: For each item, compute given function and emit function value as
key and item as value
• Reduce: Either save all grouped items or perform further computation
• Example: Inverted Index: Items are words and function is document ID

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Filtering, Parsing, and Validation
Problem
• A set of records
• Need to collect all records that meet some condition or transform each
record into another representation
Solution
•Map: For each record, emit it if passes the condition or emit its transformed
version
• Reduce: Identity
• Example: Text parsing or transformation such as word capitalization

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Distributed Task Execution
Problem
• Large computational problem
• Need to divide it into multiple parts and combine results from all parts to
obtain a final result
Solution
• Map: Perform corresponding computation
• Reduce: Combine all emitted results into a final one
• Example: RGB histogram calculation of bitmap images

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
How Hadoop MapReduce Works

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Data Flow In MapReduce

MapReduce is used to compute the huge amount of data . To handle the upcoming data in a
parallel and distributed form, the data has to flow from various phases.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Input Files
• The data for a MapReduce task is stored in input files, and input files
typically lives in HDFS. The format of these files is arbitrary, while line-
based log files and binary format can also be used. InputFormat
• Now, InputFormat defines how these input files are split and read. It
selects the files or other objects that are used for input. InputFormat
creates InputSplit.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
InputSplits
• It is created by InputFormat, logically represent the data which will
be processed by an individual Mapper (We will understand mapper
below).
• One map task is created for each split; thus the number of map
tasks will be equal to the number of InputSplits.
• The split is divided into records and each record will be processed
by the mapper.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
RecordReader
•It communicates with the InputSplit in Hadoop MapReduce and converts the
data into key-value pairs suitable for reading by the mapper.
•By default, it uses TextInputFormat for converting data into a key-value pair.
•RecordReader communicates with the InputSplit until the file reading is not
completed.
•It assigns byte offset (unique number) to each line present in the file.
•Further, these key-value pairs are sent to the mapper for further processing.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Mapper
• It processes each input record (from RecordReader) and generates new key-
value pair, and this key-value pair generated by Mapper is completely
different from the input pair.
• The output of Mapper is also known as intermediate output which is written
to the local disk.
• The output of the Mapper is not stored on HDFS as this is temporary data and
writing on HDFS will create unnecessary copies (also HDFS is a high latency
system). Mappers output is passed to the combiner for further process

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Combiner
• The combiner is also known as ‘Mini-reducer’. Hadoop MapReduce
Combiner performs local aggregation on the mappers’ output, which helps
to minimize the data transfer between mapper and reducer (we will see
reducer below).
• Once the combiner functionality is executed, the output is then passed to
the partitioner for further work.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Partitioner
• Hadoop MapReduce, Partitioner comes into the picture if we are working on
more than one reducer (for one reducer partitioner is not used).
• Partitioner takes the output from combiners and performs partitioning.
Partitioning of output takes place on the basis of the key and then sorted. By
hash function, key (or a subset of the key) is used to derive the partition.
• According to the key value in MapReduce, each combiner output is partitioned,
and a record having the same key value goes into the same partition, and then
each partition is sent to a reducer. Partitioning allows even distribution of the
map output over the reducer.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Shuffling and Sorting

• Now, the output is Shuffled to the reduce node (which is a normal slave
node but reduce phase will run here hence called as reducer node).
• The shuffling is the physical movement of the data which is done over the
network.
• Once all the mappers are finished and their output is shuffled on the
reducer nodes, then this intermediate output is merged and sorted, which
is then provided as input to reduce phase.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Reducer
• It takes the set of intermediate key-value pairs produced by the mappers
as the input and then runs a reducer function on each of them to generate
the output.
• The output of the reducer is the final output, which is stored in HDFS.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
RecordWriter
• It writes these output key-value pair from the Reducer phase to the
output files.
OutputFormat
• The way these output key-value pairs are written in output files by
RecordWriter is determined by the OutputFormat. OutputFormat
instances provided by the Hadoop are used to write files in HDFS or on the
local disk.
• Thus the final output of reducer is written on HDFS by OutputFormat
instances.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
In conclusion, we can say that data flow in MapReduce is the combination of

different processing phases of such as Input Files, InputFormat in Hadoop,
InputSplits, RecordReader, Mapper, Combiner, Partitioner, Shuffling and
Sorting, Reducer, RecordWriter, and OutputFormat. Hence all these
components play an important role in the Hadoop mapreduce working.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
MapReduce Data Types and Formats

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Data Types
• Basicly Hadoop (MapReduce) need data types which support both serialization (for
efficient read and write) and comparability ( to sort keys within sort and shuffle phase).
For that purposes hadoop has WritableComparable<T> interface, which extends
Writable (A serializable object which implements a simple, efficient, serialization
protocol) and Comparable<T>. You can see some these implementations below:
● Data Types: ByteWritable, IntWritable, LongWritable, FloatWritable, DoubleWritable,

Text,
• BooleanWritable, VIntWritable, VLongWritable
● Data Structures: BytesWritable, ArrayWritable, MapWritable and

SortedMapWritable(both extend AbstractMapWritable which implement
Configurable).
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
In/Out Format class Hierarchy

● InputSampler<K,V> uses InputFormat<K,V> and RecordReader<K,V> to
read records separated by delimiter and provide these records directly to
your map() function. You don’t need directly use their functionality, as
MapReduce framework care all about instead of you. Every map() function
get single record, and your task is handle that single record and nothing else.
● For reduce() function there are similar classes and interfaces (

OutputFormat<K,V>, RecordWriter<K,V>, FileOutputFormat<K,V>)
● For properly work you should set input and output format classes in
JobConf

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies

class CustomInputFormat extends FileInputFormat<K,V> {
List<InputSplit> getSplits(JobContext context) {
// read and return list of splits (by default block size) for given DFS file
}
RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) {
// return the instance of CustomRecordReader for every split and set delimiter for RecordReader
}
}
abstract class FileInputFormat<K,V> extends Object implements InputFormat<K,V> { ... } interface
InputFormat<K,V> { // interface for getSplits and createRecordReader
List<InputSplit> getSplits(JobContext context) {
// should be implemented in class
}
RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) {
// should be implemented in class
}

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
class CustomRecordReader extends RecordReader<K,V> {

void initialize (InputSplit split, TaskAttemptContext context ) {
// Find and open the specific DFS file split, seek start of file and create LineReader for read data
records.
}
boolean nextKeyValue() {
// physically read data record from start position with respect to delimiter
}
K getCurrentKey() { … };
V getCurrentValue() { … }; // get functions for key and value
}
abstract class RecordReader<K,V> {
void initialize (InputSplit split, TaskAttemptContext context ) { // should be implemented in class }
boolean nextKeyValue() { // should be implemented in class }
K getCurrentKey() { … }; // get functions for key
V getCurrentValue() { … }; // get functions for value
}

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
File Formats
● XML (one of the most common file formats)
● JSON (as famous as XML but it’s more reach)
● SequenceFile (native MapReduce key/value format)
● Avro (created by Hadoop it’s record oriented but not key/value format )
● Parquet (it’s columnar oriented data storage format)
● Thrift (don’t used directly)
● Protocol Buffers (MR use elephant bird to read)
● Other Custom Formats (e.g CSV)

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
XML, JSON and Sequence File
● XML - MapReduce doesn’t have native XML support. If you want to work with large XML files
in MapReduce and be able to split and process them in parallel then you can use Mahout’s
XMLInputFormat to work with XML files in HDFS with MapReduce. It reads records that are
• delimited by specific XML begin and end tags.
● JSON - there are two problems MapReduce doesn’t come with an InputFormat that works with
JSON; and how does one even go about splitting JSON?
• If you want to work with JSON inputs in MapReduce, you can use Elephant Bird
LzoJsonInputFormat input format is used as a basis to create an input format class to work
• with JSON elements.
● Sequence File - it was created specifically for MapReduce tasks, it is row oriented key/value file
format. Which well supported in all hadoop environment projects (Hive, Pig e.t.c).
•All are mentioned formats don’t support code generation and schema evaluation

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Avro
• Avro is a RPC and data serialization framework developed by Hadoop in order to improve data
interchange, interoperability, and versioning in MapReduce. Natively Avro isn’t key/value but record-
based file format.
● Support code generation and schema evaluation
● Use JSON to define schema.
● There are 3 ways that you can use Avro in MapReduce (Mixed, Record-based, and Key/Value-
based modes).
○ Mixed - in cases where you have non Avro input and generate Avro outputs, or vice versa, in
which case the Avro mapper and reducer classes aren’t suitable. Use AvroWrapper class.
○ Record-based - in this case Avro will be used end-to-end. As Avro isn’t key/value format you
should use specific Mapper (AvroMapper) and Reducer (AvroReducer) classes.
○ Key/Value-based - you want to use Avro as native key/value format, use the AvroKeyValue,
AvroKey, and AvroValue classes to work with Avro key/value data.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Parquet
Parquet is a new columnar storage format (yes it’s storage format, I didn’t say file format).
Parquet doesn’t have his own object model(in memory representation), instead it has object
model converters (to represent data as Avro, Thrift, Protocol Buffer, Pig, Hive e.t.c).
● Parquet physically store data column by column, instead of row by row (e.g Avro). For that
reason it’s called columnar storage.
● In case when you often need projection by columns or you need to do operation (avg, max,
min e.t.c) only on the specific columns, it’s more effective to store data in columnar format,
because accessing data become faster than in case of row storage.
● It support schema evaluation but doesn’t support code generation.
● You can use AvroParquetInputFormat and AvroParquetOutputFormat in MapReduce, and
AvroParquetWriter and AvroParquetReader classes in simple Java app.
● It is well supported by lots of hadoop projects ( Pig, Hive, Spark, Impala, HBase e.t.c)
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Custom Formats ( CSV)

• You can read the custom file formats with TextInputFormat class and implement reading and
writing in your map and reduce tasks appropriately. But if you want to write reusable and
convenient code for specific file format (e.g CSV) you need to implement your own:
●CSVInputFormat which extend FileInputFormat
●CSVRecordReader which extend RecordReader
●CSVOutputFormat which extend TextOutputFormat
●CSVRecordWriter which extend RecordWriter
•And use these as input and output classes in your MapReduce. Just set them in JobConfiguration
settings.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Hadoop Cluster

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
What is Hadoop Cluster?
 Hadoop cluster is nothing but a group of computers connected together

via LAN.
 We use it for storing and processing large data sets. Hadoop clusters have
a number of commodity hardware connected together.
 They communicate with a high-end machine which acts as a master.
 These master and slaves implement distributed computing over

distributed data storage.
 It runs open source software for providing distributed functionality.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
What is the Basic Architecture of Hadoop Cluster?
 Hadoop cluster has master-slave architecture.
Master in Hadoop Cluster

 It is a machine with a good configuration of memory and CPU. There are
two daemons running on the master and they are NameNode and
Resource Manager.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Functions of NameNode

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Functions of NameNode
• Manages file system namespace

• Regulates access to files by clients
• Stores metadata of actual data Foe example – file path, number of blocks,
block id, the location of blocks etc.
• Executes file system namespace operations like opening, closing, renaming
files and directories

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Functions of Resource Manager
• It arbitrates resources among

competing nodes
• Keeps track of live and dead nodes

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Slaves in the Hadoop Cluster

• It is a machine with a normal configuration. There are two daemons
running on Slave machines and they are – DataNode and Node Manager
a. Functions of DataNode
• It stores the business data
• It does read, write and data processing operations
• Upon instruction from a master, it does creation, deletion, and replication
of data blocks.
b. Functions of NodeManager
• It runs services on the node to check its health and reports the same to
ResourceManager.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
b. Functions of NodeManager
• Client nodes in Hadoop cluster – We install Hadoop and configure it on
client nodes.
c. Functions of the client node
• To load the data on the Hadoop cluster.
• Tells how to process the data by submitting MapReduce job.
• Collects the output from a specified location.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Single Node Cluster VS Multi-Node Cluster
• In a single node Hadoop cluster, all the processes run on one JVM
instance. The user need not make any configuration setting.
• The Hadoop user only needs to set JAVA_HOME variable.
• The default factor for single node Hadoop cluster is one.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Single Node Cluster VS Multi-Node Cluster

• A multi-node Hadoop cluster has master-slave architecture. In this
NameNode daemon run on the master machine. And DataNode daemon
runs on the slave machines.
• In multi-node Hadoop cluster, the slave daemons like DataNode and
NodeManager run on cheap machines.
• On the other hand, master daemons like NameNode and
ResourceManager run on powerful servers. Ina multi-node Hadoop
cluster, slave machines can be present in any location irrespective of the
physical location of the master server.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Communication Protocols Used in Hadoop Clusters
• The HDFS communication protocol works on the top of TCP/IP protocol.
• The client establishes a connection with NameNode using configurable

TCP port.
• Hadoop cluster establishes the connection to the client using client
protocol. DataNode talks to NameNode using the DataNode Protocol.
• A Remote Procedure Call (RPC) abstraction wraps both Client protocol and
DataNode protocol.
• NameNode does not initiate any RPC instead it responds to RPC from the
DataNode.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
How to Build a Cluster in Hadoop

• For choosing the right hardware one must consider the following
points
• Understand the kind of workloads, the cluster will be dealing with.
The volume of data which cluster need to handle. And kind of
processing required like CPU bound, I/O bound etc.
• Data storage methodology like data compression technique used if
any.
• Data retention policy like how frequently we need to flush.
– Sizing the Hadoop Cluster
– Configuring Hadoop Cluster

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Hadoop Cluster Management
A good cluster management tool should have the following features:-

• It should provide diverse work-load management, security, resource
provisioning, performance optimization, health monitoring. Also, it needs
to provide policy management, job scheduling, back up and recovery
across one or more nodes.
• Implement NameNode high availability with load balancing, auto-
failover, and hot standbys
• Enabling policy-based controls that prevent any application from gulping
more resources than others.
• Managing the deployment of any layers of software over Hadoop clusters
by performing regression testing. This is to make sure that any jobs or
data won’t crash or encounter any bottlenecks in daily operations.

Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Technologies
Benefits of Hadoop Clusters
 Here is a list of benefits provided by Clusters in Hadoop –
 Robustness
 Data disks failures, heartbeats and re-replication
 Cluster Rrbalancing
 Data integrity
 Metadata disk failure

 Snapshot

Unit 4 Hadoop

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4 Hadoop

Uploaded by

Copyright:

Available Formats

School of Computing

Science and Engineering

CO 2 Identify the architecture, infrastructure and delivery models of cloud

CO 5 Student can understand current research trends in cloud computing

Program Name:B.Tech Program Code:CSE

Unit I: INTRODUCTION 9 lecture hours

Unit II: VIRTUALIZATION 9 lecture hours

Unit III: VIRTUALIZATION INFRASTRUCTURE 9 lecture hours

Unit-IV –Programming Model

Reduce- Applications- Working of Map Reduce-Types and Formats- Setting Up

Program Name:B.Tech Program Code:CSE

Map Reduce Hadoop Distributed File Systems

Program Name:B.Tech Program Code:CSE

• HDFS handles the Distributed Filesystem layer

Program Name:B.Tech Program Code:CSE

• A MapReduce is a data processing tool which is used to process the data

Program Name:B.Tech Program Code:CSE

Analogy: Counting Fans

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE

Steps in Map Reduce

• Now, these key-value pairs are input to the reducer.

• Reducer applies aggregation to these tuples based on the key.

Program Name:B.Tech Program Code:CSE

• We can also use MapReduce in machine learning.

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE

45 Minutes 4.5 Minutes

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE

 Apache Hadoop is a framework that allows for the distributed processing

 clusters of commodity computers using a simple programming model.

• It is the storage layer of Hadoop. Files in HDFS are broken into block-sized

• DataNodes stores the block and sends block reports to NameNode in a

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE

for writing an application that performs distributed processing.

Program Name:B.Tech Program Code:CSE

Hadoop is Open Source

Program Name:B.Tech Program Code:CSE

 Hadoop cluster is Highly Scalable

 Hadoop provides Fault Tolerance

 Hadoop provides High Availability

 Hadoop is very Cost-Effective

 Hadoop is Faster in Data Processing

 Hadoop is based on Data Locality concept

 Hadoop provides Feasibility

 Hadoop is Easy to use

 Hadoop ensures Data Reliability

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE

Program Name:B.Tech Program Code:CSE