You are on page 1of 86

School of Computing

Science and Engineering

Program: B.TECH
Course Code:BCSE4036
Course Name: Cloud Computing Technologies
School of Computing Science and Engineering
Course
ourse Code
Code :: BCSE4036
BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Course Outcomes :
CO 1 Able to analyse, design and implement, sustainable and ethical
solutions in the field of computer science.

CO 2 Identify the architecture, infrastructure and delivery models of cloud


computing.
CO 3 Identify the architecture, infrastructure and delivery models of cloud
computing.
CO 4 Choose the appropriate technologies, algorithms and approaches for
the related issues.

CO 5 Student can understand current research trends in cloud computing

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course Code : BCSE4036 Course Name: Cloud Computing Technologies

Syllabus

Unit I: INTRODUCTION 9 lecture hours


Evolution of Cloud Computing -System Models for Distributed and Cloud Computing -
NIST Cloud Computing Reference Architecture -IaaS - On-demand Provisioning
-Elasticity in Cloud - Examples of IaaS Providers - PaaS - Examples of PaaS Providers
-SaaS - Examples of SaaS Providers - Public , Private and Hybrid Clouds – Google App
Engine, Amazon AWS - Cloud Software Environments -Eucalyptus, Open Nebula,
OpenStack, Nimbus

Unit II: VIRTUALIZATION 9 lecture hours


Basics of Virtual Machines - Process Virtual Machines – System Virtual Machines –
Emulation – Interpretation – Binary Translation - Taxonomy of Virtual Machines.
Virtualization –Management Virtualization –– Hardware Maximization – Architectures
–Virtualization Management – Storage Virtualization – Network Virtualization
School of Computing Science and Engineering
Course Code : BCSE4036 Course Name: Cloud Computing Technologies

Unit III: VIRTUALIZATION INFRASTRUCTURE 9 lecture hours


G Comprehensive Analysis – Resource Pool – Testing Environment –Server
Virtualization – Virtual Workloads – Provision Virtual Machines –Desktop
Virtualization– Application Virtualization – Work with AppV – Mobile OS for smart
phones – Mobile Platform Virtualization – Collaborative Applications for Mobile
platforms
Unit IV : PROGRAMMING MODEL 9 lecture hours
Map Reduce Hadoop Distributed File Systems – Hadoop I/O – Developing Map
Reduce Applications – Working of Map Reduce – Types and Formats – Setting up
Hadoop Cluster
Unit V: CLOUD INFRASTRUCTURE AND SECURITY 9 lecture hours
Architectural Design of Compute and Storage Clouds - Inter Cloud Resource
Management - Resource Provisioning and Platform Deployment - Global Exchange of
Cloud Resources - Security Overview – Cloud Security Challenges – Software as a
Service Security – Security Governance – Risk Management – Security Monitoring –
Security Architecture Design – Data Security – Application Security – Virtual Machine
Security.
School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Unit-IV –Programming Model

Map Reduce Hadoop Distributed File Systems- Hadoop I/O – Developing Map

Reduce- Applications- Working of Map Reduce-Types and Formats- Setting Up

Hadoop Cluster.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Map Reduce Hadoop Distributed File Systems

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

MapReduce

• HDFS handles the Distributed Filesystem layer


• MapReduce is a programming model for data processing.
• MapReduce
– Framework for parallel computing
– Programmers get simple API
– Don’t have to worry about handling
• parallelization
• data distribution
• load balancing
• fault tolerance

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

MapReduce
•It is a powerful paradigm for parallel computation

• A MapReduce is a data processing tool which is used to process the data


parallelly in a distributed form. It was developed in 2004, on the basis of
paper titled as "MapReduce: Simplified Data Processing on Large
Clusters," published by Google.
• Allows one to process huge amounts of data (terabytes and petabytes) on
thousands of processors

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
Course Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Analogy: Counting Fans


Given a cricket stadium, count the number of fans for each player /
team

Traditional way

Smart way

Smarter way?

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course Code :BCSE4036 Course Name: Cloud Computing Technologies

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course Code :BCSE4036 Course Name: Cloud Computing Technologies

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course Code :BCSE4036 Course Name: Cloud Computing Technologies

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

MapReduce
The MapReduce is a paradigm which has two phases, the map phase, and the
reducer phase

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Two phases

 In the Map, the input is given in the form of a key-value pair. The
output of the Map is fed to the reducer as input. 

 The reducer runs only after the Mapper is over. The reducer too
takes input in key-value format, and the output of reducer is the
final output

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Steps in Map Reduce


• Mapper reads the block of data and converts it into key-value pairs.

• Now, these key-value pairs are input to the reducer.


• The reducer receives data tuples from multiple mappers.

• Reducer applies aggregation to these tuples based on the key.


• The final output from reducer gets written to HDFS.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Usage of MapReduce
• It can be used in various application like document clustering, distributed
sorting, and web link-graph reversal.
• It can be used for distributed pattern-based searching.

• We can also use MapReduce in machine learning.


• It was used by Google to regenerate Google's index of the World Wide
Web.
• It can be used in multiple computing environments such as multi-cluster,
multi-core, and mobile environment.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Why DFS?

1 Machine 10 Machines
4 I/O Channels
Each Channel – 100 MB/s Read 1 TB Data 4 I/O Channels
Each Channel – 100 MB/s

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Read 1 TB Data

1 Machine 10 Machines
4 I/O Channels
4 I/O Channels
Each Channel – 100 MB/s
Each Channel – 100 MB/s

45 Minutes 4.5 Minutes

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Hadoop I/O

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies
School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Syllabus

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

What is Hadoop?

 Apache Hadoop is a framework that allows for the distributed processing


of large data sets across

 clusters of commodity computers using a simple programming model.


 It is an Open-source Data Management with scale-out storage &
distributed processing.

Program Name:B.Tech
Program Name: Program Code:Code:CSE
Program
School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

In a nutshell
• Hadoop provides: a reliable shared storage and analysis system.
• The storage is provided by HDFS
• The analysis by MapReduce.

Program Name:B.Tech
Program Name: Program Code:Code:CSE
Program
School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Modules of Hadoop
• HDFS (High Distributed File System)

• It is the storage layer of Hadoop. Files in HDFS are broken into block-sized


chunks. HDFS consists of two types of nodes that is, NameNode and
DataNodes.
• NameNode stores metadata about blocks location.

• DataNodes stores the block and sends block reports to NameNode in a


definite time interval.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

HDFS Definition

• The Hadoop Distributed File System (HDFS) is a distributed file system designed
to run on commodity hardware.
•HDFS is a distributed, scalable, and portable filesystem written in Java for the
Hadoop framework.
•It has many similarities with existing distributed file systems.
•Hadoop Distributed File System (HDFS™) is the primary storage system used by
Hadoop applications.
• HDFS creates multiple replicas of data blocks and distributes them on
compute nodes throughout a cluster to enable reliable, extremely rapid
computations.
•HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.
•HDFS provides high throughput access to application data and is suitable for
applications that have large data sets

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

MapReduce
• MapReduce is the processing layer in Hadoop. It is a software framework

for writing an application that performs distributed processing.


YARN
• It is the resource management layer. YARN is responsible for resource
allocation and job scheduling.
• To study in detail Hadoop and its component, go through the 
Hadoop architecture article.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Features of Hadoop

Hadoop is Open Source

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Features of Hadoop
 Hadoop is Open Source

 Hadoop cluster is Highly Scalable

 Hadoop provides Fault Tolerance

 Hadoop provides High Availability

 Hadoop is very Cost-Effective

 Hadoop is Faster in Data Processing

 Hadoop is based on Data Locality concept

 Hadoop provides Feasibility

 Hadoop is Easy to use

 Hadoop ensures Data Reliability

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Syllabus

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course Code :BCSE4036 Course Name: Cloud Computing Technologies

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Components of HDFS:-
• There are three major components of Hadoop HDFS are as follows:-

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

DataNode
• These are the nodes which store the actual data. HDFS stores the data in a
distributed manner. It divides the input files of varied formats into blocks.
The DataNodes stores each of these blocks. Following are the functions of
DataNodes:-
• On startup, DataNode does handshake with NameNode. It verifies the
namespace ID and software version of DataNode.
• Also, it sends a block report to NameNode and verifies the block replicas.
• It sends a heartbeat to NameNode every 3 seconds to tell that it is alive

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

NameNode
• NameNode is nothing but the master node. The NameNode is responsible for
managing file system namespace, controlling the client’s access to files. Also, it
executes tasks such as opening, closing and naming files and directories.
NameNode has two major files – FSImage and Edits log
• FSImage – FSImage is a point-in-time snapshot of HDFS’s metadata. It contains
information like file permission, disk quota, modification timestamp, access
time etc.
• Edits log – It contains modifications on FSImage. It records incremental
changes like renaming the file, appending data to the file etc.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Secondary NameNode
• If the NameNode has not restarted for months the size of Edits log
increases.
• This, in turn, increases the downtime of the cluster on the restart of
NameNode. In this case, Secondary NameNode comes into the picture.
• The Secondary NameNode applies edits log on FSImage at regular
intervals. And it updates the new FSImage on primary NameNode.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Syllabus

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Syllabus

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Developing a MapReduce

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

How MapReduce Works?


• MapReduce application is based on Hadoop Distributed FileSystem,
HDFS.
• Steps of the MapReduce process:
– The client submits the job.
– JobTracker coordinates the job and splits into tasks.
– TaskTrackers run the tasks, here is the main map and reduce phases.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Shuffle and Sort


• These two facilities are the heart of MapReduce which make Cloud
Computing powerful.
• Sort phase: guarantees the input to every reduce is sorted by key.

• Shuffle phase: transfers the map output to the reducers as input.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Applications

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Counting and Summing

Problem
• A number of documents with a set of terms
Need to calculate the number of occurrences of each term (word
count) or some arbitrary function over the terms (average
response time in log files)
Solution
• Map: For each term, emit the term and “1”
• Reduce: Take the sum (or any other operation) of each term values

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Collating
Problem
• A number of documents with a set of terms and some function of
one item
• Need to group all items that have the same value of function to either
store items together or perform some computation over them
•Solution
•Map: For each item, compute given function and emit function value as
key and item as value
• Reduce: Either save all grouped items or perform further computation
• Example: Inverted Index: Items are words and function is document ID

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Filtering, Parsing, and Validation

Problem
• A set of records
• Need to collect all records that meet some condition or transform each
record into another representation

Solution
•Map: For each record, emit it if passes the condition or emit its transformed
version
• Reduce: Identity
• Example: Text parsing or transformation such as word capitalization

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Distributed Task Execution

Problem
• Large computational problem
• Need to divide it into multiple parts and combine results from all parts to
obtain a final result

Solution
• Map: Perform corresponding computation
• Reduce: Combine all emitted results into a final one
• Example: RGB histogram calculation of bitmap images

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

How Hadoop MapReduce Works

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Data Flow In MapReduce


MapReduce is used to compute the huge amount of data . To handle the upcoming data in a
parallel and distributed form, the data has to flow from various phases.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Input Files
• The data for a MapReduce task is stored in input files, and input files
typically lives in HDFS. The format of these files is arbitrary, while line-
based log files and binary format can also be used. InputFormat
• Now, InputFormat defines how these input files are split and read. It
selects the files or other objects that are used for input. InputFormat
creates InputSplit. 

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

InputSplits
• It is created by InputFormat, logically represent the data which will
be processed by an individual Mapper (We will understand mapper
below).
• One map task is created for each split; thus the number of map
tasks will be equal to the number of InputSplits.
• The split is divided into records and each record will be processed
by the mapper. 

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

RecordReader
•It communicates with the InputSplit in Hadoop MapReduce and converts the

data into key-value pairs suitable for reading by the mapper.

•By default, it uses TextInputFormat for converting data into a key-value pair.

•RecordReader communicates with the InputSplit until the file reading is not

completed.

•It assigns byte offset (unique number) to each line present in the file.

•Further, these key-value pairs are sent to the mapper for further processing.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Mapper
• It processes each input record (from RecordReader) and generates new key-
value pair, and this key-value pair generated by Mapper is completely
different from the input pair.
• The output of Mapper is also known as intermediate output which is written
to the local disk.
• The output of the Mapper is not stored on HDFS as this is temporary data and
writing on HDFS will create unnecessary copies (also HDFS is a high latency
system). Mappers output is passed to the combiner for further process

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Combiner
• The combiner is also known as ‘Mini-reducer’. Hadoop MapReduce
Combiner performs local aggregation on the mappers’ output, which helps
to minimize the data transfer between mapper and reducer (we will see
reducer below).
• Once the combiner functionality is executed, the output is then passed to
the partitioner for further work. 

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Partitioner
• Hadoop MapReduce, Partitioner comes into the picture if we are working on
more than one reducer (for one reducer partitioner is not used).
• Partitioner takes the output from combiners and performs partitioning.
Partitioning of output takes place on the basis of the key and then sorted. By
hash function, key (or a subset of the key) is used to derive the partition.
• According to the key value in MapReduce, each combiner output is partitioned,
and a record having the same key value goes into the same partition, and then
each partition is sent to a reducer. Partitioning allows even distribution of the
map output over the reducer.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Shuffling and Sorting


• Now, the output is Shuffled to the reduce node (which is a normal slave
node but reduce phase will run here hence called as reducer node).
• The shuffling is the physical movement of the data which is done over the
network.
• Once all the mappers are finished and their output is shuffled on the
reducer nodes, then this intermediate output is merged and sorted, which
is then provided as input to reduce phase.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Reducer
• It takes the set of intermediate key-value pairs produced by the mappers
as the input and then runs a reducer function on each of them to generate
the output.
• The output of the reducer is the final output, which is stored in HDFS.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

RecordWriter
• It writes these output key-value pair from the Reducer phase to the
output files.
OutputFormat
• The way these output key-value pairs are written in output files by
RecordWriter is determined by the OutputFormat. OutputFormat
instances provided by the Hadoop are used to write files in HDFS or on the
local disk.
• Thus the final output of reducer is written on HDFS by OutputFormat
instances.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

In conclusion, we can say that data flow in MapReduce is the combination of


different processing phases of such as Input Files, InputFormat in Hadoop,
InputSplits, RecordReader, Mapper, Combiner, Partitioner, Shuffling and
Sorting, Reducer, RecordWriter, and OutputFormat. Hence all these
components play an important role in the Hadoop mapreduce working.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

MapReduce Data Types and Formats

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Data Types
• Basicly Hadoop (MapReduce) need data types which support both serialization (for
efficient read and write) and comparability ( to sort keys within sort and shuffle phase).
For that purposes hadoop has WritableComparable<T> interface, which extends
Writable (A serializable object which implements a simple, efficient, serialization
protocol) and Comparable<T>. You can see some these implementations below:

● Data Types: ByteWritable, IntWritable, LongWritable, FloatWritable, DoubleWritable,


Text,
• BooleanWritable, VIntWritable, VLongWritable

● Data Structures: BytesWritable, ArrayWritable, MapWritable and


SortedMapWritable(both extend AbstractMapWritable which implement
Configurable).
School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

In/Out Format class Hierarchy


● InputSampler<K,V> uses InputFormat<K,V> and RecordReader<K,V> to
read records separated by delimiter and provide these records directly to
your map() function. You don’t need directly use their functionality, as
MapReduce framework care all about instead of you. Every map() function
get single record, and your task is handle that single record and nothing else.

● For reduce() function there are similar classes and interfaces (


OutputFormat<K,V>, RecordWriter<K,V>, FileOutputFormat<K,V>)

● For properly work you should set input and output format classes in
JobConf

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

In/Out Format class Hierarchy


class CustomInputFormat extends FileInputFormat<K,V> {
List<InputSplit> getSplits(JobContext context) {
// read and return list of splits (by default block size) for given DFS file
}
RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) {
// return the instance of CustomRecordReader for every split and set delimiter for RecordReader
}
}
abstract class FileInputFormat<K,V> extends Object implements InputFormat<K,V> { ... } interface
InputFormat<K,V> { // interface for getSplits and createRecordReader
List<InputSplit> getSplits(JobContext context) {
// should be implemented in class
}
RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) {
// should be implemented in class
}

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

In/Out Format class Hierarchy

class CustomRecordReader extends RecordReader<K,V> {


void initialize (InputSplit split, TaskAttemptContext context ) {
// Find and open the specific DFS file split, seek start of file and create LineReader for read data
records.
}
boolean nextKeyValue() {
// physically read data record from start position with respect to delimiter
}
K getCurrentKey() { … };
V getCurrentValue() { … }; // get functions for key and value
}
abstract class RecordReader<K,V> {
void initialize (InputSplit split, TaskAttemptContext context ) { // should be implemented in class }
boolean nextKeyValue() { // should be implemented in class }
K getCurrentKey() { … }; // get functions for key
V getCurrentValue() { … }; // get functions for value
}

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

File Formats

● XML (one of the most common file formats)

● JSON (as famous as XML but it’s more reach)

● SequenceFile (native MapReduce key/value format)

● Avro (created by Hadoop it’s record oriented but not key/value format )

● Parquet (it’s columnar oriented data storage format)

● Thrift (don’t used directly)

● Protocol Buffers (MR use elephant bird to read)

● Other Custom Formats (e.g CSV)

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

XML, JSON and Sequence File

● XML - MapReduce doesn’t have native XML support. If you want to work with large XML files
in MapReduce and be able to split and process them in parallel then you can use Mahout’s
XMLInputFormat to work with XML files in HDFS with MapReduce. It reads records that are
• delimited by specific XML begin and end tags.
● JSON - there are two problems MapReduce doesn’t come with an InputFormat that works with
JSON; and how does one even go about splitting JSON?
• If you want to work with JSON inputs in MapReduce, you can use Elephant Bird
LzoJsonInputFormat input format is used as a basis to create an input format class to work
• with JSON elements.
● Sequence File - it was created specifically for MapReduce tasks, it is row oriented key/value file
format. Which well supported in all hadoop environment projects (Hive, Pig e.t.c).

•All are mentioned formats don’t support code generation and schema evaluation

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Avro

• Avro is a RPC and data serialization framework developed by Hadoop in order to improve data
interchange, interoperability, and versioning in MapReduce. Natively Avro isn’t key/value but record-
based file format.
● Support code generation and schema evaluation
● Use JSON to define schema.
● There are 3 ways that you can use Avro in MapReduce (Mixed, Record-based, and Key/Value-
based modes).

○ Mixed - in cases where you have non Avro input and generate Avro outputs, or vice versa, in
which case the Avro mapper and reducer classes aren’t suitable. Use AvroWrapper class.
○ Record-based - in this case Avro will be used end-to-end. As Avro isn’t key/value format you
should use specific Mapper (AvroMapper) and Reducer (AvroReducer) classes.
○ Key/Value-based - you want to use Avro as native key/value format, use the AvroKeyValue,
AvroKey, and AvroValue classes to work with Avro key/value data.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Parquet
Parquet is a new columnar storage format (yes it’s storage format, I didn’t say file format).

Parquet doesn’t have his own object model(in memory representation), instead it has object

model converters (to represent data as Avro, Thrift, Protocol Buffer, Pig, Hive e.t.c).

● Parquet physically store data column by column, instead of row by row (e.g Avro). For that

reason it’s called columnar storage.

● In case when you often need projection by columns or you need to do operation (avg, max,

min e.t.c) only on the specific columns, it’s more effective to store data in columnar format,

because accessing data become faster than in case of row storage.

● It support schema evaluation but doesn’t support code generation.

● You can use AvroParquetInputFormat and AvroParquetOutputFormat in MapReduce, and

AvroParquetWriter and AvroParquetReader classes in simple Java app.

● It is well supported by lots of hadoop projects ( Pig, Hive, Spark, Impala, HBase e.t.c)
School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Custom Formats ( CSV)


• You can read the custom file formats with TextInputFormat class and implement reading and

writing in your map and reduce tasks appropriately. But if you want to write reusable and

convenient code for specific file format (e.g CSV) you need to implement your own:

●CSVInputFormat which extend FileInputFormat

●CSVRecordReader which extend RecordReader

●CSVOutputFormat which extend TextOutputFormat

●CSVRecordWriter which extend RecordWriter

•And use these as input and output classes in your MapReduce. Just set them in JobConfiguration

settings.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Hadoop Cluster

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

What is Hadoop Cluster?

 Hadoop cluster is nothing but a group of computers connected together


via LAN.
 We use it for storing and processing large data sets. Hadoop clusters have
a number of commodity hardware connected together.
 They communicate with a high-end machine which acts as a master.

 These master and slaves implement distributed computing over


distributed data storage.
 It runs open source software for providing distributed functionality.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

What is the Basic Architecture of Hadoop Cluster?

 Hadoop cluster has master-slave architecture.

Master in Hadoop Cluster


 It is a machine with a good configuration of memory and CPU. There are
two daemons running on the master and they are NameNode and
Resource Manager.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

 Functions of NameNode

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

 Functions of NameNode

• Manages file system namespace


• Regulates access to files by clients
• Stores metadata of actual data Foe example – file path, number of blocks,
block id, the location of blocks etc.
• Executes file system namespace operations like opening, closing, renaming
files and directories

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Functions of Resource Manager

• It arbitrates resources among


competing nodes
• Keeps track of live and dead nodes

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Slaves in the Hadoop Cluster


• It is a machine with a normal configuration. There are two daemons
running on Slave machines and they are – DataNode and Node Manager
a. Functions of DataNode
• It stores the business data
• It does read, write and data processing operations
• Upon instruction from a master, it does creation, deletion, and replication
of data blocks.
b. Functions of NodeManager
• It runs services on the node to check its health and reports the same to
ResourceManager.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

b. Functions of NodeManager
• Client nodes in Hadoop cluster – We install Hadoop and configure it on
client nodes.
c. Functions of the client node
• To load the data on the Hadoop cluster.
• Tells how to process the data by submitting MapReduce job.
• Collects the output from a specified location.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Single Node Cluster VS Multi-Node Cluster

• In a single node Hadoop cluster, all the processes run on one JVM
instance. The user need not make any configuration setting.
• The Hadoop user only needs to set JAVA_HOME variable.

• The default factor for single node Hadoop cluster is one.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Single Node Cluster VS Multi-Node Cluster


• A multi-node Hadoop cluster has master-slave architecture. In this
NameNode daemon run on the master machine. And DataNode daemon
runs on the slave machines.
• In multi-node Hadoop cluster, the slave daemons like DataNode and
NodeManager run on cheap machines.
• On the other hand, master daemons like NameNode and
ResourceManager run on powerful servers. Ina multi-node Hadoop
cluster, slave machines can be present in any location irrespective of the
physical location of the master server.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Communication Protocols Used in Hadoop Clusters

• The HDFS communication protocol works on the top of TCP/IP protocol.

• The client establishes a connection with NameNode using configurable


TCP port.
• Hadoop cluster establishes the connection to the client using client
protocol. DataNode talks to NameNode using the DataNode Protocol.
• A Remote Procedure Call (RPC) abstraction wraps both Client protocol and
DataNode protocol.
• NameNode does not initiate any RPC instead it responds to RPC from the
DataNode.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

How to Build a Cluster in Hadoop


• For choosing the right hardware one must consider the following
points
• Understand the kind of workloads, the cluster will be dealing with.
The volume of data which cluster need to handle. And kind of
processing required like CPU bound, I/O bound etc.
• Data storage methodology like data compression technique used if
any.
• Data retention policy like how frequently we need to flush.
– Sizing the Hadoop Cluster
– Configuring Hadoop Cluster

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Hadoop Cluster Management

A good cluster management tool should have the following features:-


• It should provide diverse work-load management, security, resource
provisioning, performance optimization, health monitoring. Also, it needs
to provide policy management, job scheduling, back up and recovery
across one or more nodes.
• Implement NameNode high availability with load balancing, auto-
failover, and hot standbys
• Enabling policy-based controls that prevent any application from gulping
more resources than others.
• Managing the deployment of any layers of software over Hadoop clusters
by performing regression testing. This is to make sure that any jobs or
data won’t crash or encounter any bottlenecks in daily operations.

Program Name:B.Tech Program Code:CSE


School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies

Benefits of Hadoop Clusters

 Here is a list of benefits provided by Clusters in Hadoop –

 Robustness
 Data disks failures, heartbeats and re-replication

 Cluster Rrbalancing
 Data integrity

 Metadata disk failure


 Snapshot

Program Name:B.Tech Program Code:CSE

You might also like