Professional Documents
Culture Documents
Program: B.TECH
Course Code:BCSE4036
Course Name: Cloud Computing Technologies
School of Computing Science and Engineering
Course
ourse Code
Code :: BCSE4036
BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies
Course Outcomes :
CO 1 Able to analyse, design and implement, sustainable and ethical
solutions in the field of computer science.
Syllabus
Map Reduce Hadoop Distributed File Systems- Hadoop I/O – Developing Map
Hadoop Cluster.
MapReduce
MapReduce
•It is a powerful paradigm for parallel computation
Traditional way
Smart way
Smarter way?
MapReduce
The MapReduce is a paradigm which has two phases, the map phase, and the
reducer phase
Two phases
In the Map, the input is given in the form of a key-value pair. The
output of the Map is fed to the reducer as input.
The reducer runs only after the Mapper is over. The reducer too
takes input in key-value format, and the output of reducer is the
final output
Usage of MapReduce
• It can be used in various application like document clustering, distributed
sorting, and web link-graph reversal.
• It can be used for distributed pattern-based searching.
Why DFS?
1 Machine 10 Machines
4 I/O Channels
Each Channel – 100 MB/s Read 1 TB Data 4 I/O Channels
Each Channel – 100 MB/s
Read 1 TB Data
1 Machine 10 Machines
4 I/O Channels
4 I/O Channels
Each Channel – 100 MB/s
Each Channel – 100 MB/s
Hadoop I/O
Syllabus
What is Hadoop?
Program Name:B.Tech
Program Name: Program Code:Code:CSE
Program
School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies
In a nutshell
• Hadoop provides: a reliable shared storage and analysis system.
• The storage is provided by HDFS
• The analysis by MapReduce.
Program Name:B.Tech
Program Name: Program Code:Code:CSE
Program
School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies
Modules of Hadoop
• HDFS (High Distributed File System)
HDFS Definition
• The Hadoop Distributed File System (HDFS) is a distributed file system designed
to run on commodity hardware.
•HDFS is a distributed, scalable, and portable filesystem written in Java for the
Hadoop framework.
•It has many similarities with existing distributed file systems.
•Hadoop Distributed File System (HDFS™) is the primary storage system used by
Hadoop applications.
• HDFS creates multiple replicas of data blocks and distributes them on
compute nodes throughout a cluster to enable reliable, extremely rapid
computations.
•HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.
•HDFS provides high throughput access to application data and is suitable for
applications that have large data sets
MapReduce
• MapReduce is the processing layer in Hadoop. It is a software framework
Features of Hadoop
Features of Hadoop
Hadoop is Open Source
Syllabus
Components of HDFS:-
• There are three major components of Hadoop HDFS are as follows:-
DataNode
• These are the nodes which store the actual data. HDFS stores the data in a
distributed manner. It divides the input files of varied formats into blocks.
The DataNodes stores each of these blocks. Following are the functions of
DataNodes:-
• On startup, DataNode does handshake with NameNode. It verifies the
namespace ID and software version of DataNode.
• Also, it sends a block report to NameNode and verifies the block replicas.
• It sends a heartbeat to NameNode every 3 seconds to tell that it is alive
NameNode
• NameNode is nothing but the master node. The NameNode is responsible for
managing file system namespace, controlling the client’s access to files. Also, it
executes tasks such as opening, closing and naming files and directories.
NameNode has two major files – FSImage and Edits log
• FSImage – FSImage is a point-in-time snapshot of HDFS’s metadata. It contains
information like file permission, disk quota, modification timestamp, access
time etc.
• Edits log – It contains modifications on FSImage. It records incremental
changes like renaming the file, appending data to the file etc.
Secondary NameNode
• If the NameNode has not restarted for months the size of Edits log
increases.
• This, in turn, increases the downtime of the cluster on the restart of
NameNode. In this case, Secondary NameNode comes into the picture.
• The Secondary NameNode applies edits log on FSImage at regular
intervals. And it updates the new FSImage on primary NameNode.
Syllabus
Syllabus
Developing a MapReduce
Applications
Problem
• A number of documents with a set of terms
Need to calculate the number of occurrences of each term (word
count) or some arbitrary function over the terms (average
response time in log files)
Solution
• Map: For each term, emit the term and “1”
• Reduce: Take the sum (or any other operation) of each term values
Collating
Problem
• A number of documents with a set of terms and some function of
one item
• Need to group all items that have the same value of function to either
store items together or perform some computation over them
•Solution
•Map: For each item, compute given function and emit function value as
key and item as value
• Reduce: Either save all grouped items or perform further computation
• Example: Inverted Index: Items are words and function is document ID
Problem
• A set of records
• Need to collect all records that meet some condition or transform each
record into another representation
Solution
•Map: For each record, emit it if passes the condition or emit its transformed
version
• Reduce: Identity
• Example: Text parsing or transformation such as word capitalization
Problem
• Large computational problem
• Need to divide it into multiple parts and combine results from all parts to
obtain a final result
Solution
• Map: Perform corresponding computation
• Reduce: Combine all emitted results into a final one
• Example: RGB histogram calculation of bitmap images
Input Files
• The data for a MapReduce task is stored in input files, and input files
typically lives in HDFS. The format of these files is arbitrary, while line-
based log files and binary format can also be used. InputFormat
• Now, InputFormat defines how these input files are split and read. It
selects the files or other objects that are used for input. InputFormat
creates InputSplit.
InputSplits
• It is created by InputFormat, logically represent the data which will
be processed by an individual Mapper (We will understand mapper
below).
• One map task is created for each split; thus the number of map
tasks will be equal to the number of InputSplits.
• The split is divided into records and each record will be processed
by the mapper.
RecordReader
•It communicates with the InputSplit in Hadoop MapReduce and converts the
•By default, it uses TextInputFormat for converting data into a key-value pair.
•RecordReader communicates with the InputSplit until the file reading is not
completed.
•It assigns byte offset (unique number) to each line present in the file.
•Further, these key-value pairs are sent to the mapper for further processing.
Mapper
• It processes each input record (from RecordReader) and generates new key-
value pair, and this key-value pair generated by Mapper is completely
different from the input pair.
• The output of Mapper is also known as intermediate output which is written
to the local disk.
• The output of the Mapper is not stored on HDFS as this is temporary data and
writing on HDFS will create unnecessary copies (also HDFS is a high latency
system). Mappers output is passed to the combiner for further process
Combiner
• The combiner is also known as ‘Mini-reducer’. Hadoop MapReduce
Combiner performs local aggregation on the mappers’ output, which helps
to minimize the data transfer between mapper and reducer (we will see
reducer below).
• Once the combiner functionality is executed, the output is then passed to
the partitioner for further work.
Partitioner
• Hadoop MapReduce, Partitioner comes into the picture if we are working on
more than one reducer (for one reducer partitioner is not used).
• Partitioner takes the output from combiners and performs partitioning.
Partitioning of output takes place on the basis of the key and then sorted. By
hash function, key (or a subset of the key) is used to derive the partition.
• According to the key value in MapReduce, each combiner output is partitioned,
and a record having the same key value goes into the same partition, and then
each partition is sent to a reducer. Partitioning allows even distribution of the
map output over the reducer.
Reducer
• It takes the set of intermediate key-value pairs produced by the mappers
as the input and then runs a reducer function on each of them to generate
the output.
• The output of the reducer is the final output, which is stored in HDFS.
RecordWriter
• It writes these output key-value pair from the Reducer phase to the
output files.
OutputFormat
• The way these output key-value pairs are written in output files by
RecordWriter is determined by the OutputFormat. OutputFormat
instances provided by the Hadoop are used to write files in HDFS or on the
local disk.
• Thus the final output of reducer is written on HDFS by OutputFormat
instances.
Data Types
• Basicly Hadoop (MapReduce) need data types which support both serialization (for
efficient read and write) and comparability ( to sort keys within sort and shuffle phase).
For that purposes hadoop has WritableComparable<T> interface, which extends
Writable (A serializable object which implements a simple, efficient, serialization
protocol) and Comparable<T>. You can see some these implementations below:
● For properly work you should set input and output format classes in
JobConf
File Formats
● Avro (created by Hadoop it’s record oriented but not key/value format )
● XML - MapReduce doesn’t have native XML support. If you want to work with large XML files
in MapReduce and be able to split and process them in parallel then you can use Mahout’s
XMLInputFormat to work with XML files in HDFS with MapReduce. It reads records that are
• delimited by specific XML begin and end tags.
● JSON - there are two problems MapReduce doesn’t come with an InputFormat that works with
JSON; and how does one even go about splitting JSON?
• If you want to work with JSON inputs in MapReduce, you can use Elephant Bird
LzoJsonInputFormat input format is used as a basis to create an input format class to work
• with JSON elements.
● Sequence File - it was created specifically for MapReduce tasks, it is row oriented key/value file
format. Which well supported in all hadoop environment projects (Hive, Pig e.t.c).
•All are mentioned formats don’t support code generation and schema evaluation
Avro
• Avro is a RPC and data serialization framework developed by Hadoop in order to improve data
interchange, interoperability, and versioning in MapReduce. Natively Avro isn’t key/value but record-
based file format.
● Support code generation and schema evaluation
● Use JSON to define schema.
● There are 3 ways that you can use Avro in MapReduce (Mixed, Record-based, and Key/Value-
based modes).
○ Mixed - in cases where you have non Avro input and generate Avro outputs, or vice versa, in
which case the Avro mapper and reducer classes aren’t suitable. Use AvroWrapper class.
○ Record-based - in this case Avro will be used end-to-end. As Avro isn’t key/value format you
should use specific Mapper (AvroMapper) and Reducer (AvroReducer) classes.
○ Key/Value-based - you want to use Avro as native key/value format, use the AvroKeyValue,
AvroKey, and AvroValue classes to work with Avro key/value data.
Parquet
Parquet is a new columnar storage format (yes it’s storage format, I didn’t say file format).
Parquet doesn’t have his own object model(in memory representation), instead it has object
model converters (to represent data as Avro, Thrift, Protocol Buffer, Pig, Hive e.t.c).
● Parquet physically store data column by column, instead of row by row (e.g Avro). For that
● In case when you often need projection by columns or you need to do operation (avg, max,
min e.t.c) only on the specific columns, it’s more effective to store data in columnar format,
● It is well supported by lots of hadoop projects ( Pig, Hive, Spark, Impala, HBase e.t.c)
School of Computing Science and Engineering
Course
ourse Code
Code :BCSE4036
:BCSE4036 Course
Course Name:
Name: Cloud
Cloud Computing
Computing Technologies
Technologies
writing in your map and reduce tasks appropriately. But if you want to write reusable and
convenient code for specific file format (e.g CSV) you need to implement your own:
•And use these as input and output classes in your MapReduce. Just set them in JobConfiguration
settings.
Hadoop Cluster
Functions of NameNode
Functions of NameNode
b. Functions of NodeManager
• Client nodes in Hadoop cluster – We install Hadoop and configure it on
client nodes.
c. Functions of the client node
• To load the data on the Hadoop cluster.
• Tells how to process the data by submitting MapReduce job.
• Collects the output from a specified location.
• In a single node Hadoop cluster, all the processes run on one JVM
instance. The user need not make any configuration setting.
• The Hadoop user only needs to set JAVA_HOME variable.
Robustness
Data disks failures, heartbeats and re-replication
Cluster Rrbalancing
Data integrity