You are on page 1of 76

How To Make The Best Use Of Live Sessions

• Please log in 10 mins before the class starts and check your internet connection to avoid any network issues during the LIVE
session

• All participants will be on mute, by default, to avoid any background noise. However, you will be unmuted by instructor if
required. Please use the “Questions” tab on your webinar tool to interact with the instructor at any point during the class

• Feel free to ask and answer questions to make your learning interactive. Instructor will address your queries at the end of on-
going topic

• If you want to connect to your Personal Learning Manager (PLM), dial +917618772501

• We have dedicated support team to assist all your queries. You can reach us anytime on the below numbers:
US: 1855 818 0063 (Toll-Free) | India: +91 9019117772

• Your feedback is very much appreciated. Please share feedback after each class, which will help us enhance your learning
experience

Copyright © edureka and/or its affiliates. All rights reserved.


Big Data & Hadoop Certification Training

Copyright © edureka and/or its affiliates. All rights reserved.


Course Outline
Understanding Big Data Kafka Monitoring &
Hive
Stream Processing
and Hadoop

Hadoop Architecture Integration of Kafka


Kafka Producer Advance
with Hive&and
Hadoop HBase
Storm
and HDFS

Hadoop MapReduce Integration of Kafka


Kafka Consumer Advance
Framework with Spark &HBase
Flume

Kafka Operation and Processing Distributed Data


Advance MapReduce
Performance Tuning with Apache Spark

Kafka Cluster Architectures Apache Oozie and Hadoop


Pig Kafka Project
& Administering Kafka Project

Copyright © edureka and/or its affiliates. All rights reserved.


Module 3: Hadoop MapReduce Framework

Copyright © edureka and/or its affiliates. All rights reserved.


Topics
Following are the topics covered in this module:
▪ Where MapReduce is used?
▪ Difference between Traditional way and MapReduce way
▪ Hadoop 2.x MapReduce Architecture
▪ YARN MR Application Execution Flow
▪ MapReduce Paradigm
▪ MapReduce Job Submission Flow
▪ Combiner and Partitioner in MapReduce

Copyright © edureka and/or its affiliates. All rights reserved.


Objectives
At the end of this module, you will be able to:
▪ Analyze different use-cases where MapReduce is used

▪ Differentiate between Traditional way and MapReduce way

▪ Learn about Hadoop 2.x MapReduce architecture and components

▪ Understand execution flow of YARN MapReduce application

▪ Implement basic MapReduce concepts

▪ Run a MapReduce Program

▪ Understand Input Splits concept in MapReduce

▪ Understand MapReduce Job Submission Flow

▪ Implement Combiner and Partitioner in MapReduce

Copyright © edureka and/or its affiliates. All rights reserved.


Let’s Revise
▪ Hadoop Cluster Configuration
Core core-site.xml
▪ Data Loading Techniques

▪ Hadoop Cluster Modes

HDFS hdfs-site.xml

Yarn yarn-site.xml

Map
mapred-site.xml
Reduce

Copyright © edureka and/or its affiliates. All rights reserved.


Let’s Revise
Sqoop Flume – Questions
Data Analysis

Using Pig Using HIVE

HDFS

Using Flume Using Hadoop Copy Commands

Using Sqoop

Data Loading

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Question
Secondary NameNode is a hot backup for NameNode:
» TRUE
» FALSE

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Answer
Ans. FALSE. The Secondary NameNode (SNN) is the most
misunderstood component of the HDFS architecture. SNN is not
a hot backup for NameNode but a checkpoint backup
mechanism enabler in a Hadoop Cluster.

Copyright © edureka and/or its affiliates. All rights reserved.


Where MapReduce is Used?
Problem Statement

Messaging Systems helps managing the


complexity of the pipelines

Weather Forecasting

Problem Statement

De-identify personal health information.

HealthCare

Copyright © edureka and/or its affiliates. All rights reserved.


The Traditional Way

Split Data grep matches

Split Data grep matches


Very
All
Big
Split Data grep matches cat matches
Data
:
Split Data grep matches

Copyright © edureka and/or its affiliates. All rights reserved.


MapReduce Way
MapReduce Framework

Split Data
R
Split Data E
Very M
D
Big A All
Split Data U matches
Data P
C
: E
Split Data

Copyright © edureka and/or its affiliates. All rights reserved.


Why MapReduce?
Node
▪ Two biggest Advantages:

• Taking processing to the data


a
b

c
• Processing data in parallel

Rack

Map Task
Data Center
HDFS Block

Copyright © edureka and/or its affiliates. All rights reserved.


Solving the Problem with MapReduce
Take DB dump in CSV format and 0100
Sqoop copy it on HDFS 1101
1001

matches

Store CSV file into 0100


HDFS HDFS 1101
1001
Read CSV file
Reduce Logic
from HDFS

Reduce
Map 0100
Map Logic 1101
1001

Copyright © edureka and/or its affiliates. All rights reserved.


Hadoop 2.x MapReduce Architecture
Client
Namenode

Job History Resource


Server Manager

MapReduce Status
Job Submission
Node Status
Node Manager Resource Request
Node Manager

Container Container Container


Application Reduce Task
Map Task
Master

Datanode 1 Datanode 2

Copyright © edureka and/or its affiliates. All rights reserved.


Hadoop 2.x MapReduce Components
▪ Job History Server
▪ Client
▪ Maintains information about submitted
▪ Submits a MapReduce Job
MapReduce jobs after their ApplicationMaster
terminates
▪ Resource Manager
▪ Cluster Level resource manager ▪ ApplicationMaster
▪ Long Life, High Quality Hardware ▪ One per application
▪ Short life
▪ Node Manager ▪ Coordinates and Manages MapReduce Jobs
▪ One per Data Node
▪ Negotiates with Resource Manager to schedule
▪ Monitors resources on Data Node
tasks
▪ The tasks are started by NodeManager(s)
▪ Container
▪ Created by NM when requested
▪ Allocates certain amount of resources
(memory, CPU etc.) on a slave node

Copyright © edureka and/or its affiliates. All rights reserved.


Executing MapReduce Application on YARN

Copyright © edureka and/or its affiliates. All rights reserved.


YARN MR Application Execution Flow
▪ MapReduce Job Execution
▪ Job Submission
▪ Job Initialization
▪ Tasks Assignment
▪ Memory Assignment
▪ Status Updates
▪ Failure Recovery

Copyright © edureka and/or its affiliates. All rights reserved.


YARN MR Application Execution Flow
1. Run Job
Management Node
2. Get New Application
Resource
Application Job Object
4. Submit Application Manager
Client JVM

Client

3. Copy Job Resources

HDFS

Copyright © edureka and/or its affiliates. All rights reserved.


YARN MR Application Execution Flow
1. Run Job Management Node
2. Get New Application
Resource
Application Job Object
4. Submit Application Manager
Client JVM 5. Start MR AppMaster container
Client

3. Copy Job Resources 8. Request


Resources

Node Manager

6. Create 9. Start
container container
HDFS 7. Get Input Splits

MR AppMaster
Data Node

Copyright © edureka and/or its affiliates. All rights reserved.


YARN MR Application Execution Flow
1. Run Job
Management Node
2. Get New Application
Resource
Application Job Object
4. Submit Application Manager
Client JVM 5. Start MR AppMaster container
Client

3. Copy Job Resources 8. Request 10. Create


Resources Container
Task JVM
Node Manager
YarnChild
6. Create 9. Start
container container 12.
HDFS 7. Get Input Splits Execute
MR AppMaster Map/Reduce
Task
Data Node
11. Acquire Job
Resources

Copyright © edureka and/or its affiliates. All rights reserved.


YARN MR Application Execution Flow
Management Node
Resource
Application Job Object
Manager
Client JVM
Client

Task JVM
Node Manager
Poll for Status YarnChild

HDFS
Update
MR AppMaster Status Map/Reduce
Task
Data Node

Copyright © edureka and/or its affiliates. All rights reserved.


Hadoop 2.x : YARN Workflow
Scheduler
Resource
Manager Applications Manager
(AsM)

Node Manager Node Manager Node Manager Node Manager


Container 1.2 Container 2.2

Node Manager Node Manager Node Manager Node Manager


Container 1.1 App
Container 2.1 Master 2

Node Manager Node Manager Node Manager Node Manager


App Container 2.3
Master 1

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Question
YARN was developed to overcome the following disadvantage in
Hadoop 1.0 MapReduce framework?
» Single Point Of Failure Of NameNode
» Only one version can be run in classic MapReduce
» Too much burden on Job Tracker

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Answer

Too much burden on Job Tracker

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Question
In YARN, the functionality of JobTracker has been replaced by which of
the following YARN features:
» Job Scheduling
» Task Monitoring
» Resource Management
» Node management

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Answer
Task Monitoring and Resource Management. The fundamental idea of YARN
is to split up the two major functionalities of the JobTracker, i.e. resource
management and job scheduling/monitoring, into separate daemons. A
global Resource Manager (RM) for resources and per-application
ApplicationMaster (AM) for task monitoring.

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Question
In YARN, which of the following daemons takes care of the container
and the resource utilization by the applications?
» Node Manager
» Job Tracker
» Task tracker
» ApplicationMaster

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Answer

ApplicationMaster

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Question
Can we run MRv1 Jobs in a YARN enabled Hadoop Cluster?

» Yes
» No

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Answer

Yes. MapReduce on YARN ensures full binary compatibility. These


existing applications can run on YARN directly without recompilation.

Copyright © edureka and/or its affiliates. All rights reserved.


MapReduce Paradigm

Copyright © edureka and/or its affiliates. All rights reserved.


MapReduce Paradigm
The Overall MapReduce Word Count Process

Input Splitting Mapping Shuffling Reducing Final Result

List(K2,V2) K2,List(V2)
K1,V1
Deer, 1 Bear, (1,1) Bear, 2
Deer Bear River Bear, 1 List(K3,V3)
River, 1
Car, (1,1,1) Car, 3 Bear, 2
Dear Bear River Car, 1 Car, 3
Car Car River Car Car River Car, 1 Deer, 2
Deer Car Bear River, 1 River, 2
Deer, (1,1) Deer, 2

Deer, 1
Deer Car Bear
Car, 1
Bear, 1 River, (1,1) River, 2

Copyright © edureka and/or its affiliates. All rights reserved.


Anatomy of a MapReduce Program

Map:
Key Value

(K1, V1) List (K2, V2)

Reduce:
MapReduce

(K2, list (V2)) List (K3, V3)

Copyright © edureka and/or its affiliates. All rights reserved.


Demo of WordCount program

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Question
Input to the mapper is in the form of?
» A flat file
» (key, value) pair
» Only string
» All the above

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Answer

A Mapper accepts (key, value) pair as input.

Copyright © edureka and/or its affiliates. All rights reserved.


Input Splits

HDFS Physical
Blocks Division

INPUT DATA

Input Logical
Splits Division

Copyright © edureka and/or its affiliates. All rights reserved.


Relation Between Input Splits and HDFS Blocks

Split Split Split


File
Lines 1 2 3 4 5 6 7 8 9 10 11

Block Block Block Block


Boundary Boundary Boundary Boundary

▪ Logical records do not fit neatly into the HDFS blocks.

▪ Logical records are lines that cross the boundary of the blocks.

▪ First split contains line 5 although it spans across blocks.

Copyright © edureka and/or its affiliates. All rights reserved.


MapReduce Job Submission Flow

Copyright © edureka and/or its affiliates. All rights reserved.


MapReduce Job Submission Flow
INPUT DATA

Node 1 Node 2
▪ Input data is distributed to nodes

Copyright © edureka and/or its affiliates. All rights reserved.


MapReduce Job Submission Flow
INPUT DATA

Node 1 Node 2
▪ Input data is distributed to nodes

▪ Each map task works on a “split” of data Map Map

Copyright © edureka and/or its affiliates. All rights reserved.


MapReduce Job Submission Flow
INPUT DATA

Node 1 Node 2
▪ Input data is distributed to nodes

▪ Each map task works on a “split” of data Map Map

▪ Mapper outputs intermediate data

Copyright © edureka and/or its affiliates. All rights reserved.


MapReduce Job Submission Flow
INPUT DATA

Node 1 Node 2
▪ Input data is distributed to nodes

▪ Each map task works on a “split” of data Map Map

▪ Mapper outputs intermediate data

▪ Data will be copied by the reducer processor once it identifies the


respective task using application master for all data the reducer is
responsible for

Node 1 Node 2

Copyright © edureka and/or its affiliates. All rights reserved.


MapReduce Job Submission Flow
INPUT DATA

Node 1 Node 2
▪ Input data is distributed to nodes

▪ Each map task works on a “split” of data Map Map

▪ Mapper outputs intermediate data

▪ Data will be copied by the reducer processor once it identifies the


respective task using application master for all data the reducer is
responsible for

▪ Shuffle processor will sort and merge the data for a particular key

Reduce Reduce

Node 1 Node 2

Copyright © edureka and/or its affiliates. All rights reserved.


MapReduce Job Submission Flow
INPUT DATA

Node 1 Node 2
▪ Input data is distributed to nodes

▪ Each map task works on a “split” of data Map Map

▪ Mapper outputs intermediate data

▪ Data will be copied by the reducer processor once it identifies the


respective task using application master for all data the reducer is
responsible for

▪ Shuffle processor will sort and merge the data for a particular key

▪ Reducer output is stored Reduce Reduce

Node 1 Node 2

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Question
MapReduce programming model provides a way for reducers to
communicate with each other?
» Yes, reducers running on the same machine can communicate
with each other through shared memory
» No, each reducer runs independently and in isolation.

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Answer
Ans. No, reducers run independently and in isolation. Individual
tasks do not know the input source. Reducer tasks rely on Hadoop
framework to deliver the appropriate input for processing.

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Question
Who specify Input Split Information?
» randomly and decided by name node
» randomly and decided by job tracker
» line by Line and decided by Input Splitter
» we will have to specify explicitly

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Answer

Ans. The client have to submit the input spit information by specifying the
start and end point either in InputFormat Configuration.

Copyright © edureka and/or its affiliates. All rights reserved.


Overview of MapReduce
Complete view of MapReduce, illustrating combiners and partitioner in addition to
Mappers and Reducers

MapReduce

Combiners Partitioners

Combiners can be viewed as Partitioners determine which reducer is


‘mini-reducers’ in the Map phase. responsible for a particular key.

Copyright © edureka and/or its affiliates. All rights reserved.


Combiner – Local Reduce
COMBINERS

Mini-Reducers Perform a
“Local Reduce”

Passed workload further to Before we distribute the


the Reducers mapper results

Copyright © edureka and/or its affiliates. All rights reserved.


Combiner

(B,1)
B
(C,1) (B,2)
C
Block 1

(D,1) (C,1)
D Mapper Combiner
(E,1) (D,2)
E (A, [2]) (A,2)
(D,1) (E,1)
D (B, [2,1]) (B,3)
(B,1)
B (C, [1,1]) Reducer (C,2)
Shuffle
(D, [2,2]) (D,4)
D (E, [1]) (E,1)
(D,1)
A (D,2)
Block 2

(A,1)
A (A,2)
Mapper (A,1) Combiner
C (C,1)
(C,1)
B (B,1)
(B,1)
D
(D,1)

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Question
Combiner works at?
» Mapper Level
» Partitioner Level
» Reducer Level
» All the above

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Answer

Ans. Mapper level as Combiner works on the output data from Mapper.

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Question
Combiner can be considered as:
» Semi Partitioner
» Semi Reducer
» Semi Shuffler
» Major Reducer

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Answer

Ans. Semi Reducer. Combiner works on the Mapper output and lessen
the burden on Reducer.

Copyright © edureka and/or its affiliates. All rights reserved.


Partitioner – Redirecting Output from Mapper

Map Partitioner Reducer

Map Partitioner Reducer

Map Partitioner Reducer

Copyright © edureka and/or its affiliates. All rights reserved.


Demo: Combiner and Partitioner

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Question
Can we use same logic for combiner and reducer?

» No, they are separate entities.


» Yes, only if reducer and combiner logic are commutative and
associative and both of them are of same data types.

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Answer
Ans. Yes, you can use same logic if Reducer and Combiner logic are
both commutative and associative and both of them are of same
data types.

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Question
Can we change the format of output key class and output value
class?

» TRUE
» FALSE

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Answer

Ans. TRUE

Copyright © edureka and/or its affiliates. All rights reserved.


HealthCare Dataset

Copyright © edureka and/or its affiliates. All rights reserved.


Revisit De-identification Architecture
Taking DB dump in CSV format and
Sqoop 0100
ingest into HDFS
1101
1001

matches

Store De-identified CSV


file into HDFS 0100
1101
HDFS 1001
Read CSV file
from HDFS

Reduce Task 1
De-identify columns
Map Task 1 Reduce Task 2
based on
configurations .
Map Task 2
.
. 0100
. 1101
1001
Copyright © edureka and/or its affiliates. All rights reserved.
DeIdentify MapReduce Code
public static String encrypt(String strToEncrypt, byte[] key)
{
try
{
Cipher cipher = Cipher.getInstance("AES/ECB/PKCS5Padding");
SecretKeySpec secretKey = new SecretKeySpec(key, "AES");
cipher.init(Cipher.ENCRYPT_MODE, secretKey);

String encryptedString = Base64.encodeBase64String(cipher.doFinal(strToEncrypt.getBytes()));

return encryptedString.trim();

}
catch (Exception e)
{
logger.error("Error while encrypting", e);
}
return null;
}
}

Copyright © edureka and/or its affiliates. All rights reserved.


Demo of De-identify Program

Copyright © edureka and/or its affiliates. All rights reserved.


Weather Data

Dataset Link: ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/


Copyright © edureka and/or its affiliates. All rights reserved.
Demo of WeatherData Program

Copyright © edureka and/or its affiliates. All rights reserved.


Assignment
Write MapReduce code for WordCount on your own and run it on Edureka’s Cloud Lab

Download all the MapReduce codes from LMS and import them in your Eclipse IDE and execute them

Try Maximum Temperature problem in MapReduce

Try Hot and Cold day problem in MapReduce

Copyright © edureka and/or its affiliates. All rights reserved.


Pre-work
Watch video “Running MapReduce Program” under Module-3 of your LMS

Attempt the Word Count, Patents, & Alphabets assignment using the items present in the LMS under the tab
Module 3

Review the Interview Questions for MapReduce

http://www.edureka.in/blog/hadoop-interview-questions-mapreduce/

Review the Next Generation MapReduce (MRv2 or YARN)

http://www.edureka.in/blog/apache-hadoop-2-0-and-yarn/

http://www.edureka.in/blog/hadoop-2-0-setting-up-a-single-node-cluster-in-15-minutes/

Setup the CDH4 Hadoop development environment using the documents present in the LMS

http://blog.cloudera.com/blog/2013/08/how-to-use-eclipse-with-mapreduce-in-clouderas-quickstart-vm/

Copyright © edureka and/or its affiliates. All rights reserved.


Agenda for Next Class
▪ Map and Reduce Side Join

▪ Counters

▪ Distributed Cache

▪ Custom Input Format

▪ Sequence Input Format

▪ MRUnit

Copyright © edureka and/or its affiliates. All rights reserved.


Copyright © edureka and/or its affiliates. All rights reserved.
Copyright © 2017, edureka and/or its affiliates. All rights reserved.

You might also like