How To Make The Best Use of Live Sessions: Log in 10 Mins Before

How To Make The Best Use Of Live Sessions
• Please log in 10 mins before the class starts and check your internet connection to avoid any network issues during the LIVE
session
• All participants will be on mute, by default, to avoid any background noise. However, you will be unmuted by instructor if
required. Please use the “Questions” tab on your webinar tool to interact with the instructor at any point during the class
• Feel free to ask and answer questions to make your learning interactive. Instructor will address your queries at the end of on-
going topic
• If you want to connect to your Personal Learning Manager (PLM), dial +917618772501
• We have dedicated support team to assist all your queries. You can reach us anytime on the below numbers:
US: 1855 818 0063 (Toll-Free) | India: +91 9019117772
• Your feedback is very much appreciated. Please share feedback after each class, which will help us enhance your learning
experience
Copyright © edureka and/or its affiliates. All rights reserved.

Big Data & Hadoop Certification Training

Course Outline
Understanding Big Data Kafka Monitoring &
Hive
Stream Processing
and Hadoop
Hadoop Architecture Integration of Kafka

Kafka Producer Advance
with Hive&and
Hadoop HBase
Storm
and HDFS
Hadoop MapReduce Integration of Kafka

Kafka Consumer Advance
Framework with Spark &HBase
Flume
Kafka Operation and Processing Distributed Data

Advance MapReduce
Performance Tuning with Apache Spark
Kafka Cluster Architectures Apache Oozie and Hadoop

Pig Kafka Project
& Administering Kafka Project

Module 3: Hadoop MapReduce Framework

Topics
Following are the topics covered in this module:
▪ Where MapReduce is used?
▪ Difference between Traditional way and MapReduce way
▪ Hadoop 2.x MapReduce Architecture
▪ YARN MR Application Execution Flow
▪ MapReduce Paradigm
▪ MapReduce Job Submission Flow
▪ Combiner and Partitioner in MapReduce

Objectives
At the end of this module, you will be able to:
▪ Analyze different use-cases where MapReduce is used
▪ Differentiate between Traditional way and MapReduce way
▪ Learn about Hadoop 2.x MapReduce architecture and components
▪ Understand execution flow of YARN MapReduce application
▪ Implement basic MapReduce concepts
▪ Run a MapReduce Program
▪ Understand Input Splits concept in MapReduce
▪ Understand MapReduce Job Submission Flow
▪ Implement Combiner and Partitioner in MapReduce

Let’s Revise
▪ Hadoop Cluster Configuration
Core core-site.xml
▪ Data Loading Techniques
▪ Hadoop Cluster Modes
HDFS hdfs-site.xml
Yarn yarn-site.xml
Map
mapred-site.xml
Reduce

Let’s Revise
Sqoop Flume – Questions
Data Analysis
Using Pig Using HIVE
HDFS
Using Flume Using Hadoop Copy Commands
Using Sqoop
Data Loading

Annie’s Question
Secondary NameNode is a hot backup for NameNode:
» TRUE
» FALSE

Annie’s Answer
Ans. FALSE. The Secondary NameNode (SNN) is the most
misunderstood component of the HDFS architecture. SNN is not
a hot backup for NameNode but a checkpoint backup
mechanism enabler in a Hadoop Cluster.

Where MapReduce is Used?
Problem Statement
Messaging Systems helps managing the

complexity of the pipelines
Weather Forecasting
Problem Statement
De-identify personal health information.
HealthCare

The Traditional Way
Split Data grep matches

Very
All
Big
Split Data grep matches cat matches
Data
:

MapReduce Way
MapReduce Framework
Split Data
R
Split Data E
Very M
D
Big A All
Split Data U matches
Data P
C
: E
Split Data

Why MapReduce?
Node
▪ Two biggest Advantages:
• Taking processing to the data

a
b
c
• Processing data in parallel
Rack
Map Task
Data Center
HDFS Block

Solving the Problem with MapReduce
Take DB dump in CSV format and 0100
Sqoop copy it on HDFS 1101
1001
matches
Store CSV file into 0100

HDFS HDFS 1101
1001
Read CSV file
Reduce Logic
from HDFS
Reduce
Map 0100
Map Logic 1101
1001

Hadoop 2.x MapReduce Architecture
Client
Namenode
Job History Resource

Server Manager
MapReduce Status
Job Submission
Node Status
Node Manager Resource Request
Node Manager
Container Container Container

Application Reduce Task
Map Task
Master
Datanode 1 Datanode 2

Hadoop 2.x MapReduce Components
▪ Job History Server
▪ Client
▪ Maintains information about submitted
▪ Submits a MapReduce Job
MapReduce jobs after their ApplicationMaster
terminates
▪ Resource Manager
▪ Cluster Level resource manager ▪ ApplicationMaster
▪ Long Life, High Quality Hardware ▪ One per application
▪ Short life
▪ Node Manager ▪ Coordinates and Manages MapReduce Jobs
▪ One per Data Node
▪ Negotiates with Resource Manager to schedule
▪ Monitors resources on Data Node
tasks
▪ The tasks are started by NodeManager(s)
▪ Container
▪ Created by NM when requested
▪ Allocates certain amount of resources
(memory, CPU etc.) on a slave node

Executing MapReduce Application on YARN

YARN MR Application Execution Flow
▪ MapReduce Job Execution
▪ Job Submission
▪ Job Initialization
▪ Tasks Assignment
▪ Memory Assignment
▪ Status Updates
▪ Failure Recovery

1. Run Job
Management Node
2. Get New Application
Resource
Application Job Object
4. Submit Application Manager
Client JVM
Client
3. Copy Job Resources
HDFS

1. Run Job Management Node
Resource
Client JVM 5. Start MR AppMaster container
Client
3. Copy Job Resources 8. Request

Resources
Node Manager
6. Create 9. Start
container container
HDFS 7. Get Input Splits
MR AppMaster
Data Node

1. Run Job
Management Node
Resource
Client JVM 5. Start MR AppMaster container
Client
3. Copy Job Resources 8. Request 10. Create

Resources Container
Task JVM
Node Manager
YarnChild
6. Create 9. Start
container container 12.
HDFS 7. Get Input Splits Execute
MR AppMaster Map/Reduce
Task
Data Node
11. Acquire Job
Resources

Management Node
Resource
Manager
Client JVM
Client
Task JVM
Node Manager
Poll for Status YarnChild
HDFS
Update
MR AppMaster Status Map/Reduce
Task
Data Node

Hadoop 2.x : YARN Workflow
Scheduler
Resource
Manager Applications Manager
(AsM)
Node Manager Node Manager Node Manager Node Manager

Container 1.2 Container 2.2

Container 1.1 App
Container 2.1 Master 2

App Container 2.3
Master 1

Annie’s Question
YARN was developed to overcome the following disadvantage in
Hadoop 1.0 MapReduce framework?
» Single Point Of Failure Of NameNode
» Only one version can be run in classic MapReduce
» Too much burden on Job Tracker

Annie’s Answer
Too much burden on Job Tracker

Annie’s Question
In YARN, the functionality of JobTracker has been replaced by which of
the following YARN features:
» Job Scheduling
» Task Monitoring
» Resource Management
» Node management

Annie’s Answer
Task Monitoring and Resource Management. The fundamental idea of YARN
is to split up the two major functionalities of the JobTracker, i.e. resource
management and job scheduling/monitoring, into separate daemons. A
global Resource Manager (RM) for resources and per-application
ApplicationMaster (AM) for task monitoring.

Annie’s Question
In YARN, which of the following daemons takes care of the container
and the resource utilization by the applications?
» Node Manager
» Job Tracker
» Task tracker
» ApplicationMaster

Annie’s Answer
ApplicationMaster

Annie’s Question
Can we run MRv1 Jobs in a YARN enabled Hadoop Cluster?
» Yes
» No

Annie’s Answer
Yes. MapReduce on YARN ensures full binary compatibility. These

existing applications can run on YARN directly without recompilation.

MapReduce Paradigm

MapReduce Paradigm
The Overall MapReduce Word Count Process
Input Splitting Mapping Shuffling Reducing Final Result
List(K2,V2) K2,List(V2)
K1,V1
Deer, 1 Bear, (1,1) Bear, 2
Deer Bear River Bear, 1 List(K3,V3)
River, 1
Car, (1,1,1) Car, 3 Bear, 2
Dear Bear River Car, 1 Car, 3
Car Car River Car Car River Car, 1 Deer, 2
Deer Car Bear River, 1 River, 2
Deer, (1,1) Deer, 2
Deer, 1
Deer Car Bear
Car, 1
Bear, 1 River, (1,1) River, 2

Anatomy of a MapReduce Program
Map:
Key Value
(K1, V1) List (K2, V2)
Reduce:
MapReduce
(K2, list (V2)) List (K3, V3)

Demo of WordCount program

Annie’s Question
Input to the mapper is in the form of?
» A flat file
» (key, value) pair
» Only string
» All the above

Annie’s Answer
A Mapper accepts (key, value) pair as input.

Input Splits
HDFS Physical
Blocks Division
INPUT DATA
Input Logical
Splits Division

Relation Between Input Splits and HDFS Blocks
Split Split Split

File
Lines 1 2 3 4 5 6 7 8 9 10 11
Block Block Block Block

Boundary Boundary Boundary Boundary
▪ Logical records do not fit neatly into the HDFS blocks.
▪ Logical records are lines that cross the boundary of the blocks.
▪ First split contains line 5 although it spans across blocks.

MapReduce Job Submission Flow

INPUT DATA
Node 1 Node 2
▪ Input data is distributed to nodes

INPUT DATA
Node 1 Node 2
▪ Each map task works on a “split” of data Map Map

INPUT DATA
Node 1 Node 2
▪ Mapper outputs intermediate data

INPUT DATA
Node 1 Node 2
▪ Data will be copied by the reducer processor once it identifies the

respective task using application master for all data the reducer is
responsible for
Node 1 Node 2

INPUT DATA
Node 1 Node 2

responsible for
▪ Shuffle processor will sort and merge the data for a particular key
Reduce Reduce
Node 1 Node 2

INPUT DATA
Node 1 Node 2

responsible for
▪ Shuffle processor will sort and merge the data for a particular key
▪ Reducer output is stored Reduce Reduce
Node 1 Node 2

Annie’s Question
MapReduce programming model provides a way for reducers to
communicate with each other?
» Yes, reducers running on the same machine can communicate
with each other through shared memory
» No, each reducer runs independently and in isolation.

Annie’s Answer
Ans. No, reducers run independently and in isolation. Individual
tasks do not know the input source. Reducer tasks rely on Hadoop
framework to deliver the appropriate input for processing.

Annie’s Question
Who specify Input Split Information?
» randomly and decided by name node
» randomly and decided by job tracker
» line by Line and decided by Input Splitter
» we will have to specify explicitly

Annie’s Answer
Ans. The client have to submit the input spit information by specifying the
start and end point either in InputFormat Configuration.

Overview of MapReduce
Complete view of MapReduce, illustrating combiners and partitioner in addition to
Mappers and Reducers
MapReduce
Combiners Partitioners
Combiners can be viewed as Partitioners determine which reducer is

‘mini-reducers’ in the Map phase. responsible for a particular key.

Combiner – Local Reduce
COMBINERS
Mini-Reducers Perform a
“Local Reduce”
Passed workload further to Before we distribute the

the Reducers mapper results

Combiner
(B,1)
B
(C,1) (B,2)
C
Block 1
(D,1) (C,1)
D Mapper Combiner
(E,1) (D,2)
E (A, [2]) (A,2)
(D,1) (E,1)
D (B, [2,1]) (B,3)
(B,1)
B (C, [1,1]) Reducer (C,2)
Shuffle
(D, [2,2]) (D,4)
D (E, [1]) (E,1)
(D,1)
A (D,2)
Block 2
(A,1)
A (A,2)
Mapper (A,1) Combiner
C (C,1)
(C,1)
B (B,1)
(B,1)
D
(D,1)

Annie’s Question
Combiner works at?
» Mapper Level
» Partitioner Level
» Reducer Level
» All the above

Annie’s Answer
Ans. Mapper level as Combiner works on the output data from Mapper.

Annie’s Question
Combiner can be considered as:
» Semi Partitioner
» Semi Reducer
» Semi Shuffler
» Major Reducer

Annie’s Answer
Ans. Semi Reducer. Combiner works on the Mapper output and lessen
the burden on Reducer.

Partitioner – Redirecting Output from Mapper
Map Partitioner Reducer

Demo: Combiner and Partitioner

Annie’s Question
Can we use same logic for combiner and reducer?
» No, they are separate entities.

» Yes, only if reducer and combiner logic are commutative and
associative and both of them are of same data types.

Annie’s Answer
Ans. Yes, you can use same logic if Reducer and Combiner logic are
both commutative and associative and both of them are of same
data types.

Annie’s Question
Can we change the format of output key class and output value
class?
» TRUE
» FALSE

Annie’s Answer
Ans. TRUE

HealthCare Dataset

Revisit De-identification Architecture
Taking DB dump in CSV format and
Sqoop 0100
ingest into HDFS
1101
1001
matches
Store De-identified CSV

file into HDFS 0100
1101
HDFS 1001
Read CSV file
from HDFS
Reduce Task 1
De-identify columns
Map Task 1 Reduce Task 2
based on
configurations .
Map Task 2
.
. 0100
. 1101
1001
DeIdentify MapReduce Code
public static String encrypt(String strToEncrypt, byte[] key)
{
try
{
Cipher cipher = Cipher.getInstance("AES/ECB/PKCS5Padding");
SecretKeySpec secretKey = new SecretKeySpec(key, "AES");
cipher.init(Cipher.ENCRYPT_MODE, secretKey);
String encryptedString = Base64.encodeBase64String(cipher.doFinal(strToEncrypt.getBytes()));
return encryptedString.trim();
}
catch (Exception e)
{
logger.error("Error while encrypting", e);
}
return null;
}
}

Demo of De-identify Program

Weather Data
Dataset Link: ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/

Demo of WeatherData Program

Assignment
Write MapReduce code for WordCount on your own and run it on Edureka’s Cloud Lab
Download all the MapReduce codes from LMS and import them in your Eclipse IDE and execute them
Try Maximum Temperature problem in MapReduce
Try Hot and Cold day problem in MapReduce

Pre-work
Watch video “Running MapReduce Program” under Module-3 of your LMS
Attempt the Word Count, Patents, & Alphabets assignment using the items present in the LMS under the tab
Module 3
Review the Interview Questions for MapReduce
http://www.edureka.in/blog/hadoop-interview-questions-mapreduce/
Review the Next Generation MapReduce (MRv2 or YARN)
http://www.edureka.in/blog/apache-hadoop-2-0-and-yarn/
http://www.edureka.in/blog/hadoop-2-0-setting-up-a-single-node-cluster-in-15-minutes/
Setup the CDH4 Hadoop development environment using the documents present in the LMS
http://blog.cloudera.com/blog/2013/08/how-to-use-eclipse-with-mapreduce-in-clouderas-quickstart-vm/

Agenda for Next Class
▪ Map and Reduce Side Join
▪ Counters
▪ Distributed Cache
▪ Custom Input Format
▪ Sequence Input Format
▪ MRUnit

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

How To Make The Best Use of Live Sessions: Log in 10 Mins Before

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How To Make The Best Use of Live Sessions: Log in 10 Mins Before

Uploaded by

Copyright:

Available Formats

How To Make The Best Use Of Live Sessions

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Hadoop Architecture Integration of Kafka

Hadoop MapReduce Integration of Kafka

Kafka Operation and Processing Distributed Data

Kafka Cluster Architectures Apache Oozie and Hadoop

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

▪ Differentiate between Traditional way and MapReduce way

▪ Learn about Hadoop 2.x MapReduce architecture and components

▪ Understand execution flow of YARN MapReduce application

▪ Implement basic MapReduce concepts

▪ Run a MapReduce Program

▪ Understand Input Splits concept in MapReduce

▪ Understand MapReduce Job Submission Flow

▪ Implement Combiner and Partitioner in MapReduce

Copyright © edureka and/or its affiliates. All rights reserved.

▪ Hadoop Cluster Modes

Copyright © edureka and/or its affiliates. All rights reserved.

Using Pig Using HIVE

Using Flume Using Hadoop Copy Commands

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Messaging Systems helps managing the

De-identify personal health information.

Copyright © edureka and/or its affiliates. All rights reserved.

Split Data grep matches

Split Data grep matches

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

• Taking processing to the data

Copyright © edureka and/or its affiliates. All rights reserved.

Store CSV file into 0100

Copyright © edureka and/or its affiliates. All rights reserved.

Job History Resource

Container Container Container

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

3. Copy Job Resources

Copyright © edureka and/or its affiliates. All rights reserved.

3. Copy Job Resources 8. Request

Copyright © edureka and/or its affiliates. All rights reserved.

3. Copy Job Resources 8. Request 10. Create

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Node Manager Node Manager Node Manager Node Manager

Node Manager Node Manager Node Manager Node Manager

Node Manager Node Manager Node Manager Node Manager

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Too much burden on Job Tracker

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Yes. MapReduce on YARN ensures full binary compatibility. These