You are on page 1of 39

How to make the best use of Live Sessions

• Please login on time

• Please do a check on your network connection and audio before the class to have a smooth session

• All participants will be on mute, by default. You will be unmuted when requested or as needed

• Please use the “Questions” panel on your webinar tool to interact with the instructor at any point during the
class

• Ask and answer questions to make your learning interactive

• Please have the support phone number (US : 1855 818 0063 (toll free), India : +91 90191 17772) and raise
tickets from LMS in case of any issues with the tool

• Most often logging off or rejoining will help solve the tool related issues

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Big Data & Hadoop Certification Training

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Course Outline
Understanding Big Data Kafka Monitoring &
Hive
Stream Processing
and Hadoop

Hadoop Architecture Integration of Kafka


Kafka Producer Advance
with Hive&and
Hadoop HBase
Storm
and HDFS

Hadoop MapReduce Integration of Kafka


Kafka Consumer Advance
Framework with Spark &HBase
Flume

Kafka Operation and Processing Distributed Data


Advance MapReduce
Performance Tuning with Apache Spark

Kafka Cluster Architectures Apache Oozie and Hadoop


Pig Kafka Project
& Administering Kafka Project

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Module 4: Advance MapReduce

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Objectives
At the end of this module, you will be able to:

• Implement Counters in MapReduce

• Understand Map and Reduce Side joins

• Test MapReduce Programs

• Implement Distributed Cache Concept in MapReduce

• Implement Custom Input Format in MapReduce

• Implement Sequence Input Format in MapReduce

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Let’s Revise
INPUT DATA

Node 1 Node 2

Input data is distributed to nodes


Map Map
Each map task works on a “split” of data

Mapper outputs intermediate data

Data exchange between nodes in a “shuffle” process

Intermediate data of the same key goes to the same reducer


Reduce Reduce
Reducer output is stored

Node 1 Node 2

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Annie’s Question
Can you use the Map-Reduce algorithm to perform a relational join
on two large tables sharing a key? Assume that the two tables are
formatted as comma-separated files in HDFS:
a. Yes
b. No

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Annie’s Answer

Ans. Yes using Map-Reduce we perform Join Algorithms such as


Map-side, Reduce-side, and In-Memory join in Map-Reduce.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Annie’s Question
What types of algorithms are difficult to express in Map-Reduce?

a. Algorithms that require global, shared state


b. Algorithms that requires application of the same mathematical
function to large numbers of individual binary records

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Annie’s Answer
Ans. The correct option is ‘a’. Map-Reduce paradigm works in a
massively parallel system. Map and Reduce tasks execute in isolation
on a chunk of data (input splits), so algorithms which require a
global shared state to be maintained aren’t suitable for Map-Reduce
framework.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Annie’s Question
At what stage in Map-Reduce tasks execution, a reducer's reduce
function starts?

a. At least one mapper is ready with its output


b. map() and reduce() starts simultaneously
c. After processing for all the map tasks is completed

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Annie’s Answer
Ans. The correct option is ‘c’. The Reduce tasks works on the output
of Map tasks and the output from all the Mappers is required to start
the Reduce process of Map-Reduce algorithm.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Annie’s Question
You want to reduce the traffic between mapper and reducer. Your class
should implement which interface?
a. Partitioner
b. Combiner
c. Writable
d. WritableComparable

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Annie’s Answer
Ans. The correct option is ‘b’. Combiners are basically mini-reducers.
They essentially lessen the workload which is passed on further to the
reducers.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Map and Reduce Side Joins
Fragment (large table)

Map tasks: Split 1 Split 2 Split 3 Split 4

Duplicate
(small table)
Duplicate

Blog: http://www.edureka.in/blog/map-side-vs-join/

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Map and Reduce Side Joins
Small Table
Task
Data
a

MapReduce Local Task Hash Table Files Compressed and Archived

Distributed Cache

Mapper

Mapper
Record
Mapper Record
Record
b Record Big Table Data
.
Output
.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Demo: Joins in Map-Reduce

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Input Format
Input file Input file

Input Split Input Split Input Split Input Split


Input Format

Record Record Record Record


Reader Reader Reader Reader

Mapper Mapper Mapper Mapper

(Intermediates) (Intermediates) (Intermediates) (Intermediates)


Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Input Format – Class Hierarchy
Combine File
Input Format<K,V>

Text Input Format

Input Format File Input Format Key Value Text


<K,V> <K,V> Input Format

Nline Input Format

Sequence File Sequence File As


Input Format<K,V> Binary Input Format
<<interface>>
Composable
Sequence File As
Input Format Composite Input Format
Text Input Format
<K,V> <K,V>

DB Input Format Sequence File Input


<T> Filter<K,V>

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Output Format

Reducer Reducer Reducer


Output Format

RecordWriter RecordWriter RecordWriter

Output file Output file Output file

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Output Format – Class Hierarchy
Text Output Format
Output Format File Output Format <K,V>
<K,V> <K,V>
Sequence File
Output Format<K,V>

Null Output Format


<K,V>

Sequence File As Binary


Output Format
DB Output Format
<K,V>

Filter Output Format Lazy Output Format


<K,V> <K,V>

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Demo: Custom Input Format

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


MRUnit Testing Framework
▪ Provides 4 drivers for separately testing MapReduce code
 MapDriver
 ReduceDriver
 MapReduceDriver *JUnit is a simple framework to
 PipelineMapReduceDriver write repeatable tests.

▪ Helps in filling the gap between MapReduce programs and JUnit*

▪ Better control on log messages with JUnit Integration

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Demo: MRUnit Testing Framework

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Counters
▪ Counters are lightweight objects in Hadoop that allow you to keep track of system progress in both the map and reduce
stages of processing.

▪ Counters are used to gather information about the data we are analysing, like how many types of records were processed,
how many invalid records were found while running the job, etc.

Counters

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Demo: Counters

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Distributed Cache
MapReduce MapReduce MapReduce MapReduce
Distributed Cache is a facility provided by the Map-Reduce
framework to cache files (text, archives, jars etc.) needed by
applications

Files are copied only once per job and should not be modified
by the application or externally while the job is executing.

Distributed Cache can be used to distribute simple, read-only HDFS – Hadoop Distributed Cache
data/text files and/or more complex types such as archives, jars
etc via the JobConf.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Demo: Distributed Cache

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Sequence File
▪ Hadoop is not restricted to processing plain text data. For user custom binary data type, one can use the SequenceFile
▪ SequenceFile is a flat file consisting of binary key/value pairs
▪ Used in MapReduce as input/output formats
▪ Output of Maps are stored using SequenceFile
▪ Provides
▪ A Writer
▪ A Reader
▪ A Sorter

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Sequence File
Three different SequenceFile formats:
▪ Uncompressed key/value records
▪ Record compressed key/value rec
• Only ‘values’ are compressed here
▪ Block compressed key/value records
• Both keys and values are collected in ‘blocks’ separately and compressed
▪ The other objective of using SequenceFile is to 'pack' many small files into a single large SequenceFile for the
MapReduce computation since the design of Hadoop prefers large files (Remember that Hadoop default block
size for data is 64MB).

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Sequence File – Record Compression
Header Record Record Sync Record Record Record Sync Record

Block Record Key Key Value


compression length length
4 4

Record Record Key Compressed


Key Value
compression length length
4 4

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Sequence File – Block Compression

Header Sync Block Sync Block Sync Block Sync Block

Block Number of Compressed Compressed Compressed Compressed


compression records Key lengths keys value lengths values

1-5

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Demo: SequenceFile

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Assignment
Practice “Advance MR Codes” present in the LMS in the Cloud Lab

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Pre-work
Review the following PIG blogs:
http://www.edureka.in/blog/pig-programming-create-your-first-apache-pig-script/

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Agenda for Next Class
• PIG and its need
• Difference between PIG MapReduce
• PIG features and programming structure
• PIG running modes
• PIG components and data model
• Basic operations in PIG
• UDF in PIG

Copyright © 2017, edureka and/or its affiliates. All rights reserved.


Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Copyright © 2017, edureka and/or its affiliates. All rights reserved.

You might also like