How To Make The Best Use of Live Sessions

How to make the best use of Live Sessions
• Please login on time
• Please do a check on your network connection and audio before the class to have a smooth session
• All participants will be on mute, by default. You will be unmuted when requested or as needed
• Please use the “Questions” panel on your webinar tool to interact with the instructor at any point during the
class
• Ask and answer questions to make your learning interactive
• Please have the support phone number (US : 1855 818 0063 (toll free), India : +91 90191 17772) and raise
tickets from LMS in case of any issues with the tool
• Most often logging off or rejoining will help solve the tool related issues
Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Big Data & Hadoop Certification Training

Course Outline
Understanding Big Data Kafka Monitoring &
Hive
Stream Processing
and Hadoop
Hadoop Architecture Integration of Kafka

Kafka Producer Advance
with Hive&and
Hadoop HBase
Storm
and HDFS
Hadoop MapReduce Integration of Kafka

Kafka Consumer Advance
Framework with Spark &HBase
Flume
Kafka Operation and Processing Distributed Data

Advance MapReduce
Performance Tuning with Apache Spark
Kafka Cluster Architectures Apache Oozie and Hadoop

Pig Kafka Project
& Administering Kafka Project

Module 4: Advance MapReduce

Objectives
At the end of this module, you will be able to:
• Implement Counters in MapReduce
• Understand Map and Reduce Side joins
• Test MapReduce Programs
• Implement Distributed Cache Concept in MapReduce
• Implement Custom Input Format in MapReduce
• Implement Sequence Input Format in MapReduce

Let’s Revise
INPUT DATA
Node 1 Node 2
Input data is distributed to nodes

Map Map
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Intermediate data of the same key goes to the same reducer

Reduce Reduce
Reducer output is stored
Node 1 Node 2

Annie’s Question
Can you use the Map-Reduce algorithm to perform a relational join
on two large tables sharing a key? Assume that the two tables are
formatted as comma-separated files in HDFS:
a. Yes
b. No

Annie’s Answer
Ans. Yes using Map-Reduce we perform Join Algorithms such as

Map-side, Reduce-side, and In-Memory join in Map-Reduce.

Annie’s Question
What types of algorithms are difficult to express in Map-Reduce?
a. Algorithms that require global, shared state

b. Algorithms that requires application of the same mathematical
function to large numbers of individual binary records

Annie’s Answer
Ans. The correct option is ‘a’. Map-Reduce paradigm works in a
massively parallel system. Map and Reduce tasks execute in isolation
on a chunk of data (input splits), so algorithms which require a
global shared state to be maintained aren’t suitable for Map-Reduce
framework.

Annie’s Question
At what stage in Map-Reduce tasks execution, a reducer's reduce
function starts?
a. At least one mapper is ready with its output

b. map() and reduce() starts simultaneously
c. After processing for all the map tasks is completed

Annie’s Answer
Ans. The correct option is ‘c’. The Reduce tasks works on the output
of Map tasks and the output from all the Mappers is required to start
the Reduce process of Map-Reduce algorithm.

Annie’s Question
You want to reduce the traffic between mapper and reducer. Your class
should implement which interface?
a. Partitioner
b. Combiner
c. Writable
d. WritableComparable

Annie’s Answer
Ans. The correct option is ‘b’. Combiners are basically mini-reducers.
They essentially lessen the workload which is passed on further to the
reducers.

Map and Reduce Side Joins
Fragment (large table)
Map tasks: Split 1 Split 2 Split 3 Split 4
Duplicate
(small table)
Duplicate
Blog: http://www.edureka.in/blog/map-side-vs-join/

Map and Reduce Side Joins
Small Table
Task
Data
a
MapReduce Local Task Hash Table Files Compressed and Archived
Distributed Cache
Mapper
Mapper
Record
Mapper Record
Record
b Record Big Table Data
.
Output
.

Demo: Joins in Map-Reduce

Input Format
Input file Input file
Input Split Input Split Input Split Input Split

Input Format
Record Record Record Record

Reader Reader Reader Reader
Mapper Mapper Mapper Mapper
(Intermediates) (Intermediates) (Intermediates) (Intermediates)

Input Format – Class Hierarchy
Combine File
Input Format<K,V>
Text Input Format
Input Format File Input Format Key Value Text

<K,V> <K,V> Input Format
Nline Input Format
Sequence File Sequence File As

Input Format<K,V> Binary Input Format
<<interface>>
Composable
Sequence File As
Input Format Composite Input Format
Text Input Format
<K,V> <K,V>
DB Input Format Sequence File Input

<T> Filter<K,V>

Output Format
Reducer Reducer Reducer

Output Format
RecordWriter RecordWriter RecordWriter
Output file Output file Output file

Output Format – Class Hierarchy
Text Output Format
Output Format File Output Format <K,V>
<K,V> <K,V>
Sequence File
Output Format<K,V>
Null Output Format

<K,V>
Sequence File As Binary

Output Format
DB Output Format
<K,V>
Filter Output Format Lazy Output Format

<K,V> <K,V>

Demo: Custom Input Format

MRUnit Testing Framework
▪ Provides 4 drivers for separately testing MapReduce code
 MapDriver
 ReduceDriver
 MapReduceDriver *JUnit is a simple framework to
 PipelineMapReduceDriver write repeatable tests.
▪ Helps in filling the gap between MapReduce programs and JUnit*
▪ Better control on log messages with JUnit Integration

Demo: MRUnit Testing Framework

Counters
▪ Counters are lightweight objects in Hadoop that allow you to keep track of system progress in both the map and reduce
stages of processing.
▪ Counters are used to gather information about the data we are analysing, like how many types of records were processed,
how many invalid records were found while running the job, etc.
Counters

Demo: Counters

Distributed Cache
MapReduce MapReduce MapReduce MapReduce
Distributed Cache is a facility provided by the Map-Reduce
framework to cache files (text, archives, jars etc.) needed by
applications
Files are copied only once per job and should not be modified
by the application or externally while the job is executing.
Distributed Cache can be used to distribute simple, read-only HDFS – Hadoop Distributed Cache
data/text files and/or more complex types such as archives, jars
etc via the JobConf.

Demo: Distributed Cache

Sequence File
▪ Hadoop is not restricted to processing plain text data. For user custom binary data type, one can use the SequenceFile
▪ SequenceFile is a flat file consisting of binary key/value pairs
▪ Used in MapReduce as input/output formats
▪ Output of Maps are stored using SequenceFile
▪ Provides
▪ A Writer
▪ A Reader
▪ A Sorter

Sequence File
Three different SequenceFile formats:
▪ Uncompressed key/value records
▪ Record compressed key/value rec
• Only ‘values’ are compressed here
▪ Block compressed key/value records
• Both keys and values are collected in ‘blocks’ separately and compressed
▪ The other objective of using SequenceFile is to 'pack' many small files into a single large SequenceFile for the
MapReduce computation since the design of Hadoop prefers large files (Remember that Hadoop default block
size for data is 64MB).

Sequence File – Record Compression
Header Record Record Sync Record Record Record Sync Record
Block Record Key Key Value

compression length length
4 4
Record Record Key Compressed

Key Value
compression length length
4 4

Sequence File – Block Compression
Header Sync Block Sync Block Sync Block Sync Block
Block Number of Compressed Compressed Compressed Compressed

compression records Key lengths keys value lengths values
1-5

Demo: SequenceFile

Assignment
Practice “Advance MR Codes” present in the LMS in the Cloud Lab

Pre-work
Review the following PIG blogs:
http://www.edureka.in/blog/pig-programming-create-your-first-apache-pig-script/

Agenda for Next Class
• PIG and its need
• Difference between PIG MapReduce
• PIG features and programming structure
• PIG running modes
• PIG components and data model
• Basic operations in PIG
• UDF in PIG


How To Make The Best Use of Live Sessions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How To Make The Best Use of Live Sessions

Uploaded by

Copyright:

Available Formats

How to make the best use of Live Sessions

• Please login on time

• Ask and answer questions to make your learning interactive

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Hadoop Architecture Integration of Kafka

Hadoop MapReduce Integration of Kafka

Kafka Operation and Processing Distributed Data

Kafka Cluster Architectures Apache Oozie and Hadoop

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

• Implement Counters in MapReduce

• Understand Map and Reduce Side joins

• Test MapReduce Programs

• Implement Distributed Cache Concept in MapReduce

• Implement Custom Input Format in MapReduce

• Implement Sequence Input Format in MapReduce

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Input data is distributed to nodes

Mapper outputs intermediate data

Data exchange between nodes in a “shuffle” process

Intermediate data of the same key goes to the same reducer

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Ans. Yes using Map-Reduce we perform Join Algorithms such as

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

a. Algorithms that require global, shared state

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

a. At least one mapper is ready with its output

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Map tasks: Split 1 Split 2 Split 3 Split 4

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

MapReduce Local Task Hash Table Files Compressed and Archived

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Input Split Input Split Input Split Input Split

Record Record Record Record

Mapper Mapper Mapper Mapper

(Intermediates) (Intermediates) (Intermediates) (Intermediates)

Text Input Format

Input Format File Input Format Key Value Text

Nline Input Format

Sequence File Sequence File As

DB Input Format Sequence File Input

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Reducer Reducer Reducer

RecordWriter RecordWriter RecordWriter

Output file Output file Output file

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Null Output Format

Sequence File As Binary

Filter Output Format Lazy Output Format

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

▪ Helps in filling the gap between MapReduce programs and JUnit*

▪ Better control on log messages with JUnit Integration

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Copyright © 2017, edureka and/or its affiliates. All rights reserved.