You are on page 1of 34

BIG DATA

ANALYTICS
UNIT -3 HADOOP CONCEPTS
SYLLABUS
UNIT-3

HADOOP CONCEPTS
History of Hadoop – The Hadoop Distributed File System (HDFS) –
Components of Hadoop – Analyzing the Data with Hadoop – Scaling Out
– Hadoop Streaming – Design of HDFS – Java Interfaces to HDFS
Basics.
COURSE OUTCOMES
After successful completion of this course, the students should be able to ,
CO1: Identify the need for big data analytics for a domain.
CO2: Use Hadoop and Map Reduce framework.
CO3: Apply big data analytics for a given problem.
CO4: Suggest areas to apply big data to increase business outcome.
HDFS WRITE MECHANISM
HDFS READ MECHANISM
PARALLEL PROCESSING ?
“FAST COMPUTING”
HADOOP MAPREDUCE
 The processing component of Apache Hadoop
 Process the data parallely in the distributed environment.
CONCEPT BEHIND HADOOP
MAPREDUCE :

“ Instead of moving data to the processing unit, processing unit is moved to the data”
[BCOZ Size of Code is Comparatively less than the data]
DAEMONS
Master Slave Architecture
2 Components Associated,
1. Job Tracker
2. Task Tracker
Job Tracker
Master
Single Master / cluster
Establish the Communication between the hadoop and Client
Application
Allocation of nodes
Monitors the execution
Responsible for re execution when the task fails.
Task Tracker
Job assigned by the Job Tracker is executed by task tracker
Frequently Communicates with Job Tracker(Heart Beats)
PHASES OF MAPREDUCE
Mapreduce works by breaking the process into two phases :
1. Map Phase
Written by the Programmer
2. Reduce Phase
Mapreduce works on the concept of Key-Value pair
Mapreduce accepts the input as Key-Value pair
There is a another two phases ,
1. Shuffle Phase Handled by the Hadoop framework

2. Sort Phase
Understanding Key-Value Pair
SAMPLE :

One key can have more then one


values
BROAD STEPS

 Map Phase takes input as Key-Value pairs


 It produces output in the form of Key-Value Pair .
( O/P can be different from I/P Key-Value)
 Output of various Map tasks are grouped together on the basis of Key value
(done in shuffle phase and sort phase).
 Key and the associated values are sent to the reduce phase for processing.
 Output of the Reduce phase is written to the HDFS
Visualization Mapreduce Working
To find the maximum temperature (Year, Maximum Temperature)
Dataset contains date dd-mm-yyyy, Zipcode and Temperature
Mapper Phase
Input : OUTPUT :
The expected output is Year and the maximum
Temperature .
So the Mapper Function extracts only the
necessary data and emits the output.
SHUFFLE PHASE
 Groups and sorts the key-value pair based on OUTPUT:
the key.

SORT PHASE:
REDUCE PHASE :
Input : Reduce Method :
- Iterates the list and can perform any kind of
operations like,
Minimum Temperature
Average Temperature
Maximum Temperature
Count of Records
Here it is Maximum Temperature,
OVERVIEW
Pseudocode:
For Better Understanding Try it
Algorithm
Hadoop Scaling Out
Scaling Out
Scale out is a growth architecture or method that focuses on horizontal
growth, or the addition of new resources instead of increasing the
capacity of current resources (known as scaling up).
Terminologies for understanding
 Split => Fixed junk of Data
 Data Blocks and Splits are Different
 Data Blocks – HDFS
Splits
– Mapreduce
[Input to the mapreduce job, Mapreduce Job process the input and produces the output]
Terminologies for understanding Cont’d
 Hadoop runs the mapreduce job by dividing into 2 tasks
1, Map Task
2, Reduce Task
 Results of the map task merged and shuffled which is i/p for the reduce job.
 Reduce job process the i/p and produces the result.
Data Flow: Map Job
 Map Job works on the node where the data is present – Data Locality
 For Fast Processing, Data Should be in the local node, If data is not present in the local
node it
should be fetched from the network for betterment latency added to the i/p
 Optimal Split size is equal to the Data Block Size
 Map’s Output is Stored on the local disk not in the HDFS. [Intermediate Result]
 When Reduce job fails , Map’s job output will be rescheduled.
 Job’s Tracker Cleans up the Map’s Output only after the completion of reduce task.
 If Zero reducer, Map’s output is written in HDFS
Data Flow: Reduce Job
 No data locality.
No.of reducers to be decided independently.
Reducer O/P is written to HDFS (read only)
Data Flow with Multiple Mapper and
Single Reducer
Data Flow with Multiple Mapper and
Multiple Reducer
Data Flow with Multiple Mapper and
Zero Reducer
Previous Example with
Multiple Mapper and Reducer:
Hadoop Streaming
- Core Ideology of Hadoop
“Data Processing should be independent of programming languages”
- Data Processing methods can be written any other languages other than java.
- Hadoop streaming is the ability to interface with map and reduce functions written in two different
Languages. Ex: Map Program : Ruby , Reduce Program : Python
- Hadoop streaming uses unix standard streaming as interface between hadoop and the program.
References
https://www.edureka.co/blog/mapreduce-tutorial/

You might also like