You are on page 1of 23

UNIT 4 – Part 2

COURSE OUTCOMES
After successful completion of this course, the students should be able to ,
CO1: Identify the need for big data analytics for a domain.
CO2: Use Hadoop and Map Reduce framework.
CO3: Apply big data analytics for a given problem.
CO4: Suggest areas to apply big data to increase business outcome.
Shuffle and Sort
Shuffle Phase
Sort Phase
Sort Phase Cont’d
To increase disk i/o efficiency
TASK EXECUTION
Speculative Execution
Speculative Execution
Task JVM Reuse
Skipping Bad Records
Mapreduce Types and
Formats.
MAPREDUCE INPUT AND
OUTPUT
General form of Map/Reduce functions:

map: (K1, V1) -> list(K2, V2)

reduce: (K2, list(V2)) -> list(K3, V3)

General form with Combiner function:

map: (K1, V1) -> list(K2, V2)

combiner: (K2, list(V2)) -> list(K2, V2)

reduce: (K2, list(V2)) -> list(K3, V3)

Partition function:

partition: (K2, V2) -> integer


SERIALIZTION
- When two process communicates (Mapper and Reducer) , object is transferred between.
- Serialization is the process of changing the structured objects into byte stream for transmission
over a network or for writing into a storage.

DESERIALIZATION :
- Receiver End .
- Process of converting the streams into series of structured object.

INTERPROCESS COMMUNICATION is done using RPC in Hadoop


Features of Serialization Framework:
1. Compact – Smaller the data transfer better the efficiency.
2. Fast – Process of serialization and deserialization needs to be fast.
3. Extensible – Protocols need to be changed over time to meet the
newer requirement.
4. Interoperable – able two communicate between two different
languages
Drawbacks in Serialization frameworks:
1. It was not compact. Had lot of overheads. Java serialization sends
the meta data like class definition along with the data which increases
the data size.
2. Increase in data size increases time of processing and serialization.

So Hadoop has it own serialization formats and writables.


Avro Serialization:
- Apache Avro is a language neutral data serialization system.
- Data format can be processed by many languages.
- Allow the data to potentially outlive the language used to read and write.
- Avro assumes that the schema is always present at both read and write time.
- Avro Schema is written in JSON
Writable interface
Serialization in Hadoop offers two methods for serialization as well as
deserialization in Hadoop.
The methods are:
void readFields(DataInput in)
To deserialize the fields of the given object.
void write(DataOutput out)
Whereas, to serialize the fields of the given object .
WritableComparable Interface
- Combination of two interfaces, one is Writable and the other one is Comparable
Interfaces.
- This interface inherits the Comparable interface of Java and Writable interface of
Hadoop.
- It implements the writable interface.
- Hence it offers methods for data serialization, deserialization, and comparison as
well.
Writable interface
Common Data Types :
Java Vs Writables
A Glance
REFERENCES
https://www.edureka.co/blog/hadoop-yarn-tutorial/
https://data-flair.training/blogs/hadoop-yarn-tutorial/
https://www.wisdomjobs.com/e-university/hadoop-tutorial-484/
https://data-flair.training/blogs/avro-serialization/

You might also like