BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius

BSc In Information Technology
(Data Science)
SLIIT – 2019 (Semester 2)
Massive or BIG Data Processing

J.Alosius
Introduction to Map Reduce
BIG Data Processing and Abstraction
MapReduce Overview
• A method for distributing computation across multiple nodes
• Each node processes the data that is stored at that node
• Consists of two main phases
• Map
• Reduce
MapReduce Features
• Automatic parallelization and distribution
• Fault-Tolerance
• Provides a clean abstraction for programmers to use
MapReduce Algorithm
• Iterate over a large n u m b e r of records MAP
• Extract something of interest from each
• Shuffle a n d sort intermediate results REDUCE
• Aggregate intermediate results
• Generate final o u t p u t
Key idea: provide a functional abstraction for these two operations
Programmers specify two functions:

m a p (k1, v1) → [(k2, v2)]
reduce (k2, [v2]) → [(k3, v3)]
 All values with the same key are sent to the same reducer
The execution framework handles everything else…
MapReduce Algorithm
Runtime
• Handles scheduling
• Assigns workers to map and
reduce tasks
• Handles “data distribution”
• Moves processes to data
• Handles synchronization
• Gathers, sorts, and shuffles
intermediate data
• Handles errors and faults
• Detects worker failures and
restarts
• Everything happens on top of a
distributed FS (HDFS)
MapReduce Algorithm
The Combiner
The Mapper • Called once for each unique key
• Reads data as key/value pairs • Gets a list of all values associated with a key as
• The key is often discarded input
• Outputs zero or more key/value pairs • The reducer outputs zero or more final
key/value pairs
Shuffle and Sort • Usually just one output per input key
• Output from the mapper is sorted by key • Example: local counting for Word Count:
• All values with the same key are guaranteed to go to the • def combiner(key, values):
same machine • output(key, sum(values)
Partition
The Reducer • In MapReduce, intermediate output values are
• Called once for each unique key not usually reduced together
• Gets a list of all values associated with a key as input • All values with the same key are presented to a
• The reducer outputs zero or more final key/value pairs single Reducer together
• Usually just one output per input key • More specifically, a different subset of
intermediate key space is assigned to each
Reducer
• These subsets are known as partitions
MapReduce Algorithm
• Programmers specify two functions:
• map (k1, v1) → [(k2, v2)]
• reduce (k2, [v2]) → [(k3, v3)]
• All values with the same key are reduced together
• The execution framework handles everything else…
• Not quite…usually, programmers also specify:
• partition (k2, number of partitions) → partition for
k2
• Often a simple hash of the key, e.g., hash(k2) mod n
• Divides up key space for parallel reduce operations
combine (k2, [v2]) → [(k2, v2’)]
• Mini-reducers that run in memory after the map
phase
• Used as an optimization to reduce network traffic
Word Count Example
Word Count Example
MapReduce Program
A MapReduce program consists of the following 3 parts:
• Driver (main- would trigger the map and reduce methods)

• Mapper
• Reducer
It is better to include the map, reduce and main methods in 3

different classes
public static class Map extends Mapper<LongWritable,Text,Text,IntWritable> {
1

2
public void map(LongWritable key, Text value, Context context) throws
3
IOException,InterruptedException {
4

5
String line = value.toString();
6
StringTokenizer tokenizer = new StringTokenizer(line);
7
while (tokenizer.hasMoreTokens()) {
8
value.set(tokenizer.nextToken());
9
context.write(value, new IntWritable(1));
10
}
Input:
The key is nothing but the offset of each line in the text file: LongWritable
The value is each individual line (as shown in the figure at the right): Text
Output:
The key is the tokenized words: Text
We have the hardcoded value in our case which is 1: IntWritable
Example – Dear 1, Bear 1, etc.
Reducer Code:
1public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable> {
2
3public void reduce(Text key, Iterable<IntWritable> values,Context context)
4throws IOException,InterruptedException {
5
6int sum=0;
7for(IntWritable x: values)
8{
9sum+=x.get();
10}
11context.write(key, new IntWritable(sum));
12}
13}
Both the input and the output of the Reducer is a key-value pair.
Input:
The key nothing but those unique words which have been generated after the sorting and shuffling phase: Text
The value is a list of integers corresponding to each key: IntWritable
Example – Bear, [1, 1], etc.
Output:
The key is all the unique words present in the input text file: Text
The value is the number of occurrences of each of the unique words: IntWritable
Example – Bear, 2; Car, 3, etc.
We have aggregated the values present in each of the list corresponding to each key and produced the final answer.
• In the driver class, we set the
1Configuration conf= new Configuration();
configuration of our MapReduce
2Job job = new Job(conf,"My Word Count Program");
3job.setJarByClass(WordCount.class); job to run in Hadoop.
4job.setMapperClass(Map.class); • specify the name of the job ,
5job.setReducerClass(Reduce.class); • the data type of input/output of
6job.setOutputKeyClass(Text.class); the mapper and reducer.
7 • specify the names of the
8job.setOutputValueClass(IntWritable.class); mapper and reducer classes.
9job.setInputFormatClass(TextInputFormat.class); • The path of the input and
10job.setOutputFormatClass(TextOutputFormat.class); output folder.
11Path outputPath = new Path(args[1]); • The method
12 setInputFormatClass () is used
13//Configuring the input/output path from the filesystem into the job for specifying that how a
14FileInputFormat.addInputPath(job, new Path(args[0]));
Mapper will read the input data
15FileOutputFormat.setOutputPath(job, new Path(args[1]));
or what will be the unit of work.
Here, we have chosen
TextInputFormat so that single
hadoop jar hadoop-mapreduce-example.jar WordCount /sample/input line is read by the mapper at a
/sample/output time from the input text file.
• The main () method is the entry
point for the driver. In this
method, we instantiate a new
Configuration object for the job.
HDFS Architecture
Distributed File System
• Don’t move data to workers… move workers to
the data!
• Store data on the local disks of nodes in the
cluster
• Start up the workers on the node that has
the data local
• Why DFS?
• Not enough RAM to hold all the data in
memory
• Disk access is slow, but disk throughput is
DFS - Features
• Single Namespace for entire cluster reasonable
• Data Coherency
• Write-once-read-many access model
• Client can only append to existing files
• Files are broken up into blocks
• Typically 128 MB block size
• Each block replicated on multiple DataNodes
• Intelligent Client
• Client can find location of blocks
• Client accesses data directly from DataNode
NameNode – Metadata NameNode – Responsibilities
• Meta-data in Memory • Managing the file system namespace:
• The entire metadata is in main memory • Holds file/directory structure, metadata, file-
• No demand paging of meta-data to-block mapping, access permissions, etc.
• Types of Metadata • Coordinating file operations:
• List of file • Directs clients to datanodes for reads and
• List of Blocks for each file writes
• List of DataNodes for each block • No data is moved through the namenode
• File attributes, e.g creation time, • Maintaining overall health:
replication factor • Periodic communication with the datanodes
• A Transaction Log • Block re-replication and rebalancing
• Records file creations, file deletions. etc • Garbage collection
Datanode Block Placement
• A Block Server
• Stores data in the local file system • Current Strategy
• Stores meta-data of a block • One replica on local node
• Serves data and meta-data to Clients • Second replica on a remote rack
• Block Report • Third replica on same remote rack
• Periodically sends a report of all existing • Additional replicas are randomly placed
blocks to the NameNode • Clients read from nearest replica
• Facilitates Pipelining of Data • Would like to make this policy pluggable
• Forwards data to other specified DataNodes
Data Correctness
• Use Checksums to validate data
• Use CRC32
• File Creation
• Client computes checksum per 512 byte
• DataNode stores the checksum
• File access
• Client retrieves the data and checksum from DataNode
• If Validation fails, Client tries other replicas
NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
• A directory on the local file system
• A directory on a remote file system (NFS/CIFS)
3. Client writes block directly to one Data Node
• Data Nodes replicates block
1. Client consults Name Node • Cycle repeats for next block
2. Name Node replies with the location of the Data 4. Data node replies with acknowledgement
5. Client sends to the Name Node a request to
Node close the file
1. An application client wishing to read a file must first 3. The client then contacts the Data Nodes to
contact the Name Node to determine where the actual retrieve data.
data is stored. • Important features of the design:
2. In response to the client request the Name Node • Data is never moved through the Name Node
returns: • All data transfer occurs directly between clients
• The relevant block ids and the Data Nodes
• The locations where the blocks are held • Communications with the Name Node only
involves transfer of metadata
Introduction to YARN
MapReduce Vs YARN
MapReduce Vs YARN
Limits Scalability
• Maximum cluster size: 4,000 nodes
• Maximum concurrent tasks: 40,000
• Availability – Job Tracker is SPOF (Single Point of Failure)
• Problem with Resource Utilization
• Predefined number of map slots and reduce slots for each TaskTracker
• Underutilization when more map tasks or reducer tasks are running
• Runs only MapReduce applications
Advantages of YARN
• Yarn does efficient utilization of the resource

• Centralized resource management
• Multiple applications in Hadoop, all sharing a common resource
• No more fixed map-reduce slots
• Supports applications that do not follow MapReduce mode
• Apache Spark, Apache Giraph, Tez
• Most JobTracker functions moved to Application Master
• – one cluster can have many Application Masters
Components of YARN
Resource Manager (RM)
• Runs on Master Node
• Global resource scheduler
• Arbitrates system resources between
competing nodes
Node Manager (NM)

• Runs on slave nodes
• Communicates with RM
Container Application Master

• Created by the RM upon request • One per application
• Allocate a certain amount of resources (memory , CPU) • Framework/application specific
on a slave node • Runs in a container
• Applications run in one or more containers • Requests more containers to run application
tasks
Fault Tolerance
• Task (Container) – Handled just like MRv1
• MR APPMaster will re-attempt tasks that complete with exceptions or stop responding (4 times by default)
• Applications with too many failed tasks are considered failed
• Application Master
• If application fails or if AM stops sending heartbeats, RM will re-attempt the whole application (2 times by
default)
• MR AppMaster optional setting: Job Recovery
• If false, all tasks will re-run
• If true, MR APPMaster retrieves state of tasks when it restarts; only incomplete tasks will be re-run
• NodeManager
• If NM stops sending heartbeats to RM, it is removed from list of active nodes
• Tasks on the node will be treated as failed by MR AppMaster
• If the AppMaster node fails, it will be treated as a failed application
• Resource Manager
• No application or tasks can be launched if RM is unavailable

• Can be configured with High Availability
MapReduce Program

• Every mapper class must be extended
from MapReduceBase class and it must
implement Mapper interface.
• The main part of Mapper class is
a 'map()' method which accepts four
arguments.
• At every call to 'map()' method, a key-value pair
('key' and 'value' in this code) is passed.
• 'map()' method begins by splitting input text
which is received as an argument. It uses the
tokenizer to split these lines into words.
• After this, a pair is formed using a record at 7th
index of array 'SingleCountryData' and a
value '1'.
• Next Step to select the 7th index because the Country data is located at 7th index in array 'SingleCountryData'.
• Please note that the input data is in the below format (where Country is at 7th index, with 0 as a starting index)-
• Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Long
itude
• An output of mapper is again a key-value pair which is outputted using 'collect()' method of 'OutputCollector'.
MapReduce Program
• An input to the reduce() method is a key with a list
of multiple values.
• For example, in our case, it will be-
<United Arab Emirates, 1>, <United Arab
Emirates, 1>, <United Arab Emirates, 1>,<United
Arab Emirates, 1>, <United Arab Emirates, 1>,
<United Arab Emirates, 1>.
• This is given to reducer as <United Arab Emirates,
{1,1,1,1,1,1}>
• So, to accept arguments of this form, first two data
types are used,
viz., Text and Iterator<IntWritable>. Text is a data
type of key and Iterator<IntWritable> is a data type
for list of values for that key.
• The next argument is of
type OutputCollector<Text,IntWritable> which
collects the output of reducer phase.
• reduce() method begins by copying key value and initializing frequency count to 0.
• Then, 'while' loop, is used to iterate through the list of values associated with the key and calculate the final
frequency by summing up all the values.
• push the result to the output collector in the form of key and obtained frequency count.
MapReduce Program

BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius

Uploaded by

Copyright:

Available Formats

BSc In Information Technology

Massive or BIG Data Processing

Programmers specify two functions:

• Driver (main- would trigger the map and reduce methods)

It is better to include the map, reduce and main methods in 3

• Yarn does efficient utilization of the resource

Node Manager (NM)

Container Application Master

• No application or tasks can be launched if RM is unavailable

You might also like