Professional Documents
Culture Documents
Syllabus
1
1. MapReduce: Map-Reduce is a software framework for
easily writing applications which process vast amounts of
data (multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable,
fault-tolerant manner is called MapReduce.
MapReduce is a programming model for expressing
distributed computations on massive amounts of data and
an execution framework for large-scale data processing on
clusters of commodity servers.
MapReduce is Programming Model for Data Processing.
MapReduce characteristics are Batch processing, No
limits on #passes over the data or time and No memory
constraints. 2
Fig: MapReduce Logical Data flow 3
History of MapReduce: Developed by researchers at Google Group
around in the year 2003, built on principles in parallel and
distributed processing.
MapReduce Provides a clear separation between what to compute
and how to compute it on a cluster.
5. Hadoop Pipes.
6
1) Analyzing the Data with UNIX Tools: Analyzing the Data with UNIX
Tools are Hadoop, Cloudera, Datameer, Splunk, Mahout, Hive, HBase,
LucidWorks, R, MapR, Ubuntu and Linux flavors.
Ex: A program for finding the maximum recorded temperature by year from
NASA weather
Program: records
#!/bash
for year in allusr/bin/env /*
do
echo -ne `basename $year .gz`"\t"
gunzip -c $year | \
awk '{ temp = substr($0, 88, 5) + 0;
q = substr($0, 93, 1);
if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }
END { print max }'
Done
The script loops through the compressed year files, first printing the year, and then processing
each file using awk. The awk script extracts two fields from the data: the air temperature and
the quality code. The END block is executed after all the lines in the file have been
processed, and it prints the maximum value. 7
2) Analyzing the Data with Hadoop: Analyzing the Data with
Hadoop is majorly MapReduce and HDFS.
To take advantage of the parallel processing that Hadoop provides,
we need to express our query as a Map Reduce job.
Map Reduce works by breaking the processing into two phases are
the map phase and the reduce phase, each phase has key-value pairs
as input and output. The map() method is passed a key to a value
and also provides an instance of Context to write the output.
It consists of
1. Features of MapReduce
2. Counters
3. Sorting
4. Joins
5. Side Data Distribution
6. MapReduce Library Classes
1. Features of MapReduce: Map-Reduce is a software framework
for easily writing applications which process vast amounts of data
in-parallel on large clusters of commodity hardware in a reliable,
fault-tolerant manner is called MapReduce.
Features of MapReduce includes counters, sorting and joining
datasets.
It consists of
Scale-out Architecture: Add servers to increase processing power
Security & Authentication: Works with HDFS and HBase security to
make sure that only approved users can operate against the data in the
system
Resource Manager: Employs data locality and server resources to
determine optimal computing operations
Optimized Scheduling: Completes jobs according to prioritization
Flexibility: Procedures can be written in virtually any programming
language
Resiliency & High Availability: Multiple job and task trackers ensure
that jobs fail independently and restart automatically.
Fig: MapReduce Logical Data flow 15
2. Counters: The MapReduce framework provides Counters as an
efficient mechanism for tracking the occurrences of global events
within the map and reduce phases of jobs.
Counters are a useful channel for gathering statistics about the job
which means it show for quality control or for application level-
statistics. They are also useful for problem diagnosis.
Hadoop should maintain a built-in counters for every job, which
report various metrics for your job. for example there are counters
for the number of input files and records processed.
Ex: Typical MapReduce job will kick off several mapper instances,
one for each block of the input data, all running the same code.
These instances are part of the same job, but run independent of
one another.
User-Defined Java Counters: MapReduce can allow the userdefined java counters
by using java “enum” keyword.
A job may define an arbitrary number of enums, each with an arbitrary number of
fields.
The name of the enum is the group name, and the enum’s fields are the counter
names.
Ex: TOTAL_LAUNCHED_MAPS counts the number of map tasks that were
launched over the course of a job.
Map
Reduce-Side joins are more simple than Map-Side joins since the
– Input datasets don’t have to be structured in any particular way
– Less efficient as both datasets have to go through the
MapReduce shuffle
Multiple inputs: The input sources for the datasets have different
formats
Use the MultipleInputs class to separate the logic for parsing
and tagging each source.
IntSumReducer, LongSumReducer Reducers that sum integer values to produce a total for
every key.