You are on page 1of 16

What is MapReduce?

A data-parallel programming model designed for high scalability and


MapReduce resiliency
A programming model
for processing large datasets Pioneered by Google. Also designed Percolator – incrementally
processing updates
n Implementation in C++
Prof. Tahar Kechadi
Popularised by the open-source Hadoop project
School of Computer Science
n Used at Yahoo!, Facebook, Amazon, …

1 2

1
MapReduce Applications Parallel Programming
At Google Process of developing programs that express what computations
n Index construction for Google Search, Article clustering for Google should be executed in parallel.
News, Statistical machine translation Parallel computing
At Yahoo! n use of two or more processors (computers), usually within a single system,
n “Web map” powering Yahoo! Search, Spam detection for Yahoo! Mail working simultaneously to solve a single problem
At Facebook Distributed computing
n Ad optimisation, Spam detection n any computing that involves multiple computers remote from each other
In research that each have a role in a computation problem or information processing
n Astronomical image analysis, Bioinformatics, Analysing Wikipedia
conflicts, Natural language processing, Particle physics, Ocean climate
simulation
Among others…

3 4

2
Parallel Programming
Parallel vs. Distributed Computing
Approach
Partitioning
n Computation, data are decomposed into small
tasks.
Communication
n Coordinate task execution
Agglomeration
n Tasks are combined to larger tasks
Map
n Assigned to a processor
Final Result Join
join

5 6

3
Parallel Programming Paradigms Parallel Programming Paradigms
Data parallelism Task parallelism
n All tasks apply the same set of operations to different data n Independent tasks apply different operations to different data elements
Example: n Example
n
for i ¬ 0 to 99 do a¬2
a[i] ¬ b[i] + c[i] b¬3
endfor m ¬ (a + b) / 2
s ¬ (a2 + b2) / 2
n Operations may be performed either synchronously or asynchronously n v¬
Concurrent execution of tasks, not s - m2
statements
n Fine grained parallelism n Problem is divided into different tasks
n Grain Size is the average number of computations performed between n Tasks are divided between the processors
communication or synchronization steps
n Coarse grained parallelism

7 8

4
Parallel Programming Examples Parallel Programming Example I
Adding a sequence of n numbers Adding a sequence of n numbers
n Sum = A[0] + A[1] + A[2] + … + A[n-1] n Sum = A[0] + A[1] + A[2] + … + A[n-1]
Code: Code:
sum ¬ 0
n n

sum ¬ 0 for i = 0 to n-1 do


for i = 0 to n-1 do sum ¬ sum + A[i]
sum ¬ sum + A[i] end for
end for
n Example: : 6 4 16 10 16 14 2 8
n Sequential or parallel? sum ¬ 0 + 6
sum ¬ 6 + 4
sum ¬ 10 + 4

9 10

5
Parallel Programming Example I Parallel Programming Example II
Adding a sequence of n numbers Computing the prefix sums
n A[i] is the sum of the first i+1 elements
n Sum = A[0] + A[1] + A[2] + … + A[n-1]
n Code:
n In parallel: for i = 1 to n-1 do
Add pairs of values producing 1st level results, A[i] ¬ A[i] + A[i-1]
Add pairs of 1st level results producing 2nd level results end for
n So:
Sum pairs of 2nd level results
A[0] is unchanged

A[1] = A[1] + A[0]
A[2] = A[2] + (A[1] + A[0])

A[n-1]= A[n-1] + (A[n-2] + (…(A[1] + A[0]) …)
n Example:
Input: 6 4 16 10 16 14 2 8
Output: 6 10 26 36 52 66 68 76

11 12

6
Parallel Programming Example II Traditional HPC systems
Computing the prefix sums
n A[i] is the sum of the first i elements CPU-intensive computations
n Example:
n Relatively small amount of data
Input: 6 4 16 10 16 14 2 8
Output: 6 10 26 36 52 66 68 76 n Tightly-coupled applications
n Highly concurrent I/O requirements
n Complex message passing paradigms such as MPI, PVM…
n Developers might need to spend some time designing for failure

13 14

7
Challenges What requirements?
Data and storage
n Locality, computation close to the data
A simple data-parallel programming model, designed for high
scalability and resiliency
In large-scale systems, nodes fail
n MTBF (Mean time between failures) for 1 node = 3 years n Scalability to large-scale data volumes
n MTBF for 1000 nodes = 1 day n Automated fault-tolerance at application level rather than relying on

n Solution: Built in fault-tolerance high-availability hardware


Commodity network = low bandwidth n Simplified I/O and tasks monitoring

n All based on cost-efficient commodity machines (cheap, but


Distributed programming is hard
unreliable), and commodity network
n Solution: simple data-parallel programming model: users structure the application
in “map” & “reduce” functions, system distributes data/work and handles faults
n Not all applications can be parallelised: tightly-coupled computations

15 16

8
Core concepts The model
A map function processes a key/value pair to generate a set of
Data distributed in advance, persistent (in terms of locality), intermediate key/value pairs
and replicated n Divides the problem into smaller ‘intermediate key/value’ pairs
No inter-dependencies / shared nothing architecture The reduce function merges all intermediate values associated
with the same intermediate keys
Applications written in two pieces of code
Run-time system takes care of:
n And developers do not have to worry about the underlying issues in n Partitioning the input data across nodes (blocks/chunks typically of 64Mb
networking, jobs interdependencies, scheduling, etc… to 128Mb)
n Scheduling the data and execution
n Node failures, replication, re-submissions
n Coordination among nodes

17 18

9
Map function Reduce function
A map function processes a key/value pair to generate a set The reduce function merge all intermediate values associated
of intermediate key/value pairs with the same intermediate key
n Divides the problem into smaller ‘intermediate key/value’ pairs
Map: (key1, val1) → (key2, val2) Reduce: (key2, [val2]) → [results]
Ex: Ex:
n (word, [val1,val2,…]) → (word, Svali)
n (line-id, text) → (word, 1)
n (the,1), (apple,1), (is,1), (an,1), (apple,1) → (an,1), (apple,2), (is,1),
n (2, the apple is an apple) → (the,1), (apple,1), (is,1), (an,1),
(the,1)
(apple,1)

19 20

10
Map/Reduce MapReduce execution
Map
Iterate over a large number of records MapReduce Job

Extract something of interest from each Master

Reduce
Shuffle and sort intermediate results Split 0 Worker
Split 1 Worker
Aggregate intermediate results Split 2 read
Worker
Local write
Remote
Worker
Generate final output Split 3
Worker
read

Split 4 Output files


Intermediate
Input files files locally
MAPPER REDUCER

21 22

11
Simple example: Word count Simple Word Count
Key range the node
is responsible for
(apple, 3)
♯key: offset, value: line
Mapper (apple, 1) (apple,
(apple,
1) (apple,
{1, 1, 1})
1) Reducer (an, 2)
(an,(an, {1, 1})
1)(an, 1) def mapper():
(1-2) (A-G) (because, 1)
(1, the apple) (because,
(because,{1}) 1) (green, 1)
(green,
(green,{1}) 1) for line in open(“doc”):
(2, is an apple)
(3, not an orange) Mapper (is,(is,
1){1,
(is,1})
1) Reducer (is, 2) for word in line.split():
(3-4) (not,(not,
1) (not,
{1, 1})
1) (H-N) (not, 2)
(4, because the)
output(word, 1)
(5, orange) (orange, 1) (orange,
(orange,
1) (orange,
{1, 1, 1})
1) (orange, 3)
(6, unlike the apple) Mapper (the, 1) (the,
(the,1){1,
(the, 1) Reducer
1, 1}) (the, 3)
(5-6) (O-U)
(7, is orange)
(unlike,
(unlike,{1})
1) (unlike, 1) ♯key: a word, value: iterator over counts
(8, not green) def reducer():
Mapper Reducer
(7-8) (V-Z)
output(key, sum(value))
1 Each mapper 2 The 3 Each KV-pair output 4 The reducers 5 The reducers
receives some mappers by the mapper is sent sort their input process their
of the KV-pairs process the to the reducer that is by key input one group
as input KV-pairs responsible for it and group it at a time
one by one

23 24

12
K-Means K-Means in MapReduce
Given a set of objects, group these objects into k “clusters” Mapper:
K-means algorithm is implemented in 4 steps: n Input: <nodeID, [centroidIDs]>
n Step 1 n Emit: <centroidID, nodeID>
Given k, partition objects into k nonempty subsets
Reducer:
n Step 2
Compute seed points as the centroids of the clusters of the current partition n Input: <centroidID, nodeID>
The centroid is the centre (mean point) of the cluster n Reduce ® <centroidID, [nodeIDs]>
n Step 3 Recompute centroid position from positions of nodes in it
Assign each object to the cluster with the nearest seed point
n Step 4
n Emit:
Go back to Step 2 until no more new assignment <centroidID, [nodeIDs]>
<centroidID, [nodeIDs]> ® <nodeID, [centroidIDs]>
Repeat until no change
25 26

13
Optimisation: The Combiner Some examples…
Distributed Grep:
A combiner is a local aggregation function for repeated keys n Map function emits a line if it matches a supplied pattern
produced by the map n Reduce function is an identity function that copies the supplied intermediate data to the output
Count of URL accesses:
Works for associative functions like sum, count, max n Map function processes logs of web page requests and outputs <URL, 1>
n Reduce function adds together all values for the same URL, emitting <URL, total count> pairs
Decreases the size of intermediate data / communications Reverse Web-Link graph:
n Map function outputs <tgt, src> for each link to a tgt in a page named src
Reduce concatenates the list of all src URLS associated with a given tgt URL and emits the pair: <tgt,
map-side aggregation for Word Count: n

list(src)>
Inverted Index:
n Map function parses each document, emitting a sequence of <word, doc_ID>
n Reduce accepts all pairs for a given word, sorts the corresponding doc_IDs and emits a <word,
list(doc_ID)> pair
n Set of all output pairs forms a simple inverted index
def combiner(key, values):
output(key, sum(values))

27 28

14
Exercise: host size Execution
A distributed file system providing a global namespace (GFS,
Suppose we have a large web corpus
HDFS..)

The metadata file has lines of the form (URL, size, date, …) Map invocations distributed across multiple machines by
automatically partitioning the input data into a set of splits /
chunks
For each host, find the total number of bytes, i.e. the sum of
the page sizes for all URLs from that host Reduce invocations are distributed by partitioning the
intermediate key space into R pieces using a hash function

29 30

15
Execution Distributed file systems
Single master controls job execution on multiple slaves A distributed implementation of the classical time-sharing model
Mappers preferentially placed on same node or same rack as of a file system, where multiple users / locations share files and
their input block storage resources.
n Locality, to minimise network usage Key aspects: dispersion & multiplicity
Mappers save outputs to local disk before serving them to
reducers Primary issues: Naming, Transparency, & Fault tolerance
n Allows recovery if a reducer crashes Examples: NFS, AFS, GFS, HDFS
Redundant execution & storage (usually 3+ copies of the data)

31 32

16

You might also like