3b - Distributed Programing Model - MapReduce PDF

Modified from Mining Massive Dataset (Stanford)
Warm-up task:
 We have a huge text document
 Count the number of times each

distinct word appears in the file
 Sample application:
▪ Analyze web server logs to find popular URLs
▪ Find popular hashtag
▪ Term statistic for search
2
Case 1:
▪ File too large for memory, but all <word, count>
pairs fit in memory
3
Case 2:
 Count occurrences of words:
▪ words(doc.txt) | sort | uniq -c
▪ where words takes a file and outputs the words in it,
one per a line
 Case 2 captures the essence of MapReduce
▪ Great thing is that it is naturally parallelizable
4
words(doc.txt) | sort | uniq -c
 Map:
▪ Scan input file record-at-a-time
▪ Extract something you care about from each
record (keys)
 Group by key:
▪ Sort and Shuffle
 Reduce:
▪ Aggregate, summarize, filter or transform
▪ Write the result
5
Input Intermediate
key-value pairs key-value pairs
k v
map
k v
k v
map
k v
k v
… …
k v k v
6
Output
Intermediate Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
reduce
Group
k v k v v k v
by key
k v
… …
…
k v k v k v
7
 Input: a set of key-value pairs
 Programmer specifies two methods:
▪ Map(k, v) → <k’, v’>*
▪ Takes a key-value pair and outputs a set of key-value pairs
▪ E.g., key is the filename, value is text of the document file
▪ There is one Map call for every (k,v) pair
▪ Reduce(k’, <v’>*) → <k’, v’’>*
▪ All values v’ with same key k’ are reduced together
8
Provided by the Provided by the
programmer programmer
MAP: Group by key: Reduce:
Read input and Collect all values
Collect all pairs
produces a set of belonging to the
with same key
key-value pairs key and output
The crew of the space

shuttle Endeavor recently
(The, 1) (crew, 1)
returned to Earth as (crew, 1) (crew, 1)
ambassadors, harbingers of (crew, 2)
a new era of space (of, 1) (space, 1)
exploration. Scientists at
(space, 1)
(the, 1) (the, 1)
NASA are saying that the (the, 3)
recent assembly of the (space, 1) (the, 1)
Dextre bot is the first step in (shuttle, 1)
a long-term space-based (shuttle, 1) (the, 1)
man/mache partnership.
(recently, 1)
(Endeavor, 1) (shuttle, 1)
'"The work we're doing now …
-- the robotics we're doing - (recently, 1) (recently, 1)
- is what we're going to
need ……………………..
…. …
Big document (key, value) (key, value) (key, value)
9
Provided by the Provided by the
programmer programmer
MAP: Group by key: Reduce:
Read input and Collect all values
Collect all pairs
produces a set of belonging to the
with same key
key-value pairs key and output
The crew of the space

shuttle Endeavor recently
(The, 1) (crew, 1)
returned to Earth as (crew, 1) (crew, 1)
ambassadors, harbingers of (crew, 2)
a new era of space (of, 1) (space, 1)
exploration. Scientists at
(space, 1)
(the, 1) (the, 1)
NASA are saying that the (the, 3)
recent assembly of the (space, 1) (the, 1)
Dextre bot is the first step in (shuttle, 1)
a long-term space-based (shuttle, 1) (the, 1)
man/mache partnership.
(recently, 1)
(Endeavor, 1) (shuttle, 1)
'"The work we're doing now …
-- the robotics we're doing - (recently, 1) (recently, 1)
- is what we're going to
need ……………………..
…. …
Big document (key, value) (key, value) (key, value)
10
11
• Map operation
Done by the
programmer
• Example running in native python

Note:
Make sure the “py file” is in the
mode of “allow executing file as
program.”
12
• Shuffle and Sort
➢ Shuffling: the process of transferring data from the
mappers to reducers
➢ Sorting: the process of sorting intermediate key-
value pairs by their keys
Note: Both processes are automatically performed by

Hadoop (not the programmer)
• Equivalent in native python
13
• Reduce operation #01
Done by the
programmer
• Example of key-val
pairs from mapper
14
• Reduce operation #02
• Example of key-val
pairs from mapper
• Example running in native python
15
• Running in Hadoop
Syntax:
hadoop jar <hadoop streaming jar file> -input <input folder in HDFS>
-output <output folder in HDFS> -mapper <mapper.py> -reducer
<reducer.py>
Example:
hadoop jar /home/bigdata/hadoop-2.8.5/share/hadoop/tools/lib/hadoop-
streaming-2.8.5.jar -input /input -output /output1 -mapper
/home/bigdata/project-folder/mapper.py -reducer
/home/bigdata/project-folder/reducer.py
Result:
Notes:
1. It will process all the files inside the “HDFS input folder”
2. Before running the code, “HDFS output folder” must not present in the HDFS
3. The output can be seen by hdfs command “-cat” above, or move the file to local to view
16
17
Given sample of purchase data below with 6 columns separated
by tab. The column data are: date, time, store city, product item,
price, payment method
Task: show the total purchases for each store city,

i.e., pairs of “city_store <tab> total_purchases”
18
• Mapper
• Before running the original data in Hadoop, it is suggested to run in

native Python first using small data
19
• Reducer • Mapper output
• Test the code in native Python
20
• Running the map reduce program in Hadoop
• Show first 10 rows of the result
Note:
In executing script above,
folder “/purchases” in HDFS
contains file “purchases.txt”
that is used in Week 09
21
 Suppose we have a large web corpus
 Look at the metadata file
▪ Lines of the form: (URL, size, date, …)
 For each host, find the total number of bytes
▪ That is, the sum of the page sizes for all URLs from
that particular host
 Map
▪?
 Reduce
▪?
23
 Suppose we have a large web corpus
 Look at the metadata file
▪ Lines of the form: (URL, size, date, …)
 For each host, find the total number of bytes
▪ That is, the sum of the page sizes for all URLs from
that particular host
 Map
▪ For each record, output (hostname(URL), size)
 Reduce
▪ Sum of the size for each host
24
 Statistical machine translation:
▪ Need to count number of times every 5-word
sequence occurs in a large corpus of documents
 Very easy with MapReduce:

▪ Map:
▪?
▪ Reduce:
▪?
25
 Statistical machine translation:
▪ Need to count number of times every 5-word
sequence occurs in a large corpus of documents
 Very easy with MapReduce:

▪ Map:
▪ Extract (5-word sequence, count) from document
▪ Reduce:
▪ Combine the counts
26

3b - Distributed Programing Model - MapReduce PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3b - Distributed Programing Model - MapReduce PDF

Uploaded by

Copyright:

Available Formats

Modified from Mining Massive Dataset (Stanford)

 Count the number of times each

The crew of the space

The crew of the space

• Example running in native python

Note: Both processes are automatically performed by

• Equivalent in native python

• Example running in native python

Task: show the total purchases for each store city,

• Before running the original data in Hadoop, it is suggested to run in

• Test the code in native Python

• Show first 10 rows of the result

 Very easy with MapReduce:

 Very easy with MapReduce:

You might also like