You are on page 1of 26

Modified from Mining Massive Dataset (Stanford)

Warm-up task:
 We have a huge text document

 Count the number of times each


distinct word appears in the file

 Sample application:
▪ Analyze web server logs to find popular URLs
▪ Find popular hashtag
▪ Term statistic for search

2
Case 1:
▪ File too large for memory, but all <word, count>
pairs fit in memory

3
Case 2:
 Count occurrences of words:
▪ words(doc.txt) | sort | uniq -c
▪ where words takes a file and outputs the words in it,
one per a line
 Case 2 captures the essence of MapReduce
▪ Great thing is that it is naturally parallelizable

4
words(doc.txt) | sort | uniq -c
 Map:
▪ Scan input file record-at-a-time
▪ Extract something you care about from each
record (keys)
 Group by key:
▪ Sort and Shuffle
 Reduce:
▪ Aggregate, summarize, filter or transform
▪ Write the result
5
Input Intermediate
key-value pairs key-value pairs

k v
map
k v
k v
map
k v
k v

… …

k v k v

6
Output
Intermediate Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
reduce
Group
k v k v v k v
by key

k v
… …

k v k v k v

7
 Input: a set of key-value pairs
 Programmer specifies two methods:
▪ Map(k, v) → <k’, v’>*
▪ Takes a key-value pair and outputs a set of key-value pairs
▪ E.g., key is the filename, value is text of the document file
▪ There is one Map call for every (k,v) pair
▪ Reduce(k’, <v’>*) → <k’, v’’>*
▪ All values v’ with same key k’ are reduced together

8
Provided by the Provided by the
programmer programmer
MAP: Group by key: Reduce:
Read input and Collect all values
Collect all pairs
produces a set of belonging to the
with same key
key-value pairs key and output

The crew of the space


shuttle Endeavor recently
(The, 1) (crew, 1)
returned to Earth as (crew, 1) (crew, 1)
ambassadors, harbingers of (crew, 2)
a new era of space (of, 1) (space, 1)
exploration. Scientists at
(space, 1)
(the, 1) (the, 1)
NASA are saying that the (the, 3)
recent assembly of the (space, 1) (the, 1)
Dextre bot is the first step in (shuttle, 1)
a long-term space-based (shuttle, 1) (the, 1)
man/mache partnership.
(recently, 1)
(Endeavor, 1) (shuttle, 1)
'"The work we're doing now …
-- the robotics we're doing - (recently, 1) (recently, 1)
- is what we're going to
need ……………………..
…. …
Big document (key, value) (key, value) (key, value)
9
Provided by the Provided by the
programmer programmer
MAP: Group by key: Reduce:
Read input and Collect all values
Collect all pairs
produces a set of belonging to the
with same key
key-value pairs key and output

The crew of the space


shuttle Endeavor recently
(The, 1) (crew, 1)
returned to Earth as (crew, 1) (crew, 1)
ambassadors, harbingers of (crew, 2)
a new era of space (of, 1) (space, 1)
exploration. Scientists at
(space, 1)
(the, 1) (the, 1)
NASA are saying that the (the, 3)
recent assembly of the (space, 1) (the, 1)
Dextre bot is the first step in (shuttle, 1)
a long-term space-based (shuttle, 1) (the, 1)
man/mache partnership.
(recently, 1)
(Endeavor, 1) (shuttle, 1)
'"The work we're doing now …
-- the robotics we're doing - (recently, 1) (recently, 1)
- is what we're going to
need ……………………..
…. …
Big document (key, value) (key, value) (key, value)
10
11
• Map operation

Done by the
programmer

• Example running in native python


Note:
Make sure the “py file” is in the
mode of “allow executing file as
program.”

12
• Shuffle and Sort
➢ Shuffling: the process of transferring data from the
mappers to reducers
➢ Sorting: the process of sorting intermediate key-
value pairs by their keys

Note: Both processes are automatically performed by


Hadoop (not the programmer)

• Equivalent in native python

13
• Reduce operation #01

Done by the
programmer

• Example of key-val
pairs from mapper

14
• Reduce operation #02

• Example of key-val
pairs from mapper

• Example running in native python

15
• Running in Hadoop
Syntax:
hadoop jar <hadoop streaming jar file> -input <input folder in HDFS>
-output <output folder in HDFS> -mapper <mapper.py> -reducer
<reducer.py>

Example:
hadoop jar /home/bigdata/hadoop-2.8.5/share/hadoop/tools/lib/hadoop-
streaming-2.8.5.jar -input /input -output /output1 -mapper
/home/bigdata/project-folder/mapper.py -reducer
/home/bigdata/project-folder/reducer.py

Result:

Notes:
1. It will process all the files inside the “HDFS input folder”
2. Before running the code, “HDFS output folder” must not present in the HDFS
3. The output can be seen by hdfs command “-cat” above, or move the file to local to view
16
17
Given sample of purchase data below with 6 columns separated
by tab. The column data are: date, time, store city, product item,
price, payment method

Task: show the total purchases for each store city,


i.e., pairs of “city_store <tab> total_purchases”

18
• Mapper

• Before running the original data in Hadoop, it is suggested to run in


native Python first using small data

19
• Reducer • Mapper output

• Test the code in native Python

20
• Running the map reduce program in Hadoop

• Show first 10 rows of the result

Note:
In executing script above,
folder “/purchases” in HDFS
contains file “purchases.txt”
that is used in Week 09

21
 Suppose we have a large web corpus
 Look at the metadata file
▪ Lines of the form: (URL, size, date, …)
 For each host, find the total number of bytes
▪ That is, the sum of the page sizes for all URLs from
that particular host
 Map
▪?
 Reduce
▪?
23
 Suppose we have a large web corpus
 Look at the metadata file
▪ Lines of the form: (URL, size, date, …)
 For each host, find the total number of bytes
▪ That is, the sum of the page sizes for all URLs from
that particular host
 Map
▪ For each record, output (hostname(URL), size)
 Reduce
▪ Sum of the size for each host
24
 Statistical machine translation:
▪ Need to count number of times every 5-word
sequence occurs in a large corpus of documents

 Very easy with MapReduce:


▪ Map:
▪?
▪ Reduce:
▪?

25
 Statistical machine translation:
▪ Need to count number of times every 5-word
sequence occurs in a large corpus of documents

 Very easy with MapReduce:


▪ Map:
▪ Extract (5-word sequence, count) from document
▪ Reduce:
▪ Combine the counts

26

You might also like