Professional Documents
Culture Documents
Warm-up task:
We have a huge text document
Sample application:
▪ Analyze web server logs to find popular URLs
▪ Find popular hashtag
▪ Term statistic for search
2
Case 1:
▪ File too large for memory, but all <word, count>
pairs fit in memory
3
Case 2:
Count occurrences of words:
▪ words(doc.txt) | sort | uniq -c
▪ where words takes a file and outputs the words in it,
one per a line
Case 2 captures the essence of MapReduce
▪ Great thing is that it is naturally parallelizable
4
words(doc.txt) | sort | uniq -c
Map:
▪ Scan input file record-at-a-time
▪ Extract something you care about from each
record (keys)
Group by key:
▪ Sort and Shuffle
Reduce:
▪ Aggregate, summarize, filter or transform
▪ Write the result
5
Input Intermediate
key-value pairs key-value pairs
k v
map
k v
k v
map
k v
k v
… …
k v k v
6
Output
Intermediate Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
reduce
Group
k v k v v k v
by key
k v
… …
…
k v k v k v
7
Input: a set of key-value pairs
Programmer specifies two methods:
▪ Map(k, v) → <k’, v’>*
▪ Takes a key-value pair and outputs a set of key-value pairs
▪ E.g., key is the filename, value is text of the document file
▪ There is one Map call for every (k,v) pair
▪ Reduce(k’, <v’>*) → <k’, v’’>*
▪ All values v’ with same key k’ are reduced together
8
Provided by the Provided by the
programmer programmer
MAP: Group by key: Reduce:
Read input and Collect all values
Collect all pairs
produces a set of belonging to the
with same key
key-value pairs key and output
Done by the
programmer
12
• Shuffle and Sort
➢ Shuffling: the process of transferring data from the
mappers to reducers
➢ Sorting: the process of sorting intermediate key-
value pairs by their keys
13
• Reduce operation #01
Done by the
programmer
• Example of key-val
pairs from mapper
14
• Reduce operation #02
• Example of key-val
pairs from mapper
15
• Running in Hadoop
Syntax:
hadoop jar <hadoop streaming jar file> -input <input folder in HDFS>
-output <output folder in HDFS> -mapper <mapper.py> -reducer
<reducer.py>
Example:
hadoop jar /home/bigdata/hadoop-2.8.5/share/hadoop/tools/lib/hadoop-
streaming-2.8.5.jar -input /input -output /output1 -mapper
/home/bigdata/project-folder/mapper.py -reducer
/home/bigdata/project-folder/reducer.py
Result:
Notes:
1. It will process all the files inside the “HDFS input folder”
2. Before running the code, “HDFS output folder” must not present in the HDFS
3. The output can be seen by hdfs command “-cat” above, or move the file to local to view
16
17
Given sample of purchase data below with 6 columns separated
by tab. The column data are: date, time, store city, product item,
price, payment method
18
• Mapper
19
• Reducer • Mapper output
20
• Running the map reduce program in Hadoop
Note:
In executing script above,
folder “/purchases” in HDFS
contains file “purchases.txt”
that is used in Week 09
21
Suppose we have a large web corpus
Look at the metadata file
▪ Lines of the form: (URL, size, date, …)
For each host, find the total number of bytes
▪ That is, the sum of the page sizes for all URLs from
that particular host
Map
▪?
Reduce
▪?
23
Suppose we have a large web corpus
Look at the metadata file
▪ Lines of the form: (URL, size, date, …)
For each host, find the total number of bytes
▪ That is, the sum of the page sizes for all URLs from
that particular host
Map
▪ For each record, output (hostname(URL), size)
Reduce
▪ Sum of the size for each host
24
Statistical machine translation:
▪ Need to count number of times every 5-word
sequence occurs in a large corpus of documents
25
Statistical machine translation:
▪ Need to count number of times every 5-word
sequence occurs in a large corpus of documents
26