Map Reduce

BigData
Relational database vs BigData

• Structured data vs semi-structured data, graph data
• Data from a single enterprise
• BigData requires high degree of parallelism (storage and processing)
• Sharding, key-value storage systems and documents stores

Map-reduce
Map Reduce algorithms
• Used in parallel processing.
• Fault tolerant.
• Programming paradigm (model) → framework,

• examples Hadoop, Google
• Allows to process large volumes of data.
• Input in different formats.

Map Reduce example
• Counting product that clients entering local buy.
• Input collected by multiple machines in parallel.

• Data processed by multiple machines.
Map Map Map Map

→ (laptop, 10) → (laptop,50) → (laptop, 78) → (phone, 49)
(usb, 13) (usb,57) (mouse, 25) (mouse,67)
Map Reduce example
• MAP phase
• map function provided by the developer will run on multiple nodes in

parallel, process input data.
map ……. ……. …….

→ (key, value)
reduce key
Map Reduce example
• REDUCE phase
• reduce function provided by the developer, reduce the output
produced by map functions, aggregate.
• a call for a reduce function is for a single reduced key.
(laptop, 10) (laptop,50) (laptop, 78) (phone, 49)
(usb, 13) (usb,57) (mouse, 25) (mouse,67)
Shuffle,
Sort,
Reduce
output
Map reduce
map
map reduce
input output
map reduce
map
Map reduce
map (laptop, 10)

(usb, 12)
(laptop,50)
map
(usb,57)
input
(laptop, 78)
map
(mouse, 25)
map (phone, 49)

(mouse,67)
Map reduce
map (laptop, 10)

(usb, 12) M1
(laptop, 10)
(laptop, 50)
(laptop, 78)
(laptop,50)
map (laptop, 5)
(usb,57)
input shuffle, sort
(laptop, 78)
map
(mouse, 25) M2
(mouse, 25)
(mouse, 67)
(phone, 49) (usb, 12)
map (usb, 57)
(mouse,67)
(laptop, 5)
Map reduce
(laptop, 10)
map (usb, 12) M1
(laptop, 10)
(laptop, 50)
(laptop,50) (laptop, 78)
map (usb,57) (laptop, 5)
input shuffle, sort
(laptop, 78)
map (mouse, 25) M2
(mouse, 25)
(mouse, 67)
(phone, 49) (usb, 12)
map (mouse,67) (usb, 57)
(laptop, 5)
Map reduce
(laptop, 10)
map (usb, 13)
M1
(laptop, [10, 50, 78, 5]) laptop, 143
(laptop,50)
map (usb,57)
input shuffle, sort
(laptop, 78)
map (mouse, 25)
M2
mouse, 92
(mouse, [25,67])
usb 70
(phone, 49) (usb, [13,57])
map (mouse,67)
(laptop, 5)
MapReduce Hadoop
• Open source from Apache. https://hadoop.apache.org/
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-
client-core/MapReduceTutorial.html
• Written in Java, also provide implementations in C++/Python.
• Components
• MapReduce
• Hadoop Distributed file system HDFS

• Each file is stored as a sequence of blocks
• Fault tolerant: Each block is replicated
• Master-slave architecture: NameNode (master), DataNodes (slaves).

MapReduce Hadoop
• map, reduce and combine function.
• combine perform partial aggregation before maps sends the result to

reduce.
• combine -- reduce the amount of data sent over the network.
• combine -- Decrease the shuffling cost
• A MapReduce job can be configured to process map function phase only

Inverted index
MapReduce Inverted index
• Web search engines (including Google).
• Maps content to location.
• Fast text search.
• PageRank-ing
D1: data base systems,
Inverted index D2: economic base analysis
D3: distributed systems
D4: data analysis
(data, 1) M1
map (base, 6) (data, D1:1)
(systems, 11) (base, D1:1)
(systems, D1: 11)
(economic, 1) (economic, D2:1)
map (base, 10) (base, D2:10)
(analysis, 15)
input shuffle, sort
M2
(distributed,1) (analysis, D2:15)
map
(systems, 14) (distributed, D3:1)
(systems, D3:14)
(data, D4:1)
(data, 1) (analysis, D4: 6)
map (analysis, 6)
D4: data analysis
(data, 1) M1
map (base, 6) (analysis, D2:15)
(systems, 11) (analysis, D4: 6)
(base, D1:1)
(economic, 1) (base, D2:10)
map (base, 10) (data, D1:1)
(analysis, 15) (data, D4:1)
input shuffle, sort
(distributed,1) M2
map (distributed, D3:1)
(systems, 14)
(economic, D2:1)
(systems, D1: 11)
(systems, D3:14)
(data, 1)
map (analysis, 6)
D4: data analysis
M1
(data, 1) (analysis, D2:15)
M1
map (base, 6) (analysis, D4: 6)
(analysis, [D2:15,
(systems, 11) (base, D1:1)
D4:6])
(base, D2:10)
(base, [D1:1, D2, 10])
(data, D1:1)
(economic, 1) (data, [D1:1,D4:1])
(data, D4:1)
map (base, 10)
(analysis, 15)
input shuffle, sort
(distributed,1) M2 M2
map (distributed, D3:1) (distributed, [D3:1])
(systems, 14)
(economic, D2:1) (economic, [D2:1])
(systems, D1: 11) (systems, [D1:
(systems, D3:14) 11,D3:14])
(data, 1)
map (analysis, 6)
Sql operators
MapReduce: Sql operators
• Selection
• Group by
• Join
EMPLOYEES DEPARTMENTS
emp_id name dep_id dep_id dep_name
100 Steven King 90 30 Purchasing
102 Lex De Hann 90 90 Executive
108 Nancy Greenberg 100 100 Finance
116 Shelli Baida 30 20 Marketing
117 Sigal Tobias 30
map map
key Value key Value

90 (Emp, Steven King, 90) 30 (Dep, 30, Purchasing)
90 (Emp, 102, Lex De Hann, 90) 90 (Dep, 90, Executive)
100 (Emp, 108, Nancy Greenberg, 90) 100 (Dep, 100, Finance)
30 (Emp, 116, Shelli Baida, 30) 20 (Dep, 20, Marketing)
30 (Emp, 117, Sigal Tobias, 30
117 Sigal Tobias 30
map map
key Value key Value

shuffle
key Value
20 (Dep, 20, Marketing)
30 (Dep, 30, Purchasing)
30 (Emp, 116, Shelli Baida, 30)
90 (Dep, 90, Executive)
90 (Emp, Steven King, 90)
90 (Emp, 102, Lex De Hann, 90)
100 (Dep, 100, Finance)
100 (Emp, 108, Nancy Greenberg, 90)
117 Sigal Tobias 30
map map
key Value key Value

shuffle
key Value key Value
20 (Dep, 20, Marketing) 30 [(Dep, 30, Purchasing),
30 (Dep, 30, Purchasing)
(Emp, 116, Shelli Baida, 30),
30 (Emp, 116, Shelli Baida, 30)
(Emp, 117, Sigal Tobias, 30]
reduce 90 [(Dep, 90, Executive),
90 (Dep, 90, Executive)
(Emp, Steven King, 90),
90 (Emp, Steven King, 90)
(Emp, 102, Lex De Hann, 90)]
90 (Emp, 102, Lex De Hann, 90)
100 [(Dep, 100, Finance),
100 (Dep, 100, Finance)
100 (Emp, 108, Nancy Greenberg, 90) (Emp, 108, Nancy Greenberg, 90)]

Map Reduce

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Map Reduce

Uploaded by

Copyright:

Available Formats

BigData

Relational database vs BigData

• Data from a single enterprise

• BigData requires high degree of parallelism (storage and processing)

• Sharding, key-value storage systems and documents stores

• Programming paradigm (model) → framework,

• Allows to process large volumes of data.

• Input in different formats.

• Input collected by multiple machines in parallel.

Map Map Map Map

• map function provided by the developer will run on multiple nodes in

map ……. ……. …….

map (laptop, 10)

map (phone, 49)

map (laptop, 10)

input shuffle, sort

input shuffle, sort

input shuffle, sort

• Written in Java, also provide implementations in C++/Python.

• Hadoop Distributed file system HDFS

• Master-slave architecture: NameNode (master), DataNodes (slaves).

• combine perform partial aggregation before maps sends the result to

• combine -- reduce the amount of data sent over the network.

• combine -- Decrease the shuffling cost

• A MapReduce job can be configured to process map function phase only

• Maps content to location.

• Fast text search.

key Value key Value

key Value key Value

key Value key Value

You might also like