You are on page 1of 24

BigData

Relational database vs BigData


• Structured data vs semi-structured data, graph data

• Data from a single enterprise

• BigData requires high degree of parallelism (storage and processing)

• Sharding, key-value storage systems and documents stores


Map-reduce
Map Reduce algorithms
• Used in parallel processing.

• Fault tolerant.

• Programming paradigm (model) → framework,


• examples Hadoop, Google

• Allows to process large volumes of data.

• Input in different formats.


Map Reduce example
• Counting product that clients entering local buy.

• Input collected by multiple machines in parallel.


• Data processed by multiple machines.

Map Map Map Map


→ (laptop, 10) → (laptop,50) → (laptop, 78) → (phone, 49)
(usb, 13) (usb,57) (mouse, 25) (mouse,67)
Map Reduce example
• MAP phase

• map function provided by the developer will run on multiple nodes in


parallel, process input data.

map ……. ……. …….


→ (key, value)
reduce key
Map Reduce example
• REDUCE phase
• reduce function provided by the developer, reduce the output
produced by map functions, aggregate.
• a call for a reduce function is for a single reduced key.
(laptop, 10) (laptop,50) (laptop, 78) (phone, 49)
(usb, 13) (usb,57) (mouse, 25) (mouse,67)

Shuffle,
Sort,
Reduce

output
Map reduce

map

map reduce

input output

map reduce

map
Map reduce

map (laptop, 10)


(usb, 12)

(laptop,50)
map
(usb,57)

input

(laptop, 78)
map
(mouse, 25)

map (phone, 49)


(mouse,67)
Map reduce

map (laptop, 10)


(usb, 12) M1
(laptop, 10)
(laptop, 50)
(laptop, 78)
(laptop,50)
map (laptop, 5)
(usb,57)

input shuffle, sort

(laptop, 78)
map
(mouse, 25) M2
(mouse, 25)
(mouse, 67)
(phone, 49) (usb, 12)
map (usb, 57)
(mouse,67)
(laptop, 5)
Map reduce
(laptop, 10)
map (usb, 12) M1
(laptop, 10)
(laptop, 50)
(laptop,50) (laptop, 78)
map (usb,57) (laptop, 5)

input shuffle, sort

(laptop, 78)
map (mouse, 25) M2
(mouse, 25)
(mouse, 67)
(phone, 49) (usb, 12)
map (mouse,67) (usb, 57)
(laptop, 5)
Map reduce
(laptop, 10)
map (usb, 13)
M1
(laptop, [10, 50, 78, 5]) laptop, 143
(laptop,50)
map (usb,57)

input shuffle, sort

(laptop, 78)
map (mouse, 25)
M2
mouse, 92
(mouse, [25,67])
usb 70
(phone, 49) (usb, [13,57])
map (mouse,67)
(laptop, 5)
MapReduce Hadoop
• Open source from Apache. https://hadoop.apache.org/
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-
client-core/MapReduceTutorial.html

• Written in Java, also provide implementations in C++/Python.

• Components
• MapReduce

• Hadoop Distributed file system HDFS


• Each file is stored as a sequence of blocks
• Fault tolerant: Each block is replicated

• Master-slave architecture: NameNode (master), DataNodes (slaves).


MapReduce Hadoop
• map, reduce and combine function.

• combine perform partial aggregation before maps sends the result to


reduce.

• combine -- reduce the amount of data sent over the network.

• combine -- Decrease the shuffling cost

• A MapReduce job can be configured to process map function phase only


Inverted index
MapReduce Inverted index
• Web search engines (including Google).

• Maps content to location.

• Fast text search.

• PageRank-ing
D1: data base systems,
Inverted index D2: economic base analysis
D3: distributed systems
D4: data analysis

(data, 1) M1
map (base, 6) (data, D1:1)
(systems, 11) (base, D1:1)
(systems, D1: 11)
(economic, 1) (economic, D2:1)
map (base, 10) (base, D2:10)
(analysis, 15)
input shuffle, sort
M2
(distributed,1) (analysis, D2:15)
map
(systems, 14) (distributed, D3:1)
(systems, D3:14)
(data, D4:1)
(data, 1) (analysis, D4: 6)
map (analysis, 6)
D1: data base systems,
Inverted index D2: economic base analysis
D3: distributed systems
D4: data analysis

(data, 1) M1
map (base, 6) (analysis, D2:15)
(systems, 11) (analysis, D4: 6)
(base, D1:1)
(economic, 1) (base, D2:10)
map (base, 10) (data, D1:1)
(analysis, 15) (data, D4:1)
input shuffle, sort

(distributed,1) M2
map (distributed, D3:1)
(systems, 14)
(economic, D2:1)
(systems, D1: 11)
(systems, D3:14)
(data, 1)
map (analysis, 6)
D1: data base systems,
Inverted index D2: economic base analysis
D3: distributed systems
D4: data analysis
M1
(data, 1) (analysis, D2:15)
M1
map (base, 6) (analysis, D4: 6)
(analysis, [D2:15,
(systems, 11) (base, D1:1)
D4:6])
(base, D2:10)
(base, [D1:1, D2, 10])
(data, D1:1)
(economic, 1) (data, [D1:1,D4:1])
(data, D4:1)
map (base, 10)
(analysis, 15)
input shuffle, sort

(distributed,1) M2 M2
map (distributed, D3:1) (distributed, [D3:1])
(systems, 14)
(economic, D2:1) (economic, [D2:1])
(systems, D1: 11) (systems, [D1:
(systems, D3:14) 11,D3:14])
(data, 1)
map (analysis, 6)
Sql operators
MapReduce: Sql operators
• Selection

• Group by

• Join
EMPLOYEES DEPARTMENTS
emp_id name dep_id dep_id dep_name
100 Steven King 90 30 Purchasing
102 Lex De Hann 90 90 Executive
108 Nancy Greenberg 100 100 Finance
116 Shelli Baida 30 20 Marketing
117 Sigal Tobias 30

map map

key Value key Value


90 (Emp, Steven King, 90) 30 (Dep, 30, Purchasing)
90 (Emp, 102, Lex De Hann, 90) 90 (Dep, 90, Executive)
100 (Emp, 108, Nancy Greenberg, 90) 100 (Dep, 100, Finance)
30 (Emp, 116, Shelli Baida, 30) 20 (Dep, 20, Marketing)
30 (Emp, 117, Sigal Tobias, 30
EMPLOYEES DEPARTMENTS
emp_id name dep_id dep_id dep_name
100 Steven King 90 30 Purchasing
102 Lex De Hann 90 90 Executive
108 Nancy Greenberg 100 100 Finance
116 Shelli Baida 30 20 Marketing
117 Sigal Tobias 30

map map

key Value key Value


90 (Emp, Steven King, 90) 30 (Dep, 30, Purchasing)
90 (Emp, 102, Lex De Hann, 90) 90 (Dep, 90, Executive)
100 (Emp, 108, Nancy Greenberg, 90) 100 (Dep, 100, Finance)
30 (Emp, 116, Shelli Baida, 30) 20 (Dep, 20, Marketing)
30 (Emp, 117, Sigal Tobias, 30
shuffle
key Value
20 (Dep, 20, Marketing)
30 (Dep, 30, Purchasing)
30 (Emp, 116, Shelli Baida, 30)
30 (Emp, 117, Sigal Tobias, 30
90 (Dep, 90, Executive)
90 (Emp, Steven King, 90)
90 (Emp, 102, Lex De Hann, 90)
100 (Dep, 100, Finance)
100 (Emp, 108, Nancy Greenberg, 90)
EMPLOYEES DEPARTMENTS
emp_id name dep_id dep_id dep_name
100 Steven King 90 30 Purchasing
102 Lex De Hann 90 90 Executive
108 Nancy Greenberg 100 100 Finance
116 Shelli Baida 30 20 Marketing
117 Sigal Tobias 30

map map

key Value key Value


90 (Emp, Steven King, 90) 30 (Dep, 30, Purchasing)
90 (Emp, 102, Lex De Hann, 90) 90 (Dep, 90, Executive)
100 (Emp, 108, Nancy Greenberg, 90) 100 (Dep, 100, Finance)
30 (Emp, 116, Shelli Baida, 30) 20 (Dep, 20, Marketing)
30 (Emp, 117, Sigal Tobias, 30
shuffle
key Value key Value
20 (Dep, 20, Marketing) 30 [(Dep, 30, Purchasing),
30 (Dep, 30, Purchasing)
(Emp, 116, Shelli Baida, 30),
30 (Emp, 116, Shelli Baida, 30)
(Emp, 117, Sigal Tobias, 30]
30 (Emp, 117, Sigal Tobias, 30
reduce 90 [(Dep, 90, Executive),
90 (Dep, 90, Executive)
(Emp, Steven King, 90),
90 (Emp, Steven King, 90)
(Emp, 102, Lex De Hann, 90)]
90 (Emp, 102, Lex De Hann, 90)
100 [(Dep, 100, Finance),
100 (Dep, 100, Finance)
100 (Emp, 108, Nancy Greenberg, 90) (Emp, 108, Nancy Greenberg, 90)]

You might also like