You are on page 1of 14

BỘ GIÁO DỤC VÀ ĐÀO TẠO​

TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN TP.HCM​


KHOA CÔNG NGHỆ THÔNG TIN​

Tìm hiểu nghiên cứu


về MapReduce
Báo cáo môn Các hệ cơ sở dữ liệu nâng cao 

GVHD: Ts. Nguyễn Trần Minh Thư


Nhóm 07: 
1. 19C11015 - Đỗ Huy Gia Cát
2. 21C12003 - Đào Thanh Danh
3. 21C11026 - Nguyễn Thành Thái
1
CONTENTS

• Overview about MapReduce


o Motivation
o History
o Application
• MapReduce define
o How MapReduce works?
o Example
• MapReduce extends
• Conclusion

2
What is MapReduce

• Motivation – the real-world problem


• History MapReduce

3
4
What is MapReduce

• MapReduce brings resolve


• Automatically parallelized and executed on a large cluster of machines
• Relate MapReduce and database management system competing or
completing paradigms?

5
What is MapReduce

• Use case of Google apply MapReduce


• Distributed grep
• Distributed sort
• Web link-graph reversal
• Term-vector per host
• Web access log stats
• Inverted index construction
• Document clustering
• Machine learning
• Statistical machine translation

6
How MapReduce Works

• Define MapReduce
• key-value pairs
• map
• Input: input key/value
• Output: intermediate key/value
• reduce
• Input: intermediate key/{value}
• Output: output key/value

7
How
MapReduce
Works
• Input Splits -> divided into
fixed-size pieces (jobs) => key-
value pairs
• Mapping -> each chunk split
passed into mapping function
• Shuffling -> task is to
consolidate the relevant
records
• Reducing -> value aggregate
combined and returns a single
output value 
8
Example: Word Count Problem

9
MapReduce Extends

• MapReduce trades off flexibility in structuring computation for a model


for parallelizing the computation over a cluster => Computation
constraints exist
• Within a map task, you can only work on one aggregate
• Within a reduce task, you can only work on one single key
• It is required to have different approaches with these constraints

10
Multiple stages approach

• As the computation becomes more complex, it is more suitable to divide


the map-reduce into smaller steps

• Advantages:
• Easier to write and maintain
• Reusability

11
Incremental MapReduce approach

• Suitable for data with constant update


• Can be used to implement part of data instead of restarting from starch
• Need to persist the current data and combine with new data
• Map stages are easier to approach, while reduce stages are more
complex

12
Conclusion MapReduce
• Allow computations to be parallelized over a cluster, but has large latency.
• The map task reads data from an aggregate and boils it down to relevant key-value
pairs. Only read a single record at a time and can thus be parallelized.
• Reduce tasks take many values for a single key, output from map tasks and summarize
them into a single output. Parallelized by key
• Reducers can be combined into pipelines, improves parallelism and reduces data to
be transferred.
• Map-reduce operations can be composed into pipelines with multi map-reduce
others (map -> reduce -> map -> reduces...)
• Result of a map-reduce computation can be stored as a materialized view -> it can be
updated through incremental map-reduce operations (only recomputing changing)

13
15

You might also like