Professional Documents
Culture Documents
Tutorial Mapreduce PDF
Tutorial Mapreduce PDF
Pietro Michiardi
Eurecom
Tutorial: MapReduce
1 / 131
Introduction
Introduction
Tutorial: MapReduce
2 / 131
Introduction
What is MapReduce
A programming model:
I
I
An execution framework:
I
I
Tutorial: MapReduce
3 / 131
Introduction
Architecture internals
Software components
Cluster deployments
Tutorial: MapReduce
4 / 131
Introduction
Motivations
Motivations
Tutorial: MapReduce
5 / 131
Introduction
Motivations
Big Data
Vast repositories of data
I
I
I
I
I
Web-scale processing
Behavioral data
Physics
Astronomy
Finance
Tutorial: MapReduce
6 / 131
Introduction
Big Ideas
Tutorial: MapReduce
7 / 131
Introduction
Big Ideas
Tutorial: MapReduce
8 / 131
Introduction
Big Ideas
1 HDD = 75 MB/sec
1000 HDDs = 75 GB/sec
Sharing is difficult:
I
I
I
Synchronization, deadlocks
Finite bandwidth to access data from SAN
Temporal dependencies are complicated (restarts)
Tutorial: MapReduce
9 / 131
Introduction
Big Ideas
Cascading effect
Tutorial: MapReduce
10 / 131
Introduction
Big Ideas
Implications of Failures
Sources of Failures
I
I
I
Hardware / Software
Electrical, Cooling, ...
Unavailability of a resource due to overload
Failure Types
I
I
Permanent
Transient
Tutorial: MapReduce
11 / 131
Introduction
Big Ideas
Tutorial: MapReduce
12 / 131
Introduction
Big Ideas
Tutorial: MapReduce
13 / 131
Introduction
Big Ideas
batch processing
involving (mostly) full scans of the dataset
Tutorial: MapReduce
14 / 131
Introduction
Big Ideas
Auxiliary components
I
I
I
I
Hadoop Pig
Hadoop Hive
Cascading/Scalding
... and many many more!
Tutorial: MapReduce
15 / 131
Introduction
Big Ideas
Seamless Scalability
Tutorial: MapReduce
16 / 131
Introduction
Big Ideas
Part One
Tutorial: MapReduce
17 / 131
MapReduce Framework
Tutorial: MapReduce
18 / 131
MapReduce Framework
Preliminaries
Preliminaries
Tutorial: MapReduce
19 / 131
MapReduce Framework
Preliminaries
Tutorial: MapReduce
20 / 131
MapReduce Framework
Preliminaries
Data locality
Resource availability
Tutorial: MapReduce
21 / 131
MapReduce Framework
Preliminaries
Tutorial: MapReduce
22 / 131
MapReduce Framework
Programming Model
Tutorial: MapReduce
23 / 131
MapReduce Framework
Programming Model
Tutorial: MapReduce
24 / 131
MapReduce Framework
Programming Model
map phase:
I
fold phase:
I
I
I
Tutorial: MapReduce
25 / 131
MapReduce Framework
Programming Model
Tutorial: MapReduce
26 / 131
MapReduce Framework
Programming Model
In practice:
I
Tutorial: MapReduce
27 / 131
MapReduce Framework
Programming Model
Tutorial: MapReduce
28 / 131
MapReduce Framework
The Framework
Tutorial: MapReduce
29 / 131
MapReduce Framework
The Framework
Data Structures
Keys and values can be: integers, float, strings, raw bytes
They can also be arbitrary data structures
E.g.: for a collection of Web pages, input keys may be URLs and
values may be the HTML content
In some algorithms, input keys are not used, in others they uniquely
identify a record
Keys can be combined in complex ways to design various
algorithms
Tutorial: MapReduce
30 / 131
MapReduce Framework
The Framework
A MapReduce job
Tutorial: MapReduce
31 / 131
MapReduce Framework
The Framework
Tutorial: MapReduce
32 / 131
MapReduce Framework
The Framework
Tutorial: MapReduce
33 / 131
MapReduce Framework
The Framework
Tutorial: MapReduce
34 / 131
MapReduce Framework
The Framework
Mapper:
I
I
The framework:
I
Guarantees all values associated with the same key (the word) are
brought to the same reducer
The reducer:
I
I
Tutorial: MapReduce
35 / 131
MapReduce Framework
The Framework
Tutorial: MapReduce
36 / 131
MapReduce Framework
The Framework
Side effects
I
I
I
Tutorial: MapReduce
37 / 131
MapReduce Framework
The Framework
Tutorial: MapReduce
38 / 131
MapReduce Framework
The Framework
Tutorial: MapReduce
39 / 131
MapReduce Framework
The Framework
Scheduling
Each Job is broken into tasks
I
Tutorial: MapReduce
40 / 131
MapReduce Framework
The Framework
Scheduling
The scheduler component can be customized
I
Job execution time depends on the slowest map and reduce tasks
Speculative execution can help with slow machines
F
Tutorial: MapReduce
41 / 131
MapReduce Framework
The Framework
Data/code co-location
The scheduler starts the task on the node that holds a particular
block of data required by the task
If this is not possible, tasks are started elsewhere, and data will
cross the network
F
Tutorial: MapReduce
42 / 131
MapReduce Framework
The Framework
Synchronization
In MapReduce, synchronization is achieved by the shuffle and
sort bareer
I
I
Tutorial: MapReduce
43 / 131
MapReduce Framework
The Framework
Hardware failures
I
I
I
Software failures
I
Exceptions, bugs
Tutorial: MapReduce
44 / 131
MapReduce Framework
The Framework
Tutorial: MapReduce
45 / 131
MapReduce Framework
The Framework
Partitioners
Partitioners are responsible for:
Dividing up the intermediate key space
Assigning intermediate key-value pairs to reducers
Specify the task to which an intermediate key-value pair must be
copied
I
I
Hash-based partitioner
I
I
When dealing with complex keys, even the base partitioner may
need customization
Tutorial: MapReduce
46 / 131
MapReduce Framework
The Framework
Combiners
Combiners are an (optional) optimization:
I
I
Tutorial: MapReduce
47 / 131
MapReduce Framework
The Framework
Tutorial: MapReduce
48 / 131
MapReduce Framework
The Framework
Tutorial: MapReduce
49 / 131
MapReduce Framework
The Framework
Tutorial: MapReduce
50 / 131
MapReduce Framework
The Framework
Distributed filesystems
I
I
I
Tutorial: MapReduce
51 / 131
MapReduce Framework
The Framework
HDFS
Divide user data into blocks
I
I
Master-slave architecture
I
Tutorial: MapReduce
52 / 131
MapReduce Framework
The Framework
HDFS, an Illustration
Tutorial: MapReduce
53 / 131
MapReduce Framework
The Framework
HDFS I/O
A typical read from a client involves:
1
2
2
3
4
Tutorial: MapReduce
54 / 131
MapReduce Framework
The Framework
HDFS Replication
Replication policy
I
I
I
Tutorial: MapReduce
55 / 131
MapReduce Framework
The Framework
Tutorial: MapReduce
56 / 131
MapReduce Framework
The Framework
Part Two
Tutorial: MapReduce
57 / 131
Hadoop MapReduce
Tutorial: MapReduce
58 / 131
Hadoop MapReduce
Preliminaries
Preliminaries
Tutorial: MapReduce
59 / 131
Hadoop MapReduce
Preliminaries
HDFS in details
Hadoop I/O
Hadoop MapReduce
F
F
F
Implementation details
Types and Formats
Features in Hadoop
Hadoop Deployments
I
Tutorial: MapReduce
60 / 131
Hadoop MapReduce
Preliminaries
Terminology
MapReduce:
I
I
I
I
Task Attempts
I
I
I
I
Tutorial: MapReduce
61 / 131
Hadoop MapReduce
HDFS in details
HDFS in details
Tutorial: MapReduce
62 / 131
Hadoop MapReduce
HDFS in details
Tutorial: MapReduce
63 / 131
Hadoop MapReduce
HDFS in details
HDFS Blocks
(Big) files are broken into block-sized chunks
I
NOTE: A file that is smaller than a single block does not occupy a
full blocks worth of underlying storage
Tutorial: MapReduce
64 / 131
Hadoop MapReduce
HDFS in details
Secondary NameNode
I
I
DataNode
I
I
Tutorial: MapReduce
65 / 131
Hadoop MapReduce
HDFS in details
External clients
I
MapReduce clients
I
I
Tutorial: MapReduce
66 / 131
Hadoop MapReduce
HDFS in details
Details on replication
I
I
Replica Placement
I
I
First copy on the same node of the client, second replica is off-rack,
third replica is on the same rack as the second but on a different node
Since Hadoop 0.21, replica placement can be customized
Tutorial: MapReduce
67 / 131
Hadoop MapReduce
HDFS in details
Tutorial: MapReduce
68 / 131
Hadoop MapReduce
HDFS in details
Tutorial: MapReduce
69 / 131
Hadoop MapReduce
Hadoop I/O
Hadoop I/O
Tutorial: MapReduce
70 / 131
Hadoop MapReduce
Hadoop I/O
From/to HDFS
From/to local disk drives
Across machines (inter-process communication)
Whats next
I
I
Tutorial: MapReduce
71 / 131
Hadoop MapReduce
Hadoop I/O
Data Integrity
Every I/O operation on disks or the network may corrupt data
I
I
Tutorial: MapReduce
72 / 131
Hadoop MapReduce
Hadoop I/O
Compression
Tutorial: MapReduce
73 / 131
Hadoop MapReduce
Hadoop I/O
Compression
Splittable files, Example 2 (gzip)
I
I
I
Use bzip2
Otherwise, use SequenceFiles
See Chapter 4 (page 84) [11]
Tutorial: MapReduce
74 / 131
Hadoop MapReduce
Hadoop I/O
Serialization
Transforms structured objects into a byte stream
I
I
Tutorial: MapReduce
75 / 131
Hadoop MapReduce
Hadoop I/O
Sequence Files
Specialized data structure to hold custom input data
I
SequenceFiles
I
I
Tutorial: MapReduce
76 / 131
Hadoop MapReduce
Tutorial: MapReduce
77 / 131
Hadoop MapReduce
Tutorial: MapReduce
78 / 131
Hadoop MapReduce
Job Submission
JobClient class
I
I
Tutorial: MapReduce
79 / 131
Hadoop MapReduce
Job Initialization
The JobTracker is responsible for:
I
I
I
Tutorial: MapReduce
80 / 131
Hadoop MapReduce
Task Assignment
Hearbeat-based mechanism
I
I
I
Selecting a task
I
I
Data locality
I
Tutorial: MapReduce
81 / 131
Hadoop MapReduce
Task Execution
Task Assignement is done, now TaskTrackers can execute
I
I
I
Tutorial: MapReduce
82 / 131
Hadoop MapReduce
Handling Failures
In the real world, code is buggy, processes crash and machine fails
Task Failure
I
F
F
Tutorial: MapReduce
83 / 131
Hadoop MapReduce
Handling Failures
TaskTracker Failure
I
I
I
I
I
I
JobTracker Failure
I
I
Multiple JobTrackers
Use ZooKeeper as a coordination mechanisms
Tutorial: MapReduce
84 / 131
Hadoop MapReduce
Scheduling
FIFO Scheduler (default behavior)
I
I
Fair Scheduler
I
I
Every user gets a fair share of the cluster capacity over time
Jobs are placed in to pools, one for each user
F
F
I
I
Users that submit more jobs have no more resources than oterhs
Can guarantee minimum capacity per pool
Supports preemption
Contrib module, requires manual installation
Capacity Scheduler
I
I
I
Tutorial: MapReduce
85 / 131
Hadoop MapReduce
The process by which the system sorts and transfers map outputs
to reducers is known as shuffle
Tutorial: MapReduce
86 / 131
Hadoop MapReduce
Tutorial: MapReduce
87 / 131
Hadoop MapReduce
In memory buffering
Pre-sorting
100 MB by default
Threshold based mechanism to spill buffer content to disk
Map output written to the buffer while spilling to disk
If buffer fills up while spilling, the map task is blocked
Disk spills
I
I
I
I
Tutorial: MapReduce
88 / 131
Hadoop MapReduce
Tutorial: MapReduce
89 / 131
Hadoop MapReduce
Tutorial: MapReduce
90 / 131
Hadoop MapReduce
There is a small number (5) of copy threads that can fetch map
outputs in parallel
Tutorial: MapReduce
91 / 131
Hadoop MapReduce
Input consolidation
I
When all map outputs have been copied a merge phase starts
All map outputs are sorted maintaining their sort ordering, in rounds
Tutorial: MapReduce
92 / 131
Hadoop MapReduce
Tutorial: MapReduce
93 / 131
Hadoop MapReduce
MapReduce Types
Types:
I
I
Tutorial: MapReduce
94 / 131
Hadoop MapReduce
What is a Writable
Why comparable?
Tutorial: MapReduce
95 / 131
Hadoop MapReduce
Tutorial: MapReduce
96 / 131
Hadoop MapReduce
Reading Data
Datasets are specified by InputFormats
I
I
Each split is divided into records, and the map processes each
record (a key-value pair) in turn
Splits and records are logical, they are not physically bound to a file
Tutorial: MapReduce
97 / 131
Hadoop MapReduce
Tutorial: MapReduce
98 / 131
Hadoop MapReduce
KeyValueTextInputFormat
I
SequenceFileInputFormat
I
SequenceFileAsTextInputFormat
I
Tutorial: MapReduce
99 / 131
Hadoop MapReduce
Tutorial: MapReduce
100 / 131
Hadoop MapReduce
Record Readers
KeyValueRecordReader
I
Used by KeyValueTextInputFormat
Tutorial: MapReduce
101 / 131
Hadoop MapReduce
Example
Tutorial: MapReduce
102 / 131
Hadoop MapReduce
Tutorial: MapReduce
103 / 131
Hadoop MapReduce
WritableComparator
Configured through:
JobConf.setOutputValueGroupingComparator()
Tutorial: MapReduce
104 / 131
Hadoop MapReduce
Partiotioner
Tutorial: MapReduce
105 / 131
Hadoop MapReduce
The Reducer
Tutorial: MapReduce
106 / 131
Hadoop MapReduce
Tutorial: MapReduce
107 / 131
Hadoop MapReduce
Analogous to InputFormat
TextOutputFormat writes key value <newline> strings to
output file
SequenceFileOutputFormat uses a binary format to pack
key-value pairs
NullOutputFormat discards output
Tutorial: MapReduce
108 / 131
Hadoop MapReduce
Tutorial: MapReduce
109 / 131
Hadoop MapReduce
Tutorial: MapReduce
110 / 131
Hadoop MapReduce
Preliminaries
Writing a program in MapReduce has a certain flow to it
I
F
F
I
Tutorial: MapReduce
111 / 131
Hadoop MapReduce
Configuration
Before writing a MapReduce program, we need to set up and
cofigure the development environment
I
I
I
I
I
In the IDE create a new project and add all the JAR files from the
top level of the distribution and form the lib directory
For Eclipse there are also available plugins
Commercial IDE also exist (Karmasphere)
Alternatives
I
I
Tutorial: MapReduce
112 / 131
Hadoop MapReduce
Local Execution
Use the GenericOptionsParser, Tool and ToolRunner
I
Tutorial: MapReduce
113 / 131
Hadoop MapReduce
Cluster Execution
Packaging
Launching a Job
The WebUI
Hadoop Logs
Running Dependent Jobs, and Oozie
Tutorial: MapReduce
114 / 131
Hadoop MapReduce
Hadoop Deployments
Hadoop Deployments
Tutorial: MapReduce
115 / 131
Hadoop MapReduce
Hadoop Deployments
Cluster deployment
I
I
I
Private cluster
Cloud-based cluster
AWS Elasitc MapReduce
Outlook:
I
Cluster specification
F
F
Hardware
Network Topology
Hadoop Configuration
F
Memory considerations
Tutorial: MapReduce
116 / 131
Hadoop MapReduce
Hadoop Deployments
Cluster Specification
Commodity Hardware
I
Commodity 6= Low-end
Commodity 6= High-end
A 2010 specification:
I
I
I
I
2 quad-cores
16-24 GB ECC RAM
4 1 TB SATA disks6
Gigabit Ethernet
Tutorial: MapReduce
117 / 131
Hadoop MapReduce
Hadoop Deployments
Cluster Specification
Example:
Assume your data grows by 1 TB per week
Assume you have three-way replication in HDFS
You need additional 3TB of raw storage per week
I Allow for some overhead (temporary files, logs)
This is a new machine per week
I
I
Tutorial: MapReduce
118 / 131
Hadoop MapReduce
Hadoop Deployments
Cluster Specification
Tutorial: MapReduce
119 / 131
Hadoop MapReduce
Hadoop Deployments
Tutorial: MapReduce
120 / 131
Hadoop MapReduce
Hadoop Deployments
Typical configuration
I
I
I
Features
I
Tutorial: MapReduce
121 / 131
Hadoop MapReduce
Hadoop Deployments
Hadoop Configuration
There are a handful of files for controlling the operation of an
Hadoop Cluster
I
I
I
Tutorial: MapReduce
122 / 131
Hadoop MapReduce
Hadoop Deployments
Hadoop Configuration
Filename
Format
hadoop-env.sh
core-site.xml
hdfs-site.xml
mapred-site.xml
masters
slaves
Bash script
Hadoop configuration XML
Hadoop configuration XML
Hadoop configuration XML
Plain text
Plain text
Description
Environment variables that are used in the scripts to run Hadoop.
I/O settings that are common to HDFS and MapReduce.
Namenode, the secondary namenode, and the datanodes.
Jobtracker, and the tasktrackers.
A list of machines that each run a secondary namenode.
A list of machines that each run a datanode and a tasktracker.
Tutorial: MapReduce
123 / 131
Hadoop MapReduce
Hadoop Deployments
DataNode: 1 GB
TaskTracker: 1 GB
Child JVM map task: 2 200MB
Child JVM reduce task: 2 200MB
This is true for cluster configuration but also for job specific
configurations
Tutorial: MapReduce
124 / 131
Hadoop MapReduce
Hadoop Deployments
Tutorial: MapReduce
125 / 131
Hadoop MapReduce
Hadoop Deployments
Hadoop on EC2
Launch instances of a cluster on demand, paying by hour
I
Example
Launch a cluster test-hadoop-cluster, with one master node
(JobTracker and NameNode) and 5 worker nodes (DataNodes
and TaskTrackers)
hadoop-ec2 launch-cluster test-hadoop-cluster 5
I See project webpage and Chapter 9, page 290 [11]
I
Tutorial: MapReduce
126 / 131
Hadoop MapReduce
Hadoop Deployments
Hadoop as a service
I
I
I
I
Tutorial: MapReduce
127 / 131
References
References I
[1]
[2]
[3]
[4]
Tutorial: MapReduce
128 / 131
References
References II
[5]
James Hamilton.
Cooperative expendable micro-slice servers (cems): Low cost,
low power servers for internet-scale services.
In Proc. of the 4th Biennal Conference on Innovative Data
Systems Research (CIDR), 2009.
[6]
[7]
Tutorial: MapReduce
129 / 131
References
References III
[8]
[9]
Tutorial: MapReduce
130 / 131
References
References IV
Tutorial: MapReduce
131 / 131