You are on page 1of 96

Adam Kawa

Data Engineer @ Spotify

Apache Hadoop
In Theory And Practice
Get insights to offer a better product
More data usually beats better algorithms
Get of insights to make better decisions
Avoid guesstimates
Take a competitive advantage
Why Data?
Store data reliably
Analyze data quickly
Cost-effective way
Use expressible and high-level language
What Is Challenging?
A big system of machines, not a big machine
Failures will happen
Move computation to data, not data to computation
Write complex code only once, but right
Fundamental Ideas
A system of multiple animals
An An open-source open-source Java software Java software
Storing Storing and and processing processing of very of very large data sets large data sets
A A clusters of clusters of commodity machines commodity machines
A A simple programming model simple programming model
Apache Hadoop
Two main components:
HDFS - a distributed file system
MapReduce a distributed processing layer
Apache Hadoop
Many other tools belong to Apache Hadoop Ecosystem
Store large datasets in a distributed, scalable and fault-tolerant way
High throughput
Very large files
Streaming reads and writes (no edits)
Write once, read many times
The Purpose Of HDFS
It is like a big truck
to moe heay stuff
!not "errari#
Do NOT use, if you have
Low-latency requests
Random reads and writes
Lots of small files
Then better to consider
File servers,
Hbase or Cassandra...
HDFS Mis-Usage
Split a very large file into smaller (but still large) blocks
Store them redundantly on a set of machines
Splitting Files And Replicating Blocks
The default block size is 64MB
Minimize the overhead of a disk seek operation (less than 1%)
A file is just sliced into chunks after each 64MB (or so)
It does NOT matter whether it is text/binary, compressed or not
It does matter later (when reading the data)

Spiting Files Into Blocks
$oday% &'(M) or '*+M) is recommended
The default replication factor is 3
It can be changed per a file or a directory
It can be increased for hot datasets (temporarily or permanently)
Trade-off Trade-off between
Reliability, availability, performance
Disk space
Replicating Blocks
The Master node keeps and manages all metadata information
The Slave nodes store blocks of data and serve them to the client
Master And Slaves
Master node !called ,ame,ode#
-lae nodes !called .ata,odes#
*no NameNode HA, no HDFS Replication
Classical* HDFS Cluster
Manages metadata
.oes some
operations for
-tores and retriees
blocks of data
Performs all the metadata-related operations
Keeps information in RAM (for fast lookup)
The filesystem tree
Metadata for all files/directories (e.g. ownership, permissions)
Names and locations of blocks
Metadata (not all) is additionally stored on disk(s) (for reliability)
The filesystem snapshot (fsimage) + editlog (edits) files
HDFS NameNode
Stores and retrieves blocks of data
Data is stored as regular files on a local filesystem (e.g. ext4)
e.g. blk_-992391354910561645 (+ checksums in a separate file)
A block itself does not know which file it belongs to!
Sends a heartbeat message to the NN to say that it is still alive
Sends a block report to the NN periodically
HDFS DataNode
NOT a failover NameNode
Periodically merges a prior snapshot (fsimage) and editlog(s) (edits)
Fetches current fsimage and edits files from the NameNode
Applies edits to fsimage to create the up-to-date fsimage
Then sends the up-to-date fsimage back to the NameNode
We can configure frequency of this operation

Reduces the NameNode startup time
Prevents edits to become too large
HDFS Secondary NameNode
hadoop fs -ls -R /user/kawaa
hadoop fs -cat /toplist/2013-05-15/poland.txt less
hadoop fs -put logs.txt /incoming/logs/user
hadoop fs -count /toplist
hadoop fs -chown kawaa!kawaa /toplist/2013-05-15/poland.a"ro
Exemplary HDFS Commands
It is distributed% but it gies you a beautiful abstraction0
Block data is never sent through the NameNode
The NameNode redirects a client to an appropriate DataNode
The NameNode chooses a DataNode that is as close as possible
Reading A File From HDFS
1 hadoop fs /cat 2toplist2'3&4/3*/&*2poland5t6t
)locks locations
7ots of data comes
from .ata,odes
to a client
Network topology defined by an administrator in a supplied script
Convert IP address into a path to a rack (e.g /dc1/rack1)
A path is used to calculate distance between nodes

HDFS Network Topology
Image source: Hadoop: The Definitive Guide by Tom White
Pluggable (default in #lock$lacement$olic%&efault.'a"a)

1st replica on the same node where a writer is located
Otherwise random (but not too busy or almost full) node is used
2nd and the 3rd replicas on two different nodes in a different rack

The rest are placed on random nodes
No DataNode with more than one replica of any block
No rack with more than two replicas of the same block (if possible)
HDFS Block Placement
Moves block from over-utilized DNs to under-utilized DNs
Stops when HDFS is balanced
Maintains the block placement policy
HDFS Balancer
the utili8ation of eery .,
differs from the utili8ation
of the cluster by no
more than a gien threshold
Why a block itself does NOT know
which file it belongs to?
HDFS Block
Why a block itself does NOT know
which file it belongs to?
Design decision simplicity, performance
Filename, permissions, ownership etc might change
It would require updating all block replicas that belongs to a file
HDFS Block
Why NN does NOT store information
about block locations on disks?
HDFS Metadata
Why NN does NOT store information
about block locations on disks?
Design decision simplicity
They are sent by DataNodes as block reports periodically
Locations of block replicas may change over time
A change in IP address or hostname of DataNode
Balancing the cluster (moving blocks between nodes)
Moving disks between servers (e.g. failure of a motherboard)
HDFS Metadata
How many files represent a block replica in HDFS?
HDFS Replication
How many files represent a block replica in HDFS?
Actually, two files:
The first file for data itself
The second file for blocks metadata
Checksums for the block data
The blocks generation stamp
HDFS Replication
by default less than &:
of the actual data
Why does NOT the default block placement strategy take the disk
space utilization (%) into account?
HDFS Block Placement
It only checks% if a node
a# has enough disk space to write a block% and
b# does not sere too many clients 555
Why does NOT the default block placement strategy take the disk
space utilization (%) into account?
Some DataNodes might become overloaded by incoming data
e.g. a newly added node to the cluster
HDFS Block Placement
Runs on the top of a native file system (e.g. ext3, ext4, xfs)
HDFS is simply a Java application that uses a native file system
HDFS And Local File System
HDFS detects corrupted blocks
;hen writing
Client computes the checksums for each block
Client sends checksums to a DN together with data
;hen reading
Client verifies the checksums when reading a block
If verification fails, NN is notified about the corrupt replica
Then a DN fetches a different replica from another DN
HDFS Data Integrity
Stats based on Yahoo! Clusters
An average file 1.5 blocks (block size = 128 MB)
An average file 600 bytes in RAM (1 file and 2 blocks objects)
100M files 60 GB of metadata
1 GB of metadata 1 PB physical storage (but usually less*)
<-adly% based on practical obserations% the block to file ratio tends to
decrease during the lifetime of the cluster
.ekel $ankel% =ahoo0
HDFS NameNode Scalability
Read/write operations throughput limited by one machine
~120K read ops/sec
~6K write ops/sec
MapReduce tasks are also HDFS clients
Internal load increases as the cluster grows
More block reports and heartbeats from DataNodes
More MapReduce tasks sending requests
Bigger snapshots transferred from/to Secondary NameNode
HDFS NameNode Performance
Single NameNode
Keeps all metadata in RAM
Performs all metadata operations
Becomes a single point of failure (SPOF)
HDFS Main Limitations
Introduce multiple NameNodes in form of:
HDFS Federation
HDFS High Availability (HA)
HDFS Main Improvements
"ind More>
In practice
DataNode can not start on a server for some reason
Usually it means some kind of disk failure
( ls /disk/hd12/
ls! reading director% /disk/hd12/! )nput/output error
,oo man% failed "olumes - current "alid "olumes! 11-
"olumes configured! 12- "olumes failed! 1- .olume failures tolerated! 0
org.apache.hadoop.hdfs.ser"er.datanode.&ata/ode! 012,&34/5607!
Increase dfs5datanode5failed5olumes5tolerated
to aoid e6pensie block replication when a disk fails
!and @ust monitor failed disks#
It was exciting this see stuff breaking!
In practice
A user user can not run resource-intensive Hive queries can not run resource-intensive Hive queries
It happened It happened immediately after expanding the cluster immediately after expanding the cluster
The The queries are valid queries are valid
The queries are The queries are resource-intensive resource-intensive
The queries run successfully on a small dataset successfully on a small dataset
But they fail on a large dataset fail on a large dataset
Surprisingly they run successfully through other user accounts successfully through other user accounts
The user has right permissions to HDFS directories and Hive tables right permissions to HDFS directories and Hive tables
The NameNode is throwing thousands of warnings and exceptions
14592 times only during 8 min (4768/min in a peak)
Hadoop is a very trusty elephant Hadoop is a very trusty elephant
The The username username comes comes from the client machine from the client machine (and is not verified) (and is not verified)
The The groupname is resolved on the NameNode server groupname is resolved on the NameNode server
Using the shell command '' Using the shell command ''id -7n 8username9'' ''
If a user does not have an account on the NameNode server If a user does not have an account on the NameNode server
The The +xit*ode+xception +xit*ode+xception exception is thrown exception is thrown
Possible Fixes
Possible Fixes
Create an user account on the NameNode server (dirty, insecure)
Use AD/LDAP for a user-group resolution AD/LDAP for a user-group resolution settings
If you also need the full-authentication, deploy Kerberos
Aur "i6
We decided to use We decided to use LDAP for a user-group resolution LDAP for a user-group resolution
However, However, LDAP settings in Hadoop did not work for us LDAP settings in Hadoop did not work for us
Because Because posix7roup posix7roup is not a supported filter group class in is not a supported filter group class in
We found We found a workaround using a workaround using nsswitch.conf nsswitch.conf
7esson 7earned
Know who is going to use your cluster Know who is going to use your cluster
Know Know who is abusing the cluster who is abusing the cluster (HDFS access and MR jobs) (HDFS access and MR jobs)
Parse the NameNode logs Parse the NameNode logs regularly regularly
Look for Look for ;<,<= ;<,<=, , +RR3R +RR3R, , +xception +xception messages messages
Especially before and after expanding the cluster Especially before and after expanding the cluster
Programming model inspired by functional programming
map>) and reduce>) functions processing 8ke%- "alue9 pairs
Useful for processing very large datasets in a distributed way
Simple, but very expressible
MapReduce Model

Map And Reduce Functions

Map And Reduce Functions -
Counting Word

MapReduce Job
Input data is diided into
splits and conerted into
Bkey% alueC pairs
Inokes map!# function
multiple times
Keys are sorted%
alues not !but
could be#
Inokes reduce!#
"unction multiple times

MapReduce Example: ArtistCount
Artist% -ong% $imestamp% Dser
Key is the offset of the line
from the beginning
of the line
;e could specify which artist
goes to which reducer
!Hash?aritioner is default one#
map>)nteger ke%- +nd0ong "alue- *ontext context)!
context.write>"alue.artist- 1)
reduce>0tring ke%- )terator8)nteger9 "alues- *ontext context)!
int count ? 0
for each " in "alues!
count @? "
context.write>ke%- count)
MapReduce Example: ArtistCount
?seudo/code in
non/e6isting language E#

MapReduce Combiner
Make sure that the Combiner
combines fast and enough
!otherwise it adds oerhead only#

Data Locality in HDFS and MapReduce
Ideally% Mapper code is sent
to a node that has
the replica of this block
)y default% three replicas
should be aailable somewhere
on the cluster
Batch processing system
Automatic parallelization and distribution of computation
Deals with all messy details related to distributed processing
Relatively easy to use for programmers
Java API, Streaming (Python, Perl, Ruby )
Apache Pig, Apache Hive, (S)Crunch
Status and monitoring tools
MapReduce Implementation

Classical MapReduce Daemons
Keeps track of $$s%
schedules @obs and tasks e6ecutions
Funs map and reduce tasks%
Feports to Gob$racker
Manages the computational resources
Available TaskTrackers, map and reduce slots
Schedules all user jobs
Schedules all tasks that belongs to a job
Monitors tasks executions
Restarts failed and speculatively runs slow tasks
Calculates job counters totals
JobTracker Reponsibilities
Runs map and reduce tasks
Reports to JobTracker
Heartbeats saying that it is still alive
Number of free map and reduce slots
Task progress, status, counters etc
TaskTracker Reponsibilities
It can consists of 1, 5, 100 and 4000 nodes
Apache Hadoop Cluster

MapReduce Job Submission
Image source: Hadoop: The Definitive Guide by Tom White
$asks are started
in a separate GHM
to isolate a user code
form Hadoop code
$hey are copied with
a higher replication
!by default% &3#

MapReduce: Sort And Shuffle Phase
Ather maps tasks
Ather reduce tasks
Feduce phase Map phase
Image source: Hadoop: The Definitive Guide by Tom White
Specifies which Reducer should get a given 8ke%- "alue9 pair
Aim for an even distribution of the intermediate data
Skewed data may overload a single reducer
And make a job running much longer

MapReduce: Partitioner
Scheduling a redundant copy of the remaining, long-running task
The output from the one that finishes first is used
The other one is killed, since it is no longer needed
An optimization, not a feature to make jobs run more reliably

Speculative Execution
Enable, if tasks often experience external problems e.g. hardware
degradation (disk, network card), system problems, memory
Speculative execution can reduce overall throughput
Redundant tasks run with similar speed as non-redundant ones
Might help one job, all the others have to wait longer for slots
Redundantly running reduce tasks will
transfer over the network all intermediate data
write their output redundantly (for a moment) to directory in HDFS

Speculative Execution
Very customizable
)nput/3utput ;ormat, Record Reader/4riter,
$artitioner,4ritaAle- *omparator
Unit testing with MRUnit
HPROF profiler can give a lot of insights
Reuse objects (especially keys and values) when possible
Split 0tring efficiently e.g. 0tring2tils instead of 0tring.split
More about Hadoop Java API
Java API
Tons of configuration parameters
Input split size (~implies the number of map tasks)
Number of reduce tasks
Available memory for tasks
Compression settings
and more...
MapReduce Job Configuration
Why each line in text file is, by default, converted to
<offset, line> instead of <line_number, line>?
MapReduce Input <Key, Value> Pairs
Why each line in text file is, by default, converted to
<offset, line> instead of <line_number, line>?
If your lines are not fixed-width,
you need to read file from the beginning to the end
to find line_number of each line (thus it is not parallelized).
MapReduce Input <Key, Value> Pairs
In practice
I noticed that a bottleneck seems to be coming from the map tasks.
Is there any reason that we can't open any of the allocated reduce slots
to map tasks?
How to
hard/code the
number of
map and
We initially started with 60/40
But today we are closer to something like 70/30
Accupied Map And Feduce -lots $ime -pend In Accupied -lots
$his may change again soon 555
We are currently introducing a new feature to Luigi
Automatic settings of
Maximum input split size (~implies the number of map tasks)
Number of reduce task
More settings soon (e.g. size of map output buffer)
The goal is each task running 5-15 minutes on average
)ecause een perfect manual setting
may become outdated
because input si8e grows
The current PoC ;)
It should help in extreme cases
short-living maps
short-living and long-living reduces
type # map # reduce avg map time avg reduce time job execution time
old_1 4826 25 46sec 1hrs, 52mins, 14sec 2hrs, 52mins, 16sec
new_1 391 294 4mins, 46sec 8mins, 24sec 23mins, 12sec
type # map # reduce avg map time avg reduce time job execution time
old_2 4936 800 7mins, 30sec 22mins, 18sec 5hrs, 20mins, 1sec
new_2 4936 1893 8mins, 52sec 7mins, 35sec 1hrs, 18mins, 29sec
In practice
Surprisingly, Hive queries are running extremely long
Thousands task are constantly being killed
Anly & task failed%
'6 more task
were killed
were completed
Logs show that the JobTracker gets a request to kill the tasks
Who actually can send a kill request?
User (using e.g. mapred 'oA -kill-task)
JobTracker (a speculative duplicate, or when a whole job fails)
Fair Scheduler
.iplomatically% itJs called preemption
Key Abserations
Killed tasks came from ad-hoc and resource-intensive Hive queries
Tasks are usually killed quickly after they start
Surviving tasks are running fine for long time
Hive queries are running in their own Fair Scheduler's pool
FairScheduler has a license to kill!
Preempt the newest tasks
in an over-share pool
to forcibly make some room
for starving pools
Hive pool was running over its minimum and fair shares
Other pools were running under their minimum and fair shares
-o that
Fair Scheduler was (legally) killing Hive tasks from time to time
"air -cheduler can kill to be KI,.555
?ossible "i6es
Disable the preemption
Tune minimum shares based on your workload
Tune preemption timeouts based on your workload
Limit the number of map/reduce tasks in a pool
Limit the number of jobs in a pool
Switch to Capacity Scheduler
7essons 7earned
A scheduler should NOT be considered as the ''black-box''
It is so easy to implement a long-running Hive query
More About
Hadoop Adventures
In the reality
Hadoop is fun!
Stockholm and Sweden
Check out or @Spotifyjobs
for more information
Want to join the band?

Thank you!