Professional Documents
Culture Documents
Dhruba Borthakur
Apache Hadoop Developer
Facebook Data Infrastructure
dhruba@apache.org, dhruba@facebook.com
Condor Week, April 22, 2009
Outline
• Architecture of Hadoop Distributed File System
• Hadoop usage at Facebook
• Ideas for Hadoop related research
Who Am I?
• Hadoop Developer
– Core contributor since Hadoop’s infancy
– Project Lead for Hadoop Distributed File System
• Facebook (Hadoop, Hive, Scribe)
• Yahoo! (Hadoop in Yahoo Search)
• Veritas (San Point Direct, Veritas File System)
• IBM Transarc (Andrew File System)
• UW Computer Science Alumni (Condor Project)
Hadoop, Why?
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not constant.
• Need common infrastructure
– Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
– Workloads are IO bound and not CPU bound
Hive, Why?
• Need a Multi Petabyte Warehouse
• Files are insufficient data abstractions
– Need tables, schemas, partitions, indices
• SQL is highly popular
• Need for an open data format
– RDBMS have a closed data format
– flexible schema
• Hive is a Hadoop subproject!
Hadoop & Hive History
• Dec 2004 – Google GFS paper published
• July 2005 – Nutch uses MapReduce
• Feb 2006 – Becomes Lucene subproject
• Apr 2007 – Yahoo! on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• Jul 2008 – A 4000 node test cluster
• Sept 2008 – Hive becomes a Hadoop subproject
Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• IBM
• Joost
• Last.fm
• New York Times
• PowerSet
• Veoh
• Yahoo!
Commodity Hardware
e NameNode
am
ilen
1. f
des
taNo
kId, Da
2. Blc Secondary
NameNode
o
Client
3.Read da
ta
Cluster Membership
• Test cluster
• 800 cores, 16GB each
Data Flow
Networ
k
Storage