Professional Documents
Culture Documents
Vikash Chaurasia-Hadoop - Univ - Research
Vikash Chaurasia-Hadoop - Univ - Research
Dhruba Borthakur
Apache Hadoop Developer
Facebook Data Infrastructure
dhruba@apache.org, dhruba@facebook.com
Condor Week, April 22, 2009
Outline
Architecture of Hadoop Distributed File System
Hadoop usage at Facebook
Ideas for Hadoop related research
Who Am I?
Hadoop Developer
Core contributor since Hadoops infancy
Project Lead for Hadoop Distributed File System
Facebook (Hadoop, Hive, Scribe)
Yahoo! (Hadoop in Yahoo Search)
Veritas (San Point Direct, Veritas File System)
IBM Transarc (Andrew File System)
UW Computer Science Alumni (Condor Project)
Hadoop, Why?
Need to process Multi Petabyte Datasets
Expensive to build reliability in each application.
Nodes fail every day
Failure is expected, rather than exceptional.
The number of nodes in a cluster is not constant.
Need common infrastructure
Efficient, reliable, Open Source Apache License
The above goals are same as Condor, but
Workloads are IO bound and not CPU bound
Hive, Why?
Need a Multi Petabyte Warehouse
Files are insufficient data abstractions
Need tables, schemas, partitions, indices
SQL is highly popular
Need for an open data format
RDBMS have a closed data format
flexible schema
Hive is a Hadoop subproject!
Hadoop & Hive History
Dec 2004 Google GFS paper published
July 2005 Nutch uses MapReduce
Feb 2006 Becomes Lucene subproject
Apr 2007 Yahoo! on 1000-node cluster
Jan 2008 An Apache Top Level Project
Jul 2008 A 4000 node test cluster
Sept 2008 Hive becomes a Hadoop subproject
Who uses Hadoop?
Amazon/A9
Facebook
Google
IBM
Joost
Last.fm
New York Times
PowerSet
Veoh
Yahoo!
Commodity Hardware
NameNode
Secondary
NameNode
Client
Cluster Membership
Test cluster
800 cores, 16GB each
Data Flow
Network
Storage