This action might not be possible to undo. Are you sure you want to continue?
An An Architectural Hybrid of MapReduce & DBMS Technologies for Analytical Workloads
• • • • • • • • • • Motivation Introduction Desired Properties Background & Shortfalls HadoopDB Benchmarks Fault Tolerance Conclusion Related Work References
• Analyzing massive structured data on 1000s of shared-nothing nodes • Shared nothing architecture:
• A collection of independent,possibly virtual matchines eact with local disk and local main memory connected together on a highspeed network
• Parallel databases • Map/Reduce systems
Desired Properties • Performance • A primary characteristic that commercial database systems use to distinguish themselves • A Fault tolerance • Heterogeneus environments • Increasing number of nodes • Difficult homogeneous • Flexible query interface • Usually JDBC or ODBC • UDF mechanism • Desirable SQL and no SQL interfaces 4 .
I/O sharing • Tables partitioned over nodes • Transparent to the user • Meet performance • Needed highly skilled DBA • Flexible query interfaces • UDFs varies accros implementations • Fault tolerance • Not score so well • Assumption: failures are rare • Assumption: dozens of nodes in clusters 5 . compression.Background-PDBMS • Standard relational tables and SQL • Indexing.caching.
Background-MapReduce • Satisfies fault tolerance • Works on heterogeneus environment • Drawback: performance • No enhacing performance techniques • Interfaces • Write M/R jobs in multiple languages • SQL not supported directly ( excluding eg: Hive ) 6 .
• Maps and Reduces run independently of each other over blocks of data distributed across a cluster 7 . • Hadoop • Is a MapReduce implementation for processing large data sets over 1000s of nodes.• MapReduce (Hadoop) MapReduce is a programming model which speciﬁes: • A map function that processes a key/value pair to generate a set of intermediate key/value pairs. • A reduce function that merges all intermediate values associated with the same intermediate key.
Background-MapReduce 8 .
Diﬀerences between Parallel Databases and MapReduce? 9 .
HadoopDB 11 .
HadoopDB • Hadoop as communication layer above multiple nodes running single-node DBMS instances • Full open-source solution : • PostgreSQL as DB layer • Hadoop as communication layer • Hive as translation layer 12 .
HadoopDB RDBMS • Careful layout of data • Indexing • Sorting • Query optimization • compression Hadoop • Job scheduling • Task coordination • Parallellization 13 .
Ideas • Main goal: achieve the properties described before • Connect multiple single-datanode systems • Hadoop as the task coordination & network communication layer • Queries parallelized across the nodes using MapReduce framework • Fault tolerant and work in heterogeneus nodes • Parallel databases performance • Query processing in database engine 14 .
Architecture Background • Data Storage layer (HDFS) • Block structured file system managed by central NameNode • Files broken in blocks and ditributed • Data processing layer (Map/Reduce framework) • Master/slave architecture • Job and Task trackers 15 .
HadoopDB Components • • • • Database Connector Catalog Data Loader Planner (SMS) 16 .
Database Connector • Interface between DBMS and TaskTacker • Responsabilities • Connect to the database • Execute the SQL query • Return the results as key-value pairs • Achieved goal • Datasources are similar to datablocks in HDFS 17 .
replica or partitioning • Catalog stored as xml file in HDFS • Plan to deploy as separated service 18 .Catalog • Maintain information about database • Database location. driver class • Darasets in cluster.
Data Loader • Responsabilities: • Globally partition the data on given key • Break single node data into chunks • Bulk-loading chunks in single-node databases • Two main components: • Global hasher • Map/Reduce job read from HDS and repartition • Local Hasher • Copies from HDFS to local file system 19 .
SMS Planner • Extends Hive • Steps • • • • • • Parser transforms query to (AST)abstract syntax tree Get table schema information from catalog Logical plan generator creates query plan Optimizer breaks up plan to Map or Reduce phases Executable plan generated for one or more MapReduce jobs SMS tries to push maximum work to database layer 20 .
200Mb sort buffer • HadoopDB • Similar to Hadoop conf.2.1024 MB heap size.Benchmarking • Environment • Amazon EC2 “large” instances • Each instance • 7.No compress data • Vertica • Used a cloud edition • All data is compressed • DBMS-X • Comercial parallel row • Run on EC2 (not cloud edition available) 22 .PostgreSQL 8.64 bits Linux Fedora 8 • Systems • Hadoop • 256MB data blocks.5 GB memory.5.2 virtual cores.850 GB storage.
Benchmarking • Used data • Http log files. ranking • Sizes (per node): • 155 millions user visits (~ 20Gigabytes) • 18 millions ranking (~1Gigabyte) • Stored as plain text in HDFS 23 . html pages.
SIGMOD’09 benchmark on Amazon EC2 clusters of 10. 50. 24 .Evaluating HadoopDB • Compare HadoopDB to • 1 Hadoop • 2 Parallel databases (Vertica. DBMS-X) • Features: • 1 Performance: • We expected HadoopDB to approach the performance of parallel databases • 2 Scalability: • We expected HadoopDB to scale as well as Hadoop We ran the Pavlo et al. 100 nodes.
Benchmark tasks • • • • • • • Data loading Grep task Selection task Aggregation task Join task UDF Aggregation task Fault tolerance and Heterogeneous environment 25 .
Data Load 26 .
Queries Result 27 .
HadoopDB’s performance matches Hadoop 28 .• load -data loads are slower than Hadoop. but faster than parallel databases • runtime • Structured data-HadoopDB is faster than Hadoop but slower than parallel databases(HadoopDB’s performance is close to parallel databases) • Unstructured data.
Scalability:Setup • • • • Simple aggregation task .full table scan Data replicated across 10 nodes Fault-tolerance: Kill a node halfway Fluctuation-tolerance: Slow down a node for the entire experiment 29 .
Scalability:Results • HadoopDB and Hadoop take advantage of runtime acheduling by splitting data into chunks • Parallel databases restart entire query on node failure or wait for the slowest node 30 .
a hybrid of DBMS and MapReduce • HadoopDB is close in performance to parallel databases • HadoopDB is able to operate in truly heterogeneous environment and has the fault tolerance of Hadoop environment • Is free and open-source http://hadoopdb.net 31 .To Summarize • HadoopDB .sourceforge.
Related Work • Pig Project at yahoo • SCOPE project at Microsoft • Hive project 32 .
Future Work • Integration with other open source databases • Full automation of the loading and replication process • Dynamically adjusting fault-tolerance levels based on failure rate 33 .
Thank You! 34 .