You are on page 1of 23

How Drill achieves Flexibility with Performance

©© 2015
2015 MapR
MapR Technologies
Technologies 1
Drill Supports Schema Discovery On-The-Fly

Schema Declared In Advance Schema2Discovered On-The-Fly

• Fixed schema • Fixed schema, evolving schema or


• Leverage schema in centralized schema-less
repository (Hive Metastore) • Leverage schema in centralized
repository or self-describing data

SCHEMA ON SCHEMA SCHEMA ON THE


WRITE BEFORE READ FLY

© 2015 MapR Technologies 2


Drill’s Data Model is Flexible Apache Drill table
{
name: {
Dynamic first: Michael,
Fixed schema last: Smith
schema },
hobbies: [ski, soccer],
district: Los Altos
}
{
Parquet JSON name: {
Complex
Avro BSON first: Jennifer,
last: Gates
},

Flexibility
hobbies: [sing],
preschool: CCLC
}

CSV
Flat HBase RDBMS/SQL-on-Hadoop table
TSV
Name Gender Age
Michael M 6
Jennifer F 3
Flexibility
© 2015 MapR Technologies 3
Drill enables ‘SQL on Everything’

Workspace Table
- Sub-directory - Pathnames
- HBase namespace - Hive table
- Hive database - HBase table

SELECT * FROM dfs.yelp.`business.json`

Storage plugin instance


- DFS (Text, Parquet, JSON)
- HBase/MapRDB
- Hive Metastore/Hcatalog
- Easy API to go beyond Hadoop

© 2015 MapR Technologies 4


Drill is a Distributed SQL query engine

drillbit drillbit drillbit

… ZooKeeper
ZooKeeper
ZooKeeper
DataNode/Regi DataNode/Regi DataNode/Regi
onServer onServer onServer

 Scale out
 Columnar and Vectorized execution
 Optimistic and pipelined execution (no MR, Spark, Tez)
 Late binding
 Extensible
© 2015 MapR Technologies 5
Drill allows reuse of existing SQL Tools and Skills

Leverage SQL-compatible tools


(BI, query builders, etc.) via Drill’s
standard ODBC, JDBC and ANSI
SQL support

Enable business analysts, technical


analysts and data scientists to
explore and analyze large volumes
of real-time data

© 2015 MapR Technologies 6


Drill is Designed For A Wide Set Of Use Cases

Raw Data Exploration JSON Analytics DWH Offload …

{JSON}, Parquet
Text Files …

Files Directories Hive HBase

© 2015 MapR Technologies 7


MapR Optimized Data Architecture Data Movement
Data Access
Optimized Data Architecture Machine Learning

Operational Apps
Sources
RELATIONAL, Batch Interactive Streaming
SAAS, (MapReduce, (Drill, (Spark Streaming, Recommendations
MAINFRAME Spark, Hive, Pig) Impala) Storm)

Fraud Detection
DOCUMENTS, MapR-DB MapR-FS
EMAILS
MAPR DISTRIBUTION FOR HADOOP Logistics
MapR Data Platform
BLOGS,
TWEETS, MAPR DISTRIBUTION FOR HADOOP
LINK DATA Analytics
Data Transformation, Enrichment Search
LOG FILES,
CLICKSTREAMS and Integration
SENSORS Schema-less
data exploration

BI, reporting
Ad-hoc integrated
DATA WAREHOUSE analytics
© 2015 MapR Technologies 8
Architecture – Under the hood

©© 2015
2015 MapR
MapR Technologies
Technologies 9
High Level Architecture
Cluster of commodity servers
– Daemon (drillbit) on each node

ZooKeeper maintains ephemeral cluster membership information


– Drillbit uses ZooKeeper to find other drillbits in the cluster
– Client uses ZooKeeper to find drillbits

Built-in, optimistic query execution engine. Doesn’t require a


particular storage or execution system (MapReduce, Spark, Tez)
– Better performance and manageability

Data processing unit is columnar record batches


– Enables schema flexibility with negligible performance impact
© 2015 MapR Technologies 10
Basic Process

Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, REST)


2. Drillbit generates execution plan based on query optimization & locality
3. Fragments are farmed to individual nodes
4. Result is returned to driving node

Drillbit Drillbit Drillbit

Zookeeper
DFS/HBase/H DFS/HBase/H DFS/HBase/H
ive ive ive

© 2015 MapR Technologies 11


Core Modules within drillbit
RPC Endpoint

Storage Plugins
DFS

Physical Plan
Logical Plan
Hive
SQL Parser Optimizer Execution
HBase

MongoDB

© 2015 MapR Technologies 12


A Query engine that is…

• Columnar/Vectorized

• Optimistic/pipelined

• Runtime compilation

• Late binding

• Extensible

© 2015 MapR Technologies 13


Columnar representation
A B C D E
A On disk

© 2015 MapR Technologies 14


Columnar Encoding

• Values in a col. stored next to one-another A On disk


– Better compression
– Range-map: save min-max, can skip if not present
B
• Only retrieve columns participating in query
C

• Drill optimizes for BOTH columnar storage D


and Execution
E

© 2015 MapR Technologies 15


Vectorization
Drill operates on more than one record at a time
– Word-sized manipulations
– SIMD instructions (GCC, LLVM and JVM all do various optimizations
automatically)
– Manually code algorithms

Logical Vectorization
– Bitmaps allow lightning fast null-checks
– Avoid branching to speed CPU pipeline

© 2015 MapR Technologies 16


Optimistic Execution
With a short time horizon, failures infrequent
– Don’t spend energy and time creating boundaries and checkpoints to
minimize recovery time
– Rerun entire query in face of failure

No barriers

No persistence unless memory overflow

© 2015 MapR Technologies 17


Pipelining
Record batch is the unit of work for Drill
– Operators work on a record batch ( )

DrillBit
Record batches are pipelined between nodes
– ~256kB usually

Operator reconfiguration happens


at batch boundaries DrillBit DrillBit

© 2015 MapR Technologies 18


Runtime Compilation is Faster
Janino interpreted

Time for 1 million evaluations (ms)


500
450
400
350
300
250
200
150
100
50
0
Trivial
Trivial Simple Moderate
Source: http://bit.ly/16Xk32x © 2015 MapR Technologies 19
Drill compiler

Janino compiles
CodeModel
runtime
generates code
byte-code
Merge byte-code of
Loaded class
the two classes

Precompiled byte-
code templates

© 2015 MapR Technologies 20


Cost-based Optimization
Pluggable rules, and cost model

Rules for distributed plan generation


- Insert Exchange operator into physical plan Pluggable
Query rules
- Parallel query plans
Optimizer Pluggable
rules
Pluggable cost model
- CPU, IO, memory, network cost (data locality)
- Storage engine features (HDFS vs HIVE vs HBase)

© 2015 MapR Technologies 21


Integration and extensibility points
Support UDFs
– UDFs/UDAFs using high performance Java API

Not Hadoop centric


– Work with other NoSQL solutions including MongoDB, Cassandra, Riak, etc.
– Build one distributed query engine together than per technology

Built in classpath scanning and plugin concept to add additional storage


engines, function and operators with zero configuration

Support direct execution of strongly specified JSON based logical and physical
plans
– Simplifies testing
– Enables integration of alternative query languages

© 2015 MapR Technologies 22


Additional Resources

Download Tutorial: Apache Whiteboard Video


Apache Drill Drill in 10 Minutes with Tomer Shiran

© 2015 MapR Technologies 23

You might also like