Drill High Performance SQL Engine With Json Data Model 150519024433 Lva1 App6891

How Drill achieves Flexibility with Performance
©© 2015
2015 MapR
MapR Technologies
Technologies 1
Drill Supports Schema Discovery On-The-Fly
Schema Declared In Advance Schema2Discovered On-The-Fly
• Fixed schema • Fixed schema, evolving schema or

• Leverage schema in centralized schema-less
repository (Hive Metastore) • Leverage schema in centralized
repository or self-describing data
SCHEMA ON SCHEMA SCHEMA ON THE

WRITE BEFORE READ FLY
© 2015 MapR Technologies 2

Drill’s Data Model is Flexible Apache Drill table
{
name: {
Dynamic first: Michael,
Fixed schema last: Smith
schema },
hobbies: [ski, soccer],
district: Los Altos
}
{
Parquet JSON name: {
Complex
Avro BSON first: Jennifer,
last: Gates
},
Flexibility
hobbies: [sing],
preschool: CCLC
}
CSV
Flat HBase RDBMS/SQL-on-Hadoop table
TSV
Name Gender Age
Michael M 6
Jennifer F 3
Flexibility
Drill enables ‘SQL on Everything’
Workspace Table
- Sub-directory - Pathnames
- HBase namespace - Hive table
- Hive database - HBase table
SELECT * FROM dfs.yelp.`business.json`
Storage plugin instance

- DFS (Text, Parquet, JSON)
- HBase/MapRDB
- Hive Metastore/Hcatalog
- Easy API to go beyond Hadoop

Drill is a Distributed SQL query engine
drillbit drillbit drillbit
… ZooKeeper
ZooKeeper
ZooKeeper
DataNode/Regi DataNode/Regi DataNode/Regi
onServer onServer onServer
 Scale out
 Columnar and Vectorized execution
 Optimistic and pipelined execution (no MR, Spark, Tez)
 Late binding
 Extensible
Drill allows reuse of existing SQL Tools and Skills
Leverage SQL-compatible tools

(BI, query builders, etc.) via Drill’s
standard ODBC, JDBC and ANSI
SQL support
Enable business analysts, technical

analysts and data scientists to
explore and analyze large volumes
of real-time data

Drill is Designed For A Wide Set Of Use Cases
Raw Data Exploration JSON Analytics DWH Offload …
{JSON}, Parquet
Text Files …
…
Files Directories Hive HBase

MapR Optimized Data Architecture Data Movement
Data Access
Optimized Data Architecture Machine Learning
Operational Apps
Sources
RELATIONAL, Batch Interactive Streaming
SAAS, (MapReduce, (Drill, (Spark Streaming, Recommendations
MAINFRAME Spark, Hive, Pig) Impala) Storm)
Fraud Detection
DOCUMENTS, MapR-DB MapR-FS
EMAILS
MAPR DISTRIBUTION FOR HADOOP Logistics
MapR Data Platform
BLOGS,
TWEETS, MAPR DISTRIBUTION FOR HADOOP
LINK DATA Analytics
Data Transformation, Enrichment Search
LOG FILES,
CLICKSTREAMS and Integration
SENSORS Schema-less
data exploration
BI, reporting
Ad-hoc integrated
DATA WAREHOUSE analytics
Architecture – Under the hood
©© 2015
2015 MapR
MapR Technologies
Technologies 9
High Level Architecture
Cluster of commodity servers
– Daemon (drillbit) on each node
ZooKeeper maintains ephemeral cluster membership information

– Drillbit uses ZooKeeper to find other drillbits in the cluster
– Client uses ZooKeeper to find drillbits
Built-in, optimistic query execution engine. Doesn’t require a

particular storage or execution system (MapReduce, Spark, Tez)
– Better performance and manageability
Data processing unit is columnar record batches

– Enables schema flexibility with negligible performance impact
Basic Process
Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, REST)

2. Drillbit generates execution plan based on query optimization & locality
3. Fragments are farmed to individual nodes
4. Result is returned to driving node
Drillbit Drillbit Drillbit
Zookeeper
DFS/HBase/H DFS/HBase/H DFS/HBase/H
ive ive ive

Core Modules within drillbit
RPC Endpoint
Storage Plugins
DFS
Physical Plan
Logical Plan
Hive
SQL Parser Optimizer Execution
HBase
MongoDB

A Query engine that is…
• Columnar/Vectorized
• Optimistic/pipelined
• Runtime compilation
• Late binding
• Extensible

Columnar representation
A B C D E
A On disk

Columnar Encoding
• Values in a col. stored next to one-another A On disk

– Better compression
– Range-map: save min-max, can skip if not present
B
• Only retrieve columns participating in query
C
• Drill optimizes for BOTH columnar storage D

and Execution
E

Vectorization
Drill operates on more than one record at a time
– Word-sized manipulations
– SIMD instructions (GCC, LLVM and JVM all do various optimizations
automatically)
– Manually code algorithms
Logical Vectorization
– Bitmaps allow lightning fast null-checks
– Avoid branching to speed CPU pipeline

Optimistic Execution
With a short time horizon, failures infrequent
– Don’t spend energy and time creating boundaries and checkpoints to
minimize recovery time
– Rerun entire query in face of failure
No barriers
No persistence unless memory overflow

Pipelining
Record batch is the unit of work for Drill
– Operators work on a record batch ( )
DrillBit
Record batches are pipelined between nodes
– ~256kB usually
Operator reconfiguration happens

at batch boundaries DrillBit DrillBit

Runtime Compilation is Faster
Janino interpreted
Time for 1 million evaluations (ms)

500
450
400
350
300
250
200
150
100
50
0
Trivial
Trivial Simple Moderate
Source: http://bit.ly/16Xk32x © 2015 MapR Technologies 19
Drill compiler
Janino compiles
CodeModel
runtime
generates code
byte-code
Merge byte-code of
Loaded class
the two classes
Precompiled byte-
code templates

Cost-based Optimization
Pluggable rules, and cost model
Rules for distributed plan generation

- Insert Exchange operator into physical plan Pluggable
Query rules
- Parallel query plans
Optimizer Pluggable
rules
Pluggable cost model
- CPU, IO, memory, network cost (data locality)
- Storage engine features (HDFS vs HIVE vs HBase)

Integration and extensibility points
Support UDFs
– UDFs/UDAFs using high performance Java API
Not Hadoop centric

– Work with other NoSQL solutions including MongoDB, Cassandra, Riak, etc.
– Build one distributed query engine together than per technology
Built in classpath scanning and plugin concept to add additional storage

engines, function and operators with zero configuration
Support direct execution of strongly specified JSON based logical and physical
plans
– Simplifies testing
– Enables integration of alternative query languages

Additional Resources
Download Tutorial: Apache Whiteboard Video

Apache Drill Drill in 10 Minutes with Tomer Shiran

Drill High Performance SQL Engine With Json Data Model 150519024433 Lva1 App6891

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Drill High Performance SQL Engine With Json Data Model 150519024433 Lva1 App6891

Uploaded by

Copyright:

Available Formats

How Drill achieves Flexibility with Performance

Schema Declared In Advance Schema2Discovered On-The-Fly

• Fixed schema • Fixed schema, evolving schema or

SCHEMA ON SCHEMA SCHEMA ON THE

© 2015 MapR Technologies 2

SELECT * FROM dfs.yelp.`business.json`

Storage plugin instance

© 2015 MapR Technologies 4

drillbit drillbit drillbit

Leverage SQL-compatible tools

Enable business analysts, technical

© 2015 MapR Technologies 6

Raw Data Exploration JSON Analytics DWH Offload …

© 2015 MapR Technologies 7

ZooKeeper maintains ephemeral cluster membership information

Built-in, optimistic query execution engine. Doesn’t require a

Data processing unit is columnar record batches

Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, REST)

Drillbit Drillbit Drillbit

© 2015 MapR Technologies 11

© 2015 MapR Technologies 12

© 2015 MapR Technologies 13

© 2015 MapR Technologies 14

• Values in a col. stored next to one-another A On disk

• Drill optimizes for BOTH columnar storage D

© 2015 MapR Technologies 15

© 2015 MapR Technologies 16

No persistence unless memory overflow

© 2015 MapR Technologies 17

Operator reconfiguration happens

© 2015 MapR Technologies 18

Time for 1 million evaluations (ms)

© 2015 MapR Technologies 20

Rules for distributed plan generation

© 2015 MapR Technologies 21

Not Hadoop centric

Built in classpath scanning and plugin concept to add additional storage

© 2015 MapR Technologies 22

Download Tutorial: Apache Whiteboard Video

© 2015 MapR Technologies 23

You might also like