0% found this document useful (0 votes)

381 views12 pages

Overview of Apache Druid Architecture

Apache Druid is an open source distributed data store designed for real-time analytics on large datasets. It was started in 2011 at Metamarkets to address the slow performance of HBase for aggregate queries. Druid uses a columnar data storage format with techniques like dictionary encoding, bitmap indexes, and compression to enable fast filtering and aggregation. It is horizontally scalable and supports high ingest and query throughput. Druid is used by many large companies for applications like analytics dashboards and real-time metrics.

Uploaded by

Ankit Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

381 views12 pages

Overview of Apache Druid Architecture

Uploaded by

Ankit Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Apache Druid

[Link]

Sudhindra Tirupati Nagaraj

History and Trivia

● Started in 2011 within a ad-tech company called Metamarkets

● Initially considered HBase, which was too slow for aggregate queries

● A real-time and batch analytics data store

● Open sourced in 2012

● Apache Druid project in 2015

● Used by thousands of companies like Netflix, Lyft, Twitter, Cisco etc

Architecture
Caching

● Brokers have segment caches (either local heap memory/memcached)

● Historicals cache segments loaded from deep storage

Load Balancing

● Coordinators periodically read segments and assign to historicals (discovered via zk)

● Use query patterns to determine cost based optimization to spread or co-locate segments from different
data sources

Availability

● Periodic persisting to disk of real-time nodes, disks backed up.

● Zookeeper/MySQL failure will not affect data availability for querying

Rules

● Hot-tier vs Cold-tier in historicals. Rules configure query SLA’s

Why Druid does not do joins

● Scaling joins is hard

● Gains of supporting joins is offset by problems due to high throughput, join heavy
workloads

● Possible to materialize columns into streams and perform hash-based/sort-merge join, but
requires lot of computation
Storage Format

● Columnar storage (similar to HBase)

● Data tables called Data Sources (similar to Rockset collections). Table partitioned into
“segments”. Segment is typically 5-10 million rows, spanning a period of time. Segments are
immutable.

● Multiple column types in segment => different encoding/compression techniques

Example: String columns use dictionary encoding. Numeric columns use raw values. Post encoding, compression LZF.

John Smith -> 0

Jane Doe -> 1

Name column: [0, 1, 0]

Timestamp Name City Expenses

T0 John Smith San Francisco 1000

T1 Jane Doe San Francisco 2000

T2 John Smith San Francisco 500

Filtering

● Filtering on aggregated results (eg: sum of all expenses for San Francisco)

● Binary bitmap as indices. For example, for each city, there can be a binary bitmap indicating the
rows containing that city. Example: San Francisco -> [0, 1, 2] -> [1, 1, 1, 0], New York -> [3] -> [0, 0, 0, 1]

The bitmap can be compressed further using bitmap compression algorithms (Druid uses
Concise). We can also combine 2 bitmaps. For example, sum of all expenses in San Francisco
and New York is obtained by: [1, 1, 1, 0] OR [0, 0, 0, 1] -> [1, 1, 1, 1]

Timestamp Name City Characters Added

T0 John Smith San Francisco 1000

T1 Jane Doe San Francisco 2000

T2 John Smith San Francisco 500

T3 John Smith New York 200

Query Language

● No SQL.

● HTTP POST: Response:

[
Request: {
"timestamp": "2012-01-01T[Link].000Z",
{ "result": {
"rows": 393298
"queryType": "timeseries",
}
"dataSource": "wikipedia", },
"intervals": "2013-01-01/2013-01-08", {
"filter": { "timestamp": "2012-01-02T[Link].000Z",
"type": "selector", "result": {
"dimension": "page", "rows": 382932
"value": "Ke$ha" }
}, },
"granularity": "day", ...
{
"aggregations": [
"timestamp": "2012-01-07T[Link].000Z",
{ "result": {
"type": "count", "rows": 1337
"name": "rows" }
} }
] ]
}
Query Performance

Test data:

30% standard aggregate queries

60% group by over aggregates
10% search

Results: Average ~550ms, 95th percentile 2 seconds, 99th percentile 10 seconds

Cardinality of a dimension matters a lot!

Query Performance with scaling

Linear scaling helped mostly simple aggregate

queries.
Ingest Performance

● Depends on the source, more than

dimensions/metrics

● Achieves peak rate of 800k events/s/core with only

timestamp data.

● Once source is discounted, cost is mainly

deserialization.
Thank You

ETL Process Overview in Agriculture
100% (1)
ETL Process Overview in Agriculture
42 pages
Near Real-Time Big Data Processing
No ratings yet
Near Real-Time Big Data Processing
59 pages
Delta Lake On Azure Databricks
No ratings yet
Delta Lake On Azure Databricks
18 pages
How To Kickstart An Azure Data Engineering Project
No ratings yet
How To Kickstart An Azure Data Engineering Project
6 pages
Master Databricks Data Engineering Guide
No ratings yet
Master Databricks Data Engineering Guide
6 pages
Data Warehouse Solutions: Cloud vs Traditional
No ratings yet
Data Warehouse Solutions: Cloud vs Traditional
5 pages
Understanding Spark Architecture Basics
No ratings yet
Understanding Spark Architecture Basics
25 pages
Caching vs Persisting in PySpark
No ratings yet
Caching vs Persisting in PySpark
3 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Spark Optimization Techniques Handbook
No ratings yet
Spark Optimization Techniques Handbook
7 pages
HDFC 1
No ratings yet
HDFC 1
5 pages
Py Spark
No ratings yet
Py Spark
33 pages
Databricks Tutorial
No ratings yet
Databricks Tutorial
2 pages
Real-Time Transaction Analytics in Banking With Delta Lake
No ratings yet
Real-Time Transaction Analytics in Banking With Delta Lake
6 pages
Azure Data Factory Workshop
No ratings yet
Azure Data Factory Workshop
26 pages
Apache Spark vs Hadoop Overview
No ratings yet
Apache Spark vs Hadoop Overview
9 pages
SCD Type-2 with Pandas in Spark
0% (1)
SCD Type-2 with Pandas in Spark
8 pages
Spark vs Hadoop: Key Component Differences
No ratings yet
Spark vs Hadoop: Key Component Differences
9 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
AWS Athena Knowledgebase
No ratings yet
AWS Athena Knowledgebase
4 pages
Understanding Apache Spark Architecture
0% (1)
Understanding Apache Spark Architecture
30 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Fall209 Spark SQL MC
No ratings yet
Fall209 Spark SQL MC
96 pages
Using Apache Spark in Local Mode
No ratings yet
Using Apache Spark in Local Mode
56 pages
Apache Spark: Fast Cluster Computing
No ratings yet
Apache Spark: Fast Cluster Computing
6 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
Databricks Delta for Developers
No ratings yet
Databricks Delta for Developers
11 pages
Spark Repartition vs Coalesce Explained
No ratings yet
Spark Repartition vs Coalesce Explained
7 pages
Competitive Intelligence Course
No ratings yet
Competitive Intelligence Course
36 pages
ETL Operations in Azure Databricks
No ratings yet
ETL Operations in Azure Databricks
5 pages
Mastering Azure Databricks Day-5
No ratings yet
Mastering Azure Databricks Day-5
9 pages
Azure Databricks
No ratings yet
Azure Databricks
5 pages
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
Google BigQuery: Scalable Data Analysis
No ratings yet
Google BigQuery: Scalable Data Analysis
2 pages
A - Learning - Oreilly.com-Preface Data Engineering With AWS
No ratings yet
A - Learning - Oreilly.com-Preface Data Engineering With AWS
6 pages
Tuning Spark for Big Data Performance
100% (1)
Tuning Spark for Big Data Performance
20 pages
Azure Data Engineering Project Overview
No ratings yet
Azure Data Engineering Project Overview
56 pages
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Apache Spark 101 For Data Engineering
No ratings yet
Apache Spark 101 For Data Engineering
15 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
Pyspark - Notes 1
No ratings yet
Pyspark - Notes 1
3 pages
Deloitte Scenario-Based Questions in Spark
No ratings yet
Deloitte Scenario-Based Questions in Spark
7 pages
Set Your Data in Motion
No ratings yet
Set Your Data in Motion
8 pages
SQL Joins and Functions Guide
No ratings yet
SQL Joins and Functions Guide
1 page
Azure Data Engineering Guide
No ratings yet
Azure Data Engineering Guide
11 pages
DBT Cloud and Snowflake
No ratings yet
DBT Cloud and Snowflake
17 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
Handout Accelerate Your Analytics and AI With Amazon SageMaker Lakehouse
No ratings yet
Handout Accelerate Your Analytics and AI With Amazon SageMaker Lakehouse
45 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
SCD Type-1 & 2 in PySpark Guide
No ratings yet
SCD Type-1 & 2 in PySpark Guide
6 pages
Azure Databricks Full Course Notes
No ratings yet
Azure Databricks Full Course Notes
2 pages
ABD22 1st Exam Review - January 2023
No ratings yet
ABD22 1st Exam Review - January 2023
13 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Top SQL Interview Questions 2024
No ratings yet
Top SQL Interview Questions 2024
20 pages
Data Migration Project
No ratings yet
Data Migration Project
36 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
Getting Started With Apache Nifi
No ratings yet
Getting Started With Apache Nifi
10 pages
HBase Architecture and Performance Insights
No ratings yet
HBase Architecture and Performance Insights
46 pages
Cybersquatting Analysis in Singapore
No ratings yet
Cybersquatting Analysis in Singapore
43 pages
Retail Wifi Analytics
No ratings yet
Retail Wifi Analytics
38 pages
Cognos 11.0.9: New Features Overview
No ratings yet
Cognos 11.0.9: New Features Overview
20 pages
How Blockchains Could Change The World
No ratings yet
How Blockchains Could Change The World
5 pages
451 Group - Business Impact Brief - AI
No ratings yet
451 Group - Business Impact Brief - AI
2 pages
Data Platforms Market Map 2019
No ratings yet
Data Platforms Market Map 2019
31 pages
Perverted Indian Wife
63% (19)
Perverted Indian Wife
30 pages
Perverted Indian Wife
63% (19)
Perverted Indian Wife
30 pages
Unexpected Swap at Munnar
45% (11)
Unexpected Swap at Munnar
8 pages
Camera Samples Manual V1.1.3
No ratings yet
Camera Samples Manual V1.1.3
24 pages
Primary Memory Test
No ratings yet
Primary Memory Test
3 pages
N1 AM E01 20130222 Alarm
No ratings yet
N1 AM E01 20130222 Alarm
82 pages
Twintex PPS-3010 Manual
0% (1)
Twintex PPS-3010 Manual
34 pages
All CLARiiON Disk and FLARE OE Matrices
No ratings yet
All CLARiiON Disk and FLARE OE Matrices
60 pages
AWS Cloud Technology Overview
No ratings yet
AWS Cloud Technology Overview
67 pages
Second Generation Computers: Transistors
No ratings yet
Second Generation Computers: Transistors
47 pages
Click and Play Intuitively, Make Your Media Alive: NSA221 NSA221
No ratings yet
Click and Play Intuitively, Make Your Media Alive: NSA221 NSA221
4 pages
VMware Site Recovery Manager On NetApp Storage
100% (1)
VMware Site Recovery Manager On NetApp Storage
42 pages
Archiving With Archive Development Kit (ADK) : Warning
No ratings yet
Archiving With Archive Development Kit (ADK) : Warning
87 pages
i.MX 8 Android Build Guide
No ratings yet
i.MX 8 Android Build Guide
81 pages
LECTURE 1 - Memory Hierarchy and Locality of Reference
No ratings yet
LECTURE 1 - Memory Hierarchy and Locality of Reference
58 pages
TSM Commands
100% (1)
TSM Commands
29 pages
Overview of Random-Access Memory (RAM)
No ratings yet
Overview of Random-Access Memory (RAM)
19 pages
MFG Kinetic Hardware Sizing Guide WP ENS
No ratings yet
MFG Kinetic Hardware Sizing Guide WP ENS
28 pages
CSS G11 Module 5 Q1
No ratings yet
CSS G11 Module 5 Q1
21 pages
Management Information Systems Guide
No ratings yet
Management Information Systems Guide
8 pages
Student Daily Diary Format
No ratings yet
Student Daily Diary Format
12 pages
Computer Basic Questions
100% (1)
Computer Basic Questions
12 pages
Memory Systems Overview
No ratings yet
Memory Systems Overview
24 pages
Creation of Global Active Device
No ratings yet
Creation of Global Active Device
31 pages
Cloud Computing (Unit-2)
No ratings yet
Cloud Computing (Unit-2)
11 pages
Test Bank For Handbook of Informatics For Nurses and Healthcare Professionals 5th Edition Hebda 0132574950 9780132574952
100% (70)
Test Bank For Handbook of Informatics For Nurses and Healthcare Professionals 5th Edition Hebda 0132574950 9780132574952
36 pages
SPRO Configuration For Employee PHOTO Upload in IT0002
No ratings yet
SPRO Configuration For Employee PHOTO Upload in IT0002
10 pages
The Linux Memory Manager - Lorenzo Stoakes
No ratings yet
The Linux Memory Manager - Lorenzo Stoakes
1,300 pages
Chapter 6 - Query Processing and Optimization Algorithm
No ratings yet
Chapter 6 - Query Processing and Optimization Algorithm
27 pages
Microprocessor Exam Questions 2011-2012
No ratings yet
Microprocessor Exam Questions 2011-2012
7 pages
Easy68k A Beginners Guide
100% (4)
Easy68k A Beginners Guide
10 pages
Republic Act No 10175 (Cybercrime)
No ratings yet
Republic Act No 10175 (Cybercrime)
23 pages

Overview of Apache Druid Architecture

Uploaded by

Overview of Apache Druid Architecture

Uploaded by

Apache Druid

Sudhindra Tirupati Nagaraj

● Started in 2011 within a ad-tech company called Metamarkets

● A real-time and batch analytics data store

● Open sourced in 2012

● Apache Druid project in 2015

● Used by thousands of companies like Netflix, Lyft, Twitter, Cisco etc

● Brokers have segment caches (either local heap memory/memcached)

● Historicals cache segments loaded from deep storage

● Periodic persisting to disk of real-time nodes, disks backed up.

● Zookeeper/MySQL failure will not affect data availability for querying

● Hot-tier vs Cold-tier in historicals. Rules configure query SLA’s

● Scaling joins is hard

● Columnar storage (similar to HBase)

● Multiple column types in segment => different encoding/compression techniques

John Smith -> 0

Name column: [0, 1, 0]

T0 John Smith San Francisco 1000

T1 Jane Doe San Francisco 2000

T2 John Smith San Francisco 500

Timestamp Name City Characters Added

T0 John Smith San Francisco 1000

T1 Jane Doe San Francisco 2000

T2 John Smith San Francisco 500

T3 John Smith New York 200

● HTTP POST: Response:

30% standard aggregate queries

Results: Average ~550ms, 95th percentile 2 seconds, 99th percentile 10 seconds

Cardinality of a dimension matters a lot!

Linear scaling helped mostly simple aggregate

● Depends on the source, more than

● Achieves peak rate of 800k events/s/core with only

● Once source is discounted, cost is mainly

You might also like