Apache Druid
[Link]
Sudhindra Tirupati Nagaraj
History and Trivia
● Started in 2011 within a ad-tech company called Metamarkets
● Initially considered HBase, which was too slow for aggregate queries
● A real-time and batch analytics data store
● Open sourced in 2012
● Apache Druid project in 2015
● Used by thousands of companies like Netflix, Lyft, Twitter, Cisco etc
Architecture
Caching
● Brokers have segment caches (either local heap memory/memcached)
● Historicals cache segments loaded from deep storage
Load Balancing
● Coordinators periodically read segments and assign to historicals (discovered via zk)
● Use query patterns to determine cost based optimization to spread or co-locate segments from different
data sources
Availability
● Periodic persisting to disk of real-time nodes, disks backed up.
● Zookeeper/MySQL failure will not affect data availability for querying
Rules
● Hot-tier vs Cold-tier in historicals. Rules configure query SLA’s
Why Druid does not do joins
● Scaling joins is hard
● Gains of supporting joins is offset by problems due to high throughput, join heavy
workloads
● Possible to materialize columns into streams and perform hash-based/sort-merge join, but
requires lot of computation
Storage Format
● Columnar storage (similar to HBase)
● Data tables called Data Sources (similar to Rockset collections). Table partitioned into
“segments”. Segment is typically 5-10 million rows, spanning a period of time. Segments are
immutable.
● Multiple column types in segment => different encoding/compression techniques
Example: String columns use dictionary encoding. Numeric columns use raw values. Post encoding, compression LZF.
John Smith -> 0
Jane Doe -> 1
Name column: [0, 1, 0]
Timestamp Name City Expenses
T0 John Smith San Francisco 1000
T1 Jane Doe San Francisco 2000
T2 John Smith San Francisco 500
Filtering
● Filtering on aggregated results (eg: sum of all expenses for San Francisco)
● Binary bitmap as indices. For example, for each city, there can be a binary bitmap indicating the
rows containing that city. Example: San Francisco -> [0, 1, 2] -> [1, 1, 1, 0], New York -> [3] -> [0, 0, 0, 1]
The bitmap can be compressed further using bitmap compression algorithms (Druid uses
Concise). We can also combine 2 bitmaps. For example, sum of all expenses in San Francisco
and New York is obtained by: [1, 1, 1, 0] OR [0, 0, 0, 1] -> [1, 1, 1, 1]
Timestamp Name City Characters Added
T0 John Smith San Francisco 1000
T1 Jane Doe San Francisco 2000
T2 John Smith San Francisco 500
T3 John Smith New York 200
Query Language
● No SQL.
● HTTP POST: Response:
[
Request: {
"timestamp": "2012-01-01T[Link].000Z",
{ "result": {
"rows": 393298
"queryType": "timeseries",
}
"dataSource": "wikipedia", },
"intervals": "2013-01-01/2013-01-08", {
"filter": { "timestamp": "2012-01-02T[Link].000Z",
"type": "selector", "result": {
"dimension": "page", "rows": 382932
"value": "Ke$ha" }
}, },
"granularity": "day", ...
{
"aggregations": [
"timestamp": "2012-01-07T[Link].000Z",
{ "result": {
"type": "count", "rows": 1337
"name": "rows" }
} }
] ]
}
Query Performance
Test data:
30% standard aggregate queries
60% group by over aggregates
10% search
Results: Average ~550ms, 95th percentile 2 seconds, 99th percentile 10 seconds
Cardinality of a dimension matters a lot!
Query Performance with scaling
Linear scaling helped mostly simple aggregate
queries.
Ingest Performance
● Depends on the source, more than
dimensions/metrics
● Achieves peak rate of 800k events/s/core with only
timestamp data.
● Once source is discounted, cost is mainly
deserialization.
Thank You