Professional Documents
Culture Documents
????
ETL SQL
Data
Warehouse Workloads often limited to SQL and BI tools
Data in proprietary formats
Hard to do integrate streaming, ML, and AI workloads
Performance is expensive
Scaling up/out usually comes at a high cost
Data Lakes
SQL
ETL @
Evolution of a Cutting-Edge Data Pipeline
Events
?
Streaming
Analytics
Events
Streaming
Analytics
1 λ-arch Streaming
Analytics
2 Validation
2 Validation
Partitioned
3
Reprocessing
Data Lake Reporting
Challenge #4: Query Performance?
λ-arch
1 λ-arch
Events 1 2 Validation
1 λ-arch Streaming 3 Reprocessing
Analytics
4 Compaction
2 Validation
Partitioned
2 4 Scheduled to
Avoid Compaction
Reprocessing
Data Lake 4 Compact Reporting
Small Files
Challenges with Data Lakes: Reliability
Failed production jobs leave data in corrupt
state requiring tedious recovery
DELTA
= +
Transactional log
DELTA
Scalable storage pathToTable/
+---- 000.parquet
table data stored as Parquet files +---- 001.parquet
on HDFS, AWS S3, Azure Blob Stores +---- 002.parquet
+ ...
|
Transactional log +---- _delta_log/
sequence of metadata files to track +---- 000.json
operations made on the table +---- 001.json
...
stored in scalable storage along with table
Log Structured Storage
|
INSERT actions
+---- _delta_log/
Changes to the table Add 001.parquet
are stored as ordered, +---- 000.json Add 002.parquet
atomic commits
+---- 001.json UPDATE actions
Remove 001.parquet
Each commit is a set of ...
actions file in directory Remove 002.parquet
_delta_log Add 003.parquet
Log Structured Storage
|
INSERT actions
Readers read the+----
log in _delta_log/ Add 001.parquet
atomic units thus reading
+---- 000.json Add 002.parquet
consistent snapshots
+---- 001.json UPDATE actions
Remove 001.parquet
...
readers will read Remove 002.parquet
either [001+002].parquet
Add 003.parquet
or 003.parquet
and nothing in-between
Mutual Exclusion
Concurrent writers 000.json
need to agree on the Writer 1 Writer 2
001.json
order of changes
002.json
New commit files must
be created mutually only one of the writers trying
exclusively to concurrently write 002.json
must succeed
Challenges with cloud storage
Different cloud storage systems have different semantics to
provide atomic guarantees
Solution:
Failed write jobs do not update the commit log,
DELTA hence partial / corrupt files not visible to readers
Challenges solved: Reliability
Challenge :
Lack of consistency makes it almost impossible to mix
appends, deletes, upserts and get consistent reads
Solution:
All reads have full snapshot consistency
DELTA All successful writes are consistent
In practice, most writes don't conflict
Tunable isolation levels
Challenges solved: Reliability
Challenge :
Lack of schema enforcement creates
inconsistent and low quality data
Solution:
Schema recorded in the log
Fails attempts to commit data with incorrect schema
DELTA Allows explicit schema evolution
Allows invariant and constraint checks (high data quality)
Challenges solved: Performance
Challenge:
Too many small files increase resource
usage significantly
Solution:
Transactionally performed compaction using OPTIMIZE
DELTA
OPTIMIZE table WHERE date = '2019-04-04'
Challenges solved: Performance
Challenge:
Partitioning breaks down with many
dimensions and/or high cardinality columns
Solution:
Optimize using multi-dimensional clustering on multiple
columns
DELTA
OPTIMIZE conns WHERE date = '2019-04-04'
ZORDER BY (srcIP, destIP)
Querying connection data at Apple
Ad-hoc query of connection data based on different columns
srcIp
SELECT count(*) FROM conns
WHERE date = '2019-04-04'
AND dstIp = '1.1.1.1'
5
6
7
8
Multidimensional Sorting
dstIp
1 2 3 4 5 6 7 8
1
SELECT count(*) FROM conns
WHERE date = '2019-04-04' 2
AND srcIp = '1.1.1.1'
3
4
srcIp
SELECT count(*) FROM conns
WHERE date = '2019-04-04'
AND dstIp = '1.1.1.1'
5
6
srcIp
SELECT count(*) FROM conns
WHERE date = '2019-04-04'
AND dstIp = '1.1.1.1'
5
6
7
8
Multidimensional Sorting
dstIp
1 2 3 4 5 6 7 8
1
SELECT count(*) FROM conns
2
WHERE date = '2019-04-04'
AND srcIp = '1.1.1.1'
2 files
3
4
srcIp
SELECT count(*) FROM conns
WHERE date = '2019-04-04' 8 files 5
AND dstIp = '1.1.1.1'
6
great for major sorting 7
dimension, not for others 8
Multidimensional Clustering dstIp
1 2 3 4 5 6 7 8
1
SELECT count(*) FROM conns
WHERE date = '2019-04-04' 2
AND srcIp = '1.1.1.1'
3
4
srcIp
SELECT count(*) FROM conns
WHERE date = '2019-04-04'
AND dstIp = '1.1.1.1'
5
6
zorder space
filling curve 7
8
Multidimensional Clustering dstIp
1 2 3 4 5 6 7 8
1
SELECT count(*) FROM conns
2
WHERE date = '2019-04-04'
AND srcIp = '1.1.1.1'
4 files
3
4
srcIp
SELECT count(*) FROM conns
WHERE date = '2019-04-04' 4 files 5
AND dstIp = '1.1.1.1'
6
reasonably good for 7
all dimensions 8
Data Pipeline @ Apple
Security Infra Detect signal across user, application and network logs
IDS/IPS, DLP, antivirus, load
balancers, proxy servers
Quickly analyze the blast radius with ad hoc queries
Cloud Infra & Apps Respond quickly in an automated fashion
AWS, Azure, Google Cloud
Network Infra
Routers, switches, WAPs,
> 100TB new data/day
databases, LDAP
Dump Complex
Alerting
STRUCTURED SQL, ML,
ETL
DELTA Reports
STREAMING STREAMING
DELTA
Reporting
1 λ-arch Not needed, Delta handles both short and long term data
2 Validation Easy as data in short term and long term data in one location
3 Reprocessing
Easy and seamless with Detla's transactional guarantees
4 Compaction
Easy to use Delta with Spark APIs
Instead of parquet... … simply say delta
dataframe dataframe
.write .write
.format("parquet") .format("delta")
.save("/data") .save("/data")
Accelerate Innovation with Databricks
Unified Analytics Platform
Increases Data Science Open Extensible API’s
DATABRICKS COLLABORATIVE NOTEBOOKS
Productivity by 5x
Explore Data Train Models Serve Models