Onehouse Introduction
Supercharge performance and automate
your Apache Hudi™ data platform
Today’s Agenda
• Introductions
• CCC Intelligent Solution Goals
• How Onehouse can help
• Next steps
Tell us more about your platform?
Question Key Observations
What is your business use case?
What are your data sources?
How is data ingestion done to Hudi?
Scale?
What query engines do you use?
Where do you have current challenges?
Goals and Pains
Any of these examples?
Cost Savings Performance Eng Productivity
1. Overprovisioned AWS Infra? 1. Any slow queries? 1. How much time on-call?
2. Any vendor costs growing 2. Difficult Hudi perf tuning? 2. What routine ops do you want to
unexpectedly? 3. Slow writes / freshness latency? automate?
3. Spark clusters or storage cost? 3. Hudi monitoring metrics?
Reliability Architecture Changes Hudi Expertise
1. Hudi pipeline stability? 1. Iceberg interop? 1. How many Hudi pro on your team?
2. Any internal SLA goals to hit? 2. Warehouse migrations? 2. Hudi version upgrades?
3. Disaster recovery? 3. Database ingestion? 3. Any special tuning projects
4. New query engines or catalogs? planned?
Example impact with our customers
Cost Savings Performance Eng Productivity
TBs/day Hudi experts 30x query performance in 1day after 4-6 FTEs managing Spark -> 0.5 FTE
40% total infra reduction vs OSS Spark their team spent months tuning building ETL pipeline templates
on bare metal EC2 6h write latency -> minutes
Reliability Architecture Changes Hudi Expertise
~75% est downtime reduction Hudi + Iceberg interop Hudi v0.8 -> 0.14 upgrades
Snowflake migrations Professional services for issue
Glue + DataHub multi-catalog resolution and perf tuning
How Onehouse Can Help:
The Universal Data Lakehouse
Supercharge performance and automate
your Apache Hudi™ data platform
“ Onehouse is going to make broadly accessible
what has to-date been a tightly held secret
used by only the most advanced data teams
Aaron Schildkrout, Investor - Addition
Delivering an Open Data Lake for the Enterprise
Take your data lakehouse to the next level
Infra Cost + Continuous analysis & DIY tuning 10x faster innovations on top of OSS
Performance -
-
Multiplexed ingestion
Optimized record-level index (RLI)
- Column-level partial overwrites (coming soon)
Analytics and data science-ready
- Auto perf tuned file sizes, sorting, partitioning, operates low
latency pipelines)
Developer Deep knowledge on Spark, No knobs tuning, with deep automatic
Experience Scala, cloud infra ++ database optimizations
Time to Months Days to weeks
production
Maintenance DIY oncall, monitoring, version Onehouse handles everything
upgrades, security patches,
backfilling
What is Onehouse?
Fully-Managed Universal Data Lakehouse
Data Warehouses
+
Fully-managed, securely in your account
Data Streams LakeView & Control Plane Query Engines
Table Format Interop & Catalog Sync
Databases Real-time Engines
Continuous Table SQL Spark/Python
Ingestion Optimizations Transformations jobs
Cloud Storage Data Engineering
Data Science
Onehouse Compute Runtime
Onehouse Compute Runtime - OCR
Specialized + runtime in your VPC
Serverless Compute Manager
○ Elastic cluster scaling - for handling data workload spikes and swings
○ Multi-cluster management - for flexible resource allocation and isolation
○ BYOC Infrastructure - for security, sovereignty, and flexibility
Adaptive Workload Optimizations
○ Multiplexed job scheduler - to minimize compute footprint
○ Lag-aware scheduling - to enforce latency SLAs
○ Performance profiles - to balance write vs query performance
High-performance Lakehouse I/O
○ Vectorized columnar merging - for fast writes
○ Parallel pipelined execution - to maximize CPU efficiency
○ Optimized storage access - reduces network requests vs oss parquet readers
- 4 ways to engage
Onehouse Cloud: Ingestion/ETL Onehouse Table Optimizer
● 10x faster vs existing OSS Hudi, Iceberg, Delta pipelines ● No-knobs tuning for Apache Hudi pipelines
● CDC ingestion from popular databases to any engines ● Automate advanced optimizations for 10x faster perf
● Auto-scaling infra w/ serverless exp in customer VPC ● No change to your existing pipelines
Onehouse LakeView (Free!) Enterprise Hudi Support + Services
● Advanced Hudi dashboards to monitor OSS pipelines ● Enterprise SLA (24x7, guaranteed response time)
● Monitoring + tuning insights for your existing data pipelines ● Hudi professional services (ex: Hudi 1.0 version upgrade)
- Differentiation vs AWS EMR
Autoscaling Advanced auto-scaling tuned for Hudi pipelines Generic Spark autoscaling
Compute Multiplexed Spark compute w/ serverless Provision, tune, and manage your own
backbone experience inside your secure VPC Spark clusters
Hudi table Automated Apache Hudi services continuously DIY custom code
services optimize tables
Ingestion Fully managed ingest services that beat cost DIY custom code
even vs OSS Spark on raw Ec2
ETL Low-code transformers, extensible Spark code, Custom Spark, PySpark, SparkSQL
SQL endpoint code
Observability Rich dashboards/alerts for pipeline health, perf Basic Spark monitors lack details to
quality, lag, cost, optimizations, + Hudi timeline debug and tune Hudi pipelines
Support 24x7 Hudi and Spark experts 24x7 AWS generic Spark support
- Examples and scale
Industry Vertical: Retail
Use Case: Near real-time lakehouse and ingestion for IoT, orders, inventory
management
Onehouse Impact: Cost reduction from BigQuery, faster/fresher data
lakehouse ingestion. Move from OSS to Managed Table Service management
Industry Vertical: Utilities
Use Case: Power meter utility sensor IoT. Predictive analytics for grid
failures.
Onehouse Impact: Reached previously untapped 10yr historical data that was
estimated to take more than 6 months just to process 1.7PB+ 1,000+
Data Processed Active Pipelines
Industry Vertical: Finance
Use Case: Credit card issuing, transaction processing, fraud detection, and
more.
700k+ 1.5 TB/min
Onehouse Impact: Snowflake migration. Streaming ingestion of raw data into Compute hours Peak Burst Rates
Hudi and XTable/Iceberg tables. Table management for downstream pipelines for a single pipeline
Sample Customer Workload
- EXAMPLE Customer TCO vs EMR
Total Cost:
$447K
Table Optimizer
Examp
le only
- actu
al $ es
t imates
Engineering
1.5 FTE(1)
$375K
will va
ry ~20% ~1,200
Infra. cost reduction Query hours gained
Total Cost:
$131K
Foundation
SavePoint / Rollback* $12K Engineering
0.25 FTE(1)
Monitoring $12K $60K
Observability $12K
Onehouse(5)
~2,400
$53K
~75%
Hudi Table Services $12K
AWS $24K(2) AWS $18K(2)
Downtime Avoidance Engineering hours saved
OSS Hudi Onehouse Platform
Proven industry traction for Onehouse & Hudi
50%
Reduction in data pipeline
98%
Reduction in CDC sync frequency
10X
Reduction in data processing
<1s
Query latency at data lake scale
engineering expenses times, after reducing backfill time
from half a year to days
80% 80% 5X+ 1.25M
Reduction in compute costs Faster ingestion
Reduction in compute costs Saved per year vs Fivetran
over millions of v_cores and Snowflake , while
90% 2-10X powering Notion AI
Reduction in storage costs Typical query performance
over 100TB/day ingested
- Peek at the feature menu 👀
Stream Ingest from Kafka,
Low-code transformations
RDBMS, or S3
Extensible Spark-code
Data Quality Quarantine
transform
Incremental Change Data
Capture SQL endpoint
Ingestion Transformations
Clustering Cleaning
Table Services Compaction Multi catalog sync
Table Format Interop More coming soon ++
Infrastructure
Serverless In-VPC Auto-scaling
Dataplane Spark clusters
Infra as Code APIs Monitoring/Alerting
Multiplexed compute Hudi Enterprise Support
Next Steps
- Next Steps Example
Identify Use Case Technical Trial Production
Setup
Executive Alignment Free trial
- Finding the Use Case (example)
Current State (Q3 2023) Future State
• Primarily an append-only workload in shadow mode in • Improved Data Accessibility & Analytics: Support for
production. multiple formats doesn’t mean more pipelines and
• Only a few small tables with mutable workloads. duplicate data.
• Old version of Apache Hudi without metatable • Scalability & Flexibility: Easily accommodates future
enabled.
data growth and integration of new data sources.
Leverage any query engine or data processing engine.
• Cost Optimization: Leverages scalable cloud native
storage and elastic computing.
Required Capabilities Success Metrics
● Support ingestion, backfill and historical upserts. • 20% Reduced infrastructure costs?
● Highly available infrastructure with minimized • Enable interoperability with Apache Iceberg and Delta
downtime. Lake?
● Store the data in a open standards, future-proof • 30% Query performance boost?
format in cloud native technology. • Autoscale ingestion?
● Out of the box data automation to enable data
freshness, reduce data corruption and enhance
query performance through data skew reduction
and data “defragging”.
● Stored securely in AWS?
NEXT STEPS
Your Universal ● Identify Use Case
Data Lakehouse ● Free 1-month trial to get started
is Waiting For You ● Invite your team to learn more:
⁃ Custom demo
⁃ Detailed feature overview
⁃ Case studies
⁃ Pricing
⁃ Hudi details
⁃ Anything around data
Email gtm@onehouse.ai for exclusive access
Thank You
Your ETL transformations, your way
Pr
ev
i ew
*
Managed Incremental Pipelines
Spark Jobs
No-code, Low-code, or BYO-code to ingest
data in real-time and ELT workloads. Easy, fast experience over OSS Spark/AWS
EMR + open table format
Pr
Pr ev
ev i ew
ie *
w
*
SQL Endpoint
SQL Editor
Execute SQL Pipelines using popular tools:
Onehouse SQL editor for quick exploration
dbt, Metabase, Airflow, Hex, Notebooks, etc
Pr
ev
i
- Spark Jobs ew
*
● Submit Spark Jobs (Scala/Python) to
Onehouse Jobs API or via UI
● Executes on Onehouse Compute Runtime
(OCR) - 100% Spark compatible
● Isolate workloads using auto-scaling
serverless clusters that lower compute costs
● Job monitoring and history, log tracking
Pr
ev
i
- Onehouse SQL Endpoint ew
*
● Endpoint for popular external SQL tools:
○ dbt, Airflow (SQL pipelines)
○ Metabase, notebooks (data exploration)
● Seamless Onehouse integration
○ Observability in the Onehouse console
○ Managed, async optimizations on your
SQL-generated tables
○ Multi-catalog sync for your SQL-generated
tables to Snowflake, Glue, DataHub, etc.
● Reduces spend on premium cloud warehouse
compute
○ SQL functions to author pipelines running
merges/loads incrementally
Pr
ev
i
- Onehouse SQL Editor ew
*
● Author and run SQL queries directly through the
Onehouse SQL Editor
● Get quick analysis at a glance
● Backfill or update critical data
● Create new tables and experiment
Announcing 1.0 release
1. Secondary Indexing
a. 95% faster TPC-DS with secondary indexing that you can
create/drop asynchronously w/ simple SQL
2. Partial Updates
a. 2.6x perf and 85% less write amplification with MERGE
INTO modifying only the changed fields of a record
3. Logical partitioning
a. Postgres style expression indexes to treat partitions like
the coarse-grained indexes
4. Merge Modes
a. 1st-class support for: commit_time_ordering,
event_time_ordering
5. Non-blocking Concurrency
https://hudi.apache.org/blog/2024/12/16/announcing-hudi-1-0-0
a. Multiple writers and compaction of the same record
without blocking any involved processes
6. LSM timeline
a. Revamped timeline, allowing users to retain a large
amount of table history
Record Index performance
1. [Write performance]
a. Boosts index lookup 4 - 10x for large scale datasets
2. [Read performance]
a. On “EqualTo” or “IN” predicates on record key columns in Spark
b. 98% reduction in query latency in a 400GB dataset with 20,000 filegroups
3. [Efficient Storage]
a. Stored as a separate partition under metadata table
b. Record key to record location in HFile for fast update & lookup
c. Avoids cost associated with gathering data from table data files
d. Uses fewer file groups instead of all data files
e. No linear increase with table size
Index and Write latency speedup on a 1TB dataset, 200MB batch, random updates, Reduce SQL point lookup latency on a TPC-DS
Spark datasource! 10TB datasets, store_sales table, Spark!
Onehouse Table Optimizer
Operating Open Table Formats is hard…
● Apache Hudi requires maintenance for
cleaning, clustering, compaction, file-sizing, etc
● Many configs to tune parameters from:
○ Frequency, budgets, triggers, partition spread, parallelism,
retention, concurrency, etc
● Inline services compete with write operations
● Tuning your table service configs right can result
in 2-100x perf swings
Operating Hudi pipelines
EXISTING PIPELINES
Bronze Silver Gold
Inline resource contention
or complex operational burden
Onehouse Table Optimizer
Managed Hudi Table Services
EXISTING HUDI PIPELINES
Bronze Silver Gold
Onehouse
Hands-free optimizations
● Async + On-Demand
● Auto-scaling + Spot Nodes
● Off-peak hours execution
- Table Optimizer
→ Auto file-sizing: never worry about small-files again
→ Auto cleaning: enforce retention and clean versions and failed
commits
→ Zero-Config Compaction: Set it and forget it continuous
merging of Hudi MoR parquet base and avro log files
→ Adaptive Clustering: Incremental + global clustering w/ several
sorting algorithms including, linear, z-order, hilbert curves
→ Partition monitoring: monitoring and dashboards to identify
partition skew over time.
- Table Optimizer Architecture
Clustering Cleaning
Compaction Multi catalog sync
Table Services
Table Format Interop More coming soon ++
Infrastructure Serverless In-VPC Auto-scaling
Dataplane Spark clusters
Infra as Code APIs Monitoring/Alerting
Hudi Enterprise
Multiplexed compute
Support
- Table Services Impact
20-30% 10-50% 2-30x
Storage Savings Faster Writes Faster Queries
1. Cleaning enforce retention and 1. Async table services eliminate 1. Clustering - sorting data to
clean Hudi timeline versions writer compute contention accelerate query access
Universal Table Services
Interoperability
Clustering Cleaning
1. Multi catalog sync makes data analytics
ready in any engine
Compaction Multi catalog sync
2. XTable conversion produces table as
Hudi, Iceberg, or Delta
Table Format Interop More coming soon ++
- Infrastructure Impact
~80% 10-70%
Engineering time savings Compute Savings
1. Hands-Free Infrastructure 1. Advanced auto-scaling - Push table maintenance
a. Removes complex DIY tuning, configuring, and to smaller skus or spot nodes for cost savings
troubleshooting Spark clusters and Hudi table
services
2. Enterprise monitoring and alerting
a. Dashboards w/ Hudi timeline, job performance,
and resource usage Infrastructure
b. Configurable alerts to monitor health of the
pipelines Serverless In-VPC Auto-scaling
3. 24/7 production support w/ Onehouse expert Dataplane Spark clusters
guidance for table optimization strategies
Infra as Code APIs Monitoring/Alerting
Hudi Enterprise
Multiplexed compute
Support
- Advanced Monitoring/Alerting
● Per job cost attribution tracking
● Advanced lag and latency detection with configurable alert thresholds
● Detailed stats of bytes written per operation type, cluster utilizations, failures, and so much more
● In depth timeline of table level operations to audit all writers and services that touch the table
● Critical Hudi metrics including partition skew, file size, compaction backlog, concurrent operations
- Table Optimizer Roadmap
Now Next
● Auto file-sizing: never worry about ● Compaction Accelerator: providing ● Auto indexing subsystem: given
small-files again out-of-the-box 2-10x perf write patterns, apply indexes for
improvements over OSS acceleration
● Auto cleaning: enforce retention and
clean versions and failed commits
● TTL management: set and enforce ● Fully-managed lock manager:
● Zero-Config Compaction: Set it and Data expiry lifetimes Hosted and auto-configured locking
forget it continuous merging of Hudi service
MoR parquet base and avro log files ● Compaction Balancer: Balances
● Adaptive Clustering: Incremental + compaction patterns across recent ● Savepoint/Restore: automatically
global clustering w/ several sorting partitions for cost/performance take savepoints with on-demand
algorithms including, linear, z-order, restore service
hilbert curves ● Intelligent Clustering: monitor storage
access patterns to automatically select ● Iceberg/Delta: native optimizations
● Partition monitoring: monitoring and
clustering strategies
dashboards to identify partition skew
over time.
Confidential
LIVE DEMO
Thank You