0% found this document useful (0 votes)
97 views39 pages

Onehouse

Onehouse offers a fully-managed Universal Data Lakehouse that enhances the performance and automation of Apache Hudi data platforms. The platform addresses various challenges such as cost savings, performance optimization, and engineering productivity while providing features like continuous ingestion, auto-scaling, and advanced monitoring. It aims to simplify data management and improve query performance through hands-free optimizations and seamless integration with existing data ecosystems.

Uploaded by

navnith0503
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views39 pages

Onehouse

Onehouse offers a fully-managed Universal Data Lakehouse that enhances the performance and automation of Apache Hudi data platforms. The platform addresses various challenges such as cost savings, performance optimization, and engineering productivity while providing features like continuous ingestion, auto-scaling, and advanced monitoring. It aims to simplify data management and improve query performance through hands-free optimizations and seamless integration with existing data ecosystems.

Uploaded by

navnith0503
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Onehouse Introduction

Supercharge performance and automate


your Apache Hudi™ data platform
Today’s Agenda
• Introductions
• CCC Intelligent Solution Goals
• How Onehouse can help
• Next steps
Tell us more about your platform?
Question Key Observations

What is your business use case?

What are your data sources?

How is data ingestion done to Hudi?

Scale?

What query engines do you use?

Where do you have current challenges?


Goals and Pains
Any of these examples?

Cost Savings Performance Eng Productivity

1. Overprovisioned AWS Infra? 1. Any slow queries? 1. How much time on-call?
2. Any vendor costs growing 2. Difficult Hudi perf tuning? 2. What routine ops do you want to
unexpectedly? 3. Slow writes / freshness latency? automate?
3. Spark clusters or storage cost? 3. Hudi monitoring metrics?

Reliability Architecture Changes Hudi Expertise

1. Hudi pipeline stability? 1. Iceberg interop? 1. How many Hudi pro on your team?
2. Any internal SLA goals to hit? 2. Warehouse migrations? 2. Hudi version upgrades?
3. Disaster recovery? 3. Database ingestion? 3. Any special tuning projects
4. New query engines or catalogs? planned?
Example impact with our customers

Cost Savings Performance Eng Productivity

TBs/day Hudi experts 30x query performance in 1day after 4-6 FTEs managing Spark -> 0.5 FTE
40% total infra reduction vs OSS Spark their team spent months tuning building ETL pipeline templates
on bare metal EC2 6h write latency -> minutes

Reliability Architecture Changes Hudi Expertise

~75% est downtime reduction Hudi + Iceberg interop Hudi v0.8 -> 0.14 upgrades
Snowflake migrations Professional services for issue
Glue + DataHub multi-catalog resolution and perf tuning
How Onehouse Can Help:
The Universal Data Lakehouse
Supercharge performance and automate
your Apache Hudi™ data platform

“ Onehouse is going to make broadly accessible


what has to-date been a tightly held secret
used by only the most advanced data teams

Aaron Schildkrout, Investor - Addition


Delivering an Open Data Lake for the Enterprise
Take your data lakehouse to the next level

Infra Cost + Continuous analysis & DIY tuning 10x faster innovations on top of OSS
Performance -
-
Multiplexed ingestion
Optimized record-level index (RLI)
- Column-level partial overwrites (coming soon)

Analytics and data science-ready


- Auto perf tuned file sizes, sorting, partitioning, operates low
latency pipelines)

Developer Deep knowledge on Spark, No knobs tuning, with deep automatic


Experience Scala, cloud infra ++ database optimizations

Time to Months Days to weeks


production

Maintenance DIY oncall, monitoring, version Onehouse handles everything


upgrades, security patches,
backfilling
What is Onehouse?
Fully-Managed Universal Data Lakehouse

Data Warehouses
+
Fully-managed, securely in your account

Data Streams LakeView & Control Plane Query Engines

Table Format Interop & Catalog Sync


Databases Real-time Engines

Continuous Table SQL Spark/Python


Ingestion Optimizations Transformations jobs
Cloud Storage Data Engineering

Data Science

Onehouse Compute Runtime


Onehouse Compute Runtime - OCR
Specialized + runtime in your VPC

Serverless Compute Manager


○ Elastic cluster scaling - for handling data workload spikes and swings
○ Multi-cluster management - for flexible resource allocation and isolation
○ BYOC Infrastructure - for security, sovereignty, and flexibility

Adaptive Workload Optimizations


○ Multiplexed job scheduler - to minimize compute footprint
○ Lag-aware scheduling - to enforce latency SLAs
○ Performance profiles - to balance write vs query performance

High-performance Lakehouse I/O


○ Vectorized columnar merging - for fast writes
○ Parallel pipelined execution - to maximize CPU efficiency
○ Optimized storage access - reduces network requests vs oss parquet readers
- 4 ways to engage

Onehouse Cloud: Ingestion/ETL Onehouse Table Optimizer


● 10x faster vs existing OSS Hudi, Iceberg, Delta pipelines ● No-knobs tuning for Apache Hudi pipelines
● CDC ingestion from popular databases to any engines ● Automate advanced optimizations for 10x faster perf
● Auto-scaling infra w/ serverless exp in customer VPC ● No change to your existing pipelines

Onehouse LakeView (Free!) Enterprise Hudi Support + Services


● Advanced Hudi dashboards to monitor OSS pipelines ● Enterprise SLA (24x7, guaranteed response time)
● Monitoring + tuning insights for your existing data pipelines ● Hudi professional services (ex: Hudi 1.0 version upgrade)
- Differentiation vs AWS EMR

Autoscaling Advanced auto-scaling tuned for Hudi pipelines Generic Spark autoscaling

Compute Multiplexed Spark compute w/ serverless Provision, tune, and manage your own
backbone experience inside your secure VPC Spark clusters

Hudi table Automated Apache Hudi services continuously DIY custom code
services optimize tables

Ingestion Fully managed ingest services that beat cost DIY custom code
even vs OSS Spark on raw Ec2

ETL Low-code transformers, extensible Spark code, Custom Spark, PySpark, SparkSQL
SQL endpoint code

Observability Rich dashboards/alerts for pipeline health, perf Basic Spark monitors lack details to
quality, lag, cost, optimizations, + Hudi timeline debug and tune Hudi pipelines

Support 24x7 Hudi and Spark experts 24x7 AWS generic Spark support
- Examples and scale

Industry Vertical: Retail


Use Case: Near real-time lakehouse and ingestion for IoT, orders, inventory
management
Onehouse Impact: Cost reduction from BigQuery, faster/fresher data
lakehouse ingestion. Move from OSS to Managed Table Service management

Industry Vertical: Utilities


Use Case: Power meter utility sensor IoT. Predictive analytics for grid
failures.
Onehouse Impact: Reached previously untapped 10yr historical data that was
estimated to take more than 6 months just to process 1.7PB+ 1,000+
Data Processed Active Pipelines

Industry Vertical: Finance


Use Case: Credit card issuing, transaction processing, fraud detection, and
more.
700k+ 1.5 TB/min
Onehouse Impact: Snowflake migration. Streaming ingestion of raw data into Compute hours Peak Burst Rates
Hudi and XTable/Iceberg tables. Table management for downstream pipelines for a single pipeline

Sample Customer Workload


- EXAMPLE Customer TCO vs EMR

Total Cost:
$447K
Table Optimizer
Examp
le only
- actu
al $ es
t imates
Engineering
1.5 FTE(1)
$375K
will va
ry ~20% ~1,200
Infra. cost reduction Query hours gained
Total Cost:
$131K

Foundation
SavePoint / Rollback* $12K Engineering
0.25 FTE(1)
Monitoring $12K $60K

Observability $12K
Onehouse(5)

~2,400
$53K

~75%
Hudi Table Services $12K

AWS $24K(2) AWS $18K(2)

Downtime Avoidance Engineering hours saved


OSS Hudi Onehouse Platform
Proven industry traction for Onehouse & Hudi

50%
Reduction in data pipeline
98%
Reduction in CDC sync frequency
10X
Reduction in data processing
<1s
Query latency at data lake scale
engineering expenses times, after reducing backfill time
from half a year to days

80% 80% 5X+ 1.25M


Reduction in compute costs Faster ingestion
Reduction in compute costs Saved per year vs Fivetran
over millions of v_cores and Snowflake , while

90% 2-10X powering Notion AI

Reduction in storage costs Typical query performance


over 100TB/day ingested
- Peek at the feature menu 👀
Stream Ingest from Kafka,
Low-code transformations
RDBMS, or S3

Extensible Spark-code
Data Quality Quarantine
transform

Incremental Change Data


Capture SQL endpoint
Ingestion Transformations
Clustering Cleaning

Table Services Compaction Multi catalog sync

Table Format Interop More coming soon ++


Infrastructure
Serverless In-VPC Auto-scaling
Dataplane Spark clusters

Infra as Code APIs Monitoring/Alerting

Multiplexed compute Hudi Enterprise Support


Next Steps
- Next Steps Example

Identify Use Case Technical Trial Production


Setup

Executive Alignment Free trial


- Finding the Use Case (example)
Current State (Q3 2023) Future State
• Primarily an append-only workload in shadow mode in • Improved Data Accessibility & Analytics: Support for
production. multiple formats doesn’t mean more pipelines and
• Only a few small tables with mutable workloads. duplicate data.
• Old version of Apache Hudi without metatable • Scalability & Flexibility: Easily accommodates future
enabled.
data growth and integration of new data sources.
Leverage any query engine or data processing engine.
• Cost Optimization: Leverages scalable cloud native
storage and elastic computing.

Required Capabilities Success Metrics


● Support ingestion, backfill and historical upserts. • 20% Reduced infrastructure costs?
● Highly available infrastructure with minimized • Enable interoperability with Apache Iceberg and Delta
downtime. Lake?
● Store the data in a open standards, future-proof • 30% Query performance boost?
format in cloud native technology. • Autoscale ingestion?
● Out of the box data automation to enable data
freshness, reduce data corruption and enhance
query performance through data skew reduction
and data “defragging”.
● Stored securely in AWS?
NEXT STEPS
Your Universal ● Identify Use Case
Data Lakehouse ● Free 1-month trial to get started
is Waiting For You ● Invite your team to learn more:
⁃ Custom demo
⁃ Detailed feature overview
⁃ Case studies
⁃ Pricing
⁃ Hudi details
⁃ Anything around data

Email gtm@onehouse.ai for exclusive access


Thank You
Your ETL transformations, your way
Pr
ev
i ew
*

Managed Incremental Pipelines


Spark Jobs
No-code, Low-code, or BYO-code to ingest
data in real-time and ELT workloads. Easy, fast experience over OSS Spark/AWS
EMR + open table format

Pr
Pr ev
ev i ew
ie *
w
*

SQL Endpoint
SQL Editor
Execute SQL Pipelines using popular tools:
Onehouse SQL editor for quick exploration
dbt, Metabase, Airflow, Hex, Notebooks, etc
Pr
ev
i
- Spark Jobs ew
*

● Submit Spark Jobs (Scala/Python) to


Onehouse Jobs API or via UI

● Executes on Onehouse Compute Runtime


(OCR) - 100% Spark compatible

● Isolate workloads using auto-scaling


serverless clusters that lower compute costs

● Job monitoring and history, log tracking


Pr
ev
i
- Onehouse SQL Endpoint ew
*

● Endpoint for popular external SQL tools:


○ dbt, Airflow (SQL pipelines)
○ Metabase, notebooks (data exploration)
● Seamless Onehouse integration
○ Observability in the Onehouse console
○ Managed, async optimizations on your
SQL-generated tables
○ Multi-catalog sync for your SQL-generated
tables to Snowflake, Glue, DataHub, etc.
● Reduces spend on premium cloud warehouse
compute
○ SQL functions to author pipelines running
merges/loads incrementally
Pr
ev
i
- Onehouse SQL Editor ew
*

● Author and run SQL queries directly through the


Onehouse SQL Editor
● Get quick analysis at a glance
● Backfill or update critical data
● Create new tables and experiment
Announcing 1.0 release
1. Secondary Indexing
a. 95% faster TPC-DS with secondary indexing that you can
create/drop asynchronously w/ simple SQL
2. Partial Updates
a. 2.6x perf and 85% less write amplification with MERGE
INTO modifying only the changed fields of a record
3. Logical partitioning
a. Postgres style expression indexes to treat partitions like
the coarse-grained indexes
4. Merge Modes
a. 1st-class support for: commit_time_ordering,
event_time_ordering
5. Non-blocking Concurrency
https://hudi.apache.org/blog/2024/12/16/announcing-hudi-1-0-0
a. Multiple writers and compaction of the same record
without blocking any involved processes
6. LSM timeline
a. Revamped timeline, allowing users to retain a large
amount of table history
Record Index performance
1. [Write performance]
a. Boosts index lookup 4 - 10x for large scale datasets
2. [Read performance]
a. On “EqualTo” or “IN” predicates on record key columns in Spark
b. 98% reduction in query latency in a 400GB dataset with 20,000 filegroups
3. [Efficient Storage]
a. Stored as a separate partition under metadata table
b. Record key to record location in HFile for fast update & lookup
c. Avoids cost associated with gathering data from table data files
d. Uses fewer file groups instead of all data files
e. No linear increase with table size

Index and Write latency speedup on a 1TB dataset, 200MB batch, random updates, Reduce SQL point lookup latency on a TPC-DS
Spark datasource! 10TB datasets, store_sales table, Spark!
Onehouse Table Optimizer
Operating Open Table Formats is hard…

● Apache Hudi requires maintenance for


cleaning, clustering, compaction, file-sizing, etc

● Many configs to tune parameters from:


○ Frequency, budgets, triggers, partition spread, parallelism,
retention, concurrency, etc

● Inline services compete with write operations

● Tuning your table service configs right can result


in 2-100x perf swings
Operating Hudi pipelines
EXISTING PIPELINES

Bronze Silver Gold

Inline resource contention

or complex operational burden


Onehouse Table Optimizer

Managed Hudi Table Services


EXISTING HUDI PIPELINES

Bronze Silver Gold

Onehouse
Hands-free optimizations
● Async + On-Demand
● Auto-scaling + Spot Nodes
● Off-peak hours execution
- Table Optimizer

→ Auto file-sizing: never worry about small-files again

→ Auto cleaning: enforce retention and clean versions and failed


commits

→ Zero-Config Compaction: Set it and forget it continuous


merging of Hudi MoR parquet base and avro log files

→ Adaptive Clustering: Incremental + global clustering w/ several


sorting algorithms including, linear, z-order, hilbert curves

→ Partition monitoring: monitoring and dashboards to identify


partition skew over time.
- Table Optimizer Architecture

Clustering Cleaning

Compaction Multi catalog sync


Table Services
Table Format Interop More coming soon ++

Infrastructure Serverless In-VPC Auto-scaling


Dataplane Spark clusters

Infra as Code APIs Monitoring/Alerting

Hudi Enterprise
Multiplexed compute
Support
- Table Services Impact

20-30% 10-50% 2-30x


Storage Savings Faster Writes Faster Queries
1. Cleaning enforce retention and 1. Async table services eliminate 1. Clustering - sorting data to
clean Hudi timeline versions writer compute contention accelerate query access

Universal Table Services


Interoperability
Clustering Cleaning
1. Multi catalog sync makes data analytics
ready in any engine
Compaction Multi catalog sync
2. XTable conversion produces table as
Hudi, Iceberg, or Delta
Table Format Interop More coming soon ++
- Infrastructure Impact

~80% 10-70%
Engineering time savings Compute Savings
1. Hands-Free Infrastructure 1. Advanced auto-scaling - Push table maintenance
a. Removes complex DIY tuning, configuring, and to smaller skus or spot nodes for cost savings
troubleshooting Spark clusters and Hudi table
services
2. Enterprise monitoring and alerting
a. Dashboards w/ Hudi timeline, job performance,
and resource usage Infrastructure
b. Configurable alerts to monitor health of the
pipelines Serverless In-VPC Auto-scaling
3. 24/7 production support w/ Onehouse expert Dataplane Spark clusters
guidance for table optimization strategies
Infra as Code APIs Monitoring/Alerting

Hudi Enterprise
Multiplexed compute
Support
- Advanced Monitoring/Alerting
● Per job cost attribution tracking
● Advanced lag and latency detection with configurable alert thresholds
● Detailed stats of bytes written per operation type, cluster utilizations, failures, and so much more
● In depth timeline of table level operations to audit all writers and services that touch the table
● Critical Hudi metrics including partition skew, file size, compaction backlog, concurrent operations
- Table Optimizer Roadmap
Now Next

● Auto file-sizing: never worry about ● Compaction Accelerator: providing ● Auto indexing subsystem: given
small-files again out-of-the-box 2-10x perf write patterns, apply indexes for
improvements over OSS acceleration
● Auto cleaning: enforce retention and
clean versions and failed commits
● TTL management: set and enforce ● Fully-managed lock manager:
● Zero-Config Compaction: Set it and Data expiry lifetimes Hosted and auto-configured locking
forget it continuous merging of Hudi service
MoR parquet base and avro log files ● Compaction Balancer: Balances
● Adaptive Clustering: Incremental + compaction patterns across recent ● Savepoint/Restore: automatically
global clustering w/ several sorting partitions for cost/performance take savepoints with on-demand
algorithms including, linear, z-order, restore service
hilbert curves ● Intelligent Clustering: monitor storage
access patterns to automatically select ● Iceberg/Delta: native optimizations
● Partition monitoring: monitoring and
clustering strategies
dashboards to identify partition skew
over time.

Confidential
LIVE DEMO
Thank You

You might also like