You are on page 1of 19

Storage used to be so basic—

basically available, soft-state,


eventually-consistent
Adi Polak & Einat Orr
Agenda 1 Introduction

2 CI/CD for data

3 Open sourcing lakeFS &


the power of community
1

Introduction
Speakers
Einat Orr, PhD.
CEO & Co-founder
Treeverse
@EinatOrr

Adi Polak
Senior Cloud Advocate
Microsoft
@AdiPolak
Object storage is the present
and future of data lakes
The data lake advantage
Scalability and cost effectiveness

Accessibility and ease of use

High throughput

Rich application ecosystem


The data lake challenge

The data lake challenge


In-ability to experiment,
compare and reproduce

Difficult to enforce data best practices

Hard to ensure high quality data


2

CI/CD for data


IN A PERFECT WORLD

IN A PERFECT WORLD

We would manage data from dev to


production the way we manage code
Open-source atomic versioned
data lake on top of object storage
Delivering true CI/CD for data
Data development environment
Experimentation: try tools and code in isolation

Reproducibility: go back to any point of time for


both your code and your data

Compare: tools, code or different versions of your data

revert

experiment-1

main
Continuous data integration
Ingest new data safely

Enforce best practices

Metadata validation: prevent breaking changes from


entering your production data environment

merge changeset:
✓ 001.parquet
✓ 002.parquet
x random.csv
new-data-1

main
Continuous data deployment
Prevent data quality issues: by testing production
data before exposing it to users

Test DAG intermediate results: avoid cascading quality issues

Instantly revert changes to data

Commit metadata
topic_name = events
topic_offset = 1761348
job_git_commit = 60c3fa

stream

main
Demo
Einat Orr
3

Open sourcing lakeFS &


the power of community
Resources

Check out the docs: docs.lakefs.io

Join the Slack Channel: lakefs.io/slack

Contribute & star repo: treeverse/lakeFS

Follow on Twitter: @lakeFS


© Copyright Microsoft Corporation. All rights reserved.

You might also like