You are on page 1of 55

Governance with

Unity Catalog
for data and AI

©2024 Databricks Inc. — All rights reserved


Gopala Raju

• 2+ years with Databricks, 19+ years of


experience
• 10+ years of experience in Data and
Analytics
• Governance lead for APJ
Governance - Product Specialist - APJ

2
©2024 Databricks Inc. — All rights reserved
Agenda

▪ Introduction
▪ Unity Catalog overview
▪ Key capabilities & Lakehouse Federation
▪ Sharing and Collaboration
▪ Unity Catalog on different clouds
▪ Upgrading to Unity Catalog
▪ Demo

©2024 Databricks Inc. — All rights reserved


Housekeeping

▪ This presentation will be recorded and we will share these materials after
the session, within 48 hours
▪ There are no hands-on components so you only need to take notes
▪ Use the Q&A function to ask questions
▪ If we do not answer your question during the event, we will follow-up with
you afterwards to get you the information you need!
▪ Please fill out the survey at the end of the session so that we can improve
our future sessions

©2024 Databricks Inc. — All rights reserved


Data and AI governance
drives business value
“Organizations are finally realizing the value of data as an asset that needs
to be protected, managed and maintained to increase asset value”

IDC

“Organizations seeing the highest returns from AI, have a framework for
AI governance to cover every step of the model development process”

The State of AI in 2022, McKinsey & Co

“AI is now an enterprise essential, and as such, AI governance


will join cybersecurity and compliance as a board-level topic”

Forrester, 2023 AI Predictions report

©2024 Databricks Inc. — All rights reserved


Today, data and AI governance is complex
Data Consumers Data Governance Team
Permissions on files
“Where to “How to
discover secure
Data lake
the datasets, these
Data analyst Permissions
models, on tables, rows and columns
assets?”
notebooks,
dashboards?”
“Who is
Data warehouse accessing
“Can I trust Data engineer these assets
Permissions on ML
the data and and how?”
models, features
ML models?”

“Are we
ML Models meeting the
ML engineer regulatory
Permissions on reports, compliance?”
dashboards

Applications BI dashboards

©2024 Databricks Inc. — All rights reserved


Today, data and AI governance is complex
Data Consumers Data Governance Team
Permissions on files
“Where are “How to
Fragmented
the datasets, view of the Disjoint tools for access
Data lake
Incomplete monitoring Lack of cross-platform
secure
models,
data and AI estate management and observability these
data sharing
Data analyst Permissions
notebooks, on tables, rows and columns
assets?”
dashboards
that I need?”
“Who is
Data warehouse accessing
“Can I trust Data engineer these assets
Permissions on ML
the data and and how?”
models, features
ML Reduced
models?” pace of Increased data breach risk, Non-compliance risk, Costly data sharing,
innovation operational expenses reputational harm untapped monetization
“Are we
ML Models meeting the
Data scientist regulatory
Permissions on reports, compliance?”
dashboards

Applications BI dashboards

©2024 Databricks Inc. — All rights reserved


Databricks Unity Catalog
Unified governance for data and AI
Unified visibility into data and AI
Single permission model for data and AI
AI-powered monitoring and observability
Open data sharing

Databricks Unity Catalog

Access
Lineage Discovery Monitoring Auditing Sharing
Controls

Metadata Management
(Files | Tables | ML Models | Notebooks | Dashboards)

©2024 Databricks Inc. — All rights reserved 8


Lakehouse Federation
Discover, query, and govern all your data - no matter where it lives

Build a unified view of your


data estate
Users
Query and combine all data
efficiently with a single engine
Dashboards

Safeguard data across data


sources

©2024 Databricks Inc. — All rights reserved


Databricks Lakehouse unifies
data and AI governance
External Compute
Platforms
BI & Data Data Data Data
Warehousing Engineering Streaming Science & ML

Open Interfaces

Databricks Unity Catalog


One governance model for structured and unstructured data + AI

Cloud Data Lake External Catalogs External Data Sources


All structured, semi-structured, and unstructured data

©2024 Databricks Inc. — All rights reserved


Unified visibility into data and AI
■ Discover and classify structured and unstructured
data, files, notebooks, ML models, and dashboards at
one place

■ Consolidate and query data from other databases


and data warehouses using a single point of
access, without moving or copying the data

■ Build better understanding of your data estate


with automated lineage, tags and auto-generated
data insights

■ Boost productivity by searching, understanding and


gaining insights from your data and AI assets, using
natural language
©2024 Databricks Inc. — All rights reserved
Open data sharing
■ Avoid vendor lock-in with open source Delta
Sharing for seamless data sharing across clouds,
regions, and platforms, without replication

■ Share more than just data - Notebooks, ML


models, dashboards, applications

■ Explore and monetize data products through an


open marketplace

■ Collaborate securely on sensitive data with


scalable data clean rooms

©2024 Databricks Inc. — All rights reserved


Unity Catalog Powers Databricks Data Intelligence Platform

Data Science
Mosaic AI ETL &Tables
Delta Live Orchestration
Workflows Data SQL
Databricks
& AI Real-time Analytics Warehousing
Create, tune, and Automated Job cost optimized Text-to-SQL
Mosaic
serve customAI LLMs Delta Live
data quality Workflows
based on past runs Databricks
Text-to-VizSQL
Tables
Use generative AI to understand the semantics of your data

Data Intelligence Engine

Unified security, governance, and cataloging


Unity Catalog
Unity
Securely get Catalog
insights in natural language

Unified data storage for reliability and sharing


Delta Lake
Delta
Data layout is automatically Lakebased on usage patterns
optimized

Open Data Lake


All Raw Data
(Logs, Texts, Audio, Video, Images)
©2024 Databricks Inc. — All rights reserved
Unlock full databricks experience
Unity Catalog is Central to Data Intelligence Platform

Lakehouse Monitoring Databricks Assistant


End-to-end monitoring of Data and A context-aware AI assistant, that integrates
ML Models for quality and drift throughout the platform to improve
productivity via conversational interface

Lakehouse IQ AI Accelerated Performance


A knowledge engine that learns unique Automated organization of data, performance
nuances of your business and data to improvements, predictive IO and serverless
power natural language access

©2024 Databricks Inc. — All rights reserved


Unity Catalog Overview

©2024 Databricks Inc. — All rights reserved


Centralized metadata and controls
One metadata layer across file and database sources superpowers governance

Without Unity Catalog With Unity Catalog

Databricks Databricks Unity Catalog


Workspace 1 Workspace 2 User
Metastore
Foreign
Access Controls
Management Databases

User User
Management Management

Metastore Metastore
Databricks Databricks
Workspace Workspace
Access Controls Access Controls
Clusters Clusters
SQL Warehouses SQL Warehouses
Clusters Clusters
SQL Warehouses SQL Warehouses

©2024 Databricks Inc. — All rights reserved 16


Fundamental Concepts
Working with file based data sources Working with databases

● Credentials ● Connections
○ Cloud provider credential to connect to storage ○ Credential and connection information to

● External Locations connect to an external database

○ Storage location used for external tables, external ● Foreign Catalogs


volumes, or arbitrary files, or default managed ○ A catalog that represents an external
location for a catalog or schema database in UC and can be queried alongside

● Managed / External Tables managed data sources and file sources

○ Tabular data stored in managed or external locations

● Managed / External Volumes


○ Arbitrary file container inside a managed or external
location

©2024 Databricks Inc. — All rights reserved 17


Governed namespace across file and database sources
Access legacy metastore and foreign databases powered by Lakehouse
Federation
Unity Catalog

hive_metastore Foreign
Catalog 1
(legacy) Catalog

default Foreign
(database) Schema Schema 1

customers Foreign External Models/ Managed / Ext


(table) Table Volumes
Table Views
Functions Tables

SELECT * FROM main.paul.red_wine; -- <catalog>.<database>.<table>

SELECT * FROM hive_metastore.default.customers;

SELECT * FROM snowflake_warehouse.some_schema.some_table;

©2024 Databricks Inc. — All rights reserved 18


Centralized Access Controls
Centrally grant and manage access permissions across workloads and foreign
databases

Using ANSI SQL DCL Using UI

GRANT <privilege> ON <securable_type>


<securable_name> TO `<principal>`

GRANT SELECT ON iot.events TO engineers


Choose
‘Table’= collection of
permission level files in S3/ADLS Sync groups from
your identity
provider

©2024 Databricks Inc. — All rights reserved 19


In
Pr
ev
ie
Row Level Security and Column Level Masking w

Provide differential fine grained access to file based datasets and foreign tables

Only show specific rows Mask or redact sensitive columns

CREATE FUNCTION <name> ( <parameter_name > CREATE FUNCTION <name> (<parameter_name>,


<parameter_type> .. ) <parameter_type>, [, <column>...])
RETURN {filter clause whose output must be a boolean} RETURN {expression with the same type as the first
parameter}

CREATE FUNCTION us_filter(region STRING) CREATE FUNCTION ssn_mask(ssn STRING)


RETURN IF(IS_MEMBER(‘admin’), true, region=“US”); RETURN IF(IS_MEMBER(‘admin’), ssn, “****”);

ALTER TABLE sales SET ROW FILTER us_filter ON region; ALTER TABLE users ALTER COLUMN table_ssn SET MASK
ssn_mask;

Test for group Assign reusable Test for group Assign reusable
membership filter to table Specify filter membership mask to column Specify mask or
predicates
function to mask

©2024 Databricks Inc. — All rights reserved 20


Access data from specified environments only
Restrict catalog access by environment or purpose

Catalogs Workspaces Groups

dev dev_ws Developers

staging staging_ws

Testers

Access to data
Metastore prod prod_ws Analysts
and availability of
data can be
isolated across
bu_1_dev BU Developers
workspaces and
bu_dev_stg_ws
groups
bu_1_staging BU Testers

bu_1_prod bu_prod_ws BU Users

©2024 Databricks Inc. — All rights reserved 21


High Leverage Governance with Terraform & APIs
Use data-sec-ops, policies as code patterns to scale your efforts

• Privileges for UC objects can be managed


programmatically using our Terraform resource "databricks_grants" "sandbox" {
provider, especially for teams already using provider = databricks.workspace
Terraform catalog = databricks_catalog.sandbox.name
• This will pair naturally with the grant {

management of the UC objects (Metastore, principal = "Data Scientists"


privileges = ["USAGE", "CREATE"]
Catalog, Assignments etc.) themselves.
}
(If not already using Terraform, maybe now is a good time!) grant {
principal = "Data Engineers"
privileges = ["USAGE"]
}
}

©2024 Databricks Inc. — All rights reserved 22


Governance for file-based and
database sources

©2024 Databricks Inc. — All rights reserved 23


In
Pr
ev

Query Federation
ie
w

Unify your entire data estate with lakehouse

Query Federation provides one single point of secure


access to all your data - no matter where it lives - and one
way to access, catalog, govern, and query all your data - no
ingestion required.

● Unified permission controls


● Intelligent pushdown optimizations
● Accelerated query performance with Materialized Views
● Support for R/O operations today

CREATE FOREIGN CATALOG <catalog_name>


USING CONNECTION <connection_name>
OPTIONS (database ‘<remote_database>’)

SELECT * FROM <catalog_name>.<schema_name>.<table_name>

©2024 Databricks Inc. — All rights reserved 24


In
Pr
ev
ie
Volumes in Unity Catalog w

Access, store, organize and process files with Unity Catalog governance
- Volumes can be accessed by some POSIX commands Cloud Storage
(S3, ADLS, GCS)
dbutils.fs.ls(“s3://my_external_location/Volumes/catalog/schema/volume123”)
Managed / External
ls /Volumes/catalog/schema/volume123 Location

- Volumes are created under Managed or External Locations and show


Volume
up in UC Lineage

- Volumes add governance over non-tabular data sets Volume

- Unstructured data, e.g., image, audio, video, or PDF files, used for ML
Data
Data
- Semi-structured training, validation, test data sets, used in ML model
training
- Raw data files used for ad-hoc or early stage data exploration, or saved
outputs
- Library or config files used across workspaces Table

- Operational data, e.g., logging or checkpointing output files

- Tables
©2024 Databricksare
Inc. —registered in Managed
All rights reserved / External Locations, not in Volumes 25
Governed namespace across file and database sources
Access legacy metastore and foreign databases powered by Query Federation `

Unity Catalog

hive_metastore Foreign
Catalog 1
(legacy) Catalog

default Foreign
(database) Schema Schema 1

customers Foreign External


Managed / Ext Models/ Managed / Ext
(table) Table Volumes
Table Views
Functions Tables

SELECT * FROM main.paul.red_wine; -- <catalog>.<database>.<table>

SELECT * FROM hive_metastore.default.customers;

SELECT * FROM snowflake_warehouse.some_schema.some_table;

©2024 Databricks Inc. — All rights reserved 26


Defining file based data sources in Unity
Simplify data access management across clouds

Unity External
Locations &
Catalog Credentials

Access Control
Cloud Storage
(S3, ADLS, GCS)

Managed Managed Location on


Schema/Catalog
Tables
ged
a
an
M
Ex
ter
User Cluster or na Path in External Location
SQL warehouse l

Path in External Location


Volumes External

©2024 Databricks Inc. — All rights reserved 27


Querying file based data sources with Unity
● Creates an IAM role (AWS)/
Managed Identity (Azure) /
Service Account (GCP)
● Creates storage credentials/
external locations in Unity Catalog
Admin
● Defines access policies in Unity
Catalog
Check namespace ,
2 metadata and grants Write
audit log
Unity
Send query (SQL
Catalog
1 Python, R, Scala,) Return list of paths/data files Audit log
4 and scoped down temporary
tokens
Assume IAM Role /
3 Managed Identity / Service
User
8 Send result Cluster or SQL Request/ingest data from Account
warehouse 5 paths/data files with temporary
tokens

Enforce
7 policies
Cloud
Storage
6 Return data
(S3, ADLS)
©2024 Databricks Inc. — All rights reserved 28
Querying database sources with Unity
● Defines Connection object with
JDBC connection information to
database.
● Registers foreign catalog

Admin ● Defines access policies in Unity


Catalog

Check namespace ,
metadata and grants,
2 Write
enforce policies
audit log
Unity
Send query (SQL
Catalog
1 Python, R, Scala,) Returns encrypted credential Audit log
3 information to the cluster or
warehouse
Metadata changes in the data
User 5 are pushed to UC Control Plane
7 Send result Cluster or SQL Request / ingest data with
warehouse 4 predicate pushdown by directly
connecting to database
(Snowflake, SQL
JDBC Server, Postgres,
Database Mysql, etc)
6 Return data

©2024 Databricks Inc. — All rights reserved 29


Isolation between file based data sources
Use managed data sources for data isolation or cost allocation
Unity Cloud Storage
Catalog (S3, ADLS, GCS)

Managed
Metastore
container / bucket

1) Store at the Catalog1


metastore
Schema1

Managed
Catalog2
container / bucket
2) Store at the catalog
Schema2

Managed
3) Store at the schema Schema3
container / bucket

©2024 Databricks Inc. — All rights reserved 30


Multiple ACL trees for flexible governance
Govern external tables and file based data source access separately
Cloud Storage
(S3, ADLS, GCS)
Unity
Catalog READ/WRITE
External Location

Volume
Access Control

Volume

WRITE OPERATION
User
Table

Data
SELECT ONLY
Writes to the table or table data via a path does not
use the WRITE permission from the External
Location
©2024 Databricks Inc. — All rights reserved 31
Discover your data with search and
lineage

©2024 Databricks Inc. — All rights reserved 32


Why is data lineage important?

Compliance Discovery Observability

● Regulatory requirements to ● Understand context and ● Track down issues /


verify data lineage trustworthiness of data discrepancies in reports by
before using it in analytics tracing back the data

● Track the spread of ● Prevent duplicative work ● Analyze impact of proposed


sensitive data across and data changes to downstream
datasets reports e.g. column
deprecation

©2024 Databricks Inc. — All rights reserved 33


Automated lineage for all workloads
End-to-end visibility into how data flows and consumed in your organization

● Auto-capture runtime data lineage on


a Databricks cluster or SQL warehouse
● Leverage common permission model
from Unity Catalog
● Lineage across tables, columns,
dashboards, workflows, notebooks,
files, external sources, and models

©2024 Databricks Inc. — All rights reserved 34


Built-in search and discovery
Accelerate time to value with low latency data discovery

● Unified UI to search for data assets


stored in Unity Catalog
● Leverage common permission model
from Unity Catalog

● Tag Column, Table, Schema, Catalog


objects in UC
● Search for objects on tags

Recommendation: Use comments and


Tag your Data Assets on Ingest
©2024 Databricks Inc. — All rights reserved 35
Audit your data

©2024 Databricks Inc. — All rights reserved 36


System Tables: Object Metadata
Answer questions about the state of objects in the catalog

What tables are in the sales catalog? Who last updated the gold tables and when?
SELECT table_name SELECT table_name, last_altered_by, last_altered
FROM system.information_schema.tables FROM system.information_schema.tables
WHERE table_catalog="sales" WHERE table_schema = "churn_gold"
AND table_schema!="information_schema"; ORDER BY 1, 3 DESC;

Who owns this gold table?


Who has access to this table?
SELECT table_owner
SELECT grantee, table_name, privilege_type
FROM system.information_schema.tables
FROM system.information_schema.table_privileges
WHERE table_catalog = "retail_prod" AND table_schema =
WHERE table_name = "login_data_silver";
"churn_gold" AND table_name = "churn_features";

©2024 Databricks Inc. — All rights reserved 37


In
Pr
ev

System Tables: Audit Logs


ie
w

Near-real time, see who accessed what, and when

Who accesses this table the most? What has this user accessed in the last 24 hours?
SELECT user_identity.email, count(*) SELECT request_params.table_full_name
FROM system.operational_data.audit_logs FROM system.operational_data.audit_logs
WHERE request_params.table_full_name = "main.uc_deep_dive.login_data_silver" WHERE user_identity.email = "ifi.derekli@databricks.com"
AND service_name = "unity Catalog" AND service_name = "unity Catalog"
AND action_name = "generateTemporaryTableCredential" AND action_name = "generateTemporaryTableCredential"
GROUP BY 1 ORDER BY 2 DESC LIMIT 1; AND datediff(now(), created_at) < 1;

Who deleted this table? What tables does this user access most frequently?
SELECT user_identity.email SELECT request_params.table_full_name, count(*)
FROM system.operational_data.audit_logs FROM system.operational_data.audit_logs
WHERE request_params.full_name_arg = WHERE user_identity.email = "ifi.derekli@databricks.com"
"main.uc_deep_dive.login_data_silver" AND service_name = "unity Catalog"
AND service_name = "unity Catalog" AND action_name = "generateTemporaryTableCredential"
AND action_name = "deleteTable"; GROUP BY 1 ORDER BY 2 DESC LIMIT 1;

©2024 Databricks Inc. — All rights reserved 38


In
Pr
ev

System Tables: Billing Logs


ie
w

Understand cost allocation across your data estate

What is the daily trend in DBU consumption? Which 10 users consumed the most DBUs?
SELECT date(created_on) as `Date`, sum(dbus) as `DBUs Consumed` SELECT tags.creator as `User`, sum(dbus) as `DBUs`
FROM system.operational_data.billing_logs FROM system.operational_data.billing_logs
GROUP BY date(created_on) GROUP BY tags.creator
ORDER BY date(created_on) ASC; ORDER BY `DBUs` DESC
LIMIT 10;

How many DBUs of each SKU have been used so far this month? Which Jobs consumed the most DBUs?
SELECT sku as `SKU`, sum(dbus) as `DBUs` SELECT tags.JobId as `Job ID`, sum(dbus) as `DBUs`
FROM system.operational_data.billing_logs FROM system.operational_data.billing_logs
WHERE GROUP BY `Job ID`;
month(created_on) = month(CURRENT_DATE)
GROUP BY sku
ORDER BY `DBUs` DESC;

©2024 Databricks Inc. — All rights reserved 39


In
Pr
ev

System Tables: Lineage Data


ie
w

Query upstream and downstream sources in one place

What tables are sourced from this table? What user queries read from this table?
SELECT DISTINCT target_table_full_name SELECT DISTINCT entity_type, entity_id,
FROM system.access.table_lineage source_table_full_name
WHERE source_table_name = "login_data_bronze"; FROM system.access.table_lineage
WHERE source_table_name = "login_data_silver";

©2024 Databricks Inc. — All rights reserved 40


Open Collaboration Powered by
Unity Catalog

©2024 Databricks Inc. — All rights reserved 41


Delta Sharing
An open standard for secure sharing of tables, views, files, models, and more

Data provider Data consumer

Delta Lake Delta Sharing Delta Sharing


Any compatible
table / view / file / server protocol
client
model / etc … and more

Share cross-platform w/ open protocol Share data with no replication

©2024 Databricks Inc. — All rights reserved 42


Databricks Marketplace
An open marketplace for data, analytics, and AI

Data sets, Notebooks, ML


models and applications from
top data & solution providers Data
Files
Data
Tables

Public marketplace,
private exchanges Dashboards
Databricks
Marketplace
Solution
Accelerators

Open for Databricks &


non-Databricks users
ML Notebooks
Models

©2024 Databricks Inc. — All rights reserved 4


3
Clean rooms
Secure environments to run computations on joint data

Collaborator 1 Collaborator 2
e.g. Publisher e.g. Advertiser
Data Clean
Room
Hashed_user_id age income ad_id imp clicks Hashed_user_id conversion_event

How did my campaign do


for our common users?

Collaborator 1 owned Collaborator 2 owned


sensitive data sensitive data

Secure, privacy preserving


environment

©2024 Databricks Inc. — All rights reserved


Unity Catalog and cloud providers

©2024 Databricks Inc. — All rights reserved 45


Databricks Accounts and Cloud Providers

Azure
Account
Console

Databricks AAD
Account Tenant

Azure Azure
Subscription Subscription

Databricks Databricks Databricks


Workspace Workspace Workspace

©2024 Databricks Inc. — All rights reserved 46


Databricks Accounts and Cloud Providers
AWS GCP

Marketplace Account

Organizational Unit Entitlement

Account Account
Console Console

Databricks Databricks
Account AWS Account AWS Account Account Project Project

Databricks Databricks Databricks Databricks


Workspace Workspace Workspace Workspace

©2024 Databricks Inc. — All rights reserved 47


Unity Catalog and Cloud Constructs

AWS Azure GCP

Databricks Account Accounts Tenant Marketplace Account

Metastore Region Region Region

Catalog Account* Subscription* Project*

Storage Location S3 Bucket ADLS Account GCS Bucket

Credential IAM Role Managed Identity Service Account

* Minimum one, more are optional

©2024 Databricks Inc. — All rights reserved 48


Upgrade to Unity Catalog

©2024 Databricks Inc. — All rights reserved 49


How to upgrade to Unity Catalog
Steps to consider for a full upgrade
CLEANUP
1 3 5 7
UC Design Create UC Objects Grant Access Decommissioning
● Catalogs ● Storage Credentials ● Catalogs ● Old pipelines
● Workspaces ● External locations ● Schemas ● Old clusters
● Account groups ● Catalogs ● Tables ● Hive_metastores
● Default roles ● Set Owners ● Files ● Mounts

2 4 6
One Time Setup Upgrade Legacy Upgrade Workloads
● Create metastore Metadata ● Create clusters
● Identity federation ● SYNC external ● Create jobs
● Join workspaces tables/schemas ● Update notebooks
● Migrate managed ● Downstream tools
tables, files
PLANNING/SETUP DATA / WORKLOAD UPGRADE
©2024 Databricks Inc. — All rights reserved 50
An example transition
Bring your readers

Cloud Storage
(S3, ADLS, GCS)

Hive Unity Catalog


SYNC

Old Job Old Job Reader 1 Reader 2 New Job

©2024 Databricks Inc. — All rights reserved 51


Upgrading Hive tables to Unity
Managed & External tables - use SYNC command

• Run multiple times to pull changes from the hive/glue database into Unity over time
• Use a job for long term synchronization
• Use the DRY RUN option to test the sync without making any changes to the target
table.
• Works on Hive Managed Tables where schema locations are defined.

SYNC SCHEMA hive_metastore.my_db TO SCHEMA main.my_db_uc DRY RUN

SYNC TABLE hive_metastore.my_db.my_tbl TO TABLE main.my_db_uc.my_tbl

©2024 Databricks Inc. — All rights reserved 52


Demo

©2024 Databricks Inc. — All rights reserved 53


Thank you!

©2024 Databricks Inc. — All rights reserved 54


©2022 Databricks Inc. — All rights reserved

You might also like