Professional Documents
Culture Documents
Databricks Unity Catalog - Jan 2024
Databricks Unity Catalog - Jan 2024
Unity Catalog
for data and AI
2
©2024 Databricks Inc. — All rights reserved
Agenda
▪ Introduction
▪ Unity Catalog overview
▪ Key capabilities & Lakehouse Federation
▪ Sharing and Collaboration
▪ Unity Catalog on different clouds
▪ Upgrading to Unity Catalog
▪ Demo
▪ This presentation will be recorded and we will share these materials after
the session, within 48 hours
▪ There are no hands-on components so you only need to take notes
▪ Use the Q&A function to ask questions
▪ If we do not answer your question during the event, we will follow-up with
you afterwards to get you the information you need!
▪ Please fill out the survey at the end of the session so that we can improve
our future sessions
“Organizations seeing the highest returns from AI, have a framework for
AI governance to cover every step of the model development process”
—
The State of AI in 2022, McKinsey & Co
“Are we
ML Models meeting the
ML engineer regulatory
Permissions on reports, compliance?”
dashboards
Applications BI dashboards
Applications BI dashboards
Access
Lineage Discovery Monitoring Auditing Sharing
Controls
Metadata Management
(Files | Tables | ML Models | Notebooks | Dashboards)
Open Interfaces
Data Science
Mosaic AI ETL &Tables
Delta Live Orchestration
Workflows Data SQL
Databricks
& AI Real-time Analytics Warehousing
Create, tune, and Automated Job cost optimized Text-to-SQL
Mosaic
serve customAI LLMs Delta Live
data quality Workflows
based on past runs Databricks
Text-to-VizSQL
Tables
Use generative AI to understand the semantics of your data
User User
Management Management
Metastore Metastore
Databricks Databricks
Workspace Workspace
Access Controls Access Controls
Clusters Clusters
SQL Warehouses SQL Warehouses
Clusters Clusters
SQL Warehouses SQL Warehouses
● Credentials ● Connections
○ Cloud provider credential to connect to storage ○ Credential and connection information to
hive_metastore Foreign
Catalog 1
(legacy) Catalog
default Foreign
(database) Schema Schema 1
Provide differential fine grained access to file based datasets and foreign tables
ALTER TABLE sales SET ROW FILTER us_filter ON region; ALTER TABLE users ALTER COLUMN table_ssn SET MASK
ssn_mask;
Test for group Assign reusable Test for group Assign reusable
membership filter to table Specify filter membership mask to column Specify mask or
predicates
function to mask
staging staging_ws
Testers
Access to data
Metastore prod prod_ws Analysts
and availability of
data can be
isolated across
bu_1_dev BU Developers
workspaces and
bu_dev_stg_ws
groups
bu_1_staging BU Testers
Query Federation
ie
w
Access, store, organize and process files with Unity Catalog governance
- Volumes can be accessed by some POSIX commands Cloud Storage
(S3, ADLS, GCS)
dbutils.fs.ls(“s3://my_external_location/Volumes/catalog/schema/volume123”)
Managed / External
ls /Volumes/catalog/schema/volume123 Location
- Unstructured data, e.g., image, audio, video, or PDF files, used for ML
Data
Data
- Semi-structured training, validation, test data sets, used in ML model
training
- Raw data files used for ad-hoc or early stage data exploration, or saved
outputs
- Library or config files used across workspaces Table
- Tables
©2024 Databricksare
Inc. —registered in Managed
All rights reserved / External Locations, not in Volumes 25
Governed namespace across file and database sources
Access legacy metastore and foreign databases powered by Query Federation `
Unity Catalog
hive_metastore Foreign
Catalog 1
(legacy) Catalog
default Foreign
(database) Schema Schema 1
Unity External
Locations &
Catalog Credentials
Access Control
Cloud Storage
(S3, ADLS, GCS)
Enforce
7 policies
Cloud
Storage
6 Return data
(S3, ADLS)
©2024 Databricks Inc. — All rights reserved 28
Querying database sources with Unity
● Defines Connection object with
JDBC connection information to
database.
● Registers foreign catalog
Check namespace ,
metadata and grants,
2 Write
enforce policies
audit log
Unity
Send query (SQL
Catalog
1 Python, R, Scala,) Returns encrypted credential Audit log
3 information to the cluster or
warehouse
Metadata changes in the data
User 5 are pushed to UC Control Plane
7 Send result Cluster or SQL Request / ingest data with
warehouse 4 predicate pushdown by directly
connecting to database
(Snowflake, SQL
JDBC Server, Postgres,
Database Mysql, etc)
6 Return data
Managed
Metastore
container / bucket
Managed
Catalog2
container / bucket
2) Store at the catalog
Schema2
Managed
3) Store at the schema Schema3
container / bucket
Volume
Access Control
Volume
WRITE OPERATION
User
Table
Data
SELECT ONLY
Writes to the table or table data via a path does not
use the WRITE permission from the External
Location
©2024 Databricks Inc. — All rights reserved 31
Discover your data with search and
lineage
What tables are in the sales catalog? Who last updated the gold tables and when?
SELECT table_name SELECT table_name, last_altered_by, last_altered
FROM system.information_schema.tables FROM system.information_schema.tables
WHERE table_catalog="sales" WHERE table_schema = "churn_gold"
AND table_schema!="information_schema"; ORDER BY 1, 3 DESC;
Who accesses this table the most? What has this user accessed in the last 24 hours?
SELECT user_identity.email, count(*) SELECT request_params.table_full_name
FROM system.operational_data.audit_logs FROM system.operational_data.audit_logs
WHERE request_params.table_full_name = "main.uc_deep_dive.login_data_silver" WHERE user_identity.email = "ifi.derekli@databricks.com"
AND service_name = "unity Catalog" AND service_name = "unity Catalog"
AND action_name = "generateTemporaryTableCredential" AND action_name = "generateTemporaryTableCredential"
GROUP BY 1 ORDER BY 2 DESC LIMIT 1; AND datediff(now(), created_at) < 1;
Who deleted this table? What tables does this user access most frequently?
SELECT user_identity.email SELECT request_params.table_full_name, count(*)
FROM system.operational_data.audit_logs FROM system.operational_data.audit_logs
WHERE request_params.full_name_arg = WHERE user_identity.email = "ifi.derekli@databricks.com"
"main.uc_deep_dive.login_data_silver" AND service_name = "unity Catalog"
AND service_name = "unity Catalog" AND action_name = "generateTemporaryTableCredential"
AND action_name = "deleteTable"; GROUP BY 1 ORDER BY 2 DESC LIMIT 1;
What is the daily trend in DBU consumption? Which 10 users consumed the most DBUs?
SELECT date(created_on) as `Date`, sum(dbus) as `DBUs Consumed` SELECT tags.creator as `User`, sum(dbus) as `DBUs`
FROM system.operational_data.billing_logs FROM system.operational_data.billing_logs
GROUP BY date(created_on) GROUP BY tags.creator
ORDER BY date(created_on) ASC; ORDER BY `DBUs` DESC
LIMIT 10;
How many DBUs of each SKU have been used so far this month? Which Jobs consumed the most DBUs?
SELECT sku as `SKU`, sum(dbus) as `DBUs` SELECT tags.JobId as `Job ID`, sum(dbus) as `DBUs`
FROM system.operational_data.billing_logs FROM system.operational_data.billing_logs
WHERE GROUP BY `Job ID`;
month(created_on) = month(CURRENT_DATE)
GROUP BY sku
ORDER BY `DBUs` DESC;
What tables are sourced from this table? What user queries read from this table?
SELECT DISTINCT target_table_full_name SELECT DISTINCT entity_type, entity_id,
FROM system.access.table_lineage source_table_full_name
WHERE source_table_name = "login_data_bronze"; FROM system.access.table_lineage
WHERE source_table_name = "login_data_silver";
Public marketplace,
private exchanges Dashboards
Databricks
Marketplace
Solution
Accelerators
Collaborator 1 Collaborator 2
e.g. Publisher e.g. Advertiser
Data Clean
Room
Hashed_user_id age income ad_id imp clicks Hashed_user_id conversion_event
Azure
Account
Console
Databricks AAD
Account Tenant
Azure Azure
Subscription Subscription
Marketplace Account
Account Account
Console Console
Databricks Databricks
Account AWS Account AWS Account Account Project Project
2 4 6
One Time Setup Upgrade Legacy Upgrade Workloads
● Create metastore Metadata ● Create clusters
● Identity federation ● SYNC external ● Create jobs
● Join workspaces tables/schemas ● Update notebooks
● Migrate managed ● Downstream tools
tables, files
PLANNING/SETUP DATA / WORKLOAD UPGRADE
©2024 Databricks Inc. — All rights reserved 50
An example transition
Bring your readers
Cloud Storage
(S3, ADLS, GCS)
• Run multiple times to pull changes from the hive/glue database into Unity over time
• Use a job for long term synchronization
• Use the DRY RUN option to test the sync without making any changes to the target
table.
• Works on Hive Managed Tables where schema locations are defined.