You are on page 1of 7359

Contents

Azure Databricks Documentation


Overview
What is Azure Databricks?
What is Databricks Data Science & Engineering?
What is Databricks Machine Learning?
What is Databricks SQL?
Quickstarts
Create a workspace
Portal
Azure PowerShell
ARM template
Virtual network
Concepts
Lakehouse overview
Data objects in the Databricks Lakehouse
Azure Databricks concepts
Azure Databricks architecture
Databricks Data Science & Engineering
Data Science & Engineering documentation
Tutorials
Run jobs using a service principal
Query SQL Server running in Docker container
Access storage using Azure Key Vault
Use Cosmos DB service endpoint
Perform ETL operations
Stream data using Event Hubs
How-to guides
User guide
Databricks runtimes
Runtime overview
Databricks Runtime
Databricks Runtime for Machine Learning
Databricks Runtime for Genomics
Databricks Light
Photon-enabled runtimes
Workspace
Explore the Databricks workspace
Workspace assets
Work with workspace objects
Get workspace, cluster, notebook, and job identifiers
Per-workspace URLs
DataFrames and Datasets
DataFrames and Datasets overview
Introduction to Python DataFrames
Introduction to Scala DataFrames
Introduction to Datasets
Complex and nested data
Aggregators
Dates and timestamps
Structured Streaming
Structured Streaming overview
Structured Streaming in production
Productions considerations for Structured Streaming
Recover from query failure
Monitoring Structured Streaming queries
Configure scheduler pools
Structured Streaming batch size
Structured Streaming triggers
Optimize performance of stateful streaming queries
Configure RocksDB state store
Enable asynchronous state checkpointing
Control late data threshold
Set initial state for mapGroupsWithState
Test state update function for mapGroupsWithState
Message queues and pub/sub systems
Working with pub/sub and message queues
Apache Kafka
Azure Event Hubs
Read and write streaming Avro data with DataFrames
Streaming examples
Write to arbitrary data sinks
Delta Lake tables
Optimized Azure Blob storage with Azure Queue storage
Clusters
Clusters overview
Create a cluster
Manage clusters
Configure clusters
Best practices for cluster configuration
Task preemption
Custom containers
Cluster init scripts
GPU-enabled clusters
Single node clusters
Web terminal
Debugging with the Apache Spark UI
Pools
Pools overview
Display pools
Create a pool
Configure a pool
Edit a pool
Delete a pool
Use a pool
Best practices for pools
Notebooks
Notebooks overview
Manage notebooks
Use notebooks
Visualizations
Visualize data
Migrate deprecated line charts
Visualization deep dive in Python
Visualization deep dive in Scala
HTML, D3, and SVG in notebooks
Bokeh in Python notebooks
Matplotlib
Plotly
htmlwidgets
ggplot2
Legacy visualizations
Dashboards
ipywidgets
Widgets
Notebook workflows
Package cells
IPython kernel
bamboolib
Best practices
Repos
Repos overview
Set up Git integration
Git providers
Version control with GitHub
Version control with Azure DevOps
Version control with Bitbucket Cloud
Version control with GitLab
Version control with AWS CodeCommit
Version control with GitHub AE
Work with notebooks and project files
Sync a remote Git repo
CI/CD workflows with Repos
Limitations and FAQ
Errors and troubleshooting
Libraries
Libraries overview
Workspace libraries
Cluster libraries
Notebook-scoped Python libraries
Notebook-scoped R libraries
Databricks File System (DBFS)
FileStore
Mounting object storage
Metastores
External Hive metastore
Migration
Migrate production workloads
Migrate single node
Migrate workloads to Delta Lake
Data ingestion
Ingest data into the Databricks Lakehouse
Create table in Databricks SQL
COPY INTO
COPY INTO overview
Use temporary credentials to load data with COPY INTO
Common data loading patterns with COPY INTO
COPY INTO notebook tutorial
COPY INTO Databricks SQL tutorial
Auto Loader
Auto Loader overview
DLT quick start
Structured Streaming quick start
Schema inference and evolution
Choosing between file notification and directory listing modes
Configuring for production
Options
Examples
Common data loading patterns
Ingest CSV
Ingest JSON
Ingest Parquet
Ingest Avro
Ingest images
Auto Loader Tutorial
FAQ
Data sources
SQL databases using JDBC
SQL databases using the Apache Spark connector
Azure Storage
Azure Data Lake Storage Gen2 and Blob Storage
Azure Data Lake Storage Gen2 frequently asked questions
Azure Data Lake Storage Gen1
Azure Blob Storage with WASB
Azure Cosmos DB
Azure Synapse Analytics
Binary file
Cassandra
Couchbase
ElasticSearch
Image
Hive tables
MLflow experiment
MongoDb
Neo4j
Avro files
CSV files
JSON files
LZO compressed files
Parquet files
Redis
Snowflake
ZIP files
Partner data integrations
Import data
File metadata column
Workflows
Workflows overview
Delta Live Tables
Overview
Quickstart
Concepts
User interface guide
Python language reference
SQL language reference
Data quality
Data sources
Publish data
Streaming data processing
Change data capture (CDC)
API
Settings
Event log
Workflow orchestration
Cookbook
Frequently asked questions
Upgrades
Jobs
Jobs overview
Jobs quickstart
Jobs user interface
Jobs API updates
Managing dependencies in data pipelines
Delta Lake and Delta Engine guide
Delta Lake overview
Introduction to Delta Lake
Get started with Delta Lake
Introductory notebooks
Ingest data into Delta Lake
Table batch reads and writes
Table streaming reads and writes
Table delete, update, and merge
Change data feed
Table utility commands
API reference
Concurrency control
Best practices
Frequently asked questions
Resources
Delta Engine overview
Optimize performance with file management
Auto optimize
Optimize performance with caching
Dynamic file pruning
Isolation levels
Bloom filter indexes
Optimize join performance
Optimize join performance overview
Range Join optimization
Skew Join optimization
Optimized data transformation
Optimized data transformation overview
Higher-order functions
Transform complex data types
Table versioning
Optimization examples
Genomics guide
Genomics overview
Tertiary analytics with Apache Spark
Tertiary analytics overview
Adam
Hail
Glow
Secondary analysis
Secondary analysis overview
DNASeq pipeline
RNASeq pipeline
Tumor/Normal pipeline
Variant annotation using Pipe Tansformer
Variant annotation methods
SnpEff pipeline
Vep pipeline
Joint genotyping
Joint genotyping overview
Joint genotyping pipeline
Security guide
Security articles
Security overview
Security baseline
Access Azure Storage with Azure Active Directory
Access control
Access control overview
Workspace access control
Cluster access control
Pool access control
Jobs access control
Enable Delta Live Tables access control
Secret access control
Secrets
Keep data secure with secrets
Secret scopes
Secrets
Secret redaction
Secret workflow example
GDPR best practices
IP access list
Domain name firewall rules
Secure cluster connectivity
Encrypt traffic between cluster worker nodes
Customer-managed keys
Customer-managed keys overview
Enable customer-managed keys for managed services
Enable customer-managed keys for DBFS root
Customer-managed keys for DBFS overview
Enable using Azure portal
Enable using Azure CLI
Enable using PowerShell
Configure double encryption for DBFS root
Data Governance guide
Overview
Data governance best practices
Unity Catalog (Preview)
Overview
Get started using Unity Catalog
Key concepts
Data permissions
Access storage using managed identities
Create a metastore
Create compute resources
Create catalogs
Create schemas
Manage account-level identities
Create tables
Create views
Manage access to data
Manage external locations and storage credentials
Query data
Train a machine-learning model using data in Unity Catalog
Connect to Bi tools
Audit access and activity for Unity Catalog resources
Upgrade workspace tables and views to Unity Catalog
Automate Unity Catalog setup using Terraform
Limitations
Table access control
Table access control overview
Enable table access control for a cluster
Set privileges on a data object
Authenticate to Azure Data Lake Storage using Azure Active Directory credentials
Data sharing
Overview
Delta Sharing (Preview)
Share data with Delta Sharing
Access data shared with you
Manage IP access lists
Developer tools and guidance
Overview
Use IDEs
dbx by Databricks Labs
Use an IDE
Databricks Connect
Use connector or driver
Databricks SQL Connector for Python
Databricks SQL Driver for Go
Databricks SQL Driver for Node.js
pyodbc
Use command line or notebook
Databricks CLI
Databricks SQL CLI
Databricks Utilities
Call Databricks REST APIs
REST API (latest)
Overview
Clusters API 2.0
Cluster Policies API 2.0
DBFS API 2.0
Databricks SQL Warehouses API 2.0
Databricks SQL Queries and Dashboards API 2.0
Databricks SQL Query History API 2.0
Delta Live Tables API 2.0
Git Credentials API 2.0
Global Init Scripts API 2.0
Groups API 2.0
Instance Pools API 2.0
IP Access List API 2.0
Jobs API 2.1
Libraries API 2.0
MLflow API 2.0
Permissions API 2.0
Repos API 2.0
SCIM APIs 2.0
Secrets API 2.0
Token API 2.0
Token Management API 2.0
Workspace API 2.0
API 1.2
Authentication
Examples
REST API 2.1
Overview
Authentication
Jobs API 2.1
REST API 2.0
Overview
Authentication
Clusters API 2.0
Cluster Policies API 2.0
Databricks SQL Warehouses API 2.0
Databricks SQL Queries and Dashboards API 2.0
Databricks SQL Query History API 2.0
DBFS API 2.0
Delta Live Tables API 2.0
Git Credentials API 2.0
Global Init Scripts API 2.0
Groups API 2.0
Instance Pools API 2.0
IP Access List API 2.0
Jobs API 2.0
Libraries API 2.0
MLflow API 2.0
Permissions API 2.0
Repos API 2.0
SCIM APIs 2.0
Secrets API 2.0
Token API 2.0
Token Management API 2.0
Workspace API 2.0
REST API 1.2
Provision infrastructure
Databricks Terraform Provider
CI/CD
Azure DevOps
GitHub Actions
Service principals for CI/CD
Jenkins
Use SQL database tools
Databricks SQL CLI
DataGrip
DBeaver
Use service principals
Languages
Python
Python overview
Pandas API on Spark
Koalas
R
R overview
SparkR overview
SparkR ML tutorials
SparkR ML tutorials overview
Use glm
SparkR function reference
sparklyr
RStudio on Azure Databricks
Shiny on hosted RStudio Server
Shiny in an Azure Databricks notebook
renv on Azure Databricks
Scala
SQL
Spark SQL overview
Databricks Runtime 7.x SQL language manual
Spark SQL general reference
Databricks Runtime 6.x and 5.5 LTS SQL language manual
Alter Database
Alter Table or View
Alter Table Partition
Analyze Table
Cache
Cache Table
Clear Cache
Clone (Delta Lake on Azure Databricks)
Convert To Delta (Delta Lake on Azure Databricks)
Copy Into (Delta Lake on Azure Databricks)
Create Bloomfilter Index
Create Database
Create Function
Create Table
Create View
Delete From (Delta Lake on Azure Databricks)
Deny
Describe Database
Describe Function
Describe History (Delta Lake on Azure Databricks)
Describe Table
Drop Bloomfilter Index
Drop Database
Drop Function
Drop Table
Drop View
Explain
Fsck Repair Table (Delta Lake on Azure Databricks)
Functions
Grant
Insert
Load Data
Merge Into (Delta Lake on Azure Databricks)
Msck
Optimize (Delta Lake on Azure Databricks)
Refresh Table
Reset
Revoke
Select
Set
Show Columns
Show Create Table
Show Databases
Show Functions
Show Grant
Show Partitions
Show Table Properties
Show Tables
Truncate Table
Uncache Table
Update (Delta Lake on Azure Databricks)
Use Database
Vacuum
Spark SQL examples
Cost-based optimizer
Data skipping index
Transactional writes to cloud storage with DBIO
Handle bad records and files
Handle large queries in interactive workflows
Adaptive query execution
Query semi-structured data in SQL
Optimize conversion between Spark and pandas DataFrames
User defined aggregate functions - Scala
User-defined functions - Python
pandas user-defined functions
pandas function APIs
Apache Hive compatibility
Troubleshooting
Common questions
Troubleshooting overview
Administration
Who deleted a workspace in Azure
Who deleted a cluster in Azure
Business intelligence
Configure Simba JDBC driver using Azure AD
Configure Simba ODBC driver with a proxy in Windows
JDBC and ODBC connections
Power BI proxy and SSL configuration
Azure infrastructure
Unable to mount ADLS Gen1 account
ADLException - Error getting info for file
Assign a single public IP for VNet workspaces using Azure Firewall
Analyze user interface performance issues
Configure custom DNS settings using dnsmasq
Jobs are not progressing in the workspace
SAS requires current ABFS client
Clusters
Auto termination is disabled when starting a job cluster
Calculate number of cores in a cluster
Cluster Apache Spark configuration not applied
Cluster failed to launch
Cluster fails to start with dummy does not exist error
Cluster manager core instance request limit
Cannot apply updated cluster policy
Custom Docker image requires root
Custom garbage collection prevents cluster launch
Enable OpenJSSE and TLS 1.3
Enable retries in init script
Admin user cannot restart cluster
Configuration setting overwrites default settings
Enable GCM cipher suites
Failed to create cluster with invalid tag value
Install a private PyPI repo
IP access list update returns INVALID_STATE
Overwrite log4j configurations
Persist Apache Spark CSV metrics to a DBFS location
Replay Apache Spark events in a cluster
Set Apache Hadoop core-site.xml properties
Set executor log level
Unexpected cluster termination
UnknownHostException on cluster launch
Apache Spark executor memory allocation
Apache Spark UI show less than total node memory
CPU core limit prevents cluster creation
IP address limit prevents cluster creation
Slow cluster launch and missing nodes
Cluster slowdown due to Ganglia metrics filling root partition
Configure a cluster to use a custom NTP server
SSH to the cluster driver node
Validate environment variable behavior
Data management
Access files written by Apache Spark on ADLS Gen1
Append to a DataFrame
Spark 2.0.0 cluster slow to append data
Improve performance with bucketing
Simplify chained transformations
Dump tables in different formats
Get and set Spark configuration properties in a notebook
Hive UDFs
Prevent duplicated columns on DataFrame joins
List and delete files faster
Handle corrupted Parquet files with different schema
No USAGE permission on database
Nulls and empty strings in partitioned columns save as nulls
Revoke all user privileges
Behavior of randomSplit method
Generate a schema from Case class
Specify skew hints in Dataset and DataFrame join commands
Update nested columns
Incompatible schema in some files
Data sources
ABFS client hangs if incorrect client ID or wrong path used
Create tables on JSON datasets
Error reading data from ADLS Gen1 with Sparklyr
Blob Storage mount and access failures
Inconsistent timestamp results with JDBC applications
JDBC/ODBC access to ADLS Gen2
Kafka client terminated with OffsetOutOfRangeException
Long jobs fail when accessing ADLS
Unable to access ADLS Gen1 with firewall
CosmosDB connector library conflict
Failure to detect encoding in JSON
Unable to read files in WASB
Optimize read performance from JDBC data sources
ADLS and WASB writes throttled
Use cx_Oracle to connect to an Oracle server
Databricks File System (DBFS)
Can't read objects stored in DBFS root
How to specify the DBFS path
Operation not supported during append
Parallelize filesystem operations
Remount a storage account after rotating access keys
Databricks SQL
Null column values display as NaN
Retrieve queries owned by a disabled user
Delta Lake
A file referenced in the transaction log cannot be found
Compare two versions of a Delta table
Converting from Parquet to Delta Lake fails
FileReadException when reading a Delta table
Populate or update columns in Delta Lake table
Delete your streaming query checkpoint and restart
Delta Cache behavior on autoscaling cluster
Delta Merge cannot resolve nested field
Identify duplicate data on append operations
Improve MERGE INTO performance with partition pruning
Optimize a Delta sink in a structured streaming application
Write job fails
Drop a managed Delta Lake table
Unable to cast string to varchar
UPDATE query fails with IllegalStateException
Vaccuming with zero retention results in data loss
Z-Ordering will be ineffective, not collecting stats
Developer tools
Apache Spark session is null in DBConnect
Cannot access workspace notebook in Azure Data Factory
Common Azure Data Factory errors
Failed to create process error with Databricks CLI in Windows
GeoSpark undefined function error with DBConnect
Get Apache Spark config in DBConnect
Invalid Access Token with Airflow
ProtoSerializer stack overflow error in DBConnect
Use tcpdump to create pcap files
Job execution
Increase number of tasks per stage
Maximum execution context and notebook attachment limits
Serialized task is too large
Jobs
Active vs Dead jobs
ADLS CREATE limits
Driver is temporarily unavailable
Delete all jobs using the REST API
Identify less used jobs
Job cluster limits on notebook output
Job fails, but Apache Spark tasks finish
Job fails with invalid access token
Library not installed
Job rate limit
Jobs failing with shuffle fetch failures
Create table in overwrite mode fails when interrupted
Spark job hags due to custom UDF
Spark job fails with `Failed to parse byte string`
Spark UI shows wrong number of jobs
Job fails maxresultsize
Job fails with atypical errors message
Streaming job has degraded performance
SQLAlchemy package error causes job failure
Ensure idempotency
SQLAlchemy package error
Task deserialization time is high
Monitor running jobs with a Job Run dashboard
Libraries
Can't uninstall library from UI
Cannot import module in egg library
Cannot import TabularPrediction from AutoGluon
Error installing pyodbc on a cluster
Error when installing Cartopy on a cluster
Init script fails to download Maven JAR
Install package using previous CRAN snapshot
Install PyGraphViz
Install TensorFlow 2.1 on Databricks Runtime 6.5 ML GPU clusters
Install Turbodbc via init script
Latest PyStan fails to install on Databricks Runtime 6.4
Libraries fail with dependency exception
Libraries failing due to transient Maven issue
Library unavailability job failure
Reading .xlsx files with xlrd fails
Replace a default library jar
Remove Log4j 1.x JMSAppender and SocketServer classes from classpath
TensorFlow fails to import
Update a Maven library
Python command fails with AssertionError wrong color format
PyPMML fails with Could not find py4j jar error
Verify the version of Log4j on your cluster
Machine learning
Conda fails to download packages from Anaconda
Download artifacts from MLflow
Error when importing OneHotEncoderEstimator
Extract feature information for tree-based SparkML pipeline models
SparkML model fit error
Group K-fold cross-validation
H2O.ai Sparkling Water cluster not reachable
Hyperopt fails with maxNumConcurrentTasks error
MLflow project fails to access an Apache Hive table
Speed up cross-validation
Incorrect results when using documents as inputs
Errors when accessing MLflow artifacts without using the MLflow client
Experiment warning when custom artifact storage location is used
Experiment warning when legacy artifact storage location is used
KNN model using pyfunc returns ModuleNotFoundError or FileNotFoundError
OSError when accessing MLflow experiment artifacts
PERMISSION_DENIED error when accessing MLflow experiment artifacts
Python commands fail on Machine Learning clusters
Runs are not nested when SparkTrials is enabled in Hyperopt
X13PATH environmental variable not found
Metastores
Autoscaling is slow with an external metastore
Create table DDLs to import into an external metastore
Data too long for column error
Drop database without deletion
Drop tables with corrupted metadata
Error in CREATE TABLE with external Hive metastore
Japanese character support in external metastore
Azure metastore drop table AnalysisException
Common Hive metastore issues
List table names
Parquet timestamp requires Hive metastore 1.2 or above
Set up embedded Hive metastore
Set up Hive metastore on SQL Server
Metrics
Explore Spark metrics with Spark listeners
Use Apache Spark metrics
Notebooks
Check if a Spark property can be modified
Common notebook errors
display() does not show microseconds correctly
Error - Received command c on object id p0
Failure when accessing or mounting storage
Item was too large to export
JSON reader parses values as null
Notebook autosave fails due to file size limits
Can't run notebook after cancelling streaming cell
Troubleshoot cancel command
Update job permissions for multiple users
Access notebooks owned by a deleted user
Security and permissions
Table creation failure with security exception
Troubleshoot key vault access issues
Python
AttributeError function object has no attribute
Convert Python datetime object to string
Create a cluster with Conda
Install and compile Cython
Read large DBFS-mounted files
Command fails after installing Bokeh
Commands fail on high concurrency clusters
Command cancelled due to library conflict
Command fails with AttributeError
Display file and directory timestamp details
Job remains idle before starting
List all workspace objects
Load special characters with Spark-XML
Python REPL fails to start in Docker
Run C++ code
Run SQL queries
Use the HDFS API to read files in Python
Python 2 sunset status
Import a custom CA certificate
R with Spark
Change version of R
Install rJava and RJDBC libraries
Resolve package or namespace load error
Persist and share code in RStudio
Fix the version of R packages
Fail to render R markdown file containing sparklyr
Parallelize R code with gapply
Parallelize R code with spark.lapply
RStudio server backend connection error
Verify R packages installed via init script
Spark
Apache Spark job fails with Parquet column cannot be converted error
Apache Spark read fails with Corrupted parquet page error
Apache Spark UI is not in sync with job
Best practice for cache(), count(), and take()
Cannot import timestamp_millis or unix_millis
Cannot modify the value of an Apache Spark config
Convert flattened DataFrame to nested JSON
Convert nested JSON to a flattened DataFrame
Create a DataFrame from a JSON string or Python dictionary
Decimal$DecimalIsFractional assertion error
from_json returns null in Apache Spark 3.0
Intermittent NullPointerException when AQE is enabled
Manage the size of Delta tables
Trouble reading JDBC tables after upgrading from Databricks Runtime 5.5
Run C++ code in Scala
Select files using a pattern match
Concurent Spark JAR jobs fail
SQL
Broadcast join exceeds threshold, returns out of memory error
Cannot grow BufferHolder; exceeds size limitation
Disable broadcast when query plan has BroadcastNestedLoopJoin
Duplicate columns in the metadata error
Error when downloading full results after join
Error when running MSCK REPAIR TABLE in parallel
Find the size of a table
Generate unique increasing numeric values
Inner join drops records in result
JDBC write fails with a PrimaryKeyViolation error
Table or view not found
Date functions only accept int values in Apache Spark 3.0
Query does not skip header row on external table
SHOW DATABASES command returns unexpected column name
Streaming
Append output is not supported without a watermark
Apache Spark DStream is not supported
Checkpoint files not being deleted when using display()
Checkpoint files not being deleted when using foreachBatch()
Conflicting directory structures error
Get the path of files consumed by Auto Loader
Kafka error No resolvable bootstrap urls
readStream() is not whitelisted error when running a query
Recovery after checkpoint or output directory change
Restart a structured Streaming query from last written offset
RocksDB fails to acquire a lock
Stream XML files using an auto-loader
Streaming job gets stuck writing to checkpoint
Visualizations
Save Ploty files and display from DBFS
Reference
API Reference guides
Databricks REST API
Overview
Authentication
Azure Databricks personal access tokens
Azure Active Directory tokens
Azure AD token authentication overview
Get an Azure AD token using the Azure AD Authentication Library
Get an Azure AD token using a service principal
Troubleshoot Azure AD access tokens
Examples
REST API (latest)
Overview
Clusters API 2.0
Cluster Policies API 2.0
Databricks SQL Warehouses API 2.0
Databricks SQL Queries and Dashboards API 2.0
Databricks SQL Query History API 2.0
DBFS API 2.0
Delta Live Tables API 2.0
Git Credentials API 2.0
Global Init Scripts API 2.0
Groups API 2.0
Instance Pools API 2.0
IP Access List API 2.0
Jobs API 2.1
Libraries API 2.0
MLflow API 2.0
Permissions API 2.0
Repos API 2.0
SCIM APIs 2.0
Overview
SCIM API 2.0 (Me)
SCIM API 2.0 (Users)
SCIM API 2.0 (Groups)
SCIM API 2.0 (ServicePrincipals)
Secrets API 2.0
Token API 2.0
Token Management API 2.0
Workspace API 2.0
REST API 2.1
Overview
Jobs API 2.1
REST API 2.0
Overview
Clusters API 2.0
Cluster Policies API 2.0
Databricks SQL Warehouses API 2.0
Databricks SQL Queries and Dashboards API 2.0
Databricks SQL Query History API 2.0
DBFS API 2.0
Delta Live Tables API 2.0
Git Credentials API 2.0
Global Init Scripts API 2.0
Groups API 2.0
Instance Pools API 2.0
IP Access List API 2.0
Jobs API 2.0
Libraries API 2.0
MLflow API 2.0
Permissions API 2.0
Repos API 2.0
SCIM APIs 2.0
Overview
SCIM API 2.0 (Me)
SCIM API 2.0 (Users)
SCIM API 2.0 (Groups)
SCIM API 2.0 (ServicePrincipals)
Secrets API 2.0
Token API 2.0
Token Management API 2.0
Workspace API 2.0
REST API 1.2
Databricks Workspace Utilities
Databricks Workspace CLI
Databricks CLI overview
Clusters
Cluster policies
DBFS
Delta Live Tables
Groups
Instance pools
Jobs
Libraries
Repos
Runs
Secrets
Stack
Tokens
Unity Catalog
Workspace
MLflow API
Apache Spark API
Delta Lake API
Azure Databricks Workspace Management REST API
Resource Manager template
Azure CLI
Databricks Machine Learning
Machine Learning documentation
Tutorials
Machine learning tutorial
10-minute tutorials
How-to guides
User guides
Machine learning home page
Prepare data
Load data
Load data overview
Prepare data for distributed training
Load data using Petastorm
Save DataFrames to TFRecord files and load with TensorFlow
Preprocess data
Preprocess data overview
Feature engineering with scikit-learn
Feature engineering with Spark MLlib
Featurization for transfer learning
Environment setup
Databricks AutoML
Train models
Train models overview
Machine learning
scikit-learn
Spark MLlib
XGBoost
Deep learning
Best practices
TensorFlow Keras tutorial
TensorFlow
PyTorch
Distributed training
Deep learning pipelines
Hyperparameter tuning
Overview of hyperparameter tuning
Hyperparameter tuning with Hyperopt
Automated MLflow tracking
Track model development
Inference and deployment
Deploy models for inference
Deep learning model inference workflow
Deep learning model inference performance tuning
Productionize models
Manage models
Share models across workspaces
Export and import models
Reference solutions
Reference solutions overview
Reference solution for image applications
Reference solution for recommender system
MLflow guide
MLflow overview
Get started with MLflow
Get started with MLflow Java and Scala
Get started with MLflow Python
Get started with MLflow R
End-to-end example notebook
Track machine learning training runs
Train a scikit-learn model
Train PyTorch model
Train a PySpark model and save in MLeap formats
Tracking ML model training data with Delta Lake
Access tracking externally
Build dashboards with the MLflow Search API
Save, load, and deploy models
MLflow model example
scikit-learn model deployment on Azure
Reproduce runs with MLflow projects
Manage the Lifecycle of MLflow Models in MLflow Model Registry
MLflow Model Registry example
MLflow Model Registry webhooks
MLflow Model Serving
Databricks Autologging
Copy MLflow objects between workspaces
GraphFrames
GraphFrames overview
Graph Analysis tutorial with GraphFrames
GraphFrames - Python
GraphFrames - Scala
Machine learning and Unity Catalog
Feature guides
Databricks Feature Store
Feature Store overview
Feature Store concepts
Feature Store Python API
Feature Store workflow and example notebook
Work with feature tables
Train models using the Databricks Feature Store
Work with online stores
Work with time series feature tables
Share feature tables across workspaces
Feature Store UI
Control access to feature tables
Feature Store limitations and troubleshooting
Experiments
Models
Databricks SQL
Databricks SQL documentation
How-to guides
Get started
Get started with Databricks SQL
Learn about Databricks SQL by importing the sample dashboards
Complete the admin onboarding tasks
Get started as a Databricks SQL administrator
Get started as a Databricks SQL user
Databricks SQL concepts
User guide
Favorites and tags
Data explorer
Data explorer overview
Explore databases
Explore tables
Queries
Queries overview
Queries tasks
Queries filters
Queries parameters
Queries snippets
Schedule a query
Visualizations
Visualizations overview
Visualization tasks
Visualization types
Format numeric types
Tables
Charts
Histograms
Cohorts
Funnel visualization
Map visualization
Heatmap visualization
Boxplot visualization
Pivot tables
Dashboards
Dashboards
Alerts
Alerts
Administration guide
SQL warehouses
Query history
Query profile
Data access configuration
SQL configuration parameters
Manage users and groups
Alert destinations
Transfer ownership of Databricks SQL objects
General settings
Workspace colors
Security
Security overview
Data access overview
Configure access to cloud storage
Map Data Science & Engineering security models to Databricks SQL
Access control
Access control overview
Alert access control
Dashboard access control
Data access control
Query access control
SQL warehouse access control
Personal access tokens
Encrypt queries
Reference
SQL reference
SQL reference overview
Data types
Data type rules
Datetime patterns
Expression
JSON path expressions
Partitions
Principals
Privileges and securable objects
External locations and storage credentials
Delta Sharing
Reserved words
Built-in functions
Alphabetic list of built-in functions
Lambda functions
Window functions
Data types
Array type
Bigint type
Binary type
Boolean type
Date type
Decimal type
Double type
Float type
Int type
Interval type
Map type
Void type
Smallint type
String type
Struct type
Timestamp type
Tinyint type
Special floating point values
Functions
abs function
acos function
acosh function
add_months function
aes_decrypt function
aes_encrypt function
aggregate function
ampersand sign operator
and operator
any function
approx_count_distinct function
approx_percentile function
approx_top_k function
array function
array_agg function
array_contains function
array_distinct function
array_except function
array_intersect function
array_join function
array_max function
array_min function
array_position function
array_remove function
array_repeat function
array_size function
array_sort function
array_union function
arrays_overlap function
arrays_zip function
ascii function
asin function
asinh function
assert_true function
asterisksign operator
atan function
atan2 function
atanh function
avg function
bangeqsign operator
bangsign operator
base64 function
between operator
bigint function
bin function
binary function
bit_and function
bit_count function
bit_length function
bit_or function
bit_xor function
bool_and function
bool_or function
boolean function
bround function
cardinality function
caretsign operator
case function
cast function
cbrt function
ceil function
ceiling function
char function
char_length function
character_length function
chr function
coalesce function
collect_list function
collect_set function
coloncolonsign operator
concat function
concat_ws function
contains function
conv function
corr function
cos function
cosh function
cot function
count function
count_if function
count_min_sketch function
covar_pop function
covar_samp function
crc32 function
csc function
cube function
cume_dist function
current_catalog function
current_database function
current_date function
current_schema function
current_timestamp function
current_timezone function
current_user function
current_version function
date function
date_add function
date_format function
date_from_unix_date function
date_part function
date_sub function
date_trunc function
dateadd function
datediff function
datediff (timestamp) function
day function
dayofmonth function
dayofweek function
dayofyear function
decimal function
decode function
decode (character set) function
degrees function
dense_rank function
div operator
double function
e function
element_at function
elt function
encode function
endswith function
eqeqsign operator
eqsign operator
every function
exists function
exp function
explode function
explode_outer function
expm1 function
extract function
factorial function
filter function
find_in_set function
first function
first_value function
flatten function
float function
floor function
forall function
format_number function
format_string function
from_csv function
from_json function
from_unixtime function
from_utc_timestamp function
get_json_object function
greatest function
grouping function
grouping_id function
gteqsign operator
gtsign operator
hash function
hex function
hour function
hypot function
if function
ifnull function
in function
initcap function
inline function
inline_outer function
input_file_block_length function
input_file_block_start function
input_file_name function
instr function
int function
isdistinct operator
isfalse operator
isnan function
isnotnull function
isnull function
isnullop operator
istrue operator
java_method function
json_array_length function
json_object_keys function
json_tuple function
kurtosis function
lag function
last function
last_day function
last_value function
lcase function
lead function
least function
left function
length function
levenshtein function
like operator
ln function
locate function
log function
log10 function
log1p function
log2 function
lower function
lpad function
lteqgtsign operator
lteqsign operator
ltgtsign operator
ltrim function
ltsign operator
make_date function
make_interval function
make_timestamp function
map function
map_concat function
map_contains_key function
map_entries function
map_filter function
map_from_arrays function
map_from_entries function
map_keys function
map_values function
map_zip_with function
max function
max_by function
md5 function
mean function
min function
min_by function
minussign operator
minussign unary operator
minute function
mod function
monotonically_increasing_id function
month function
months_between function
named_struct function
nanvl function
negative function
next_day function
not operator
now function
nth_value function
ntile function
nullif function
nvl function
nvl2 function
octet_length function
or operator
overlay function
parse_url function
percent_rank function
percentile function
percentile_approx function
percentile_cont function
percentile_disc function
percentsign operator
pi function
pipepipesign operator
pipesign operator
plussign operator
plussignunary unary operator
pmod function
posexplode function
posexplode_outer function
position function
positive function
pow function
power function
printf function
quarter function
radians function
raise_error function
rand function
randn function
random function
range function
rank function
reflect function
regexp operator
regexp_extract function
regexp_extract_all function
regexp_replace function
regr_avgx function
regr_avgy function
regr_count function
regr_r2 function
regr_sxx function
regr_sxy function
regr_syy function
repeat function
replace function
reverse function
right function
rint function
rlike operator
round function
row_number function
rpad function
rtrim function
schema_of_csv function
schema_of_json function
sec function
second function
sentences function
sequence function
sha function
sha1 function
sha2 function
shiftleft function
shiftright function
shiftrightunsigned function
shuffle function
sign function
signum function
sin function
sinh function
size function
skewness function
slashsign operator
slice function
smallint function
some function
sort_array function
soundex function
space function
spark_partition function
split function
split_part function
sqrt function
stack function
startswith function
std function
stddev function
stddev_pop function
stddev_samp function
str_to_map function
string function
struct function
substr function
substring function
substring_index function
sum function
tan function
tanh function
tildesign operator
timestamp function
timestamp_micros function
timestamp_millis function
timestamp_seconds function
timestampadd function
timestampdiff function
tinyint function
to_csv function
to_date function
to_json function
to_number function
to_timestamp function
to_unix_timestamp function
to_utc_timestamp function
transform function
transform_keys function
transform_values function
translate function
trim function
trunc function
try_add function
try_avg function
try_cast function
try_divide function
try_element_at function
try_multiply function
try_subtract function
try_sum function
try_to_number function
typeof function
ucase function
unbase64 function
unhex function
unix_date function
unix_micros function
unix_millis function
unix_seconds function
unix_timestamp function
upper function
uuid function
var_pop function
var_samp function
variance function
version function
weekday function
weekofyear function
width_bucket function
window function
xpath function
xpath_boolean function
xpath_double function
xpath_float function
xpath_int function
xpath_long function
xpath_number function
xpath_short function
xpath_string function
xxhash64 function
year function
zip_with function
Configuration parameters
ANSI_MODE
EMABLE_PHOTON
LEGACY_TIME_PARSER_POLICY
MAX_PARTITION_BYTES
READ_ONLY_EXTERNAL_METASTORE
TIMEZONE
USE_CACHE_RESULT
Identifiers
Names
Null semantics
Information schema
INFORMATION_SCHEMA
CATALOG_PRIVILEGES
CATALOGS
CHECK_CONSTRAINTS
COLUMNS
INFORMATION_SCHEMA_CATALOG_NAME
REFERENTIAL_CONSTRAINTS
SCHEMA_PRIVILEGES
SCHEMATA
TABLE_PRIVILEGES
TABLES
VIEWS
Syntax diagram
ALTER CATALOG
ALTER CREDENTIAL
ALTER DATABASE
ALTER LOCATION
ALTER SCHEMA
ALTER SHARE
ALTER TABLE
ALTER VIEW
CREATE CATALOG
CREATE DATABASE
CREATE FUNCTION
CREATE LOCATION
CREATE RECIPIENT
CREATE SCHEMA
CREATE SHARE
CREATE TABLE
CREATE TABLE USING
CREATE TABLE LIKE
CREATE VIEW
DROP CATALOG
DROP CREDENTIAL
DROP DATABASE
DROP FUNCTION
DROP LOCATION
DROP RECIPIENT
DROP SCHEMA
DROP SHARE
DROP TABLE
DROP VIEW
REPAIR TABLE
TRUNCATE TABLE
USE CATALOG
USE DATABASE
USE SCHEMA
Table properties and table options
INSERT INTO
EXPLAIN
CLUSTER BY clause
Common table expression
DISTRIBUTE BY clause
GROUP BY clause
HAVING clause
Hints
JOIN
LATERAL VIEW clause
LIMIT clause
ORDER BY clause
PIVOT clause
Query
Sampling queries
SELECT
Set operations
SORT BY clause
Table valued functions
WHERE clause
WINDOW clause
WINDOW frame clause
ANALYZE TABLE
DESCRIBE CATALOG
DESCRIBE CREDENTIAL
DESCRIBE DATABASE
DESCRIBE FUNCTION
DESCRIBE LOCATION
DESCRIBE QUERY
DESCRIBE RECIPIENT
DESCRIBE SCHEMA
DESCRIBE SHARE
DESCRIBE TABLE
LIST
SHOW ALL IN SHARE
SHOW CATALOGS
SHOW COLUMNS
SHOW CREATE TABLE
SHOW CREDENTIALS
SHOW DATABASES
SHOW FUNCTIONS
SHOW GROUPS
SHOW LOCATIONS
SHOW PARTITIONS
SHOW RECIPIENTS
SHOW SCHEMAS
SHOW SHARES
SHOW TABLE EXTENDED
SHOW TABLES
SHOW TBLPROPERTIES
SHOW USERS
SHOW VIEWS
RESET
SET
SET TIMEZONE
CACHE (Delta Lake on Azure Databricks)
CLONE (Delta Lake on Azure Databricks)
CONVERT TO DELTA (Delta Lake on Azure Databricks)
COPY INTO (Delta Lake on Azure Databricks)
CREATE BLOOMFILTER INDEX (Delta Lake on Azure Databricks)
DELETE FROM (Delta Lake on Azure Databricks)
DESCRIBE HISTORY (Delta Lake on Azure Databricks)
DROP BLOOMFILTER INDEX (Delta Lake on Azure Databricks)
FSCK (Delta Lake on Azure Databricks)
MERGE INTO (Delta Lake on Azure Databricks)
OPTIMIZE (Delta Lake on Azure Databricks)
REORG TABLE (Delta Lake on Azure Databricks)
RESTORE (Delta Lake on Azure Databricks)
UPDATE (Delta Lake on Azure Databricks)
VACUUM (Delta Lake on Azure Databricks)
ALTER GROUP
CREATE GROUP
DROP GROUP
DENY
GRANT
GRANT SHARE
REVOKE
REVOKE SHARE
SHOW GRANT
SHOW GRANT ON SHARE
SHOW GRANT TO RECIPIENT
MSCK
REST API
Overview
Authentication
SQL Warehouses API
Query History API
Queries and Dashboards API
Resources
Release notes
Databricks Integrations
Integrations documentation
Databricks partners
Overview
Anomalo
Arcion
dbt Cloud
dbt Core
Fivetran
Hex
Hightouch
InfoWorks
John Snow Labs
Labelbox
Lightup
Looker
Matillion
MicroStrategy
Mode
Power BI
Preset
Prophecy
Qlik Replicate
Qlik Sense
Rivery
SQL Workbench/J
StreamSets
Syncsort
Tableau
TIBCO Spotfire
JDBC and ODBC drivers and configuration
Databricks Partner Connect
Partner Connect overview
Connect to a data ingestion partner
Connect to a data preparation and transformation partner
Connect to a machine learning partner
Connect to a BI and visualization partner
Connect to a data quality partner
Connect to Fivetran walkthrough
Manage connections as an administrator
Troubleshoot connections
Partner Connect partner list
Administration guide
Administration overview
Admin console
Manage your Azure Databricks account
Account management overview
Manage your subscription
Diagnostic logging in Azure Databricks
Monitor usage using tags
Manage users and groups
Users and groups overview
Manage users
Manage service principals
Manage groups
Set up single sign-on
Provision users and groups using SCIM
Configure SCIM provisioning for Azure AD
Manage access control
Access control overview
Enable workspace access control
Enable cluster access control
Enable pool access control
Enable jobs access control
Enable table access control
Enable token-based authentication
Conditional access
Manage workspace objects and behavior
Manage workspaces overview
Manage workspace storage
Manage workspace security headers
Manage access to notebook features
Manage access to file upload interface
Manage access to DBFS browser
Increase the jobs limit in a workspace
Manage clusters
Manage cluster configuration options
Manage cluster policies
Best practices for cluster policies
Enable Container Services
Enable Databricks Runtime for Genomics
Enable web terminal
Manage virtual networks
Virtual networks overview
Peer virtual networks
Upgrade your preview workspace to GA
Deploy Azure Databricks in your VNet
Connect a workspace to an on-premises network
User-defined route settings
Disaster recovery
Resources
Best practices
Error messages
Common error conditions
DIVIDE_BY_ZERO error
UNSUPPORTED_DESERIALIZER error
UNSUPPORTED_FEATURE error
UNSUPPORTED_SAVE_MODE error
Apache Spark
Apache Spark overview
Get started with Apache Spark
DataFrames
Datasets
Machine learning
Structured streaming
What's next
Release notes
Platform
Platform release notes
July 2022
June 2022
May 2022
April 2022
March 2022
February 2022
January 2022
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
Databricks Runtime
Runtime releases
Runtime release notes
Databricks Runtime 11.1 (Beta)
Databricks Runtime 11.1 ML (Beta)
Databricks Runtime 11.0
Databricks Runtime 11.0 ML
Databricks Runtime 10.5
Databricks Runtime 10.5 ML
Databricks Runtime 10.4 LTS
Databricks Runtime 10.4 LTS ML
Databricks Runtime 10.3
Databricks Runtime 10.3 ML
Databricks Runtime 10.2
Databricks Runtime 10.2 ML
Databricks Runtime 10.1
Databricks Runtime 10.1 ML
Databricks Runtime 10.0
Databricks Runtime 10.0 ML
Databricks Runtime 9.1 LTS
Databricks Runtime 9.1 LTS ML
Databricks Runtime 9.0
Databricks Runtime 9.0 ML
Databricks Runtime 8.4
Databricks Runtime 8.4 ML
Databricks Runtime 8.3
Databricks Runtime 8.3 ML
Databricks Runtime 8.2
Databricks Runtime 8.2 ML
Databricks Runtime 8.1
Databricks Runtime 8.1 ML
Databricks Runtime 8.0
Databricks Runtime 8.0 ML
Databricks Runtime 7.6
Databricks Runtime 7.6 ML
Databricks Runtime 7.5
Databricks Runtime 7.5 ML
Databricks Runtime 7.5 Genomics
Databricks Runtime 7.4
Databricks Runtime 7.4 ML
Databricks Runtime 7.4 Genomics
Databricks Runtime 7.3 LTS
Databricks Runtime 7.3 LTS ML
Databricks Runtime 7.3 LTS Genomics
Databricks Runtime 7.2
Databricks Runtime 7.2 ML
Databricks Runtime 7.2 Genomics
Databricks Runtime 7.1
Databricks Runtime 7.1 ML
Databricks Runtime 7.1 Genomics
Databricks Runtime 7.0
Databricks Runtime 7.0 ML
Databricks Runtime 7.0 Genomics
Databricks Runtime 6.6
Databricks Runtime 6.6 ML
Databricks Runtime 6.6 Genomics
Databricks Runtime 6.5
Databricks Runtime 6.5 ML
Databricks Runtime 6.5 Genomics
Databricks Runtime 6.4 Extended Support
Databricks Runtime 6.4
Databricks Runtime 6.4 ML
Databricks Runtime 6.4 Genomics
Databricks Runtime 6.3
Databricks Runtime 6.3 ML
Databricks Runtime 6.3 Genomics
Databricks Runtime 6.2
Databricks Runtime 6.2 ML
Databricks Runtime 6.2 Genomics
Databricks Runtime 6.1
Databricks Runtime 6.1 ML
Databricks Runtime 6.0
Databricks Runtime 6.0 ML
Databricks Runtime 5.5 Extended Support
Databricks Runtime 5.5 ML Extended Support
Databricks Runtime 5.5
Databricks Runtime 5.5 ML
Databricks Light 2.4 Extended Support
Databricks Light 2.4
Databricks Runtime 5.4
Databricks Runtime 5.4 ML
Databricks Runtime 5.3
Databricks Runtime 5.3 ML
Databricks Runtime 5.2
Databricks Runtime 5.1
Databricks Runtime 5.0
Databricks Runtime 4.3
Databricks Runtime 4.2
Databricks Runtime 4.1
Databricks Runtime 4.0
Databricks Runtime 3.5 LTS
Databricks Runtime 3.4
Runtime migration guides
Databricks Runtime 7.x migration
Databricks Runtime 7.3 LTS migration
Databricks Runtime 9.1 LTS migration
Databricks Runtime 10.x migration
Runtime maintenance updates
Runtime support lifecycle
Databricks Connect
Release types
R developer's guide
Shared resources
Azure Databricks datasets
Training
Supported browsers
Azure Databricks status page
Limits
Azure Roadmap
Platform release process
Pricing
Ask a question - Microsoft Q&A question page
Stack Overflow
Region availability
Support options
What is Azure Databricks?
7/21/2022 • 2 minutes to read

Azure Databricks is a data analytics platform optimized for the Microsoft Azure cloud services platform. Azure
Databricks offers three environments for developing data intensive applications: Databricks SQL, Databricks
Data Science & Engineering, and Databricks Machine Learning.
Databricks SQL provides an easy-to-use platform for analysts who want to run SQL queries on their data lake,
create multiple visualization types to explore query results from different perspectives, and build and share
dashboards.
Databricks Data Science & Engineering provides an interactive workspace that enables collaboration
between data engineers, data scientists, and machine learning engineers. For a big data pipeline, the data (raw
or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real-time using
Apache Kafka, Event Hub, or IoT Hub. This data lands in a data lake for long term persisted storage, in Azure Blob
Storage or Azure Data Lake Storage. As part of your analytics workflow, use Azure Databricks to read data from
multiple data sources and turn it into breakthrough insights using Spark.
Databricks Machine Learning is an integrated end-to-end machine learning environment incorporating
managed services for experiment tracking, model training, feature development and management, and feature
and model serving.
To select an environment, launch an Azure Databricks workspace and use the persona switcher in the sidebar:

Next steps
Learn more about Databricks Data Science & Engineering
Learn more about Databricks Machine Learning
Learn more about Databricks SQL Analytics
What is Databricks Data Science & Engineering?
7/21/2022 • 3 minutes to read

Databricks Data Science & Engineering (sometimes called simply "Workspace") is an analytics platform based
on Apache Spark. It is integrated with Azure to provide one-click setup, streamlined workflows, and an
interactive workspace that enables collaboration between data engineers, data scientists, and machine learning
engineers.

For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in
batches, or streamed near real-time using Apache Kafka, Event Hub, or IoT Hub. This data lands in a data lake for
long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage. As part of your analytics
workflow, use Azure Databricks to read data from multiple data sources such as Azure Blob Storage, Azure Data
Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and turn it into breakthrough insights using
Spark.

Apache Spark analytics platform


Databricks Data Science & Engineering comprises the complete open-source Apache Spark cluster technologies
and capabilities. Spark in Databricks Data Science & Engineering includes the following components:
Spark SQL and DataFrames : Spark SQL is the Spark module for working with structured data. A
DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent
to a table in a relational database or a data frame in R/Python.
Streaming : Real-time data processing and analysis for analytical and interactive applications. Integrates
with HDFS, Flume, and Kafka.
MLlib : Machine Learning library consisting of common learning algorithms and utilities, including
classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying
optimization primitives.
GraphX : Graphs and graph computation for a broad scope of use cases from cognitive analytics to data
exploration.
Spark Core API : Includes support for R, SQL, Python, Scala, and Java.

Apache Spark in Azure Databricks


Azure Databricks builds on the capabilities of Spark by providing a zero-management cloud platform that
includes:
Fully managed Spark clusters
An interactive workspace for exploration and visualization
A platform for powering your favorite Spark applications
Fully managed Apache Spark clusters in the cloud
Azure Databricks has a secure and reliable production environment in the cloud, managed and supported by
Spark experts. You can:
Create clusters in seconds.
Dynamically autoscale clusters up and down and share them across teams.
Use clusters programmatically by invoking REST APIs.
Use secure data integration capabilities built on top of Spark that enable you to unify your data without
centralization.
Get instant access to the latest Apache Spark features with each release.
Databricks Runtime
Databricks Runtime is built on top of Apache Spark and is natively built for the Azure cloud.
Azure Databricks completely abstracts out the infrastructure complexity and the need for specialized expertise to
set up and configure your data infrastructure.
For data engineers, who care about the performance of production jobs, Azure Databricks provides a Spark
engine that is faster and performant through various optimizations at the I/O layer and processing layer
(Databricks I/O).
Workspace for collaboration
Through a collaborative and integrated environment, Databricks Data Science & Engineering streamlines the
process of exploring data, prototyping, and running data-driven applications in Spark.
Determine how to use data with easy data exploration.
Document your progress in notebooks in R, Python, Scala, or SQL.
Visualize data in a few clicks, and use familiar tools like Matplotlib, ggplot, or d3.
Use interactive dashboards to create dynamic reports.
Use Spark and interact with the data simultaneously.

Enterprise security
Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration, role-
based controls, and SLAs that protect your data and your business.
Integration with Azure Active Directory enables you to run complete Azure-based solutions using Azure
Databricks.
Azure Databricks roles-based access enables fine-grained user permissions for notebooks, clusters, jobs, and
data.
Enterprise-grade SLAs.

IMPORTANT
Azure Databricks is a Microsoft Azure first-party service that is deployed on the Global Azure Public Cloud infrastructure.
All communications between components of the service, including between the public IPs in the control plane and the
customer data plane, remain within the Microsoft Azure network backbone. See also Microsoft global network.

Integration with Azure services


Databricks Data Science & Engineering integrates deeply with Azure databases and stores: Synapse Analytics,
Cosmos DB, Data Lake Store, and Blob storage.

Integration with Power BI


Through rich integration with Power BI, Databricks Data Science & Engineering allows you to discover and share
your impactful insights quickly and easily. You can use other BI tools as well, such as Tableau Software.

Next steps
Quickstart: Create an Azure Databricks workspace and run a Spark job
Work with Spark clusters
Work with notebooks
Create Spark jobs
What is Databricks Machine Learning?
7/21/2022 • 2 minutes to read

Databricks Machine Learning is an integrated end-to-end machine learning platform incorporating managed
services for experiment tracking, model training, feature development and management, and feature and model
serving. The diagram shows how the capabilities of Databricks map to the steps of the model development and
deployment process.

With Databricks Machine Learning, you can:


Train models either manually or with AutoML.
Track training parameters and models using experiments with MLflow tracking.
Create feature tables and access them for model training and inference.
Share, manage, and serve models using Model Registry.
For machine learning applications, Databricks provides Databricks Runtime for Machine Learning, a variation of
Databricks Runtime that includes many popular machine learning libraries.

Databricks Machine Learning features


Feature store
Feature Store enables you to catalog ML features and make them available for training and serving, increasing
reuse. With a data-lineage–based feature search that leverages automatically-logged data sources, you can
make features available for training and serving with simplified model deployment that doesn’t require changes
to the client application.
Experiments
MLflow experiments let you visualize, search for, and compare runs, as well as download run artifacts and
metadata for analysis in other tools. The Experiments page gives you quick access to MLflow experiments across
your organization. You can track machine learning model development by logging to these experiments from
Azure Databricks notebooks and jobs.
Models
Azure Databricks provides a hosted version of MLflow Model Registry to help you to manage the full lifecycle of
MLflow Models. Model Registry provides chronological model lineage (which MLflow experiment and run
produced the model at a given time), model versioning, stage transitions (for example, from staging to
production or archived), and email notifications of model events. You can also create and view model
descriptions and leave comments.
AutoML
AutoML enables you to automatically generate machine learning models from data and accelerate the path to
production. It prepares the dataset for model training and then performs and records a set of trials, creating,
tuning, and evaluating multiple models. It displays the results and provides a Python notebook with the source
code for each trial run so you can review, reproduce, and modify the code. AutoML also calculates summary
statistics on your dataset and saves this information in a notebook that you can review later.
Databricks Runtime for Machine Learning
Databricks Runtime for Machine Learning (Databricks Runtime ML) automates the creation of a cluster
optimized for machine learning. Databricks Runtime ML clusters include the most popular machine learning
libraries, such as TensorFlow, PyTorch, Keras, and XGBoost, and also include libraries required for distributed
training such as Horovod. Using Databricks Runtime ML speeds up cluster creation and ensures that the
installed library versions are compatible.

Next steps
Run the machine learning quickstart tutorial
Run the 10-minute tutorials using your favorite ML libraries
Learn more about Databricks Machine Learning
What is Databricks SQL?
7/21/2022 • 2 minutes to read

Databricks SQL allows you to run quick ad-hoc SQL queries on your data lake. Queries support multiple
visualization types to help you explore your query results from different perspectives.

NOTE
Databricks SQL is not supported in Azure China regions.

Fully managed SQL warehouses in the cloud


SQL queries run on fully managed SQL warehouses sized according to query latency and number of concurrent
users. To help you get started quickly, every workspace comes pre-configured with a small starter SQL
warehouse. Learn more.
Dashboards for sharing insights
Dashboards let you combine visualizations and text to share insights drawn from your queries. Learn more.
Alerts help you monitor and integrate
Alerts notify you when a field returned by a query meets a threshold. Use alerts to monitor your business or
integrate them with tools to start workflows such as user onboarding or support tickets. Learn more.

Enterprise security
Databricks SQL provides enterprise-grade Azure security, including Azure Active Directory integration, role-
based controls, and SLAs that protect your data and your business.
Integration with Azure Active Directory enables you to run complete Azure-based solutions using Databricks
SQL.
Role based access enables fine-grained user permissions for alerts, dashboards, SQL warehouses, queries,
and data.
Enterprise-grade SLAs.
For details, see Data access overview.

IMPORTANT
Azure Databricks is a Microsoft Azure first-party service that is deployed on the Global Azure Public Cloud infrastructure.
All communications between components of the service, including between the public IPs in the control plane and the
customer data plane, remain within the Microsoft Azure network backbone. See also Microsoft global network.

Integration with Azure services


Databricks SQL integrates with Azure databases and stores: Synapse Analytics, Cosmos DB, Data Lake Store, and
Blob storage.

Integration with Power BI


Through rich integration with Power BI, Databricks SQL allows you to discover and share your impactful insights
quickly and easily. You can use other BI tools as well, such as Tableau Software.

Next steps
Quickstart: Learn about Databricks SQL by importing the sample dashboards
Quickstart: Complete the admin onboarding tasks
Quickstart: Enable users and create a SQL warehouse
Quickstart: Run a query and create a dashboard
Work with queries
Create dashboards
Quickstart: Run a Spark job on Azure Databricks
Workspace using the Azure portal
7/21/2022 • 7 minutes to read

In this quickstart, you use the Azure portal to create an Azure Databricks workspace with an Apache Spark
cluster. You run a job on the cluster and use custom charts to produce real-time reports from Seattle safety data.

Prerequisites
Portal
Azure CLI

Azure subscription - create one for free. This tutorial cannot be carried out using Azure Free Trial
Subscription . If you have a free account, go to your profile and change your subscription to pay-as-
you-go . For more information, see Azure free account. Then, remove the spending limit, and request a
quota increase for vCPUs in your region. When you create your Azure Databricks workspace, you can
select the Trial (Premium - 14-Days Free DBUs) pricing tier to give the workspace access to free
Premium Azure Databricks DBUs for 14 days.
Sign in to the Azure portal.

NOTE
If you want to create an Azure Databricks workspace in the Azure Commercial Cloud that holds US Government
compliance certifications like FedRAMP High, please reach out to your Microsoft or Databricks representative to gain
access to this experience.

Create an Azure Databricks workspace


In this section, you create an Azure Databricks workspace using the Azure portal or the Azure CLI.

Portal
Azure CLI

1. In the Azure portal, select Create a resource > Analytics > Azure Databricks .
2. Under Azure Databricks Ser vice , provide the values to create a Databricks workspace.
Provide the following values:

P RO P ERT Y DESC RIP T IO N

Workspace name Provide a name for your Databricks workspace

Subscription From the drop-down, select your Azure subscription.

Resource group Specify whether you want to create a new resource


group or use an existing one. A resource group is a
container that holds related resources for an Azure
solution. For more information, see Azure Resource
Group overview.

Location Select West US 2 . For other available regions, see Azure


services available by region.

Pricing Tier Choose between Standard , Premium , or Trial. For


more information on these tiers, see Databricks pricing
page.

3. Select Review + Create , and then Create . The workspace creation takes a few minutes. During
workspace creation, you can view the deployment status in Notifications . Once this process is finished,
your user account is automatically added as an admin user in the workspace.
When a workspace deployment fails, the workspace is still created in a failed state. Delete the failed
workspace and create a new workspace that resolves the deployment errors. When you delete the failed
workspace, the managed resource group and any successfully deployed resources are also deleted.

Create a Spark cluster in Databricks


NOTE
To use a free account to create the Azure Databricks cluster, before creating the cluster, go to your profile and change
your subscription to pay-as-you-go . For more information, see Azure free account.

1. In the Azure portal, go to the Databricks workspace that you created, and then click Launch Workspace .
2. You are redirected to the Azure Databricks portal. From the portal, click New Cluster .

3. In the New cluster page, provide the values to create a cluster.


Accept all other default values other than the following:
Enter a name for the cluster.
For this article, create a cluster with 10.4LTS runtime.
Make sure you select the Terminate after __ minutes of inactivity checkbox. Provide a
duration (in minutes) to terminate the cluster, if the cluster is not being used.
Select Create cluster . Once the cluster is running, you can attach notebooks to the cluster and
run Spark jobs.
For more information on creating clusters, see Create a Spark cluster in Azure Databricks.

Run a Spark SQL job


Perform the following tasks to create a notebook in Databricks, configure the notebook to read data from an
Azure Open Datasets, and then run a Spark SQL job on the data.
1. In the left pane, select Azure Databricks . From the Common Tasks , select New Notebook .
2. In the Create Notebook dialog box, enter a name, select Python as the language, and select the Spark
cluster that you created earlier.

Select Create .
3. In this step, create a Spark DataFrame with Seattle Safety Data from Azure Open Datasets, and use SQL to
query the data.
The following command sets the Azure storage access information. Paste this PySpark code into the first
cell and use Shift+Enter to run the code.
blob_account_name = "azureopendatastorage"
blob_container_name = "citydatacontainer"
blob_relative_path = "Safety/Release/city=Seattle"
blob_sas_token = r"?st=2019-02-26T02%3A34%3A32Z&se=2119-02-27T02%3A34%3A00Z&sp=rl&sv=2018-03-
28&sr=c&sig=XlJVWA7fMXCSxCKqJm8psMOh0W4h7cSYO28coRqF2fs%3D"

The following command allows Spark to read from Blob storage remotely. Paste this PySpark code into
the next cell and use Shift+Enter to run the code.

wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name,


blob_relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token)
print('Remote blob path: ' + wasbs_path)

The following command creates a DataFrame. Paste this PySpark code into the next cell and use
Shift+Enter to run the code.

df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')

4. Run a SQL statement return the top 10 rows of data from the temporary view called source . Paste this
PySpark code into the next cell and use Shift+Enter to run the code.

print('Displaying top 10 rows: ')


display(spark.sql('SELECT * FROM source LIMIT 10'))

5. You see a tabular output like shown in the following screenshot (only some columns are shown):

6. You now create a visual representation of this data to show how many safety events are reported using
the Citizens Connect App and City Worker App instead of other sources. From the bottom of the tabular
output, select the Bar char t icon, and then click Plot Options .

7. In Customize Plot , drag-and-drop values as shown in the screenshot.


Set Keys to source .
Set Values to <\id> .
Set Aggregation to COUNT .
Set Display type to Pie char t .
Click Apply .

Clean up resources
After you have finished the article, you can terminate the cluster. To do so, from the Azure Databricks workspace,
from the left pane, select Clusters . For the cluster you want to terminate, move the cursor over the ellipsis
under Actions column, and select the Terminate icon.

If you do not manually terminate the cluster it will automatically stop, provided you selected the Terminate
after __ minutes of inactivity checkbox while creating the cluster. In such a case, the cluster automatically
stops, if it has been inactive for the specified time.

Next steps
In this article, you created a Spark cluster in Azure Databricks and ran a Spark job using data from Azure Open
Datasets. You can also look at Spark data sources to learn how to import data from other data sources into
Azure Databricks. Advance to the next article to learn how to perform an ETL operation (extract, transform, and
load data) using Azure Databricks.
Extract, transform, and load data using Azure Databricks
Quickstart: Create an Azure Databricks workspace
using PowerShell
7/21/2022 • 3 minutes to read

This quickstart describes how to use PowerShell to create an Azure Databricks workspace. You can use
PowerShell to create and manage Azure resources interactively or in scripts.

Prerequisites
If you don't have an Azure subscription, create a free account before you begin.
If you choose to use PowerShell locally, this article requires that you install the Az PowerShell module and
connect to your Azure account using the Connect-AzAccount cmdlet. For more information about installing the
Az PowerShell module, see Install Azure PowerShell.

IMPORTANT
While the Az.Databricks PowerShell module is in preview, you must install it separately from the Az PowerShell module
using the following command: Install-Module -Name Az.Databricks -AllowPrerelease . Once the Az.Databricks
PowerShell module is generally available, it becomes part of future Az PowerShell module releases and available natively
from within Azure Cloud Shell.

NOTE
If you want to create an Azure Databricks workspace in the Azure Commercial Cloud that holds US Government
compliance certifications like FedRAMP High, please reach out to your Microsoft or Databricks representative to gain
access to this experience.

If this is your first time using Azure Databricks, you must register the Microsoft.Databricks resource provider.

Register-AzResourceProvider -ProviderNamespace Microsoft.Databricks

Use Azure Cloud Shell


Azure hosts Azure Cloud Shell, an interactive shell environment that you can use through your browser. You can
use either Bash or PowerShell with Cloud Shell to work with Azure services. You can use the Cloud Shell
preinstalled commands to run the code in this article without having to install anything on your local
environment.
To start Azure Cloud Shell:

O P T IO N EXA M P L E/ L IN K

Select Tr y It in the upper-right corner of a code block.


Selecting Tr y It doesn't automatically copy the code to
Cloud Shell.
O P T IO N EXA M P L E/ L IN K

Go to https://shell.azure.com, or select the Launch Cloud


Shell button to open Cloud Shell in your browser.

Select the Cloud Shell button on the menu bar at the


upper right in the Azure portal.

To run the code in this article in Azure Cloud Shell:


1. Start Cloud Shell.
2. Select the Copy button on a code block to copy the code.
3. Paste the code into the Cloud Shell session by selecting Ctrl +Shift +V on Windows and Linux or by
selecting Cmd +Shift +V on macOS.
4. Select Enter to run the code.
If you have multiple Azure subscriptions, choose the appropriate subscription in which the resources should be
billed. Select a specific subscription ID using the Set-AzContext cmdlet.

Set-AzContext -SubscriptionId 00000000-0000-0000-0000-000000000000

Create a resource group


Create an Azure resource group using the New-AzResourceGroup cmdlet. A resource group is a logical
container in which Azure resources are deployed and managed as a group.
The following example creates a resource group named myresourcegroup in the West US 2 region.

New-AzResourceGroup -Name myresourcegroup -Location westus2

Create an Azure Databricks workspace


In this section, you create an Azure Databricks workspace using PowerShell.

New-AzDatabricksWorkspace -Name mydatabricksws -ResourceGroupName myresourcegroup -Location westus2 -


ManagedResourceGroupName databricks-group -Sku standard

Provide the following values:

P RO P ERT Y DESC RIP T IO N

Name Provide a name for your Databricks workspace

ResourceGroupName Specify an existing resource group name

Location Select West US 2 . For other available regions, see Azure


services available by region

ManagedResourceGroupName Specify whether you want to create a new managed resource


group or use an existing one.
P RO P ERT Y DESC RIP T IO N

Sku Choose between Standard , Premium , or Trial. For more


information on these tiers, see Databricks pricing

The workspace creation takes a few minutes. Once this process is finished, your user account is automatically
added as an admin user in the workspace.
When a workspace deployment fails, the workspace is still created in a failed state. Delete the failed workspace
and create a new workspace that resolves the deployment errors. When you delete the failed workspace, the
managed resource group and any successfully deployed resources are also deleted.

Determine the provisioning state of a Databricks workspace


To determine if a Databricks workspace was provisioned successfully, you can use the
Get-AzDatabricksWorkspace cmdlet.

Get-AzDatabricksWorkspace -Name mydatabricksws -ResourceGroupName myresourcegroup |


Select-Object -Property Name, SkuName, Location, ProvisioningState

Name SkuName Location ProvisioningState


---- ------- -------- -----------------
mydatabricksws standard westus2 Succeeded

Clean up resources
If the resources created in this quickstart aren't needed for another quickstart or tutorial, you can delete them by
running the following example.
Cau t i on

The following example deletes the specified resource group and all resources contained within it. If resources
outside the scope of this quickstart exist in the specified resource group, they will also be deleted.

Remove-AzResourceGroup -Name myresourcegroup

To delete only the server created in this quickstart without deleting the resource group, use the
Remove-AzDatabricksWorkspace cmdlet.

Remove-AzDatabricksWorkspace -Name mydatabricksws -ResourceGroupName myresourcegroup

Next steps
Create a Spark cluster in Databricks
Quickstart: Create an Azure Databricks workspace
by using an ARM template
7/21/2022 • 3 minutes to read

In this quickstart, you use an Azure Resource Manager template (ARM template) to create an Azure Databricks
workspace. Once the workspace is created, you validate the deployment.
An ARM template is a JavaScript Object Notation (JSON) file that defines the infrastructure and configuration for
your project. The template uses declarative syntax, which lets you state what you intend to deploy without
having to write the sequence of programming commands to create it.
If your environment meets the prerequisites and you're familiar with using ARM templates, select the Deploy to
Azure button. The template will open in the Azure portal.

Prerequisites
To complete this article, you need to:
Have an Azure subscription - create one for free

NOTE
If you want to create an Azure Databricks workspace in the Azure Commercial Cloud that holds US Government
compliance certifications like FedRAMP High, please reach out to your Microsoft or Databricks representative to gain
access to this experience.

Review the template


The template used in this quickstart is from Azure Quickstart Templates.
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"metadata": {
"_generator": {
"name": "bicep",
"version": "0.5.6.12127",
"templateHash": "14509124136721506545"
}
},
"parameters": {
"disablePublicIp": {
"type": "bool",
"defaultValue": false,
"metadata": {
"description": "Specifies whether to deploy Azure Databricks workspace with Secure Cluster
Connectivity (No Public IP) enabled or not"
}
},
"workspaceName": {
"type": "string",
"metadata": {
"description": "The name of the Azure Databricks workspace to create."
}
},
"pricingTier": {
"type": "string",
"defaultValue": "premium",
"allowedValues": [
"standard",
"premium"
],
"metadata": {
"description": "The pricing tier of workspace."
}
},
"location": {
"type": "string",
"defaultValue": "[resourceGroup().location]",
"metadata": {
"description": "Location for all resources."
}
}
},
"variables": {
"managedResourceGroupName": "[format('databricks-rg-{0}-{1}', parameters('workspaceName'),
uniqueString(parameters('workspaceName'), resourceGroup().id))]"
},
"resources": [
{
"type": "Microsoft.Databricks/workspaces",
"apiVersion": "2018-04-01",
"name": "[parameters('workspaceName')]",
"location": "[parameters('location')]",
"sku": {

The Azure resource defined in the template is Microsoft.Databricks/workspaces: create an Azure Databricks
workspace.

Deploy the template


In this section, you create an Azure Databricks workspace using an ARM template.
1. Select the following image to sign in to Azure and open a template. The template creates an Azure
Databricks workspace.
2. Provide the required values to create your Azure Databricks workspace

Provide the following values:

P RO P ERT Y DESC RIP T IO N

Subscription From the drop-down, select your Azure subscription.

Resource group Specify whether you want to create a new resource


group or use an existing one. A resource group is a
container that holds related resources for an Azure
solution. For more information, see Azure Resource
Group overview.

Location Select East US 2 . For other available regions, see Azure


services available by region.

Workspace name Provide a name for your Databricks workspace

Pricing Tier Choose between Standard or Premium . For more


information on these tiers, see Databricks pricing page.

3. Select Review + Create , then Create .


4. The workspace creation takes a few minutes. When a workspace deployment fails, the workspace is still
created in a failed state. Delete the failed workspace and create a new workspace that resolves the
deployment errors. When you delete the failed workspace, the managed resource group and any
successfully deployed resources are also deleted.

Review deployed resources


You can either use the Azure portal to check the Azure Databricks workspace or use the following Azure CLI or
Azure PowerShell script to list the resource.
Azure CLI

echo "Enter your Azure Databricks workspace name:" &&


read databricksWorkspaceName &&
echo "Enter the resource group where the Azure Databricks workspace exists:" &&
read resourcegroupName &&
az databricks workspace show -g $resourcegroupName -n $databricksWorkspaceName

Azure PowerShell

$resourceGroupName = Read-Host -Prompt "Enter the resource group name where your Azure Databricks workspace
exists"
(Get-AzResource -ResourceType "Microsoft.Databricks/workspaces" -ResourceGroupName $resourceGroupName).Name
Write-Host "Press [ENTER] to continue..."

Clean up resources
If you plan to continue on to subsequent tutorials, you may wish to leave these resources in place. When no
longer needed, delete the resource group, which deletes the Azure Databricks workspace and the related
managed resources. To delete the resource group by using Azure CLI or Azure PowerShell:
Azure CLI

echo "Enter the Resource Group name:" &&


read resourceGroupName &&
az group delete --name $resourceGroupName &&
echo "Press [ENTER] to continue ..."

Azure PowerShell

$resourceGroupName = Read-Host -Prompt "Enter the Resource Group name"


Remove-AzResourceGroup -Name $resourceGroupName
Write-Host "Press [ENTER] to continue..."

Limitations
The configuration of the storage account deployed in the resource group cannot be modified. To use locally-
replicated storage (LRS) instead of globally-replicated storage (GRS), create a new storage account and mount it
in the existing workspace.

Next steps
In this quickstart, you created an Azure Databricks workspace by using an ARM template and validated the
deployment. Advance to the next article to learn how to perform an ETL operation (extract, transform, and load
data) using Azure Databricks.
Extract, transform, and load data using Azure Databricks
Quickstart: Create an Azure Databricks workspace
in your own Virtual Network
7/21/2022 • 5 minutes to read

The default deployment of Azure Databricks creates a new virtual network that is managed by Databricks. This
quickstart shows how to create an Azure Databricks workspace in your own virtual network instead. You also
create an Apache Spark cluster within that workspace.
For more information about why you might choose to create an Azure Databricks workspace in your own virtual
network, see Deploy Azure Databricks in your Azure Virtual Network (VNet Injection).
If you don't have an Azure subscription, create a free account. This tutorial cannot be carried out using Azure
Free Trial Subscription . If you have a free account, go to your profile and change your subscription to pay-
as-you-go . For more information, see Azure free account. Then, remove the spending limit, and request a quota
increase for vCPUs in your region. When you create your Azure Databricks workspace, you can select the Trial
(Premium - 14-Days Free DBUs) pricing tier to give the workspace access to free Premium Azure Databricks
DBUs for 14 days.

Sign in to the Azure portal


Sign in to the Azure portal.

NOTE
If you want to create an Azure Databricks workspace in the Azure Commercial Cloud that holds US Government
compliance certifications like FedRAMP High, please reach out to your Microsoft or Databricks representative to gain
access to this experience.

Create a virtual network


1. From the Azure portal menu, select Create a resource . Then select Networking > Vir tual network .
2. Under Create vir tual network , apply the following settings:

SET T IN G SUGGEST ED VA L UE DESC RIP T IO N

Subscription <Your subscription> Select the Azure subscription that


you want to use.

Resource group databricks-quickstart Select Create New and enter a new


resource group name for your
account.

Name databricks-quickstart Select a name for your virtual


network.

Region <Select the region that is closest to Select a geographic location where
your users> you can host your virtual network.
Use the location that's closest to
your users.

3. Select Next: IP Addresses > and apply the following settings. Then select Review + create .

SET T IN G SUGGEST ED VA L UE DESC RIP T IO N


SET T IN G SUGGEST ED VA L UE DESC RIP T IO N

IPv4 address space 10.2.0.0/16 The virtual network's address range


in CIDR notation. The CIDR range
must be between /16 and /24

Subnet name default Select a name for the default subnet


in your virtual network.

Subnet Address range 10.2.0.0/24 The subnet's address range in CIDR


notation. It must be contained by
the address space of the virtual
network. The address range of a
subnet which is in use can't be
edited.

4. On the Review + create tab, select Create to deploy the virtual network. Once the deployment is
complete, navigate to your virtual network and select Address space under Settings . In the box that
says Add additional address range, insert 10.179.0.0/16 and select Save .
Create an Azure Databricks workspace
1. From the Azure portal menu, select Create a resource . Then select Analytics > Databricks .

2. Under Azure Databricks Ser vice , apply the following settings:


SET T IN G SUGGEST ED VA L UE DESC RIP T IO N

Workspace name databricks-quickstart Select a name for your Azure


Databricks workspace.

Subscription <Your subscription> Select the Azure subscription that


you want to use.

Resource group databricks-quickstart Select the same resource group you


used for the virtual network.

Location <Select the region that is closest to Choose the same location as your
your users> virtual network.

Pricing Tier Choose between Standard or For more information on pricing


Premium. tiers, see the Databricks pricing
page.

3. Once you've finished entering settings on the Basics page, select Next: Networking > and apply the
following settings:

SET T IN G SUGGEST ED VA L UE DESC RIP T IO N


SET T IN G SUGGEST ED VA L UE DESC RIP T IO N

Deploy Azure Databricks workspace Yes This setting allows you to deploy an
in your Virtual Network (VNet) Azure Databricks workspace in your
virtual network.

Virtual Network databricks-quickstart Select the virtual network you


created in the previous section.

Public Subnet Name public-subnet Use the default public subnet name.

Public Subnet CIDR Range 10.179.64.0/18 Use a CIDR range up to and


including /26.

Private Subnet Name private-subnet Use the default private subnet


name.

Private Subnet CIDR Range 10.179.0.0/18 Use a CIDR range up to and


including /26.

4. Once the deployment is complete, navigate to the Azure Databricks resource. Notice that virtual network
peering is disabled. Also notice the resource group and managed resource group in the overview page.
The managed resource group is not modifiable, and it is not used to create virtual machines. You can only
create virtual machines in the resource group you manage.

When a workspace deployment fails, the workspace is still created in a failed state. Delete the failed
workspace and create a new workspace that resolves the deployment errors. When you delete the failed
workspace, the managed resource group and any successfully deployed resources are also deleted.

Create a cluster
NOTE
To use a free account to create the Azure Databricks cluster, before creating the cluster, go to your profile and change
your subscription to pay-as-you-go . For more information, see Azure free account.

1. Return to your Azure Databricks service and select Launch Workspace on the Over view page.
2. Select Clusters > + Create Cluster . Then create a cluster name, like databricks-quickstart-cluster, and
accept the remaining default settings. Select Create Cluster .

3. Once the cluster is running, return to the managed resource group in the Azure portal. Notice the new
virtual machines, disks, IP Address, and network interfaces. A network interface is created in each of the
public and private subnets with IP addresses.

4. Return to your Azure Databricks workspace and select the cluster you created. Then navigate to the
Executors tab on the Spark UI page. Notice that the addresses for the driver and the executors are in
the private subnet range. In this example, the driver is 10.179.0.6 and executors are 10.179.0.4 and
10.179.0.5. Your IP addresses could be different.

Clean up resources
After you have finished the article, you can terminate the cluster. To do so, from the Azure Databricks workspace,
from the left pane, select Clusters . For the cluster you want to terminate, move the cursor over the ellipsis
under Actions column, and select the Terminate icon. This stops the cluster.
If you do not manually terminate the cluster it will automatically stop, provided you selected the Terminate
after __ minutes of inactivity checkbox while creating the cluster. In such a case, the cluster automatically
stops, if it has been inactive for the specified time.
If you do not wish to reuse the cluster, you can delete the resource group you created in the Azure portal.

Next steps
In this article, you created a Spark cluster in Azure Databricks that you deployed to a virtual network. Advance to
the next article to learn how to query a SQL Server Linux Docker container in the virtual network using JDBC
from an Azure Databricks notebook.
Query a SQL Server Linux Docker container in a virtual network from an Azure Databricks notebook
What is the Databricks Lakehouse?
7/21/2022 • 2 minutes to read

The Databricks Lakehouse combines the ACID transactions and data governance of data warehouses with the
flexibility and cost-efficiency of data lakes to enable business intelligence (BI) and machine learning (ML) on all
data. The Databricks Lakehouse keeps your data in your massively scalable cloud object storage in open source
data standards, allowing you to use your data however and wherever you want.

Components of the Databricks Lakehouse


The primary components of the Databricks Lakehouse are:
Delta tables :
ACID transactions
Data versioning
ETL
Indexing
Unity Catalog :
Data governance
Data sharing
Data auditing
By storing data with Delta Lake, you enable downstream data scientists, analysts, and machine learning
engineers to leverage the same production data supporting your core ETL workloads as soon as data is
processed.
Unity Catalog ensures that you have complete control over who gains access to which data and provides a
centralized mechanism for managing all data governance and access controls without needing to replicate your
data.

Delta tables
Tables created on Azure Databricks use the Delta Lake protocol by default. When you create a new Delta table:
Metadata used to reference the table is added to the metastore in the declared schema or database.
Data and table metadata are saved to a directory in cloud object storage.
The metastore reference to a Delta table is technically optional; you can create Delta tables by directly interacting
with directory paths using Spark APIs. Some new features that build upon Delta Lake will store additional
metadata in the table directory, but all Delta tables have:
A directory containing table data in the Parquet file format.
A sub-directory /_delta_log that contains metadata about table versions in JSON and Parquet format.
Learn more about Data objects in the Databricks Lakehouse.

Data lakehouse vs. data warehouse vs. data lake


Data warehouses have powered business intelligence (BI) decisions for about 30 years, having evolved as set of
design guidelines for systems controlling the flow of data. Data warehouses optimize queries for BI reports, but
can take minutes or even hours to generate results. Designed for data that is unlikely to change with high
frequency, data warehouses seek to prevent conflicts between concurrently running queries. Many data
warehouses rely on proprietary formats, which often limit support for machine learning.
Powered by technological advances in data storage and driven by exponential increases in the types and volume
of data, data lakes have come into widespread use over the last decade. Data lakes store and process data
cheaply and efficiently. Data lakes are often defined in opposition to data warehouses: A data warehouse
delivers clean, structured data for BI analytics, while a data lake permanently and cheaply stores data of any
nature in any format. Many organizations use data lakes for data science and machine learning, but not for BI
reporting due to its unvalidated nature.
The data lakehouse replaces the current dependency on data lakes and data warehouses for modern data
companies that desire:
Open, direct access to data stored in standard data formats.
Indexing protocols optimized for machine learning and data science.
Low query latency and high reliability for BI and advanced analytics.
By combining an optimized metadata layer with validated data stored in standard formats in cloud object
storage, the data lakehouse allows data scientists and ML engineers to build models from the same data driving
BI reports.
Data objects in the Databricks Lakehouse
7/21/2022 • 7 minutes to read

The Databricks Lakehouse organizes data stored with Delta Lake in cloud object storage with familiar relations
like database, tables, and views. This model combines many of the benefits of a data warehouse with the
scalability and flexibility of a data lake. Learn more about how this model works, and the relationship between
object data and metadata so that you can apply best practices when designing and implementing Databricks
Lakehouse for your organization.

What data objects are in the Databricks Lakehouse?


The Databricks Lakehouse architecture combines data stored with the Delta Lake protocol in cloud object
storage with metadata registered to a metastore. There are five primary objects in the Databricks Lakehouse:
Catalog : a grouping of databases.
Database or schema: a grouping of objects in a catalog. Databases contain tables, views, and functions.
Table : a collection of rows and columns stored as data files in object storage.
View : a saved query typically against one or more tables or data sources.
Function : saved logic that returns a scalar value or set of rows.

For information on securing objects with Unity Catalog, see securable objects model.

What is a metastore?
The metastore contains all of the metadata that defines data objects in the lakehouse. Azure Databricks provides
the following metastore options:
Unity Catalog : you can create a metastore to store and share metadata across multiple Azure Databricks
workspaces. Unity Catalog is managed at the account level.
Hive metastore : Azure Databricks stores all the metadata for the built-in Hive metastore as a managed
service. An instance of the metastore deploys to each cluster and securely accesses metadata from a central
repository for each customer workspace.
External metastore : you can also bring your own metastore to Azure Databricks.
Regardless of the metastore used, Azure Databricks stores all data associated with tables in object storage
configured by the customer in their cloud account.

What is a catalog?
A catalog is the highest abstraction (or coarsest grain) in the Databricks Lakehouse relational model. Every
database will be associated with a catalog. Catalogs exist as objects within a metastore.
Before the introduction of Unity Catalog, Azure Databricks used a two-tier namespace. Catalogs are the third tier
in the Unity Catalog namespacing model:

catalog_name.database_name.table_name

The built-in Hive metastore only supports a single catalog, hive_metastore .

What is a database?
A database is a collection of data objects, such as tables or views (also called “relations”), and functions. In Azure
Databricks, the terms “schema” and “database” are used interchangeably (whereas in many relational systems, a
database is a collection of schemas).
Databases will always be associated with a location on cloud object storage. You can optionally specify a
LOCATION when registering a database, keeping in mind that:

The LOCATION associated with a database is always considered a managed location.


Creating a database does not create any files in the target location.
The LOCATION of a database will determine the default location for data of all tables registered to that
database.
Successfully dropping a database will recursively drop all data and files stored in a managed location.
This interaction between locations managed by database and data files is very important. To avoid accidentally
deleting data:
Do not share database locations across multiple database definitions.
Do not register a database to a location that already contains data.
To manage data life cycle independently of database, save data to a location that is not nested under any
database locations.

What is a table?
A Azure Databricks table is a collection of structured data. A Delta table stores data as a directory of files on
cloud object storage and registers table metadata to the metastore within a catalog and schema. As Delta Lake is
the default storage provider for tables created in Azure Databricks, all tables created in Databricks are Delta
tables, by default. Because Delta tables store data in cloud object storage and provide references to data through
a metastore, users across an organization can access data using their preferred APIs; on Databricks, this includes
SQL, Python, PySpark, Scala, and R.
Note that it is possible to create tables on Databricks that are not Delta tables. These tables are not backed by
Delta Lake, and will not provide the ACID transactions and optimized performance of Delta tables. Tables falling
into this category include tables registered against data in external systems and tables registered against other
file formats in the data lake.
There are two kinds of tables in Databricks, managed and unmanaged (or external) tables.
NOTE
The Delta Live Tables distinction between live tables and streaming live tables is not enforced from the table perspective.

What is a managed table?


Azure Databricks manages both the metadata and the data for a managed table; when you drop a table, you also
delete the underlying data. Data analysts and other users that mostly work in SQL may prefer this behavior.
Managed tables are the default when creating a table. The data for a managed table resides in the LOCATION of
the database it is registered to. This managed relationship between the data location and the database means
that in order to move a managed table to a new database, you must rewrite all data to the new location.
There are a number of ways to create managed tables, including:

CREATE TABLE table_name AS SELECT * FROM another_table

CREATE TABLE table_name (field_name1 INT, field_name2 STRING)

df.write.saveAsTable("table_name")

What is an unmanaged table?


Azure Databricks only manages the metadata for unmanaged (external) tables; when you drop a table, you do
not affect the underlying data. Unmanaged tables will always specify a LOCATION during table creation; you can
either register an existing directory of data files as a table or provide a path when a table is first defined. Because
data and metadata are managed independently, you can rename a table or register it to a new database without
needing to move any data. Data engineers often prefer unmanaged tables and the flexibility they provide for
production data.
There are a number of ways to create unmanaged tables, including:

CREATE TABLE table_name


USING DELTA
LOCATION '/path/to/existing/data'

CREATE TABLE table_name


(field_name1 INT, field_name2 STRING)
LOCATION '/path/to/empty/directory'

df.write.option("path", "/path/to/empty/directory").saveAsTable("table_name")

What is a view?
A view stores the text for a query typically against one or more data sources or tables in the metastore. In
Databricks, a view is equivalent to a Spark DataFrame persisted as an object in a database. Unlike DataFrames,
you can query views from any part of the Databricks product, assuming you have permission to do so. Creating
a view does not process or write any data; only the query text is registered to the metastore in the associated
database.

What is a temporary view?


A temporary view has a limited scope and persistence and is not registered to a schema or catalog. The lifetime
of a temporary view differs based on the environment you’re using:
In notebooks and jobs, temporary views are scoped to the notebook or script level. They cannot be
referenced outside of the notebook in which they are declared, and will no longer exist when the notebook
detaches from the cluster.
In Databricks SQL, temporary views are scoped to the query level. Multiple statements within the same
query can use the temp view, but it cannot be referenced in other queries, even within the same dashboard.
Global temporary views are scoped to the cluster level and can be shared between notebooks or jobs that
share computing resources. Databricks recommends using views with appropriate table ACLs instead of
global temporary views.

What is a function?
Functions allow you to associate user-defined logic with a database. Functions can return either scalar values or
sets of rows. You can use functions to provide managed access to custom logic across a variety of contexts on
the Databricks product.

How do relational objects work in Delta Live Tables?


Delta Live Tables uses declarative syntax to define and manage DDL, DML, and infrastructure deployment. Delta
Live Tables uses the concept of a “virtual schema” during logic planning and execution. Delta Live Tables can
interact with other databases in your Databricks environment, and Delta Live Tables can publish and persist
tables for querying elsewhere by specifying a target database in the pipeline configuration settings. All tables
created in Delta Live Tables are Delta tables, and can be declared as either managed or unmanaged tables.
While views can be declared in Delta Live Tables, these should be thought of as temporary views scoped to the
pipeline. Temporary tables in Delta Live Tables are a unique concept: these tables persist data to storage but do
not publish data to the target database.
Some operations, such as APPLY CHANGES INTO , will register both a table and view to the database; the table
name will begin with an underscore ( _ ) and the view will have the table name declared as the target of the
APPLY CHANGES INTO operation. The view queries the corresponding hidden table to materialize the results.
Azure Databricks concepts
7/21/2022 • 8 minutes to read

This article introduces the set of fundamental concepts you need to understand in order to use Azure Databricks
effectively.
Some concepts are general to Azure Databricks, and others are specific to the persona-based Azure Databricks
environment you are using:
Databricks Data Science & Engineering
Databricks Machine Learning
Databricks SQL

General concepts
This section describes concepts and terms that apply across all Azure Databricks persona-based environments.
Workspaces
In Azure Databricks workspace has two meanings:
1. An Azure Databricks deployment in the cloud that functions as the unified environment that your team
uses for accessing all of their Databricks assets. Your organization can choose to have multiple
workspaces or just one: it depends on your needs.
2. The UI for the Databricks Data Science & Engineering and Databricks Machine Learning person-based
environments. This is as opposed to the Databricks SQL environment.
When we talk about the “workspace browser,” for example, we are talking about the UI that lets you
browse notebooks, libraries, and other files in the Data Science & Engineering and Databricks Machine
Learning environments—a UI that isn’t part of the Databricks SQL environment. But Data Science &
Engineering, Databricks Machine Learning, and Databricks SQL are all included in your deployed Azure
Databricks workspace.
Billing
DBU
Azure Databricks bills based on Databricks units (DBUs), units of processing capability per hour based on VM
instance type.
See the Azure Databricks pricing page.
Authentication and authorization
This section describes concepts that you need to know when you manage Azure Databricks users and their
access to Azure Databricks assets.
User
A unique individual who has access to the system.
Group
A collection of users.
Access control list (ACL)
A list of permissions attached to the workspace, cluster, job, table, or experiment. An ACL specifies which users
or system processes are granted access to the objects, as well as what operations are allowed on the assets. Each
entry in a typical ACL specifies a subject and an operation.

Databricks Data Science & Engineering


Databricks Data Science & Engineering is the classic Azure Databricks environment for collaboration among
data scientists, data engineers, and data analysts. This section describes the fundamental concepts you need to
understand in order to work effectively in the Databricks Data Science & Engineering environment.
Workspace
A workspace is an environment for accessing all of your Azure Databricks assets. A workspace organizes objects
(notebooks, libraries, dashboards, and experiments) into folders and provides access to data objects and
computational resources.
This section describes the objects contained in the Azure Databricks workspace folders.
Notebook
A web-based interface to documents that contain runnable commands, visualizations, and narrative text.
Dashboard
An interface that provides organized access to visualizations.
Librar y
A package of code available to the notebook or job running on your cluster. Databricks runtimes include many
libraries and you can add your own.
Repo
A folder whose contents are co-versioned together by syncing them to a remote Git repository.
Experiment
A collection of MLflow runs for training a machine learning model.
Data Science & Engineering interface
This section describes the interfaces that Azure Databricks supports for accessing your assets: UI, API, and
command-line (CLI).
UI
The Azure Databricks UI provides an easy-to-use graphical interface to workspace folders and their contained
objects, data objects, and computational resources.
REST API
There are three versions of the REST API: 2.1, 2.0, and 1.2. The REST APIs 2.1 and 2.0 support most of the
functionality of the REST API 1.2 and additional functionality and are preferred.
CLI
An open source project hosted on GitHub. The CLI is built on top of the REST API (latest).
Data management in Data Science & Engineering
This section describes the objects that hold the data on which you perform analytics and feed into machine
learning algorithms.
Databricks File System (DBFS)
A filesystem abstraction layer over a blob store. It contains directories, which can contain files (data files,
libraries, and images), and other directories. DBFS is automatically populated with some datasets that you can
use to learn Azure Databricks.
Database
A collection of information that is organized so that it can be easily accessed, managed, and updated.
Table
A representation of structured data. You query tables with Apache Spark SQL and Apache Spark APIs.
Metastore
The component that stores all the structure information of the various tables and partitions in the data
warehouse including column and column type information, the serializers and deserializers necessary to read
and write data, and the corresponding files where the data is stored. Every Azure Databricks deployment has a
central Hive metastore accessible by all clusters to persist table metadata. You also have the option to use an
existing external Hive metastore.
Computation management in Data Science & Engineering
This section describes concepts that you need to know to run computations in Databricks Data Science &
Engineering.
Cluster
A set of computation resources and configurations on which you run notebooks and jobs. There are two types of
clusters: all-purpose and job.
You create an all-purpose cluster using the UI, CLI, or REST API. You can manually terminate and restart an
all-purpose cluster. Multiple users can share such clusters to do collaborative interactive analysis.
The Azure Databricks job scheduler creates a job cluster when you run a job on a new job cluster and
terminates the cluster when the job is complete. You cannot restart an job cluster.
Pool
A set of idle, ready-to-use instances that reduce cluster start and auto-scaling times. When attached to a pool, a
cluster allocates its driver and worker nodes from the pool. If the pool does not have sufficient idle resources to
accommodate the cluster’s request, the pool expands by allocating new instances from the instance provider.
When an attached cluster is terminated, the instances it used are returned to the pool and can be reused by a
different cluster.
Databricks runtime
The set of core components that run on the clusters managed by Azure Databricks. Azure Databricks offers
several types of runtimes:
Databricks Runtime includes Apache Spark but also adds a number of components and updates that
substantially improve the usability, performance, and security of big data analytics.
Databricks Runtime for Machine Learning is built on Databricks Runtime and provides a ready-to-go
environment for machine learning and data science. It contains multiple popular libraries, including
TensorFlow, Keras, PyTorch, and XGBoost.
Databricks Runtime for Genomics is a version of Databricks Runtime optimized for working with genomic
and biomedical data.
Databricks Light is the Azure Databricks packaging of the open source Apache Spark runtime. It provides a
runtime option for jobs that don’t need the advanced performance, reliability, or autoscaling benefits
provided by Databricks Runtime. You can select Databricks Light only when you create a cluster to run a JAR,
Python, or spark-submit job; you cannot select this runtime for clusters on which you run interactive or
notebook job workloads.
Workflows
Frameworks to develop and run data processing pipelines:
Workflows with jobs: A non-interactive mechanism for running a notebook or library either immediately or
on a scheduled basis.
Delta Live Tables: A framework for building reliable, maintainable, and testable data processing pipelines.
Workload
Azure Databricks identifies two types of workloads subject to different pricing schemes: data engineering ( job)
and data analytics (all-purpose).
Data engineering An (automated) workload runs on a job cluster which the Azure Databricks job scheduler
creates for each workload.
Data analytics An (interactive) workload runs on an all-purpose cluster. Interactive workloads typically run
commands within an Azure Databricks notebook. However, running a job on an existing all-purpose cluster is
also treated as an interactive workload.
Execution context
The state for a REPL environment for each supported programming language. The languages supported are
Python, R, Scala, and SQL.

Databricks Machine Learning


The Databricks Machine Learning environment starts with the features provided in the Data Science &
Engineering workspace and adds functionality. Important concepts include:
Experiments
The main unit of organization for tracking machine learning model development. Experiments organize, display,
and control access to individual logged runs of model training code.
Feature Store
A centralized repository of features. Databricks Feature Store enables feature sharing and discovery across your
organization and also ensures that the same feature computation code is used for model training and inference.
Models
A trained machine learning or deep learning model that has been registered in Model Registry.

Databricks SQL
Databricks SQL is geared toward data analysts who work primarily with SQL queries and BI tools. It provides an
intuitive environment for running ad-hoc queries and creating dashboards on data stored in your data lake. Its
UI is quite different from that of the Data Science & Engineering and Databricks Machine Learning
environments. This section describes the fundamental concepts you need to understand in order to use
Databricks SQL effectively.
Databricks SQL interface
This section describes the interfaces that Azure Databricks supports for accessing your Databricks SQL assets: UI
and API.
UI : A graphical interface to dashboards and queries, SQL warehouses, query history, and alerts.
REST API An interface that allows you to automate tasks on Databricks SQL objects.
Data management in Databricks SQL
Visualization : A graphical presentation of the result of running a query.
Dashboard : A presentation of query visualizations and commentary.
Aler t : A notification that a field returned by a query has reached a threshold.
Computation management in Databricks SQL
This section describes concepts that you need to know to run SQL queries in Databricks SQL.
Quer y : A valid SQL statement.
SQL warehouse : A computation resource on which you execute SQL queries.
Quer y histor y : A list of executed queries and their performance characteristics.
Authentication and authorization in Databricks SQL
This section describes concepts that you need to know when you manage Databricks SQL users and groups and
their access to assets.
User and group : A user is a unique individual who has access to the system. A group is a collection of users.
Personal access token : An opaque string is used to authenticate to the REST API and by tools in the Databricks
integrations to connect to SQL warehouses.
Access control list : A set of permissions attached to a principal that requires access to an object. An ACL entry
specifies the object and the actions allowed on the object. Each entry in an ACL specifies a principal, action type,
and object.
Azure Databricks architecture overview
7/21/2022 • 2 minutes to read

The Databricks Unified Data Analytics Platform, from the original creators of Apache Spark, enables data teams
to collaborate in order to solve some of the world’s toughest problems.

High-level architecture
Azure Databricks is structured to enable secure cross-functional team collaboration while keeping a significant
amount of backend services managed by Azure Databricks so you can stay focused on your data science, data
analytics, and data engineering tasks.
Azure Databricks operates out of a control plane and a data plane.
The control plane includes the backend services that Azure Databricks manages in its own Azure account.
Notebook commands and many other workspace configurations are stored in the control plane and
encrypted at rest.
The data plane is managed by your Azure account and is where your data resides. This is also where data is
processed. You can use Azure Databricks connectors so that your clusters can connect to external data
sources outside of your Azure account to ingest data or for storage. You can also ingest data from external
streaming data sources, such as events data, streaming data, IoT data, and more.
Although architectures can vary depending on custom configurations (such as when you’ve deployed a Azure
Databricks workspace to your own virtual network, also known as VNet injection), the following architecture
diagram represents the most common structure and flow of data for Azure Databricks.
For more architecture information, see Manage virtual networks.
Your data is stored at rest in your Azure account in the data plane and in your own data sources, not the control
plane, so you maintain control and ownership of your data.
Job results reside in storage in your account.
Interactive notebook results are stored in a combination of the control plane (partial results for presentation in
the UI) and your Azure storage. If you want interactive notebook results stored only in your cloud account
storage, you can ask your Databricks representative to enable interactive notebook results in the customer
account for your workspace. Note that some metadata about results, such as chart column names, continues to
be stored in the control plane. This feature is in Public Preview.
Tutorial: Run a job with an Azure service principal
7/21/2022 • 8 minutes to read

Jobs provide a non-interactive way to run applications in an Azure Databricks cluster, for example, an ETL job or
data analysis task that should run on a scheduled basis. Typically these jobs run as the user that created them,
but this can have some limitations:
Creating and running jobs is dependent on the user having appropriate permissions.
Only the user that created the job has access to the job.
The user might be removed from the Azure Databricks workspace.
Using a service account—an account associated with an application rather than a specific user—is a common
method to address these limitations. In Azure, you can use an Azure Active Directory (Azure AD) application and
service principal to create a service account.
An example of where this is important is when service principals control access to data stored in an Azure Data
Lake Storage Gen2 account. Running jobs with those service principals allows the jobs to access data in the
storage account and provides control over data access scope.
This tutorial describes how to create an Azure AD application and service principal and make that service
principal the owner of a job. You’ll also learn how to give job run permissions to other groups that don’t own the
job. The following is a high-level overview of the tasks this tutorial walks through:
1. Create a service principal in Azure Active Directory.
2. Create a personal access token (PAT) in Azure Databricks. You’ll use the PAT to authenticate to the Databricks
REST API.
3. Add the service principal as a non-administrative user to Azure Databricks using the Databricks SCIM API.
4. Create an Azure Key Vault-backed secret scope in Azure Databricks.
5. Grant the service principal read access to the secret scope.
6. Create a job in Azure Databricks and configure the job cluster to read secrets from the secret scope.
7. Transfer ownership of the job to the service principal.
8. Test the job by running it as the service principal.
If you don’t have an Azure subscription, create a free account before you begin.

Requirements
You’ll need the following for this tutorial:
A user account with the permissions required to register an application in your Azure AD tenant.
Administrative privileges in the Azure Databricks workspace where you’ll run jobs.
A tool for making API requests to Azure Databricks. This tutorial uses cURL, but you can use any tool that
allows you to submit REST API requests.

Create a service principal in Azure Active Directory


A service principal is the identity of an Azure AD application. To create the service principal that will be used to
run jobs:
1. In the Azure portal, select Azure Active Director y > App Registrations > New Registration . Enter a
name for the application and click Register .
2. Go to Cer tificates & secrets , click New client secret , and generate a new client secret. Copy and save the
secret in a secure place.
3. Go to Over view and note the Application (client) ID and Director y (tenant) ID .

Create the Azure Databricks personal access token


You’ll use an Azure Databricks personal access token (PAT) to authenticate against the Databricks REST API. To
create a PAT that can be used to make API requests:
1. Go to your Azure Databricks workspace.
2. Click the user icon in the top-right corner of the screen and click User Settings .
3. Click Access Tokens > Generate New Token .
4. Copy and save the token value.

TIP
This example uses a personal access token, but you can use an Azure Active Directory token for most APIs. A best practice
is that a PAT is suitable for administrative configuration tasks, but Azure AD tokens are preferred for production
workloads.
You can restrict the generation of PATs to administrators only for security purposes. See Manage personal access tokens
for more details.

Add the service principal to the Azure Databricks workspace


You add the Azure AD service principal to a workspace using the SCIM API 2.0. You must also give the service
principal permission to launch automated job clusters. You can grant this through the allow-cluster-create
permission. Open a terminal and run the following command to add the service principal and grant the required
permissions:

curl -X POST 'https://<per-workspace-url>/api/2.0/preview/scim/v2/ServicePrincipals' \


--header 'Content-Type: application/scim+json' \
--header 'Authorization: Bearer <personal-access-token>' \
--data-raw '{
"schemas":[
"urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal"
],
"applicationId":"<application-id>",
"displayName": "test-sp",
"entitlements":[
{
"value":"allow-cluster-create"
}
]
}'

Replace <per-workspace-url> with the unique per-workspace URL for your Azure Databricks workspace.
Replace <personal-access-token> with the Azure Databricks personal access token.
Replace <application-id> with the Application (client) ID for the Azure AD application registration.

Create an Azure Key Vault-backed secret scope in Azure Databricks


Secret scopes provide secure storage and management of secrets. You’ll store the secret associated with the
service principal in a secret scope. You can store secrets in a Azure Databricks secret scope or an Azure Key
Vault-backed secret scope. These instructions describe the Azure Key Vault-backed option:
1. Create an Azure Key Vault instance in the Azure portal.
2. Create the Azure Databricks secret scope backed by the Azure Key Vault instance.
Step 1: Create an Azure Key Vault instance
1. In the Azure portal, select Key Vaults > + Add and give the key vault a name.
2. Click Review + create .
3. After validation completes, click Create .
4. After creating the key vault, go to the Proper ties page for the new key vault.
5. Copy and save the Vault URI and Resource ID .

Step 2: Create An Azure Key Vault-backed secret scope


Azure Databricks resources can reference secrets stored in an Azure key vault by creating a Key Vault-backed
secret scope. To create the Azure Databricks secret scope:
1. Go to the Azure Databricks Create Secret Scope page at
https://<per-workspace-url>/#secrets/createScope . Replace per-workspace-url with the unique per-
workspace URL for your Azure Databricks workspace.
2. Enter a Scope Name .
3. Enter the Vault URI and Resource ID values for the Azure key vault you created in Step 1: Create an
Azure Key Vault instance.
4. Click Create .
Save the client secret in Azure Key Vault
1. In the Azure portal, go to the Key vaults service.
2. Select the key vault created in Step 1: Create an Azure Key Vault instance.
3. Under Settings > Secrets , click Generate/Impor t .
4. Select the Manual upload option and enter the client secret in the Value field.
5. Click Create .

Grant the service principal read access to the secret scope


You’ve created a secret scope and stored the service principal’s client secret in that scope. Now you’ll give the
service principal access to read the secret from the secret scope.
Open a terminal and run the following command:

curl -X POST 'https://<per-workspace-url/api/2.0/secrets/acls/put' \


--header 'Authorization: Bearer <personal-access-token>' \
--header 'Content-Type: application/json' \
--data-raw '{
"scope": "<scope-name>",
"principal": "<application-id>",
"permission": "READ"
}'

Replace <per-workspace-url> with the unique per-workspace URL for your Azure Databricks workspace.
Replace <personal-access-token> with the Azure Databricks personal access token.
Replace <scope-name> with the name of the Azure Databricks secret scope that contains the client secret.
Replace <application-id> with the Application (client) ID for the Azure AD application registration.

Create a job in Azure Databricks and configure the cluster to read


secrets from the secret scope
You’re now ready to create a job that can run as the new service principal. You’ll use a notebook created in the
Azure Databricks UI and add the configuration to allow the job cluster to retrieve the service principal’s secret.
1. Go to your Azure Databricks landing page and select Create Blank Notebook . Give your notebook a
name and select SQL as the default language.
2. Enter SELECT 1 in the first cell of the notebook. This is a simple command that just displays 1 if it
succeeds. If you have granted your service principal access to particular files or paths in Azure Data Lake
Storage Gen 2, you can read from those paths instead.
3. Go to Workflows and click the + Create Job button. Give the job a name, click Select Notebook , and
select the notebook you just created.
4. Click Edit next to the Cluster information.
5. On the Configure Cluster page, click Advanced Options .
6. On the Spark tab, enter the following Spark Config:

fs.azure.account.auth.type.acmeadls.dfs.core.windows.net OAuth
fs.azure.account.oauth.provider.type.acmeadls.dfs.core.windows.net
org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.oauth2.client.id.acmeadls.dfs.core.windows.net <application-id>
fs.azure.account.oauth2.client.secret.acmeadls.dfs.core.windows.net {{secrets/<secret-scope-
name>/<secret-name>}}
fs.azure.account.oauth2.client.endpoint.acmeadls.dfs.core.windows.net
https://login.microsoftonline.com/<directory-id>/oauth2/token

Replace <secret-scope-name> with the name of the Azure Databricks secret scope that contains the
client secret.
Replace <application-id> with the Application (client) ID for the Azure AD application registration.
Replace <secret-name> with the name associated with the client secret value in the secret scope.
Replace <directory-id> with the Directory (tenant) ID for the Azure AD application registration.

Transfer ownership of the job to the service principal


A job can have exactly one owner, so you’ll need to transfer ownership of the job from yourself to the service
principal. To ensure that other users can manage the job, you can also grant Can Manage permissions to a group.
In this example, we use the Permissions API to set these permissions.
Open a terminal and run the following command:

curl -X PUT 'https://<per-workspace-url>/api/2.0/permissions/jobs/<job-id>' \


--header 'Authorization: Bearer <personal-access-token>' \
--header 'Content-Type: application/json' \
--data-raw '{
"access_control_list": [
{
"service_principal_name": "<application-id>",
"permission_level": "IS_OWNER"
},
{
"group_name": "admins",
"permission_level": "CAN_MANAGE"
}
]
}'

Replace <per-workspace-url> with the unique per-workspace URL for your Azure Databricks workspace.
Replace <personal-access-token> with the Azure Databricks personal access token.
Replace <application-id> with the Application (client) ID for the Azure AD application registration.

The job will also need read permissions to the notebook. Run the following command to grant the required
permissions:
curl -X PUT 'https://<per-workspace-url>/api/2.0/permissions/notebooks/<notebook-id>' \
--header 'Authorization: Bearer <personal-access-token>' \
--header 'Content-Type: application/json' \
--data-raw '{
"access_control_list": [
{
"service_principal_name": "<application-id>",
"permission_level": "CAN_READ"
}
]
}'

Replace <per-workspace-url> with the unique per-workspace URL for your Azure Databricks workspace.
Replace <notebook-id> with the ID of the notebook associated with the job. To find the ID, go to the notebook
in the Azure Databricks workspace and look for the numeric ID that follows notebook/ in the notebook’s
URL.
Replace <personal-access-token> with the Azure Databricks personal access token.
Replace <application-id> with the Application (client) ID for the Azure AD application registration.

Test the job


You run jobs with a service principal the same way you run jobs as a user, either through the UI, API, or CLI. To
test the job using the Azure Databricks UI:
1. Go to Workflows in the Azure Databricks UI and select the job.
2. Click Run Now .
You’ll see a status of Succeeded for the job if everything runs correctly. You can select the job in the UI to verify
the output:

Learn more
To learn more about creating and running jobs, see Jobs.
Tutorial: Query a SQL Server Linux Docker
container in a virtual network from an Azure
Databricks notebook
7/21/2022 • 5 minutes to read

This tutorial teaches you how to integrate Azure Databricks with a SQL Server Linux Docker container in a
virtual network.
In this tutorial, you learn how to:
Deploy an Azure Databricks workspace to a virtual network
Install a Linux virtual machine in a public network
Install Docker
Install Microsoft SQL Server on Linux docker container
Query the SQL Server using JDBC from a Databricks notebook

Prerequisites
Create a Databricks workspace in a virtual network.
Install Ubuntu for Windows.
Download SQL Server Management Studio.

Create a Linux virtual machine


1. In the Azure portal, select the icon for Vir tual Machines . Then, select + Add .
2. On the Basics tab, Choose Ubuntu Server 18.04 LTS and change the VM size to B2s. Choose an
administrator username and password.

3. Navigate to the Networking tab. Choose the virtual network and the public subnet that includes your
Azure Databricks cluster. Select Review + create , then Create to deploy the virtual machine.

4. When the deployment is complete, navigate to the virtual machine. Notice the Public IP address and
Virtual network/subnet in the Over view . Select the Public IP Address
5. Change the Assignment to Static and enter a DNS name label . Select Save , and restart the virtual
machine.

6. Select the Networking tab under Settings . Notice that the network security group that was created
during the Azure Databricks deployment is associated with the virtual machine. Select Add inbound
por t rule .
7. Add a rule to open port 22 for SSH. Use the following settings:

SET T IN G SUGGEST ED VA L UE DESC RIP T IO N

Source IP Addresses IP Addresses specifies that incoming


traffic from a specific source IP
Address will be allowed or denied by
this rule.

Source IP addresses <your public ip> Enter the your public IP address.
You can find your public IP address
by visiting bing.com and searching
for "my IP".

Source port ranges * Allow traffic from any port.

Destination IP Addresses IP Addresses specifies that outgoing


traffic for a specific source IP
Address will be allowed or denied by
this rule.

Destination IP addresses <your vm public ip> Enter your virtual machine's public
IP address. You can find this on the
Over view page of your virtual
machine.

Destination port ranges 22 Open port 22 for SSH.

Priority 290 Give the rule a priority.

Name ssh-databricks-tutorial-vm Give the rule a name.


8. Add a rule to open port 1433 for SQL with the following settings:

SET T IN G SUGGEST ED VA L UE DESC RIP T IO N

Source Any Source specifies that incoming traffic


from a specific source IP Address will
be allowed or denied by this rule.

Source port ranges * Allow traffic from any port.

Destination IP Addresses IP Addresses specifies that outgoing


traffic for a specific source IP
Address will be allowed or denied by
this rule.

Destination IP addresses <your vm public ip> Enter your virtual machine's public
IP address. You can find this on the
Over view page of your virtual
machine.

Destination port ranges 1433 Open port 22 for SQL Server.

Priority 300 Give the rule a priority.

Name sql-databricks-tutorial-vm Give the rule a name.


Run SQL Server in a Docker container
1. Open Ubuntu for Windows, or any other tool that will allow you to SSH into the virtual machine.
Navigate to your virtual machine in the Azure portal and select Connect to get the SSH command you
need to connect.

2. Enter the command in your Ubuntu terminal and enter the admin password you created when you
configured the virtual machine.
3. Use the following command to install Docker on the virtual machine.

sudo apt-get install docker.io

Verify the install of Docker with the following command:

sudo docker --version

4. Install the image.

sudo docker pull mcr.microsoft.com/mssql/server:2017-latest

Check the images.

sudo docker images

5. Run the container from the image.

sudo docker run -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=Password1234' -p 1433:1433 --name sql1 -d


mcr.microsoft.com/mssql/server:2017-latest

Verify that the container is running.

sudo docker ps -a

Create a SQL database


1. Open SQL Server Management Studio and connect to the server using the server name and SQL
Authentication. The sign in username is SA and the password is the password set in the Docker
command. The password in the example command is Password1234 .
2. Once you've successfully connected, select New Quer y and enter the following code snippet to create a
database, a table, and insert some records in the table.

CREATE DATABASE MYDB;


GO
USE MYDB;
CREATE TABLE states(Name VARCHAR(20), Capitol VARCHAR(20));
INSERT INTO states VALUES ('Delaware','Dover');
INSERT INTO states VALUES ('South Carolina','Columbia');
INSERT INTO states VALUES ('Texas','Austin');
SELECT * FROM states
GO

Query SQL Server from Azure Databricks


1. Navigate to your Azure Databricks workspace and verify that you created a cluster as part of the
prerequisites. Then, select Create a Notebook . Give the notebook a name, select Python as the
language, and select the cluster you created.
2. Use the following command to ping the internal IP Address of the SQL Server virtual machine. This ping
should be successful. If not, verify that the container is running, and review the network security group
(NSG) configuration.

%sh
ping 10.179.64.4

You can also use the nslookup command to review.

%sh
nslookup databricks-tutorial-vm.westus2.cloudapp.azure.com

3. Once you've successfully pinged the SQL Server, you can query the database and tables. Run the
following python code:

jdbcHostname = "10.179.64.4"
jdbcDatabase = "MYDB"
userName = 'SA'
password = 'Password1234'
jdbcPort = 1433
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2};user={3};password={4}".format(jdbcHostname,
jdbcPort, jdbcDatabase, userName, password)

df = spark.read.jdbc(url=jdbcUrl, table='states')
display(df)

Clean up resources
When no longer needed, delete the resource group, the Azure Databricks workspace, and all related resources.
Deleting the job avoids unnecessary billing. If you're planning to use the Azure Databricks workspace in future,
you can stop the cluster and restart it later. If you are not going to continue to use this Azure Databricks
workspace, delete all resources you created in this tutorial by using the following steps:
1. From the left-hand menu in the Azure portal, click Resource groups and then click the name of the
resource group you created.
2. On your resource group page, select Delete , type the name of the resource to delete in the text box, and
then select Delete again.

Next steps
Advance to the next article to learn how to extract, transform, and load data using Azure Databricks.
Tutorial: Extract, transform, and load data by using Azure Databricks
Tutorial: Access Azure Blob Storage from Azure
Databricks using Azure Key Vault
7/21/2022 • 5 minutes to read

This tutorial describes how to access Azure Blob Storage from Azure Databricks using secrets stored in a key
vault.
In this tutorial, you learn how to:
Create a storage account and blob container
Create an Azure Key Vault and add a secret
Create an Azure Databricks workspace and add a secret scope
Access your blob container from Azure Databricks

Prerequisites
Azure subscription - create one for free

Sign in to the Azure portal


Sign in to the Azure portal.

NOTE
This tutorial cannot be carried out using Azure Free Trial Subscription . If you have a free account, go to your profile
and change your subscription to pay-as-you-go . For more information, see Azure free account. Then, remove the
spending limit, and request a quota increase for vCPUs in your region. When you create your Azure Databricks
workspace, you can select the Trial (Premium - 14-Days Free DBUs) pricing tier to give the workspace access to free
Premium Azure Databricks DBUs for 14 days.

Create a storage account and blob container


1. In the Azure portal, select Create a resource > Storage . Then select Storage account .
2. Select your subscription and resource group, or create a new resource group. Then enter a storage
account name, and choose a location. Select Review + Create .
3. If the validation is unsuccessful, address the issues and try again. If the validation is successful, select
Create and wait for the storage account to be created.
4. Navigate to your newly created storage account and select Blobs under Ser vices on the Over view
page. Then select + Container and enter a container name. Select OK .

5. Locate a file you want to upload to your blob storage container. If you don't have a file, use a text editor to
create a new text file with some information. In this example, a file named hw.txt contains the text "hello
world." Save your text file locally and upload it to your blob storage container.
6. Return to your storage account and select Access keys under Settings . Copy Storage account name
and key 1 to a text editor for later use in this tutorial.

Create an Azure Key Vault and add a secret


1. In the Azure portal, select Create a resource and enter Key Vault in the search box.

2. The Key Vault resource is automatically selected. Select Create .


3. On the Create key vault page, enter the following information, and keep the default values for the
remaining fields:

P RO P ERT Y DESC RIP T IO N

Name A unique name for your key vault.

Subscription Choose a subscription.

Resource group Choose a resource group or create a new one.

Location Choose a location.


4. After providing the information above, select Create .
5. Navigate to your newly created key vault in the Azure portal and select Secrets . Then, select +
Generate/Impor t .

6. On the Create a secret page, provide the following information, and keep the default values for the
remaining fields:
P RO P ERT Y VA L UE

Upload options Manual

Name Friendly name for your storage account key.

Value key1 from your storage account.

7. Save the key name in a text editor for use later in this tutorial, and select Create . Then, navigate to the
Proper ties menu. Copy the DNS Name and Resource ID to a text editor for use later in the tutorial.
Create an Azure Databricks workspace and add a secret scope
1. In the Azure portal, select Create a resource > Analytics > Azure Databricks .

2. Under Azure Databricks Ser vice , provide the following values to create a Databricks workspace.

P RO P ERT Y DESC RIP T IO N

Workspace name Provide a name for your Databricks workspace

Subscription From the drop-down, select your Azure subscription.

Resource group Select the same resource group that contains your key
vault.

Location Select the same location as your Azure Key Vault. For all
available regions, see Azure services available by region.

Pricing Tier Choose between Standard or Premium . For more


information on these tiers, see Databricks pricing page.
Select Create .
3. Navigate to your newly created Azure Databricks resource in the Azure portal and select Launch
Workspace .

4. Once your Azure Databricks workspace is open in a separate window, append #secrets/createScope to
the URL. The URL should have the following format:
https://<\location>.azuredatabricks.net/?o=<\orgID>#secrets/createScope .
5. Enter a scope name, and enter the Azure Key Vault DNS name and Resource ID you saved earlier. Save the
scope name in a text editor for use later in this tutorial. Then, select Create .

Access your blob container from Azure Databricks


1. From the home page of your Azure Databricks workspace, select New Cluster under Common Tasks .
2. Enter a cluster name and select Create cluster . The cluster creation takes a few minutes to complete.
3. Once the cluster is created, navigate to the home page of your Azure Databricks workspace, select New
Notebook under Common Tasks .
4. Enter a notebook name, and set the language to Python. Set the cluster to the name of the cluster you
created in the previous step.
5. Run the following command to mount your blob storage container. Remember to change the values for
the following properties:
your-container-name
your-storage-account-name
mount-name
config-key
scope-name
key-name

dbutils.fs.mount(
source = "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})

mount-name is a DBFS path representing where the Blob Storage container or a folder inside the
container (specified in source) will be mounted. A directory is created using the mount-name you
provide.
conf-key can be either fs.azure.account.key.<\your-storage-account-name>.blob.core.windows.net or
fs.azure.sas.<\your-container-name>.<\your-storage-account-name>.blob.core.windows.net
scope-name is the name of the secret scope you created in the previous section.
key-name is the name of they secret you created for the storage account key in your key vault.

6. Run the following command to read the text file in your blob storage container to a dataframe. Change
the values in the command to match your mount name and file name.

df = spark.read.text("/mnt/<mount-name>/<file-name>")

7. Use the following command to display the contents of your file.

df.show()

8. To unmount your blob storage, run the following command:

dbutils.fs.unmount("/mnt/<mount-name>")

9. Notice that once the mount has been unmounted, you can no longer read from your blob storage
account.
Clean up resources
If you're not going to continue to use this application, delete your entire resource group with the following steps:
1. From the left-hand menu in Azure portal, select Resource groups and navigate to your resource group.
2. Select Delete resource group and type your resource group name. Then select Delete .

Next steps
Advance to the next article to learn how to implement a VNet injected Databricks environment with a Service
Endpoint enabled for Cosmos DB.
Tutorial: Implement Azure Databricks with a Cosmos DB endpoint
Tutorial: Implement Azure Databricks with a Cosmos
DB endpoint
7/21/2022 • 4 minutes to read

This tutorial describes how to implement a VNet injected Databricks environment with a Service Endpoint
enabled for Cosmos DB.
In this tutorial you learn how to:
Create an Azure Databricks workspace in a virtual network
Create a Cosmos DB service endpoint
Create a Cosmos DB account and import data
Create an Azure Databricks cluster
Query Cosmos DB from an Azure Databricks notebook

Prerequisites
Before you start, do the following:
Create an Azure Databricks workspace in a virtual network.
Download the Spark connector.
Download sample data from the NOAA National Centers for Environmental Information. Select a state or
area and select Search . On the next page, accept the defaults and select Search . Then select CSV
Download on the left side of the page to download the results.
Download the pre-compiled binary of the Azure Cosmos DB Data Migration Tool.

Create a Cosmos DB service endpoint


1. Once you have deployed an Azure Databricks workspace to a virtual network, navigate to the virtual
network in the Azure portal. Notice the public and private subnets that were created through the
Databricks deployment.
2. Select the public-subnet and create a Cosmos DB service endpoint. Then Save .

Create a Cosmos DB account


1. Open the Azure portal. On the upper-left side of the screen, select Create a resource > Databases >
Azure Cosmos DB .
2. Fill out the Instance Details on the Basics tab with the following settings:

SET T IN G VA L UE

Subscription your subscription

Resource Group your resource group

Account Name db-vnet-service-endpoint

API Core (SQL)

Location West US

Geo-Redundancy Disable

Multi-region Writes Enable


3. Select the Network tab and configure your virtual network.
a. Choose the virtual network you created as a prerequisite, and then select public-subnet. Notice that
private-subnet has the note 'Microsoft AzureCosmosDB' endpoint is missing'. This is because you only
enabled the Cosmos DB service endpoint on the public-subnet.
b. Ensure you have Allow access from Azure por tal enabled. This setting allows you to access your
Cosmos DB account from the Azure portal. If this option is set to Deny , you will receive errors when
attempting to access your account.

NOTE
It is not necessary for this tutorial, but you can also enable Allow access from my IP if you want the ability to
access your Cosmos DB account from your local machine. For example, if you are connecting to your account
using the Cosmos DB SDK, you need to enable this setting. If it is disabled, you will receive "Access Denied" errors.
4. Select Review + Create , and then Create to create your Cosmos DB account inside the virtual network.
5. Once your Cosmos DB account has been created, navigate to Keys under Settings . Copy the primary
connection string and save it in a text editor for later use.
6. Select Data Explorer and New Container to add a new database and container to your Cosmos DB
account.
Upload data to Cosmos DB
1. Open the graphical interface version of the data migration tool for Cosmos DB, Dtui.exe .
2. On the Source Information tab, select CSV File(s) in the Impor t from dropdown. Then select Add
Files and add the storm data CSV you downloaded as a prerequisite.

3. On the Target Information tab, input your connection string. The connection string format is
AccountEndpoint=<URL>;AccountKey=<key>;Database=<database> . The AccountEndpoint and AccountKey are
included in the primary connection string you saved in the previous section. Append
Database=<your database name> to the end of the connection string, and select Verify . Then, add the
Container name and partition key.

4. Select Next until you get to the Summary page. Then, select Impor t .

Create a cluster and add library


1. Navigate to your Azure Databricks service in the Azure portal and select Launch Workspace .
2. Create a new cluster. Choose a Cluster Name and accept the remaining default settings.
3. After your cluster is created, navigate to the cluster page and select the Libraries tab. Select Install New
and upload the Spark connector jar file to install the library.

You can verify that the library was installed on the Libraries tab.
Query Cosmos DB from a Databricks notebook
1. Navigate to your Azure Databricks workspace and create a new python notebook.

2. Run the following python code to set the Cosmos DB connection configuration. Change the Endpoint ,
Masterkey , Database , and Container accordingly.
connectionConfig = {
"Endpoint" : "https://<your Cosmos DB account name.documents.azure.com:443/",
"Masterkey" : "<your Cosmos DB primary key>",
"Database" : "<your database name>",
"preferredRegions" : "West US 2",
"Container": "<your container name>",
"schema_samplesize" : "1000",
"query_pagesize" : "200000",
"query_custom" : "SELECT * FROM c"
}

3. Use the following python code to load the data and create a temporary view.

users = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**connectionConfig).load()
users.createOrReplaceTempView("storm")

4. Use the following magic command to execute a SQL statement that returns data.

%sql
select * from storm

You have successfully connected your VNet-injected Databricks workspace to a service-endpoint enabled
Cosmos DB resource. To read more about how to connect to Cosmos DB, see Azure Cosmos DB
Connector for Apache Spark.

Clean up resources
When no longer needed, delete the resource group, the Azure Databricks workspace, and all related resources.
Deleting the job avoids unnecessary billing. If you're planning to use the Azure Databricks workspace in future,
you can stop the cluster and restart it later. If you are not going to continue to use this Azure Databricks
workspace, delete all resources you created in this tutorial by using the following steps:
1. From the left-hand menu in the Azure portal, click Resource groups and then click the name of the
resource group you created.
2. On your resource group page, select Delete , type the name of the resource to delete in the text box, and
then select Delete again.

Next steps
In this tutorial, you've deployed an Azure Databricks workspace to a virtual network, and used the Cosmos DB
Spark connector to query Cosmos DB data from Databricks. To learn more about working with Azure Databricks
in a virtual network, continue to the tutorial for using SQL Server with Azure Databricks.
Tutorial: Query a SQL Server Linux Docker container in a virtual network from an Azure Databricks notebook
Tutorial: Extract, transform, and load data by using
Azure Databricks
7/21/2022 • 12 minutes to read

In this tutorial, you perform an ETL (extract, transform, and load data) operation by using Azure Databricks. You
extract data from Azure Data Lake Storage Gen2 into Azure Databricks, run transformations on the data in Azure
Databricks, and load the transformed data into Azure Synapse Analytics.
The steps in this tutorial use the Azure Synapse connector for Azure Databricks to transfer data to Azure
Databricks. This connector, in turn, uses Azure Blob Storage as temporary storage for the data being transferred
between an Azure Databricks cluster and Azure Synapse.
The following illustration shows the application flow:

This tutorial covers the following tasks:


Create an Azure Databricks service.
Create a Spark cluster in Azure Databricks.
Create a file system in the Data Lake Storage Gen2 account.
Upload sample data to the Azure Data Lake Storage Gen2 account.
Create a service principal.
Extract data from the Azure Data Lake Storage Gen2 account.
Transform data in Azure Databricks.
Load data into Azure Synapse.
If you don't have an Azure subscription, create a free account before you begin.

NOTE
This tutorial cannot be carried out using Azure Free Trial Subscription . If you have a free account, go to your profile
and change your subscription to pay-as-you-go . For more information, see Azure free account. Then, remove the
spending limit, and request a quota increase for vCPUs in your region. When you create your Azure Databricks
workspace, you can select the Trial (Premium - 14-Days Free DBUs) pricing tier to give the workspace access to free
Premium Azure Databricks DBUs for 14 days.

Prerequisites
Complete these tasks before you begin this tutorial:
Create an Azure Synapse, create a server-level firewall rule, and connect to the server as a server admin.
See Quickstart: Create and query a Synapse SQL pool using the Azure portal.
Create a master key for the Azure Synapse. See Create a database master key.
Create an Azure Blob storage account, and a container within it. Also, retrieve the access key to access the
storage account. See Quickstart: Upload, download, and list blobs with the Azure portal.
Create an Azure Data Lake Storage Gen2 storage account. See Quickstart: Create an Azure Data Lake
Storage Gen2 storage account.
Create a service principal. See How to: Use the portal to create an Azure AD application and service
principal that can access resources.
There's a couple of specific things that you'll have to do as you perform the steps in that article.
When performing the steps in the Assign the application to a role section of the article, make sure
to assign the Storage Blob Data Contributor role to the service principal in the scope of the
Data Lake Storage Gen2 account. If you assign the role to the parent resource group or
subscription, you'll receive permissions-related errors until those role assignments propagate to
the storage account.
If you'd prefer to use an access control list (ACL) to associate the service principal with a specific
file or directory, reference Access control in Azure Data Lake Storage Gen2.
When performing the steps in the Get values for signing in section of the article, paste the tenant
ID, app ID, and secret values into a text file.
Sign in to the Azure portal.

Gather the information that you need


Make sure that you complete the prerequisites of this tutorial.
Before you begin, you should have these items of information:
️ The database name, database server name, user name, and password of your Azure Synapse.

️ The access key of your blob storage account.

️ The name of your Data Lake Storage Gen2 storage account.

️ The tenant ID of your subscription.

️ The application ID of the app that you registered with Azure Active Directory (Azure AD).

️ The authentication key for the app that you registered with Azure AD.

Create an Azure Databricks service


In this section, you create an Azure Databricks service by using the Azure portal.
1. From the Azure portal menu, select Create a resource .
Then, select Analytics > Azure Databricks .

2. Under Azure Databricks Ser vice , provide the following values to create a Databricks service:
P RO P ERT Y DESC RIP T IO N

Workspace name Provide a name for your Databricks workspace.

Subscription From the drop-down, select your Azure subscription.

Resource group Specify whether you want to create a new resource


group or use an existing one. A resource group is a
container that holds related resources for an Azure
solution. For more information, see Azure Resource
Group overview.

Location Select West US 2 . For other available regions, see Azure


services available by region.

Pricing Tier Select Standard .

3. The account creation takes a few minutes. To monitor the operation status, view the progress bar at the
top.
4. Select Pin to dashboard and then select Create .

Create a Spark cluster in Azure Databricks


1. In the Azure portal, go to the Databricks service that you created, and select Launch Workspace .
2. You're redirected to the Azure Databricks portal. From the portal, select Cluster .

3. In the New cluster page, provide the values to create a cluster.


4. Fill in values for the following fields, and accept the default values for the other fields:
Enter a name for the cluster.
Make sure you select the Terminate after __ minutes of inactivity check box. If the cluster isn't
being used, provide a duration (in minutes) to terminate the cluster.
Select Create cluster . After the cluster is running, you can attach notebooks to the cluster and run
Spark jobs.

Create a file system in the Azure Data Lake Storage Gen2 account
In this section, you create a notebook in Azure Databricks workspace and then run code snippets to configure
the storage account
1. In the Azure portal, go to the Azure Databricks service that you created, and select Launch Workspace .
2. On the left, select Workspace . From the Workspace drop-down, select Create > Notebook .
3. In the Create Notebook dialog box, enter a name for the notebook. Select Scala as the language, and
then select the Spark cluster that you created earlier.

4. Select Create .
5. The following code block sets default service principal credentials for any ADLS Gen 2 account accessed
in the Spark session. The second code block appends the account name to the setting to specify
credentials for a specific ADLS Gen 2 account. Copy and paste either code block into the first cell of your
Azure Databricks notebook.
Session configuration

val appID = "<appID>"


val secret = "<secret>"
val tenantID = "<tenant-id>"

spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", "<appID>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<secret>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<tenant-
id>/oauth2/token")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")

Account configuration
val storageAccountName = "<storage-account-name>"
val appID = "<app-id>"
val secret = "<secret>"
val fileSystemName = "<file-system-name>"
val tenantID = "<tenant-id>"

spark.conf.set("fs.azure.account.auth.type." + storageAccountName + ".dfs.core.windows.net", "OAuth")


spark.conf.set("fs.azure.account.oauth.provider.type." + storageAccountName +
".dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id." + storageAccountName + ".dfs.core.windows.net",
"" + appID + "")
spark.conf.set("fs.azure.account.oauth2.client.secret." + storageAccountName +
".dfs.core.windows.net", "" + secret + "")
spark.conf.set("fs.azure.account.oauth2.client.endpoint." + storageAccountName +
".dfs.core.windows.net", "https://login.microsoftonline.com/" + tenantID + "/oauth2/token")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
dbutils.fs.ls("abfss://" + fileSystemName + "@" + storageAccountName + ".dfs.core.windows.net/")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")

6. In this code block, replace the <app-id> , <secret> , <tenant-id> , and <storage-account-name>
placeholder values in this code block with the values that you collected while completing the
prerequisites of this tutorial. Replace the <file-system-name> placeholder value with whatever name you
want to give the file system.
The <app-id> , and <secret> are from the app that you registered with active directory as part of
creating a service principal.
The <tenant-id> is from your subscription.
The <storage-account-name> is the name of your Azure Data Lake Storage Gen2 storage account.
7. Press the SHIFT + ENTER keys to run the code in this block.

Ingest sample data into the Azure Data Lake Storage Gen2 account
Before you begin with this section, you must complete the following prerequisites:
Enter the following code into a notebook cell:

%sh wget -P /tmp


https://raw.githubusercontent.com/Azure/usql/master/Examples/Samples/Data/json/radiowebsite/small_radio_json
.json

In the cell, press SHIFT + ENTER to run the code.


Now in a new cell below this one, enter the following code, and replace the values that appear in brackets with
the same values you used earlier:

dbutils.fs.cp("file:///tmp/small_radio_json.json", "abfss://" + fileSystemName + "@" + storageAccountName +


".dfs.core.windows.net/")

In the cell, press SHIFT + ENTER to run the code.

Extract data from the Azure Data Lake Storage Gen2 account
1. You can now load the sample json file as a data frame in Azure Databricks. Paste the following code in a
new cell. Replace the placeholders shown in brackets with your values.
val df = spark.read.json("abfss://" + fileSystemName + "@" + storageAccountName +
".dfs.core.windows.net/small_radio_json.json")

2. Press the SHIFT + ENTER keys to run the code in this block.
3. Run the following code to see the contents of the data frame:

df.show()

You see an output similar to the following snippet:

+---------------------+---------+---------+------+-------------+----------+---------+-------+--------
------------+------+--------+-------------+---------+--------------------+------+-------------+------
+
| artist| auth|firstName|gender|itemInSession| lastName| length| level|
location|method| page| registration|sessionId| song|status| ts|userId|
+---------------------+---------+---------+------+-------------+----------+---------+-------+--------
------------+------+--------+-------------+---------+--------------------+------+-------------+------
+
| El Arrebato |Logged In| Annalyse| F| 2|Montgomery|234.57914| free |
Killeen-Temple, TX| PUT|NextSong|1384448062332| 1879|Quiero Quererte Q...| 200|1409318650332|
309|
| Creedence Clearwa...|Logged In| Dylann| M| 9| Thomas|340.87138| paid |
Anchorage, AK| PUT|NextSong|1400723739332| 10| Born To Move| 200|1409318653332|
11|
| Gorillaz |Logged In| Liam| M| 11| Watts|246.17751| paid |New
York-Newark-J...| PUT|NextSong|1406279422332| 2047| DARE| 200|1409318685332|
201|
...
...

You have now extracted the data from Azure Data Lake Storage Gen2 into Azure Databricks.

Transform data in Azure Databricks


The raw sample data small_radio_json.json file captures the audience for a radio station and has a variety of
columns. In this section, you transform the data to only retrieve specific columns from the dataset.
1. First, retrieve only the columns firstName , lastName , gender , location , and level from the dataframe
that you created.

val specificColumnsDf = df.select("firstname", "lastname", "gender", "location", "level")


specificColumnsDf.show()

You receive output as shown in the following snippet:


+---------+----------+------+--------------------+-----+
|firstname| lastname|gender| location|level|
+---------+----------+------+--------------------+-----+
| Annalyse|Montgomery| F| Killeen-Temple, TX| free|
| Dylann| Thomas| M| Anchorage, AK| paid|
| Liam| Watts| M|New York-Newark-J...| paid|
| Tess| Townsend| F|Nashville-Davidso...| free|
| Margaux| Smith| F|Atlanta-Sandy Spr...| free|
| Alan| Morse| M|Chicago-Napervill...| paid|
|Gabriella| Shelton| F|San Jose-Sunnyval...| free|
| Elijah| Williams| M|Detroit-Warren-De...| paid|
| Margaux| Smith| F|Atlanta-Sandy Spr...| free|
| Tess| Townsend| F|Nashville-Davidso...| free|
| Alan| Morse| M|Chicago-Napervill...| paid|
| Liam| Watts| M|New York-Newark-J...| paid|
| Liam| Watts| M|New York-Newark-J...| paid|
| Dylann| Thomas| M| Anchorage, AK| paid|
| Alan| Morse| M|Chicago-Napervill...| paid|
| Elijah| Williams| M|Detroit-Warren-De...| paid|
| Margaux| Smith| F|Atlanta-Sandy Spr...| free|
| Alan| Morse| M|Chicago-Napervill...| paid|
| Dylann| Thomas| M| Anchorage, AK| paid|
| Margaux| Smith| F|Atlanta-Sandy Spr...| free|
+---------+----------+------+--------------------+-----+

2. You can further transform this data to rename the column level to subscription_type .

val renamedColumnsDF = specificColumnsDf.withColumnRenamed("level", "subscription_type")


renamedColumnsDF.show()

You receive output as shown in the following snippet.

+---------+----------+------+--------------------+-----------------+
|firstname| lastname|gender| location|subscription_type|
+---------+----------+------+--------------------+-----------------+
| Annalyse|Montgomery| F| Killeen-Temple, TX| free|
| Dylann| Thomas| M| Anchorage, AK| paid|
| Liam| Watts| M|New York-Newark-J...| paid|
| Tess| Townsend| F|Nashville-Davidso...| free|
| Margaux| Smith| F|Atlanta-Sandy Spr...| free|
| Alan| Morse| M|Chicago-Napervill...| paid|
|Gabriella| Shelton| F|San Jose-Sunnyval...| free|
| Elijah| Williams| M|Detroit-Warren-De...| paid|
| Margaux| Smith| F|Atlanta-Sandy Spr...| free|
| Tess| Townsend| F|Nashville-Davidso...| free|
| Alan| Morse| M|Chicago-Napervill...| paid|
| Liam| Watts| M|New York-Newark-J...| paid|
| Liam| Watts| M|New York-Newark-J...| paid|
| Dylann| Thomas| M| Anchorage, AK| paid|
| Alan| Morse| M|Chicago-Napervill...| paid|
| Elijah| Williams| M|Detroit-Warren-De...| paid|
| Margaux| Smith| F|Atlanta-Sandy Spr...| free|
| Alan| Morse| M|Chicago-Napervill...| paid|
| Dylann| Thomas| M| Anchorage, AK| paid|
| Margaux| Smith| F|Atlanta-Sandy Spr...| free|
+---------+----------+------+--------------------+-----------------+

Load data into Azure Synapse


In this section, you upload the transformed data into Azure Synapse. You use the Azure Synapse connector for
Azure Databricks to directly upload a dataframe as a table in a Synapse Spark pool.
As mentioned earlier, the Azure Synapse connector uses Azure Blob storage as temporary storage to upload
data between Azure Databricks and Azure Synapse. So, you start by providing the configuration to connect to
the storage account. You must already have already created the account as part of the prerequisites for this
article.
1. Provide the configuration to access the Azure Storage account from Azure Databricks.

val blobStorage = "<blob-storage-account-name>.blob.core.windows.net"


val blobContainer = "<blob-container-name>"
val blobAccessKey = "<access-key>"

2. Specify a temporary folder to use while moving data between Azure Databricks and Azure Synapse.

val tempDir = "wasbs://" + blobContainer + "@" + blobStorage +"/tempDirs"

3. Run the following snippet to store Azure Blob storage access keys in the configuration. This action
ensures that you don't have to keep the access key in the notebook in plain text.

val acntInfo = "fs.azure.account.key."+ blobStorage


sc.hadoopConfiguration.set(acntInfo, blobAccessKey)

4. Provide the values to connect to the Azure Synapse instance. You must have created an Azure Synapse
Analytics service as a prerequisite. Use the fully qualified server name for dwSer ver . For example,
<servername>.database.windows.net .

//Azure Synapse related settings


val dwDatabase = "<database-name>"
val dwServer = "<database-server-name>"
val dwUser = "<user-name>"
val dwPass = "<password>"
val dwJdbcPort = "1433"
val dwJdbcExtraOptions =
"encrypt=true;trustServerCertificate=true;hostNameInCertificate=*.database.windows.net;loginTimeout=3
0;"
val sqlDwUrl = "jdbc:sqlserver://" + dwServer + ":" + dwJdbcPort + ";database=" + dwDatabase +
";user=" + dwUser+";password=" + dwPass + ";$dwJdbcExtraOptions"
val sqlDwUrlSmall = "jdbc:sqlserver://" + dwServer + ":" + dwJdbcPort + ";database=" + dwDatabase +
";user=" + dwUser+";password=" + dwPass

5. Run the following snippet to load the transformed dataframe, renamedColumnsDF , as a table in Azure
Synapse. This snippet creates a table called SampleTable in the SQL database.

spark.conf.set(
"spark.sql.parquet.writeLegacyFormat",
"true")

renamedColumnsDF.write.format("com.databricks.spark.sqldw").option("url",
sqlDwUrlSmall).option("dbtable", "SampleTable") .option(
"forward_spark_azure_storage_credentials","True").option("tempdir", tempDir).mode("overwrite").save()
NOTE
This sample uses the forward_spark_azure_storage_credentials flag, which causes Azure Synapse to access
data from blob storage using an Access Key. This is the only supported method of authentication.
If your Azure Blob Storage is restricted to select virtual networks, Azure Synapse requires Managed Service
Identity instead of Access Keys. This will cause the error "This request is not authorized to perform this operation."

6. Connect to the SQL database and verify that you see a database named SampleTable .

7. Run a select query to verify the contents of the table. The table should have the same data as the
renamedColumnsDF dataframe.

Clean up resources
After you finish the tutorial, you can terminate the cluster. From the Azure Databricks workspace, select Clusters
on the left. For the cluster to terminate, under Actions , point to the ellipsis (...) and select the Terminate icon.

If you don't manually terminate the cluster, it automatically stops, provided you selected the Terminate after __
minutes of inactivity check box when you created the cluster. In such a case, the cluster automatically stops if
it's been inactive for the specified time.
Next steps
In this tutorial, you learned how to:
Create an Azure Databricks service
Create a Spark cluster in Azure Databricks
Create a notebook in Azure Databricks
Extract data from a Data Lake Storage Gen2 account
Transform data in Azure Databricks
Load data into Azure Synapse
Advance to the next tutorial to learn about streaming real-time data into Azure Databricks using Azure Event
Hubs.
Stream data into Azure Databricks using Event Hubs
Tutorial: Stream data into Azure Databricks using
Event Hubs
7/21/2022 • 12 minutes to read

In this tutorial, you connect a data ingestion system with Azure Databricks to stream data into an Apache Spark
cluster in near real-time. You set up data ingestion system using Azure Event Hubs and then connect it to Azure
Databricks to process the messages coming through. To access a stream of data, you use Twitter APIs to ingest
tweets into Event Hubs. Once you have the data in Azure Databricks, you can run analytical jobs to further
analyze the data.
By the end of this tutorial, you would have streamed tweets from Twitter (that have the term "Azure" in them)
and read the tweets in Azure Databricks.
The following illustration shows the application flow:

This tutorial covers the following tasks:


Create an Azure Databricks workspace
Create a Spark cluster in Azure Databricks
Create a Twitter app to access streaming data
Create notebooks in Azure Databricks
Attach libraries for Event Hubs and Twitter API
Send tweets to Event Hubs
Read tweets from Event Hubs
If you don't have an Azure subscription, create a free account before you begin.

NOTE
This tutorial cannot be carried out using Azure Free Trial Subscription . If you have a free account, go to your profile
and change your subscription to pay-as-you-go . For more information, see Azure free account. Then, remove the
spending limit, and request a quota increase for vCPUs in your region. When you create your Azure Databricks
workspace, you can select the Trial (Premium - 14-Days Free DBUs) pricing tier to give the workspace access to free
Premium Azure Databricks DBUs for 14 days.

Prerequisites
Before you start with this tutorial, make sure to meet the following requirements:
An Azure Event Hubs namespace.
An Event Hub within the namespace.
Connection string to access the Event Hubs namespace. The connection string should have a format similar
to
Endpoint=sb://<namespace>.servicebus.windows.net/;SharedAccessKeyName=<key name>;SharedAccessKey=<key
value>
.
Shared access policy name and policy key for Event Hubs.
You can meet these requirements by completing the steps in the article, Create an Azure Event Hubs namespace
and event hub.

Sign in to the Azure portal


Sign in to the Azure portal.

Create an Azure Databricks workspace


In this section, you create an Azure Databricks workspace using the Azure portal.
1. In the Azure portal, select Create a resource > Data + Analytics > Azure Databricks .

2. Under Azure Databricks Ser vice , provide the values to create a Databricks workspace.
Provide the following values:

P RO P ERT Y DESC RIP T IO N

Workspace name Provide a name for your Databricks workspace

Subscription From the drop-down, select your Azure subscription.

Resource group Specify whether you want to create a new resource


group or use an existing one. A resource group is a
container that holds related resources for an Azure
solution. For more information, see Azure Resource
Group overview.

Location Select East US 2 . For other available regions, see Azure


services available by region.

Pricing Tier Choose between Standard or Premium . For more


information on these tiers, see Databricks pricing page.

Select Pin to dashboard and then select Create .


3. The account creation takes a few minutes. During account creation, the portal displays the Submitting
deployment for Azure Databricks tile on the right side. You may need to scroll right on your
dashboard to see the tile. There is also a progress bar displayed near the top of the screen. You can watch
either area for progress.
Create a Spark cluster in Databricks
1. In the Azure portal, go to the Databricks workspace that you created, and then select Launch
Workspace .
2. You are redirected to the Azure Databricks portal. From the portal, select Cluster .

3. In the New cluster page, provide the values to create a cluster.


Accept all other default values other than the following:
Enter a name for the cluster.
For this article, create a cluster with 6.0 runtime.
Make sure you select the Terminate after __ minutes of inactivity checkbox. Provide a duration (in
minutes) to terminate the cluster, if the cluster is not being used.
Select cluster worker and driver node size suitable for your technical criteria and budget.
Select Create cluster . Once the cluster is running, you can attach notebooks to the cluster and run Spark
jobs.

Create a Twitter application


To receive a stream of tweets, you create an application in Twitter. Follow the instructions create a Twitter
application and record the values that you need to complete this tutorial.
1. From a web browser, go to Twitter For Developers, and select Create an app . You might see a message
saying that you need to apply for a Twitter developer account. Feel free to do so, and after your
application has been approved you should see a confirmation email. It could take several days to be
approved for a developer account.
2. In the Create an application page, provide the details for the new app, and then select Create your
Twitter application .
3. In the application page, select the Keys and Tokens tab and copy the values for Consumer API Key
and Consumer API Secret Key . Also, select Create under Access Token and Access Token Secret
to generate the access tokens. Copy the values for Access Token and Access Token Secret .

Save the values that you retrieved for the Twitter application. You need the values later in the tutorial.

Attach libraries to Spark cluster


In this tutorial, you use the Twitter APIs to send tweets to Event Hubs. You also use the Apache Spark Event Hubs
connector to read and write data into Azure Event Hubs. To use these APIs as part of your cluster, add them as
libraries to Azure Databricks and associate them with your Spark cluster. The following instructions show how to
add a library.
1. In the Azure Databricks workspace, select Clusters , and choose your existing Spark cluster. Within the
cluster menu, choose Libraries and click Install New .
2. In the New Library page, for Source select Maven . Individually enter the following coordinates for the
Spark Event Hubs connector and the Twitter API into Coordinates .
Spark Event Hubs connector - com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.12
Twitter API - org.twitter4j:twitter4j-core:4.0.7
3. Select Install .
4. In the cluster menu, make sure both libraries are installed and attached properly.

5. Repeat these steps for the Twitter package, twitter4j-core:4.0.7 .

Create notebooks in Databricks


In this section, you create two notebooks in Databricks workspace with the following names:
SendTweetsToEventHub - A producer notebook you use to get tweets from Twitter and stream them to
Event Hubs.
ReadTweetsFromEventHub - A consumer notebook you use to read the tweets from Event Hubs.
1. In the left pane, select Workspace . From the Workspace drop-down, select Create > Notebook .

2. In the Create Notebook dialog box, enter SendTweetsToEventHub , select Scala as the language, and
select the Spark cluster that you created earlier.

Select Create .
3. Repeat the steps to create the ReadTweetsFromEventHub notebook.

Send tweets to Event Hubs


In the SendTweetsToEventHub notebook, paste the following code, and replace the placeholders with values
for your Event Hubs namespace and Twitter application that you created earlier. This notebook streams tweets
with the keyword "Azure" into Event Hubs in real time.

NOTE
Twitter API has certain request restrictions and quotas. If you are not satisfied with standard rate limiting in Twitter API,
you can generate text content without using Twitter API in this example. To do that, set variable dataSource to test
instead of twitter and populate the list testSource with preferred test input.

import scala.collection.JavaConverters._
import com.microsoft.azure.eventhubs._
import java.util.concurrent._
import scala.collection.immutable._
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
val namespaceName = "<EVENT HUBS NAMESPACE>"
val eventHubName = "<EVENT HUB NAME>"
val sasKeyName = "<POLICY NAME>"
val sasKey = "<POLICY KEY>"
val connStr = new ConnectionStringBuilder()
.setNamespaceName(namespaceName)
.setEventHubName(eventHubName)
.setSasKeyName(sasKeyName)
.setSasKey(sasKey)

val pool = Executors.newScheduledThreadPool(1)


val eventHubClient = EventHubClient.createFromConnectionString(connStr.toString(), pool)

def sleep(time: Long): Unit = Thread.sleep(time)

def sendEvent(message: String, delay: Long) = {


sleep(delay)
val messageData = EventData.create(message.getBytes("UTF-8"))
eventHubClient.get().send(messageData)
System.out.println("Sent event: " + message + "\n")
}

// Add your own values to the list


val testSource = List("Azure is the greatest!", "Azure isn't working :(", "Azure is okay.")

// Specify 'test' if you prefer to not use Twitter API and loop through a list of values you define in
`testSource`
// Otherwise specify 'twitter'
val dataSource = "test"

if (dataSource == "twitter") {

import twitter4j._
import twitter4j.TwitterFactory
import twitter4j.Twitter
import twitter4j.conf.ConfigurationBuilder

// Twitter configuration!
// Replace values below with you

val twitterConsumerKey = "<CONSUMER API KEY>"


val twitterConsumerSecret = "<CONSUMER API SECRET>"
val twitterOauthAccessToken = "<ACCESS TOKEN>"
val twitterOauthTokenSecret = "<TOKEN SECRET>"

val cb = new ConfigurationBuilder()


cb.setDebugEnabled(true)
.setOAuthConsumerKey(twitterConsumerKey)
.setOAuthConsumerSecret(twitterConsumerSecret)
.setOAuthAccessToken(twitterOauthAccessToken)
.setOAuthAccessTokenSecret(twitterOauthTokenSecret)

val twitterFactory = new TwitterFactory(cb.build())


val twitter = twitterFactory.getInstance()

// Getting tweets with keyword "Azure" and sending them to the Event Hub in realtime!
val query = new Query(" #Azure ")
query.setCount(100)
query.lang("en")
var finished = false
while (!finished) {
val result = twitter.search(query)
val statuses = result.getTweets()
var lowestStatusId = Long.MaxValue
for (status <- statuses.asScala) {
if(!status.isRetweet()){
sendEvent(status.getText(), 5000)
}
lowestStatusId = Math.min(status.getId(), lowestStatusId)
}
query.setMaxId(lowestStatusId - 1)
}

} else if (dataSource == "test") {


// Loop through the list of test input data
while (true) {
testSource.foreach {
sendEvent(_,5000)
}
}

} else {
System.out.println("Unsupported Data Source. Set 'dataSource' to \"twitter\" or \"test\"")
}

// Closing connection to the Event Hub


eventHubClient.get().close()

To run the notebook, press SHIFT + ENTER . You see an output like the snippet below. Each event in the output
is a tweet that is ingested into the Event Hubs containing the term "Azure".

Sent event: @Microsoft and @Esri launch Geospatial AI on Azure https://t.co/VmLUCiPm6q via @geoworldmedia
#geoai #azure #gis #ArtificialIntelligence

Sent event: Public preview of Java on App Service, built-in support for Tomcat and OpenJDK
https://t.co/7vs7cKtvah
#cloudcomputing #Azure

Sent event: 4 Killer #Azure Features for #Data #Performance https://t.co/kpIb7hFO2j by @RedPixie

Sent event: Migrate your databases to a fully managed service with Azure SQL Managed Instance | #Azure |
#Cloud https://t.co/sJHXN4trDk

Sent event: Top 10 Tricks to #Save Money with #Azure Virtual Machines https://t.co/F2wshBXdoz #Cloud

...
...

Read tweets from Event Hubs


In the ReadTweetsFromEventHub notebook, paste the following code, and replace the placeholder with
values for your Azure Event Hubs that you created earlier. This notebook reads the tweets that you earlier
streamed into Event Hubs using the SendTweetsToEventHub notebook.
import org.apache.spark.eventhubs._
import com.microsoft.azure.eventhubs._

// Build connection string with the above information


val namespaceName = "<EVENT HUBS NAMESPACE>"
val eventHubName = "<EVENT HUB NAME>"
val sasKeyName = "<POLICY NAME>"
val sasKey = "<POLICY KEY>"
val connStr = new com.microsoft.azure.eventhubs.ConnectionStringBuilder()
.setNamespaceName(namespaceName)
.setEventHubName(eventHubName)
.setSasKeyName(sasKeyName)
.setSasKey(sasKey)

val customEventhubParameters =
EventHubsConf(connStr.toString())
.setMaxEventsPerTrigger(5)

val incomingStream = spark.readStream.format("eventhubs").options(customEventhubParameters.toMap).load()

incomingStream.printSchema

// Sending the incoming stream into the console.


// Data comes in batches!
incomingStream.writeStream.outputMode("append").format("console").option("truncate",
false).start().awaitTermination()

You get the following output:

root
|-- body: binary (nullable = true)
|-- offset: long (nullable = true)
|-- seqNumber: long (nullable = true)
|-- enqueuedTime: long (nullable = true)
|-- publisher: string (nullable = true)
|-- partitionKey: string (nullable = true)

-------------------------------------------
Batch: 0
-------------------------------------------
+------+------+--------------+---------------+---------+------------+
|body |offset|sequenceNumber|enqueuedTime |publisher|partitionKey|
+------+------+--------------+---------------+---------+------------+
|[50 75 62 6C 69 63 20 70 72 65 76 69 65 77 20 6F 66 20 4A 61 76 61 20 6F 6E 20 41 70 70 20 53 65 72 76 69
63 65 2C 20 62 75 69 6C 74 2D 69 6E 20 73 75 70 70 6F 72 74 20 66 6F 72 20 54 6F 6D 63 61 74 20 61 6E 64 20
4F 70 65 6E 4A 44 4B 0A 68 74 74 70 73 3A 2F 2F 74 2E 63 6F 2F 37 76 73 37 63 4B 74 76 61 68 20 0A 23 63 6C
6F 75 64 63 6F 6D 70 75 74 69 6E 67 20 23 41 7A 75 72 65] |0 |0
|2018-03-09 05:49:08.86 |null |null |
|[4D 69 67 72 61 74 65 20 79 6F 75 72 20 64 61 74 61 62 61 73 65 73 20 74 6F 20 61 20 66 75 6C 6C 79 20 6D
61 6E 61 67 65 64 20 73 65 72 76 69 63 65 20 77 69 74 68 20 41 7A 75 72 65 20 53 51 4C 20 44 61 74 61 62 61
73 65 20 4D 61 6E 61 67 65 64 20 49 6E 73 74 61 6E 63 65 20 7C 20 23 41 7A 75 72 65 20 7C 20 23 43 6C 6F 75
64 20 68 74 74 70 73 3A 2F 2F 74 2E 63 6F 2F 73 4A 48 58 4E 34 74 72 44 6B] |168 |1
|2018-03-09 05:49:24.752|null |null |
+------+------+--------------+---------------+---------+------------+

-------------------------------------------
Batch: 1
-------------------------------------------
...
...

Because the output is in a binary mode, use the following snippet to convert it into string.
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

// Event Hub message format is JSON and contains "body" field


// Body is binary, so we cast it to string to see the actual content of the message
val messages =
incomingStream
.withColumn("Offset", $"offset".cast(LongType))
.withColumn("Time (readable)", $"enqueuedTime".cast(TimestampType))
.withColumn("Timestamp", $"enqueuedTime".cast(LongType))
.withColumn("Body", $"body".cast(StringType))
.select("Offset", "Time (readable)", "Timestamp", "Body")

messages.printSchema

messages.writeStream.outputMode("append").format("console").option("truncate",
false).start().awaitTermination()

The output now resembles the following snippet:

root
|-- Offset: long (nullable = true)
|-- Time (readable): timestamp (nullable = true)
|-- Timestamp: long (nullable = true)
|-- Body: string (nullable = true)

-------------------------------------------
Batch: 0
-------------------------------------------
+------+-----------------+----------+-------+
|Offset|Time (readable) |Timestamp |Body
+------+-----------------+----------+-------+
|0 |2018-03-09 05:49:08.86 |1520574548|Public preview of Java on App Service, built-in support for
Tomcat and OpenJDK
https://t.co/7vs7cKtvah
#cloudcomputing #Azure |
|168 |2018-03-09 05:49:24.752|1520574564|Migrate your databases to a fully managed service with Azure SQL
Managed Instance | #Azure | #Cloud https://t.co/sJHXN4trDk |
|0 |2018-03-09 05:49:02.936|1520574542|@Microsoft and @Esri launch Geospatial AI on Azure
https://t.co/VmLUCiPm6q via @geoworldmedia #geoai #azure #gis #ArtificialIntelligence|
|176 |2018-03-09 05:49:20.801|1520574560|4 Killer #Azure Features for #Data #Performance
https://t.co/kpIb7hFO2j by @RedPixie |
+------+-----------------+----------+-------+
-------------------------------------------
Batch: 1
-------------------------------------------
...
...

That's it! Using Azure Databricks, you have successfully streamed data into Azure Event Hubs in near real-time.
You then consumed the stream data using the Event Hubs connector for Apache Spark. For more information on
how to use the Event Hubs connector for Spark, see the connector documentation.

Clean up resources
After you have finished running the tutorial, you can terminate the cluster. To do so, from the Azure Databricks
workspace, from the left pane, select Clusters . For the cluster you want to terminate, move the cursor over the
ellipsis under Actions column, and select the Terminate icon.
If you do not manually terminate the cluster it will automatically stop, provided you selected the Terminate
after __ minutes of inactivity checkbox while creating the cluster. In such a case, the cluster will automatically
stop if it has been inactive for the specified time.

Next steps
In this tutorial, you learned how to:
Create an Azure Databricks workspace
Create a Spark cluster in Azure Databricks
Create a Twitter app to generate streaming data
Create notebooks in Azure Databricks
Add libraries for Event Hubs and Twitter API
Send tweets to Event Hubs
Read tweets from Event Hubs
Databricks runtimes
7/21/2022 • 2 minutes to read

Databricks runtimes are the set of core components that run on Azure Databricks clusters. Azure Databricks
offers several types of runtimes.
Databricks Runtime
Databricks Runtime includes Apache Spark but also adds a number of components and updates that
substantially improve the usability, performance, and security of big data analytics.
Databricks Runtime for Machine Learning
Databricks Runtime ML is a variant of Databricks Runtime that adds multiple popular machine learning
libraries, including TensorFlow, Keras, PyTorch, and XGBoost.
Photon runtime
Photon is the Azure Databricks native vectorized query engine that runs SQL workloads faster and
reduces your total cost per workload.
Databricks Light
Databricks Light provides a runtime option for jobs that don’t need the advanced performance, reliability,
or autoscaling benefits provided by Databricks Runtime.
Databricks Runtime for Genomics (Deprecated)
Databricks Runtime for Genomics is a variant of Databricks Runtime optimized for working with genomic
and biomedical data.
You can choose from among the supported runtime versions when you create a cluster.
For information about the contents of each runtime variant, see the release notes.
Databricks Runtime
7/21/2022 • 2 minutes to read

Databricks Runtime includes Apache Spark but also adds a number of components and updates that
substantially improve the usability, performance, and security of big data analytics:
Delta Lake, a next-generation storage layer built on top of Apache Spark that provides ACID transactions,
optimized layouts and indexes, and execution engine improvements for building data pipelines.
Installed Java, Scala, Python, and R libraries
Ubuntu and its accompanying system libraries
GPU libraries for GPU-enabled clusters
Databricks services that integrate with other components of the platform, such as notebooks, jobs, and
cluster manager
For information about the contents of each runtime version, see the release notes.

Runtime versioning
Databricks Runtime versions are released on a regular basis:
Major versions are represented by an increment to the version number that precedes the decimal point (the
jump from 3.5 to 4.0, for example). They are released when there are major changes, some of which may not
be backwards-compatible.
Feature versions are represented by an increment to the version number that follows the decimal point (the
jump from 3.4 to 3.5, for example). Each major release includes multiple feature releases. Feature releases are
always backwards compatible with previous releases within their major release.
Long Term Suppor t versions are represented by an LTS qualifier (for example, 3.5 LTS ). For each major
release, we declare a “canonical” feature version, for which we provide two full years of support. See
Databricks runtime support lifecycle for more information.
Databricks Runtime for Machine Learning
7/21/2022 • 3 minutes to read

Databricks Runtime for Machine Learning (Databricks Runtime ML) automates the creation of a cluster
optimized for machine learning. Databricks Runtime ML clusters include the most popular machine learning
libraries, such as TensorFlow, PyTorch, Keras, and XGBoost, and also include libraries required for distributed
training such as Horovod. Using Databricks Runtime ML speeds up cluster creation and ensures that the
installed library versions are compatible.
For complete information about using Azure Databricks for machine learning and deep learning, see Databricks
Machine Learning guide.
For information about the contents of each Databricks Runtime ML version, see the release notes.
Databricks Runtime ML is built on Databricks Runtime. For example, Databricks Runtime 7.3 LTS for Machine
Learning is built on Databricks Runtime 7.3 LTS. The libraries included in the base Databricks Runtime are listed
in the Databricks Runtime release notes.

Introduction to Databricks Runtime for Machine Learning


This tutorial is designed for new users of Databricks Runtime ML. It takes about 10 minutes to work through,
and shows a complete end-to-end example of loading tabular data, training a model, distributed
hyperparameter tuning, and model inference. It also illustrates how to use the MLflow API and MLflow Model
Registry.
Databricks tutorial notebook
Get notebook

Libraries included in Databricks Runtime ML


The Databricks Runtime ML includes a variety of popular ML libraries. The libraries are updated with each
release to include new features and fixes.
Azure Databricks has designated a subset of the supported libraries as top-tier libraries. For these libraries,
Azure Databricks provides a faster update cadence, updating to the latest package releases with each runtime
release (barring dependency conflicts). Azure Databricks also provides advanced support, testing, and
embedded optimizations for top-tier libraries.
For a full list of top-tier and other provided libraries, see the following articles for each available runtime:
Databricks Runtime 11.1 for Machine Learning (Beta)
Databricks Runtime 11.0 for Machine Learning
Databricks Runtime 10.5 for Machine Learning
Databricks Runtime 10.4 LTS for Machine Learning
Databricks Runtime 10.3 for Machine Learning
Databricks Runtime 10.2 for Machine Learning (Unsupported)
Databricks Runtime 10.1 for Machine Learning (Unsupported)
Databricks Runtime 10.0 for Machine Learning (Unsupported)
Databricks Runtime 9.1 LTS for Machine Learning
Databricks Runtime 9.0 for Machine Learning (Unsupported)
Databricks Runtime 8.4 for Machine Learning (Unsupported)
Databricks Runtime 8.3 for Machine Learning (Unsupported)
Databricks Runtime 8.2 for Machine Learning (Unsupported)
Databricks Runtime 8.1 for Machine Learning (Unsupported)
Databricks Runtime 8.0 for Machine Learning (Unsupported)
Databricks Runtime 7.6 for Machine Learning (Unsupported)
Databricks Runtime 7.5 for Machine Learning (Unsupported)
Databricks Runtime 7.3 LTS for Machine Learning
Databricks Runtime 5.5 LTS for Machine Learning (Unsupported)

How to use Databricks Runtime ML


In addition to the pre-installed libraries, Databricks Runtime ML differs from Databricks Runtime in the cluster
configuration and in how you manage Python packages.
Create a cluster using Databricks Runtime ML
When you create a cluster, select a Databricks Runtime ML version from the Databricks Runtime Version drop-
down. Both CPU and GPU-enabled ML runtimes are available.

If you select a GPU-enabled ML runtime, you are prompted to select a compatible Driver Type and Worker
Type . Incompatible instance types are grayed out in the drop-downs. GPU-enabled instance types are listed
under the GPU-Accelerated label.

IMPORTANT
Libraries in your workspace that automatically install into all clusters can conflict with the libraries included in
Databricks Runtime ML. Before you create a cluster with Databricks Runtime ML, clear the Install automatically on
all clusters checkbox for conflicting libraries. See the release notes for a list of libraries that are included with each
version of Databricks Runtime ML.
To access data in Unity Catalog for machine learning workflows, you must use a Single User cluster. User Isolation
clusters are not compatible with Databricks Runtime ML.

Manage Python packages


In Databricks Runtime 9.0 ML and above, the virtualenv package manager is used to install Python packages. All
Python packages are installed inside a single environment: /databricks/python3 .
In Databricks Runtime 8.4 ML and below, the Conda package manager is used to install Python packages. All
Python packages are installed inside a single environment: /databricks/python2 on clusters using Python 2 and
/databricks/python3 on clusters using Python 3. Switching (or activating) Conda environments is not
supported.
For information on managing Python libraries, see Libraries.

Support for automated machine learning


Databricks Runtime ML includes tools to automate the model development process and help you efficiently find
the best performing model.
AutoML automatically creates, tunes, and evaluates a set of models and creates a Python notebook with the
source code for each run so you can review, reproduce, and modify the code.
Managed MLFlow manages the end-to-end model lifecycle, including tracking experimental runs, deploying
and sharing models, and maintaining a centralized model registry.
Hyperopt, augmented with the SparkTrials class, automates and distributes ML model parameter tuning.
Databricks Runtime for Genomics (Deprecated)
7/21/2022 • 2 minutes to read

Databricks Runtime for Genomics (Databricks Runtime Genomics) is a version of Databricks Runtime optimized
for working with genomic and biomedical data. It is a component of the Azure Databricks Unified Analytics
Platform for Genomics. For more information on developing genomics applications, see Genomics guide.

NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.

What’s in Databricks Runtime for Genomics?


An optimized version of the Databricks-Regeneron open-source library Glow with all its functionalities as
well as:
Spark SQL support for reading and writing variant data
Functions for common workflow elements
Optimizations for common query patterns
Turn-key pipelines parallelized with Apache Spark:
DNASeq
RNASeq
Tumor Normal Sequencing (MutSeq)
Joint Genotyping
SnpeEff variant annotation
Hail 0.2 integration
Popular open source libraries, optimized for performance and reliability:
ADAM
GATK
Hadoop-bam
Popular command line tools:
samtools
Reference data (grch37 or 38, known SNP sites)
See the Databricks Runtime for Genomics release notes for a complete list of included libraries and versions.

Requirements
Your Azure Databricks workspace must have Databricks Runtime for Genomics enabled.

Create a cluster using Databricks Runtime for Genomics


When you create a cluster, select a Databricks Runtime for Genomics version from the Databricks Runtime
Version drop-down.
Databricks Light
7/21/2022 • 2 minutes to read

Databricks Light is the Databricks packaging of the open source Apache Spark runtime. It provides a runtime
option for jobs that don’t need the advanced performance, reliability, or autoscaling benefits provided by
Databricks Runtime. In particular, Databricks Light does not support:
Delta Lake
Autopilot features such as autoscaling
Highly concurrent, all-purpose clusters
Notebooks, dashboards, and collaboration features
Connectors to various data sources and BI tools
Databricks Light is a runtime environment for jobs (or “automated workloads”). When you run jobs on
Databricks Light clusters, they are subject to lower Jobs Light Compute pricing. You can select Databricks Light
only when you create or schedule a JAR, Python, or spark-submit job and attach a cluster to that job; you cannot
use Databricks Light to run notebook jobs or interactive workloads.
Databricks Light can be used in the same workspace with clusters running on other Databricks runtimes and
pricing tiers. You don’t need to request a separate workspace to get started.

What’s in Databricks Light?


The release schedule of Databricks Light runtime follows the Apache Spark runtime release schedule. Any
Databricks Light version is based on a specific version of Apache Spark. See the following release notes for
more information:
Databricks Light 2.4 Extended Support

Create a cluster using Databricks Light


When you create a job cluster, select a Databricks Light version from the Databricks Runtime Version drop-
down.

IMPORTANT
Support for Databricks Light on pool-backed job clusters is in Public Preview.
Photon runtime
7/21/2022 • 2 minutes to read

Photon is the native vectorized query engine on Azure Databricks, written to be directly compatible with Apache
Spark APIs so it works with your existing code. It is developed in C++ to take advantage of modern hardware,
and uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level
parallelism in CPUs, enhancing performance on real-world data and applications-—all natively on your data
lake. Photon is part of a high-performance runtime that runs your existing SQL and DataFrame API calls faster
and reduces your total cost per workload. Photon is used by default in Databricks SQL warehouses.

Azure Databricks clusters


IMPORTANT
Photon runtime on Azure Databricks clusters is in Public Preview.

To access Photon on Azure Databricks clusters you must explicitly select a runtime containing Photon when you
create the cluster, either using the UI or the APIs (Clusters API 2.0 and Jobs API 2.1, specifying spark_version
using the syntax <databricks-runtime-version>-photon-scala2.12 ). Photon is available for clusters running
Databricks Runtime 9.1 LTS and above.
Photon supports a limited set of instance types on the driver and worker nodes. Photon instance types consume
DBUs at a different rate than the same instance type running the non-Photon runtime. For more information
about Photon instances and DBU consumption, see the Azure Databricks pricing page.

Photon advantages
Supports SQL and equivalent DataFrame operations against Delta and Parquet tables.
Expected to accelerate queries that process a significant amount of data (100GB+) and include aggregations
and joins.
Faster performance when data is accessed repeatedly from the Delta cache.
More robust scan performance on tables with many columns and many small files.
Faster Delta and Parquet writing using UPDATE , DELETE , MERGE INTO , INSERT , and CREATE TABLE AS SELECT ,
especially for wide tables (hundreds to thousands of columns).
Replaces sort-merge joins with hash-joins.

Limitations
Works on Delta and Parquet tables only for both read and write.
Does not support window and sort operators
Does not support Spark Structured Streaming.
Does not support UDFs.
Not expected to improve short-running queries (<2 seconds), for example, queries against small amounts of
data.
Features not supported by Photon run the same way they would with Databricks Runtime; there is no
performance advantage for those features.
Navigate the workspace
7/21/2022 • 2 minutes to read

An Azure Databricks workspace is an environment for accessing all of your Azure Databricks assets. The
workspace organizes objects (notebooks, libraries, and experiments) into folders, and provides access to data
and computational resources such as clusters and jobs.

You can manage the workspace using the workspace UI, the Databricks CLI, and the Databricks REST API
reference. Most of the articles in the Azure Databricks documentation focus on performing tasks using the
workspace UI.

Use the sidebar


You can access all of your Azure Databricks assets using the sidebar. The sidebar’s contents depend on the
selected persona: Data Science & Engineering , Machine Learning , or SQL .
By default, the sidebar appears in a collapsed state and only the icons are visible. Move your cursor over
the sidebar to expand to the full view.

To change the persona, click the icon below the Databricks logo , and select a persona.
To pin a persona so that it appears the next time you log in, click next to the persona. Click it again to
remove the pin.
Use Menu options at the bottom of the sidebar to set the sidebar mode to Auto (default behavior),
Expand , or Collapse .
When you open a machine learning-related page, the persona automatically switches to Machine
Learning .

Switch to a different workspace


If you have access to more than one workspace in the same account, you can quickly switch among them.

1. Click in the lower left corner of your Azure Databricks workspace.


2. Under Workspaces , select a workspace to switch to it.

Search the workspace


To search the workspace, click Search in the sidebar. See Search workspace for an object for details.

Get help
To get help:

1. Click the Help in the lower left corner:

2. Select one of the following options:


Help Center : Submit a help ticket or search across Azure Databricks documentation, Azure
Databricks Knowledge Base articles, Apache Spark documentation, and Databricks forums.
Release Notes : View Azure Databricks Release notes.
Documentation : View Azure Databricks Documentation.
Knowledge Base : View Azure Databricks Knowledge Base.
Databricks Status : View Azure Databricks status by region.
Feedback : Provide Azure Databricks product feedback.

Work with the browser and workspace objects


The following articles give an overview of workspace assets, how to work with workspace folders and other
objects, and how to find IDs for your workspace and assets:
Workspace assets
Work with workspace objects
Get workspace, cluster, notebook, folder, model, and job identifiers
Per-workspace URLs
Workspace assets
7/21/2022 • 2 minutes to read

This article provides a high-level introduction to Azure Databricks workspace assets.

Clusters
Azure Databricks Data Science & Engineering and Databricks Machine Learning clusters provide a unified
platform for various use cases such as running production ETL pipelines, streaming analytics, ad-hoc analytics,
and machine learning. A cluster is a type of Azure Databricks compute resource. Other compute resource types
include Azure Databricks SQL warehouses.
For detailed information on managing and using clusters, see Clusters.

Notebooks
A notebook is a web-based interface to documents containing a series of runnable cells (commands) that
operate on files and tables, visualizations, and narrative text. Commands can be run in sequence, referring to the
output of one or more previously run commands.
Notebooks are one mechanism for running code in Azure Databricks. The other mechanism is jobs.
For detailed information on managing and using notebooks, see Notebooks.

Jobs
Jobs are one mechanism for running code in Azure Databricks. The other mechanism is notebooks.
For detailed information on managing and using jobs, see Jobs.

Libraries
A library makes third-party or locally-built code available to notebooks and jobs running on your clusters.
For detailed information on managing and using libraries, see Libraries.

Data
You can import data into a distributed file system mounted into an Azure Databricks workspace and work with it
in Azure Databricks notebooks and clusters. You can also use a wide variety of Apache Spark data sources to
access data.
For detailed information on loading data, see Ingest data into the Azure Databricks Lakehouse.

Repos
Repos are Azure Databricks folders whose contents are co-versioned together by syncing them to a remote Git
repository. Using a Azure Databricks repo, you can develop notebooks in Azure Databricks and use a remote Git
repository for collaboration and version control.
For detailed information on using repos, see Git integration with Databricks Repos.
Models
Model refers to a model registered in MLflow Model Registry. Model Registry is a centralized model store that
enables you to manage the full lifecycle of MLflow models. It provides chronological model lineage, model
versioning, stage transitions, and model and model version annotations and descriptions.
For detailed information on managing and using models, see MLflow Model Registry on Azure Databricks.

Experiments
An MLflow experiment is the primary unit of organization and access control for MLflow machine learning
model training runs; all MLflow runs belong to an experiment. Each experiment lets you visualize, search, and
compare runs, as well as download run artifacts or metadata for analysis in other tools.
For detailed information on managing and using experiments, see Experiments.
Work with workspace objects
7/21/2022 • 3 minutes to read

This article explains how to work with folders and other workspace objects.

Folders
Folders contain all static assets within a workspace: notebooks, libraries, experiments, and other folders. Icons
indicate the type of the object contained in a folder. Click a folder name to open or close the folder and view its
contents.

To perform an action on a folder, click the at the right side of a folder and select a menu item.

Special folders
An Azure Databricks workspace has three special folders: Workspace, Shared, and Users. You cannot rename or
move a special folder.
Workspace root folder
To navigate to the Workspace root folder:

1. Click Workspace .

2. Click the icon.


The Workspace root folder is a container for all of your organization’s Azure Databricks static assets.
Within the Workspace root folder:
Shared is for sharing objects across your organization. All users have full permissions for all objects in
Shared.
Users contains a folder for each user.
By default, the Workspace root folder and all of its contained objects are available to all users. You can control
who can manage and access objects by enabling workspace access control and setting permissions.
To sort all objects alphabetically or by type across all folders, click the to the right of the Workspace folder
and select Sor t > [Alphabetical | Type] :

User home folders


Each user has a home folder for their notebooks and libraries:

>
If workspace access control is enabled, by default objects in this folder are private to that user.

NOTE
When you remove a user, the user’s home folder is retained.

Workspace object operations


The objects stored in the Workspace root folder are folders, notebooks, libraries, and experiments. To perform an

action on a Workspace object, right-click the object or click the at the right side of an object.
From the drop-down menu you can:
If the object is a folder:
Create a notebook, library, MLflow experiment, or folder.
Import a Databricks archive.
Clone the object.
Rename the object.
Move the object to another folder.
Move the object to Trash. See Delete an object.
Export a folder or notebook as a Databricks archive.
If the object is a notebook, copy the notebook’s file path.
If you have Workspace access control enabled, set permissions on the object.
Search workspace for an object

IMPORTANT
This feature is in Public Preview.

NOTE
The search behavior described in this section is not supported on workspaces that use customer-managed keys for
encryption. In those workspaces, you can click Search in the sidebar and type a search string in the Search
Workspace field. As you type, objects whose name contains the search string are listed. Click a name from the list to
open that item in the workspace.

To search the workspace for an object, click Search in the sidebar. The Search dialog appears.
To search for a text string, type it into the search field and press Enter. The system searches the names of all
notebooks, folders, files, libraries, and Repos in the workspace that you have access to. It also searches notebook
commands, but not text in non-notebook files.
You can also search for items by type (file, folder, notebooks, libraries, or repo). A text string is not required.
When you press Enter, workspace objects that match the search criteria appear in the dialog. Click a name from
the list to open that item in the workspace.
Access recently used objects
You can access recently used objects by clicking Recents in the sidebar or the Recents column on the
workspace landing page.
NOTE
The Recents list is cleared after deleting the browser cache and cookies.

Move an object

To move an object, you can drag-and-drop the object or click the or at the right side of the object and
select Move :

To move all the objects inside a folder to another folder, select the Move action on the source folder and select
the Move all items in ‘’ rather than the folder itself checkbox.

Delete an object

To delete a folder, notebook, library or experiment, click the or at the right side of the object and select
Move to Trash . The Trash folder is automatically emptied (purged) after 30 days .

You can permanently delete an object in the Trash by selecting the to the right of the object and selecting
Delete Immediately .

You can permanently delete all objects in the Trash by selecting the to the right of the Trash folder and
selecting Empty Trash .

Restore an object
You restore an object by dragging it from the Trash folder to another folder.
Get workspace, cluster, notebook, folder, model, and
job identifiers
7/21/2022 • 3 minutes to read

This article explains how to get workspace, cluster, directory, model, notebook, and job identifiers and URLs in
Azure Databricks.

Workspace instance names, URLs, and IDs


A unique instance name, also known as a per-workspace URL, is assigned to each Azure Databricks deployment.
It is the fully-qualified domain name used to log into your Azure Databricks deployment and make API requests.
An Azure Databricks workspace is where the Azure Databricks platform runs and where you can create Spark
clusters and schedule workloads. A workspace has a unique numerical workspace ID.
Per-workspace URL
The unique per-workspace URL has the format adb-<workspace-id>.<random-number>.azuredatabricks.net . The
workspace ID appears immediately after adb- and before the “dot” (.). For the per-workspace URL
https://adb-5555555555555555.19.azuredatabricks.net/ :

The instance name is adb-5555555555555555.19.azuredatabricks.net .


The workspace ID is 5555555555555555 .
Determine per-workspace URL
You can determine the per-workspace URL for your workspace:
In your browser when you are logged in:

In the Azure portal, by selecting the resource and noting the value in the URL field:
Using the Azure API. See Get a per-workspace URL using the Azure API.
Legacy regional URL

IMPORTANT
Avoid using legacy regional URLs. They may not work for new workspaces, are less reliable, and exhibit lower performance
than per-workspace URLs.

The legacy regional URL is composed of the region where the Azure Databricks workspace is deployed plus the
domain azuredatabricks.net , for example, https://westus.azuredatabricks.net/ .
If you log in to a legacy regional URL like https://westus.azuredatabricks.net/ , the instance name is
westus.azuredatabricks.net .
The workspace ID appears in the URL only after you have logged in using a legacy regional URL. It appears
after the o= . In the URL https://<databricks-instance>/?o=6280049833385130 , the workspace ID is
6280049833385130 .

Cluster URL and ID


An Azure Databricks cluster provides a unified platform for various use cases such as running production ETL
pipelines, streaming analytics, ad-hoc analytics, and machine learning. Each cluster has a unique ID called the
cluster ID. This applies to both all-purpose and job clusters. To get the details of a cluster using the REST API, the
cluster ID is essential.
To get the cluster ID, click the Clusters tab in sidebar and then select a cluster name. The cluster ID is the
number after the /clusters/ component in the URL of this page

https://<databricks-instance>/#/setting/clusters/<cluster-id>

In the following screenshot, the cluster ID is 0831-211914-clean632 .


Notebook URL and ID
A notebook is a web-based interface to a document that contains runnable code, visualizations, and narrative
text. Notebooks are one interface for interacting with Azure Databricks. Each notebook has a unique ID. The
notebook URL has the notebook ID, hence the notebook URL is unique to a notebook. It can be shared with
anyone on Azure Databricks platform with permission to view and edit the notebook. In addition, each notebook
command (cell) has a different URL.
To get to a notebook URL, open a notebook. To get to a cell URL, click the contents of the command.

In this notebook:
The notebook URL is:

https://westus.azuredatabricks.net/?o=6280049833385130#notebook/1940481404050342`

The notebook ID is 1940481404050342 .


The command (cell) URL is

https://westus.azuredatabricks.net/?
o=6280049833385130#notebook/1940481404050342/command/2432220274659491

Folder ID
A folder is a directory used to store files that can used in the Azure Databricks workspace. These files can be
notebooks, libraries or subfolders. There is a specific id associated with each folder and each individual sub-
folder. The Permissions API refers to this id as a directory_id and is used in setting and updating permissions for
a folder.
To retrieve the directory_id , use the Workspace API:

curl -n -X GET -H 'Content-Type: application/json' -d '{"path": "/Users/me@example.com/MyFolder"}' \


https://<databricks-instance>//api/2.0/workspace/get-status
This is an example of the API call response:

{
"object_type": "DIRECTORY",
"path": "/Users/me@example.com/MyFolder",
"object_id": 123456789012345
}

Model ID
A model refers to an MLflow registered model, which lets you manage MLflow Models in production through
stage transitions and versioning. The registered model ID is required for changing the permissions on the model
programmatically through the Permissions API 2.0.
To get the ID of a registered model, you can use the REST API (latest) endpoint
mlflow/databricks/registered-models/get . For example, the following code returns the registered model object
with its properties, including its ID:

curl -n -X GET -H 'Content-Type: application/json' -d '{"name": "model_name"}' \


https://<databricks-instance>/api/2.0/mlflow/databricks/registered-models/get

The returned value has the format:

{
"registered_model_databricks": {
"name":"model_name",
"id":"ceb0477eba94418e973f170e626f4471"
}
}

Job URL and ID


A job is a way of running a notebook or JAR either immediately or on a scheduled basis.
To get a job URL, click Workflows in the sidebar and click a job name. The job ID is after the text #job/ in the
URL. The job URL is required to troubleshoot the root cause of failed job runs.
In the following screenshot, the job URL is:

https://westus.azuredatabricks.net/?o=6280049833385130#job/1

In this example, the job ID is 1 .


Per-workspace URLs
7/21/2022 • 2 minutes to read

In April 2020, Azure Databricks added a new unique per-workspace URL for each workspace. This per-
workspace URL has the format
adb-<workspace-id>.<random-number>.azuredatabricks.net

The per-workspace URL replaces the deprecated regional URL ( <region>.azuredatabricks.net ) to access
workspaces.

IMPORTANT
Avoid using legacy regional URLs. They may not work for new workspaces, are less reliable, and exhibit lower performance
than per-workspace URLs.

Launch a workspace using the per-workspace URL


In the Azure portal, go to the Azure Databricks service resource page for your workspace and either click
Launch Workspace or copy the per-workspace URL as displayed on the resource page and paste it into your
browser address bar.

Get a per-workspace URL using the Azure API


Use the Azure API Workspaces - Get endpoint to get workspace details, including per-workspace URL. The per-
workspace URL is returned in the properties.workspaceUrl field in the response object.
Migrate your scripts to use per-workspace URLs
Azure Databricks users typically write scripting or other automation that references workspaces in one of two
ways:
You create all workspaces in the same region and hardcode the legacy regional URL in the script.
Because you need an API token for each workspace, you also have a list of tokens either stored in the
script itself or in some other database. If this is the case, we recommend that you store a list of
<per-workspace-url, api-token> pairs and remove any hardcoded regional URLs.

You create workspaces in one or more regions and have a list of <regional-url, api-token> pairs either
stored in the script itself or in a database. If this is the case, we recommend that you store the per-
workspace URL instead of the regional URL in the list.

NOTE
Because both regional URLs and per-workspace URLs are supported, any existing automation that uses regional URLs to
reference workspaces that were created before the introduction of per-workspace URLs will continue to work. Although
Databricks recommends that you update any automation to use per-workspace URLs, doing so is not required in this
case.

Find the legacy regional URL for a workspace


If you need to find the legacy regional URL for a workspace, run nslookup on the per-workspace URL.

$ nslookup adb-<workspace-id>.<random-number>.azuredatabricks.net
Server: 192.168.50.1
Address: 192.168.50.1#53

Non-authoritative answer:
adb-<workspace-id>.<random-number>.azuredatabricks.net canonical name = eastus-c3.azuredatabricks.net.
Name: eastus-c3.azuredatabricks.net
Address: 20.42.4.211
DataFrames and Datasets
7/21/2022 • 2 minutes to read

This section gives an introduction to Apache Spark DataFrames and Datasets using Azure Databricks notebooks.
Introduction to DataFrames - Python
Create DataFrames
Work with DataFrames
DataFrame FAQs
Introduction to DataFrames - Scala
Create DataFrames
Work with DataFrames
Frequently asked questions (FAQ)
Introduction to Datasets
Create a Dataset
Work with Datasets
Convert a Dataset to a DataFrame
Complex and nested data
Complex nested data notebook
Aggregators
Dataset aggregator notebook
Dates and timestamps
Dates and calendars
Timestamps and time zones
Construct dates and timestamps
Collect dates and timestamps
For reference information about DataFrames and Datasets, Azure Databricks recommends the following Apache
Spark API reference:
Python API
Scala API
Java API
Introduction to DataFrames - Python
7/21/2022 • 11 minutes to read

This article provides several coding examples of common PySpark DataFrame APIs that use Python.
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can
think of a DataFrame like a spreadsheet, a SQL table, or a dictionary of series objects. For more information and
examples, see the Quickstart on the Apache Spark documentation website.

Create DataFrames
This example uses the Row class from Spark SQL to create several DataFrames. The contents of a few of these
DataFrames are then printed.

# import pyspark class Row from module sql


from pyspark.sql import *

# Create Example Data - Departments and Employees

# Create the Departments


department1 = Row(id='123456', name='Computer Science')
department2 = Row(id='789012', name='Mechanical Engineering')
department3 = Row(id='345678', name='Theater and Drama')
department4 = Row(id='901234', name='Indoor Recreation')

# Create the Employees


Employee = Row("firstName", "lastName", "email", "salary")
employee1 = Employee('michael', 'armbrust', 'no-reply@berkeley.edu', 100000)
employee2 = Employee('xiangrui', 'meng', 'no-reply@stanford.edu', 120000)
employee3 = Employee('matei', None, 'no-reply@waterloo.edu', 140000)
employee4 = Employee(None, 'wendell', 'no-reply@berkeley.edu', 160000)
employee5 = Employee('michael', 'jackson', 'no-reply@neverla.nd', 80000)

# Create the DepartmentWithEmployees instances from Departments and Employees


departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2])
departmentWithEmployees2 = Row(department=department2, employees=[employee3, employee4])
departmentWithEmployees3 = Row(department=department3, employees=[employee5, employee4])
departmentWithEmployees4 = Row(department=department4, employees=[employee2, employee3])

print(department1)
print(employee2)
print(departmentWithEmployees1.employees[0].email)

Output:

Row(id='123456', name='Computer Science')


Row(firstName='xiangrui', lastName='meng', email='no-reply@stanford.edu', salary=120000)
no-reply@berkeley.edu

See DataFrame Creation in the PySpark documentation.


Create DataFrames from a list of the rows
This example uses the createDataFrame method of the SparkSession (which is represented by the Azure
Databricks-provided spark variable) to create a DataFrame from a list of rows from the previous example.
departmentsWithEmployeesSeq1 = [departmentWithEmployees1, departmentWithEmployees2]
df1 = spark.createDataFrame(departmentsWithEmployeesSeq1)

df1.show(truncate=False)

departmentsWithEmployeesSeq2 = [departmentWithEmployees3, departmentWithEmployees4]


df2 = spark.createDataFrame(departmentsWithEmployeesSeq2)

df2.show(truncate=False)

Output:

+--------------------------------+--------------------------------------------------------------------------
---------------------------+
|department |employees
|
+--------------------------------+--------------------------------------------------------------------------
---------------------------+
|{123456, Computer Science} |[{michael, armbrust, no-reply@berkeley.edu, 100000}, {xiangrui, meng, no-
reply@stanford.edu, 120000}]|
|{789012, Mechanical Engineering}|[{matei, null, no-reply@waterloo.edu, 140000}, {null, wendell, no-
reply@berkeley.edu, 160000}] |
+--------------------------------+--------------------------------------------------------------------------
---------------------------+

+---------------------------+-------------------------------------------------------------------------------
-----------------+
|department |employees
|
+---------------------------+-------------------------------------------------------------------------------
-----------------+
|{345678, Theater and Drama}|[{michael, jackson, no-reply@neverla.nd, 80000}, {null, wendell, no-
reply@berkeley.edu, 160000}]|
|{901234, Indoor Recreation}|[{xiangrui, meng, no-reply@stanford.edu, 120000}, {matei, null, no-
reply@waterloo.edu, 140000}] |
+---------------------------+-------------------------------------------------------------------------------
-----------------+

Work with DataFrames


Union two DataFrames
This example uses the union method to combine the rows in the specified DataFrame in the previous example
into a new DataFrame.

unionDF = df1.union(df2)
unionDF.show(truncate=False)

Output:
+--------------------------------+--------------------------------------------------------------------------
---------------------------+
|department |employees
|
+--------------------------------+--------------------------------------------------------------------------
---------------------------+
|{123456, Computer Science} |[{michael, armbrust, no-reply@berkeley.edu, 100000}, {xiangrui, meng, no-
reply@stanford.edu, 120000}]|
|{789012, Mechanical Engineering}|[{matei, null, no-reply@waterloo.edu, 140000}, {null, wendell, no-
reply@berkeley.edu, 160000}] |
|{345678, Theater and Drama} |[{michael, jackson, no-reply@neverla.nd, 80000}, {null, wendell, no-
reply@berkeley.edu, 160000}] |
|{901234, Indoor Recreation} |[{xiangrui, meng, no-reply@stanford.edu, 120000}, {matei, null, no-
reply@waterloo.edu, 140000}] |
+--------------------------------+--------------------------------------------------------------------------
---------------------------+

Write the unioned DataFrame to a Parquet file


This example uses the rm command (dbutils.fs.rm) method of the File system utility (dbutils.fs) in Databricks
Utilities to remove the specified Parquet file, if it exists. The write method then uses the parquet method of the
resulting DataFrameWriter to write the DataFrame from the previous example to the specified location in the
Azure Databricks workspace in Parquet format.

# Remove the file if it exists


dbutils.fs.rm("/tmp/databricks-df-example.parquet", True)
unionDF.write.format("parquet").save("/tmp/databricks-df-example.parquet")

Read a DataFrame from the Parquet file


This example uses the read method to use the parquet method of the resulting DataFrameReader to read the
Parquet file in the specified location into a DataFrame and then display the DataFrame’s content.

parquetDF = spark.read.format("parquet").load("/tmp/databricks-df-example.parquet")
parquetDF.show(truncate=False)

Output:

+--------------------------------+--------------------------------------------------------------------------
---------------------------+
|department |employees
|
+--------------------------------+--------------------------------------------------------------------------
---------------------------+
|{789012, Mechanical Engineering}|[{matei, null, no-reply@waterloo.edu, 140000}, {null, wendell, no-
reply@berkeley.edu, 160000}] |
|{901234, Indoor Recreation} |[{xiangrui, meng, no-reply@stanford.edu, 120000}, {matei, null, no-
reply@waterloo.edu, 140000}] |
|{345678, Theater and Drama} |[{michael, jackson, no-reply@neverla.nd, 80000}, {null, wendell, no-
reply@berkeley.edu, 160000}] |
|{123456, Computer Science} |[{michael, armbrust, no-reply@berkeley.edu, 100000}, {xiangrui, meng, no-
reply@stanford.edu, 120000}]|
+--------------------------------+--------------------------------------------------------------------------
---------------------------+

Explode the employees column


This example uses the select method of the preceding DataFrame to project a set of expressions into a new
DataFrame. In this case, the explode function returns a new row for each of the employees items. The alias
method uses e as a shorthand for the column. The selectExpr method of the new DataFrame projects a set of
SQL expressions into a new DataFrame.

from pyspark.sql.functions import explode

explodeDF = unionDF.select(explode("employees").alias("e"))
flattenDF = explodeDF.selectExpr("e.firstName", "e.lastName", "e.email", "e.salary")

flattenDF.show(truncate=False)

Output:

+---------+--------+---------------------+------+
|firstName|lastName|email |salary|
+---------+--------+---------------------+------+
|michael |armbrust|no-reply@berkeley.edu|100000|
|xiangrui |meng |no-reply@stanford.edu|120000|
|matei |null |no-reply@waterloo.edu|140000|
|null |wendell |no-reply@berkeley.edu|160000|
|michael |jackson |no-reply@neverla.nd |80000 |
|null |wendell |no-reply@berkeley.edu|160000|
|xiangrui |meng |no-reply@stanford.edu|120000|
|matei |null |no-reply@waterloo.edu|140000|
+---------+--------+---------------------+------+

Use filter() to return the rows that match a predicate


This example uses the filter method of the preceding DataFrame to display only those rows where the
firstName field’s value is xiangrui . It then uses the sort method to sort the results by the value of the rows’
lastName field.

filterDF = flattenDF.filter(flattenDF.firstName == "xiangrui").sort(flattenDF.lastName)


filterDF.show(truncate=False)

Output:

+---------+--------+---------------------+------+
|firstName|lastName|email |salary|
+---------+--------+---------------------+------+
|xiangrui |meng |no-reply@stanford.edu|120000|
|xiangrui |meng |no-reply@stanford.edu|120000|
+---------+--------+---------------------+------+

This example is similar to the previous one, except that it displays only those rows where the firstName field’s
value is xiangrui or michael .

from pyspark.sql.functions import col, asc

# Use `|` instead of `or`


filterDF = flattenDF.filter((col("firstName") == "xiangrui") | (col("firstName") ==
"michael")).sort(asc("lastName"))
filterDF.show(truncate=False)

Output:
+---------+--------+---------------------+------+
|firstName|lastName|email |salary|
+---------+--------+---------------------+------+
|michael |armbrust|no-reply@berkeley.edu|100000|
|michael |jackson |no-reply@neverla.nd |80000 |
|xiangrui |meng |no-reply@stanford.edu|120000|
|xiangrui |meng |no-reply@stanford.edu|120000|
+---------+--------+---------------------+------+

The where() clause is equivalent to filter()

This example is equivalent to the preceding example, except that it uses the where method instead of the filter
method.

whereDF = flattenDF.where((col("firstName") == "xiangrui") | (col("firstName") ==


"michael")).sort(asc("lastName"))
whereDF.show(truncate=False)

Output:

+---------+--------+---------------------+------+
|firstName|lastName|email |salary|
+---------+--------+---------------------+------+
|michael |armbrust|no-reply@berkeley.edu|100000|
|michael |jackson |no-reply@neverla.nd |80000 |
|xiangrui |meng |no-reply@stanford.edu|120000|
|xiangrui |meng |no-reply@stanford.edu|120000|
+---------+--------+---------------------+------+

Replace null values with -- using DataFrame Na function


This example uses the fillna method of the previous flattenDF DataFrame to replace all null values with the
characters -- .

nonNullDF = flattenDF.fillna("--")
nonNullDF.show(truncate=False)

Before:

+---------+--------+---------------------+------+
|firstName|lastName|email |salary|
+---------+--------+---------------------+------+
|michael |armbrust|no-reply@berkeley.edu|100000|
|xiangrui |meng |no-reply@stanford.edu|120000|
|matei |null |no-reply@waterloo.edu|140000|
|null |wendell |no-reply@berkeley.edu|160000|
|michael |jackson |no-reply@neverla.nd |80000 |
|null |wendell |no-reply@berkeley.edu|160000|
|xiangrui |meng |no-reply@stanford.edu|120000|
|matei |null |no-reply@waterloo.edu|140000|
+---------+--------+---------------------+------+

After:
+---------+--------+---------------------+------+
|firstName|lastName|email |salary|
+---------+--------+---------------------+------+
|michael |armbrust|no-reply@berkeley.edu|100000|
|xiangrui |meng |no-reply@stanford.edu|120000|
|matei |-- |no-reply@waterloo.edu|140000|
|-- |wendell |no-reply@berkeley.edu|160000|
|michael |jackson |no-reply@neverla.nd |80000 |
|-- |wendell |no-reply@berkeley.edu|160000|
|xiangrui |meng |no-reply@stanford.edu|120000|
|matei |-- |no-reply@waterloo.edu|140000|
+---------+--------+---------------------+------+

Retrieve only rows with missing firstName or lastName

This example uses the filter method of the previous flattenDF DataFrame along with the isNull method of the
Column class to display all rows where the firstName or lastName field has a null value.

filterNonNullDF = flattenDF.filter(col("firstName").isNull() | col("lastName").isNull()).sort("email")


dfilterNonNullDF.show(truncate=False)

Output:

+---------+--------+---------------------+------+
|firstName|lastName|email |salary|
+---------+--------+---------------------+------+
|null |wendell |no-reply@berkeley.edu|160000|
|null |wendell |no-reply@berkeley.edu|160000|
|matei |null |no-reply@waterloo.edu|140000|
|matei |null |no-reply@waterloo.edu|140000|
+---------+--------+---------------------+------+

Example aggregations using agg() and countDistinct()

This example uses the select, groupBy, and agg methods of the previous nonNullDF DataFrame to select only
the rows’ firstName and lastName fields, group the results by the firstName field’s values, and then display
the number of distinct lastName field values for each of those first names. For each first name, only one distinct
last name is found, except for michael , which has both michael armbrust and michael jackson .

from pyspark.sql.functions import countDistinct

countDistinctDF = nonNullDF.select("firstName", "lastName") \


.groupBy("firstName") \
.agg(countDistinct("lastName").alias("distinct_last_names"))

countDistinctDF.show()

Output:

+---------+-------------------+
|firstName|distinct_last_names|
+---------+-------------------+
| null| 1|
| xiangrui| 1|
| matei| 0|
| michael| 2|
+---------+-------------------+
Compare the DataFrame and SQL query physical plans
This example uses the explain method of the preceding example’s DataFrame to print the results of the physical
plan for debugging purpose.

TIP
They should be the same.

countDistinctDF.explain()

This example uses the createOrReplaceTempView method of the preceding example’s DataFrame to create a
local temporary view with this DataFrame. This temporary view exists until the related Spark session goes out of
scope. This example then uses the Spark session’s sql method to run a query on this temporary view. The
physical plan for this query is then displayed. The results of this explain call should be the same as the
previous explain call.

# Register the DataFrame as a temporary view so that we can query it by using SQL.
nonNullDF.createOrReplaceTempView("databricks_df_example")

# Perform the same query as the preceding DataFrame and then display its physical plan.
countDistinctDF_sql = spark.sql('''
SELECT firstName, count(distinct lastName) AS distinct_last_names
FROM databricks_df_example
GROUP BY firstName
''')

countDistinctDF_sql.explain()

Sum up all the salaries


This example uses the agg method of the previous nonNullDF DataFrame to display the sum of all of the rows’
salaries.

salarySumDF = nonNullDF.agg({"salary" : "sum"})


salarySumDF.show()

Output:

+-----------+
|sum(salary)|
+-----------+
| 1020000|
+-----------+

This example displays the underlying data type of the salary field for the preceding DataFrame, which is a
bigint .

match = 'salary'

for key, value in nonNullDF.dtypes:


if key == match:
print(f"Data type of '{match}' is '{value}'.")

Output:
Data type of 'salary' is 'bigint'.

Print the summary statistics for the salaries


This example uses the describe method of the previous nonNullDF DataFrame to display basic statistics for the
salary field.

nonNullDF.describe("salary").show()

Output:

+-------+------------------+
|summary| salary|
+-------+------------------+
| count| 8|
| mean| 127500.0|
| stddev|28157.719063467175|
| min| 80000|
| max| 160000|
+-------+------------------+

An example using pandas and Matplotlib integration


This example uses the pandas and Matplotlib libraries to display the previous nonNullDF DataFrame’s
information graphically. This example uses the toPandas method of the DataFrame to output the DataFrame
content’s as a pandas DataFrame, and it uses the clf and plot methods of matplotlib.pyplot in Matplotlib to clear
the plotting surface and to then create the actual plot.

import pandas as pd
import matplotlib.pyplot as plt
plt.clf()
pdDF = nonNullDF.toPandas()
pdDF.plot(x='firstName', y='salary', kind='bar', rot=45)
display()

Output:

Cleanup: remove the Parquet file


This example uses the rm command (dbutils.fs.rm) method of the File system utility (dbutils.fs) in Databricks
Utilities to delete the Parquet file that was initially written toward the beginning of this article.

dbutils.fs.rm("/tmp/databricks-df-example.parquet", True)
DataFrame FAQs
This FAQ addresses common use cases and example usage using the available APIs. For more detailed API
descriptions, see the PySpark documentation.
How can I get better performance with DataFrame UDFs?
If the functionality exists in the built-in functions, using these will perform better. Example usage follows. Also
see the PySpark Functions API reference. Use the built-in functions and the withColumn() API to add new
columns. You can also use withColumnRenamed() to replace an existing column after the transformation.

from pyspark.sql import functions as F


from pyspark.sql.types import *

# Build an example DataFrame dataset to work with.


dbutils.fs.rm("/tmp/dataframe_sample.csv", True)
dbutils.fs.put("/tmp/dataframe_sample.csv", """id|end_date|start_date|location
1|2015-10-14 00:00:00|2015-09-14 00:00:00|CA-SF
2|2015-10-15 01:00:20|2015-08-14 00:00:00|CA-SD
3|2015-10-16 02:30:00|2015-01-14 00:00:00|NY-NY
4|2015-10-17 03:00:20|2015-02-14 00:00:00|NY-NY
5|2015-10-18 04:30:00|2014-04-14 00:00:00|CA-SD
""", True)

df = spark.read.format("csv").options(header='true', delimiter = '|').load("/tmp/dataframe_sample.csv")


df.printSchema()

# Instead of registering a UDF, call the builtin functions to perform operations on the columns.
# This will provide a performance improvement as the builtins compile and run in the platform's JVM.

# Convert to a Date type


df = df.withColumn('date', F.to_date(df.end_date))

# Parse out the date only


df = df.withColumn('date_only', F.regexp_replace(df.end_date,' (\d+)[:](\d+)[:](\d+).*$', ''))

# Split a string and index a field


df = df.withColumn('city', F.split(df.location, '-')[1])

# Perform a date diff function


df = df.withColumn('date_diff', F.datediff(F.to_date(df.end_date), F.to_date(df.start_date)))

df.createOrReplaceTempView("sample_df")
display(sql("select * from sample_df"))

I want to conver t the DataFrame back to JSON strings to send back to Kafka.
There is an underlying toJSON() function that returns an RDD of JSON strings using the column names and
schema to produce the JSON records.

rdd_json = df.toJSON()
rdd_json.take(2)

My UDF takes a parameter including the column to operate on. How do I pass this parameter?
There is a function available called lit() that creates a constant column.
from pyspark.sql import functions as F

add_n = udf(lambda x, y: x + y, IntegerType())

# We register a UDF that adds a column to the DataFrame, and we cast the id column to an Integer type.
df = df.withColumn('id_offset', add_n(F.lit(1000), df.id.cast(IntegerType())))

display(df)

# any constants used by UDF will automatically pass through to workers


N = 90
last_n_days = udf(lambda x: x < N, BooleanType())

df_filtered = df.filter(last_n_days(df.date_diff))
display(df_filtered)

I have a table in the Hive metastore and I’d like to access to table as a DataFrame. What’s the best
way to define this?
There are multiple ways to define a DataFrame from a registered table. Call table(tableName) or select and filter
specific columns using an SQL query:

# Both return DataFrame types


df_1 = table("sample_df")
df_2 = spark.sql("select * from sample_df")

I’d like to clear all the cached tables on the current cluster.
There’s an API available to do this at a global level or per table.

spark.catalog.clearCache()
spark.catalog.cacheTable("sample_df")
spark.catalog.uncacheTable("sample_df")

I’d like to compute aggregates on columns. What’s the best way to do this?
The agg(*exprs) method takes a list of column names and expressions for the type of aggregation you’d like to
compute. See pyspark.sql.DataFrame.agg. You can use built-in functions in the expressions for each column.

# Provide the min, count, and avg and groupBy the location column. Diplay the results
agg_df = df.groupBy("location").agg(F.min("id"), F.count("id"), F.avg("date_diff"))
display(agg_df)

I’d like to write out the DataFrames to Parquet, but would like to par tition on a par ticular column.
You can use the following APIs to accomplish this. Ensure the code does not create a large number of partition
columns with the datasets otherwise the overhead of the metadata can cause significant slow downs. If there is
a SQL table back by this directory, you will need to call refresh table <table-name> to update the metadata
prior to the query.
df = df.withColumn('end_month', F.month('end_date'))
df = df.withColumn('end_year', F.year('end_date'))
df.write.partitionBy("end_year", "end_month").format("parquet").load("/tmp/sample_table")
display(dbutils.fs.ls("/tmp/sample_table"))

How do I properly handle cases where I want to filter out NULL data?
You can use filter() and provide similar syntax as you would with a SQL query.

null_item_schema = StructType([StructField("col1", StringType(), True),


StructField("col2", IntegerType(), True)])
null_df = spark.createDataFrame([("test", 1), (None, 2)], null_item_schema)
display(null_df.filter("col1 IS NOT NULL"))

How do I infer the schema using the CSV or spark-avro libraries?


There is an inferSchema option flag. Providing a header ensures appropriate column naming.

adult_df = spark.read.\
format("com.spark.csv").\
option("header", "false").\
option("inferSchema", "true").load("dbfs:/databricks-datasets/adult/adult.data")
adult_df.printSchema()

You have a delimited string dataset that you want to conver t to their datatypes. How would you
accomplish this?
Use the RDD APIs to filter out the malformed rows and map the values to the appropriate types. We define a
function that filters the items using regular expressions.
Introduction to DataFrames - Scala
7/21/2022 • 6 minutes to read

This article demonstrates a number of common Spark DataFrame functions using Scala.

Create DataFrames
// Create the case classes for our domain
case class Department(id: String, name: String)
case class Employee(firstName: String, lastName: String, email: String, salary: Int)
case class DepartmentWithEmployees(department: Department, employees: Seq[Employee])

// Create the Departments


val department1 = new Department("123456", "Computer Science")
val department2 = new Department("789012", "Mechanical Engineering")
val department3 = new Department("345678", "Theater and Drama")
val department4 = new Department("901234", "Indoor Recreation")

// Create the Employees


val employee1 = new Employee("michael", "armbrust", "no-reply@berkeley.edu", 100000)
val employee2 = new Employee("xiangrui", "meng", "no-reply@stanford.edu", 120000)
val employee3 = new Employee("matei", null, "no-reply@waterloo.edu", 140000)
val employee4 = new Employee(null, "wendell", "no-reply@princeton.edu", 160000)
val employee5 = new Employee("michael", "jackson", "no-reply@neverla.nd", 80000)

// Create the DepartmentWithEmployees instances from Departments and Employees


val departmentWithEmployees1 = new DepartmentWithEmployees(department1, Seq(employee1, employee2))
val departmentWithEmployees2 = new DepartmentWithEmployees(department2, Seq(employee3, employee4))
val departmentWithEmployees3 = new DepartmentWithEmployees(department3, Seq(employee5, employee4))
val departmentWithEmployees4 = new DepartmentWithEmployees(department4, Seq(employee2, employee3))

Create DataFrames from a list of the case classes

val departmentsWithEmployeesSeq1 = Seq(departmentWithEmployees1, departmentWithEmployees2)


val df1 = departmentsWithEmployeesSeq1.toDF()
display(df1)

val departmentsWithEmployeesSeq2 = Seq(departmentWithEmployees3, departmentWithEmployees4)


val df2 = departmentsWithEmployeesSeq2.toDF()
display(df2)

Work with DataFrames


Union two DataFrames

val unionDF = df1.union(df2)


display(unionDF)

Write the unioned DataFrame to a Parquet file

// Remove the file if it exists


dbutils.fs.rm("/tmp/databricks-df-example.parquet", true)
unionDF.write.format("parquet").save("/tmp/databricks-df-example.parquet")
Read a DataFrame from the Parquet file

val parquetDF = spark.read.format("parquet").load("/tmp/databricks-df-example.parquet")

Explode the employees column

import org.apache.spark.sql.functions._

val explodeDF = parquetDF.select(explode($"employees"))


display(explodeDF)

Flatten the fields of the employee class into columns

val flattenDF = explodeDF.select($"col.*")


flattenDF.show()

+---------+--------+--------------------+------+
|firstName|lastName| email|salary|
+---------+--------+--------------------+------+
| matei| null|no-reply@waterloo...|140000|
| null| wendell|no-reply@princeto...|160000|
| michael|armbrust|no-reply@berkeley...|100000|
| xiangrui| meng|no-reply@stanford...|120000|
| michael| jackson| no-reply@neverla.nd| 80000|
| null| wendell|no-reply@princeto...|160000|
| xiangrui| meng|no-reply@stanford...|120000|
| matei| null|no-reply@waterloo...|140000|
+---------+--------+--------------------+------+

Use filter() to return the rows that match a predicate

val filterDF = flattenDF


.filter($"firstName" === "xiangrui" || $"firstName" === "michael")
.sort($"lastName".asc)
display(filterDF)

The where() clause is equivalent to filter()

val whereDF = flattenDF


.where($"firstName" === "xiangrui" || $"firstName" === "michael")
.sort($"lastName".asc)
display(whereDF)

Replace null values with -- using DataFrame Na function

val nonNullDF = flattenDF.na.fill("--")


display(nonNullDF)

Retrieve rows with missing firstName or lastName

val filterNonNullDF = nonNullDF.filter($"firstName" === "--" || $"lastName" === "--").sort($"email".asc)


display(filterNonNullDF)

Example aggregations using agg() and countDistinct()


// Find the distinct last names for each first name
val countDistinctDF = nonNullDF.select($"firstName", $"lastName")
.groupBy($"firstName")
.agg(countDistinct($"lastName") as "distinct_last_names")
display(countDistinctDF)

Compare the DataFrame and SQL query physical plans

TIP
They should be the same.

countDistinctDF.explain()

// register the DataFrame as a temp view so that we can query it using SQL
nonNullDF.createOrReplaceTempView("databricks_df_example")

spark.sql("""
SELECT firstName, count(distinct lastName) as distinct_last_names
FROM databricks_df_example
GROUP BY firstName
""").explain

Sum up all the salaries

val salarySumDF = nonNullDF.agg("salary" -> "sum")


display(salarySumDF)

Print the summary statistics for the salaries

nonNullDF.describe("salary").show()

Cleanup: remove the Parquet file

dbutils.fs.rm("/tmp/databricks-df-example.parquet", true)

Frequently asked questions (FAQ)


This FAQ addresses common use cases and example usage using the available APIs. For more detailed API
descriptions, see the DataFrameReader and DataFrameWriter documentation.
How can I get better performance with DataFrame UDFs?
If the functionality exists in the available built-in functions, using these will perform better.
We use the built-in functions and the withColumn() API to add new columns. We could have also used
withColumnRenamed() to replace an existing column after the transformation.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

// Build an example DataFrame dataset to work with.


dbutils.fs.rm("/tmp/dataframe_sample.csv", true)
dbutils.fs.put("/tmp/dataframe_sample.csv", """
id|end_date|start_date|location
1|2015-10-14 00:00:00|2015-09-14 00:00:00|CA-SF
2|2015-10-15 01:00:20|2015-08-14 00:00:00|CA-SD
3|2015-10-16 02:30:00|2015-01-14 00:00:00|NY-NY
4|2015-10-17 03:00:20|2015-02-14 00:00:00|NY-NY
5|2015-10-18 04:30:00|2014-04-14 00:00:00|CA-LA
""", true)

val conf = new Configuration


conf.set("textinputformat.record.delimiter", "\n")
val rdd = sc.newAPIHadoopFile("/tmp/dataframe_sample.csv", classOf[TextInputFormat], classOf[LongWritable],
classOf[Text], conf).map(_._2.toString).filter(_.nonEmpty)

val header = rdd.first()


// Parse the header line
val rdd_noheader = rdd.filter(x => !x.contains("id"))
// Convert the RDD[String] to an RDD[Rows]. Create an array using the delimiter and use Row.fromSeq()
val row_rdd = rdd_noheader.map(x => x.split('|')).map(x => Row.fromSeq(x))

val df_schema =
StructType(
header.split('|').map(fieldName => StructField(fieldName, StringType, true)))

var df = spark.createDataFrame(row_rdd, df_schema)


df.printSchema

// Instead of registering a UDF, call the builtin functions to perform operations on the columns.
// This will provide a performance improvement as the builtins compile and run in the platform's JVM.

// Convert to a Date type


val timestamp2datetype: (Column) => Column = (x) => { to_date(x) }
df = df.withColumn("date", timestamp2datetype(col("end_date")))

// Parse out the date only


val timestamp2date: (Column) => Column = (x) => { regexp_replace(x," (\\d+)[:](\\d+)[:](\\d+).*$", "") }
df = df.withColumn("date_only", timestamp2date(col("end_date")))

// Split a string and index a field


val parse_city: (Column) => Column = (x) => { split(x, "-")(1) }
df = df.withColumn("city", parse_city(col("location")))

// Perform a date diff function


val dateDiff: (Column, Column) => Column = (x, y) => { datediff(to_date(y), to_date(x)) }
df = df.withColumn("date_diff", dateDiff(col("start_date"), col("end_date")))

df.createOrReplaceTempView("sample_df")
display(sql("select * from sample_df"))

I want to conver t the DataFrame back to JSON strings to send back to Kafka.
There is a toJSON() function that returns an RDD of JSON strings using the column names and schema to
produce the JSON records.
val rdd_json = df.toJSON
rdd_json.take(2).foreach(println)

My UDF takes a parameter including the column to operate on. How do I pass this parameter?
There is a function available called lit() that creates a static column.

val add_n = udf((x: Integer, y: Integer) => x + y)

// We register a UDF that adds a column to the DataFrame, and we cast the id column to an Integer type.
df = df.withColumn("id_offset", add_n(lit(1000), col("id").cast("int")))
display(df)

val last_n_days = udf((x: Integer, y: Integer) => {


if (x < y) true else false
})

//last_n_days = udf(lambda x, y: True if x < y else False, BooleanType())

val df_filtered = df.filter(last_n_days(col("date_diff"), lit(90)))


display(df_filtered)

I have a table in the Hive metastore and I’d like to access to table as a DataFrame. What’s the best
way to define this?
There are multiple ways to define a DataFrame from a registered table. Call table(tableName) or select and filter
specific columns using an SQL query:

// Both return DataFrame types


val df_1 = table("sample_df")
val df_2 = spark.sql("select * from sample_df")

I’d like to clear all the cached tables on the current cluster.
There’s an API available to do this at the global or per table level.

spark.catalog.clearCache()
spark.catalog.cacheTable("sample_df")
spark.catalog.uncacheTable("sample_df")

I’d like to compute aggregates on columns. What’s the best way to do this?
There’s an API named agg(*exprs) that takes a list of column names and expressions for the type of
aggregation you’d like to compute. You can leverage the built-in functions mentioned above as part of the
expressions for each column.

// Provide the min, count, and avg and groupBy the location column. Diplay the results
var agg_df = df.groupBy("location").agg(min("id"), count("id"), avg("date_diff"))
display(agg_df)

I’d like to write out the DataFrames to Parquet, but would like to par tition on a par ticular column.
You can use the following APIs to accomplish this. Ensure the code does not create a large number of partitioned
columns with the datasets otherwise the overhead of the metadata can cause significant slow downs. If there is
a SQL table back by this directory, you will need to call refresh table <table-name> to update the metadata
prior to the query.

df = df.withColumn("end_month", month(col("end_date")))
df = df.withColumn("end_year", year(col("end_date")))
dbutils.fs.rm("/tmp/sample_table", true)
df.write.partitionBy("end_year", "end_month").format("parquet").load("/tmp/sample_table")
display(dbutils.fs.ls("/tmp/sample_table"))

How do I properly handle cases where I want to filter out NULL data?
You can use filter() and provide similar syntax as you would with a SQL query.

val null_item_schema = StructType(Array(StructField("col1", StringType, true),


StructField("col2", IntegerType, true)))

val null_dataset = sc.parallelize(Array(("test", 1 ), (null, 2))).map(x => Row.fromTuple(x))


val null_df = spark.createDataFrame(null_dataset, null_item_schema)
display(null_df.filter("col1 IS NOT NULL"))

How do I infer the schema using the csv or spark-avro libraries?


There is an inferSchema option flag. Providing a header allows you to name the columns appropriately.

val adult_df = spark.read.


format("csv").
option("header", "false").
option("inferSchema", "true").load("dbfs:/databricks-datasets/adult/adult.data")
adult_df.printSchema()

You have a delimited string dataset that you want to conver t to their data types. How would you
accomplish this?
Use the RDD APIs to filter out the malformed rows and map the values to the appropriate types.
Introduction to Datasets
7/21/2022 • 3 minutes to read

The Datasets API provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with
the benefits of Spark SQL’s optimized execution engine. You can define a Dataset JVM objects and then
manipulate them using functional transformations ( map , flatMap , filter , and so on) similar to an RDD. The
benefits is that, unlike RDDs, these transformations are now applied on a structured and strongly typed
distributed collection that allows Spark to leverage Spark SQL’s execution engine for optimization.

Create a Dataset
To convert a sequence to a Dataset, call .toDS() on the sequence.

val dataset = Seq(1, 2, 3).toDS()


dataset.show()

If you have a sequence of case classes, calling .toDS() provides a Dataset with all the necessary fields.

case class Person(name: String, age: Int)

val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("Muller", 62)).toDS()


personDS.show()

Create a Dataset from an RDD


To convert an RDD into a Dataset, call rdd.toDS() .

val rdd = sc.parallelize(Seq((1, "Spark"), (2, "Databricks")))


val integerDS = rdd.toDS()
integerDS.show()

Create a Dataset from a DataFrame


You can call df.as[SomeCaseClass] to convert the DataFrame to a Dataset.

case class Company(name: String, foundingYear: Int, numEmployees: Int)


val inputSeq = Seq(Company("ABC", 1998, 310), Company("XYZ", 1983, 904), Company("NOP", 2005, 83))
val df = sc.parallelize(inputSeq).toDF()

val companyDS = df.as[Company]


companyDS.show()

You can also deal with tuples while converting a DataFrame to Dataset without using a case class .

val rdd = sc.parallelize(Seq((1, "Spark"), (2, "Databricks"), (3, "Notebook")))


val df = rdd.toDF("Id", "Name")

val dataset = df.as[(Int, String)]


dataset.show()
Work with Datasets
Word Count Example

val wordsDataset = sc.parallelize(Seq("Spark I am your father", "May the spark be with you", "Spark I am
your father")).toDS()
val groupedDataset = wordsDataset.flatMap(_.toLowerCase.split(" "))
.filter(_ != "")
.groupBy("value")
val countsDataset = groupedDataset.count()
countsDataset.show()

Join Datasets
The following example demonstrates the following:
Union multiple datasets
Doing an inner join on a condition Group by a specific column
Doing a custom aggregation (average) on the grouped dataset.
The examples uses only Datasets API to demonstrate all the operations available. In reality, using DataFrames for
doing aggregation would be simpler and faster than doing custom aggregation with mapGroups . The next
section covers the details of converting Datasets to DataFrames and using DataFrames API for doing
aggregations.

case class Employee(name: String, age: Int, departmentId: Int, salary: Double)
case class Department(id: Int, name: String)

case class Record(name: String, age: Int, salary: Double, departmentId: Int, departmentName: String)
case class ResultSet(departmentId: Int, departmentName: String, avgSalary: Double)

val employeeDataSet1 = sc.parallelize(Seq(Employee("Max", 22, 1, 100000.0), Employee("Adam", 33, 2,


93000.0), Employee("Eve", 35, 2, 89999.0), Employee("Muller", 39, 3, 120000.0))).toDS()
val employeeDataSet2 = sc.parallelize(Seq(Employee("John", 26, 1, 990000.0), Employee("Joe", 38, 3,
115000.0))).toDS()
val departmentDataSet = sc.parallelize(Seq(Department(1, "Engineering"), Department(2, "Marketing"),
Department(3, "Sales"))).toDS()

val employeeDataset = employeeDataSet1.union(employeeDataSet2)

def averageSalary(key: (Int, String), iterator: Iterator[Record]): ResultSet = {


val (total, count) = iterator.foldLeft(0.0, 0.0) {
case ((total, count), x) => (total + x.salary, count + 1)
}
ResultSet(key._1, key._2, total/count)
}

val averageSalaryDataset = employeeDataset.joinWith(departmentDataSet, $"departmentId" === $"id", "inner")


.map(record => Record(record._1.name, record._1.age,
record._1.salary, record._1.departmentId, record._2.name))
.filter(record => record.age > 25)
.groupBy($"departmentId", $"departmentName")
.avg()

averageSalaryDataset.show()

Convert a Dataset to a DataFrame


The above 2 examples dealt with using pure Datasets APIs. You can also easily move from Datasets to
DataFrames and leverage the DataFrames APIs. The following example shows the word count example that uses
both Datasets and DataFrames APIs.
import org.apache.spark.sql.functions._

val wordsDataset = sc.parallelize(Seq("Spark I am your father", "May the spark be with you", "Spark I am
your father")).toDS()
val result = wordsDataset
.flatMap(_.split(" ")) // Split on whitespace
.filter(_ != "") // Filter empty words
.map(_.toLowerCase())
.toDF() // Convert to DataFrame to perform aggregation / sorting
.groupBy($"value") // Count number of occurrences of each word
.agg(count("*") as "numOccurances")
.orderBy($"numOccurances" desc) // Show most common words first
result.show()
Complex and nested data
7/21/2022 • 2 minutes to read

Here’s a notebook showing you how to work with complex and nested data.

Complex nested data notebook


Get notebook
Aggregators
7/21/2022 • 2 minutes to read

Here’s a notebook showing you how to work with Dataset aggregators.

Dataset aggregator notebook


Get notebook
Dates and timestamps
7/21/2022 • 19 minutes to read

The Date and Timestamp datatypes changed significantly in Databricks Runtime 7.0. This article describes:
The Date type and the associated calendar.
The Timestamp type and how it relates to time zones. It also explains the details of time zone offset resolution
and the subtle behavior changes in the new time API in Java 8, used by Databricks Runtime 7.0.
APIs to construct date and timestamp values.
Common pitfalls and best practices for collecting date and timestamp objects on the Apache Spark driver.

Dates and calendars


A Date is a combination of the year, month, and day fields, like (year=2012, month=12, day=31). However, the
values of the year, month, and day fields have constraints to ensure that the date value is a valid date in the real
world. For example, the value of month must be from 1 to 12, the value of day must be from 1 to 28,29,30, or 31
(depending on the year and month), and so on. The Date type does not consider time zones.
Calendars
Constraints on Date fields are defined by one of many possible calendars. Some, like the Lunar calendar, are
used only in specific regions. Some, like the Julian calendar, are used only in history. The de facto international
standard is the Gregorian calendar which is used almost everywhere in the world for civil purposes. It was
introduced in 1582 and was extended to support dates before 1582 as well. This extended calendar is called the
Proleptic Gregorian calendar.
Databricks Runtime 7.0 uses the Proleptic Gregorian calendar, which is already being used by other data
systems like pandas, R, and Apache Arrow. Databricks Runtime 6.x and below used a combination of the Julian
and Gregorian calendar: for dates before 1582, the Julian calendar was used, for dates after 1582 the Gregorian
calendar was used. This is inherited from the legacy java.sql.Date API, which was superseded in Java 8 by
java.time.LocalDate , which uses the Proleptic Gregorian calendar.

Timestamps and time zones


The Timestamp type extends the Date type with new fields: hour, minute, second (which can have a fractional
part) and together with a global (session scoped) time zone. It defines a concrete time instant. For example,
(year=2012, month=12, day=31, hour=23, minute=59, second=59.123456) with session time zone UTC+01:00.
When writing timestamp values out to non-text data sources like Parquet, the values are just instants (like
timestamp in UTC) that have no time zone information. If you write and read a timestamp value with a different
session time zone, you may see different values of the hour, minute, and second fields, but they are the same
concrete time instant.
The hour, minute, and second fields have standard ranges: 0–23 for hours and 0–59 for minutes and seconds.
Spark supports fractional seconds with up to microsecond precision. The valid range for fractions is from 0 to
999,999 microseconds.
At any concrete instant, depending on time zone, you can observe many different wall clock values:
Conversely, a wall clock value can represent many different time instants.
The time zone offset allows you to unambiguously bind a local timestamp to a time instant. Usually, time zone
offsets are defined as offsets in hours from Greenwich Mean Time (GMT) or UTC+0 (Coordinated Universal
Time). This representation of time zone information eliminates ambiguity, but it is inconvenient. Most people
prefer to point out a location such as America/Los_Angeles or Europe/Paris . This additional level of abstraction
from zone offsets makes life easier but brings complications. For example, you now have to maintain a special
time zone database to map time zone names to offsets. Since Spark runs on the JVM, it delegates the mapping
to the Java standard library, which loads data from the Internet Assigned Numbers Authority Time Zone
Database (IANA TZDB). Furthermore, the mapping mechanism in Java’s standard library has some nuances that
influence Spark’s behavior.
Since Java 8, the JDK exposed a different API for date-time manipulation and time zone offset resolution and
Databricks Runtime 7.0 uses this API. Although the mapping of time zone names to offsets has the same source,
IANA TZDB, it is implemented differently in Java 8 and above compared to Java 7.
For example, take a look at a timestamp before the year 1883 in the America/Los_Angeles time zone:
1883-11-10 00:00:00 . This year stands out from others because on November 18, 1883, all North American
railroads switched to a new standard time system. Using the Java 7 time API, you can obtain a time zone offset
at the local timestamp as -08:00 :

java.time.ZoneId.systemDefault

res0:java.time.ZoneId = America/Los_Angeles

java.sql.Timestamp.valueOf("1883-11-10 00:00:00").getTimezoneOffset / 60.0

res1: Double = 8.0

The equivalent Java 8 API returns a different result:

java.time.ZoneId.of("America/Los_Angeles").getRules.getOffset(java.time.LocalDateTime.parse("1883-11-
10T00:00:00"))

res2: java.time.ZoneOffset = -07:52:58

Prior to November 18, 1883, time of day in North America was a local matter, and most cities and towns used
some form of local solar time, maintained by a well-known clock (on a church steeple, for example, or in a
jeweler’s window). That’s why you see such a strange time zone offset.
The example demonstrates that Java 8 functions are more precise and take into account historical data from
IANA TZDB. After switching to the Java 8 time API, Databricks Runtime 7.0 benefited from the improvement
automatically and became more precise in how it resolves time zone offsets.
Databricks Runtime 7.0 also switched to the Proleptic Gregorian calendar for the Timestamp type. The ISO
SQL:2016 standard declares the valid range for timestamps is from 0001-01-01 00:00:00 to
9999-12-31 23:59:59.999999 . Databricks Runtime 7.0 fully conforms to the standard and supports all
timestamps in this range. Compared to Databricks Runtime 6.x and below, note the following sub-ranges:
0001-01-01 00:00:00..1582-10-03 23:59:59.999999 . Databricks Runtime 6.x and below uses the Julian calendar
and doesn’t conform to the standard. Databricks Runtime 7.0 fixes the issue and applies the Proleptic
Gregorian calendar in internal operations on timestamps such as getting year, month, day, etc. Due to
different calendars, some dates that exist in Databricks Runtime 6.x and below don’t exist in Databricks
Runtime 7.0. For example, 1000-02-29 is not a valid date because 1000 isn’t a leap year in the Gregorian
calendar. Also, Databricks Runtime 6.x and below resolves time zone name to zone offsets incorrectly for this
timestamp range.
1582-10-04 00:00:00..1582-10-14 23:59:59.999999 . This is a valid range of local timestamps in Databricks
Runtime 7.0, in contrast to Databricks Runtime 6.x and below where such timestamps didn’t exist.
1582-10-15 00:00:00..1899-12-31 23:59:59.999999 . Databricks Runtime 7.0 resolves time zone offsets
correctly using historical data from IANA TZDB. Compared to Databricks Runtime 7.0, Databricks Runtime 6.x
and below might resolve zone offsets from time zone names incorrectly in some cases, as shown in the
preceding example.
1900-01-01 00:00:00..2036-12-31 23:59:59.999999 . Both Databricks Runtime 7.0 and Databricks Runtime 6.x
and below conform to the ANSI SQL standard and use Gregorian calendar in date-time operations such as
getting the day of the month.
2037-01-01 00:00:00..9999-12-31 23:59:59.999999 . Databricks Runtime 6.x and below can resolve time zone
offsets and daylight saving time offsets incorrectly. Databricks Runtime 7.0 does not.
One more aspect of mapping time zone names to offsets is overlapping of local timestamps that can happen
due to daylight savings time (DST) or switching to another standard time zone offset. For instance, on
November 3 2019, 02:00:00, most states in the USA turned clocks backwards 1 hour to 01:00:00. The local
timestamp 2019-11-03 01:30:00 America/Los_Angeles can be mapped either to 2019-11-03 01:30:00 UTC-08:00 or
2019-11-03 01:30:00 UTC-07:00 . If you don’t specify the offset and just set the time zone name (for example,
2019-11-03 01:30:00 America/Los_Angeles ), Databricks Runtime 7.0 takes the earlier offset, typically
corresponding to “summer”. The behavior diverges from Databricks Runtime 6.x and below which takes the
“winter” offset. In the case of a gap, where clocks jump forward, there is no valid offset. For a typical one-hour
daylight saving time change, Spark moves such timestamps to the next valid timestamp corresponding to
“summer” time.
As you can see from the preceding examples, the mapping of time zone names to offsets is ambiguous, and is
not one to one. In the cases when it is possible, when constructing timestamps we recommend specifying exact
time zone offsets, for example 2019-11-03 01:30:00 UTC-07:00 .
ANSI SQL and Spark SQL timestamps
The ANSI SQL standard defines two types of timestamps:
TIMESTAMP WITHOUT TIME ZONE or TIMESTAMP : Local timestamp as ( YEAR , MONTH , DAY , HOUR , MINUTE , SECOND
). These timestamps are not bound to any time zone, and are wall clock timestamps.
TIMESTAMP WITH TIME ZONE : Zoned timestamp as ( YEAR , MONTH , DAY , HOUR , MINUTE , SECOND , TIMEZONE_HOUR ,
TIMEZONE_MINUTE ). These timestamps represent an instant in the UTC time zone + a time zone offset (in hours
and minutes) associated with each value.
The time zone offset of a TIMESTAMP WITH TIME ZONE does not affect the physical point in time that the timestamp
represents, as that is fully represented by the UTC time instant given by the other timestamp components.
Instead, the time zone offset only affects the default behavior of a timestamp value for display, date/time
component extraction (for example, EXTRACT ), and other operations that require knowing a time zone, such as
adding months to a timestamp.
Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME ZONE , which is a combination of the fields
( YEAR , MONTH , DAY , HOUR , MINUTE , SECOND , SESSION TZ ) where the YEAR through SECOND field identify a time
instant in the UTC time zone, and where SESSION TZ is taken from the SQL config spark.sql.session.timeZone.
The session time zone can be set as:
Zone offset (+|-)HH:mm . This form allows you to unambiguously define a physical point in time.
Time zone name in the form of region ID area/city , such as America/Los_Angeles . This form of time zone
info suffers from some of the problems described previously like overlapping of local timestamps. However,
each UTC time instant is unambiguously associated with one time zone offset for any region ID, and as a
result, each timestamp with a region ID based time zone can be unambiguously converted to a timestamp
with a zone offset. By default, the session time zone is set to the default time zone of the Java virtual
machine.
Spark TIMESTAMP WITH SESSION TIME ZONE is different from:
TIMESTAMP WITHOUT TIME ZONE , because a value of this type can map to multiple physical time instants, but any
value of TIMESTAMP WITH SESSION TIME ZONE is a concrete physical time instant. The SQL type can be emulated
by using one fixed time zone offset across all sessions, for instance UTC+0. In that case, you could consider
timestamps at UTC as local timestamps.
TIMESTAMP WITH TIME ZONE , because according to the SQL standard column values of the type can have
different time zone offsets. That is not supported by Spark SQL.
You should notice that timestamps that are associated with a global (session scoped) time zone are not
something newly invented by Spark SQL. RDBMSs such as Oracle provide a similar type for timestamps:
TIMESTAMP WITH LOCAL TIME ZONE .

Construct dates and timestamps


Spark SQL provides a few methods for constructing date and timestamp values:
Default constructors without parameters: CURRENT_TIMESTAMP() and CURRENT_DATE() .
From other primitive Spark SQL types, such as INT , LONG , and STRING
From external types like Python datetime or Java classes java.time.LocalDate / Instant .
Deserialization from data sources such as CSV, JSON, Avro, Parquet, ORC, and so on.
The function MAKE_DATE introduced in Databricks Runtime 7.0 takes three parameters— YEAR , MONTH , and DAY
—and constructs a DATE value. All input parameters are implicitly converted to the INT type whenever
possible. The function checks that the resulting dates are valid dates in the Proleptic Gregorian calendar,
otherwise it returns NULL . For example:

spark.createDataFrame([(2020, 6, 26), (1000, 2, 29), (-44, 1, 1)],['Y', 'M', 'D']).createTempView('YMD')


df = sql('select make_date(Y, M, D) as date from YMD')
df.printSchema()

root
|-- date: date (nullable = true)
To print DataFrame content, call the show() action, which converts dates to strings on executors and transfers
the strings to the driver to output them on the console:

df.show()

+-----------+
| date|
+-----------+
| 2020-06-26|
| null|
|-0044-01-01|
+-----------+

Similarly, you can construct timestamp values using the MAKE_TIMESTAMP functions. Like MAKE_DATE , it performs
the same validation for date fields, and additionally accepts time fields HOUR (0-23), MINUTE (0-59) and
SECOND (0-60). SECOND has the type Decimal(precision = 8, scale = 6) because seconds can be passed with
the fractional part up to microsecond precision. For example:

df = spark.createDataFrame([(2020, 6, 28, 10, 31, 30.123456), \


(1582, 10, 10, 0, 1, 2.0001), (2019, 2, 29, 9, 29, 1.0)],['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE',
'SECOND'])
df.show()

+----+-----+---+----+------+---------+
|YEAR|MONTH|DAY|HOUR|MINUTE| SECOND|
+----+-----+---+----+------+---------+
|2020| 6| 28| 10| 31|30.123456|
|1582| 10| 10| 0| 1| 2.0001|
|2019| 2| 29| 9| 29| 1.0|
+----+-----+---+----+------+---------+

df.selectExpr("make_timestamp(YEAR, MONTH, DAY, HOUR, MINUTE, SECOND) as MAKE_TIMESTAMP")


ts.printSchema()

root
|-- MAKE_TIMESTAMP: timestamp (nullable = true)

As for dates, print the content of the ts DataFrame using the show() action. In a similar way, show() converts
timestamps to strings but now it takes into account the session time zone defined by the SQL config
spark.sql.session.timeZone .

ts.show(truncate=False)

+--------------------------+
|MAKE_TIMESTAMP |
+--------------------------+
|2020-06-28 10:31:30.123456|
|1582-10-10 00:01:02.0001 |
|null |
+--------------------------+

Spark cannot create the last timestamp because this date is not valid: 2019 is not a leap year.
You might notice that there is no time zone information in the preceding example. In that case, Spark takes a
time zone from the SQL configuration spark.sql.session.timeZone and applies it to function invocations. You
can also pick a different time zone by passing it as the last parameter of MAKE_TIMESTAMP . Here is an example:

df = spark.createDataFrame([(2020, 6, 28, 10, 31, 30, 'UTC'),(1582, 10, 10, 0, 1, 2, 'America/Los_Angeles'),


\
(2019, 2, 28, 9, 29, 1, 'Europe/Moscow')], ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'TZ'])
df = df.selectExpr('make_timestamp(YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, TZ) as MAKE_TIMESTAMP')
df = df.selectExpr("date_format(MAKE_TIMESTAMP, 'yyyy-MM-dd HH:mm:ss VV') AS TIMESTAMP_STRING")
df.show(truncate=False)

+---------------------------------+
|TIMESTAMP_STRING |
+---------------------------------+
|2020-06-28 13:31:00 Europe/Moscow|
|1582-10-10 10:24:00 Europe/Moscow|
|2019-02-28 09:29:00 Europe/Moscow|
+---------------------------------+

As the example demonstrates, Spark takes into account the specified time zones but adjusts all local timestamps
to the session time zone. The original time zones passed to the MAKE_TIMESTAMP function are lost because the
TIMESTAMP WITH SESSION TIME ZONE type assumes that all values belong to one time zone, and it doesn’t even
store a time zone per every value. According to the definition of the TIMESTAMP WITH SESSION TIME ZONE , Spark
stores local timestamps in the UTC time zone, and uses the session time zone while extracting date-time fields or
converting the timestamps to strings.
Also, timestamps can be constructed from the LONG type using casting. If a LONG column contains the number
of seconds since the epoch 1970-01-01 00:00:00Z, it can be cast to a Spark SQL TIMESTAMP :

select CAST(-123456789 AS TIMESTAMP);


1966-02-02 05:26:51

Unfortunately, this approach doesn’t allow you to specify the fractional part of seconds.
Another way is to construct dates and timestamps from values of the STRING type. You can make literals using
special keywords:

select timestamp '2020-06-28 22:17:33.123456 Europe/Amsterdam', date '2020-07-01';


2020-06-28 23:17:33.123456 2020-07-01

Alternatively, you can use casting that you can apply for all values in a column:

select cast('2020-06-28 22:17:33.123456 Europe/Amsterdam' as timestamp), cast('2020-07-01' as date);


2020-06-28 23:17:33.123456 2020-07-01

The input timestamp strings are interpreted as local timestamps in the specified time zone or in the session time
zone if a time zone is omitted in the input string. Strings with unusual patterns can be converted to timestamp
using the to_timestamp() function. The supported patterns are described in Datetime Patterns for Formatting
and Parsing:

select to_timestamp('28/6/2020 22.17.33', 'dd/M/yyyy HH.mm.ss');


2020-06-28 22:17:33
If you don’t specify a pattern, the function behaves similarly to CAST .
For usability, Spark SQL recognizes special string values in all methods that accept a string and return a
timestamp or date:
epoch is an alias for date 1970-01-01 or timestamp 1970-01-01 00:00:00Z .
now is the current timestamp or date at the session time zone. Within a single query it always produces the
same result.
today is the beginning of the current date for the TIMESTAMP type or just current date for the DATE type.
tomorrow is the beginning of the next day for timestamps or just the next day for the DATE type.
yesterday is the day before current one or its beginning for the TIMESTAMP type.

For example:

select timestamp 'yesterday', timestamp 'today', timestamp 'now', timestamp 'tomorrow';


2020-06-27 00:00:00 2020-06-28 00:00:00 2020-06-28 23:07:07.18 2020-06-29 00:00:00
select date 'yesterday', date 'today', date 'now', date 'tomorrow';
2020-06-27 2020-06-28 2020-06-28 2020-06-29

Spark allows you to create Datasets from existing collections of external objects at the driver side and create
columns of corresponding types. Spark converts instances of external types to semantically equivalent internal
representations. For example, to create a Dataset with DATE and TIMESTAMP columns from Python collections,
you can use:

import datetime
df = spark.createDataFrame([(datetime.datetime(2020, 7, 1, 0, 0, 0), datetime.date(2020, 7, 1))],
['timestamp', 'date'])
df.show()

+-------------------+----------+
| timestamp| date|
+-------------------+----------+
|2020-07-01 00:00:00|2020-07-01|
+-------------------+----------+

PySpark converts Python’s date-time objects to internal Spark SQL representations at the driver side using the
system time zone, which can be different from Spark’s session time zone setting spark.sql.session.timeZone .
The internal values don’t contain information about the original time zone. Future operations over the
parallelized date and timestamp values take into account only Spark SQL sessions time zone according to the
TIMESTAMP WITH SESSION TIME ZONE type definition.

In a similar way, Spark recognizes the following types as external date-time types in Java and Scala APIs:
java.sql.Date and java.time.LocalDateas external types for the DATE type
java.sql.Timestamp and java.time.Instant for the TIMESTAMP type.

There is a difference between java.sql.* and java.time.* types. java.time.LocalDate and java.time.Instant
were added in Java 8, and the types are based on the Proleptic Gregorian calendar–the same calendar that is
used by Databricks Runtime 7.0 and above. java.sql.Date and java.sql.Timestamp have another calendar
underneath–the hybrid calendar (Julian + Gregorian since 1582-10-15), which is the same as the legacy
calendar used by Databricks Runtime 6.x and below. Due to different calendar systems, Spark has to perform
additional operations during conversions to internal Spark SQL representations, and rebase input
dates/timestamp from one calendar to another. The rebase operation has a little overhead for modern
timestamps after the year 1900, and it can be more significant for old timestamps.
The following example shows how to make timestamps from Scala collections. The first example constructs a
java.sql.Timestamp object from a string. The valueOf method interprets the input strings as a local timestamp
in the default JVM time zone which can be different from Spark’s session time zone. If you need to construct
instances of java.sql.Timestamp or java.sql.Date in specific time zone, have a look at
java.text.SimpleDateFormat (and its method setTimeZone ) or java.util.Calendar.

Seq(java.sql.Timestamp.valueOf("2020-06-29 22:41:30"), new java.sql.Timestamp(0)).toDF("ts").show(false)

+-------------------+
|ts |
+-------------------+
|2020-06-29 22:41:30|
|1970-01-01 03:00:00|
+-------------------+

Seq(java.time.Instant.ofEpochSecond(-12219261484L), java.time.Instant.EPOCH).toDF("ts").show

+-------------------+
| ts|
+-------------------+
|1582-10-15 11:12:13|
|1970-01-01 03:00:00|
+-------------------+

Similarly, you can make a DATE column from collections of java.sql.Date or java.sql.LocalDate .
Parallelization of java.sql.LocalDate instances is fully independent of either Spark’s session or JVM default time
zones, but the same is not true for parallelization of java.sql.Date instances. There are nuances:
1. java.sql.Date instances represent local dates at the default JVM time zone on the driver.
2. For correct conversions to Spark SQL values, the default JVM time zone on the driver and executors must be
the same.

Seq(java.time.LocalDate.of(2020, 2, 29), java.time.LocalDate.now).toDF("date").show

+----------+
| date|
+----------+
|2020-02-29|
|2020-06-29|
+----------+

To avoid any calendar and time zone related issues, we recommend Java 8 types java.sql.LocalDate / Instant
as external types in parallelization of Java/Scala collections of timestamps or dates.

Collect dates and timestamps


The reverse operation of parallelization is collecting dates and timestamps from executors back to the driver and
returning a collection of external types. For example above, you can pull the DataFrame back to the driver using
the collect() action:
df.collect()

[Row(timestamp=datetime.datetime(2020, 7, 1, 0, 0), date=datetime.date(2020, 7, 1))]

Spark transfers internal values of dates and timestamps columns as time instants in the UTC time zone from
executors to the driver, and performs conversions to Python datetime objects in the system time zone at the
driver, not using Spark SQL session time zone. collect() is different from the show() action described in the
previous section. show() uses the session time zone while converting timestamps to strings, and collects the
resulted strings on the driver.
In Java and Scala APIs, Spark performs the following conversions by default:
Spark SQL DATE values are converted to instances of java.sql.Date .
Spark SQL TIMESTAMP values are converted to instances of java.sql.Timestamp .

Both conversions are performed in the default JVM time zone on the driver. In this way, to have the same date-
time fields that you can get using Date.getDay() , getHour() , and so on, and using Spark SQL functions DAY ,
HOUR , the default JVM time zone on the driver and the session time zone on executors should be the same.

Similarly to making dates/timestamps from java.sql.Date / Timestamp , Databricks Runtime 7.0 performs
rebasing from the Proleptic Gregorian calendar to the hybrid calendar (Julian + Gregorian). This operation is
almost free for modern dates (after the year 1582) and timestamps (after the year 1900), but it could bring
some overhead for ancient dates and timestamps.
You can avoid such calendar-related issues, and ask Spark to return java.time types, which were added since
Java 8. If you set the SQL config spark.sql.datetime.java8API.enabled to true, the Dataset.collect() action
returns:
java.time.LocalDate for Spark SQL DATE type
java.time.Instant for Spark SQL TIMESTAMP type

Now the conversions don’t suffer from the calendar-related issues because Java 8 types and Databricks Runtime
7.0 and above are both based on the Proleptic Gregorian calendar. The collect() action doesn’t depend on the
default JVM time zone. The timestamp conversions don’t depend on time zone at all. Date conversions use the
session time zone from the SQL config spark.sql.session.timeZone . For example, consider a Dataset with
DATE and TIMESTAMP columns, with the default JVM time zone to set to Europe/Moscow and the session time
zone set to America/Los_Angeles .

java.util.TimeZone.getDefault

res1: java.util.TimeZone = sun.util.calendar.ZoneInfo[id="Europe/Moscow",...]

spark.conf.get("spark.sql.session.timeZone")

res2: String = America/Los_Angeles

df.show
+-------------------+----------+
| timestamp| date|
+-------------------+----------+
|2020-07-01 00:00:00|2020-07-01|
+-------------------+----------+

The show() action prints the timestamp at the session time America/Los_Angeles , but if you collect the Dataset ,
it is converted to java.sql.Timestamp and the toString method prints Europe/Moscow :

df.collect()

res16: Array[org.apache.spark.sql.Row] = Array([2020-07-01 10:00:00.0,2020-07-01])

df.collect()(0).getAs[java.sql.Timestamp](0).toString

res18: java.sql.Timestamp = 2020-07-01 10:00:00.0

Actually, the local timestamp 2020-07-01 00:00:00 is 2020-07-01T07:00:00Z at UTC. You can observe that if you
enable Java 8 API and collect the Dataset:

df.collect()

res27: Array[org.apache.spark.sql.Row] = Array([2020-07-01T07:00:00Z,2020-07-01])

You can convert a java.time.Instant object to any local timestamp independently from the global JVM time
zone. This is one of the advantages of java.time.Instant over java.sql.Timestamp . The former requires
changing the global JVM setting, which influences other timestamps on the same JVM. Therefore, if your
applications process dates or timestamps in different time zones, and the applications should not clash with
each other while collecting data to the driver using Java or Scala Dataset.collect() API, we recommend
switching to Java 8 API using the SQL config spark.sql.datetime.java8API.enabled .
What is Apache Spark Structured Streaming?
7/21/2022 • 2 minutes to read

Apache Spark Structured Streaming is a near-real time processing engine that offers end-to-end fault tolerance
with exactly-once processing guarantees using familiar Spark APIs. Structured Streaming lets you express
computation on streaming data in the same way you express a batch computation on static data. The Structured
Streaming engine performs the computation incrementally and continuously updates the result as streaming
data arrives. For an overview of Structured Streaming, see the Apache Spark Structured Streaming
Programming Guide.

How is Structured Streaming used on Azure Databricks?


Structured Streaming pairs tightly with Delta Lake to offer enhanced functionality for incremental data
processing at scale in the Databricks Lakehouse. Structured Streaming is the core technology at the heart of
Databricks Auto Loader, as well as Delta Live Tables.
Streaming with Delta Lake
Auto Loader
Delta Live Tables

What streaming sources and sinks does Azure Databricks support?


Databricks recommends using Auto Loader to ingest supported file types from cloud object storage into Delta
Lake. For ETL pipelines, Databricks recommends using Delta Live Tables (which uses Delta tables and Structured
Streaming). You can also configure incremental ETL workloads by streaming to and from Delta Lake tables.
In addition to Delta Lake and Auto Loader, Structured Streaming can connect to messaging services such as
Apache Kafka.
You can also Perform streaming writes to arbitrary data sinks with Structured Streaming and foreachBatch.

What are best practices for Structured Streaming in production?


Azure Databricks supports a number of edge features not found in Apache Spark to help customers get the best
performance out of Structured Streaming. Learn more about these features and other recommendations for
Production considerations for Structured Streaming applications on Azure Databricks.

Examples
For introductory notebooks and notebooks demonstrating example use cases, see Examples for working with
Structured Streaming on Azure Databricks.

API reference
For reference information about Structured Streaming, Azure Databricks recommends the following Apache
Spark API reference:
Python
Scala
Java
Production considerations for Structured Streaming
applications on Azure Databricks
7/21/2022 • 2 minutes to read

You can easily configure production incremental processing workloads with Structured Streaming on Azure
Databricks to fulfill latency and cost requirements for real-time or batch applications. Understanding key
concepts of Structured Streaming on Azure Databricks can help you avoid common pitfalls as you scaling up the
volume and velocity of data and move from development to production.
Azure Databricks has introduced Delta Live Tables to reduce the complexities of managing production
infrastructure for Structured Streaming workloads. Databricks recommends using Delta Live Tables for new
Structured Streaming pipelines; see Delta Live Tables.

Using notebooks for Structured Streaming workloads


Interactive development with Databricks notebooks requires you attach your notebooks to a cluster in order to
execute queries manually. You can schedule Databricks notebooks for automated deployment and automatic
recovery from query failure using Workflows.
Recover from Structured Streaming query failures
Monitoring Structured Streaming queries on Azure Databricks
Configure scheduler pools for multiple Structured Streaming workloads on a cluster
You can visualize Structured Streaming queries in notebooks during interactive development, or for interactive
monitoring of production workloads. You should only visualize a Structured Streaming query in production if a
human will regularly monitor the output of the notebook. While the trigger and checkpointLocation
parameters are optional, as a best practice Databricks recommends that you always specify them in production.

Controlling batch size and frequency for Structured Streaming on


Azure Databricks
Structured Streaming on Azure Databricks has enhanced options for helping to control costs and latency while
streaming with Auto Loader and Delta Lake.
Configure Structured Streaming batch size on Azure Databricks
Configure Structured Streaming trigger intervals on Azure Databricks

What is stateful streaming?


A stateful Structured Streaming query requires incremental updates to intermediate state information, whereas
a stateless Structured Streaming query only tracks information about which rows have been processed from the
source to the sink.
Stateful operations include streaming aggregation, streaming dropDuplicates , stream-stream joins,
mapGroupsWithState , and flatMapGroupsWithState .

The intermediate state information required for stateful Structured Streaming queries can lead to unexpected
latency and production problems if not configured properly.
Optimize performance of stateful Structured Streaming queries on Azure Databricks
Configure RocksDB state store on Azure Databricks
Enable asynchronous state checkpointing for Structured Streaming
Control late data threshold for Structured Streaming with multiple watermark policy
Specify initial state for Structured Streaming mapGroupsWithState
Test state update function for Structured Streaming mapGroupsWithState
Recover from Structured Streaming query failures
7/21/2022 • 5 minutes to read

Structured Streaming provides fault-tolerance and data consistency for streaming queries; using Azure
Databricks workflows, you can easily configure your Structured Streaming queries to automatically restart on
failure. By enabling checkpointing for a streaming query, you can restart the query after a failure. The restarted
query continues where the failed one left off.

Enable checkpointing for Structured Streaming queries


Databricks recommends that you always specify the checkpointLocation option a cloud storage path before you
start the query. For example:

streamingDataFrame.writeStream
.format("parquet")
.option("path", "/path/to/table")
.option("checkpointLocation", "/path/to/table/_checkpoint")
.start()

This checkpoint location preserves all of the essential information that identifies a query. Each query must have
a different checkpoint location. Multiple queries should never have the same location. For more information, see
the Structured Streaming Programming Guide.

NOTE
While checkpointLocation is required for most types of output sinks, some sinks, such as memory sink, may
automatically generate a temporary checkpoint location when you do not provide checkpointLocation . These
temporary checkpoint locations do not ensure any fault tolerance or data consistency guarantees and may not get
cleaned up properly. Avoid potential pitfalls by always specifying a checkpointLocation .

Configure Structured Streaming jobs to restart streaming queries on


failure
You can create an Azure Databricks job with the notebook or JAR that has your streaming queries and configure
it to:
Always use a new cluster.
Always retry on failure.
Jobs have tight integration with Structured Streaming APIs and can monitor all streaming queries active in a run.
This configuration ensures that if any part of the query fails, jobs automatically terminate the run (along all the
other queries) and start a new run in a new cluster. This re-runs the notebook or JAR code and restarts all of the
queries again. This is the safest way return to a good state.

WARNING
Notebook workflows are not supported with long-running jobs. Therefore we don’t recommend using notebook
workflows in your streaming jobs.
NOTE
Failure in any of the active streaming queries causes the active run to fail and terminate all the other streaming
queries.
You do not need to use streamingQuery.awaitTermination() or spark.streams.awaitAnyTermination() at the
end of your notebook. Jobs automatically prevent a run from completing when a streaming query is active.

The following is an example of a recommended job configuration.


Cluster : Set this always to use a new cluster and use the latest Spark version (or at least version 2.1). Queries
started in Spark 2.1 and above are recoverable after query and Spark version upgrades.
Aler ts : Set this if you want email notification on failures.
Schedule : Do not set a schedule.
Timeout : Do not set a timeout. Streaming queries run for an indefinitely long time.
Maximum concurrent runs : Set to 1 . There must be only one instance of each query concurrently active.
Retries : Set to Unlimited .
See Jobs to understand these configurations. Here is a screenshot of a good job configuration.

Recover after changes in a Structured Streaming query


There are limitations on what changes in a streaming query are allowed between restarts from the same
checkpoint location. Here are a few kinds of changes that are either not allowed, or the effect of the change is
not well-defined. For all of them:
The term allowed means you can do the specified change but whether the semantics of its effect is well-
defined depends on the query and the change.
The term not allowed means you should not do the specified change as the restarted query is likely to fail
with unpredictable errors.
sdf represents a streaming DataFrame/Dataset generated with sparkSession.readStream .

Types of changes in Structured Streaming queries


Changes in the number or type (that is, different source) of input sources : This is not allowed.
Changes in the parameters of input sources : Whether this is allowed and whether the semantics of the
change are well-defined depends on the source and the query. Here are a few examples.
Addition, deletion, and modification of rate limits is allowed:

spark.readStream.format("kafka").option("subscribe", "article")

to
spark.readStream.format("kafka").option("subscribe", "article").option("maxOffsetsPerTrigger",
...)

Changes to subscribed articles and files are generally not allowed as the results are unpredictable:
spark.readStream.format("kafka").option("subscribe", "article") to
spark.readStream.format("kafka").option("subscribe", "newarticle")

Changes in the type of output sink : Changes between a few specific combinations of sinks are allowed.
This needs to be verified on a case-by-case basis. Here are a few examples.
File sink to Kafka sink is allowed. Kafka will see only the new data.
Kafka sink to file sink is not allowed.
Kafka sink changed to foreach, or vice versa is allowed.
Changes in the parameters of output sink : Whether this is allowed and whether the semantics of the
change are well-defined depends on the sink and the query. Here are a few examples.
Changes to output directory of a file sink is not allowed:
sdf.writeStream.format("parquet").option("path", "/somePath") to
sdf.writeStream.format("parquet").option("path", "/anotherPath")
Changes to output article is allowed:
sdf.writeStream.format("kafka").option("article", "somearticle") to
sdf.writeStream.format("kafka").option("path", "anotherarticle")
Changes to the user-defined foreach sink (that is, the ForeachWriter code) is allowed, but the
semantics of the change depends on the code.
Changes in projection / filter / map-like operations : Some cases are allowed. For example:
Addition / deletion of filters is allowed: sdf.selectExpr("a") to
sdf.where(...).selectExpr("a").filter(...) .
Changes in projections with same output schema is allowed:
sdf.selectExpr("stringColumn AS json").writeStream to
sdf.select(to_json(...).as("json")).writeStream .
Changes in projections with different output schema are conditionally allowed:
sdf.selectExpr("a").writeStream to sdf.selectExpr("b").writeStream is allowed only if the output
sink allows the schema change from "a" to "b" .
Changes in stateful operations : Some operations in streaming queries need to maintain state data in
order to continuously update the result. Structured Streaming automatically checkpoints the state data to
fault-tolerant storage (for example, DBFS, Azure Blob storage) and restores it after restart. However, this
assumes that the schema of the state data remains same across restarts. This means that any changes (that
is, additions, deletions, or schema modifications) to the stateful operations of a streaming query are not
allowed between restarts. Here is the list of stateful operations whose schema should not be changed
between restarts in order to ensure state recovery:
Streaming aggregation : For example, sdf.groupBy("a").agg(...) . Any change in number or type of
grouping keys or aggregates is not allowed.
Streaming deduplication : For example, sdf.dropDuplicates("a") . Any change in number or type of
grouping keys or aggregates is not allowed.
Stream-stream join : For example, sdf1.join(sdf2, ...) (i.e. both inputs are generated with
sparkSession.readStream ). Changes in the schema or equi-joining columns are not allowed. Changes
in join type (outer or inner) not allowed. Other changes in the join condition are ill-defined.
Arbitrar y stateful operation : For example, sdf.groupByKey(...).mapGroupsWithState(...) or
sdf.groupByKey(...).flatMapGroupsWithState(...) . Any change to the schema of the user-defined state
and the type of timeout is not allowed. Any change within the user-defined state-mapping function are
allowed, but the semantic effect of the change depends on the user-defined logic. If you really want to
support state schema changes, then you can explicitly encode/decode your complex state data
structures into bytes using an encoding/decoding scheme that supports schema migration. For
example, if you save your state as Avro-encoded bytes, then you can change the Avro-state-schema
between query restarts as this restores the binary state.
Monitoring Structured Streaming queries on Azure
Databricks
7/21/2022 • 3 minutes to read

Azure Databricks provides built-in montoring for Structured Streaming applications through the Spark UI under
the Streaming tab.

Distinguish Structured Streaming queries in the Spark UI


Provide your streams a unique query name by adding .queryName(<query_name>) to your writeStream code to
easily distinguish which metrics belong to which stream in the Spark UI.

Push Structured Streaming metrics to external services


The streaming metrics can be pushed to external services for alerting or dashboarding use cases by using
Apache Spark’s Streaming Query Listener interface. In Databricks Runtime 11.0 and above, the Streaming Query
Listener is available in Python and Scala:
Scala

import org.apache.spark.sql.streaming.StreamingQueryListener
import org.apache.spark.sql.streaming.StreamingQueryListener._

val myListener = new StreamingQueryListener {

/**
* Called when a query is started.
* @note This is called synchronously with
* [[org.apache.spark.sql.streaming.DataStreamWriter `DataStreamWriter.start()`]].
* `onQueryStart` calls on all listeners before
* `DataStreamWriter.start()` returns the corresponding [[StreamingQuery]].
* Do not block this method, as it blocks your query.
*/
def onQueryStarted(event: QueryStartedEvent): Unit = {}

/**
* Called when there is some status update (ingestion rate updated, etc.)
*
* @note This method is asynchronous. The status in [[StreamingQuery]] returns the
* latest status, regardless of when this method is called. The status of [[StreamingQuery]]
* may change before or when you process the event. For example, you may find [[StreamingQuery]]
* terminates when processing `QueryProgressEvent`.
*/
def onQueryProgress(event: QueryProgressEvent): Unit = {}

/**
* Called when a query is stopped, with or without error.
*/
def onQueryTerminated(event: QueryTerminatedEvent): Unit = {}
}

Python
class MyListener(StreamingQueryListener):
def onQueryStarted(self, event):
"""
Called when a query is started.

Parameters
----------
event: :class:`pyspark.sql.streaming.listener.QueryStartedEvent`
The properties are available as the same as Scala API.

Notes
-----
This is called synchronously with
meth:`pyspark.sql.streaming.DataStreamWriter.start`,
that is, ``onQueryStart`` will be called on all listeners before
``DataStreamWriter.start()`` returns the corresponding
:class:`pyspark.sql.streaming.StreamingQuery`.
Do not block in this method as it will block your query.
"""
pass

def onQueryProgress(self, event):


"""
Called when there is some status update (ingestion rate updated, etc.)

Parameters
----------
event: :class:`pyspark.sql.streaming.listener.QueryProgressEvent`
The properties are available as the same as Scala API.

Notes
-----
This method is asynchronous. The status in
:class:`pyspark.sql.streaming.StreamingQuery` returns the
most recent status, regardless of when this method is called. The status
of :class:`pyspark.sql.streaming.StreamingQuery`.
may change before or when you process the event.
For example, you may find :class:`StreamingQuery`
terminates when processing `QueryProgressEvent`.
"""
pass

def onQueryTerminated(self, event):


"""
Called when a query is stopped, with or without error.

Parameters
----------
event: :class:`pyspark.sql.streaming.listener.QueryTerminatedEvent`
The properties are available as the same as Scala API.
"""
pass

my_listener = MyListener()

Defining observable metrics in Structured Streaming


Observable metrics are named arbitrary aggregate functions that can be defined on a query (DataFrame). As
soon as the execution of a DataFrame reaches a completion point (that is, finishes a batch query or reaches a
streaming epoch), a named event is emitted that contains the metrics for the data processed since the last
completion point.
You can observe these metrics by attaching a listener to the Spark session. The listener depends on the
execution mode:
Batch mode : Use QueryExecutionListener .
QueryExecutionListener is called when the query completes. Access the metrics using the
QueryExecution.observedMetrics map.

Streaming, or micro-batch : Use StreamingQueryListener .


StreamingQueryListener is called when the streaming query completes an epoch. Access the metrics
using the StreamingQueryProgress.observedMetrics map. Azure Databricks does not support continuous
execution streaming.
For example:
Scala

// Observe row count (rc) and error row count (erc) in the streaming Dataset
val observed_ds = ds.observe("my_event", count(lit(1)).as("rc"), count($"error").as("erc"))
observed_ds.writeStream.format("...").start()

// Monitor the metrics using a listener


spark.streams.addListener(new StreamingQueryListener() {
override def onQueryProgress(event: QueryProgressEvent): Unit = {
event.progress.observedMetrics.get("my_event").foreach { row =>
// Trigger if the number of errors exceeds 5 percent
val num_rows = row.getAs[Long]("rc")
val num_error_rows = row.getAs[Long]("erc")
val ratio = num_error_rows.toDouble / num_rows
if (ratio > 0.05) {
// Trigger alert
}
}
}
})

Python

# Observe metric
observed_df = df.observe("metric", count(lit(1)).as("cnt"), count(col("error")).as("malformed"))
observed_df.writeStream.format("...").start()

# Define my listener.
class MyListener(StreamingQueryListener):
def onQueryStarted(self, event):
print(f"'{event.name}' [{event.id}] got started!")
def onQueryProgress(self, event):
row = event.progress.observedMetrics.get("metric")
if row is not None:
if row.malformed / row.cnt > 0.5:
print("ALERT! Ouch! there are too many malformed "
f"records {row.malformed} out of {row.cnt}!")
else:
print(f"{row.cnt} rows processed!")
def onQueryTerminated(self, event):
print(f"{event.id} got terminated!")

# Add my listener.
spark.streams.addListener(MyListener())
Configure scheduler pools for multiple Structured
Streaming workloads on a cluster
7/21/2022 • 2 minutes to read

To enable multiple streaming queries to execute jobs concurrently on a shared cluster, you can configure queries
to execute in separate scheduler pools.

How do scheduler pools work?


By default, all queries started in a notebook run in the same fair scheduling pool. Jobs generated by triggers
from all of the streaming queries in a notebook run one after another in first in, first out (FIFO) order. This can
cause unnecessary delays in the queries, because they are not efficiently sharing the cluster resources.
Scheduler pools allow you to declare which Structured Streaming queries share compute resources.
The following example assigns query1 to a dedicated pool, while query2 and query3 share a scheduler pool.

# Run streaming query1 in scheduler pool1


spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df.writeStream.queryName("query1").format("delta").start(path1)

# Run streaming query2 in scheduler pool2


spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query2").format("delta").start(path2)

# Run streaming query3 in scheduler pool2


spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query3").format("delta").start(path3)

NOTE
The local property configuration must be in the same notebook cell where you start your streaming query.

See Apache fair scheduler documentation for more details.


Configure Structured Streaming batch size on Azure
Databricks
7/21/2022 • 2 minutes to read

Limiting the input rate for Structured Streaming queries helps to maintain a consistent batch size and prevents
large batches from leading to spill and cascading micro-batch processing delays.
Azure Databricks provides the same options to control Structured Streaming batch sizes for both Delta Lake and
Auto Loader.

Limit input rate with maxFilesPerTrigger


Setting maxFilesPerTrigger (or cloudFiles.maxFilesPerTrigger for Auto Loader) specifies an upper-bound for
the number of files processed in each micro-batch. For both Delta Lake and Auto Loader the default is 1000.
(Note that this option is also present in Apache Spark for other file sources, where there is no max by default.)

Limit input rate with maxBytesPerTrigger


Setting maxBytesPerTrigger (or cloudFiles.maxBytesPerTrigger for Auto Loader) sets a “soft max” for the
amount of data processed in each micro-batch. This means that a batch processes approximately this amount of
data and may process more than the limit in order to make the streaming query move forward in cases when
the smallest input unit is larger than this limit. There is no default for this setting.
For example, if you specify a byte string such as 10g to limit each microbatch to 10 GB of data and you have
files that are 3 GB each, Azure Databricks processes 12 GB in a microbatch.

Setting multiple input rates together


If you use maxBytesPerTrigger in conjunction with maxFilesPerTrigger , the micro-batch processes data until
reaching the lower limit of either maxFilesPerTrigger or maxBytesPerTrigger .

Limiting input rates for other Structured Streaming sources


Streaming sources such as Apache Kafka each have custom input limits, such as maxOffsetsPerTrigger . For more
details, see Working with pub/sub and message queues on Azure Databricks.
Configure Structured Streaming trigger intervals on
Azure Databricks
7/21/2022 • 2 minutes to read

Apache Spark Structured Streaming processes data incrementally; controlling the trigger interval for batch
processing allows you to use Structured Streaming for workloads including near-real time processing,
refreshing databases every 5 minutes or once per hour, or batch processing all new data for a day or week.
Because Databricks Auto Loader uses Structured Streaming to load data, understanding how triggers work
provides you with the greatest flexibility to control costs while ingesting data with the desired frequency.

Specifying time-based trigger intervals


Structured Streaming refers to time-based trigger intervals as “fixed interval micro-batches”. Using the
processingTime keyword, specify a time duration as a string, such as .trigger(processingTime='10 seconds') .

When you specify a trigger interval that is too small (less than tens of seconds), the system may perform
unnecessary checks to see if new data arrives. Configure your processing time to balance latency requirements
and the rate that data arrives in the source.

Configuring incremental batch processing


Apache Spark provides the .trigger(once=True) option to process all new data from the source directory as a
single micro-batch. This trigger once pattern ignores all setting to control streaming input size, which can lead to
massive spill or out-of-memory errors.
Azure Databricks supports trigger(availableNow=True) in Databricks Runtime 10.2 and above for Delta Lake
and Auto Loader sources. This functionality combines the batch processing approach of trigger once with the
ability to configure batch size, resulting in multiple parallelized batches that give greater control for right-sizing
batches and the resultant files.

What is the default trigger interval?


Structured Streaming defaults to fixed interval micro-batches of 500ms. Databricks recommends you always
specify a tailored trigger to minimize costs associated with checking if new data has arrived and processing
undersized batches.

What is continuous processing mode?


Apache Spark supports an additional trigger interval known as Continuous Processing. This mode has been
classified as experimental since Spark 2.3; consult with your Azure Databricks representative to make sure you
understand the trade-offs of this processing model.
Note that this continuous processing mode does not relate at all to continuous processing as applied in Delta
Live Tables.
Optimize performance of stateful Structured
Streaming queries on Azure Databricks
7/21/2022 • 2 minutes to read

Managing the intermediate state information of stateful Structured Streaming queries can help prevent
unexpected latency and production problems.

Preventing slow down from garbage collection (GC) pause in stateful


streaming
If you have stateful operations in your streaming query (such as streaming aggregation) and you want to
maintain millions of keys in the state, then you may face issues related to large JVM garbage collection (GC)
pauses. This causes high variations in the micro-batch processing times. This occurs because your JVM’s
memory maintains your state data by default. Having a large number of state objects puts pressure on your
JVM memory, which causes high GC pauses.
In such cases, you can choose to use a more optimized state management solution based on RocksDB. This
solution is available in Databricks Runtime. Rather than keeping the state in the JVM memory, this solution uses
RocksDB to efficiently manage the state in the native memory and the local SSD. Furthermore, any changes to
this state are automatically saved by Structured Streaming to the checkpoint location you have provided, thus
providing full fault-tolerance guarantees (the same as default state management). For instructions for
configuring RocksDB as state store, see Configure RocksDB state store on Azure Databricks.

Recommended configurations for stateful Structured Streaming on


Databricks
Databricks recommends:
Use compute-optimized instances as workers. For example, Azure Standard_F16s instances.
Set the number of shuffle partitions to 1-2 times number of cores in the cluster.
Set the spark.sql.streaming.noDataMicroBatches.enabled configuration to false in the SparkSession. This
prevents the streaming micro-batch engine from processing micro-batches that do not contain data. Note
also that setting this configuration to false could result in stateful operations that leverage watermarks or
processing time timeouts to not get data output until new data arrives instead of immediately.
Regarding performance benefits, RocksDB-based state management can maintain 100 times more state keys
than the default one. For example, in a Spark cluster with Azure Standard_F16s instances as workers, the default
state management can maintain up to 1-2 million state keys per executor after which the JVM GC starts
affecting performance significantly. In contrast, the RocksDB-based state management can easily maintain 100
million state keys per executor without any GC issues.

NOTE
The state management scheme cannot be changed between query restarts. That is, if a query has been started with the
default management, then it cannot changed without starting the query from scratch with a new checkpoint location.
Configure RocksDB state store on Azure Databricks
7/21/2022 • 2 minutes to read

You can enable RockDB-based state management by setting the following configuration in the SparkSession
before starting the streaming query.

spark.conf.set(
"spark.sql.streaming.stateStore.providerClass",
"com.databricks.sql.streaming.state.RocksDBStateStoreProvider")

RocksDB state store metrics


Each state operator collects metrics related to the state management operations performed on its RocksDB
instance to observe the state store and potentially help in debugging job slowness. These metrics are
aggregated (sum) per state operator in job across all tasks where the state operator is running. These metrics
are part of the customMetrics map inside the stateOperators fields in StreamingQueryProgress . The following is
an example of StreamingQueryProgress in JSON form (obtained using StreamingQueryProgress.json() ).
{
"id" : "6774075e-8869-454b-ad51-513be86cfd43",
"runId" : "3d08104d-d1d4-4d1a-b21e-0b2e1fb871c5",
"batchId" : 7,
"stateOperators" : [ {
"numRowsTotal" : 20000000,
"numRowsUpdated" : 20000000,
"memoryUsedBytes" : 31005397,
"numRowsDroppedByWatermark" : 0,
"customMetrics" : {
"rocksdbBytesCopied" : 141037747,
"rocksdbCommitCheckpointLatency" : 2,
"rocksdbCommitCompactLatency" : 22061,
"rocksdbCommitFileSyncLatencyMs" : 1710,
"rocksdbCommitFlushLatency" : 19032,
"rocksdbCommitPauseLatency" : 0,
"rocksdbCommitWriteBatchLatency" : 56155,
"rocksdbFilesCopied" : 2,
"rocksdbFilesReused" : 0,
"rocksdbGetCount" : 40000000,
"rocksdbGetLatency" : 21834,
"rocksdbPutCount" : 1,
"rocksdbPutLatency" : 56155599000,
"rocksdbReadBlockCacheHitCount" : 1988,
"rocksdbReadBlockCacheMissCount" : 40341617,
"rocksdbSstFileSize" : 141037747,
"rocksdbTotalBytesReadByCompaction" : 336853375,
"rocksdbTotalBytesReadByGet" : 680000000,
"rocksdbTotalBytesReadThroughIterator" : 0,
"rocksdbTotalBytesWrittenByCompaction" : 141037747,
"rocksdbTotalBytesWrittenByPut" : 740000012,
"rocksdbTotalCompactionLatencyMs" : 21949695000,
"rocksdbWriterStallLatencyMs" : 0,
"rocksdbZipFileBytesUncompressed" : 7038
}
} ],
"sources" : [ {
} ],
"sink" : {
}
}

Detailed descriptions of the metrics are as follows:

M ET RIC N A M E DESC RIP T IO N

rocksdbCommitWriteBatchLatency Time (in millis) took for applying the staged writes in in-
memory structure (WriteBatch) to native RocksDB.

rocksdbCommitFlushLatency Time (in millis) took for flushing the RocksDB in-memory
changes to local disk.

rocksdbCommitCompactLatency Time (in millis) took for compaction (optional) during the
checkpoint commit.

rocksdbCommitPauseLatency Time (in millis) took for stopping the background worker
threads (for compaction etc.) as part of the checkpoint
commit.

rocksdbCommitCheckpointLatency Time (in millis) took for taking a snapshot of native RocksDB
and write it to a local directory.
M ET RIC N A M E DESC RIP T IO N

rocksdbCommitFileSyncLatencyMs Time (in millis) took for syncing the native RocksDB snapshot
related files to an external storage (checkpoint location).

rocksdbGetLatency Average time (in nanos) took per the underlying native
RocksDB::Get call.

rocksdbPutCount Average time (in nanos) took per the underlying native
RocksDB::Put call.

rocksdbGetCount Number of native RocksDB::Get calls (doesn’t include


Gets from WriteBatch - in memory batch used for staging
writes).

rocksdbPutCount Number of native RocksDB::Put calls (doesn’t include


Puts to WriteBatch - in memory batch used for staging
writes).

rocksdbTotalBytesReadByGet Number of uncompressed bytes read through native


RocksDB::Get calls.

rocksdbTotalBytesWrittenByPut Number of uncompressed bytes written through native


RocksDB::Put calls.

rocksdbReadBlockCacheHitCount Number of times the native RocksDB block cache is used to


avoid reading data from local disk.

rocksdbReadBlockCacheMissCount Number of times the native RocksDB block cache missed and
required reading data from local disk.

rocksdbTotalBytesReadByCompaction Number of bytes read from the local disk by the native
RocksDB compaction process.

rocksdbTotalBytesWrittenByCompaction Number of bytes written to the local disk by the native


RocksDB compaction process.

rocksdbTotalCompactionLatencyMs Time (in millis) took for RocksDB compactions (both


background and the optional compaction initiated during
the commit).

rocksdbWriterStallLatencyMs Time (in millis) the writer has stalled due to a background
compaction or flushing of the memtables to disk.

rocksdbTotalBytesReadThroughIterator Some of the stateful operations (such as timeout processing


in flatMapGroupsWithState or watermarking in windowed
aggregations) requires reading entire data in DB through
iterator. The total size of uncompressed data read using the
iterator.
Enable asynchronous state checkpointing for
Structured Streaming
7/21/2022 • 3 minutes to read

NOTE
Available in Databricks Runtime 10.3 and above.

For stateful streaming queries bottlenecked on state updates, enabling asynchronous state checkpointing can
reduce end-to-end latencies without sacrificing any fault-tolerance guarantees, but with a minor cost of higher
restart delays.
Structured Streaming uses synchronous checkpointing by default. Every micro-batch ensures that all the state
updates in that batch are backed up in cloud storage (called “checkpoint location”) before starting the next batch.
If a stateful streaming query fails, all micro-batches except the last micro-batch are checkpointed. On restart,
only the last batch needs to be re-run. Fast recovery with synchronous checkpointing comes at the cost of
higher latency for each micro-batch.

Asynchronous state checkpointing attempts to perform the checkpointing asynchronously so that the micro-
batch execution doesn’t have to wait for the checkpoint to complete. In other words, the next micro-batch can
start as soon as the computation of the previous micro-batch has been completed. Internally, however, the offset
metadata (also saved in the checkpoint location) tracks whether the state checkpointing has been completed for
a micro-batch. On query restart, more than one micro-batch may need to be re-executed - the last micro-batch
whose computation was incomplete, as well as the one micro-batch before it whose state checkpointing was
incomplete. And you get the same fault-tolerance guarantees (that is, exactly-once guarantees with an
idempotent sink) as that of synchronous checkpointing.

Identifying Structured Streaming workloads that benefit from


asynchronous checkpointing
The following are streaming job characteristics that may benefit from asynchronous state checkpointing.
Job has one or more stateful operations (e.g., aggregation, flatMapGroupsWithState , mapGroupsWithState ,
stream-stream joins)
State checkpoint latency is one of the major contributors to overall batch execution latency. This information
can be found in the StreamingQueryProgress events. These events are found in log4j logs on Spark driver as
well. Here is an example of streaming query progress and how to find the state checkpoint impact on the
overall batch execution latency.

{
"id" : "2e3495a2-de2c-4a6a-9a8e-f6d4c4796f19",
"runId" : "e36e9d7e-d2b1-4a43-b0b3-e875e767e1fe",
"...",
"batchId" : 0,
"durationMs" : {
"...",
"triggerExecution" : 547730,
"..."
},
"stateOperators" : [ {
"...",
"commitTimeMs" : 3186626,
"numShufflePartitions" : 64,
"..."
}]
}

State checkpoint latency analysis of above query progress event


Batch duration ( durationMs.triggerDuration ) is around 547 secs.
State store commit latency ( stateOperations[0].commitTimeMs ) is around 3,186 secs. Commit
latency is aggregated across tasks containing a state store. In this case there are 64 such tasks (
stateOperators[0].numShufflePartitions ).
Each task containing state operator took an average of 50 sec (3,186/64) for checkpoint. This is
an extra latency that is contributed to the batch duration. Assuming all 64 tasks are running
concurrently, checkpoint step contributed around 9% (50 secs / 547 secs) of the batch duration.
The percentage gets even higher when the max concurrent tasks is less than 64.

Enabling asynchronous state checkpointing


Set following configuration in streaming job. Async checkpointing needs a state store implementation that
supports async commits. Currently only the RocksDB based state store implementation supports it.

spark.conf.set(
"spark.databricks.streaming.statefulOperator.asyncCheckpoint.enabled",
"true"
)

spark.conf.set(
"spark.sql.streaming.stateStore.providerClass",
"com.databricks.sql.streaming.state.RocksDBStateStoreProvider"
)

Limitations and requirements for asynchronous checkpointing


Any failure in an asynchronous checkpoint at any one or more stores fails the query. In synchronous
checkpointing mode, the checkpoint is executed as part of the task and Spark retries the task multiple times
before failing the query. This mechanism is not present with asynchronous state checkpointing. However,
using the Databricks job retries, such failures can be automatically retried.
Asynchronous checkpointing works best when the state store locations are not changed between micro-
batch executions. Auto-scaling in combination with asynchronous state checkpointing may not work well.
With auto-scaling enabled, the state stores instance may get re-distributed as nodes are added or deleted as
part of the auto-scaling.
Asynchronous state checkpointing is supported only in the RocksDB state store provider implementation.
The default in-memory state store implementation does not support it.
Control late data threshold for Structured Streaming
with multiple watermark policy
7/21/2022 • 2 minutes to read

When working with multiple Structured Streaming inputs, you can set multiple watermarks to control tolerance
thresholds for late-arriving data. Configuring watermarks allows you to control state information and impacts
latency.
A streaming query can have multiple input streams that are unioned or joined together. Each of the input
streams can have a different threshold of late data that needs to be tolerated for stateful operations. Specify
these thresholds using withWatermarks("eventTime", delay) on each of the input streams. The following is an
example query with stream-stream joins.

val inputStream1 = ... // delays up to 1 hour


val inputStream2 = ... // delays up to 2 hours

inputStream1.withWatermark("eventTime1", "1 hour")


.join(
inputStream2.withWatermark("eventTime2", "2 hours"),
joinCondition)

While running the query, Structured Streaming individually tracks the maximum event time seen in each input
stream, calculates watermarks based on the corresponding delay, and chooses a single global watermark with
them to be used for stateful operations. By default, the minimum is chosen as the global watermark because it
ensures that no data is accidentally dropped as too late if one of the streams falls behind the others (for
example, one of the streams stop receiving data due to upstream failures). In other words, the global watermark
safely moves at the pace of the slowest stream and the query output is delayed accordingly.
If you want to get faster results, you can set the multiple watermark policy to choose the maximum value as the
global watermark by setting the SQL configuration spark.sql.streaming.multipleWatermarkPolicy to max
(default is min ). This lets the global watermark move at the pace of the fastest stream. However, this
configuration drops data from the slowest streams. Because of this, we recommends that you use this
configuration judiciously.
Specify initial state for Structured Streaming
mapGroupsWithState
7/21/2022 • 2 minutes to read

You can specify a user defined initial state for Structured Streaming stateful processing using
flatMapGroupsWithState or mapGroupsWithState . This allows you to avoid reprocessing data when starting a
stateful stream without a valid checkpoint.

def mapGroupsWithState[S: Encoder, U: Encoder](


timeoutConf: GroupStateTimeout,
initialState: KeyValueGroupedDataset[K, S])(
func: (K, Iterator[V], GroupState[S]) => U): Dataset[U]

def flatMapGroupsWithState[S: Encoder, U: Encoder](


outputMode: OutputMode,
timeoutConf: GroupStateTimeout,
initialState: KeyValueGroupedDataset[K, S])(
func: (K, Iterator[V], GroupState[S]) => Iterator[U])

Example use case that specifies an initial state to the flatMapGroupsWithState operator:

val fruitCountFunc =(key: String, values: Iterator[String], state: GroupState[RunningCount]) => {


val count = state.getOption.map(_.count).getOrElse(0L) + valList.size
state.update(new RunningCount(count))
Iterator((key, count.toString))
}

val fruitCountInitialDS: Dataset[(String, RunningCount)] = Seq(


("apple", new RunningCount(1)),
("orange", new RunningCount(2)),
("mango", new RunningCount(5)),
).toDS()

val fruitCountInitial = initialState.groupByKey(x => x._1).mapValues(_._2)

fruitStream
.groupByKey(x => x)
.flatMapGroupsWithState(Update, GroupStateTimeout.NoTimeout, fruitCountInitial)(fruitCountFunc)

Example use case that specifies an initial state to the mapGroupsWithState operator:
val fruitCountFunc =(key: String, values: Iterator[String], state: GroupState[RunningCount]) => {
val count = state.getOption.map(_.count).getOrElse(0L) + valList.size
state.update(new RunningCount(count))
(key, count.toString)
}

val fruitCountInitialDS: Dataset[(String, RunningCount)] = Seq(


("apple", new RunningCount(1)),
("orange", new RunningCount(2)),
("mango", new RunningCount(5)),
).toDS()

val fruitCountInitial = initialState.groupByKey(x => x._1).mapValues(_._2)

fruitStream
.groupByKey(x => x)
.mapGroupsWithState(GroupStateTimeout.NoTimeout, fruitCountInitial)(fruitCountFunc)
Test state update function for Structured Streaming
mapGroupsWithState
7/21/2022 • 2 minutes to read

The API enables you to test the state update function used for
TestGroupState
Dataset.groupByKey(...).mapGroupsWithState(...) and Dataset.groupByKey(...).flatMapGroupsWithState(...) .

The state update function takes the previous state as input using an object of type GroupState . See the Apache
Spark GroupState reference documentation. For example:

import org.apache.spark.sql.streaming._
import org.apache.spark.api.java.Optional

test("flatMapGroupsWithState's state update function") {


var prevState = TestGroupState.create[UserStatus](
optionalState = Optional.empty[UserStatus],
timeoutConf = GroupStateTimeout.EventTimeTimeout,
batchProcessingTimeMs = 1L,
eventTimeWatermarkMs = Optional.of(1L),
hasTimedOut = false)

val userId: String = ...


val actions: Iterator[UserAction] = ...

assert(!prevState.hasUpdated)

updateState(userId, actions, prevState)

assert(prevState.hasUpdated)
}
Working with pub/sub and message queues on
Azure Databricks
7/21/2022 • 2 minutes to read

Azure Databricks can integrate with stream messaging services for near-real time data ingestion into the
Databricks Lakehouse. It can also sync enriched and transformed data in the lakehouse with other streaming
systems.
Ingesting streaming messages to Delta Lake allows you to retain messages indefinitely, allowing you to replay
data streams without fear of losing data due to retention thresholds.
Azure Databricks has specific features for working with semi-structured data fields contained in Avro and JSON
data payloads. To learn more, see:
Read and write streaming Avro data
To learn more about specific configurations for streaming from or to message queues, see:
Apache Kafka
Azure Event Hubs
Apache Kafka
7/21/2022 • 6 minutes to read

The Apache Kafka connectors for Structured Streaming are packaged in Databricks Runtime. You use the kafka
connector to connect to Kafka 0.10+ and the kafka08 connector to connect to Kafka 0.8+ (deprecated).

Connect Kafka on HDInsight to Azure Databricks


1. Create an HDInsight Kafka cluster.
See Connect to Kafka on HDInsight through an Azure Virtual Network for instructions.
2. Configure the Kafka brokers to advertise the correct address.
Follow the instructions in Configure Kafka for IP advertising. If you manage Kafka yourself on Azure
Virtual Machines, make sure that the advertised.listeners configuration of the brokers is set to the
internal IP of the hosts.
3. Create an Azure Databricks cluster.
Follow the instructions in Quickstart: Run a Spark job on Azure Databricks using the Azure portal.
4. Peer the Kafka cluster to the Azure Databricks cluster.
Follow the instructions in Peer virtual networks.
5. Validate the connection by testing the scenarios described in Quickstart and Production Structured
Streaming with Kafka notebook.

Schema
The schema of the records is:

C O L UM N TYPE

key binary

value binary

topic string

partition int

offset long

timestamp long

timestampType int

The key and the value are always deserialized as byte arrays with the ByteArrayDeserializer . Use DataFrame
operations ( cast("string") , udfs) to explicitly deserialize the keys and values.
Quickstart
Let’s start with a the canonical WordCount example. The following notebook demonstrates how to run
WordCount using Structured Streaming with Kafka.

NOTE
This notebook example uses Kafka 0.10. To use Kafka 0.8, change the format to kafka08 (that is, .format("kafka08") ).

Kafka WordCount with Structured Streaming notebook


Get notebook

Configuration
For the comphensive list of configuration options, see the Spark Structured Streaming + Kafka Integration
Guide. To get you started, here is a subset of the most common configuration options.

NOTE
As Structured Streaming is still under development, this list may not be up to date.

There are multiple ways of specifying which topics to subscribe to. You should provide only one of these
parameters:

SUP P O RT ED K A F K A
O P T IO N VA L UE VERSIO N S DESC RIP T IO N

subscribe A comma-separated list of 0.8, 0.10 The topic list to subscribe


topics. to.

subscribePattern Java regex string. 0.10 The pattern used to


subscribe to topic(s).

assign JSON string 0.8, 0.10 Specific topicPartitions to


{"topicA": consume.
[0,1],"topic":[2,4]}
.

Other notable configurations:

SUP P O RT ED K A F K A
O P T IO N VA L UE DEFA ULT VA L UE VERSIO N S DESC RIP T IO N
SUP P O RT ED K A F K A
O P T IO N VA L UE DEFA ULT VA L UE VERSIO N S DESC RIP T IO N

kafka.bootstrap.serve Comma-separated empty 0.8, 0.10 [Required] The Kafka


rs list of host:port. bootstrap.servers
configuration. If you
find there is no data
from Kafka, check the
broker address list
first. If the broker
address list is
incorrect, there might
not be any errors.
This is because Kafka
client assumes the
brokers will become
available eventually
and in the event of
network errors retry
forever.

failOnDataLoss true or false . true 0.10 [Optional] Whether


to fail the query
when it’s possible
that data was lost.
Queries can
permanently fail to
read data from Kafka
due to many
scenarios such as
deleted topics, topic
truncation before
processing, and so
on. We try to
estimate
conservatively
whether data was
possibly lost or not.
Sometimes this can
cause false alarms.
Set this option to
false if it does not
work as expected, or
you want the query
to continue
processing despite
data loss.
SUP P O RT ED K A F K A
O P T IO N VA L UE DEFA ULT VA L UE VERSIO N S DESC RIP T IO N

minPartitions Integer >= 0, 0 = 0 (disabled) 0.10 [Optional] Minimum


disabled. number of partitions
to read from Kafka.
With Spark 2.1.0-db2
and above, you can
configure Spark to
use an arbitrary
minimum of
partitions to read
from Kafka using the
minPartitions
option. Normally
Spark has a 1-1
mapping of Kafka
topicPartitions to
Spark partitions
consuming from
Kafka. If you set the
minPartitions
option to a value
greater than your
Kafka topicPartitions,
Spark will divvy up
large Kafka partitions
to smaller pieces. This
option can be set at
times of peak loads,
data skew, and as
your stream is falling
behind to increase
processing rate. It
comes at a cost of
initializing Kafka
consumers at each
trigger, which may
impact performance
if you use SSL when
connecting to Kafka.
SUP P O RT ED K A F K A
O P T IO N VA L UE DEFA ULT VA L UE VERSIO N S DESC RIP T IO N

kafka.group.id A Kafka consumer not set 0.10 [Optional] Group ID


group ID. to use while reading
from Kafka.
Supported in Spark
2.2+. Use this with
caution. By default,
each query generates
a unique group ID
for reading data. This
ensures that each
query has its own
consumer group that
does not face
interference from any
other consumer, and
therefore can read all
of the partitions of
its subscribed topics.
In some scenarios
(for example, Kafka
group-based
authorization), you
may want to use
specific authorized
group IDs to read
data. You can
optionally set the
group ID. However,
do this with extreme
caution as it can
cause unexpected
behavior.

* Concurrently
running queries
(both, batch and
streaming) with the
same group ID are
likely interfere with
each other causing
each query to read
only part of the data.
* This may also occur
when queries are
started/restarted in
quick succession. To
minimize such issues,
set the Kafka
consumer
configuration
session.timeout.ms
to be very small.

See Structured Streaming Kafka Integration Guide for other optional configurations.
IMPORTANT
You should not set the following Kafka parameters for the Kafka 0.10 connector as it will throw an exception:
group.id : Setting this parameter is not allowed for Spark versions below 2.2.
auto.offset.reset : Instead, set the source option startingOffsets to specify where to start. To maintain
consistency, Structured Streaming (as opposed to the Kafka Consumer) manages the consumption of offsets internally.
This ensures that you don’t miss any data after dynamically subscribing to new topics/partitions. startingOffsets
applies only when you start a new Streaming query, and that resuming from a checkpoint always picks up from where
the query left off.
key.deserializer : Keys are always deserialized as byte arrays with ByteArrayDeserializer . Use DataFrame
operations to explicitly deserialize the keys.
value.deserializer : Values are always deserialized as byte arrays with ByteArrayDeserializer . Use DataFrame
operations to explicitly deserialize the values.
enable.auto.commit : Setting this parameter is not allowed. Spark keeps track of Kafka offsets internally and doesn’t
commit any offset.
interceptor.classes : Kafka source always read keys and values as byte arrays. It’s not safe to use
ConsumerInterceptor as it may break the query.

Production Structured Streaming with Kafka notebook


Get notebook
Metrics

NOTE
Available in Databricks Runtime 8.1 and above.

You can get the average, min, and max of the number of offsets that the streaming query is behind the latest
available offset among all the subscribed topics with the avgOffsetsBehindLatest , maxOffsetsBehindLatest , and
minOffsetsBehindLatest metrics. See Reading Metrics Interactively.

NOTE
Available in Databricks Runtime 9.1 and above.

Get the estimated total number of bytes that the query process has not consumed from the subscribed topics by
examining the value of estimatedTotalBytesBehindLatest . This estimate is based on the batches that were
processed in the last 300 seconds. The timeframe that the estimate is based on can be changed by setting the
option bytesEstimateWindowLength to a different value. For example, to set it to 10 minutes:

df = spark.readStream \
.format("kafka") \
.option("bytesEstimateWindowLength", "10m") // m for minutes, you can also use "600s" for 600 seconds

If you are running the stream in a notebook, you can see these metrics under the Raw Data tab in the
streaming query progress dashboard:
{
"sources" : [ {
"description" : "KafkaV2[Subscribe[topic]]",
"metrics" : {
"avgOffsetsBehindLatest" : "4.0",
"maxOffsetsBehindLatest" : "4",
"minOffsetsBehindLatest" : "4",
"estimatedTotalBytesBehindLatest" : "80.0"
},
} ]
}

Use SSL
To enable SSL connections to Kafka, follow the instructions in the Confluent documentation Encryption and
Authentication with SSL. You can provide the configurations described there, prefixed with kafka. , as options.
For example, you specify the trust store location in the property kafka.ssl.truststore.location .
We recommend that you:
Store your certificates in Azure Blob storage or Azure Data Lake Storage Gen2 and access them through a
DBFS mount point. Combined with cluster and job ACLs, you can restrict access to the certificates only to
clusters that can access Kafka.
Store your certificate passwords as secrets in a secret scope.
Once paths are mounted and secrets stored, you can do the following:

df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", ...) \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.ssl.truststore.location", <dbfs-truststore-location>) \
.option("kafka.ssl.keystore.location", <dbfs-keystore-location>) \
.option("kafka.ssl.keystore.password", dbutils.secrets.get(scope=<certificate-scope-name>,key=<keystore-
password-key-name>)) \
.option("kafka.ssl.truststore.password", dbutils.secrets.get(scope=<certificate-scope-name>,key=
<truststore-password-key-name>))

Resources
Real-Time End-to-End Integration with Apache Kafka in Apache Spark Structured Streaming
Azure Event Hubs
7/21/2022 • 4 minutes to read

Azure Event Hubs is a hyper-scale telemetry ingestion service that collects, transforms, and stores millions of
events. As a distributed streaming platform, it gives you low latency and configurable time retention, which
enables you to ingress massive amounts of telemetry into the cloud and read the data from multiple
applications using publish-subscribe semantics.
This article explains how to use Structured Streaming with Azure Event Hubs and Azure Databricks clusters.

Requirements
For current release support, see “Latest Releases” in the Azure Event Hubs Spark Connector project readme file.
1. Create a library in your Azure Databricks workspace using the Maven coordinate
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.17 .

NOTE
This connector is updated regularly, and a more recent version may be available: we recommend that you pull the
latest connector from the Maven repository

2. Install the created library into your cluster.

Schema
The schema of the records is:

C O L UM N TYPE

body binary

partition string

offset string

sequenceNumber long

enqueuedTime timestamp

publisher string

partitionKey string

properties map[string,json]

The body is always provided as a byte array. Use cast("string") to explicitly deserialize the body column.

Quick Start
Let’s start with a quick example: WordCount. The following notebook is all that it takes to run WordCount using
Structured Streaming with Azure Event Hubs.
Azure Event Hubs WordCount with Structured Streaming notebook
Get notebook

Configuration
This section discusses the configuration settings you need to work with Event Hubs.
For detailed guidance on configuring Structured Streaming with Azure Event Hubs, see the Structured Streaming
and Azure Event Hubs Integration Guide developed by Microsoft.
For detailed guidance on using Structured Streaming, see What is Apache Spark Structured Streaming?.
Connection string
An Event Hubs connection string is required to connect to the Event Hubs service. You can get the connection
string for your Event Hubs instance from the Azure portal or by using the ConnectionStringBuilder in the
library.
Azure portal
When you get the connection string from the Azure portal, it may or may not have the EntityPath key.
Consider:

// Without an entity path


val without = "Endpoint=<endpoint>;SharedAccessKeyName=<key-name>;SharedAccessKey=<key>"

// With an entity path


val with = "Endpoint=sb://<sample>;SharedAccessKeyName=<key-name>;SharedAccessKey=<key>;EntityPath=
<eventhub-name>"

To connect to your EventHubs, an EntityPath must be present. If your connection string doesn’t have one, don’t
worry. This will take care of it:

import org.apache.spark.eventhubs.ConnectionStringBuilder

val connectionString = ConnectionStringBuilder(without) // defined in the previous code block


.setEventHubName("<eventhub-name>")
.build

ConnectionStringBuilder
Alternatively, you can use the ConnectionStringBuilder to make your connection string.

import org.apache.spark.eventhubs.ConnectionStringBuilder

val connectionString = ConnectionStringBuilder()


.setNamespaceName("<namespace-name>")
.setEventHubName("<eventhub-name>")
.setSasKeyName("<key-name>")
.setSasKey("<key>")
.build

EventHubsConf
All configuration relating to Event Hubs happens in your EventHubsConf . To create an EventHubsConf , you must
pass a connection string:
val connectionString = "<event-hub-connection-string>"
val eventHubsConf = EventHubsConf(connectionString)

See Connection String for more information about obtaining a valid connection string.
For a complete list of configurations, see EventHubsConf. Here is a subset of configurations to get you started:

O P T IO N VA L UE DEFA ULT Q UERY T Y P E DESC RIP T IO N

consumerGroup String “$Default” Streaming and batch A consumer group is


a view of an entire
event hub. Consumer
groups enable
multiple consuming
applications to each
have a separate view
of the event stream,
and to read the
stream
independently at
their own pace and
with their own
offsets. More
information is
available in the
Microsoft
documentation.

startingPosition EventPosition Start of stream Streaming and batch The starting position
for your Structured
Streaming job. See
startingPositions for
information about
the order in which
options are read.

maxEventsPerTrigger long partitionCount Streaming query Rate limit on


maximum number of
* 1000 events processed per
trigger interval. The
specified total
number of events will
be proportionally
split across partitions
of different volume.

For each option, there exists a corresponding setting in EventHubsConf . For example:

import org.apache.spark.eventhubs.

val cs = "<your-connection-string>"
val eventHubsConf = EventHubsConf(cs)
.setConsumerGroup("sample-cg")
.setMaxEventsPerTrigger(10000)

EventPosition
EventHubsConf allows users to specify starting (and ending) positions with the EventPosition class.
EventPosition defines the position of an event in an Event Hub partition. The position can be an enqueued time,
offset, sequence number, the start of the stream, or the end of the stream.
import org.apache.spark.eventhubs._

EventPosition.fromOffset("246812") // Specifies offset 246812


EventPosition.fromSequenceNumber(100L) // Specifies sequence number 100
EventPosition.fromEnqueuedTime(Instant.now) // Any event after the current time
EventPosition.fromStartOfStream // Specifies from start of stream
EventPosition.fromEndOfStream // Specifies from end of stream

If you would like to start (or end) at a specific position, simply create the correct EventPosition and set it in your
EventHubsConf :

val connectionString = "<event-hub-connection-string>"


val eventHubsConf = EventHubsConf(connectionString)
.setStartingPosition(EventPosition.fromEndOfStream)

Production Structured Streaming with Azure Event Hubs


When you run streaming queries in production, you probably want more robustness and uptime guarantees
than you would have when you simply attach a notebook to a cluster and run your streaming queries
interactively. Import and run the following notebook for a demonstration of how to configure and run
Structured Streaming in production with Azure Event Hubs and Azure Databricks.
For more information, see Production considerations for Structured Streaming applications on Azure Databricks.
Production Structured Streaming with Azure Event Hubs notebook
Get notebook

End-to-end Event Hubs streaming tutorial


For an end-to-end example of streaming data into a cluster using Event Hubs, see Tutorial: Stream data into
Azure Databricks using Event Hubs.
Read and write streaming Avro data
7/21/2022 • 3 minutes to read

Apache Avro is a commonly used data serialization system in the streaming world. A typical solution is to put
data in Avro format in Apache Kafka, metadata in Confluent Schema Registry, and then run queries with a
streaming framework that connects to both Kafka and Schema Registry.
Azure Databricks supports the from_avro and to_avro functions to build streaming pipelines with Avro data in
Kafka and metadata in Schema Registry. The function to_avro encodes a column as binary in Avro format and
from_avro decodes Avro binary data into a column. Both functions transform one column to another column,
and the input/output SQL data type can be a complex type or a primitive type.

NOTE
The from_avro and to_avro functions:
Are available in Python, Scala, and Java.
Can be passed to SQL functions in both batch and streaming queries.

Also see Avro file data source.

Basic example
Similar to from_json and to_json, you can use from_avro and to_avro with any binary column, but you must
specify the Avro schema manually.

import org.apache.spark.sql.avro.functions._
import org.apache.avro.SchemaBuilder

// When reading the key and value of a Kafka topic, decode the
// binary (Avro) data into structured data.
// The schema of the resulting DataFrame is: <key: string, value: int>
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("subscribe", "t")
.load()
.select(
from_avro($"key", SchemaBuilder.builder().stringType()).as("key"),
from_avro($"value", SchemaBuilder.builder().intType()).as("value"))

// Convert structured data to binary from string (key column) and


// int (value column) and save to a Kafka topic.
dataDF
.select(
to_avro($"key").as("key"),
to_avro($"value").as("value"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("article", "t")
.save()
jsonFormatSchema example
You can also specify a schema as a JSON string. For example, if /tmp/user.avsc is:

{
"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_color", "type": ["string", "null"]}
]
}

You can create a JSON string:

from pyspark.sql.avro.functions import from_avro, to_avro

jsonFormatSchema = open("/tmp/user.avsc", "r").read()

Then use the schema in from_avro :

# 1. Decode the Avro data into a struct.


# 2. Filter by column "favorite_color".
# 3. Encode the column "name" in Avro format.

output = df\
.select(from_avro("value", jsonFormatSchema).alias("user"))\
.where('user.favorite_color == "red"')\
.select(to_avro("user.name").alias("value"))

Example with Schema Registry


If your cluster has a Schema Registry service, from_avro can work with it so that you don’t need to specify the
Avro schema manually.

NOTE
Integration with Schema Registry is available only in Scala and Java.

import org.apache.spark.sql.avro.functions._

// Read a Kafka topic "t", assuming the key and value are already
// registered in Schema Registry as subjects "t-key" and "t-value" of type
// string and int. The binary key and value columns are turned into string
// and int type with Avro and Schema Registry. The schema of the resulting DataFrame
// is: <key: string, value: int>.
val schemaRegistryAddr = "https://myhost:8081"
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("subscribe", "t")
.load()
.select(
from_avro($"key", "t-key", schemaRegistryAddr).as("key"),
from_avro($"value", "t-value", schemaRegistryAddr).as("value"))
For to_avro , the default output Avro schema might not match the schema of the target subject in the Schema
Registry service for the following reasons:
The mapping from Spark SQL type to Avro schema is not one-to-one. See Supported types for Spark SQL ->
Avro conversion.
If the converted output Avro schema is of record type, the record name is topLevelRecord and there is no
namespace by default.
If the default output schema of to_avro matches the schema of the target subject, you can do the following:

// The converted data is saved to Kafka as a Kafka topic "t".


dataDF
.select(
to_avro($"key", lit("t-key"), schemaRegistryAddr).as("key"),
to_avro($"value", lit("t-value"), schemaRegistryAddr).as("value"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("article", "t")
.save()

Otherwise, you must provide the schema of the target subject in the to_avro function:

// The Avro schema of subject "t-value" in JSON string format.


val avroSchema = ...
// The converted data is saved to Kafka as a Kafka topic "t".
dataDF
.select(
to_avro($"key", lit("t-key"), schemaRegistryAddr).as("key"),
to_avro($"value", lit("t-value"), schemaRegistryAddr, avroSchema).as("value"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("article", "t")
.save()
Examples for working with Structured Streaming on
Azure Databricks
7/21/2022 • 2 minutes to read

This contains notebooks and code samples for common patterns for working with Structured Streaming on
Azure Databricks.

Getting started with Structured Streaming


These two notebooks show how to use the DataFrame API to build Structured Streaming applications in Python
and Scala.
Structured Streaming demo Python notebook
Get notebook
Structured Streaming demo Scala notebook
Get notebook

Write to Cassandra using foreachBatch() in Scala


streamingDF.writeStream.foreachBatch() allows you to reuse existing batch data writers to write the output of a
streaming query to Cassandra. The following notebook shows this by using the Spark Cassandra connector
from Scala to write the key-value output of an aggregation query to Cassandra. See the foreachBatch
documentation for details.
To run this example, you need to install the appropriate Cassandra Spark connector for your Spark version as a
Maven library.
In this example, we create a table, and then start a Structured Streaming query to write to that table. We then
use foreachBatch() to write the streaming output using a batch DataFrame connector.
import org.apache.spark.sql._
import org.apache.spark.sql.cassandra._

import com.datastax.spark.connector.cql.CassandraConnectorConf
import com.datastax.spark.connector.rdd.ReadConf
import com.datastax.spark.connector._

val host = "<ip address>"


val clusterName = "<cluster name>"
val keyspace = "<keyspace>"
val tableName = "<tableName>"

spark.setCassandraConf(clusterName, CassandraConnectorConf.ConnectionHostParam.option(host))
spark.readStream.format("rate").load()
.selectExpr("value % 10 as key")
.groupBy("key")
.count()
.toDF("key", "value")
.writeStream
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>

batchDF.write // Use Cassandra batch data source to write streaming out


.cassandraFormat(tableName, keyspace)
.option("cluster", clusterName)
.mode("append")
.save()
}
.outputMode("update")
.start()

Write to Azure Synapse Analytics using foreachBatch() in Python


streamingDF.writeStream.foreachBatch() allows you to reuse existing batch data writers to write the output of a
streaming query to Azure Synapse Analytics. See the foreachBatch documentation for details.
To run this example, you need the Azure Synapse Analytics connector. For details on the Azure Synapse Analytics
connector, see Azure Synapse Analytics.
from pyspark.sql.functions import *
from pyspark.sql import *

def writeToSQLWarehouse(df, epochId):


df.write \
.format("com.databricks.spark.sqldw") \
.mode('overwrite') \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forward_spark_azure_storage_credentials", "true") \
.option("dbtable", "my_table_in_dw_copy") \
.option("tempdir", "wasbs://<your-container-name>@<your-storage-account-
name>.blob.core.windows.net/<your-directory-name>") \
.save()

spark.conf.set("spark.sql.shuffle.partitions", "1")

query = (
spark.readStream.format("rate").load()
.selectExpr("value % 10 as key")
.groupBy("key")
.count()
.toDF("key", "count")
.writeStream
.foreachBatch(writeToSQLWarehouse)
.outputMode("update")
.start()
)

Stream-Stream joins
These two notebooks show how to use stream-stream joins in Python and Scala.
Stream-Stream joins Python notebook
Get notebook
Stream-Stream joins Scala notebook
Get notebook
Perform streaming writes to arbitrary data sinks with
Structured Streaming and foreachBatch
7/21/2022 • 4 minutes to read

Structured Streaming APIs provide two ways to write the output of a streaming query to data sources that do
not have an existing streaming sink: foreachBatch() and foreach() .

Reuse existing batch data sources with foreachBatch()


streamingDF.writeStream.foreachBatch(...) allows you to specify a function that is executed on the output data
of every micro-batch of the streaming query. It takes two parameters: a DataFrame or Dataset that has the
output data of a micro-batch and the unique ID of the micro-batch. With foreachBatch , you can:
Reuse existing batch data sources
For many storage systems, there may not be a streaming sink available yet, but there may already exist a data
writer for batch queries. Using foreachBatch() , you can use the batch data writers on the output of each micro-
batch. Here are a few examples:
Cassandra Scala example
Azure Synapse Analytics Python example
Many other batch data sources can be used from foreachBatch() .
Write to multiple locations
If you want to write the output of a streaming query to multiple locations, then you can simply write the output
DataFrame/Dataset multiple times. However, each attempt to write can cause the output data to be recomputed
(including possible re-reading of the input data). To avoid recomputations, you should cache the output
DataFrame/Dataset, write it to multiple locations, and then uncache it. Here is an outline.

streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>


batchDF.persist()
batchDF.write.format(...).save(...) // location 1
batchDF.write.format(...).save(...) // location 2
batchDF.unpersist()
}

NOTE
If you are running multiple Spark jobs on the batchDF , the input data rate of the streaming query (reported through
StreamingQueryProgress and visible in the notebook rate graph) may be reported as a multiple of the actual rate at
which data is generated at the source. This is because the input data may be read multiple times in the multiple Spark
jobs per batch.

Apply additional DataFrame operations


Many DataFrame and Dataset operations are not supported in streaming DataFrames because Spark does not
support generating incremental plans in those cases. Using foreachBatch() you can apply some of these
operations on each micro-batch output. For example, you can use foreachBath() and the SQL MERGE INTO
operation to write the output of streaming aggregations into a Delta table in Update mode. See more details in
MERGE INTO.
IMPORTANT
foreachBatch() provides only at-least-once write guarantees. However, you can use the batchId provided to the
function as way to deduplicate the output and get an exactly-once guarantee. In either case, you will have to reason
about the end-to-end semantics yourself.
foreachBatch() does not work with the continuous processing mode as it fundamentally relies on the micro-batch
execution of a streaming query. If you write data in continuous mode, use foreach() instead.

Write to any location using foreach()


If foreachBatch() is not an option (for example, you are using Databricks Runtime lower than 4.2, or
corresponding batch data writer does not exist), then you can express your custom writer logic using foreach() .
Specifically, you can express the data writing logic by dividing it into three methods: open() , process() , and
close() .

Using Scala or Java


In Scala or Java, you extend the class ForeachWriter:

datasetOfString.writeStream.foreach(
new ForeachWriter[String] {

def open(partitionId: Long, version: Long): Boolean = {


// Open connection
}

def process(record: String) = {


// Write string to connection
}

def close(errorOrNull: Throwable): Unit = {


// Close the connection
}
}
).start()

Using Python
In Python, you can invoke foreach in two ways: in a function or in an object. The function offers a simple way to
express your processing logic but does not allow you to deduplicate generated data when failures cause
reprocessing of some input data. For that situation you must specify the processing logic in an object.
The function takes a row as input.

def processRow(row):
// Write row to storage

query = streamingDF.writeStream.foreach(processRow).start()

The object has a process method and optional open and close methods:
class ForeachWriter:
def open(self, partition_id, epoch_id):
// Open connection. This method is optional in Python.

def process(self, row):


// Write row to connection. This method is not optional in Python.

def close(self, error):


// Close the connection. This method is optional in Python.

query = streamingDF.writeStream.foreach(ForeachWriter()).start()

Execution semantics
When the streaming query is started, Spark calls the function or the object’s methods in the following way:
A single copy of this object is responsible for all the data generated by a single task in a query. In other
words, one instance is responsible for processing one partition of the data generated in a distributed
manner.
This object must be serializable, because each task will get a fresh serialized-deserialized copy of the
provided object. Hence, it is strongly recommended that any initialization for writing data (for example,
opening a connection or starting a transaction) is done after you call the open() method, which signifies
that the task is ready to generate data.
The lifecycle of the methods are as follows:
For each partition with partition_id :
For each batch/epoch of streaming data with epoch_id :
Method open(partitionId, epochId) is called.
If open(...) returns true, for each row in the partition and batch/epoch, method process(row) is called.
Method close(error) is called with error (if any) seen while processing rows.
The close() method (if it exists) is called if an open() method exists and returns successfully
(irrespective of the return value), except if the JVM or Python process crashes in the middle.

NOTE
The partitionId and epochId in the open() method can be used to deduplicate generated data when failures cause
reprocessing of some input data. This depends on the execution mode of the query. If the streaming query is being
executed in the micro-batch mode, then every partition represented by a unique tuple (partition_id, epoch_id) is
guaranteed to have the same data. Hence, (partition_id, epoch_id) can be used to deduplicate and/or
transactionally commit data and achieve exactly-once guarantees. However, if the streaming query is being executed in
the continuous mode, then this guarantee does not hold and therefore should not be used for deduplication.
Delta Lake tables
7/21/2022 • 2 minutes to read

Delta tables can be both sources and sinks for streaming queries. For more information, see the Delta Lake
streaming guide.
(Deprecated) Azure Blob storage file source with
Azure Queue Storage
7/21/2022 • 3 minutes to read

IMPORTANT
The Databricks ABS-AQS connector is deprecated. Databricks recommends using Auto Loader instead.

The ABS-AQS connector provides an optimized file source that uses Azure Queue Storage (AQS) to find new
files written to an Azure Blob storage (ABS) container without repeatedly listing all of the files. This provides two
advantages:
Lower latency: no need to list nested directory structures on ABS, which is slow and resource intensive.
Lower costs: no more costly LIST API requests made to ABS.

NOTE
The ABS-AQS source deletes messages from the AQS queue as it consumes events. If you want other pipelines to
consume messages from this queue, set up a separate AQS queue for the optimized reader. You can set up multiple Event
Grid Subscriptions to publish to different queues.

Use the ABS-AQS file source


To use the ABS-AQS file source you must:
Set up ABS event notifications by leveraging Azure Event Grid Subscriptions and route them to AQS. See
Reacting to Blob storage events.
Specify the fileFormat and queueUrl options and a schema. For example:

spark.readStream \
.format("abs-aqs") \
.option("fileFormat", "json") \
.option("queueName", ...) \
.option("connectionString", ...) \
.schema(...) \
.load()

Authenticate with Azure Queue Storage and Blob storage


To authenticate with Azure Queue Storage and Blob storage, use Shared Access Signature (SAS) tokens or
storage account keys. You must provide a connection string for the storage account where your queue is
deployed that contains either your SAS token or access keys to your storage account. For more information, see
Configure Azure Storage connection strings.
You will also need to provide access to your Azure Blob storage containers. See Accessing Azure Data Lake
Storage Gen2 and Blob Storage with Azure Databricks for information on how to configure access to your Azure
Blob storage container.

NOTE
We strongly recommend that you use Secrets for providing your connection strings.

Configuration
O P T IO N TYPE DEFA ULT DESC RIP T IO N

allowOverwrites Boolean true Whether a blob that gets


overwritten should be
reprocessed.

connectionString String None (required param) The connection string to


access your queue.

fetchParallelism Integer 1 Number of threads to use


when fetching messages
from the queueing service.
O P T IO N TYPE DEFA ULT DESC RIP T IO N

fileFormat String None (required param) The format of the files such
as parquet , json , csv ,
text , and so on.

ignoreFileDeletion Boolean false If you have lifecycle


configurations or you delete
the source files manually,
you must set this option to
true .

maxFileAge Integer 604800 Determines how long (in


seconds) file notifications
are stored as state to
prevent duplicate
processing.

pathRewrites A JSON string. "{}" If you use mount points,


you can rewrite the prefix of
the
container@storageAccount/key
path with the mount point.
Only prefixes can be
rewritten. For example, for
the configuration
{"myContainer@myStorageAccount/path":
"dbfs:/mnt/data-warehouse"}
, the path
wasbs://myContainer@myStorageAccount.blob.windows.core.net/path/2017
is rewritten to
dbfs:/mnt/data-
warehouse/2017/08/fileA.json
.

queueFetchInterval A duration string, for "5s" How long to wait in


example, 2m for 2 minutes. between fetches if the
queue is empty. Azure
charges per API request to
AQS. Therefore if data isn’t
arriving frequently, this
value can be set to a long
duration. As long as the
queue is not empty, we will
fetch continuously. If new
files are created every 5
minutes, you might want to
set a high
queueFetchInterval to
reduce AQS costs.

queueName String None (required param) The name of the AQS


queue.

If you observe a lot of messages in the driver logs that look like Fetched 0 new events and 3 old events. , where
you tend to observe a lot more old events than new, you should reduce the trigger interval of your stream.
If you are consuming files from a location on Blob storage where you expect that some files may be deleted
before they can be processed, you can set the following configuration to ignore the error and continue
processing:

spark.sql("SET spark.sql.files.ignoreMissingFiles=true")

Frequently asked questions (FAQ)


If ignoreFileDeletion is False (default) and the object has been deleted, will it fail the whole
pipeline?
Yes, if we receive an event stating that the file was deleted, it will fail the whole pipeline.
How should I set maxFileAge ?
Azure Queue Storage provides at-least-once message delivery semantics, therefore we need to keep state for
deduplication. The default setting for maxFileAge is 7 days, which is equal to the maximum TTL of a message in
the queue.
Clusters
7/21/2022 • 2 minutes to read

An Azure Databricks cluster is a set of computation resources and configurations on which you run data
engineering, data science, and data analytics workloads, such as production ETL pipelines, streaming analytics,
ad-hoc analytics, and machine learning.
You run these workloads as a set of commands in a notebook or as an automated job. Azure Databricks makes a
distinction between all-purpose clusters and job clusters. You use all-purpose clusters to analyze data
collaboratively using interactive notebooks. You use job clusters to run fast and robust automated jobs.
You can create an all-purpose cluster using the UI, CLI, or REST API. You can manually terminate and restart
an all-purpose cluster. Multiple users can share such clusters to do collaborative interactive analysis.
The Azure Databricks job scheduler creates a job cluster when you run a job on a new job cluster and
terminates the cluster when the job is complete. You cannot restart a job cluster.
This section describes how to work with clusters using the UI. For other methods, see Clusters CLI and Clusters
API 2.0.
This section also focuses more on all-purpose than job clusters, although many of the configurations and
management tools described apply equally to both cluster types. To learn more about creating job clusters, see
Jobs.

IMPORTANT
Azure Databricks retains cluster configuration information for up to 200 all-purpose clusters terminated in the last 30
days and up to 30 job clusters recently terminated by the job scheduler. To keep an all-purpose cluster configuration even
after it has been terminated for more than 30 days, an administrator can pin a cluster to the cluster list.

In this section:
Create a cluster
Use the Create button
Use the cluster UI
Terraform integration
Manage clusters
Display clusters
Pin a cluster
View a cluster configuration as a JSON file
Edit a cluster
Clone a cluster
Control access to clusters
Start a cluster
Terminate a cluster
Delete a cluster
Restart a cluster to update it with the latest images
View cluster information in the Apache Spark UI
View cluster logs
Monitor performance
Decommission spot instances
Configure clusters
Cluster policy
Cluster mode
Pools
Databricks Runtime
Cluster node type
Cluster size and autoscaling
Autoscaling local storage
Local disk encryption
Security mode
Spark configuration
Retrieve a Spark configuration property from a secret
Environment variables
Cluster tags
SSH access to clusters
Cluster log delivery
Init scripts
Best practices: Cluster configuration
Cluster features
Cluster sizing considerations
Common scenarios
Task preemption
Preemption options
Customize containers with Databricks Container Services
Requirements
Step 1: Build your base
Step 2: Push your base image
Step 3: Launch your cluster
Use an init script
Cluster node initialization scripts
Init script types
Init script execution order
Environment variables
Logging
Cluster-scoped init scripts
Global init scripts
GPU-enabled clusters
Overview
Create a GPU cluster
GPU scheduling
NVIDIA GPU driver, CUDA, and cuDNN
Databricks Container Services on GPU clusters
Single Node clusters
Create a Single Node cluster
Single Node cluster properties
Limitations
REST API
Single Node cluster policy
Single Node job cluster policy
Pools
Display pools
Create a pool
Configure pools
Edit a pool
Delete a pool
Attach a cluster to one or more pools
Best practices: pools
Web terminal
Requirements
Launch the web terminal
Limitations
Debugging with the Apache Spark UI
Spark UI
Driver logs
Executor logs
Create a cluster
7/21/2022 • 2 minutes to read

There are two types of clusters:


All-Purpose clusters can be shared by multiple users. These are typically used to run notebooks. All-Purpose
clusters remain active until you terminate them.
Job clusters run a job. You create a job cluster when you create a job. Such clusters are terminated
automatically after the job is completed.
This article describes how to create an all-purpose cluster. To learn how to create job clusters, see Create a job.
You can also create a cluster using the Clusters API 2.0.

NOTE
You must have permission to create a cluster. See Configure cluster creation entitlement.

Use the Create button


The easiest way to create a new cluster is to use the Create button:

1. Click Create in the sidebar and select Cluster from the menu. The Create Cluster page appears.
2. Name and configure the cluster.
There are many cluster configuration options, which are described in detail in cluster configuration.
3. Click the Create Cluster button.
The cluster Configuration tab displays a spinning progress indicator while the cluster is in a pending
state. When the cluster has started and is ready to use, the progress spinner turns into a green circle with
a check mark. This indicates that cluster is in the running state, and you can now attach notebooks and
start running commands and queries.

Use the cluster UI


1. Click Compute in the sidebar.
2. Click the Create Cluster button.

3. Follow steps 2 and 3 in Use the Create button.

Terraform integration
You can manage clusters in a fully automated setup using Databricks Terraform provider and databricks_cluster:
data "databricks_node_type" "smallest" {
local_disk = true
}

data "databricks_spark_version" "latest_lts" {


long_term_support = true
}

resource "databricks_cluster" "shared_autoscaling" {


cluster_name = "Shared Autoscaling"
spark_version = data.databricks_spark_version.latest_lts.id
node_type_id = data.databricks_node_type.smallest.id
autotermination_minutes = 20
autoscale {
min_workers = 1
max_workers = 50
}
}
Manage clusters
7/21/2022 • 15 minutes to read

This article describes how to manage Azure Databricks clusters, including displaying, editing, starting,
terminating, deleting, controlling access, and monitoring performance and logs.

Display clusters
To display the clusters in your workspace, click Compute in the sidebar.
The Compute page displays clusters in two tabs: All-purpose clusters and Job clusters .

At the left side are two columns indicating if the cluster has been pinned and the status of the cluster:
Pinned
Starting , Terminating
Standard cluster
Running
Terminated
High concurrency cluster
Running
Terminated
Access Denied
Running
Terminated
Table ACLs enabled
Running
Terminated
At the far right of the right side of the All-purpose clusters tab is an icon you can use to terminate the cluster.
You can use the three-button menu to restart, clone, delete, or edit permissions for the cluster. Menu options
that are not available are grayed out.

The All-purpose clusters tab shows the numbers of notebooks attached to the cluster.
Filter cluster list
You can filter the cluster lists using the buttons and search box at the top right:

Pin a cluster
30 days after a cluster is terminated, it is permanently deleted. To keep an all-purpose cluster configuration even
after a cluster has been terminated for more than 30 days, an administrator can pin the cluster. Up to 100
clusters can be pinned.
You can pin a cluster from the cluster list or the cluster detail page:
Pin cluster from cluster list
To pin or unpin a cluster, click the pin icon to the left of the cluster name.

Pin cluster from cluster detail page


To pin or unpin a cluster, click the pin icon to the right of the cluster name.

You can also invoke the Pin API endpoint to programmatically pin a cluster.

View a cluster configuration as a JSON file


Sometimes it can be helpful to view your cluster configuration as JSON. This is especially useful when you want
to create similar clusters using the Clusters API 2.0. When you view an existing cluster, simply go to the
Configuration tab, click JSON in the top right of the tab, copy the JSON, and paste it into your API call. JSON
view is ready-only.
Edit a cluster
You edit a cluster configuration from the cluster detail page. To display the cluster detail page, click the cluster
name on the Compute page.

You can also invoke the Edit API endpoint to programmatically edit the cluster.

NOTE
Notebooks and jobs that were attached to the cluster remain attached after editing.
Libraries installed on the cluster remain installed after editing.
If you edit any attribute of a running cluster (except for the cluster size and permissions), you must restart it. This can
disrupt users who are currently using the cluster.
You can edit only running or terminated clusters. You can, however, update permissions for clusters that are not in
those states on the cluster details page.

For detailed information about cluster configuration properties you can edit, see Configure clusters.

Clone a cluster
You can create a new cluster by cloning an existing cluster.

From the cluster list, click the three-button menu and select Clone from the drop down.
From the cluster detail page, click and select Clone from the drop down.

The cluster creation form is opened prepopulated with the cluster configuration. The following attributes from
the existing cluster are not included in the clone:
Cluster permissions
Installed libraries
Attached notebooks

Control access to clusters


Cluster access control within the Admin Console allows admins and delegated users to give fine-grained cluster
access to other users. There are two types of cluster access control:
Cluster creation permission: Admins can choose which users are allowed to create clusters.

Cluster-level permissions: A user who has the Can manage permission for a cluster can configure
whether other users can attach to, restart, resize, and manage that cluster from the cluster list or the
cluster details page.

From the cluster list, click the three-button menu and select Edit Permissions .
From the cluster detail page, click and select Permissions .

To learn how to configure cluster access control and cluster-level permissions, see Cluster access control.

Start a cluster
Apart from creating a new cluster, you can also start a previously terminated cluster. This lets you re-create a
previously terminated cluster with its original configuration.
You can start a cluster from the cluster list, the cluster detail page, or a notebook.
To start a cluster from the cluster list, click the arrow:

To start a cluster from the cluster detail page, click Star t :

Notebook cluster attach drop-down

You can also invoke the Start API endpoint to programmatically start a cluster.
Azure Databricks identifies a cluster with a unique cluster ID. When you start a terminated cluster, Databricks re-
creates the cluster with the same ID, automatically installs all the libraries, and re-attaches the notebooks.

NOTE
If you are using a Trial workspace and the trial has expired, you will not be able to start a cluster.

Cluster autostart for jobs


When a job assigned to an existing terminated cluster is scheduled to run or you connect to a terminated cluster
from a JDBC/ODBC interface, the cluster is automatically restarted. See Create a job and JDBC connect.
Cluster autostart allows you to configure clusters to autoterminate without requiring manual intervention to
restart the clusters for scheduled jobs. Furthermore, you can schedule cluster initialization by scheduling a job
to run on a terminated cluster.
Before a cluster is restarted automatically, cluster and job access control permissions are checked.

NOTE
If your cluster was created in Azure Databricks platform version 2.70 or earlier, there is no autostart: jobs scheduled to
run on terminated clusters will fail.

Terminate a cluster
To save cluster resources, you can terminate a cluster. A terminated cluster cannot run notebooks or jobs, but its
configuration is stored so that it can be reused (or—in the case of some types of jobs—autostarted) at a later
time. You can manually terminate a cluster or configure the cluster to automatically terminate after a specified
period of inactivity. Azure Databricks records information whenever a cluster is terminated. When the number of
terminated clusters exceeds 150, the oldest clusters are deleted.
Unless a cluster is pinned, 30 days after the cluster is terminated, it is automatically and permanently deleted.
Terminated clusters appear in the cluster list with a gray circle at the left of the cluster name.

NOTE
When you run a job on a New Job Cluster (which is usually recommended), the cluster terminates and is unavailable for
restarting when the job is complete. On the other hand, if you schedule a job to run on an Existing All-Purpose Cluster
that has been terminated, that cluster will autostart.

IMPORTANT
If you are using a Trial Premium workspace, all running clusters are terminated:
When you upgrade a workspace to full Premium.
If the workspace is not upgraded and the trial expires.

Manual termination
You can manually terminate a cluster from the cluster list or the cluster detail page.
To terminate a cluster from the cluster list, click the square:
To terminate a cluster from the cluster detail page, click Terminate :

Automatic termination
You can also set auto termination for a cluster. During cluster creation, you can specify an inactivity period in
minutes after which you want the cluster to terminate. If the difference between the current time and the last
command run on the cluster is more than the inactivity period specified, Azure Databricks automatically
terminates that cluster.
A cluster is considered inactive when all commands on the cluster, including Spark jobs, Structured Streaming,
and JDBC calls, have finished executing.

WARNING
Clusters do not report activity resulting from the use of DStreams. This means that an autoterminating cluster may be
terminated while it is running DStreams. Turn off auto termination for clusters running DStreams or consider using
Structured Streaming.
The auto termination feature monitors only Spark jobs, not user-defined local processes. Therefore, if all Spark jobs
have completed, a cluster may be terminated even if local processes are running.
Idle clusters continue to accumulate DBU and cloud instance charges during the inactivity period before termination.

Configure automatic termination


You configure automatic termination in the Auto Termination field in the Autopilot Options box on the
cluster creation page:

IMPORTANT
The default value of the auto terminate setting depends on whether you choose to create a standard or high concurrency
cluster:
Standard clusters are configured to terminate automatically after 120 minutes.
High concurrency clusters are configured to not terminate automatically.

You can opt out of auto termination by clearing the Auto Termination checkbox or by specifying an inactivity
period of 0 .

NOTE
Auto termination is best supported in the latest Spark versions. Older Spark versions have known limitations which can
result in inaccurate reporting of cluster activity. For example, clusters running JDBC, R, or streaming commands can report
a stale activity time that leads to premature cluster termination. Please upgrade to the most recent Spark version to
benefit from bug fixes and improvements to auto termination.

Unexpected termination
Sometimes a cluster is terminated unexpectedly, not as a result of a manual termination or a configured
automatic termination.
For a list of termination reasons and remediation steps, see the Knowledge Base.

Delete a cluster
Deleting a cluster terminates the cluster and removes its configuration.

WARNING
You cannot undo this action.

You cannot delete a pinned cluster. In order to delete a pinned cluster, it must first be unpinned by an
administrator.

From the cluster list, click the three-button menu and select Delete from the drop down.

From the cluster detail page, click and select Delete from the drop down.

You can also invoke the Permanent delete API endpoint to programmatically delete a cluster.

Restart a cluster to update it with the latest images


When you restart a cluster, it gets the latest images for the compute resource containers and the VM hosts. It is
particularly important to schedule regular restarts for long-running clusters, which are often used for some
applications such as processing streaming data.
It is your responsibility to restart all compute resources regularly to keep the image up-to-date with the latest
image version.
IMPORTANT
If you enable the compliance security profile for your account or your workspace, long-running clusters are automatically
restarted after 25 days. Databricks recommends that admins restart clusters before they run for 25 days and do so
during a scheduled maintenance window. This reduces the risk of an auto-restart disrupting a scheduled job.

You can restart a cluster in multiple ways:


Use the UI to restart a cluster from the cluster detail page. To display the cluster detail page, click the cluster
name on the Compute page. Click Restar t .
Use the Clusters API to restart a cluster.
Use the script that Azure Databricks provides that determines how long your clusters have run, and
optionally restarts them if they exceed a specified number of days since they were started
Run a script that determines how many days your clusters have been running, and optionally restart them
If you are a workspace admin, you can run a script that determines how long each of your clusters has been
running, and optionally restart them if they are older than a specified number of days. Azure Databricks
provides this script as a notebook.
The first lines of the script define configuration parameters:
min_age_output : The maximum number of days that a cluster can run. Default is 1.
perform_restart : If True , the script restarts clusters with age greater than the number of days specified by
min_age_output . The default is False , which identifies the long running clusters but does not restart them.
secret_configuration : Replace REPLACE_WITH_SCOPE and REPLACE_WITH_KEY with a secret scope and key name.
For more details of setting up the secrets, see the notebook.

WARNING
If you set perform_restart to True , the script automatically restarts eligible clusters, which can cause active jobs to
fail and reset open notebooks. To reduce the risk of disrupting your workspace’s business critical jobs, plan a scheduled
maintenance window and be sure to notify workspace users.

Identify and optionally restart long-running clusters notebook


Get notebook

View cluster information in the Apache Spark UI


You can view detailed information about Spark jobs in the Spark UI, which you can access from the Spark UI
tab on the cluster details page.
You can get details about active and terminated clusters.
If you restart a terminated cluster, the Spark UI displays information for the restarted cluster, not the historical
information for the terminated cluster.

View cluster logs


Azure Databricks provides three kinds of logging of cluster-related activity:
Cluster event logs, which capture cluster lifecycle events, like creation, termination, configuration edits, and
so on.
Apache Spark driver and worker logs, which you can use for debugging.
Cluster init-script logs, valuable for debugging init scripts.
This section discusses cluster event logs and driver and worker logs. For details about init-script logs, see Init
script logs.
Cluster event logs
The cluster event log displays important cluster lifecycle events that are triggered manually by user actions or
automatically by Azure Databricks. Such events affect the operation of a cluster as a whole and the jobs running
in the cluster.
For supported event types, see the REST API ClusterEventType data structure.
Events are stored for 60 days, which is comparable to other data retention times in Azure Databricks.
View a cluster event log

1. Click Compute in the sidebar.


2. Click a cluster name.
3. Click the Event Log tab.

To filter the events, click the in the Filter by Event Type… field and select one or more event type
checkboxes.
Use Select all to make it easier to filter by excluding particular event types.
View event details
For more information about an event, click its row in the log and then click the JSON tab for details.

Cluster driver and worker logs


The direct print and log statements from your notebooks, jobs, and libraries go to the Spark driver logs. These
logs have three outputs:
Standard output
Standard error
Log4j logs
You can access these files from the Driver logs tab on the cluster details page. Click the name of a log file to
download it.
To view Spark worker logs, you can use the Spark UI. You can also configure a log delivery location for the
cluster. Both worker and cluster logs are delivered to the location you specify.

Monitor performance
To help you monitor the performance of Azure Databricks clusters, Azure Databricks provides access to Ganglia
metrics from the cluster details page.
In addition, you can configure an Azure Databricks cluster to send metrics to a Log Analytics workspace in Azure
Monitor, the monitoring platform for Azure.
You can install Datadog agents on cluster nodes to send Datadog metrics to your Datadog account.
Ganglia metrics
To access the Ganglia UI, navigate to the Metrics tab on the cluster details page. CPU metrics are available in the
Ganglia UI for all Databricks runtimes. GPU metrics are available for GPU-enabled clusters.

To view live metrics, click the Ganglia UI link.


To view historical metrics, click a snapshot file. The snapshot contains aggregated metrics for the hour preceding
the selected time.
Configure metrics collection
By default, Azure Databricks collects Ganglia metrics every 15 minutes. To configure the collection period, set the
DATABRICKS_GANGLIA_SNAPSHOT_PERIOD_MINUTES environment variable using an init script or in the spark_env_vars
field in the Cluster Create API.
Azure Monitor
You can configure an Azure Databricks cluster to send metrics to a Log Analytics workspace in Azure Monitor,
the monitoring platform for Azure. For complete instructions, see Monitoring Azure Databricks.
NOTE
If you have deployed the Azure Databricks workspace in your own virtual network and you have configured network
security groups (NSG) to deny all outbound traffic that is not required by Azure Databricks, then you must configure an
additional outbound rule for the “AzureMonitor” service tag.

Datadog metrics

You can install Datadog agents on cluster nodes to send Datadog metrics to your Datadog account. The
following notebook demonstrates how to install a Datadog agent on a cluster using a cluster-scoped init script.
To install the Datadog agent on all clusters, use a global init script after testing the cluster-scoped init script.
Install Datadog agent init script notebook
Get notebook

Decommission spot instances


NOTE
This feature is available on Databricks Runtime 8.0 and above.

Because spot instances can reduce costs, creating clusters using spot instances rather than on-demand instances
is a common way to run jobs. However, spot instances can be preempted by cloud provider scheduling
mechanisms. Preemption of spot instances can cause issues with jobs that are running, including:
Shuffle fetch failures
Shuffle data loss
RDD data loss
Job failures
You can enable decommissioning to help address these issues. Decommissioning takes advantage of the
notification that the cloud provider usually sends before a spot instance is decommissioned. When a spot
instance containing an executor receives a preemption notification, the decommissioning process will attempt to
migrate shuffle and RDD data to healthy executors. The duration before the final preemption is typically 30
seconds to 2 minutes, depending on the cloud provider.
Databricks recommends enabling data migration when decommissioning is also enabled. Generally, the
possibility of errors decreases as more data is migrated, including shuffle fetching failures, shuffle data loss, and
RDD data loss. Data migration can also lead to less re-computation and save cost.
Decommissioning is best effort and does not guarantee that all data can be migrated before final preemption.
Decommissioning cannot guarantee against shuffle fetch failures when running tasks are fetching shuffle data
from the executor.
With decommissioning enabled, task failures caused by spot instance preemption are not added to the total
number of failed attempts. Task failures caused by preemption are not counted as failed attempts because the
cause of the failure is external to the task and will not result in job failure.
To enable decommissioning, you set Spark configuration settings and environment variables when you create a
cluster:
To enable decommissioning for applications:

spark.decommission.enabled true

To enable shuffle data migration during decommissioning:

spark.storage.decommission.enabled true
spark.storage.decommission.shuffleBlocks.enabled true

To enable RDD cache data migration during decommissioning:

NOTE
When RDD StorageLevel replication is set to more than 1, Databricks does not recommend enabling RDD data
migration since the replicas ensure RDDs will not lose data.

spark.storage.decommission.enabled true
spark.storage.decommission.rddBlocks.enabled true

To enable decommissioning for workers:

SPARK_WORKER_OPTS="-Dspark.decommission.enabled=true"

To set these custom Spark configuration properties:


1. On the New Cluster page, click the Advanced Options toggle.
2. Click the Spark tab.
To access a worker’s decommission status from the UI, navigate to the Spark Cluster UI - Master tab:

When the decommissioning finishes, the executor that decommissioned shows the loss reason in the Spark UI
> Executors tab on the cluster’s details page:
Configure clusters
7/21/2022 • 17 minutes to read

This article explains the configuration options available when you create and edit Azure Databricks clusters. It
focuses on creating and editing clusters using the UI. For other methods, see Clusters CLI, Clusters API 2.0, and
Databricks Terraform provider.
For help deciding what combination of configuration options suits your needs best, see cluster configuration
best practices.

Cluster policy
A cluster policy limits the ability to configure clusters based on a set of rules. The policy rules limit the attributes
or attribute values available for cluster creation. Cluster policies have ACLs that limit their use to specific users
and groups and thus limit which policies you can select when you create a cluster.
To configure a cluster policy, select the cluster policy in the Policy drop-down.
NOTE
If no policies have been created in the workspace, the Policy drop-down does not display.

If you have:
Cluster create permission, you can select the Unrestricted policy and create fully-configurable clusters. The
Unrestricted policy does not limit any cluster attributes or attribute values.
Both cluster create permission and access to cluster policies, you can select the Unrestricted policy and the
policies you have access to.
Access to cluster policies only, you can select the policies you have access to.

Cluster mode
Azure Databricks supports three cluster modes: Standard, High Concurrency, and Single Node. The default
cluster mode is Standard.

IMPORTANT
If your workspace is assigned to a Unity Catalog metastore, High Concurrency clusters are not available. Instead, you
use security mode to ensure the integrity of access controls and enforce strong isolation guarantees. See also Create a
Data Science & Engineering cluster.
You cannot change the cluster mode after a cluster is created. If you want a different cluster mode, you must create a
new cluster.

NOTE
The cluster configuration includes an auto terminate setting whose default value depends on cluster mode:
Standard and Single Node clusters terminate automatically after 120 minutes by default.
High Concurrency clusters do not terminate automatically by default.

Standard clusters
A Standard cluster is recommended for a single user. Standard clusters can run workloads developed in any
language: Python, SQL, R, and Scala.
High Concurrency clusters
A High Concurrency cluster is a managed cloud resource. The key benefits of High Concurrency clusters are that
they provide fine-grained sharing for maximum resource utilization and minimum query latencies.
High Concurrency clusters can run workloads developed in SQL, Python, and R. The performance and security of
High Concurrency clusters is provided by running user code in separate processes, which is not possible in
Scala.
In addition, only High Concurrency clusters support table access control.
To create a High Concurrency cluster, set Cluster Mode to High Concurrency .

For an example of how to create a High Concurrency cluster using the Clusters API, see High Concurrency
cluster example.
Single Node clusters
A Single Node cluster has no workers and runs Spark jobs on the driver node.
In contrast, a Standard cluster requires at least one Spark worker node in addition to the driver node to execute
Spark jobs.
To create a Single Node cluster, set Cluster Mode to Single Node .

To learn more about working with Single Node clusters, see Single Node clusters.

Pools
To reduce cluster start time, you can attach a cluster to a predefined pool of idle instances, for the driver and
worker nodes. The cluster is created using instances in the pools. If a pool does not have sufficient idle resources
to create the requested driver or worker nodes, the pool expands by allocating new instances from the instance
provider. When an attached cluster is terminated, the instances it used are returned to the pools and can be
reused by a different cluster.
If you select a pool for worker nodes but not for the driver node, the driver node inherit the pool from the
worker node configuration.

IMPORTANT
If you attempt to select a pool for the driver node but not for worker nodes, an error occurs and your cluster isn’t created.
This requirement prevents a situation where the driver node has to wait for worker nodes to be created, or vice versa.

See Pools to learn more about working with pools in Azure Databricks.

Databricks Runtime
Databricks runtimes are the set of core components that run on your clusters. All Databricks runtimes include
Apache Spark and add components and updates that improve usability, performance, and security. For details,
see Databricks runtimes.
Azure Databricks offers several types of runtimes and several versions of those runtime types in the Databricks
Runtime Version drop-down when you create or edit a cluster.

Photon acceleration

IMPORTANT
This feature is in Public Preview.

NOTE
Available in Databricks Runtime 8.3 and above.

To enable Photon acceleration, select the Use Photon Acceleration checkbox.

If desired, you can specify the instance type in the Worker Type and Driver Type drop-down.
Databricks recommends the following instance types for optimal price and performance:
Standard_E4ds_v4
Standard_E8ds_v4
Standard_E16ds_v4
You can view Photon activity in the Spark UI. The following screenshot shows the query details DAG. There are
two indications of Photon in the DAG. First, Photon operators start with “Photon”, for example,
PhotonGroupingAgg . Second, in the DAG, Photon operators and stages are colored peach, while the non-Photon
ones are blue.
Docker images
For some Databricks Runtime versions, you can specify a Docker image when you create a cluster. Example use
cases include library customization, a golden container environment that doesn’t change, and Docker CI/CD
integration.
You can also use Docker images to create custom deep learning environments on clusters with GPU devices.
For instructions, see Customize containers with Databricks Container Services and Databricks Container
Services on GPU clusters.

Cluster node type


A cluster consists of one driver node and zero or more worker nodes.
You can pick separate cloud provider instance types for the driver and worker nodes, although by default the
driver node uses the same instance type as the worker node. Different families of instance types fit different use
cases, such as memory-intensive or compute-intensive workloads.
NOTE
If your security requirements include compute isolation, select a Standard_F72s_V2 instance as your worker type. These
instance types represent isolated virtual machines that consume the entire physical host and provide the necessary level
of isolation required to support, for example, US Department of Defense Impact Level 5 (IL5) workloads.

Driver node
Worker node
GPU instance types
Spot instances
Driver node
The driver node maintains state information of all notebooks attached to the cluster. The driver node also
maintains the SparkContext and interprets all the commands you run from a notebook or a library on the
cluster, and runs the Apache Spark master that coordinates with the Spark executors.
The default value of the driver node type is the same as the worker node type. You can choose a larger driver
node type with more memory if you are planning to collect() a lot of data from Spark workers and analyze
them in the notebook.

TIP
Since the driver node maintains all of the state information of the notebooks attached, make sure to detach unused
notebooks from the driver node.

Worker node
Azure Databricks worker nodes run the Spark executors and other services required for the proper functioning
of the clusters. When you distribute your workload with Spark, all of the distributed processing happens on
worker nodes. Azure Databricks runs one executor per worker node; therefore the terms executor and worker
are used interchangeably in the context of the Azure Databricks architecture.

TIP
To run a Spark job, you need at least one worker node. If a cluster has zero workers, you can run non-Spark commands
on the driver node, but Spark commands will fail.

GPU instance types


For computationally challenging tasks that demand high performance, like those associated with deep learning,
Azure Databricks supports clusters accelerated with graphics processing units (GPUs). For more information, see
GPU-enabled clusters.
Spot instances
To save cost, you can choose to use spot instances, also known as Azure Spot VMs by checking the Spot
instances checkbox.

The first instance will always be on-demand (the driver node is always on-demand) and subsequent instances
will be spot instances. If spot instances are evicted due to unavailability, on-demand instances are deployed to
replace evicted instances.
Cluster size and autoscaling
When you create a Azure Databricks cluster, you can either provide a fixed number of workers for the cluster or
provide a minimum and maximum number of workers for the cluster.
When you provide a fixed size cluster, Azure Databricks ensures that your cluster has the specified number of
workers. When you provide a range for the number of workers, Databricks chooses the appropriate number of
workers required to run your job. This is referred to as autoscaling.
With autoscaling, Azure Databricks dynamically reallocates workers to account for the characteristics of your
job. Certain parts of your pipeline may be more computationally demanding than others, and Databricks
automatically adds additional workers during these phases of your job (and removes them when they’re no
longer needed).
Autoscaling makes it easier to achieve high cluster utilization, because you don’t need to provision the cluster to
match a workload. This applies especially to workloads whose requirements change over time (like exploring a
dataset during the course of a day), but it can also apply to a one-time shorter workload whose provisioning
requirements are unknown. Autoscaling thus offers two advantages:
Workloads can run faster compared to a constant-sized under-provisioned cluster.
Autoscaling clusters can reduce overall costs compared to a statically-sized cluster.
Depending on the constant size of the cluster and the workload, autoscaling gives you one or both of these
benefits at the same time. The cluster size can go below the minimum number of workers selected when the
cloud provider terminates instances. In this case, Azure Databricks continuously retries to re-provision instances
in order to maintain the minimum number of workers.

NOTE
Autoscaling is not available for spark-submit jobs.

How autoscaling behaves


Scales up from min to max in 2 steps.
Can scale down even if the cluster is not idle by looking at shuffle file state.
Scales down based on a percentage of current nodes.
On job clusters, scales down if the cluster is underutilized over the last 40 seconds.
On all-purpose clusters, scales down if the cluster is underutilized over the last 150 seconds.
The spark.databricks.aggressiveWindowDownS Spark configuration property specifies in seconds how often a
cluster makes down-scaling decisions. Increasing the value causes a cluster to scale down more slowly. The
maximum value is 600.
Enable and configure autoscaling
To allow Azure Databricks to resize your cluster automatically, you enable autoscaling for the cluster and provide
the min and max range of workers.
1. Enable autoscaling.
All-Purpose cluster - On the Create Cluster page, select the Enable autoscaling checkbox in the
Autopilot Options box:

Job cluster - On the Configure Cluster page, select the Enable autoscaling checkbox in the
Autopilot Options box:

2. Configure the min and max workers.

When the cluster is running, the cluster detail page displays the number of allocated workers. You can
compare number of allocated workers with the worker configuration and make adjustments as needed.

IMPORTANT
If you are using an instance pool:
Make sure the cluster size requested is less than or equal to the minimum number of idle instances in the pool. If it is
larger, cluster startup time will be equivalent to a cluster that doesn’t use a pool.
Make sure the maximum cluster size is less than or equal to the maximum capacity of the pool. If it is larger, the cluster
creation will fail.

Autoscaling example
If you reconfigure a static cluster to be an autoscaling cluster, Azure Databricks immediately resizes the cluster
within the minimum and maximum bounds and then starts autoscaling. As an example, the following table
demonstrates what happens to clusters with a certain initial size if you reconfigure a cluster to autoscale
between 5 and 10 nodes.

IN IT IA L SIZ E SIZ E A F T ER REC O N F IGURAT IO N

6 6

12 10

3 5

Autoscaling local storage


It can often be difficult to estimate how much disk space a particular job will take. To save you from having to
estimate how many gigabytes of managed disk to attach to your cluster at creation time, Azure Databricks
automatically enables autoscaling local storage on all Azure Databricks clusters.
With autoscaling local storage, Azure Databricks monitors the amount of free disk space available on your
cluster’s Spark workers. If a worker begins to run too low on disk, Databricks automatically attaches a new
managed disk to the worker before it runs out of disk space. Disks are attached up to a limit of 5 TB of total disk
space per virtual machine (including the virtual machine’s initial local storage).
The managed disks attached to a virtual machine are detached only when the virtual machine is returned to
Azure. That is, managed disks are never detached from a virtual machine as long as it is part of a running cluster.
To scale down managed disk usage, Azure Databricks recommends using this feature in a cluster configured
with Spot instances or Automatic termination.

Local disk encryption


IMPORTANT
This feature is in Public Preview.

Some instance types you use to run clusters may have locally attached disks. Azure Databricks may store shuffle
data or ephemeral data on these locally attached disks. To ensure that all data at rest is encrypted for all storage
types, including shuffle data that is stored temporarily on your cluster’s local disks, you can enable local disk
encryption.

IMPORTANT
Your workloads may run more slowly because of the performance impact of reading and writing encrypted data to and
from local volumes.

When local disk encryption is enabled, Azure Databricks generates an encryption key locally that is unique to
each cluster node and is used to encrypt all data stored on local disks. The scope of the key is local to each
cluster node and is destroyed along with the cluster node itself. During its lifetime, the key resides in memory
for encryption and decryption and is stored encrypted on the disk.
To enable local disk encryption, you must use the Clusters API 2.0. During cluster creation or edit, set:

{
"enable_local_disk_encryption": true
}

See Create and Edit in the Clusters API reference for examples of how to invoke these APIs.
Here is an example of a cluster create call that enables local disk encryption:

{
"cluster_name": "my-cluster",
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"enable_local_disk_encryption": true,
"spark_conf": {
"spark.speculation": true
},
"num_workers": 25
}

Security mode
If your workspace is assigned to a Unity Catalog metastore, you use security mode instead of High Concurrency
cluster mode to ensure the integrity of access controls and enforce strong isolation guarantees. High
Concurrency cluster mode is not available with Unity Catalog.
Under Advanced options , select from the following cluster security modes:
None : No isolation. Does not enforce workspace-local table access control or credential passthrough. Cannot
access Unity Catalog data.
Single User : Can be used only by a single user (by default, the user who created the cluster). Other users
cannot attach to the cluster. When accessing a view from a cluster with Single User security mode, the view
is executed with the user’s permissions. Single-user clusters support workloads using Python, Scala, and R.
Init scripts, library installation, and DBFS FUSE mounts are supported on single-user clusters. Automated
jobs should use single-user clusters.
User Isolation : Can be shared by multiple users. Only SQL workloads are supported. Library installation,
init scripts, and DBFS FUSE mounts are disabled to enforce strict isolation among the cluster users.
Table ACL only (Legacy) : Enforces workspace-local table access control, but cannot access Unity Catalog
data.
Passthrough only (Legacy) : Enforces workspace-local credential passthrough, but cannot access Unity
Catalog data.
The only security modes supported for Unity Catalog workloads are Single User and User Isolation .
For more information, see Cluster security mode.

Spark configuration
To fine tune Spark jobs, you can provide custom Spark configuration properties in a cluster configuration.
1. On the cluster configuration page, click the Advanced Options toggle.
2. Click the Spark tab.

In Spark config , enter the configuration properties as one key-value pair per line.
When you configure a cluster using the Clusters API 2.0, set Spark properties in the spark_conf field in the
Create cluster request or Edit cluster request.
To set Spark properties for all clusters, create a global init script:

dbutils.fs.put("dbfs:/databricks/init/set_spark_params.sh","""
|#!/bin/bash
|
|cat << 'EOF' > /databricks/driver/conf/00-custom-spark-driver-defaults.conf
|[driver] {
| "spark.sql.sources.partitionOverwriteMode" = "DYNAMIC"
|}
|EOF
""".stripMargin, true)

Retrieve a Spark configuration property from a secret


Databricks recommends storing sensitive information, such as passwords, in a secret instead of plaintext. To
reference a secret in the Spark configuration, use the following syntax:

spark.<property-name> {{secrets/<scope-name>/<secret-name>}}

For example, to set a Spark configuration property called password to the value of the secret stored in
secrets/acme_app/password :

spark.password {{secrets/acme-app/password}}

For more information, see Syntax for referencing secrets in a Spark configuration property or environment
variable.

Environment variables
You can configure custom environment variables that you can access from init scripts running on a cluster.
Databricks also provides predefined environment variables that you can use in init scripts. You cannot override
these predefined environment variables.
1. On the cluster configuration page, click the Advanced Options toggle.
2. Click the Spark tab.
3. Set the environment variables in the Environment Variables field.

You can also set environment variables using the spark_env_vars field in the Create cluster request or Edit
cluster request Clusters API endpoints.

Cluster tags
Cluster tags allow you to easily monitor the cost of cloud resources used by various groups in your
organization. You can specify tags as key-value pairs when you create a cluster, and Azure Databricks applies
these tags to cloud resources like VMs and disk volumes, as well as DBU usage reports.
For clusters launched from pools, the custom cluster tags are only applied to DBU usage reports and do not
propagate to cloud resources.
For detailed information about how pool and cluster tag types work together, see Monitor usage using cluster,
pool, and workspace tags.
For convenience, Azure Databricks applies four default tags to each cluster: Vendor , Creator , ClusterName , and
ClusterId .

In addition, on job clusters, Azure Databricks applies two default tags: RunName and JobId .
On resources used by Databricks SQL, Azure Databricks also applies the default tag SqlWarehouseId .

WARNING
Do not assign a custom tag with the key Name to a cluster. Every cluster has a tag Name whose value is set by Azure
Databricks. If you change the value associated with the key Name , the cluster can no longer be tracked by Azure
Databricks. As a consequence, the cluster might not be terminated after becoming idle and will continue to incur usage
costs.
You can add custom tags when you create a cluster. To configure cluster tags:
1. On the cluster configuration page, click the Advanced Options toggle.
2. At the bottom of the page, click the Tags tab.

3. Add a key-value pair for each custom tag. You can add up to 43 custom tags.
For more details, see Monitor usage using cluster, pool, and workspace tags.

SSH access to clusters


For security reasons, in Azure Databricks the SSH port is closed by default. If you want to enable SSH access to
your Spark clusters, contact Azure Databricks support.

NOTE
SSH can be enabled only if your workspace is deployed in your own Azure virtual network.

Cluster log delivery


When you create a cluster, you can specify a location to deliver the logs for the Spark driver node, worker nodes,
and events. Logs are delivered every five minutes to your chosen destination. When a cluster is terminated,
Azure Databricks guarantees to deliver all logs generated up until the cluster was terminated.
The destination of the logs depends on the cluster ID. If the specified destination is dbfs:/cluster-log-delivery ,
cluster logs for 0630-191345-leap375 are delivered to dbfs:/cluster-log-delivery/0630-191345-leap375 .
To configure the log delivery location:
1. On the cluster configuration page, click the Advanced Options toggle.
2. Click the Logging tab.

3. Select a destination type.


4. Enter the cluster log path.
NOTE
This feature is also available in the REST API. See Clusters API 2.0 and Cluster log delivery examples.

Init scripts
A cluster node initialization—or init—script is a shell script that runs during startup for each cluster node before
the Spark driver or worker JVM starts. You can use init scripts to install packages and libraries not included in
the Databricks runtime, modify the JVM system classpath, set system properties and environment variables
used by the JVM, or modify Spark configuration parameters, among other configuration tasks.
You can attach init scripts to a cluster by expanding the Advanced Options section and clicking the Init
Scripts tab.
For detailed instructions, see Cluster node initialization scripts.
Best practices: Cluster configuration
7/21/2022 • 16 minutes to read

Azure Databricks provides a number of options when you create and configure clusters to help you get the best
performance at the lowest cost. This flexibility, however, can create challenges when you’re trying to determine
optimal configurations for your workloads. Carefully considering how users will utilize clusters will help guide
configuration options when you create new clusters or configure existing clusters. Some of the things to
consider when determining configuration options are:
What type of user will be using the cluster? A data scientist may be running different job types with different
requirements than a data engineer or data analyst.
What types of workloads will users run on the cluster? For example, batch extract, transform, and load (ETL)
jobs will likely have different requirements than analytical workloads.
What level of service level agreement (SLA) do you need to meet?
What budget constraints do you have?
This article provides cluster configuration recommendations for different scenarios based on these
considerations. This article also discusses specific features of Azure Databricks clusters and the considerations to
keep in mind for those features.
Your configuration decisions will require a tradeoff between cost and performance. The primary cost of a cluster
includes the Databricks Units (DBUs) consumed by the cluster and the cost of the underlying resources needed
to run the cluster. What may not be obvious are the secondary costs such as the cost to your business of not
meeting an SLA, decreased employee efficiency, or possible waste of resources because of poor controls.

Cluster features
Before discussing more detailed cluster configuration scenarios, it’s important to understand some features of
Azure Databricks clusters and how best to use those features.
All-purpose clusters and job clusters
When you create a cluster you select a cluster type: an all-purpose cluster or a job cluster. All-purpose clusters
can be shared by multiple users and are best for performing ad-hoc analysis, data exploration, or development.
Once you’ve completed implementing your processing and are ready to operationalize your code, switch to
running it on a job cluster. Job clusters terminate when your job ends, reducing resource usage and cost.
Cluster mode
Azure Databricks supports three cluster modes: Standard, High Concurrency, and Single Node. Most regular
users use Standard or Single Node clusters.
Standard clusters are ideal for processing large amounts of data with Apache Spark.
Single Node clusters are intended for jobs that use small amounts of data or non-distributed workloads such
as single-node machine learning libraries.
High Concurrency clusters are ideal for groups of users who need to share resources or run ad-hoc jobs.
Administrators usually create High Concurrency clusters. Databricks recommends enabling autoscaling for
High Concurrency clusters.
On-demand and spot instances
To save cost, Azure Databricks supports creating clusters using a combination of on-demand and spot instances.
You can use spot instances to take advantage of unused capacity on Azure to reduce the cost of running your
applications, grow your application’s compute capacity, and increase throughput.
Autoscaling
Autoscaling allows clusters to resize automatically based on workloads. Autoscaling can benefit many use cases
and scenarios from both a cost and performance perspective, but it can be challenging to understand when and
how to use autoscaling. The following are some considerations for determining whether to use autoscaling and
how to get the most benefit:
Autoscaling typically reduces costs compared to a fixed-size cluster.
Autoscaling workloads can run faster compared to an under-provisioned fixed-size cluster.
Some workloads are not compatible with autoscaling clusters, including spark-submit jobs and some Python
packages.
With single-user all-purpose clusters, users may find autoscaling is slowing down their development or
analysis when the minimum number of workers is set too low. This is because the commands or queries
they’re running are often several minutes apart, time in which the cluster is idle and may scale down to save
on costs. When the next command is executed, the cluster manager will attempt to scale up, taking a few
minutes while retrieving instances from the cloud provider. During this time, jobs might run with insufficient
resources, slowing the time to retrieve results. While increasing the minimum number of workers helps, it
also increases cost. This is another example where cost and performance need to be balanced.
If Delta Caching is being used, it’s important to remember that any cached data on a node is lost if that node
is terminated. If retaining cached data is important for your workload, consider using a fixed-size cluster.
If you have a job cluster running an ETL workload, you can sometimes size your cluster appropriately when
tuning if you know your job is unlikely to change. However, autoscaling gives you flexibility if your data sizes
increase. It’s also worth noting that optimized autoscaling can reduce expense with long-running jobs if there
are long periods when the cluster is underutilized or waiting on results from another process. Once again,
though, your job may experience minor delays as the cluster attempts to scale up appropriately. If you have
tight SLAs for a job, a fixed-sized cluster may be a better choice or consider using an Azure Databricks pool
to reduce cluster start times.
Azure Databricks also supports autoscaling local storage. With autoscaling local storage, Azure Databricks
monitors the amount of free disk space available on your cluster’s Spark workers. If a worker begins to run low
on disk, Azure Databricks automatically attaches a new managed volume to the worker before it runs out of disk
space.
Pools
Pools reduce cluster start and scale-up times by maintaining a set of available, ready-to-use instances.
Databricks recommends taking advantage of pools to improve processing time while minimizing cost.
Databricks Runtime versions
Databricks recommends using the latest Databricks Runtime version for all-purpose clusters. Using the most
current version will ensure you have the latest optimizations and most up-to-date compatibility between your
code and preloaded packages.
For job clusters running operational workloads, consider using the Long Term Support (LTS) Databricks Runtime
version. Using the LTS version will ensure you don’t run into compatibility issues and can thoroughly test your
workload before upgrading. If you have an advanced use case around machine learning or genomics, consider
the specialized Databricks Runtime versions.
Cluster policies
Azure Databricks cluster policies allow administrators to enforce controls over the creation and configuration of
clusters. Databricks recommends using cluster policies to help apply the recommendations discussed in this
guide. Learn more about cluster policies in the cluster policies best practices guide.
Automatic termination
Many users won’t think to terminate their clusters when they’re finished using them. Fortunately, clusters are
automatically terminated after a set period, with a default of 120 minutes.
Administrators can change this default setting when creating cluster policies. Decreasing this setting can lower
cost by reducing the time that clusters are idle. It’s important to remember that when a cluster is terminated all
state is lost, including all variables, temp tables, caches, functions, objects, and so forth. All of this state will need
to be restored when the cluster starts again. If a developer steps out for a 30-minute lunch break, it would be
wasteful to spend that same amount of time to get a notebook back to the same state as before.

IMPORTANT
Idle clusters continue to accumulate DBU and cloud instance charges during the inactivity period before termination.

Garbage collection
While it may be less obvious than other considerations discussed in this article, paying attention to garbage
collection can help optimize job performance on your clusters. Providing a large amount of RAM can help jobs
perform more efficiently but can also lead to delays during garbage collection.
To minimize the impact of long garbage collection sweeps, avoid deploying clusters with large amounts of RAM
configured for each instance. Having more RAM allocated to the executor will lead to longer garbage collection
times. Instead, configure instances with smaller RAM sizes, and deploy more instances if you need more
memory for your jobs. However, there are cases where fewer nodes with more RAM are recommended, for
example, workloads that require a lot of shuffles, as discussed in Cluster sizing considerations.
Cluster access control
You can configure two types of cluster permissions:
The Allow Cluster Creation permission controls the ability of users to create clusters.
Cluster-level permissions control the ability to use and modify a specific cluster.
To learn more about configuring cluster permissions, see cluster access control.
You can create a cluster if you have either cluster create permissions or access to a cluster policy, which allows
you to create any cluster within the policy’s specifications. The cluster creator is the owner and has Can Manage
permissions, which will enable them to share it with any other user within the constraints of the data access
permissions of the cluster.
Understanding cluster permissions and cluster policies are important when deciding on cluster configurations
for common scenarios.
Cluster tags
Cluster tags allow you to easily monitor the cost of cloud resources used by different groups in your
organization. You can specify tags as key-value strings when creating a cluster, and Azure Databricks applies
these tags to cloud resources, such as instances and EBS volumes. Learn more about tag enforcement in the
cluster policies best practices guide.

Cluster sizing considerations


Azure Databricks runs one executor per worker node. Therefore the terms executor and worker are used
interchangeably in the context of the Azure Databricks architecture. People often think of cluster size in terms of
the number of workers, but there are other important factors to consider:
Total executor cores (compute): The total number of cores across all executors. This determines the maximum
parallelism of a cluster.
Total executor memory: The total amount of RAM across all executors. This determines how much data can
be stored in memory before spilling it to disk.
Executor local storage: The type and amount of local disk storage. Local disk is primarily used in the case of
spills during shuffles and caching.
Additional considerations include worker instance type and size, which also influence the factors above. When
sizing your cluster, consider:
How much data will your workload consume?
What’s the computational complexity of your workload?
Where are you reading data from?
How is the data partitioned in external storage?
How much parallelism do you need?
Answering these questions will help you determine optimal cluster configurations based on workloads. For
simple ETL style workloads that use narrow transformations only (transformations where each input partition
will contribute to only one output partition), focus on a compute-optimized configuration. If you expect a lot of
shuffles, then the amount of memory is important, as well as storage to account for data spills. Fewer large
instances can reduce network I/O when transferring data between machines during shuffle-heavy workloads.
There’s a balancing act between the number of workers and the size of worker instance types. A cluster with two
workers, each with 40 cores and 100 GB of RAM, has the same compute and memory as an eight worker cluster
with 10 cores and 25 GB of RAM.
If you expect many re-reads of the same data, then your workloads may benefit from caching. Consider a
storage optimized configuration with Delta Cache.
Cluster sizing examples
The following examples show cluster recommendations based on specific types of workloads. These examples
also include configurations to avoid and why those configurations are not suitable for the workload types.
Data analysis
Data analysts typically perform processing requiring data from multiple partitions, leading to many shuffle
operations. A cluster with a smaller number of nodes can reduce the network and disk I/O needed to perform
these shuffles. Cluster A in the following diagram is likely the best choice, particularly for clusters supporting a
single analyst.
Cluster D will likely provide the worst performance since a larger number of nodes with less memory and
storage will require more shuffling of data to complete the processing.

Analytical workloads will likely require reading the same data repeatedly, so recommended worker types are
storage optimized with Delta Cache enabled.
Additional features recommended for analytical workloads include:
Enable auto termination to ensure clusters are terminated after a period of inactivity.
Consider enabling autoscaling based on the analyst’s typical workload.
Consider using pools, which will allow restricting clusters to pre-approved instance types and ensure
consistent cluster configurations.
Features that are probably not useful:
Storage autoscaling, since this user will probably not produce a lot of data.
High Concurrency clusters, since this cluster is for a single user, and High Concurrency clusters are best
suited for shared use.
Basic batch ETL
Simple batch ETL jobs that don’t require wide transformations, such as joins or aggregations, typically benefit
from clusters that are compute-optimized. For these types of workloads, any of the clusters in the following
diagram are likely acceptable.

Compute-optimized worker types are recommended; these will be cheaper, and these workloads will likely not
require significant memory or storage.
Using a pool might provide a benefit for clusters supporting simple ETL jobs by decreasing cluster launch times
and reducing total runtime when running job pipelines. However, since these types of workloads typically run as
scheduled jobs where the cluster runs only long enough to complete the job, using a pool might not provide a
benefit.
The following features probably aren’t useful:
Delta Caching, since re-reading data is not expected.
Auto termination probably isn’t required since these are likely scheduled jobs.
Autoscaling is not recommended since compute and storage should be pre-configured for the use case.
High Concurrency clusters are intended for multi-users and won’t benefit a cluster running a single job.
Complex batch ETL
More complex ETL jobs, such as processing that requires unions and joins across multiple tables, will probably
work best when you can minimize the amount of data shuffled. Since reducing the number of workers in a
cluster will help minimize shuffles, you should consider a smaller cluster like cluster A in the following diagram
over a larger cluster like cluster D.

Complex transformations can be compute-intensive, so for some workloads reaching an optimal number of
cores may require adding additional nodes to the cluster.
Like simple ETL jobs, compute-optimized worker types are recommended; these will be cheaper, and these
workloads will likely not require significant memory or storage. Also, like simple ETL jobs, the main cluster
feature to consider is pools to decrease cluster launch times and reduce total runtime when running job
pipelines.
The following features probably aren’t useful:
Delta Caching, since re-reading data is not expected.
Auto termination probably isn’t required since these are likely scheduled jobs.
Autoscaling is not recommended since compute and storage should be pre-configured for the use case.
High Concurrency clusters are intended for multi-users and won’t benefit a cluster running a single job.
Training machine learning models
Since initial iterations of training a machine learning model are often experimental, a smaller cluster such as
cluster A is a good choice. A smaller cluster will also reduce the impact of shuffles.
If stability is a concern, or for more advanced stages, a larger cluster such as cluster B or C may be a good
choice.
A large cluster such as cluster D is not recommended due to the overhead of shuffling data between nodes.

Recommended worker types are storage optimized with Delta Caching enabled to account for repeated reads of
the same data and to enable caching of training data. If the compute and storage options provided by storage
optimized nodes are not sufficient, consider GPU optimized nodes. A possible downside is the lack of Delta
Caching support with these nodes.
Additional features recommended for analytical workloads include:
Enable auto termination to ensure clusters are terminated after a period of inactivity.
Consider enabling autoscaling based on the analyst’s typical workload.
Use pools, which will allow restricting clusters to pre-approved instance types and ensure consistent cluster
configurations.
Features that are probably not useful:
Autoscaling, since cached data can be lost when nodes are removed as a cluster scales down. Additionally,
typical machine learning jobs will often consume all available nodes, in which case autoscaling will provide
no benefit.
Storage autoscaling, since this user will probably not produce a lot of data.
High Concurrency clusters, since this cluster is for a single user, and High Concurrency clusters are best
suited for shared use.

Common scenarios
The following sections provide additional recommendations for configuring clusters for common cluster usage
patterns:
Multiple users running data analysis and ad-hoc processing.
Specialized use cases like machine learning.
Support scheduled batch jobs.
Multi-user clusters
Scenario
You need to provide multiple users access to data for running data analysis and ad-hoc queries. Cluster usage
might fluctuate over time, and most jobs are not very resource-intensive. The users mostly require read-only
access to the data and want to perform analyses or create dashboards through a simple user interface.
The recommended approach for cluster provisioning is a hybrid approach for node provisioning in the cluster
along with autoscaling. A hybrid approach involves defining the number of on-demand instances and spot
instances for the cluster and enabling autoscaling between the minimum and the maximum number of
instances.

This cluster is always available and shared by the users belonging to a group by default. Enabling autoscaling
allows the cluster to scale up and down depending upon the load.
Users do not have access to start/stop the cluster, but the initial on-demand instances are immediately available
to respond to user queries. If the user query requires more capacity, autoscaling automatically provisions more
nodes (mostly Spot instances) to accommodate the workload.
Azure Databricks has other features to further improve multi-tenancy use cases:
Handling large queries in interactive workflows describes a process to automatically manage queries that will
never finish.
Task preemption improves how long-running jobs and shorter jobs work together.
Autoscaling local storage helps prevent running out of storage space in a multi-tenant environment.
This approach keeps the overall cost down by:
Using a shared cluster model.
Using a mix of on-demand and spot instances.
Using autoscaling to avoid paying for underutilized clusters.
Specialized workloads
Scenario
You need to provide clusters for specialized use cases or teams within your organization, for example, data
scientists running complex data exploration and machine learning algorithms. A typical pattern is that a user
needs a cluster for a short period to run their analysis.
The best approach for this kind of workload is to create cluster policies with pre-defined configurations for
default, fixed, and settings ranges. These settings might include the number of instances, instance types, spot
versus on-demand instances, roles, libraries to be installed, and so forth. Using cluster policies allows users with
more advanced requirements to quickly spin up clusters that they can configure as needed for their use case
and enforce cost and compliance with policies.

This approach provides more control to users while maintaining the ability to keep cost under control by pre-
defining cluster configurations. This also allows you to configure clusters for different groups of users with
permissions to access different data sets.
One downside to this approach is that users have to work with administrators for any changes to clusters, such
as configuration, installed libraries, and so forth.
Batch workloads
Scenario
You need to provide clusters for scheduled batch jobs, such as production ETL jobs that perform data
preparation. The suggested best practice is to launch a new cluster for each job run. Running each job on a new
cluster helps avoid failures and missed SLAs caused by other workloads running on a shared cluster. Depending
on the level of criticality for the job, you could use all on-demand instances to meet SLAs or balance between
spot and on-demand instances for cost savings.
Task preemption
7/21/2022 • 2 minutes to read

The Apache Spark scheduler in Azure Databricks automatically preempts tasks to enforce fair sharing. This
guarantees interactive response times on clusters with many concurrently running jobs.

TIP
When tasks are preempted by the scheduler, their kill reason will be set to preempted by scheduler . This reason is
visible in the Spark UI and can be used to debug preemption behavior.

Preemption options
By default, preemption is conservative: jobs can be starved of resources for up to 30 seconds before the
scheduler intervenes. You can tune preemption by setting the following Spark configuration properties at cluster
launch time:
Whether preemption should be enabled.

spark.databricks.preemption.enabled true

The fair share fraction to guarantee per job. Setting this to 1.0 means the scheduler will aggressively
attempt to guarantee perfect fair sharing. Setting this to 0.0 effectively disables preemption. The default
setting is 0.5, which means at worst a jobs will get half of its fair share.

spark.databricks.preemption.threshold 0.5

How long a job must remain starved before preemption kicks in. Setting this to lower values will provide
more interactive response times, at the cost of cluster efficiency. Recommended values are from 1-100
seconds.

spark.databricks.preemption.timeout 30s

How often the scheduler will check for task preemption. This should be set to less than the preemption
timeout.

spark.databricks.preemption.interval 5s

For further information on job scheduling, see Scheduling Within an Application.


Customize containers with Databricks Container
Services
7/21/2022 • 5 minutes to read

Databricks Container Services lets you specify a Docker image when you create a cluster. Some example use
cases include:
Library customization: you have full control over the system libraries you want installed.
Golden container environment: your Docker image is a locked down environment that will never change.
Docker CI/CD integration: you can integrate Azure Databricks with your Docker CI/CD pipelines.
You can also use Docker images to create custom deep learning environments on clusters with GPU devices. For
additional information about using GPU clusters with Databricks Container Services, see Databricks Container
Services on GPU clusters.
For tasks to be executed each time the container starts, use an init script.

Requirements
NOTE
Databricks Runtime for Machine Learning and Databricks Runtime for Genomics do not support Databricks Container
Services.

Databricks Runtime 6.1 or above. If you have previously used Databricks Container Services you must
upgrade your base images. See the latest images in https://github.com/databricks/containers tagged with
6.x .
Your Azure Databricks workspace must have Databricks Container Services enabled.
Your machine must be running a recent Docker daemon (one that is tested and works with Client/Server
Version 18.03.0-ce) and the docker command must be available on your PATH .

Step 1: Build your base


Databricks recommends that you build your Docker base from a base that Databricks has built and tested. It is
also possible to build your Docker base from scratch. This section describes the two options.
Option 1. Use a base built by Databricks
This example uses the 9.x tag for an image that will target a cluster with runtime version Databricks Runtime
9.0 and above:

FROM databricksruntime/standard:9.x
...

To specify additional Python libraries, such as the latest version of pandas and urllib, use the container-specific
version of pip . For the datatabricksruntime/standard:9.x container, include the following:
RUN /databricks/python3/bin/pip install pandas
RUN /databricks/python3/bin/pip install urllib3

For the datatabricksruntime/standard:8.x container or lower, include the following:

RUN /databricks/conda/envs/dcs-minimal/bin/pip install pandas


RUN /databricks/conda/envs/dcs-minimal/bin/pip install urllib3

Example base images are hosted on Docker Hub at https://hub.docker.com/u/databricksruntime. The Dockerfiles
used to generate these bases are at https://github.com/databricks/containers.

NOTE
The base images databricksruntime/standard and databricksruntime/minimal are not to be confused with the
unrelated databricks-standard and databricks-minimal environments included in the no longer available Databricks
Runtime with Conda (Beta).

Option 2. Build your own Docker base


You can also build your Docker base from scratch. The Docker image must meet these requirements:
JDK 8u191 as Java on the system PATH
bash
iproute2 (ubuntu iproute)
coreutils (ubuntu coreutils)
procps (ubuntu procps)
sudo (ubuntu sudo)
Ubuntu Linux
To build your own image from scratch, you must create the virtual environment. You must also include packages
that are built into Databricks clusters, such as Python, R, and Ganglia. To get started, you can use the appropriate
base image (that is, databricksruntime/rbase for R or databricksruntime/python for Python), or refer to the
example Dockerfiles in GitHub. Another alternative is to start with the minimal image built by Databricks at
databricksruntime/minimal .

NOTE
Databricks recommends using Ubuntu Linux; however, it is possible to use Alpine Linux. To use Alpine Linux, you must
include these files:
alpine coreutils
alpine procps
alpine sudo
In addition, you must set up Python, as shown in this example Dockerfile.

WARNING
Test your custom container image thoroughly on an Azure Databricks cluster. Your container may work on a local or build
machine, but when your container is launched on an Azure Databricks cluster, the cluster launch may fail, certain features
may become disabled, or your container may stop working, even silently. In worst-case scenarios, it could corrupt your
data or accidentally expose your data to external parties.
Step 2: Push your base image
Push your custom base image to a Docker registry. This process is supported with the following registries:
Docker Hub with no auth or basic auth.
Azure Container Registry with basic auth.
Other Docker registries that support no auth or basic auth are also expected to work.

NOTE
If you use Docker Hub for your Docker registry, be sure to check that rate limits accommodate the number of clusters
that you expect to launch in a six-hour period. These rate limits are different for anonymous users, authenticated users
without a paid subscription, and paid subscriptions. See the Docker documentation for details. If this limit is exceeded, you
will get a “429 Too Many Requests” response.

Step 3: Launch your cluster


You can launch your cluster using the UI or the API.
Launch your cluster using the UI
1. Specify a Databricks Runtime Version that supports Databricks Container Services.

2. Select Use your own Docker container .


3. In the Docker Image URL field, enter your custom Docker image.
Docker image URL examples:
REGIST RY TA G F O RM AT

Docker Hub <organization>/<repository>:<tag> (for example:


databricksruntime/standard:latest )

Azure Container Registry <your-registry-name>.azurecr.io/<repository-


name>:<tag>

4. Select the authentication type.


Launch your cluster using the API
1. Generate an API token.
2. Use the Clusters API 2.0 to launch a cluster with your custom Docker base.

curl -X POST -H "Authorization: Bearer <token>" https://<databricks-instance>/api/2.0/clusters/create


-d '{
"cluster_name": "<cluster-name>",
"num_workers": 0,
"node_type_id": "Standard_DS3_v2",
"docker_image": {
"url": "databricksruntime/standard:latest",
"basic_auth": {
"username": "<docker-registry-username>",
"password": "<docker-registry-password>"
}
},
"spark_version": "7.3.x-scala2.12",
}'

basic_auth requirements depend on your Docker image type:


For public Docker images, do not include the basic_auth field.
For private Docker images, you must include the basic_auth field, using a service principal ID and
password as the username and password.
For Azure Container Registry, you must set the basic_auth field to the ID and password for a service
principal. See Azure Container Registry service principal authentication documentation for
information about creating the service principal.

Use an init script


Databricks Container Services clusters enable customers to include init scripts in the Docker container. In most
cases, you should avoid init scripts and instead make customizations through Docker directly (using the
Dockerfile). However, certain tasks must be executed when the container starts, instead of when the container is
built. Use an init script for these tasks.
For example, suppose you want to run a security daemon inside a custom container. Install and build the
daemon in the Docker image through your image building pipeline. Then, add an init script that starts the
daemon. In this example, the init script would include a line like systemctl start my-daemon .
In the API, you can specify init scripts as part of the cluster spec as follows. For more information, see
InitScriptInfo.
"init_scripts": [
{
"file": {
"destination": "file:/my/local/file.sh"
}
}
]

For Databricks Container Services images, you can also store init scripts in DBFS or cloud storage.
The following steps take place when you launch a Databricks Container Services cluster:
1. VMs are acquired from the cloud provider.
2. The custom Docker image is downloaded from your repo.
3. Azure Databricks creates a Docker container from the image.
4. Databricks Runtime code is copied into the Docker container.
5. The init scrips are executed. See Init script execution order.
Azure Databricks ignores the Docker CMD and ENTRYPOINT primitives.
Cluster node initialization scripts
7/21/2022 • 11 minutes to read

An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver or
worker JVM starts.
Some examples of tasks performed by init scripts include:
Install packages and libraries not included in Databricks Runtime. To install Python packages, use the Azure
Databricks pip binary located at /databricks/python/bin/pip to ensure that Python packages install into the
Azure Databricks Python virtual environment rather than the system Python environment. For example,
/databricks/python/bin/pip install <package-name> .
Modify the JVM system classpath in special cases.
Set system properties and environment variables used by the JVM.
Modify Spark configuration parameters.

WARNING
Azure Databricks scans the reserved location /databricks/init for legacy global init scripts which are enabled in new
workspaces by default. Databricks recommends you avoid storing init scripts in this location to avoid unexpected behavior.

Init script types


Azure Databricks supports two kinds of init scripts: cluster-scoped and global.
Cluster-scoped : run on every cluster configured with the script. This is the recommended way to run an init
script.
Global : run on every cluster in the workspace. They can help you to enforce consistent cluster configurations
across your workspace. Use them carefully because they can cause unanticipated impacts, like library
conflicts. Only admin users can create global init scripts. Global init scripts are not run on model serving
clusters.

NOTE
There are two kinds of init scripts that are deprecated. You should migrate init scripts of these types to those listed above:
Cluster-named : run on a cluster with the same name as the script. Cluster-named init scripts are best-effort (silently
ignore failures), and attempt to continue the cluster launch process. Cluster-scoped init scripts should be used instead
and are a complete replacement.
Legacy global: run on every cluster. They are less secure than the new global init script framework, silently ignore
failures, and cannot reference environment variables. You should migrate existing legacy global init scripts to the new
global init script framework. See Migrate from legacy to new global init scripts.

Whenever you change any type of init script you must restart all clusters affected by the script.

Init script execution order


The order of execution of init scripts is:
1. Legacy global (deprecated)
2. Cluster-named (deprecated)
3. Global (new)
4. Cluster-scoped

Environment variables
Cluster-scoped and global init scripts support the following environment variables:
DB_CLUSTER_ID : the ID of the cluster on which the script is running. See Clusters API 2.0.
DB_CONTAINER_IP : the private IP address of the container in which Spark runs. The init script is run inside this
container. See SparkNode.
DB_IS_DRIVER : whether the script is running on a driver node.
DB_DRIVER_IP : the IP address of the driver node.
DB_INSTANCE_TYPE : the instance type of the host VM.
DB_CLUSTER_NAME : the name of the cluster the script is executing on.
DB_IS_JOB_CLUSTER : whether the cluster was created to run a job. See Create a job.

For example, if you want to run part of a script only on a driver node, you could write a script like:

echo $DB_IS_DRIVER
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
<run this part only on driver>
else
<run this part only on workers>
fi
<run this part on both driver and workers>

You can also configure custom environment variables for a cluster and reference those variables in init scripts.
Use secrets in environment variables
Environment variables that reference secrets exhibit special behavior on Azure Databricks: init scripts can use
these variables, but programs running in Spark cannot use these variables.
You can use any valid variable name when you Reference a secret in an environment variable.

Logging
Init script start and finish events are captured in cluster event logs. Details are captured in cluster logs. Global
init script create, edit, and delete events are also captured in account-level diagnostic logs.
Init script events
Cluster event logs capture two init script events: INIT_SCRIPTS_STARTED and INIT_SCRIPTS_FINISHED , indicating
which scripts are scheduled for execution and which have completed successfully. INIT_SCRIPTS_FINISHED also
captures execution duration.
Global init scripts are indicated in the log event details by the key "global" and cluster-scoped init scripts are
indicated by the key "cluster" .

NOTE
Cluster event logs do not log init script events for each cluster node; only one node is selected to represent them all.

Init script logs


If cluster log delivery is configured for a cluster, the init script logs are written to
/<cluster-log-path>/<cluster-id>/init_scripts . Logs for each container in the cluster are written to a
subdirectory called init_scripts/<cluster_id>_<container_ip> . For example, if cluster-log-path is set to
cluster-logs , the path to the logs for a specific container would be:
dbfs:/cluster-logs/<cluster-id>/init_scripts/<cluster_id>_<container_ip> .

If the cluster is configured to write logs to DBFS, you can view the logs using the File system utility (dbutils.fs) or
the DBFS CLI. For example, if the cluster ID is 1001-234039-abcde739 :

dbfs ls dbfs:/cluster-logs/1001-234039-abcde739/init_scripts

1001-234039-abcde739_10_97_225_166
1001-234039-abcde739_10_97_231_88
1001-234039-abcde739_10_97_244_199

dbfs ls dbfs:/cluster-logs/1001-234039-abcde739/init_scripts/1001-234039-abcde739_10_97_225_166

<timestamp>_<log-id>_<init-script-name>.sh.stderr.log
<timestamp>_<log-id>_<init-script-name>.sh.stdout.log

When cluster log delivery is not configured, logs are written to /databricks/init_scripts . You can use standard
shell commands in a notebook to list and view the logs:

%sh
ls /databricks/init_scripts/
cat /databricks/init_scripts/<timestamp>_<log-id>_<init-script-name>.sh.stdout.log

Every time a cluster launches, it writes a log to the init script log folder.

IMPORTANT
Any user who creates a cluster and enables cluster log delivery can view the stderr and stdout output from global
init scripts. You should ensure that your global init scripts do not output any sensitive information.

Diagnostic logs
Azure Databricks diagnostic logging captures global init script create, edit, and delete events under the event
type globalInitScripts . See Diagnostic logging in Azure Databricks.

Cluster-scoped init scripts


Cluster-scoped init scripts are init scripts defined in a cluster configuration. Cluster-scoped init scripts apply to
both clusters you create and those created to run jobs. Since the scripts are part of the cluster configuration,
cluster access control lets you control who can change the scripts.
You can configure cluster-scoped init scripts using the UI, the CLI, and by invoking the Clusters API. This section
focuses on performing these tasks using the UI. For the other methods, see Databricks CLI and Clusters API 2.0.
You can add any number of scripts, and the scripts are executed sequentially in the order provided.
If a cluster-scoped init script returns a non-zero exit code, the cluster launch fails. You can troubleshoot cluster-
scoped init scripts by configuring cluster log delivery and examining the init script log.
Cluster-scoped init script locations
You can put init scripts in a DBFS or ADLS directory accessible by a cluster. Cluster-node init scripts in DBFS must
be stored in the DBFS root. Azure Databricks does not support storing init scripts in a DBFS directory created by
mounting object storage.
Example cluster-scoped init scripts
This section shows two examples of init scripts.
Example: Install PostgreSQL JDBC driver
The following snippets run in a Python notebook create an init script that installs a PostgreSQL JDBC driver.
1. Create a DBFS directory you want to store the init script in. This example uses dbfs:/databricks/scripts .

dbutils.fs.mkdirs("dbfs:/databricks/scripts/")

2. Create a script named postgresql-install.sh in that directory:

dbutils.fs.put("/databricks/scripts/postgresql-install.sh","""
#!/bin/bash
wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar
https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)

3. Check that the script exists.

display(dbutils.fs.ls("dbfs:/databricks/scripts/postgresql-install.sh"))

Alternatively, you can create the init script postgresql-install.sh locally:

#!/bin/bash
wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar
https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar

and copy it to dbfs:/databricks/scripts using DBFS CLI:

dbfs cp postgresql-install.sh dbfs:/databricks/scripts/postgresql-install.sh

Example: Use conda to install Python libraries


With Databricks Runtime 9.0 and above, you cannot use conda to install Python libraries. For instructions on
how to install Python packages on a cluster, see Libraries.

IMPORTANT
Anaconda Inc. updated their terms of service for anaconda.org channels in September 2020. Based on the new terms of
service you may require a commercial license if you rely on Anaconda’s packaging and distribution. See Anaconda
Commercial Edition FAQ for more information. Your use of any Anaconda channels is governed by their terms of service.
As a result of this change, Databricks has removed the default channel configuration for the Conda package manager. This
is a breaking change. You must update the usage of conda commands in init-scripts to specify a channel using -c . If you
do not specify a channel, conda commands will fail with PackagesNotFoundError .

In Databricks Runtime 8.4 ML and below, you use the Conda package manager to install Python packages. To
install a Python library at cluster initialization, you can use a script like the following:
#!/bin/bash
set -ex
/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda activate /databricks/python
conda install -c conda-forge -y astropy

Configure a cluster-scoped init script


You can configure a cluster to run an init script using the UI or API.

IMPORTANT
The script must exist at the configured location. If the script doesn’t exist, the cluster will fail to start or be autoscaled
up.
The init script cannot be larger than 64KB. If a script exceeds that size, the cluster will fail to launch and a failure
message will appear in the cluster log.

Configure a cluster-scoped init script using the UI


To use the cluster configuration page to configure a cluster to run an init script:
1. On the cluster configuration page, click the Advanced Options toggle.
2. At the bottom of the page, click the Init Scripts tab.

3. In the Destination drop-down, select a destination type. In the example in the preceding section, the
destination is DBFS .
4. Specify a path to the init script. In the example in the preceding section, the path is
dbfs:/databricks/scripts/postgresql-install.sh . The path must begin with dbfs:/ .

5. Click Add .
To remove a script from the cluster configuration, click the at the right of the script. When you confirm the
delete you will be prompted to restart the cluster. Optionally you can delete the script file from the location you
uploaded it to.
Configure a cluster-scoped init script using the DBFS REST API
To use the Clusters API 2.0 to configure the cluster with ID 1202-211320-brick1 to run the init script in the
preceding section, run the following command:
curl -n -X POST -H 'Content-Type: application/json' -d '{
"cluster_id": "1202-211320-brick1",
"num_workers": 1,
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"cluster_log_conf": {
"dbfs" : {
"destination": "dbfs:/cluster-logs"
}
},
"init_scripts": [ {
"dbfs": {
"destination": "dbfs:/databricks/scripts/postgresql-install.sh"
}
} ]
}' https://<databricks-instance>/api/2.0/clusters/edit

Global init scripts


A global init script runs on every cluster created in your workspace. Global init scripts are useful when you want
to enforce organization-wide library configurations or security screens. Only admins can create global init
scripts. You can create them using either the UI or REST API.

IMPORTANT
Use global init scripts carefully:
It is easy to add libraries or make other modifications that cause unanticipated impacts. Whenever possible, use
cluster-scoped init scripts instead.
Any user who creates a cluster and enables cluster log delivery can view the stderr and stdout output from global
init scripts. You should ensure that your global init scripts do not output any sensitive information.

You can troubleshoot global init scripts by configuring cluster log delivery and examining the init script log.
Add a global init script using the UI
To configure global init scripts using the Admin Console:
1. Go to the Admin Console and click the Global Init Scripts tab.

2. Click + Add .
3. Name the script and enter it by typing, pasting, or dragging a text file into the Script field.

NOTE
The init script cannot be larger than 64KB. If a script exceeds that size, an error message appears when you try to
save.

4. If you have more than one global init script configured for your workspace, set the order in which the
new script will run.
5. If you want the script to be enabled for all new and restarted clusters after you save, toggle Enabled .

IMPORTANT
When you add a global init script or make changes to the name, run order, or enablement of init scripts, those
changes do not take effect until you restart the cluster.

6. Click Add .
Edit a global init script using the UI
1. Go to the Admin Console and click the Global Init Scripts tab.
2. Click a script.
3. Edit the script.
4. Click Confirm .
Configure a global init script using the API
Admins can add, delete, re-order, and get information about the global init scripts in your workspace using the
Global Init Scripts API 2.0.
Migrate from legacy to new global init scripts
If your Azure Databricks workspace was launched before August 2020, you might still have legacy global init
scripts. You should migrate these to the new global init script framework to take advantage of the security,
consistency, and visibility features included in the new script framework.
1. Copy your existing legacy global init scripts and add them to the new global init script framework using
either the UI or the REST API.
Keep them disabled until you have completed the next step.
2. Disable all legacy global init scripts.
In the Admin Console, go to the Global Init Scripts tab and toggle off the Legacy Global Init Scripts
switch.

3. Enable your new global init scripts.


On the Global Init Scripts tab, toggle on the Enabled switch for each init script you want to enable.

4. Restart all clusters.


Legacy scripts will not run on new nodes added during automated scale-up of running clusters. Nor
will new global init scripts run on those new nodes. You must restart all clusters to ensure that the
new scripts run on them and that no existing clusters attempt to add new nodes with no global scripts
running on them at all.
Non-idempotent scripts may need to be modified when you migrate to the new global init script
framework and disable legacy scripts.
GPU-enabled clusters
7/21/2022 • 3 minutes to read

NOTE
Some GPU-enabled instance types are in Beta and are marked as such in the drop-down list when you select the driver
and worker types during cluster creation.

Overview
Azure Databricks supports clusters accelerated with graphics processing units (GPUs). This article describes how
to create clusters with GPU-enabled instances and describes the GPU drivers and libraries installed on those
instances.
To learn more about deep learning on GPU-enabled clusters, see Deep learning.

Create a GPU cluster


Creating a GPU cluster is similar to creating any Spark cluster (See Clusters). You should keep in mind the
following:
The Databricks Runtime Version must be a GPU-enabled version, such as Runtime 9.1 LTS ML (GPU,
Scala 2.12, Spark 3.1.2) .
The Worker Type and Driver Type must be GPU instance types.
For single-machine workflows without Spark, you can set the number of workers to zero.
Azure Databricks supports the following instance types:
NC instance type series : Standard_NC12, Standard_NC24
NC v2 instance type series : Standard_NC6s_v2, Standard_NC12s_v2, Standard_NC24s_v2,
Standard_NC24rs_v2
NC T4 v3 instance type series : Standard_NC4as_T4_v3, Standard_NC8as_T4_v3,
Standard_NC16as_T4_v3, Standard_NC64as_T4_v3
See Azure Databricks Pricing for an up-to-date list of supported GPU instance types and their availability
regions. Your Azure Databricks deployment must reside in a supported region to launch GPU-enabled clusters.

GPU scheduling
Databricks Runtime 7.0 ML and above support GPU-aware scheduling from Apache Spark 3.0. Azure Databricks
preconfigures it on GPU clusters.
GPU scheduling is not enabled on Single Node clusters.
spark.task.resource.gpu.amount is the only Spark config related to GPU-aware scheduling that you might need
to change. The default configuration uses one GPU per task, which is ideal for distributed inference workloads
and distributed training, if you use all GPU nodes. To do distributed training on a subset of nodes, which helps
reduce communication overhead during distributed training, Databricks recommends setting
spark.task.resource.gpu.amount to the number of GPUs per worker node in the cluster Spark configuration.

For PySpark tasks, Azure Databricks automatically remaps assigned GPU(s) to indices 0, 1, …. Under the default
configuration that uses one GPU per task, your code can simply use the default GPU without checking which
GPU is assigned to the task. If you set multiple GPUs per task, for example 4, your code can assume that the
indices of the assigned GPUs are always 0, 1, 2, and 3. If you do need the physical indices of the assigned GPUs,
you can get them from the CUDA_VISIBLE_DEVICES environment variable.
If you use Scala, you can get the indices of the GPUs assigned to the task from
TaskContext.resources().get("gpu") .

For Databricks Runtime releases below 7.0, to avoid conflicts among multiple Spark tasks trying to use the same
GPU, Azure Databricks automatically configures GPU clusters so that there is at most one running task per node.
That way the task can use all GPUs on the node without running into conflicts with other tasks.

NVIDIA GPU driver, CUDA, and cuDNN


Azure Databricks installs the NVIDIA driver and libraries required to use GPUs on Spark driver and worker
instances:
CUDA Toolkit, installed under /usr/local/cuda .
cuDNN: NVIDIA CUDA Deep Neural Network Library.
NCCL: NVIDIA Collective Communications Library.
The version of the NVIDIA driver included is 470.57.02, which supports CUDA 11.0.
For the versions of the libraries included, see the release notes for the specific Databricks Runtime version you
are using.

NOTE
This software contains source code provided by NVIDIA Corporation. Specifically, to support GPUs, Azure Databricks
includes code from CUDA Samples.

NVIDIA End User License Agreement (EULA )


When you select a GPU-enabled “Databricks Runtime Version” in Azure Databricks, you implicitly agree to the
terms and conditions outlined in the NVIDIA EULA with respect to the CUDA, cuDNN, and Tesla libraries, and the
NVIDIA End User License Agreement (with NCCL Supplement) for the NCCL library.

Databricks Container Services on GPU clusters


IMPORTANT
This feature is in Public Preview.

You can use Databricks Container Services on clusters with GPUs to create portable deep learning environments
with customized libraries. See Customize containers with Databricks Container Services for instructions.
To create custom images for GPU clusters, you must select a standard runtime version instead of Databricks
Runtime ML for GPU. When you select Use your own Docker container , you can choose GPU clusters with a
standard runtime version. The custom images for GPU clusters are based on the official CUDA containers, which
is different from Databricks Runtime ML for GPU.
When you create custom images for GPU clusters, you cannot change the NVIDIA driver version, because it
must match the driver version on the host machine.
The databricksruntime Docker Hub contains example base images with GPU capability. The Dockerfiles used to
generate these images are located in the example containers GitHub repository, which also has details on what
the example images provide and how to customize them.
Single Node clusters
7/21/2022 • 3 minutes to read

A Single Node cluster is a cluster consisting of an Apache Spark driver and no Spark workers. A Single Node
cluster supports Spark jobs and all Spark data sources, including Delta Lake. A Standard cluster requires a
minimum of one Spark worker to run Spark jobs.
Single Node clusters are helpful for:
Single-node machine learning workloads that use Spark to load and save data
Lightweight exploratory data analysis

Create a Single Node cluster


To create a Single Node cluster, set Cluster Mode to Single Node when you configure a cluster.

Single Node cluster properties


A Single Node cluster has the following properties:
Runs Spark locally.
The driver acts as both master and worker, with no worker nodes.
Spawns one executor thread per logical core in the cluster, minus 1 core for the driver.
All stderr , stdout , and log4j log output is saved in the driver log.
A Single Node cluster can’t be converted to a Standard cluster. To use a Standard cluster, create the cluster
and attach your notebook to it.

Limitations
Large-scale data processing will exhaust the resources on a Single Node cluster. For these workloads,
Databricks recommends using a Standard mode cluster.
Single Node clusters are not designed to be shared. To avoid resource conflicts, Databricks recommends
using a Standard mode cluster when the cluster must be shared.
A Standard mode cluster can’t be scaled to 0 workers. Use a Single Node cluster instead.
Single Node clusters are not compatible with process isolation.
GPU scheduling is not enabled on Single Node clusters.
On Single Node clusters, Spark cannot read Parquet files with a UDT column. The following error
message results:
The Spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically
reattached.

To work around this problem, disable the native Parquet reader:

spark.conf.set("spark.databricks.io.parquet.nativeReader.enabled", False)

REST API
You can use the Clusters API to create a Single Node cluster.

Single Node cluster policy


Cluster policies simplify cluster configuration for Single Node clusters.
Consider the example of a data science team whose members do not have permission to create clusters. An
admin can create a cluster policy that authorizes team members to create a maximum number of Single Node
clusters, using pools and cluster policies:
1. Create a pool:
a. Set Max capacity to 10 .
b. In Autopilot options , enable autoscaling enabled for local storage .
c. Set Instance type to Single Node cluster .
d. Select a Azure Databricks version. Databricks recommends using the latest version if possible.
e. Click Create .
The pool’s properties page appears. Make a note of the pool ID and instance type ID page for the newly-
created pool.
2. Create a cluster policy:
Set the pool ID and instance type ID from the pool properties from the pool.
Specify constraints as needed.
3. Grant the cluster policy to the team members. You can use Manage users, groups, and service principals
to simplify user management.
{
"spark_conf.spark.databricks.cluster.profile": {
"type": "fixed",
"value": "singleNode",
"hidden": true
},
"instance_pool_id": {
"type": "fixed",
"value": "singleNodePoolId1",
"hidden": true
},
"spark_version": {
"type": "fixed",
"value": "7.3.x-cpu-ml-scala2.12",
"hidden": true
},
"autotermination_minutes": {
"type": "fixed",
"value": 120,
"hidden": true
},
"num_workers": {
"type": "fixed",
"value": 0,
"hidden": true
},
"docker_image.url": {
"type": "forbidden",
"hidden": true
}
}

Single Node job cluster policy


To set up a cluster policy for jobs, you can define a similar cluster policy. Set the cluster_type.type to fixed
and cluster_type.value to job . Remove all references to auto_termination_minutes .
{
"cluster_type": {
"type": "fixed",
"value": "job"
},
"spark_conf.spark.databricks.cluster.profile": {
"type": "fixed",
"value": "singleNode",
"hidden": true
},
"instance_pool_id": {
"type": "fixed",
"value": "singleNodePoolId1",
"hidden": true
},
"num_workers": {
"type": "fixed",
"value": 0,
"hidden": true
},
"spark_version": {
"type": "fixed",
"value": "7.3.x-cpu-ml-scala2.12",
"hidden": true
},
"docker_image.url": {
"type": "forbidden",
"hidden": true
}
}
Web terminal
7/21/2022 • 2 minutes to read

Azure Databricks web terminal provides a convenient and highly interactive way for you to run shell commands
and use editors, such as Vim or Emacs, on the Spark driver node. The web terminal can be used by many users
on one cluster. Example uses of the web terminal include monitoring resource usage and installing Linux
packages.
Web terminal is disabled by default for all workspace users.
Enabling Docker Container Services disables web terminal.

WARNING
Azure Databricks proxies the web terminal service from port 7681 on the cluster’s Spark driver. This web proxy is intended
for use only with the web terminal. If the port is occupied when the cluster starts or if there is otherwise a conflict, the
web terminal may not work as expected. If other web services are launched on port 7681, cluster users may be exposed
to potential security exploits. Neither Databricks nor Microsoft is responsible for any issues that result from the
installation of unsupported software on a cluster.

Requirements
Databricks Runtime 7.0 or above.
Can Attach To permission on a cluster.
Your Azure Databricks workspace must have web terminal enabled.

Launch the web terminal


Do one of the following:
In a cluster detail page, click the Apps tab and then click Launch Web Terminal .

In a notebook, click an attached cluster drop-down and then click Terminal .


A new tab opens with the web terminal UI and the Bash prompt, where you can run commands as root inside
the container of the cluster driver node.

Each user can have up to 100 active web terminal sessions (tabs) open. Idle web terminal sessions may time out
and the web terminal web application will reconnect, resulting in a new shell process. If you want to keep your
Bash session, Databricks recommends using tmux.

Limitations
Azure Databricks does not support running Spark jobs from the web terminal. In addition, Azure Databricks
web terminal is not available in the following cluster types:
Job clusters
High concurrency clusters with either table access control or credential passthrough enabled.
Clusters launched with the DISABLE_WEB_TERMINAL=true environment variable set.
Enabling Docker Container Services disables web terminal.
Debugging with the Apache Spark UI
7/21/2022 • 6 minutes to read

This guide walks you through the different debugging options available to peek at the internals of your Apache
Spark application. The three important places to look are:
Spark UI
Driver logs
Executor logs

Spark UI
Once you start the job, the Spark UI shows information about what’s happening in your application. To get to the
Spark UI, click the attached cluster:

Streaming tab
Once you get to the Spark UI, you will see a Streaming tab if a streaming job is running in this cluster. If there is
no streaming job running in this cluster, this tab will not be visible. You can skip to Driver logs to learn how to
check for exceptions that might have happened while starting the streaming job.
The first thing to look for in this page is to check if your streaming application is receiving any input events from
your source. In this case, you can see the job receives 1000 events/second.
If you have an application that receives multiple input streams, you can click the Input Rate link which will
show the # of events received for each receiver.
Processing time
As you scroll down, find the graph for Processing Time . This is one of the key graphs to understand the
performance of your streaming job. As a general rule of thumb, it is good if you can process each batch within
80% of your batch processing time.
For this application, the batch interval was 2 seconds. The average processing time is 450ms which is well under
the batch interval. If the average processing time is closer or greater than your batch interval, then you will have
a streaming application that will start queuing up resulting in backlog soon which can bring down your
streaming job eventually.

Completed batches
Towards the end of the page, you will see a list of all the completed batches. The page displays details about the
last 1000 batches that completed. From the table, you can get the # of events processed for each batch and their
processing time. If you want to know more about what happened on one of the batches, you can click the batch
link to get to the Batch Details Page.
Batch details page
This page has all the details you want to know about a batch. Two key things are:
Input: Has details about the input to the batch. In this case, it has details about the Apache Kafka topic,
partition and offsets read by Spark Structured Streaming for this batch. In case of TextFileStream, you see a
list of file names that was read for this batch. This is the best way to start debugging a Streaming application
reading from text files.
Processing: You can click the link to the Job ID which has all the details about the processing done during this
batch.

Job details page


The job details page shows a DAG visualization. This is a very useful to understand the order of operations and
dependencies for every batch. In this case, you can see that the batch read input from Kafka direct stream
followed by a flat map operation and then a map operation. The resulting stream was then used to update a
global state using updateStateByKey. (The grayed boxes represents skipped stages. Spark is smart enough to
skip some stages if they don’t need to be recomputed. If the data is checkpointed or cached, then Spark would
skip recomputing those stages. In this case, those stages correspond to the dependency on previous batches
because of updateStateBykey . Since Spark Structured Streaming internally checkpoints the stream and it reads
from the checkpoint instead of depending on the previous batches, they are shown as grayed stages.)
At the bottom of the page, you will also find the list of jobs that were executed for this batch. You can click the
links in the description to drill further into the task level execution.
Task details page
This is the most granular level of debugging you can get into from the Spark UI for a Spark application. This
page has all the tasks that were executed for this batch. If you are investigating performance issues of your
streaming application, then this page would provide information such as the number of tasks that were
executed and where they were executed (on which executors) and shuffle information

TIP
Ensure that the tasks are executed on multiple executors (nodes) in your cluster to have enough parallelism while
processing. If you have a single receiver, sometimes only one executor might be doing all the work though you have more
than one executor in your cluster.
Thread dump
A thread dump shows a snapshot of a JVM’s thread states.
Thread dumps are useful in debugging a specific hanging or slow-running task. To view a specific task’s thread
dump in the Spark UI:
1. Click the Jobs tab.
2. In the Jobs table, find the target job that corresponds to the thread dump you want to see, and click the link
in the Description column.
3. In the job’s Stages table, find the target stage that corresponds to the thread dump you want to see, and click
the link in the Description column.
4. In the stage’s Tasks list, find the target task that corresponds to the thread dump you want to see, and note
its Task ID and Executor ID values.
5. Click the Executors tab.
6. In the Executors table, find the row that contains the Executor ID value that corresponds to the Executor
ID value that you noted earlier. In that row, click the link in the Thread Dump column.
7. In the Thread dump for executor table, click the row where the Thread Name column contains (TID
followed by the Task ID value that you noted earlier. (If the task has finished running, you will not find a
matching thread). The task’s thread dump is shown.
Thread dumps are also useful for debugging issues where the driver appears to be hanging (for example, no
Spark progress bars are showing) or making no progress on queries (for example, Spark progress bars are stuck
at 100%). To view the driver’s thread dump in the Spark UI:
1. Click the Executors tab.
2. In the Executors table, in the driver row, click the link in the Thread Dump column. The driver’s thread
dump is shown.

Driver logs
Driver logs are helpful for 2 purposes:
Exceptions: Sometimes, you may not see the Streaming tab in the Spark UI. This is because the Streaming job
was not started because of some exception. You can drill into the Driver logs to look at the stack trace of the
exception. In some cases, the streaming job may have started properly. But you will see all the batches never
going to the Completed batches section. They might all be in processing or failed state. In such cases too,
driver logs could be handy to understand on the nature of the underlying issues.
Prints: Any print statements as part of the DAG shows up in the logs too.

Executor logs
Executor logs are sometimes helpful if you see certain tasks are misbehaving and would like to see the logs for
specific tasks. From the task details page shown above, you can get the executor where the task was run. Once
you have that, you can go to the clusters UI page, click the # nodes, and then the master. The master page lists all
the workers. You can choose the worker where the suspicious task was run and then get to the log4j output.
Pools
7/21/2022 • 2 minutes to read

Azure Databricks pools reduce cluster start and auto-scaling times by maintaining a set of idle, ready-to-use
instances. When a cluster is attached to a pool, cluster nodes are created using the pool’s idle instances. If the
pool has no idle instances, the pool expands by allocating a new instance from the instance provider in order to
accommodate the cluster’s request. When a cluster releases an instance, it returns to the pool and is free for
another cluster to use. Only clusters attached to a pool can use that pool’s idle instances.
You can specify a different pool for the driver node and worker nodes, or use the same pool for both.
For an introduction to pools and configuration recommendations, view the Databricks pools video:

Azure Databricks does not charge DBUs while instances are idle in the pool. Instance provider billing does apply.
See pricing.
You can manage pools using the UI, the Instance Pools CLI, or by calling the Instance Pools API 2.0.
This section describes how to work with pools using the UI:
Display pools
Create a pool
Configure pools
Edit a pool
Delete a pool
Attach a cluster to one or more pools
Best practices: pools
Display pools
7/21/2022 • 2 minutes to read

To display the pools in your workspace, click Compute in the sidebar and click the Pools tab:

For each pool, the page displays:


Pool name
Configured instance type
Minimum idle instances
Maximum instance capacity
Current number of idle instances
Current number of used instances

The button at the far right of a row provides quick access to delete a pool.

To display cluster attachment info, click the pool name in the Pools list.
Create a pool
7/21/2022 • 2 minutes to read

IMPORTANT
You must have permission to create a pool; see Pool access control.

This article describes how to create a pool using the UI.


To learn how to use the Databricks CLI to create a pool, see Instance Pools CLI.
To learn how to use the REST API to create a pool, see Instance Pools API 2.0.
To create a pool:

1. Click Compute in the sidebar.


2. Click the Pools tab.

3. Click the Create Pool button at the top of the page.

4. Specify the pool configuration.


5. Click the Create button.

You will notice idle instances in the pending state. When they are no longer pending, clusters attached to the
pool will start faster.
To create a pool using the REST API, see the Instance Pools API 2.0 documentation.
Configure pools
7/21/2022 • 4 minutes to read

This article explains the configuration options available when you create and edit a pool.

Pool size and auto termination


When you create a pool, in order to control its size, you can set three parameters: minimum idle instances,
maximum capacity, and idle instance auto termination.
Minimum Idle Instances
The minimum number of instances the pool keeps idle. These instances do not terminate, regardless of the
setting specified in Idle Instance Auto Termination. If a cluster consumes idle instances from the pool, Azure
Databricks provisions additional instances to maintain the minimum.

Maximum Capacity
The maximum number of instances that the pool will provision. If set, this value constrains all instances (idle +
used). If a cluster using the pool requests more instances than this number during autoscaling, the request will
fail with an INSTANCE_POOL_MAX_CAPACITY_FAILURE error.

This configuration is optional. Azure Databricks recommend setting a value only in the following circumstances:
You have an instance quota you must stay under.
You want to protect one set of work from impacting another set of work. For example, suppose your instance
quota is 100 and you have teams A and B that need to run jobs. You can create pool A with a max 50 and
pool B with max 50 so that the two teams share the 100 quota fairly.
You need to cap cost.
Idle Instance Auto Termination
The time in minutes that instances above the value set in Minimum Idle Instances can be idle before being
terminated by the pool.

Instance types
A pool consists of both idle instances kept ready for new clusters and instances in use by running clusters. All of
these instances are of the same instance provider type, selected when creating a pool.
A pool’s instance type cannot be edited. Clusters attached to a pool use the same instance type for the driver and
worker nodes. Different families of instance types fit different use cases, such as memory-intensive or compute-
intensive workloads.

Azure Databricks always provides one year’s deprecation notice before ceasing support for an instance type.

NOTE
If your security requirements include compute isolation, select a Standard_F72s_V2 instance as your worker type. These
instance types represent isolated virtual machines that consume the entire physical host and provide the necessary level
of isolation required to support, for example, US Department of Defense Impact Level 5 (IL5) workloads.

Preload Databricks Runtime version


You can speed up cluster launches by selecting a Databricks Runtime version to be loaded on idle instances in
the pool. If a user selects that runtime when they create a cluster backed by the pool, that cluster will launch
even more quickly than a pool-backed cluster that doesn’t use a preloaded Databricks Runtime version.
Setting this option to None slows down cluster launches, as it causes the Databricks Runtime version to
download on demand to idle instances in the pool. When the cluster releases the instances in the pool, the
Databricks Runtime version remains cached on those instances. The next cluster creation operation that uses the
same Databricks Runtime version might benefit from this caching behavior, but it is not guaranteed.

Pool tags
Pool tags allow you to easily monitor the cost of cloud resources used by various groups in your organization.
You can specify tags as key-value pairs when you create a pool, and Azure Databricks applies these tags to cloud
resources like VMs and disk volumes, as well as DBU usage reports.
For convenience, Azure Databricks applies three default tags to each pool: Vendor , DatabricksInstancePoolId ,
and DatabricksInstancePoolCreatorId . You can also add custom tags when you create a pool. You can add up to
41 custom tags.
Custom tag inheritance
Pool-backed clusters inherit default and custom tags from the pool configuration. For detailed information
about how pool tags and cluster tags work together, see Monitor usage using cluster, pool, and workspace tags.
Configure custom pool tags
1. At the bottom of the pool configuration page, select the Tags tab.
2. Specify a key-value pair for the custom tag.

3. Click Add .

Autoscaling local storage


It can often be difficult to estimate how much disk space a particular job will take. To save you from having to
estimate how many gigabytes of managed disk to attach to your pool at creation time, Azure Databricks
automatically enables autoscaling local storage on all Azure Databricks pools.
With autoscaling local storage, Azure Databricks monitors the amount of free disk space available on your pool’s
instances. If an instance runs too low on disk, a new managed disk is attached automatically before it runs out of
disk space. Disks are attached up to a limit of 5 TB of total disk space per virtual machine (including the virtual
machine’s initial local storage).
The managed disks attached to a virtual machine are detached only when the virtual machine is returned to
Azure. That is, managed disks are never detached from a virtual machine as long as it is part of a pool.

Spot instances
To save cost, you can choose to use spot instances by checking the All Spot radio button.

Clusters in the pool will launch with spot instances for all nodes, driver and worker (as opposed to the hybrid
on-demand driver and spot instance workers for non-pool clusters).
If spot instances are evicted due to unavailability, on-demand instances do not replace evicted instances.
Edit a pool
7/21/2022 • 2 minutes to read

To edit a pool configuration, click Edit on the pool detail page.

Some pool configuration settings are not editable. These settings are grayed out.
You can also invoke the Edit API to programmatically edit the pool.

NOTE
Clusters that were attached to the pool remain attached after editing.
Delete a pool
7/21/2022 • 2 minutes to read

Deleting a pool terminates the pool’s idle instances and removes its configuration.

WARNING
You cannot undo this action.

To delete a pool, click the icon in the actions on the Pools page.

NOTE
Running clusters attached to the pool continue to run, but cannot allocate instances during resize or up-scaling.
Terminated clusters attached to the pool will fail to start.

You can also invoke the Delete API endpoint to programmatically delete a pool.
Attach a cluster to one or more pools
7/21/2022 • 2 minutes to read

To reduce cluster start time, you can designate predefined pools of idle instances to create worker nodes and the
driver node. This is also called attaching the cluster to the pools. The cluster is created using instances in the
pools. If a pool does not have sufficient idle resources to create the requested driver node or worker nodes, the
pool expands by allocating new instances from the instance provider. When the cluster is terminated, the
instances it used are returned to the pool and can be reused by a different cluster.
You can attach a different pool for the driver node and worker nodes, or attach the same pool for both.

IMPORTANT
You must use a pool for both the driver node and worker nodes, or for neither. Otherwise, an error occurs and your
cluster isn’t created. This prevents a situation where the driver node has to wait for worker nodes to be created, or vice
versa.

Requirements
You must have permission to attach to each pool; see Pool access control.

Configure the cluster


To attach a cluster to a pool using the cluster creation UI, select the pool from the Driver Type or Worker Type
drop-down when you configure the cluster. Available pools are listed at the top of each drop-down list. You can
use the same pool or different pools for the driver node and worker nodes.

If you use the Clusters API, you must specify driver_instance_pool_id for the driver node and instance_pool_id
for the worker nodes.

Inherited configuration
When you attach a cluster to a pool, the following configuration properties are inherited from the pool:
Docker images.
Custom cluster tags: You can add additional custom tags for the cluster, and both the cluster-level tags and
those inherited from pools are applied. You cannot add a cluster-specific custom tag with the same key name
as a custom tag inherited from a pool (that is, you cannot override a custom tag that is inherited from the
pool).
Best practices: pools
7/21/2022 • 4 minutes to read

Clusters provide the computation resources and configurations that run your notebooks and jobs. Clusters run
on instances provisioned by your cloud provider on demand. The Azure Databricks platform provides an
efficient and cost-effective way to manage your analytics infrastructure. This article shows how to address the
following challenges when creating new clusters or scaling up existing clusters:
The execution time of your Azure Databricks job might be shorter than the time to provision instances and
start a new cluster.
When autoscaling is enabled on a cluster, it takes time for the cloud provider to provision new instances. This
can negatively impact jobs with strict performance requirements or varying workloads.
Azure Databricks pools reduce cluster start and scale-up times by maintaining a set of available, ready-to-use
instances.
You can use a different pool for the driver node and worker nodes.
For an introduction to pools and configuration recommendations, view the following video:

As shown in the following diagram, when a cluster attached to a pool needs an instance, it first attempts to
allocate one of the pool’s available instances. If the pool has no available instances, it expands by allocating a
new instance from the cloud provider to accommodate the cluster’s request. When a cluster releases an
instance, the instance returns to the pool and is free for use by another cluster. Only clusters attached to a pool
can use that pool’s available instances.

This article discusses the following best practices to ensure the best performance at the lowest cost when you
use pools:
Create pools using instance types and Azure Databricks runtimes based on target workloads.
When possible, populate pools with spot instances to reduce costs.
Populate pools with on-demand instances for jobs with short execution times and strict execution time
requirements.
Use pool tags and cluster tags to manage billing.
Use pool configuration options to minimize cost.
Pre-populate pools to make sure instances are available when clusters need them.

Create pools based on workloads


If your driver node and worker nodes have different requirements, create a different pool for each.
You can minimize instance acquisition time by creating a pool for each instance type and Azure Databricks
runtime your organization commonly uses. For example, if most data engineering clusters use instance type A,
data science clusters use instance type B, and analytics clusters use instance type C, create a pool with each
instance type.
Configure pools to use on-demand instances for jobs with short execution times and strict execution time
requirements. Use on-demand instances to prevent acquired instances from being lost to a higher bidder on the
spot market.
Configure pools to use spot instances for clusters that support interactive development or jobs that prioritize
cost savings over reliability.

Tag pools to manage cost and billing


Tagging pools to the correct cost center allows you to manage cost and usage chargeback. You can use multiple
custom tags to associate multiple cost centers to a pool. However, it’s important to understand how tags are
propagated when a cluster is created from pools. As shown in the following diagram, tags from the pools
propagate to the underlying cloud provider instances, but the cluster’s tags do not. Apply all custom tags
required for managing chargeback of the cloud provider compute cost to the pool.
Pool tags and cluster tags both propagate to Azure Databricks billing. You can use the combination of cluster
and pool tags to manage chargeback of Azure Databricks Units.

To learn more, see Monitor usage using cluster, pool, and workspace tags

Configure pools to control cost


You can use the following configuration options to help control the cost of pools:
Set the Min Idle instances to 0 to avoid paying for running instances that aren’t doing work. The tradeoff is a
possible increase in time when a cluster needs to acquire a new instance.
Set the Idle Instance Auto Termination time to provide a buffer between when the instance is released from
the cluster and when it’s dropped from the pool. Set this to a period that allows you to minimize cost while
ensuring the availability of instances for scheduled jobs. For example, job A is scheduled to run at 8:00 AM
and takes 40 minutes to complete. Job B is scheduled to run at 9:00 AM and takes 30 minutes to complete.
Set the Idle Instance Auto Termination value to 20 minutes to ensure that instances returned to the pool
when job A completes are available when job B starts. Unless they are claimed by another cluster, those
instances are terminated 20 minutes after job B ends.
Set the Max Capacity based on anticipated usage. This sets the ceiling for the maximum number of used and
idle instances in the pool. If a job or cluster requests an instance from a pool at its maximum capacity, the
request fails, and the cluster doesn’t acquire more instances. Therefore, Databricks recommends that you set
the maximum capacity only if there is a strict instance quota or budget constraint.

Pre-populate pools
To benefit fully from pools, you can pre-populate newly created pools. Set the Min Idle instances greater than
zero in the pool configuration. Alternatively, if you’re following the recommendation to set this value to zero, use
a starter job to ensure that newly created pools have available instances for clusters to access.
With the starter job approach, schedule a job with flexible execution time requirements to run before jobs with
more strict performance requirements or before users start using interactive clusters. After the job finishes, the
instances used for the job are released back to the pool. Set Min Idle instance setting to 0 and set the Idle
Instance Auto Termination time high enough to ensure that idle instances remain available for subsequent
jobs.
Using a starter job allows the pool instances to spin up, populate the pool, and remain available for downstream
job or interactive clusters.

Learn more
Learn more about Azure Databricks pools.
Notebooks
7/21/2022 • 2 minutes to read

A notebook is a web-based interface to a document that contains runnable code, visualizations, and narrative
text.
For a quick introduction to notebooks, view this video:

This section describes how to manage and use notebooks. It also contains articles on creating data
visualizations, sharing visualizations as dashboards, parameterizing notebooks and dashboards with widgets,
building complex pipelines with notebooks, and best practices for defining classes in Scala notebooks.
Manage notebooks
Create a notebook
Open a notebook
Delete a notebook
Copy notebook path
Rename a notebook
Control access to a notebook
Notebook external formats
Notebooks and clusters
Schedule a notebook
Distribute notebooks
Use notebooks
Develop notebooks
Run notebooks
Share code in notebooks
Manage notebook state and results
Revision history
Version control with Git
Test notebooks
Visualizations
Create a new visualization
Create a new data profile
Work with visualizations and data profiles
Dashboards
Dashboards notebook
Create a scheduled job to refresh a dashboard
View a specific dashboard version
ipywidgets
Requirements
Usage
Example notebook
Best practices for using ipywidgets and Databricks widgets
Limitations
Databricks widgets
Databricks widget types
Databricks widget API
Configure widget settings
Databricks widgets in dashboards
Use Databricks widgets with %run
Modularize or link notebook code
Ways to modularize or link notebooks
API
Example
Pass structured data
Handle errors
Run multiple notebooks concurrently
Package cells
Package Cells notebook
IPython kernel
Benefits of using the IPython kernel
Known issue
Best practices
Requirements
Walkthrough
bamboolib
Requirements
Quickstart
Walkthroughs
Key tasks
Additional resources
Legacy visualizations
Create a legacy visualization
Machine learning visualizations
Structured Streaming DataFrames
displayHTML function
Images
Visualizations in Python
Visualizations in R
Visualizations in Scala
Deep dive notebooks for Python and Scala
Manage notebooks
7/21/2022 • 10 minutes to read

You can manage notebooks using the UI, the CLI, and by invoking the Workspace API. This article focuses on
performing notebook tasks using the UI. For the other methods, see Databricks CLI and Workspace API 2.0.

Create a notebook
Use the Create button
The easiest way to create a new notebook in your default folder is to use the Create button:

1. Click Create in the sidebar and select Notebook from the menu. The Create Notebook dialog appears.
2. Enter a name and select the notebook’s default language.
3. If there are running clusters, the Cluster drop-down displays. Select the cluster you want to attach the
notebook to.
4. Click Create .
Create a notebook in any folder
You can create a new notebook in any folder (for example, in the Shared folder) following these steps:

1. In the sidebar, click Workspace . Do one of the following:

Next to any folder, click the on the right side of the text and select Create > Notebook .

In the workspace or a user folder, click and select Create > Notebook .
2. Follow steps 2 through 4 in Use the Create button.

Open a notebook
In your workspace, click a . The notebook path displays when you hover over the notebook title.

Delete a notebook
See Folders and Workspace object operations for information about how to access the workspace menu and
delete notebooks or other items in the workspace.

Copy notebook path


To copy a notebook file path without opening the notebook, right-click the notebook name or click the to
the right of the notebook name and select Copy File Path .

Rename a notebook
To change the title of an open notebook, click the title and edit inline or click File > Rename .

Control access to a notebook


If your Azure Databricks account has the Premium Plan, you can use Workspace access control to control who
has access to a notebook.

Notebook external formats


Azure Databricks supports several notebook external formats:
Source file: A file containing only source code statements with the extension .scala , .py , .sql , or .r .
HTML: An Azure Databricks notebook with the extension .html .
DBC archive: A Databricks archive.
IPython notebook: A Jupyter notebook with the extension .ipynb .
RMarkdown: An R Markdown document with the extension .Rmd .
In this section:
Import a notebook
Convert a file to a notebook
Export a notebook
Export all notebooks in a folder
Import a notebook
You can import an external notebook from a URL or a file. You can also import a ZIP archive of notebooks
exported in bulk from an Azure Databricks workspace.

1. Click Workspace in the sidebar. Do one of the following:

Next to any folder, click the on the right side of the text and select Impor t .
In the Workspace or a user folder, click and select Impor t .
2. Specify the URL or browse to a file containing a supported external format or a ZIP archive of notebooks
exported from an Azure Databricks workspace.
3. Click Impor t .
If you choose a single notebook, it is exported in the current folder.
If you choose a DBC or ZIP archive, its folder structure is recreated in the current folder and each
notebook is imported.
Convert a file to a notebook
You can convert existing Python, SQL, Scala, and R scripts to single-cell notebooks by adding a comment to the
first cell of the file:
Python

# Databricks notebook source

SQL

-- Databricks notebook source

Scala

// Databricks notebook source

# Databricks notebook source

Databricks notebooks use a special comment surrounded by whitespace to define cells:


Python

# COMMAND ----------

SQL

-- COMMAND ----------

Scala

// COMMAND ----------

R
# COMMAND ----------

Export a notebook
In the notebook toolbar, select File > Expor t and a format.

NOTE
When you export a notebook as HTML, IPython notebook, or archive (DBC), and you have not cleared the results, the
results of running the notebook are included.

Export all notebooks in a folder

NOTE
When you export a notebook as HTML, IPython notebook, or archive (DBC), and you have not cleared the results, the
results of running the notebook are included.

To export all folders in a workspace folder as a ZIP archive:

1. Click Workspace in the sidebar. Do one of the following:

Next to any folder, click the on the right side of the text and select Expor t .
In the Workspace or a user folder, click and select Expor t .
2. Select the export format:
DBC Archive : Export a Databricks archive, a binary format that includes metadata and notebook
command results.
Source File : Export a ZIP archive of notebook source files, which can be imported into an Azure
Databricks workspace, used in a CI/CD pipeline, or viewed as source files in each notebook’s default
language. Notebook command results are not included.
HTML Archive : Export a ZIP archive of HTML files. Each notebook’s HTML file can be imported into
an Azure Databricks workspace or viewed as HTML. Notebook command results are included.

Notebooks and clusters


Before you can do any work in a notebook, you must first attach the notebook to a cluster. This section describes
how to attach and detach notebooks to and from clusters and what happens behind the scenes when you
perform these actions.
In this section:
Execution contexts
Attach a notebook to a cluster
Detach a notebook from a cluster
View all notebooks attached to a cluster
Execution contexts
When you attach a notebook to a cluster, Azure Databricks creates an execution context. An execution context
contains the state for a REPL environment for each supported programming language: Python, R, Scala, and
SQL. When you run a cell in a notebook, the command is dispatched to the appropriate language REPL
environment and run.
You can also use the REST 1.2 API to create an execution context and send a command to run in the execution
context. Similarly, the command is dispatched to the language REPL environment and run.
A cluster has a maximum number of execution contexts (145). Once the number of execution contexts has
reached this threshold, you cannot attach a notebook to the cluster or create a new execution context.
Idle execution contexts
An execution context is considered idle when the last completed execution occurred past a set idle threshold.
Last completed execution is the last time the notebook completed execution of commands. The idle threshold is
the amount of time that must pass between the last completed execution and any attempt to automatically
detach the notebook. The default idle threshold is 24 hours.
When a cluster has reached the maximum context limit, Azure Databricks removes (evicts) idle execution
contexts (starting with the least recently used) as needed. Even when a context is removed, the notebook using
the context is still attached to the cluster and appears in the cluster’s notebook list. Streaming notebooks are
considered actively running, and their context is never evicted until their execution has been stopped. If an idle
context is evicted, the UI displays a message indicating that the notebook using the context was detached due to
being idle.

If you attempt to attach a notebook to cluster that has maximum number of execution contexts and there are no
idle contexts (or if auto-eviction is disabled), the UI displays a message saying that the current maximum
execution contexts threshold has been reached and the notebook will remain in the detached state.

If you fork a process, an idle execution context is still considered idle once execution of the request that forked
the process returns. Forking separate processes is not recommended with Spark.
Configure context auto-eviction
Auto-eviction is enabled by default. To disable auto-eviction for a cluster, set the Spark property
spark.databricks.chauffeur.enableIdleContextTracking false .

Attach a notebook to a cluster


To attach a notebook to a cluster, you need the Can Attach To cluster-level permission.

IMPORTANT
As long as a notebook is attached to a cluster, any user with the Can Run permission on the notebook has implicit
permission to access the cluster.

To attach a notebook to a cluster:


1. In the notebook toolbar, click Detached .
2. From the drop-down, select a cluster.
IMPORTANT
An attached notebook has the following Apache Spark variables defined.

CL ASS VA R I A B L E N A M E

SparkContext sc

SQLContext / HiveContext sqlContext

SparkSession (Spark 2.x) spark

Do not create a SparkSession , SparkContext , or SQLContext . Doing so will lead to inconsistent behavior.

Determine Spark and Databricks Runtime version


To determine the Spark version of the cluster your notebook is attached to, run:

spark.version

To determine the Databricks Runtime version of the cluster your notebook is attached to, run:

spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")

NOTE
Both this sparkVersion tag and the spark_version property required by the endpoints in the Clusters API 2.0 and
Jobs API 2.1 refer to the Databricks Runtime version, not the Spark version.

Detach a notebook from a cluster


1. In the notebook toolbar, click Attached .
2. Select Detach .

You can also detach notebooks from a cluster using the Notebooks tab on the cluster details page.
When you detach a notebook from a cluster, the execution context is removed and all computed variable values
are cleared from the notebook.

TIP
Azure Databricks recommends that you detach unused notebooks from a cluster. This frees up memory space on the
driver.

View all notebooks attached to a cluster


The Notebooks tab on the cluster details page displays all of the notebooks that are attached to a cluster. The
tab also displays the status of each attached notebook, along with the last time a command was run from the
notebook.
Schedule a notebook
To schedule a notebook job to run periodically:

1. In the notebook, click at the top right. If no jobs exist for this notebook, the Schedule
dialog appears.

If jobs already exist for the notebook, the Jobs List dialog appears. To display the Schedule dialog, click
Add a schedule .

2. In the Schedule dialog, optionally enter a name for the job. The default name is the name of the
notebook.
3. Select Manual to run your job only when manually triggered, or Scheduled to define a schedule for
running the job. If you select Scheduled , use the drop-downs to specify the frequency, time, and time
zone.
4. In the Cluster drop-down, select the cluster to run the task.
If you have Allow Cluster Creation permissions, by default the job runs on a new job cluster. To edit
the configuration of the default job cluster, click Edit at the right of the field to display the cluster
configuration dialog.
If you do not have Allow Cluster Creation permissions, by default the job runs on the cluster that the
notebook is attached to. If the notebook is not attached to a cluster, you must select a cluster from the
Cluster drop-down.
5. Optionally, enter any Parameters to pass to the job. Click Add and specify the key and value of each
parameter. Parameters set the value of the notebook widget specified by the key of the parameter. Use
Task parameter variables to pass a limited set of dynamic values as part of a parameter value.
6. Optionally, specify email addresses to receive Aler ts on job events. See Notifications.
7. Click Submit .
Manage scheduled notebook jobs
To display jobs associated with this notebook, click the Schedule button. The jobs list dialog appears, showing
all jobs currently defined for this notebook. To manage jobs, click at the right of a job in the list.

From this menu, you can edit, clone, view, pause, resume, or delete a scheduled job.
When you clone a scheduled job, a new job is created with the same parameters as the original. The new job
appears in the list with the name “Clone of ”.
How you edit a job depends on the complexity of the job’s schedule. Either the Schedule dialog or the Job details
panel displays, allowing you to edit the schedule, cluster, parameters, and so on.

Distribute notebooks
To allow you to easily distribute Azure Databricks notebooks, Azure Databricks supports the Databricks archive,
which is a package that can contain a folder of notebooks or a single notebook. A Databricks archive is a JAR file
with extra metadata and has the extension .dbc . The notebooks contained in the archive are in an Azure
Databricks internal format.
Import an archive

1. Click or to the right of a folder or notebook and select Impor t .


2. Choose File or URL .
3. Go to or drop a Databricks archive in the dropzone.
4. Click Impor t . The archive is imported into Azure Databricks. If the archive contains a folder, Azure Databricks
recreates that folder.
Export an archive
Click or to the right of a folder or notebook and select Expor t > DBC Archive . Azure Databricks
downloads a file named <[folder|notebook]-name>.dbc .
Use notebooks
7/21/2022 • 21 minutes to read

A notebook is a collection of runnable cells (commands). When you use a notebook, you are primarily
developing and running cells.
All notebook tasks are supported by UI actions, but you can also perform many tasks using keyboard shortcuts.
Toggle the shortcut display by clicking the icon.

Develop notebooks
This section describes how to develop notebook cells and navigate around a notebook.
In this section:
About notebooks
Add a cell
Delete a cell
Cut, copy, and paste cells
Select multiple cells or all cells
Default language
Mix languages
Include documentation
Command comments
Change cell display
Show line and command numbers
Find and replace text
Autocomplete
Format SQL
View table of contents
View notebooks in dark mode
About notebooks
A notebook has a toolbar that lets you manage the notebook and perform actions within the notebook:

and one or more cells (or commands) that you can run:

At the far right of a cell, the cell actions , contains three menus: Run , Dashboard , and Edit :
— —
and two actions: Hide and Delete .
Add a cell
To add a cell, mouse over a cell at the top or bottom and click the icon, or access the notebook cell menu at
the far right, click , and select Add Cell Above or Add Cell Below .
Delete a cell
Go to the cell actions menu at the far right and click (Delete).
When you delete a cell, by default a delete confirmation dialog appears. To disable future confirmation dialogs,
select the Do not show this again checkbox and click Confirm . You can also toggle the confirmation dialog
setting with the Turn on command delete confirmation option in > User Settings > Notebook
Settings .
To restore deleted cells, either select Edit > Undo Delete Cells or use the ( Z ) keyboard shortcut.
Cut, copy, and paste cells
There are several options to cut and copy cells:

Use the cell actions menu at the right of the cell. Click and select Cut Cell or Copy Cell .
Use keyboard shortcuts: Command-X or Ctrl-X to cut and Command-C or Ctrl-C to copy.
Use the Edit menu at the top of the notebook. Select Cut current cell or Copy current cell .
After you cut or copy cells, you can paste those cells elsewhere in the notebook, into a different notebook, or
into a notebook in a different browser tab or window. To paste cells, use the keyboard shortcut Command-V or
Ctrl-V . The cells are pasted below the current cell.

You can use the keyboard shortcut Command-Z or Ctrl-Z to undo cut or paste actions.

NOTE
If you are using Safari, you must use the keyboard shortcuts.

Select multiple cells or all cells


You can select adjacent notebook cells using Shift + Up or Down for the previous and next cell respectively.
When multiple cells are selected, you can copy, cut, delete, and paste them.
To select all cells, select Edit > Select All Cells or use the command mode shortcut Cmd+A .
Default language
The notebook’s default language is indicated by a button next to the notebook name. In the following notebook,
the default language is SQL.
To change the default language:
1. Click the language button. The Change Default Language dialog appears.

2. Select the new language from the Default Language drop-down.


3. Click Change .
4. To ensure that existing commands continue to work, commands of the previous default language are
automatically prefixed with a language magic command.
Mix languages
By default, cells use the default language of the notebook. You can override the default language in a cell by
clicking the language button and selecting a language from the drop down.

Alternately, you can use the language magic command %<language> at the beginning of a cell. The supported
magic commands are: %python , %r , %scala , and %sql .

NOTE
When you invoke a language magic command, the command is dispatched to the REPL in the execution context for the
notebook. Variables defined in one language (and hence in the REPL for that language) are not available in the REPL of
another language. REPLs can share state only through external resources such as files in DBFS or objects in object storage.

Notebooks also support a few auxiliary magic commands:


%sh : Allows you to run shell code in your notebook. To fail the cell if the shell command has a non-zero exit
status, add the -e option. This command runs only on the Apache Spark driver, and not the workers. To run
a shell command on all nodes, use an init script.
%fs : Allows you to use dbutils filesystem commands. For example, to run the dbutils.fs.ls command to
list files, you can specify %fs ls instead. For more information, see Use %fs magic commands.
%md : Allows you to include various types of documentation, including text, images, and mathematical
formulas and equations. See the next section.
Explore SQL cell results in Python notebooks natively using Python
You might want to load data using SQL and explore it using Python. In a Databricks Python notebook, table
results from a SQL language cell are automatically made available as a Python DataFrame. The name of the
Python DataFrame is _sqldf .

NOTE
In Python notebooks, the DataFrame _sqldf is not saved automatically and is replaced with the results of the
most recent SQL cell run. To save the DataFrame, run this code in a Python cell:

new_dataframe_name = _sqldf

If the query uses a widget for parameterization, the results are not available as a Python DataFrame.
If the query uses the keywords CACHE TABLE or UNCACHE TABLE , the results are not available as a Python
DataFrame.

The screenshot shows an example:


SQL syntax highlighting and autocomplete in Python commands
Syntax highlighting and SQL autocomplete are available when you use SQL inside a Python command, such as
in a spark.sql command.
Include documentation
To include documentation in a notebook you can create a markdown cell, either by selecting Markdown from
the cell’s language button or by using the %md magic command. The contents of the cell are rendered into
HTML. For example, this snippet contains markup for a level-one heading:

%md # Hello This is a Title

It is rendered as a HTML title:

Collapsible headings
Cells that appear after cells containing Markdown headings can be collapsed into the heading cell. The following
image shows a level-one heading called Heading 1 with the following two cells collapsed into it.

To expand and collapse headings, click the + and - .


Also see Hide and show cell content.
To expand or collapse cells after cells containing Markdown headings throughout the notebook, select Expland
all headings or Collapse all headings from the View menu.

Link to other notebooks


You can link to other notebooks or folders in Markdown cells using relative paths. Specify the href attribute of
an anchor tag as the relative path, starting with a $ and then follow the same pattern as in Unix file systems:
%md
<a href="$./myNotebook">Link to notebook in same folder as current notebook</a>
<a href="$../myFolder">Link to folder in parent folder of current notebook</a>
<a href="$./myFolder2/myNotebook2">Link to nested notebook</a>

Display images
To display images stored in the FileStore, use the syntax:

%md
![test](files/image.png)

For example, suppose you have the Databricks logo image file in FileStore:

dbfs ls dbfs:/FileStore/

databricks-logo-mobile.png

When you include the following code in a Markdown cell:

the image is rendered in the cell:

Display mathematical equations


Notebooks support KaTeX for displaying mathematical formulas and equations. For example,

%md
\\(c = \\pm\\sqrt{a^2 + b^2} \\)

\\(A{_i}{_j}=B{_i}{_j}\\)

$$c = \\pm\\sqrt{a^2 + b^2}$$

\\[A{_i}{_j}=B{_i}{_j}\\]

renders as:
and

%md
\\( f(\beta)= -Y_t^T X_t \beta + \sum log( 1+{e}^{X_t\bullet\beta}) + \frac{1}{2}\delta^t S_t^{-1}\delta\\)

where \\(\delta=(\beta - \mu_{t-1})\\)

renders as:

Include HTML
You can include HTML in a notebook by using the function displayHTML . See HTML, D3, and SVG in notebooks
for an example of how to do this.

NOTE
The displayHTML iframe is served from the domain databricksusercontent.com and the iframe sandbox includes the
allow-same-origin attribute. databricksusercontent.com must be accessible from your browser. If it is currently
blocked by your corporate network, it must added to an allow list.

Command comments
You can have discussions with collaborators using command comments.
To toggle the Comments sidebar, click the Comments button at the top right of a notebook.

To add a comment to a command:


1. Highlight the command text and click the comment bubble:
2. Add your comment and click Comment .

To edit, delete, or reply to a comment, click the comment and choose an action.

Change cell display


There are three display options for notebooks:
Standard view: results are displayed immediately after code cells
Results only: only results are displayed
Side-by-side: code and results cells are displayed side by side, with results to the right
Go to the View menu to select your display option.

Show line and command numbers


To show line numbers or command numbers, go to the View menu and select Show line numbers or
Show command numbers . Once they’re displayed, you can hide them again from the same menu. You can
also enable line numbers with the keyboard shortcut Control+L .
If you enable line or command numbers, Databricks saves your preference and shows them in all of your other
notebooks for that browser.
Command numbers above cells link to that specific command. If you click the command number for a cell, it
updates your URL to be anchored to that command. If you want to link to a specific command in your notebook,
right-click the command number and choose copy link address .
Find and replace text
To find and replace text within a notebook, select Edit > Find and Replace . The current match is highlighted in
orange and all other matches are highlighted in yellow.

To replace the current match, click Replace . To replace all matches in the notebook, click Replace All .
To move between matches, click the Prev and Next buttons. You can also press shift+enter and enter to go to
the previous and next matches, respectively.
To close the find and replace tool, click or press esc .
Autocomplete
You can use Azure Databricks autocomplete to automatically complete code segments as you type them. Azure
Databricks supports two types of autocomplete: local and server.
Local autocomplete completes words that are defined in the notebook. Server autocomplete accesses the cluster
for defined types, classes, and objects, as well as SQL database and table names. To activate server
autocomplete, attach your notebook to a cluster and run all cells that define completable objects.

IMPORTANT
Server autocomplete in R notebooks is blocked during command execution.

To trigger autocomplete, press Tab after entering a completable object. For example, after you define and run
the cells containing the definitions of MyClass and instance , the methods of instance are completable, and a
list of valid completions displays when you press Tab .

Type completion, as well as SQL database and table name completion, work in SQL cells and in SQL embedded
in Python.

——
In Databricks Runtime 7.4 and above, you can display Python docstring hints by pressing Shift+Tab after
entering a completable Python object. The docstrings contain the same information as the help() function for
an object.

Format SQL
Azure Databricks provides tools that allow you to format SQL code in notebook cells quickly and easily. These
tools reduce the effort to keep your code formatted and help to enforce the same coding standards across your
notebooks.
You can trigger the formatter in the following ways:
Single cells
Keyboard shortcut: Press Cmd+Shift+F .
Command context menu: Select Format SQL in the command context drop-down menu of a SQL
cell. This item is visible only in SQL notebook cells and those with a %sql language magic.
Multiple cells
Select multiple SQL cells and then select Edit > Format SQL Cells . If you select cells of more than one
language, only SQL cells are formatted. This includes those that use %sql .

Here’s the first cell in the preceding example after formatting:


View table of contents
To display an automatically generated table of contents, click the arrow at the upper left of the notebook
(between the sidebar and the topmost cell). The table of contents is generated from the Markdown headings
used in the notebook.

To close the table of contents, click the left-facing arrow.

View notebooks in dark mode


You can choose to display notebooks in dark mode. To turn dark mode on or off, select View > Notebook
Theme and select Light Theme or Dark Theme .

Run notebooks
This section describes how to run one or more notebook cells.
In this section:
Requirements
Run a cell
Run all above or below
Run all cells
View multiple outputs per cell
Python and Scala error highlighting
Notifications
Databricks Advisor
Requirements
The notebook must be attached to a cluster. If the cluster is not running, the cluster is started when you run one
or more cells.
Run a cell
In the cell actions menu at the far right, click and select Run Cell , or press shift+enter .

IMPORTANT
The maximum size for a notebook cell, both contents and output, is 16MB.

For example, try running this Python code snippet that references the predefined spark variable.

spark

and then, run some real code:

1+1 # => 2
NOTE
Notebooks have a number of default settings:
When you run a cell, the notebook automatically attaches to a running cluster without prompting.
When you press shift+enter , the notebook auto-scrolls to the next cell if the cell is not visible.

To change these settings, select > User Settings > Notebook Settings and configure the respective checkboxes.

Run all above or below


To run all cells before or after a cell, go to the cell actions menu at the far right, click , and
select Run All Above or Run All Below .
Run All Below includes the cell you are in. Run All Above does not.
Run all cells
To run all the cells in a notebook, select Run All in the notebook toolbar.

IMPORTANT
Do not do a Run All if steps for mount and unmount are in the same notebook. It could lead to a race condition and
possibly corrupt the mount points.

View multiple outputs per cell


Python notebooks and %python cells in non-Python notebooks support multiple outputs per cell.
This feature requires Databricks Runtime 7.1 or above and can be enabled in Databricks Runtime 7.1 -
Databricks Runtime 7.3 by setting spark.databricks.workspace.multipleResults.enabled true . It is enabled by
default in Databricks Runtime 7.4 and above.
Python and Scala error highlighting
Python and Scala notebooks support error highlighting. That is, the line of code that is throwing the error will be
highlighted in the cell. Additionally, if the error output is a stacktrace, the cell in which the error is thrown is
displayed in the stacktrace as a link to the cell. You can click this link to jump to the offending code.
Notifications
Notifications alert you to certain events, such as which command is currently running during Run all cells and
which commands are in error state. When your notebook is showing multiple error notifications, the first one
will have a link that allows you to clear all notifications.

Notebook notifications are enabled by default. You can disable them under > User Settings > Notebook
Settings .
Databricks Advisor
Databricks Advisor automatically analyzes commands every time they are run and displays appropriate advice
in the notebooks. The advice notices provide information that can assist you in improving the performance of
workloads, reducing costs, and avoiding common mistakes.
View advice
A blue box with a lightbulb icon signals that advice is available for a command. The box displays the number of
distinct pieces of advice.

Click the lightbulb to expand the box and view the advice. One or more pieces of advice will become visible.

Click the Learn more link to view documentation providing more information related to the advice.
Click the Don’t show me this again link to hide the piece of advice. The advice of this type will no longer be
displayed. This action can be reversed in Notebook Settings.
Click the lightbulb again to collapse the advice box.
Advice settings

Access the Notebook Settings page by selecting > User Settings > Notebook Settings or by clicking
the gear icon in the expanded advice box.

Toggle the Turn on Databricks Advisor option to enable or disable advice.


The Reset hidden advice link is displayed if one or more types of advice is currently hidden. Click the link to
make that advice type visible again.

Share code in notebooks


Azure Databricks supports several methods for sharing code among notebooks. Each of these permits you to
modularize and share code in a notebook, just as you would with a library.
For more complex interactions between notebooks, see Modularize or link code in notebooks.
Use %run to import a notebook
The %run magic executes all of the commands from another notebook. A typical use is to define helper
functions in one notebook that are used by other notebooks.
In the example below, the first notebook defines a helper function, reverse , which is available in the second
notebook after you use the %run magic to execute shared-code-notebook .

Because both of these notebooks are in the same directory in the workspace, use the prefix ./ in
./shared-code-notebook to indicate that the path should be resolved relative to the currently running notebook.
You can organize notebooks into directories, such as %run ./dir/notebook , or use an absolute path like
%run /Users/username@organization.com/directory/notebook .

NOTE
%run must be in a cell by itself, because it runs the entire notebook inline.
You cannot use %run to run a Python file and import the entities defined in that file into a notebook. To import
from a Python file, see Reference source code files using git. Or, package the file into a Python library, create an Azure
Databricks library from that Python library, and install the library into the cluster you use to run your notebook.
When you use %run to run a notebook that contains widgets, by default the specified notebook runs with the
widget’s default values. You can also pass in values to widgets; see Use Databricks widgets with %run.

Reference source code files using git


IMPORTANT
This feature is in Public Preview.

For notebooks stored in a Azure Databricks Repo, you can reference source code files in the repository. The
following example uses a Python file rather than a notebook.
Create a new example repo to show the file layout:

To configure an existing Git repository, see Clone a remote Git repository.


Create two files in the repo:
1. A Python file with the shared code.
2. A notebook that uses the shared Python code.
The Python file shared.py contains the helper.

Now, when you open the notebook, you can reference source code files in the repository using common
commands like import .

For more information on working with files in Git repositories, see Work with non-notebook files in an Azure
Databricks repo.

Manage notebook state and results


After you attach a notebook to a cluster and run one or more cells, your notebook has state and displays results.
This section describes how to manage notebook state and results.
In this section:
Clear notebooks state and results
Download results
Download a cell result
Hide and show cell content
Notebook isolation
Clear notebooks state and results
To clear the notebook state and results, click Clear in the notebook toolbar and select the action:

Download results
By default downloading results is enabled. To toggle this setting, see Manage the ability to download results
from notebooks. If downloading results is disabled, the button is not visible.
Download a cell result
You can download a cell result that contains tabular output to your local machine. Click the button at the
bottom of a cell.

A CSV file named export.csv is downloaded to your default download directory.


Download full results
By default Azure Databricks returns 1000 rows of a DataFrame. When there are more than 1000 rows, an option
appears to re-run the query and display up to 10,000 rows.

When a query returns more than 1000 rows, a down arrow is added to the button. To download all the
results of a query:

1. Click the down arrow next to and select Download full results .

2. Select Re-execute and download .

After you download full results, a CSV file named export.csv is downloaded to your local machine and
the /databricks-results folder has a generated folder containing full the query results.
Hide and show cell content
Cell content consists of cell code and the result of running the cell. You can hide and show the cell code and
result using the cell actions menu at the top right of the cell.
To hide cell code:
Click and select Hide Code
To hide and show the cell result, do any of the following:
Click and select Hide Result
Select
Type Esc > Shift + o
To show hidden cell code or results, click the Show links:

See also Collapsible headings.


Notebook isolation
Notebook isolation refers to the visibility of variables and classes between notebooks. Azure Databricks
supports two types of isolation:
Variable and class isolation
Spark session isolation

NOTE
Since all notebooks attached to the same cluster execute on the same cluster VMs, even with Spark session isolation
enabled there is no guaranteed user isolation within a cluster.

Variable and class isolation


Variables and classes are available only in the current notebook. For example, two notebooks attached to the
same cluster can define variables and classes with the same name, but these objects are distinct.
To define a class that is visible to all notebooks attached to the same cluster, define the class in a package cell.
Then you can access the class by using its fully qualified name, which is the same as accessing a class in an
attached Scala or Java library.
Spark session isolation
Every notebook attached to a cluster running Apache Spark 2.0.0 and above has a pre-defined variable called
spark that represents a SparkSession . SparkSession is the entry point for using Spark APIs as well as setting
runtime configurations.
Spark session isolation is enabled by default. You can also use global temporary views to share temporary views
across notebooks. See Create View or CREATE VIEW. To disable Spark session isolation, set
spark.databricks.session.share to true in the Spark configuration.

IMPORTANT
Setting spark.databricks.session.share true breaks the monitoring used by both streaming notebook cells and
streaming jobs. Specifically:
The graphs in streaming cells are not displayed.
Jobs do not block as long as a stream is running (they just finish “successfully”, stopping the stream).
Streams in jobs are not monitored for termination. Instead you must manually call awaitTermination() .
Calling the Create a new visualization on streaming DataFrames doesn’t work.

Cells that trigger commands in other languages (that is, cells using %scala , %python , %r , and %sql ) and cells
that include other notebooks (that is, cells using %run ) are part of the current notebook. Thus, these cells are in
the same session as other notebook cells. By contrast, a notebook workflow runs a notebook with an isolated
SparkSession , which means temporary views defined in such a notebook are not visible in other notebooks.

Revision history
Azure Databricks notebooks maintain a history of revisions, allowing you to view and restore previous
snapshots of the notebook. You can perform the following actions on revisions: add comments, restore and
delete revisions, and clear revision history.
To access notebook revisions, click Revision Histor y at the top right of the notebook toolbar.

In this section:
Add a comment
Restore a revision
Delete a revision
Clear a revision history
Add a comment
To add a comment to the latest revision:
1. Click the revision.
2. Click the Save now link.

3. In the Save Notebook Revision dialog, enter a comment.


4. Click Save . The notebook revision is saved with the entered comment.
Restore a revision
To restore a revision:
1. Click the revision.
2. Click Restore this revision .

3. Click Confirm . The selected revision becomes the latest revision of the notebook.
Delete a revision
To delete a notebook’s revision entry:
1. Click the revision.
2. Click the trash icon .

3. Click Yes, erase . The selected revision is deleted from the notebook’s revision history.
Clear a revision history
To clear a notebook’s revision history:
1. Select File > Clear Revision Histor y .
2. Click Yes, clear . The notebook revision history is cleared.

WARNING
Once cleared, the revision history is not recoverable.

Version control with Git


NOTE
To sync your work in Azure Databricks with a remote Git repository, Databricks recommends using Git integration with
Databricks Repos.

To link a single notebook to Git, Azure Databricks also supports these Git-based version control tools:
GitHub version control
Bitbucket Cloud and Bitbucket Server version control
Azure DevOps Services version control

Test notebooks
This section covers several ways to test code in Databricks notebooks. You can use these methods separately or
together.
Many unit testing libraries work directly within the notebook. For example, you can use the built-in Python
unittest package to test notebook code.

def reverse(s):
return s[::-1]

import unittest

class TestHelpers(unittest.TestCase):
def test_reverse(self):
self.assertEqual(reverse('abc'), 'cba')

r = unittest.main(argv=[''], verbosity=2, exit=False)


assert r.result.wasSuccessful(), 'Test failed; see logs above'

Test failures appear in the output area of the cell.


You can use widgets to distinguish test invocations from normal invocations in a single notebook.

To hide test code and results, select the associated menu items from the cell dropdown. Any errors that occur
appear even when results are hidden.
To run tests periodically and automatically, you can use scheduled notebooks. You can configure the job to send
notification emails to an address you specify.

Separate test code from the notebook


To separate your test code from the code being tested, see Share code in notebooks.
An example using %run :
For code stored in a Databricks Repo, you can use the web terminal to run tests in source code files just as you
would on your local machine.
You can also run this test from a notebook.

For notebooks in a Databricks Repo, you can set up a CI/CD-style workflow by configuring notebook tests to run
for each commit. See Databricks GitHub Actions.
Visualizations
7/21/2022 • 3 minutes to read

Azure Databricks notebooks have built-in support for charts and visualizations. The visualizations described in
this section are available when you use the display command to view a data table result as a pandas or Apache
Spark DataFrame in a notebook cell.
For information about legacy Databricks visualizations, see Legacy visualizations.

Create a new visualization


To create a visualization from a cell result, the notebook cell must use a display command to show the result.
Click + and select . The visualization editor appears.

1. In the Visualization Type drop-down, choose a type.

2. Select the data to appear in the visualization. The fields available depend on the selected type.
3. Click Save .
Visualization tools
If you hover over the top right of a chart in the visualization editor, a Plotly toolbar appears where you can
perform operations such as select, zoom, and pan.
If you hover over the top right of a chart in a notebook, a subset of tools appears:

Visualization types
Visualization types in Azure Databricks notebooks

Create a new data profile


NOTE
Available in Databricks Runtime 9.1 LTS and above.

Data profiles display summary statistics of an Apache Spark DataFrame, a pandas DataFrame, or a SQL table in
tabular and graphic format. To create a data profile from a results cell, click + and select .

Azure Databricks calculates and displays the summary statistics.

Numeric and categorical features are shown in separate tables.


At the top of the tab, you can sort or search for features.
At the top of the chart column, you can choose to display a histogram (Standard ) or quantiles.
Check expand to enlarge the charts.
Check log to display the charts on a log scale.
You can hover your cursor over the charts for more detailed information, such as the boundaries of a
histogram column and the number of rows in it, or the quantile value.
You can also generate data profiles programmatically; see summarize command (dbutils.data.summarize).

Work with visualizations and data profiles


NOTE
Data profiles are available in Databricks Runtime 9.1 LTS and above.

In this topic:
Rename, duplicate, or remove a visualization or data profile
Edit a visualization
Download a visualization
Add a visualization or data profile to a dashboard
Rename, duplicate, or remove a visualization or data profile
To rename, duplicate, or remove a visualization or data profile, click the three vertical dots at the right of the tab
name.

You can also change the name by clicking directly on it and editing the name in place.
Edit a visualization

Click beneath the visualization to open the visualization editor. When you have finished
making changes, click Save .
Edit colors
You can customize a visualization’s colors when you create the visualization or by editing it.
1. Create or edit a visualization.
2. Click Colors .
3. To modify a color, click the square and select the new color by doing one of the following:
Click it in the color selector.
Enter a hex value.
4. Click anywhere outside the color selector to close it and save changes.
Temporarily hide or show a series
To hide a series in a visualization, click the series in the legend. To show the series again, click it again in the
legend.
To show only a single series, double-click the series in the legend. To show other series, click each one.
Download a visualization
To download a visualization in .png format, click the camera icon in the notebook cell or in the visualization
editor.
In a notebook cell, the camera icon appears at the upper right when you move the cursor over the cell.

In the visualization editor, the camera icon appears when you move the cursor over the chart. See
Visualization tools.

Add a visualization or data profile to a dashboard


1. Click the three vertical dots at the right of the tab name.

2. Select Add to dashboard . A list of available dashboard views appears, along with a menu option Add
to new dashboard .
3. Select a dashboard or select Add to new dashboard . The dashboard appears, including the newly
added visualization or data profile.
Visualization deep dive in Python
7/21/2022 • 2 minutes to read

Charts and graphs Python notebook


Get notebook
Visualization deep dive in Scala
7/21/2022 • 2 minutes to read

Charts and graphs Scala notebook


Get notebook
HTML, D3, and SVG in notebooks
7/21/2022 • 2 minutes to read

This article contains Python and Scala notebooks that show how to view HTML, SVG, and D3 visualizations in
notebooks.
If you want to use a custom Javascript library to render D3, see Use a Javascript library.

HTML, D3, and SVG Python notebook


Get notebook

HTML, D3, and SVG Scala notebook


Get notebook
Bokeh
7/21/2022 • 2 minutes to read

Bokeh is a Python interactive visualization library.


To use Bokeh, install the Bokeh PyPI package through the Libraries UI, and attach it to your cluster.
To display a Bokeh plot in Azure Databricks:
1. Generate a plot following the instructions in the Bokeh documentation.
2. Generate an HTML file containing the data for the plot, for example by using Bokeh’s file_html() or
output_file() functions.

3. Pass this HTML to the Azure Databricks displayHTML() function.

IMPORTANT
The maximum size for a notebook cell, both contents and output, is 16MB. Make sure that the size of the HTML
you pass to the displayHTML() function does not exceed this value.

See the following notebook for an example.

bokeh demo notebook


Get notebook
Matplotlib
7/21/2022 • 2 minutes to read

The method for displaying Matplotlib figures depends on which version of Databricks Runtime your cluster is
running.
Databricks Runtime 6.5 and above display Matplotlib figures inline.
With Databricks Runtime 6.4 ES, you must call the %matplotlib inline magic command.

The following notebook shows how to display Matplotlib figures in Python notebooks.

Matplotlib Python notebook


Get notebook

Render images at higher resolution


You can render matplotlib images in Python notebooks at double the standard resolution, providing users of
high-resolution screens with a better visualization experience. Set one of the following in a notebook cell:
retina option:

%config InlineBackend.figure_format = 'retina'

from IPython.display import set_matplotlib_formats


set_matplotlib_formats('retina')

png2x option:

%config InlineBackend.figure_format = 'png2x'

from IPython.display import set_matplotlib_formats


set_matplotlib_formats('png2x')

To switch back to standard resolution, add the following to a notebook cell:

set_matplotlib_formats('png')

%config InlineBackend.figure_format = 'png'


Plotly
7/21/2022 • 2 minutes to read

Plotly is an interactive graphing library. Azure Databricks supports Plotly 2.0.7. To use Plotly, install the Plotly
PyPI package and attach it to your cluster.

NOTE
Inside Azure Databricks notebooks we recommend using Plotly Offline. Plotly Offline may not perform well when handling
large datasets. If you notice performance issues, you should reduce the size of your dataset.

To display a Plotly plot:


1. Specify output_type='div' as an argument to the Plotly plot() function.
2. Pass the output of the plot() function to Databricks displayHTML() function.
See the notebook for an example.

Plotly Python notebook


Get notebook
htmlwidgets
7/21/2022 • 2 minutes to read

With htmlwidgets for R you can generate interactive plots using R’s flexible syntax and environment. Azure
Databricks notebooks support htmlwidgets.
The setup has two steps:
1. Install pandoc, a Linux package used by htmlwidgets to generate HTML.
2. Change one function in the htmlwidgets package to make it work in Azure Databricks.
You can automate the first step using an init script so that the cluster installs pandoc when it launches. You
should do the second step, changing an htmlwidgets function, in every notebook that uses the htmlwidgets
package.
The notebook shows how to use htmlwidgets with dygraphs, leaflet, and plotly.

IMPORTANT
With each library invocation, an HTML file containing the rendered plot is downloaded. The plot does not display inline.

htmlwidgets notebook
Get notebook
ggplot2
7/21/2022 • 2 minutes to read

The following notebook shows how to display ggplot2 objects in R notebooks.

ggplot2 R notebook
Get notebook
Legacy visualizations
7/21/2022 • 9 minutes to read

This article describes legacy Azure Databricks visualizations. See Visualizations for current visualization support.
Azure Databricks also natively supports visualization libraries in Python and R and lets you install and use third-
party libraries.

Create a legacy visualization


To create a legacy visualization from a results cell, click + and select Legacy Visualization .
Legacy visualizations support a rich set of plot types:

Choose and configure a legacy chart type


To choose a bar chart, click the bar chart icon :

To choose another plot type, click to the right of the bar chart and choose the plot type.
Legacy chart toolbar
Both line and bar charts have a built-in toolbar that support a rich set of client-side interactions.
To configure a chart, click Plot Options… .

The line chart has a few custom chart options: setting a Y-axis range, showing and hiding points, and displaying
the Y-axis with a log scale.
For information about legacy chart types, see:
Legacy line charts
Color consistency across charts
Azure Databricks supports two kinds of color consistency across legacy charts: series set and global.
Series set color consistency assigns the same color to the same value if you have series with the same values
but in different orders (for example, A = ["Apple", "Orange", "Banana"] and B = ["Orange", "Banana", "Apple"]
). The values are sorted before plotting, so both legends are sorted the same way (
["Apple", "Banana", "Orange"] ), and the same values are given the same colors. However, if you have a series C
= ["Orange", "Banana"] , it would not be color consistent with set A because the set isn’t the same. The sorting
algorithm would assign the first color to “Banana” in set C but the second color to “Banana” in set A. If you want
these series to be color consistent, you can specify that charts should have global color consistency.
In global color consistency, each value is always mapped to the same color no matter what values the series
have. To enable this for each chart, select the Global color consistency checkbox.

NOTE
To achieve this consistency, Azure Databricks hashes directly from values to colors. To avoid collisions (where two values
go to the exact same color), the hash is to a large set of colors, which has the side effect that nice-looking or easily
distinguishable colors cannot be guaranteed; with many colors there are bound to be some that are very similar looking.

Machine learning visualizations


In addition to the standard chart types, legacy visualizations support the following machine learning training
parameters and results:
Residuals
ROC curves
Decision trees
Residuals
For linear and logistic regressions, you can render a fitted versus residuals plot. To obtain this plot, supply the
model and DataFrame.
The following example runs a linear regression on city population to house sale price data and then displays the
residuals versus the fitted data.
# Load data
pop_df = spark.read.csv("/databricks-datasets/samples/population-vs-price/data_geo.csv", header="true",
inferSchema="true")

# Drop rows with missing values and rename the feature and label columns, replacing spaces with _
from pyspark.sql.functions import col
pop_df = pop_df.dropna() # drop rows with missing values
exprs = [col(column).alias(column.replace(' ', '_')) for column in pop_df.columns]

# Register a UDF to convert the feature (2014_Population_estimate) column vector to a VectorUDT type and
apply it to the column.
from pyspark.ml.linalg import Vectors, VectorUDT

spark.udf.register("oneElementVec", lambda d: Vectors.dense([d]), returnType=VectorUDT())


tdata = pop_df.select(*exprs).selectExpr("oneElementVec(2014_Population_estimate) as features",
"2015_median_sales_price as label")

# Run a linear regression


from pyspark.ml.regression import LinearRegression

lr = LinearRegression()
modelA = lr.fit(tdata, {lr.regParam:0.0})

# Plot residuals versus fitted data


display(modelA, tdata)

ROC curves
For logistic regressions, you can render an ROC curve. To obtain this plot, supply the model, the prepped data
that is input to the fit method, and the parameter "ROC" .
The following example develops a classifier that predicts if an individual earns <=50K or >50k a year from
various attributes of the individual. The Adult dataset derives from census data, and consists of information
about 48842 individuals and their annual income.
The example code in this section uses one-hot encoding. The function was renamed with Apache Spark 3.0, so
the code is slightly different depending on the version of Databricks Runtime you are using. If you are using
Databricks Runtime 6.x or below, you must adjust two lines in the code as described in the code comments.

# This code uses one-hot encoding to convert all categorical variables into binary vectors.

schema = """`age` DOUBLE,


`workclass` STRING,
`fnlwgt` DOUBLE,
`education` STRING,
`education_num` DOUBLE,
`marital_status` STRING,
`occupation` STRING,
`relationship` STRING,
`race` STRING,
`sex` STRING,
`capital_gain` DOUBLE,
`capital_loss` DOUBLE,
`hours_per_week` DOUBLE,
`native_country` STRING,
`income` STRING"""

dataset = spark.read.csv("/databricks-datasets/adult/adult.data", schema=schema)

from pyspark.ml import Pipeline


from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
# If you are using Databricks Runtime 6.x or below, comment out the preceding line and uncomment the
following line.
# from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
categoricalColumns = ["workclass", "education", "marital_status", "occupation", "relationship", "race",
"sex", "native_country"]

stages = [] # stages in the Pipeline


for categoricalCol in categoricalColumns:
# Category indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
# Use OneHotEncoder to convert categorical variables into binary SparseVectors
encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol +
"classVec"])
# If you are using Databricks Runtime 6.x or below, comment out the preceding line and uncomment the
following line.
# encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol
+ "classVec"])
# Add stages. These are not run here, but will run all at once later on.
stages += [stringIndexer, encoder]

# Convert label into label indices using the StringIndexer


label_stringIdx = StringIndexer(inputCol="income", outputCol="label")
stages += [label_stringIdx]

# Transform all features into a vector using VectorAssembler


numericCols = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss", "hours_per_week"]
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

# Run the stages as a Pipeline. This puts the data through all of the feature transformations in a single
call.

partialPipeline = Pipeline().setStages(stages)
pipelineModel = partialPipeline.fit(dataset)
preppedDataDF = pipelineModel.transform(dataset)

# Fit logistic regression model

from pyspark.ml.classification import LogisticRegression


lrModel = LogisticRegression().fit(preppedDataDF)

# ROC for data


display(lrModel, preppedDataDF, "ROC")

To display the residuals, omit the "ROC" parameter:

display(lrModel, preppedDataDF)

Decision trees
Legacy visualizations support rendering a decision tree.
To obtain this visualization, supply the decision tree model.
The following examples train a tree to recognize digits (0 - 9) from the MNIST dataset of images of handwritten
digits and then displays the tree.
Python
trainingDF = spark.read.format("libsvm").load("/databricks-datasets/mnist-digits/data-001/mnist-digits-
train.txt").cache()
testDF = spark.read.format("libsvm").load("/databricks-datasets/mnist-digits/data-001/mnist-digits-
test.txt").cache()

from pyspark.ml.classification import DecisionTreeClassifier


from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline

indexer = StringIndexer().setInputCol("label").setOutputCol("indexedLabel")

dtc = DecisionTreeClassifier().setLabelCol("indexedLabel")

# Chain indexer + dtc together into a single ML Pipeline.


pipeline = Pipeline().setStages([indexer, dtc])

model = pipeline.fit(trainingDF)
display(model.stages[-1])

Scala

val trainingDF = spark.read.format("libsvm").load("/databricks-datasets/mnist-digits/data-001/mnist-digits-


train.txt").cache
val testDF = spark.read.format("libsvm").load("/databricks-datasets/mnist-digits/data-001/mnist-digits-
test.txt").cache

import org.apache.spark.ml.classification.{DecisionTreeClassifier, DecisionTreeClassificationModel}


import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.Pipeline

val indexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel")


val dtc = new DecisionTreeClassifier().setLabelCol("indexedLabel")
val pipeline = new Pipeline().setStages(Array(indexer, dtc))

val model = pipeline.fit(trainingDF)


val tree = model.stages.last.asInstanceOf[DecisionTreeClassificationModel]

display(tree)

Structured Streaming DataFrames


To visualize the result of a streaming query in real time you can display a Structured Streaming DataFrame in
Scala and Python.
Python

streaming_df = spark.readStream.format("rate").load()
display(streaming_df.groupBy().count())

Scala

val streaming_df = spark.readStream.format("rate").load()


display(streaming_df.groupBy().count())

display supports the following optional parameters:


streamName : the streaming query name.
trigger (Scala) and processingTime (Python): defines how often the streaming query is run. If not specified,
the system checks for availability of new data as soon as the previous processing has completed. To reduce
the cost in production, Databricks recommends that you always set a trigger interval. With Databricks
Runtime 8.0 and above, the default trigger interval is 500 ms.
checkpointLocation : the location where the system writes all the checkpoint information. If it is not specified,
the system automatically generates a temporary checkpoint location on DBFS. In order for your stream to
continue processing data from where it left off, you must provide a checkpoint location. Databricks
recommends that in production you always specify the checkpointLocation option.
Python

streaming_df = spark.readStream.format("rate").load()
display(streaming_df.groupBy().count(), processingTime = "5 seconds", checkpointLocation =
"dbfs:/<checkpoint-path>")

Scala

import org.apache.spark.sql.streaming.Trigger

val streaming_df = spark.readStream.format("rate").load()


display(streaming_df.groupBy().count(), trigger = Trigger.ProcessingTime("5 seconds"), checkpointLocation =
"dbfs:/<checkpoint-path>")

For more information about these parameters, see Starting Streaming Queries.

displayHTML function
Azure Databricks programming language notebooks (Python, R, and Scala) support HTML graphics using the
displayHTML function; you can pass the function any HTML, CSS, or JavaScript code. This function supports
interactive graphics using JavaScript libraries such as D3.
For examples of using displayHTML , see:
HTML, D3, and SVG in notebooks
Embed static images in notebooks
NOTE
The displayHTML iframe is served from the domain databricksusercontent.com , and the iframe sandbox includes the
allow-same-origin attribute. databricksusercontent.com must be accessible from your browser. If it is currently
blocked by your corporate network, it must added to an allow list.

Images
Columns containing image data types are rendered as rich HTML. Azure Databricks attempts to render image
thumbnails for DataFrame columns matching the Spark ImageSchema. Thumbnail rendering works for any
images successfully read in through the spark.read.format('image') function. For image values generated
through other means, Azure Databricks supports the rendering of 1, 3, or 4 channel images (where each channel
consists of a single byte), with the following constraints:
One-channel images : mode field must be equal to 0. height , width , and nChannels fields must
accurately describe the binary image data in the data field.
Three-channel images : mode field must be equal to 16. height , width , and nChannels fields must
accurately describe the binary image data in the data field. The data field must contain pixel data in three-
byte chunks, with the channel ordering (blue, green, red) for each pixel.
Four-channel images : mode field must be equal to 24. height , width , and nChannels fields must
accurately describe the binary image data in the data field. The data field must contain pixel data in four-
byte chunks, with the channel ordering (blue, green, red, alpha) for each pixel.
Example
Suppose you have a folder containing some images:

If you read the images into a DataFrame with ImageSchema.readImages and then display the DataFrame, Azure
Databricks renders thumbnails of the images:

from pyspark.ml.image import ImageSchema


image_df = ImageSchema.readImages(sample_img_dir)
display(image_df)
Visualizations in Python
In this section:
Seaborn
Other Python libraries
Seaborn
You can also use other Python libraries to generate plots. The Databricks Runtime includes the seaborn
visualization library. To create a seaborn plot, import the library, create a plot, and pass the plot to the display
function.

import seaborn as sns


sns.set(style="white")

df = sns.load_dataset("iris")
g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot, lw=3)

g.map_upper(sns.regplot)

display(g.fig)
Other Python libraries
Bokeh
Matplotlib
Plotly

Visualizations in R
To plot data in R, use the display function as follows:

library(SparkR)
diamonds_df <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", source = "csv",
header="true", inferSchema = "true")

display(arrange(agg(groupBy(diamonds_df, "color"), "price" = "avg"), "color"))

You can use the default R plot function.

fit <- lm(Petal.Length ~., data = iris)


layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(fit)
You can also use any R visualization package. The R notebook captures the resulting plot as a .png and displays
it inline.
In this section:
Lattice
DandEFA
Plotly
Other R libraries
Lattice
The Lattice package supports trellis graphs—graphs that display a variable or the relationship between
variables, conditioned on one or more other variables.

library(lattice)
xyplot(price ~ carat | cut, diamonds, scales = list(log = TRUE), type = c("p", "g", "smooth"), ylab = "Log
price")
DandEFA
The DandEFA package supports dandelion plots.

install.packages("DandEFA", repos = "https://cran.us.r-project.org")


library(DandEFA)
data(timss2011)
timss2011 <- na.omit(timss2011)
dandpal <- rev(rainbow(100, start = 0, end = 0.2))
facl <- factload(timss2011,nfac=5,method="prax",cormeth="spearman")
dandelion(facl,bound=0,mcex=c(1,1.2),palet=dandpal)
facl <- factload(timss2011,nfac=8,method="mle",cormeth="pearson")
dandelion(facl,bound=0,mcex=c(1,1.2),palet=dandpal)
Plotly
The Plotly R package relies on htmlwidgets for R. For installation instructions and a notebook, see htmlwidgets.
Other R libraries
ggplot2
htmlwidgets

Visualizations in Scala
To plot data in Scala, use the display function as follows:

val diamonds_df =
spark.read.format("csv").option("header","true").option("inferSchema","true").load("/databricks-
datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv")

display(diamonds_df.groupBy("color").avg("price").orderBy("color"))

Deep dive notebooks for Python and Scala


For a deep dive into Python visualizations, see the notebook:
Visualization deep dive in Python
For a deep dive into Scala visualizations, see the notebook:
Visualization deep dive in Scala
Dashboards
7/21/2022 • 2 minutes to read

Dashboards allow you to publish graphs and visualizations derived from notebook output and share them in a
presentation format with your organization. View the notebook to learn how to create and organize dashboards.
The remaining sections describe how to schedule a job to refresh the dashboard and how to view a specific
dashboard version.

Dashboards notebook
Get notebook

Create a scheduled job to refresh a dashboard


Dashboards do not live refresh when you present them from the dashboard view. To schedule a dashboard to
refresh at a specified interval, schedule the notebook that generates the dashboard graphs.

View a specific dashboard version


1. Click the button.
2. Click the Last successful run link of the notebook job that is scheduled to run at the interval you want.
ipywidgets
7/21/2022 • 2 minutes to read

IMPORTANT
This feature is in Public Preview.

ipywidgets are visual elements that allow users to specify parameter values in notebook cells. You can use
ipywidgets to make your Databricks Python notebooks interactive.
The ipywidgets package includes over 30 different controls, including form controls such as sliders, text boxes,
and checkboxes, as well as layout controls such as tabs, accordions, and grids. Using these elements, you can
build graphical user interfaces to interface with your notebook code.

NOTE
For information about Databricks widgets, see Databricks widgets. For guidelines on when to use Databricks widgets or
ipywidgets, see Best practices for using ipywidgets and Databricks widgets.

Requirements
ipywidgets are available in Databricks Runtime 11.0 and above.

Usage
The following code creates a histogram with a slider that can take on values between 3 and 10. The value of the
widget determines the number of bins in the histogram. As you move the slider, the histogram updates
immediately. See the example notebook to try this out.

import ipywidgets as widgets


from ipywidgets import interact

# Load a dataset
sparkDF = spark.read.csv("/databricks-datasets/bikeSharing/data-001/day.csv", header="true",
inferSchema="true")

# In this code, `(bins=(3, 10)` defines an integer slider widget that allows values between 3 and 10.
@interact(bins=(3, 10))
def plot_histogram(bins):
pdf = sparkDF.toPandas()
pdf.hist(column='temp', bins=bins)

The following code creates an integer slider that can take on values between 0 and 10. The default value is 5. To
access the value of the slider in your code, use int_slider.value .

import ipywidgets as widgets

int_slider = widgets.IntSlider(max=10, value=5)


int_slider
Example notebook
ipywidgets example notebook
Get notebook

Best practices for using ipywidgets and Databricks widgets


To add interactive controls to Python notebooks, Databricks recommends using ipywidgets. For notebooks in
other languages, use Databricks widgets.
You can use Databricks widgets to pass parameters between notebooks and to pass parameters to jobs;
ipywidgets do not support these scenarios.

Limitations
A notebook using ipywidgets must be attached to a running cluster.
Widget states are not preserved across notebook sessions. You must re-run widget cells to render them each
time you attach the notebook to a cluster.
The following ipywidgets are not supported: Password, File Upload, Controller.
HTMLMath and Label widgets with LaTeX expressions do not render correctly. (For example,
widgets.Label(value=r'$$\frac{x+1}{x-1}$$') does not render correctly.)
Widgets might not render properly if the notebook is in dark mode, especially colored widgets.
Widget outputs cannot be used in notebook dashboard views.
The maximum message payload size for an ipywidget is 1 MB. Widgets that use images or large text data
may not be properly rendered.
Databricks widgets
7/21/2022 • 6 minutes to read

Input widgets allow you to add parameters to your notebooks and dashboards. The widget API consists of calls
to create various types of input widgets, remove them, and get bound values.

NOTE
In Databricks Runtime 11.0 and above, you can also use ipywidgets in Databricks notebooks.

Databricks widgets are best for:


Building a notebook or dashboard that is re-executed with different parameters
Quickly exploring results of a single query with different parameters

TIP
View the documentation for the widget API in Scala, Python, and R with the following command:

dbutils.widgets.help()

Databricks widget types


There are 4 types of widgets:
text : Input a value in a text box.
dropdown : Select a value from a list of provided values.
combobox : Combination of text and dropdown. Select a value from a provided list or input one in the text box.
multiselect : Select one or more values from a list of provided values.

Widget dropdowns and text boxes appear immediately following the notebook toolbar.

Databricks widget API


The widget API is designed to be consistent in Scala, Python, and R. The widget API in SQL is slightly different,
but as powerful as the other languages. You manage widgets through the Databricks Utilities interface.
dbutils.widgets.dropdown("X123", "1", [str(x) for x in range(1, 10)])

dbutils.widgets.dropdown("1", "1", [str(x) for x in range(1, 10)], "hello this is a widget")

dbutils.widgets.dropdown("x123123", "1", [str(x) for x in range(1, 10)], "hello this is a widget")

dbutils.widgets.dropdown("x1232133123", "1", [str(x) for x in range(1, 10)], "hello this is a widget 2")

Databricks widget example


Create a simple dropdown widget.

dbutils.widgets.dropdown("X", "1", [str(x) for x in range(1, 10)])

Interact with the widget from the widget panel.

You can access the current value of the widget with the call:

dbutils.widgets.get("X")

Finally, you can remove a widget or all widgets in a notebook:

dbutils.widgets.remove("X")

dbutils.widgets.removeAll()

IMPORTANT
If you add a command to remove a widget, you cannot add a subsequent command to create a widget in the same cell.
You must create the widget in another cell.

Databricks widgets in Scala, Python, and R


To see detailed API documentation for each method, use dbutils.widgets.help("<method-name>") . The help API is
identical in all languages. For example:

dbutils.widgets.help("dropdown")

You can create a dropdown widget by passing a unique identifying name, default value, and list of default
choices, along with an optional label. Once you create it, a dropdown input widget appears at the top of the
notebook. These input widgets are notebook-level entities.
If you try to create a widget that already exists, the configuration of the existing widget is overwritten with the
new options.
Databricks widgets in SQL
The API to create widgets in SQL is slightly different but as powerful as the APIs for the other languages. The
following is an example of creating a text input widget.

CREATE WIDGET TEXT y DEFAULT "10"

To specify the selectable values in a dropdown widget in SQL, you can write a sub-query. The first column of the
resulting table of the sub-query determines the values.
The following cell creates a dropdown widget from a sub-query over a table.

CREATE WIDGET DROPDOWN cuts DEFAULT "Good" CHOICES SELECT DISTINCT cut FROM diamonds

The default value specified when you create a dropdown widget must be one of the selectable values and must
be specified as a string literal. To access the current selected value of an input widget in SQL, you can use a
special UDF function in your query. The function is getArgument() . For example:

SELECT COUNT(*) AS numChoices, getArgument("cuts") AS cuts FROM diamonds WHERE cut = getArgument("cuts")

NOTE
getArgument is implemented as a Scala UDF and is not supported on a table ACL-enabled high concurrency cluster. On
such clusters, use the $<parameter> syntax shown in the following example.

You can also use the $<parameter> syntax to access the current value of a SQL input widget:

SELECT * FROM diamonds WHERE cut LIKE '%$cuts%'

You can remove the widget with a SQL command:

REMOVE WIDGET cuts

IMPORTANT
In general, you cannot use widgets to pass arguments between different languages within a notebook. You can create a
widget arg1 in a Python cell and use it in a SQL or Scala cell if you run cell by cell. However, it will not work if you
execute all the commands using Run All or run the notebook as a job. To work around this limitation, we recommend
that you create a notebook for each language and pass the arguments when you run the notebook.

Legacy input widgets in SQL

NOTE
Databricks will end support for rendering legacy SQL widgets on January 15, 2022. To ensure that your widgets continue
to render in the UI, update your code to use the SQL widgets. You can still use $<parameter> in your code to get the
parameters passed to a notebook using %run .

The old way of creating widgets in SQL queries with the $<parameter> syntax still works as before. Here is an
example:
SELECT * FROM diamonds WHERE cut LIKE '%$cuts%'

NOTE
To escape the $ character in a SQL string literal, use \$ . For example, the string $1,000 can be expressed as
"\$1,000" . The $ character cannot be escaped for SQL identifiers.

Configure widget settings


You can configure the behavior of widgets when a new value is selected, whether the widget panel is always
pinned to the top of the notebook, and change the layout of widgets in the notebook.

1. Click the icon at the right end of the Widget panel.


2. In the pop-up Widget Panel Settings dialog box, choose the widget’s execution behavior.

Run Notebook : Every time a new value is selected, the entire notebook is rerun.
Run Accessed Commands : Every time a new value is selected, only cells that retrieve the values
for that particular widget are rerun. This is the default setting when you create a widget.

NOTE
SQL cells are not rerun in this configuration.

Do Nothing : Every time a new value is selected, nothing is rerun.

3. To pin the widgets to the top of the notebook or to place the widgets above the first cell, click . The
setting is saved on a per-user basis.

4. If you have Can Manage permission for notebooks, you can configure the widget layout by clicking
. Each widget’s order and size can customized. To save or dismiss your changes, click .
NOTE
The widget layout is saved with the notebook.
If the widget layout is configured, new widgets will be added out of alphabetical order.

5. To reset the widget layout to a default order and size, click to open the Widget Panel Settings
dialog and then click Reset Layout .

NOTE
The widget layout cannot be reset by the removeAll() command.

Notebook
You can see a demo of how the Run Accessed Commands setting works in the following notebook. The year
widget is created with setting 2014 and is used in DataFrame API and SQL commands.

When you change the setting of the year widget to 2007 , the DataFrame command reruns, but the SQL
command is not rerun.
Widget demo notebook
Get notebook

Databricks widgets in dashboards


When you create a dashboard from a notebook that has input widgets, all the widgets display at the top of the
dashboard. In presentation mode, every time you update value of a widget you can click the Update button to
re-run the notebook and update your dashboard with new values.

Use Databricks widgets with %run


If you run a notebook that contains widgets, the specified notebook is run with the widget’s default values. You
can also pass in values to widgets. For example:

%run /path/to/notebook $X="10" $Y="1"


This example runs the specified notebook and passes 10 into widget X and 1 into widget Y.
Modularize or link code in notebooks
7/21/2022 • 5 minutes to read

This article describes how to use Databricks notebooks to code complex workflows that use modular code,
linked or embedded notebooks, and if-then-else logic.

Ways to modularize or link notebooks


The %run command allows you to include another notebook within a notebook. You can use %run to
modularize your code, for example by putting supporting functions in a separate notebook. You can also use it
to concatenate notebooks that implement the steps in an analysis. When you use %run , the called notebook is
immediately executed and the functions and variables defined in it become available in the calling notebook.
Notebook workflows are a complement to %run because they let you pass parameters to and return values
from a notebook. This allows you to build complex workflows and pipelines with dependencies. For example,
you can get a list of files in a directory and pass the names to another notebook, which is not possible with
%run . You can also create if-then-else workflows based on return values or call other notebooks using relative
paths.
To implement notebook workflows, use the dbutils.notebook.* methods. Unlike %run , the
dbutils.notebook.run() method starts a new job to run the notebook.

These methods, like all of the dbutils APIs, are available only in Python and Scala. However, you can use
dbutils.notebook.run() to invoke an R notebook.

WARNING
Jobs based on notebook workflows must complete in 30 days or less. Longer-running jobs based on modularized or
linked notebook tasks aren’t supported.

API
The methods available in the dbutils.notebook API to build notebook workflows are: run and exit . Both
parameters and return values must be strings.
run(path: String, timeout_seconds: int, arguments: Map): String

Run a notebook and return its exit value. The method starts an ephemeral job that runs immediately.
The timeout_seconds parameter controls the timeout of the run (0 means no timeout): the call to run throws an
exception if it doesn’t finish within the specified time. If Azure Databricks is down for more than 10 minutes, the
notebook run fails regardless of timeout_seconds .
The arguments parameter sets widget values of the target notebook. Specifically, if the notebook you are
running has a widget named A , and you pass a key-value pair ("A": "B") as part of the arguments parameter
to the run() call, then retrieving the value of widget A will return "B" . You can find the instructions for
creating and working with widgets in the Databricks widgets article.
WARNING
The arguments parameter accepts only Latin characters (ASCII character set). Using non-ASCII characters will return an
error. Examples of invalid, non-ASCII characters are Chinese, Japanese kanjis, and emojis.

run Usage
Python

dbutils.notebook.run("notebook-name", 60, {"argument": "data", "argument2": "data2", ...})

Scala

dbutils.notebook.run("notebook-name", 60, Map("argument" -> "data", "argument2" -> "data2", ...))

run Example
Suppose you have a notebook named workflows with a widget named foo that prints the widget’s value:

dbutils.widgets.text("foo", "fooDefault", "fooEmptyLabel")


print dbutils.widgets.get("foo")

Running dbutils.notebook.run("workflows", 60, {"foo": "bar"}) produces the following result:

The widget had the value you passed in through the workflow, "bar" , rather than the default.
exit(value: String): void Exit a notebook with a value. If you call a notebook using the run method, this is the
value returned.

dbutils.notebook.exit("returnValue")

Calling dbutils.notebook.exit in a job causes the notebook to complete successfully. If you want to cause the
job to fail, throw an exception.

Example
In the following example, you pass arguments to DataImportNotebook and run different notebooks (
DataCleaningNotebook or ErrorHandlingNotebook ) based on the result from DataImportNotebook .
When the notebook workflow runs, you see a link to the running notebook:

Click the notebook link Notebook job #xxxx to view the details of the run:

Pass structured data


This section illustrates how to pass structured data between notebooks.
Python
# Example 1 - returning data through temporary views.
# You can only return one string using dbutils.notebook.exit(), but since called notebooks reside in the
same JVM, you can
# return a name referencing data stored in a temporary view.

## In callee notebook
spark.range(5).toDF("value").createOrReplaceGlobalTempView("my_data")
dbutils.notebook.exit("my_data")

## In caller notebook
returned_table = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")
display(table(global_temp_db + "." + returned_table))

# Example 2 - returning data through DBFS.


# For larger datasets, you can write the results to DBFS and then return the DBFS path of the stored data.

## In callee notebook
dbutils.fs.rm("/tmp/results/my_data", recurse=True)
spark.range(5).toDF("value").write.format("parquet").load("dbfs:/tmp/results/my_data")
dbutils.notebook.exit("dbfs:/tmp/results/my_data")

## In caller notebook
returned_table = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
display(spark.read.format("parquet").load(returned_table))

# Example 3 - returning JSON data.


# To return multiple values, you can use standard JSON libraries to serialize and deserialize results.

## In callee notebook
import json
dbutils.notebook.exit(json.dumps({
"status": "OK",
"table": "my_data"
}))

## In caller notebook
result = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
print(json.loads(result))

Scala
// Example 1 - returning data through temporary views.
// You can only return one string using dbutils.notebook.exit(), but since called notebooks reside in the
same JVM, you can
// return a name referencing data stored in a temporary view.

/** In callee notebook */


sc.parallelize(1 to 5).toDF().createOrReplaceGlobalTempView("my_data")
dbutils.notebook.exit("my_data")

/** In caller notebook */


val returned_table = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
val global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")
display(table(global_temp_db + "." + returned_table))

// Example 2 - returning data through DBFS.


// For larger datasets, you can write the results to DBFS and then return the DBFS path of the stored data.

/** In callee notebook */


dbutils.fs.rm("/tmp/results/my_data", recurse=true)
sc.parallelize(1 to 5).toDF().write.format("parquet").save("dbfs:/tmp/results/my_data")
dbutils.notebook.exit("dbfs:/tmp/results/my_data")

/** In caller notebook */


val returned_table = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
display(sqlContext.read.format("parquet").load(returned_table))

// Example 3 - returning JSON data.


// To return multiple values, you can use standard JSON libraries to serialize and deserialize results.

/** In callee notebook */

// Import jackson json libraries


import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper

// Create a json serializer


val jsonMapper = new ObjectMapper with ScalaObjectMapper
jsonMapper.registerModule(DefaultScalaModule)

// Exit with json


dbutils.notebook.exit(jsonMapper.writeValueAsString(Map("status" -> "OK", "table" -> "my_data")))

/** In caller notebook */


val result = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
println(jsonMapper.readValue[Map[String, String]](result))

Handle errors
This section illustrates how to handle errors in notebook workflows.
Python
# Errors in workflows thrown a WorkflowException.

def run_with_retry(notebook, timeout, args = {}, max_retries = 3):


num_retries = 0
while True:
try:
return dbutils.notebook.run(notebook, timeout, args)
except Exception as e:
if num_retries > max_retries:
raise e
else:
print("Retrying error", e)
num_retries += 1

run_with_retry("LOCATION_OF_CALLEE_NOTEBOOK", 60, max_retries = 5)

Scala

// Errors in workflows thrown a WorkflowException.

import com.databricks.WorkflowException

// Since dbutils.notebook.run() is just a function call, you can retry failures using standard Scala try-
catch
// control flow. Here we show an example of retrying a notebook a number of times.
def runRetry(notebook: String, timeout: Int, args: Map[String, String] = Map.empty, maxTries: Int = 3):
String = {
var numTries = 0
while (true) {
try {
return dbutils.notebook.run(notebook, timeout, args)
} catch {
case e: WorkflowException if numTries < maxTries =>
println("Error, retrying: " + e)
}
numTries += 1
}
"" // not reached
}

runRetry("LOCATION_OF_CALLEE_NOTEBOOK", timeout = 60, maxTries = 5)

Run multiple notebooks concurrently


You can run multiple notebooks at the same time by using standard Scala and Python constructs such as
Threads (Scala, Python) and Futures (Scala, Python). The advanced notebook workflow notebooks demonstrate
how to use these constructs. The notebooks are in Scala but you could easily write the equivalent in Python. To
run the example:
1. Download the notebook archive.
2. Import the archive into a workspace.
3. Run the Concurrent Notebooks notebook.
Package cells
7/21/2022 • 2 minutes to read

To use custom Scala classes and objects defined within notebooks reliably in Spark and across notebook
sessions, you should define classes in package cells. A package cell is a cell that is compiled when it is run. A
package cell has no visibility with respect to the rest of the notebook. You can think of it as a separate Scala file.
Only class and object definitions can go in a package cell. You cannot have any values, variables, or function
definitions.
The following notebook shows what can happen if you do not use package cells and provides some examples,
caveats, and best practices.

Package Cells notebook


Get notebook
IPython kernel
7/21/2022 • 2 minutes to read

The IPython kernel is a Jupyter kernel for Python code execution. Jupyter, and other compatible notebooks, use
the IPython kernel for executing Python notebook code.
In Databricks Runtime 11.0 and above, Python notebooks use the IPython kernel to execute Python code.

Benefits of using the IPython kernel


The IPython kernel allows Azure Databricks to add better support for open source tools built for Jupyter
notebooks. Using the IPython kernel on Azure Databricks adds support for IPython’s display and output tooling.
See IPython.core.display for more information. Also, the IPython kernel captures the stdout and stderr outputs of
child processes created by a notebook, allowing that output to be included in the notebook’s command results.

Known issue
The IPython command update_display only updates the outputs of the current cell.
bamboolib
7/21/2022 • 20 minutes to read

IMPORTANT
This feature is in Public Preview.

NOTE
bamboolib is supported in Databricks Runtime 11.0 and above.

bamboolib is a user interface component that allows no-code data analysis and transformations from within an
Azure Databricks notebook. bamboolib helps users more easily work with their data and speeds up common
data wrangling, exploration, and visualization tasks. As users complete these kinds of tasks with their data,
bamboolib automatically generates Python code in the background. Users can share this code with others, who
can run this code in their own notebooks to quickly reproduce those original tasks. They can also use bamboolib
to extend those original tasks with additional data tasks, all without needing to know how to code. Those who
are experienced with coding can extend this code to create even more sophisticated results.
Behind the scenes, bamboolib uses ipywidgets, which is an interactive HTML widget framework for the IPython
kernel. ipywidgets runs inside of the IPython kernel.

Contents
Requirements
Quickstart
Walkthroughs
Key tasks
Additional resources

Requirements
An Azure Databricks notebook, which is attached to an Azure Databricks cluster with Databricks Runtime
11.0 or above.
The bamboolib library must be available to the notebook. You can install the library in the workspace from
PyPI, install the library only on a specific cluster from PyPI, or make the library available only to a specific
notebook with the %pip command.

Quickstart
1. Create a Python notebook.
2. Attach the notebook to a cluster that meets the requirements.
3. In the notebook’s first cell, enter the following code, and then run the cell.

import bamboolib as bam


4. In the notebook’s second cell, enter the following code, and then run the cell.

bam

NOTE
Alternatively, you can print an existing pandas DataFrame to display bamboolib for use with that specific
DataFrame.

5. Continue with key tasks.

Walkthroughs
You can use bamboolib by itself or with an existing pandas DataFrame.
Use bamboolib by itself
In this walkthrough, you use bamboolib to display in your notebook the contents of an example sales data set.
You then experiment with some of the related notebook code that bamboolib automatically generates for you.
You finish by querying and sorting a copy of the sales data set’s contents.
1. Create a Python notebook.
2. Attach the notebook to a cluster that meets the requirements.
3. In the notebook’s first cell, enter the following code, and then run the cell.

import bamboolib as bam

4. In the notebook’s second cell, enter the following code, and then run the cell.

bam

5. Click Load dummy data .


6. In the Load dummy data pane, for Load a dummy data set for testing bamboolib , select Sales
dataset .
7. Click Execute .
8. Display all of the rows where item_type is Baby Food :
a. In the Search actions list, select Filter rows .
b. In the Filter rows pane, in the Choose list (above where ), select Select rows .
c. In the list below where , select item_type .
d. In the Choose list next to item_type , select has value(s) .
e. In the Choose value(s) box next to has value(s) , select Baby Food .
f. Click Execute .
9. Copy the automatically generated Python code for this query:
a. Cick Get Code .
b. In the Expor t code pane, click Copy code .
10. Paste and modify the code:
a. In the notebook’s third cell, paste the code that you copied. It should look like this:
import pandas as pd
df = pd.read_csv(bam.sales_csv)
# Step: Keep rows where item_type is one of: Baby Food
df = df.loc[df['item_type'].isin(['Baby Food'])]

b. Add to this code so that it displays only those rows where order_prio is C , and then run the cell:

import pandas as pd
df = pd.read_csv(bam.sales_csv)
# Step: Keep rows where item_type is one of: Baby Food
df = df.loc[df['item_type'].isin(['Baby Food'])]

# Add the following code.


# Step: Keep rows where order_prio is one of: C
df = df.loc[df['order_prio'].isin(['C'])]
df

TIP
Instead of writing this code, you can also do the same thing by just using bamboolib in the second cell to display
only those rows where order_prio is C . This step is an example of extending the code that bamboolib
automatically generated earlier.

11. Sort the rows by region in ascending order:


a. In the widget within the third cell, in the Search actions list, select Sor t rows .
b. In the Sor t column(s) pane, in the Choose column list, select region .
c. In the list next to region , select ascending (A-Z) .
d. Click Execute .

NOTE
This is equivalent to writing the following code yourself:

df = df.sort_values(by=['region'], ascending=[True])
df

You could have also just used bamboolib in the second cell to sort the rows by region in ascending order. This
step demonstrates how you can use bamboolib to extend the code that you write. As you use bamboolib, it
automatically generates the additional code for you in the background, so that you can further extend your
already-extended code!

12. Continue with key tasks.


Use bamboolib with an existing DataFrame
In this walkthrough, you use bamboolib to display in your notebook the contents of a pandas DataFrame. This
DataFrame contains a copy of an example sales data set. You then experiment with some of the related notebook
code that bamboolib automatically generates for you. You finish by querying and sorting some of the
DataFrame’s contents.
1. Create a Python notebook.
2. Attach the notebook to a cluster that meets the requirements.
3. In the notebook’s first cell, enter the following code, and then run the cell.
import bamboolib as bam

4. In the notebook’s second cell, enter the following code, and then run the cell.

import pandas as pd

df = pd.read_csv(bam.sales_csv)
df

Note that bamboolib only supports pandas DataFrames. To convert a PySpark DataFrame to a pandas
DataFrame, call toPandas on the PySpark DataFrame. To convert a Pandas API on Spark DataFrame to a
pandas DataFrame, call to_pandas on the Pandas API on Spark DataFrame.
5. Click Show bamboolib UI .
6. Display all of the rows where item_type is Baby Food :
a. In the Search actions list, select Filter rows .
b. In the Filter rows pane, in the Choose list (above where ), select Select rows .
c. In the list below where , select item_type .
d. In the Choose list next to item_type , select has value(s) .
e. In the Choose value(s) box next to has value(s) , select Baby Food .
f. Click Execute .
7. Copy the automatically generated Python code for this query:
a. Cick Get Code .
b. In the Expor t code pane, click Copy code .
8. Paste and modify the code:
a. In the notebook’s third cell, paste the code that you copied. It should look like this:

# Step: Keep rows where item_type is one of: Baby Food


df = df.loc[df['item_type'].isin(['Baby Food'])]

b. Add to this code so that it displays only those rows where order_prio is C , and then run the cell:

# Step: Keep rows where item_type is one of: Baby Food


df = df.loc[df['item_type'].isin(['Baby Food'])]

# Add the following code.


# Step: Keep rows where order_prio is one of: C
df = df.loc[df['order_prio'].isin(['C'])]
df

TIP
Instead of writing this code, you can also do the same thing by just using bamboolib in the second cell to display
only those rows where order_prio is C . This step is an example of extending the code that bamboolib
automatically generated earlier.

9. Sort the rows by region in ascending order:


a. In the widget within the third cell, click Sor t rows .
a. In the Sor t column(s) pane, in the Choose column list, select region .
b. In the list next to region , select ascending (A-Z) .
c. Click Execute .

NOTE
This is equivalent to writing the following code yourself:

df = df.sort_values(by=['region'], ascending=[True])
df

You could have also just used bamboolib in the second cell to sort the rows by region in ascending order. This
step demonstrates how you can use bamboolib to extend the code that you write. As you use bamboolib, it
automatically generates the additional code for you in the background, so that you can further extend your
already-extended code!

10. Continue with key tasks.

Key tasks
In this section:
Add the widget to a cell
Clear the widget
Data loading tasks
Data action tasks
Data action history tasks
Get code to programmatically recreate the widget’s current state as a DataFrame
Add the widget to a cell
Scenario : You want the bamboolib widget to display in a cell.
1. Make sure the notebook meets the requirements for bamboolib.
2. Run the following code in the notebook, preferably in the notebook’s first cell:

import bamboolib as bam

3. Option 1 : In the cell where you want the widget to appear, add the following code, and then run the cell:

bam

The widget appears in the cell below the code.


Or:
Option 2 : In a cell that contains a reference to a pandas DataFrame, print the DataFrame. For example,
given the following DataFrame definition, run the cell:
import pandas as pd
from datetime import datetime, date

df = pd.DataFrame({
'a': [ 1, 2, 3 ],
'b': [ 2., 3., 4. ],
'c': [ 'string1', 'string2', 'string3' ],
'd': [ date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1) ],
'e': [ datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), datetime(2000, 1, 3, 12, 0) ]
})

df

The widget appears in the cell below the code.


Note that bamboolib only supports pandas DataFrames. To convert a PySpark DataFrame to a pandas
DataFrame, call toPandas on the PySpark DataFrame. To convert a Pandas API on Spark DataFrame to a
pandas DataFrame, call to_pandas on the Pandas API on Spark DataFrame.
Clear the widget
Scenario : You want to clear the contents of a widget and then read new data into the existing widget.
Option 1 : Run the following code within the cell that contains the target widget:

bam

The widget clears and then redisplays the Databricks: Read CSV file from DBFS , Databricks: Load
database table , and Load dummy data buttons.

NOTE
If the error name 'bam' is not defined appears, run the following code in the notebook (preferably in the notebook’s
first cell), and then try again:

import bamboolib as bam

Option 2 : In a cell that contains a reference to a pandas DataFrame, print the DataFrame again by running the
cell again. The widget clears and then displays the new data.
Data loading tasks
In this section:
Read an example dataset’s contents into the widget
Read a CSV file’s contents into the widget
Read a database table’s contents into the widget
Read an example dataset’s contents into the widget
Scenario : You want to read some example data into the widget, for example some pretend sales data, so that
you can test out the widget’s functionality.
1. Click Load dummy data .

NOTE
If Load dummy data is not visible, clear the widget with Option 1 and try again.
2. In the Load dummy data pane, for Load a dummy data set for testing bamboolib , select the name
of the dataset that you want to load.
3. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a
DataFrame, or leave df as the default programmatic identifier.
4. Click Execute .
The widget displays the contents of the dataset.

TIP
You can switch the current widget to display the contents of a different example dataset:
1. In the current widget, click the Load dummy data tab.
2. Follow the preceding steps to read the other example dataset’s contents into the widget.

Read a CSV file’s contents into the widget


Scenario : You want to read the contents of a CSV file within your Azure Databricks workspace into the widget.
1. Click Databricks: Read CSV file from DBFS .

NOTE
If Databricks: Read CSV file from DBFS is not visible, clear the widget with Option 1 and try again.

2. In the Read CSV from DBFS pane, browse to the location that contains the target CSV file.
3. Select the target CSV file.
4. For Dataframe name , enter a name for the programmatic identifier of the CSV file’s contents as a
DataFrame, or leave df as the default programmatic identifier.
5. For CSV value separator , enter the character that separates values in the CSV file, or leave the ,
(comma) character as the default value separator.
6. For Decimal separator , enter the character that separates decimals in the CSV file, or leave the . (dot)
character as the default value separator.
7. For Row limit: read the first N rows - leave empty for no limit , enter the maximum number of
rows to read into the widget, or leave 100000 as the default number of rows, or leave this box empty to
specify no row limit.
8. Click Open CSV file .
The widget displays the contents of the CSV file, based on the settings that you specified.

TIP
You can switch the current widget to display the contents of a different CSV file:
1. In the current widget, click the Read CSV from DBFS tab.
2. Follow the preceding steps to read the other CSV file’s contents into the widget.

Read a database table’s contents into the widget


Scenario : You want to read the contents of a database table within your Azure Databricks workspace into the
widget.
1. Click Databricks: Load database table .

NOTE
If Databricks: Load database table is not visible, clear the widget with Option 1 and try again.

2. In the Databricks: Load database table pane, for Database - leave empty for default database ,
enter the name of the database in which the target table is located, or leave this box empty to specify the
default database.
3. For Table , enter the name of the target table.
4. For Row limit: read the first N rows - leave empty for no limit , enter the maximum number of
rows to read into the widget, or leave 100000 as the default number of rows, or leave this box empty to
specify no row limit.
5. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a
DataFrame, or leave df as the default programmatic identifier.
6. Click Execute .
The widget displays the contents of the table, based on the settings that you specified.

TIP
You can switch the current widget to display the contents of a different table:
1. In the current widget, click the Databricks: Load database table tab.
2. Follow the preceding steps to read the other table’s contents into the widget.

Data action tasks


bamboolib offers over 50 data actions. Following are some of the more common getting-started data action
tasks.
In this section:
Select columns
Drop columns
Filter rows
Sort rows
Grouping rows and columns tasks
Remove rows with missing values
Remove duplicated rows
Find and replace missing values
Create a column formula
Select columns
Scenario : You want to show only specific table columns by name, by data type, or that match some regular
expression. For example, in the dummy Sales dataset , you want to show only the item_type and
sales_channel columns, or you want to show only the columns that contain the string _date in their column
names.
1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type select , and then select Select or drop columns .
Select Select or drop columns .
2. In the Select or drop columns pane, in the Choose drop-down list, select Select .
3. Select the target column names or inclusion criterion.
4. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a DataFrame,
or leave df as the default programmatic identifier.
5. Click Execute .
Drop columns
Scenario : You want to hide specific table columns by name, by data type, or that match some regular
expression. For example, in the dummy Sales dataset , you want to hide the order_prio , order_date , and
ship_date columns, or you want to hide all columns that contain only date-time values.

1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type drop , and then select Select or drop columns .
Select Select or drop columns .
2. In the Select or drop columns pane, in the Choose drop-down list, select Drop .
3. Select the target column names or inclusion criterion.
4. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a DataFrame,
or leave df as the default programmatic identifier.
5. Click Execute .
Filter rows
Scenario : You want to show or hide specific table rows based on criteria such as specific column values that are
matching or missing. For example, in the dummy Sales dataset , you want to show only those rows where the
item_type column’s value is set to Baby Food .

1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type filter , and then select Filter rows .
Select Filter rows .
2. In the Filter rows pane, in the Choose drop-down list above where , select Select rows or Drop rows .
3. Specify the first filter criterion.
4. To add another filter criterion, click add condition , and specify the next filter criterion. Repeat as desired.
5. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a DataFrame,
or leave df as the default programmatic identifier.
6. Click Execute .
Sort rows
Scenario : You want to sort table rows based on the values within one or more columns. For example, in the
dummy Sales dataset , you want to show the rows by the region column’s values in alphabetical order from A
to Z.
1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type sor t , and then select Sor t rows .
Select Sor t rows .
2. In the Sor t column(s) pane, choose the first column to sort by and the sort order.
3. To add another sort criterion, click add column , and specify the next sort criterion. Repeat as desired.
4. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a DataFrame,
or leave df as the default programmatic identifier.
5. Click Execute .
Grouping rows and columns tasks
I n t h i s se c t i o n :

Group rows and columns by a single aggregate function


Group rows and columns by multiple aggregate functions
G r o u p r o w s a n d c o l u m n s b y a si n g l e a g g r e g a t e fu n c t i o n

Scenario : You want to show row and column results by calculated groupings, and you want to assign custom
names to those groupings. For example, in the dummy Sales dataset , you want to group the rows by the
country column’s values, showing the numbers of rows containing the same country value, and giving the list
of calculated counts the name country_count .
1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type group , and then select Group by and aggregate (with renaming) .
Select Group by and aggregate (with renaming) .
2. In the Group by with column rename pane, select the columns to group by, the first calculation, and
optionally specify a name for the calculated column.
3. To add another calculation, click add calculation , and specify the next calculation and column name. Repeat
as desired.
4. Specify where to store the result.
5. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a DataFrame,
or leave df as the default programmatic identifier.
6. Click Execute .
G r o u p r o w s a n d c o l u m n s b y m u l t i p l e a g g r e g a t e fu n c t i o n s

Scenario : You want to show row and column results by calculated groupings. For example, in the dummy Sales
dataset , you want to group the rows by the region , country , and sales_channel columns’ values, showing
the numbers of rows containing the same region and country value by sales_channel , as well as the
total_revenue by unique combination of region , country , and sales_channel .

1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type group , and then select Group by and aggregate (default) .
Select Group by and aggregate (default) .
2. In the Group by with column rename pane, select the columns to group by and the first calculation.
3. To add another calculation, click add calculation , and specify the next calculation. Repeat as desired.
4. Specify where to store the result.
5. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a DataFrame,
or leave df as the default programmatic identifier.
6. Click Execute .
Remove rows with missing values
Scenario : You want to remove any row that has a missing value for the specified columns. For example, in the
dummy Sales dataset , you want to remove any rows that have a missing item_type value.
1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type drop or remove , and then select Drop missing values .
Select Drop missing values .
2. In the Drop missing values pane, select the columns to remove any row that has a missing value for that
column.
3. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a DataFrame,
or leave df as the default programmatic identifier.
4. Click Execute .
Remove duplicated rows
Scenario : You want to to remove any row that has a duplicated value for the specified columns. For example, in
the dummy Sales dataset , you want to remove any rows that are exact duplicates of each other.
1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type drop or remove , and then select Drop/Remove duplicates .
Select Drop/Remove duplicates .
2. In the Remove Duplicates pane, select the columns to remove any row that has a duplicated value for
those columns, and then select whether to keep the first or last row that has the duplicated value.
3. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a DataFrame,
or leave df as the default programmatic identifier.
4. Click Execute .
Find and replace missing values
Scenario : You want to replace the missing value with a replacement value for any row with the specified
columns. For example, in the dummy Sales dataset , you want to replace any row with a missing value in the
item_type column with the value Unknown Item Type .

1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type find or replace , and then select Find and replace missing values .
Select Find and replace missing values .
2. In the Replace missing values pane, select the columns to replace missing values for, and then specify the
replacement value.
3. Click Execute .
Create a column formula
Scenario : You want to create a column that uses a unique formula. For example, in the dummy Sales dataset ,
you want to create a column named profit_per_unit that displays the result of dividing the total_profit
column value by the units_sold column value for each row.
1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type formula , and then select New column formula .
Select New column formula .
2. In the Replace missing values pane, select the columns to replace missing values for, and then specify the
replacement value.
3. Click Execute .
Data action history tasks
In this section:
View the list of actions taken in the widget
Undo the most recent action taken in the widget
Redo the most recent action taken in the widget
Change the most recent action taken in the widget
View the list of actions taken in the widget
Scenario : You want to see a list of all of the changes that were made in the widget, starting with the most recent
change.
Click Histor y . The list of actions appears in the Transformations histor y pane.
Undo the most recent action taken in the widget
Scenario : You want to revert the most recent change that was made in the widget.
Do one of the following:
Click the counterclockwise arrow icon.
Click Histor y , and in the Transformations histor y pane, click Undo last step .
Redo the most recent action taken in the widget
Scenario : You want to revert the most recent revert that was made in the widget.
Do one of the following:
Click the clockwise arrow icon.
Click Histor y , and in the Transformations histor y pane, click Recover last step .
Change the most recent action taken in the widget
Scenario : You want to change the most recent change that was taken in the widget.
1. Do one of the following:
Click the pencil icon.
Click Histor y , and in the Transformations histor y pane, click Edit last step .
2. Make the desired change, and then click Execute .
Get code to programmatically recreate the widget’s current state as a DataFrame
Scenario : You want to get Python code that programmatically recreates the current widget’s state, represented
as a pandas DataFrame. You want to run this code in a different cell in this workbook or a different workbook
altogether.
1. Click Get Code .
2. In the Expor t code pane, click Copy code . The code is copied to your system’s clipboard.
3. Paste the code into a different cell in this workbook or into a different workbook.
4. Write additional code to work with this pandas DataFrame programmatically, and then run the cell. For
example, to display the DataFrame’s contents, assuming that your DataFrame is represented
programmatically by df :

# Your pasted code here, followed by...


df

Additional resources
bamboolib Documentation
Software engineering best practices for notebooks
7/21/2022 • 25 minutes to read

This article provides a hands-on walkthrough that demonstrates how to apply software engineering best
practices to your Azure Databricks notebooks, including version control, code sharing, testing, and optionally
continuous integration and continuous delivery or deployment (CI/CD).
In this walkthrough, you will:
Add notebooks to Azure Databricks Repos for version control.
Extract portions of code from one of the notebooks into a shareable module.
Test the shared code.
Run the notebooks from an Azure Databricks job.
Optionally apply CI/CD to the shared code.

Requirements
To complete this walkthrough, you must provide the following resources:
A remote repository with a Git provider that Databricks supports. This article’s walkthrough uses GitHub.
This walkthrough assumes that you have a GitHub repository named best-notebooks available. (You can
give your repository a different name. If you do, replace best-notebooks with your repo’s name
throughout this walkthrough.) Create a GitHub repo if you do not already have one.

NOTE
If you create a new repo, be sure to initialize the repository with at least one file, for example a README file.

An Azure Databricks workspace. Create a workspace if you do not already have one.
An Azure Databricks all-purpose cluster in the workspace. To run notebooks during the design phase, you
attach the notebooks to a running all-purpose cluster. Later on, this walkthrough uses an Azure
Databricks job to automate running the notebooks on this cluster. (You can also run jobs on job clusters
that exist only for the jobs’ lifetimes.) Create an all-purpose cluster if you do not already have one.

NOTE
To work with files in Databricks Repos, participating clusters must have Databricks Runtime 8.4 or higher installed.
Databricks recommends that these clusters have the latest Long Term Support (LTS) version installed, which is
Databricks Runtime 10.4 LTS.

Walkthrough
In this walkthrough, you will:
1. Connect your existing GitHub repo to Azure Databricks Repos.
2. Add an existing notebook to the repo and then run the notebook for the first time.
3. Move some code from the notebook into a shared module. Run the notebook for the second time to make
sure that the notebook calls the shared code as expected.
4. Use a second notebook to test the shared code separately without running the first notebook again.
5. Create an Azure Databricks job to run the two notebooks automatically, either on-demand or on a regular
schedule.
6. Set up the repo to run the second notebook that tests the shared code whenever a pull request is created in
the repo.
7. Make a pull request that changes the shared code, which will trigger the tests to run automatically.
The following steps walk you through each of these activities.
Steps
Step 1: Set up Databricks Repos
Step 2: Import and run the notebook
Step 3: Move code into a shared module
Step 4: Test the shared code
Step 5: Create a job to run the notebooks
(Optional) Step 6: Set up the repo to test the code and run the notebook automatically whenever the code
changes
(Optional) Step 7: Update the shared code in GitHub to trigger tests
Step 1: Set up Databricks Repos
In this step, you connect your existing GitHub repo to Azure Databricks Repos in your existing Azure Databricks
workspace.
To enable your workspace to connect to your GitHub repo, you must first provide your workspace with your
GitHub credentials, if you have not done so already.
Step 1.1: Provide your GitHub credentials
1. In your workspace, on the sidebar in the Data Science & Engineering or Databricks Machine Learning
environment, click Settings > User Settings .
2. On the User Settings page, click Git integration .
3. On the Git integration tab, for Git provider , select GitHub .
4. For Git provider username or email , enter your GitHub username.
5. For Token , enter your GitHub personal access token. This token must have the repo permission.
6. Click Save .
Step 1.2: Connect to your GitHub repo
1. On the sidebar in the Data Science & Engineering or Databricks Machine Learning environment, click
Repos .
2. In the Repos pane, click Add Repo .
3. In the Add Repo dialog:
a. Click Clone remote Git repo .
b. For Git repositor y URL , enter the GitHub Clone with HTTPS URL for your GitHub repo. This article
assumes that your URL ends with best-notebooks.git , for example
https://github.com/<your-GitHub-username>/best-notebooks.git .
c. In the drop-down list next to Git repositor y URL , select GitHub .
d. Leave Repo name set to the name of your repo, for example best-notebooks .
e. Click Create .
Step 2: Import and run the notebook
In this step, you import an existing external notebook into your repo. You could create your own notebooks for
this walkthrough, but to speed things up we provide them for you here.
Step 2.1: Create a working branch in the repo
In this substep, you create a branch named eda in your repo. This branch enables you to work on files and code
independently from your repo’s main branch, which is a software engineering best practice. (You can give your
branch a different name.)

NOTE
In some repos, the main branch may be named master instead. If so, replace main with master throughout this
walkthrough.

TIP
If you’re not familiar with working in Git branches, see Git Branches - Branches in a Nutshell on the Git website.

1. If the Repos pane is not showing, then on the sidebar in the Data Science & Engineering or
Databricks Machine Learning environment, click Repos .
2. If the repo that you connected to in the previous step is not showing in the Repos pane, then select your
workspace username, and select the name of the repo that you connected to in the previous step.
3. Click the drop-down arrow next to your repo’s name, and then click Git .
4. In the best-notebooks dialog, click the + (Create branch ) button.

NOTE
If your repo has a name other than best-notebooks , this dialog’s title will be different, here and throughout this
walkthrough.

5. Enter eda and then press Enter.


6. Close this dialog.
Step 2.2: Import the notebook into the repo
In this substep, you import an existing notebook from another repo into your repo. This notebook does the
following:
1. Copies a CSV file from the owid/covid-19-data GitHub repository onto a cluster in your workspace. This CSV
file contains public data about COVID-19 hospitalizations and intensive care metrics from around the world.
2. Reads the CSV file’s contents into a pandas DataFrame.
3. Filters the data to contain metrics from only the United States.
4. Displays a plot of the data.
5. Saves the pandas DataFrame as a Pandas API on Spark DataFrame.
6. Performs data cleansing on the Pandas API on Spark DataFrame.
7. Writes the Pandas API on Spark DataFrame as a Delta table in your workspace.
8. Displays the Delta table’s contents.
While you could create your own notebook in your repo here, importing an existing notebook here instead
helps to speed up this walkthrough. To create a notebook in this branch or move an existing notebook into this
branch instead of importing a notebook, see Work with notebooks and project files in Azure Databricks Repos.
1. In the Repos pane for your repo, click the drop-down arrow next to your repo’s name, and then click Create
> Folder .
2. In the New Folder Name dialog, enter notebooks , and then click Create Folder .
3. In the Repos pane, click the name of your repo, click the drop-down arrow next to the notebooks folder, and
then click Impor t .
4. In the Impor t Notebooks dialog:
a. For Impor t from , select URL .
b. Enter the URL to the raw contents of the covid_eda_raw notebook in the
databricks/notebook-best-practices repo in GitHub. To get this URL: i. Go to
https://github.com/databricks/notebook-best-practices. ii. Click the notebooks folder. iii. Click the
covid_eda_raw.py file. iv. Click Raw . v. Copy the full URL from your web browser’s address bar over
into the Impor t Notebooks dialog.
c. Click Impor t .
Step 2.3: Run the notebook
1. If the notebook is not already showing, in the Repos pane for your repo, double-click the covid_eda_raw
notebook inside of the notebooks folder to open it.
2. In the notebook, in the drop-down list next to File , select the cluster to attach this notebook to.
3. Click Run All .
4. If prompted, click Attach & Run or Star t, Attach & Run .
5. Wait while the notebook runs.
After the notebook finishes running, in the notebook you should see a plot of the data as well as over 600 rows
of raw data in the Delta table. If the cluster was not already running when you started running this notebook, it
could take several minutes for the cluster to start up before displaying the results.
Step 2.4: Check in and merge the notebook
In this substep, you save your work so far to your GitHub repo. You then merge the notebook from your
working branch into your repo’s main branch.
1. In the Repos pane for your repo, click the eda branch.
2. In the best-notebooks dialog, on the Changes tab, make sure the notebooks/covid_eda_raw.py file is
selected.
3. For Summar y (required) , enter Added raw notebook .
4. For Description (optional) , enter This is the first version of the notebook.
5. Click Commit & Push .
6. Click Histor y , or click Create a pull request on your git provider link in the popup.
7. In GitHub, click the Pull requests tab, create the pull request, and then merge the pull request into the main
branch.
8. Back in your Azure Databricks workspace, close the best-notebooks dialog if it is still showing.
Step 3: Move code into a shared module
In this step, you move some of the code in your notebook into a set of shared functions outside of your
notebook. This enables you to use these functions with other similar notebooks, which can speed up future
coding and help ensure more predictable and consistent notebook results. Sharing this code also enables you to
more easily test these functions, which as a software engineering best practice can raise the overall quality of
your code as you go.
Step 3.1: Create another working branch in the repo
1. In your workspace, in the Repos pane for your repo, click the eda branch.
2. in the best-notebooks dialog, click the drop-down arrow next to the eda branch, and select main .
3. Click the Pull button. If prompted to proceed with pulling, click Confirm .
4. Click the + (Create branch ) button.
5. Enter first_modules , and then press Enter. (You can give your branch a different name.)
6. Close this dialog.
Step 3.2: Import the notebook into the repo
To speed up this walkthrough, in this substep you import another existing notebook into your repo. This
notebook does the same things as the previous notebook, except this notebook will call shared code functions
that are stored outside of the notebook. Again, you could create your own notebook in your repo here and do
the actual code sharing yourself.
1. In the Repos pane for your repo, click the drop-down arrow next to the notebooks folder, and then click
Impor t .
2. In the Impor t Notebooks dialog:
a. For Impor t from , select URL .
b. Enter the URL to the raw contents of the covid_eda_modular notebook in the
databricks/notebook-best-practices repo in GitHub. To get this URL: i. Go to
https://github.com/databricks/notebook-best-practices. ii. Click the notebooks folder. iii. Click the
covid_eda_modular.py file. iv. Click Raw . v. Copy the full URL from your web browser’s address bar
over into the Impor t Notebooks dialog.
c. Click Impor t .

NOTE
You could delete the existing covid_eda_raw notebook at this point, because the new covid_eda_modular
notebook is a shared version of the first notebook. However, you might still want to keep the previous notebook
for comparison purposes, even though you will not use it anymore.

Step 3.3: Add the notebook’s supporting shared code functions


1. In the Repos pane for your repo, click the drop-down arrow next to your repo’s name, and then click
Create > Folder .

NOTE
Do not click the drop-down arrow next to the notebooks folder. Click the drop-down arrow next to your repo’s
name instead. You want this to go into the root of the repo, not into the notebooks folder.

2. In the New Folder Name dialog, enter covid_analysis , and then click Create Folder .
3. In the Repos pane for your repo, click the drop-down arrow next to the covid_analysis folder, and then
click Create > File .
4. In the New File Name dialog, enter transforms.py , and then click Create File .
5. In the Repos pane for your repo, click the covid_analysis folder, and then click transforms.py .
6. In the editor window, enter the following code:
import pandas as pd

# Filter by country code.


def filter_country(pdf, country="USA"):
pdf = pdf[pdf.iso_code == country]
return pdf

# Pivot by indicator, and fill missing values.


def pivot_and_clean(pdf, fillna):
pdf["value"] = pd.to_numeric(pdf["value"])
pdf = pdf.fillna(fillna).pivot_table(
values="value", columns="indicator", index="date"
)
return pdf

# Create column names that are compatible with Delta tables.


def clean_spark_cols(pdf):
pdf.columns = pdf.columns.str.replace(" ", "_")
return pdf

# Convert index to column (works with pandas API on Spark, too).


def index_to_col(df, colname):
df[colname] = df.index
return df

TIP
For other code sharing techniques, see Share code in notebooks.

Step 3.4: Add the shared code’s dependencies


The preceding code has several Python package dependencies to enable the code to run properly. In this
substep, you declare these package dependencies. Declaring dependencies improves reproducibility by using
precisely defined versions of libraries.
1. In the Repos pane for your repo, click the drop-down arrow next to your repo’s name, and then click
Create > File .

NOTE
Do not click the drop-down arrow next to the notebooks or covid_analysis folders. You want the list of
package dependencies to go into the repo’s root folder, not the notebooks or covid_analysis folders.

2. In the New File Name dialog, enter requirements.txt , and then click Create File .
3. In the Repos pane for your repo, click requirements.txt , and enter the following code:

NOTE
If the requirements.txt file is not visible, you may need to refresh your web browser.
-i https://pypi.org/simple
attrs==21.4.0
cycler==0.11.0
fonttools==4.33.3
iniconfig==1.1.1
kiwisolver==1.4.2
matplotlib==3.5.1
numpy==1.22.3
packaging==21.3
pandas==1.4.2
pillow==9.1.0
pluggy==1.0.0
py==1.11.0
py4j==0.10.9.3
pyarrow==7.0.0
pyparsing==3.0.8
pyspark==3.2.1
pytest==7.1.2
python-dateutil==2.8.2
pytz==2022.1
six==1.16.0
tomli==2.0.1
wget==3.2

NOTE
The preceding file lists specific package versions. For better compatibility, you can cross-reference these versions
with the ones that are installed on your all-purpose cluster. See the “System environment” section for your
cluster’s Databricks Runtime version in Databricks runtime releases.

Your repo structure should now look like this:

|-- covid_analysis
| `-- transforms.py
|-- notebooks
| |-- covid_eda_modular
| `-- covid_eda_raw (optional)
`-- requirements.txt

Step 3.5: Run the refactored notebook


In this substep, you run the covid_eda_modular notebook, which calls the shared code in
covid_analysis/transforms.py .
1. In the Repos pane for your repo, double-click the covid_eda_modular notebook inside the notebooks
folder.
2. In the drop-down list next to File , select the cluster to attach this notebook to.
3. Click Run All .
4. If prompted, click Attach & Run or Star t, Attach & Run .
5. Wait while the notebook runs.
After the notebook finishes running, in the notebook you should see similar results as the covid_eda_raw
notebook: a plot of the data as well as over 600 rows of raw data in the Delta table. The main difference with this
notebook is that a different filter is used (an iso_code of DZA instead of USA ). If the cluster was not already
running when you started running this notebook, it could take several minutes for the cluster to start up before
displaying the results.
Step 3.6: Check in the notebook and its related code
1. In the Repos pane for your repo, click the first_modules branch.
2. In the best-notebooks dialog, on the Changes tab, make sure the following are selected:
requirements.txt
covid_analysis/transforms.py
notebooks/covid_eda_modular.py
3. For Summar y (required) , enter Added refactored notebook .
4. For Description (optional) , enter This is the second version of the notebook.
5. Click Commit & Push .
6. Click Histor y , or click Create a pull request on your git provider link in the popup.
7. In GitHub, click the Pull requests tab, create the pull request, and then merge the pull request into the main
branch.
8. Back in your Azure Databricks workspace, close the best-notebooks dialog if it is still showing.
Step 4: Test the shared code
In this step, you test the shared code from the last step. However, you want to test this code without running the
covid_eda_modular notebook itself. This is because if the shared code fails to run, the notebook itself would likely
fail to run as well. You want to catch failures in your shared code first before having your main notebook
eventually fail later. This testing technique is a software engineering best practice.
Step 4.1: Create another working branch in the repo
1. In your workspace, in the Repos pane for your repo, click the first_modules branch.
2. in the best-notebooks dialog, click the drop-down arrow next to the first_modules branch, and select
main .
3. Click the Pull button. If prompted to proceed with pulling, click Confirm .
4. Click the + (Create branch ) button.
5. Enter first_tests , and then press Enter. (You can give your branch a different name.)
6. Close this dialog.
Step 4.2: Add the tests
In this substep, you use the pytest framework to test your shared code. In these tests, you assert whether
particular test results are achieved. If any test produces an unexpected result, that particular test fails the
assertion and thus the test itself fails.
1. In the Repos pane for your repo, click the drop-down arrow next to your repo’s name, and then click
Create > Folder .
2. In the New Folder Name dialog, enter tests , and then click Create Folder .
3. In the Repos pane for your repo, click the drop-down arrow next to the tests folder, and then click
Create > File .
4. In the New File Name dialog, enter testdata.csv , and then click Create File .
5. In the Repos pane for your repo, click the tests folder, and then click testdata.csv .
6. In the editor window, enter the following test data:
entity,iso_code,date,indicator,value
United States,USA,2022-04-17,Daily ICU occupancy,
United States,USA,2022-04-17,Daily ICU occupancy per million,4.1
United States,USA,2022-04-17,Daily hospital occupancy,10000
United States,USA,2022-04-17,Daily hospital occupancy per million,30.3
United States,USA,2022-04-17,Weekly new hospital admissions,11000
United States,USA,2022-04-17,Weekly new hospital admissions per million,32.8
Algeria,DZA,2022-04-18,Daily ICU occupancy,1010
Algeria,DZA,2022-04-18,Daily ICU occupancy per million,4.5
Algeria,DZA,2022-04-18,Daily hospital occupancy,11000
Algeria,DZA,2022-04-18,Daily hospital occupancy per million,30.9
Algeria,DZA,2022-04-18,Weekly new hospital admissions,10000
Algeria,DZA,2022-04-18,Weekly new hospital admissions per million,32.1

NOTE
Using test data is a software engineering best practice. This enables you to run your tests faster, relying on a small
portion of the data that has the same format as your real data. Of course, you want to always make sure that this
test data accurately represents your real data before you run your tests.

7. In the Repos pane for your repo, click the drop-down arrow next to the tests folder, and then click
Create > File .
8. In the New File Name dialog, enter transforms_test.py , and then click Create File .
9. In the Repos pane for your repo, click the tests folder, and then click transforms_test.py .
10. In the editor window, enter the following test code. These tests use standard pytest fixtures as well as a
mocked in-memory pandas DataFrame:
# Test each of the transform functions.
import pytest
from textwrap import fill
import os
import pandas as pd
import numpy as np
from covid_analysis.transforms import *
from pyspark.sql import SparkSession

@pytest.fixture
def raw_input_df() -> pd.DataFrame:
"""
Create a basic version of the input dataset for testing, including NaNs.
"""
return pd.read_csv('tests/testdata.csv')

@pytest.fixture
def colnames_df() -> pd.DataFrame:
df = pd.DataFrame(
data=[[0,1,2,3,4,5]],
columns=[
"Daily ICU occupancy",
"Daily ICU occupancy per million",
"Daily hospital occupancy",
"Daily hospital occupancy per million",
"Weekly new hospital admissions",
"Weekly new hospital admissions per million"
]
)
return df

# Make sure the filter works as expected.


def test_filter(raw_input_df):
filtered = filter_country(raw_input_df)
assert filtered.iso_code.drop_duplicates()[0] == "USA"

# The test data has NaNs for Daily ICU occupancy; this should get filled to 0.
def test_pivot(raw_input_df):
pivoted = pivot_and_clean(raw_input_df, 0)
assert pivoted["Daily ICU occupancy"][0] == 0

# Test column cleaning.


def test_clean_cols(colnames_df):
cleaned = clean_spark_cols(colnames_df)
cols_w_spaces = cleaned.filter(regex=(" "))
assert cols_w_spaces.empty == True

# Test column creation from index.


def test_index_to_col(raw_input_df):
raw_input_df["col_from_index"] = raw_input_df.index
assert (raw_input_df.index == raw_input_df.col_from_index).all()

Your repo structure should now look like this:

|-- covid_analysis
| `-- transforms.py
|-- notebooks
| |-- covid_eda_modular
| `-- covid_eda_raw (optional)
|-- requirements.txt
`-- tests
|-- testdata.csv
`-- transforms_test.py

Step 4.3: Run the tests


To speed up this walkthrough, in this substep you use an imported notebook to run the preceding tests. This
notebook downloads and installs the tests’ dependent Python packages into your workspace, runs the tests, and
reports the tests’ results. While you could run pytest from your cluster’s web terminal, running pytest from a
notebook can be more convenient.

NOTE
Running pytest runs all files whose names follow the form test_*.py or \*_test.py in the current directory and its
subdirectories.

1. In the Repos pane for your repo, click the drop-down arrow next to the notebooks folder, and then click
Impor t .
2. In the Impor t Notebooks dialog:
a. For Impor t from , select URL .
b. Enter the URL to the raw contents of the run_unit_tests notebook in the
databricks/notebook-best-practices repo in GitHub. To get this URL: i. Go to
https://github.com/databricks/notebook-best-practices. ii. Click the notebooks folder. iii. Click the
run_unit_tests.py file. iv. Click Raw . v. Copy the full URL from your web browser’s address bar over
into the Impor t Notebooks dialog.
c. Click Impor t .
3. If the notebook is not already showing, in the Repos pane for your repo, click the notebooks folder, and
then double-click the run_unit_tests notebook.
4. In the drop-down list next to File , select the cluster to attach this notebook to.
5. Click Run All .
6. If prompted, click Attach & Run or Star t, Attach & Run .
7. Wait while the notebook runs.
After the notebook finishes running, in the notebook you should see information about the number of passing
and failed tests, along with other related details. If the cluster was not already running when you started running
this notebook, it could take several minutes for the cluster to start up before displaying the results.
Your repo structure should now look like this:

|-- covid_analysis
| `-- transforms.py
|-- notebooks
| |-- covid_eda_modular
| |-- covid_eda_raw (optional)
| `-- run_unit_tests
|-- requirements.txt
`-- tests
|-- testdata.csv
`-- transforms_test.py

Step 4.4: Check in the notebook and related tests


1. In the Repos pane for your repo, click the first_tests branch.
2. In the best-notebooks dialog, on the Changes tab, make sure the following are selected:
tests/transforms_test.py
notebooks/run_unit_tests.py
tests/testdata.csv
3. For Summar y (required) , enter Added tests .
4. For Description (optional) , enter These are the unit tests for the shared code. .
5. Click Commit & Push .
6. Click Histor y , or click Create a pull request on your git provider link in the popup.
7. In GitHub, click the Pull requests tab, create the pull request, and then merge the pull request into the main
branch.
8. Back in your Azure Databricks workspace, close the best-notebooks dialog if it is still showing.
Step 5: Create a job to run the notebooks
In previous steps, you tested your shared code manually and ran your notebooks manually. In this step, you use
an Azure Databricks job to test your shared code and run your notebooks automatically, either on-demand or on
a regular schedule.
Step 5.1: Create a job task to run the testing notebook
1. On the sidebar in the Data Science & Engineering or Databricks Machine Learning environment, click
Workflows .
2. On the Jobs tab, click Create Job .
3. For Add a name for your job (which is next to the Runs and Tasks tabs), enter covid_report .
4. For Task name , enter run_notebook_tests .
5. For Type , select Notebook .
6. For Source , select Git .
7. Click Add a git reference .
8. In the Git information dialog:
a. For Git repositor y URL , enter the GitHub Clone with HTTPS URL for your GitHub repo. This article
assumes that your URL ends with best-notebooks.git , for example
https://github.com/<your-GitHub-username>/best-notebooks.git .
b. For Git provider , select GitHub .
c. For Git reference (branch / tag / commit) , enter main .
d. Next to Git reference (branch / tag / commit) , select branch .
e. Click Confirm .
9. For Path , enter notebooks/run_unit_tests . Do not add the .py file extension.
10. For Cluster , select the cluster from the previous step.
11. Click Create .

NOTE
In this scenario, Databricks does not recommend that you use the schedule button in the notebook as described in
Schedule a notebook to schedule a job to run this notebook periodically. This is because the schedule button creates a job
by using the latest working copy of the notebook in the workspace repo. Instead, Databricks recommends that you follow
the preceding instructions to create a job that uses the latest committed version of the notebook in the repo.

Step 5.2: Create a job task to run the main notebook


1. Click the + (Add more tasks to your job here ) icon.
2. For Task name , enter run_main_notebook .
3. For Type , select Notebook .
4. For Path , enter notebooks/covid_eda_modular . Do not add the .py file extension.
5. For Cluster , select the cluster from the previous step.
6. Click Create task .
Step 5.3 Run the job
1. Click Run now .
2. In the pop-up, click View run .
NOTE
If the pop-up disappears too quickly, then do the following:
1. On the sidebar in the Data Science & Engineering or Databricks Machine Learning environment, click
Workflows .
2. On the Job runs tab, click the Star t time value for the latest job with covid_repor t in the Jobs column.

3. To see the job results, click on the run_notebook_tests tile, the run_main_notebook tile, or both. The
results on each tile are the same as if you ran the notebooks yourself, one by one.

NOTE
This job ran on-demand. To set up this job to run on a regular basis, see Schedule a job.

(Optional) Step 6: Set up the repo to test the code and run the notebook automatically whenever the code
changes
In the previous step, you used a job to automatically test your shared code and run your notebooks at a point in
time or on a recurring basis. However, you may prefer to trigger tests automatically when changes are merged
into your GitHub repo. You can perform this automation by using a CI/CD platform such as GitHub Actions.
Step 6.1: Set up GitHub access to your workspace
In this substep, you set up a GitHub Actions workflow that run jobs in the workspace whenever changes are
merged into your repository. You do this by giving GitHub a unique Azure Databricks token for access.
For security reasons, Databricks discourages you from giving your Azure Databricks workspace user’s personal
access token to GitHub. Instead, Databricks recommends that you give GitHub an Azure Active Directory (Azure
AD) token that is associated with an Azure service principal. For instructions, see the Azure section of the Run
Databricks Notebook GitHub Action page in the GitHub Actions Marketplace.

IMPORTANT
Notebooks are run with all of the workspace permissions of the identity that is associated with the token, so Databricks
recommends using a service principal. If you really want to give your Azure Databricks workspace user’s personal access
token to GitHub for personal exploration purposes only, and you understand that for security reasons Databricks
discourages this practice, see the instructions to create your workspace user’s personal access token.

Step 6.2: Add the GitHub Actions workflow


In this substep, you add a GitHub Actions workflow to run the run_unit_tests notebook whenever there is a
pull request to the repo.
This substep stores the GitHub Actions workflow in a file that is stored within multiple folder levels in your
GitHub repo. GitHub Actions requires a specific nested folder hierarchy to exist in your repo in order to work
properly. To complete this step, you must use the website for your GitHub repo, because the Azure Databricks
Repos user interface does not support creating nested folder hierarchies.
1. In the website for your GitHub repo, click the Code tab.
2. In the Switch branches or tags drop-down list, select main , if it is not already selected.
3. If the Switch branches or tags drop-down list does not show the Find or create a branch box, click
main again.
4. In the Find or create a branch box, enter adding_github_actions .
5. Click Create branch: adding_github_actions from ‘main’ .
6. Click Add file > Create new file .
7. For Name your file , enter .github/workflows/databricks_pull_request_tests.yml .
8. In the editor window, enter the following code. This code uses declares the pull_request hook to use the
Run Databricks Notebook GitHub Action to run the run_unit_tests notebook.
In the following code, replace:
<your-workspace-instance-URL> with your Azure Databricks instance name.
<your-access-token> with the token that you generated earlier.
<your-cluster-id> with your target cluster ID.

name: Run pre-merge Databricks tests

on:
pull_request:

env:
# Replace this value with your workspace instance name.
DATABRICKS_HOST: https://<your-workspace-instance-name>

jobs:
unit-test-notebook:
runs-on: ubuntu-latest
timeout-minutes: 15

steps:
- name: Checkout repo
uses: actions/checkout@v2
- name: Run test notebook
uses: databricks/run-notebook@main
with:
databricks-token: <your-access-token>

local-notebook-path: notebooks/run_unit_tests.py

existing-cluster-id: <your-cluster-id>

git-commit: "${{ github.event.pull_request.head.sha }}"

# Grant all users view permission on the notebook's results, so that they can
# see the result of the notebook, if they have related access permissions.
access-control-list-json: >
[
{
"group_name": "users",
"permission_level": "CAN_VIEW"
}
]
run-name: "EDA transforms helper module unit tests"

9. Select Create a new branch for this commit and star t a pull request .
10. Click Propose new file .
11. Click the Pull requests tab, and then create the pull request.
12. On the pull request page, wait for the icon next to Run pre-merge Databricks tests / unit-test-
notebook (pull_request) to display a green check mark. (It may take a few moments for the icon to
appear.) If there is a red X instead of a green check mark, click Details to find out why. If the icon or
Details are no longer showing, click Show all checks .
13. If the green check mark appears, merge the pull request into the main branch.
(Optional) Step 7: Update the shared code in GitHub to trigger tests
In this step, you make a change to the shared code and then push the change into your GitHub repo, which
immediately triggers the tests automatically, based on the GitHub Actions from the previous step.
Step 7.1: Create another working branch in the repo
1. In your workspace, in the Repos pane for your repo, click the first_tests branch.
2. in the best-notebooks dialog, click the drop-down arrow next to the first_tests branch, and select main .
3. Click the Pull button. If prompted to proceed with pulling, click Confirm .
4. Click the + (Create branch ) button.
5. Enter trigger_tests , and then press Enter. (You can give your branch a different name.)
6. Close this dialog.
Step 7.2: Change the shared code
1. In your workspace, in the Repos pane for your repo, double-click the covid_analysis/transforms.py
file.
2. In the third line of this file, change this line of code:

# Filter by country code.

To this:

# Filter by country code. If not specified, use "USA."

Step 7.3: Check in the change to trigger the tests


1. In the Repos pane for your repo, click the trigger_tests branch.
2. In the best-notebooks dialog, on the Changes tab, make sure covid_analysis/transforms.py is selected.
3. For Summar y (required) , enter Updated comment .
4. For Description (optional) , enter This updates the comment for filter_country.
5. Click Commit & Push .
6. Click Histor y , or click Create a pull request on your git provider link in the popup.
7. In GitHub, click the Pull requests tab, and then create the pull request.
8. On the pull request page, wait for the icon next to Run pre-merge Databricks tests / unit-test-
notebook (pull_request) to display a green check mark. (It may take a few moments for the icon to
appear.) If there is a red X instead of a green check mark, click Details to find out why. If the icon or Details
are no longer showing, click Show all checks .
9. If the green check mark appears, merge the pull request into the main branch.
Git integration with Databricks Repos
7/21/2022 • 2 minutes to read

Learn how to integrate Git source control with Databricks Repos. To support best practices for data science and
engineering code development, Databricks Repos provides repository-level integration with Git providers. You
can develop code in an Azure Databricks notebook, sync it with a remote Git repository, and use Git commands
for updates and source control.

NOTE
Support for arbitrary files in Databricks Repos is now in Public Preview. For details, see Work with non-notebook files in an
Azure Databricks repo and Import Python and R modules.

What can you do with Databricks Repos?


Databricks Repos lets you use Git functionality such as cloning a remote repo, managing branches, pushing and
pulling changes, and visually comparing differences upon commit.
Databricks Repos also provides an API that you can integrate with your CI/CD pipeline. For example, you can
programmatically update a Databricks repo so that it always has the most recent code version.
Databricks Repos provides security features such as allow lists to control access to Git repositories and detection
of clear text secrets in source code.
When audit logging is enabled, audit events are logged when you interact with a Databricks repo. For example,
an audit event is logged when you create, update, or delete a Databricks repo, when you list all Databricks Repos
associated with a workspace, and when you sync changes between your Databricks repo and the Git remote.
For more information about best practices for code development using Databricks Repos, see CI/CD workflows
with Databricks Repos and Git integration.

Supported Git providers


Azure Databricks supports these Git providers:
GitHub
Bitbucket Cloud
GitLab
Azure DevOps (not available in Azure China regions)
AWS CodeCommit
GitHub AE
Databricks Repos also supports Bitbucket Server, GitHub Enterprise Server, or a GitLab self-managed
subscription instance integration, if the server is internet accessible.
To integrate with a private Git server instance that is not internet-accessible, get in touch with your Databricks
representative.
Support for arbitrary files in Databricks Repos is available in Databricks Runtime 8.4 and above.
Set up Git integration with Databricks Repos
7/21/2022 • 4 minutes to read

Set up your Azure Databricks workspace and a Git repo to use Databricks Repos capabilities. Once you set up
Databricks Repos, you can run notebooks or access project files and libraries stored in a remote Git repo.

NOTE
Databricks recommends that you set an expiration date for all personal access tokens.
If you are using GitHub AE and you have enabled GitHub allow lists, you must add Azure Databricks control plane NAT
IPs to the allow list. Use the IP for the region that the Azure Databricks workspace is in.

Configure user settings for Git integration


1. Click Settings in your Azure Databricks workspace and select User Settings from the menu.
2. On the User Settings page, go to the Git Integration tab.
3. Follow the instructions for integration with:
GitHub
Bitbucket Cloud and Bitbucket Server
GitLab
Azure DevOps
AWS CodeCommit
GitHub AE
For Azure DevOps, if you do not enter a token or app password, Git integration uses your Azure Active
Directory token by default. If you enter an Azure DevOps personal access token, Git integration uses it
instead.
4. If your organization has SAML SSO enabled in GitHub, ensure that you have authorized your personal
access token for SSO.

Use a service principal with Databricks Repos


To use a service principal with the Repos API and Jobs API, do the following:
Create an Azure AD service principal with tools such as the Azure portal.
Create an Azure AD token for an Azure AD service principal with tools such as curl and Postman.
After you create an Azure AD service principal, you add it to your Azure Databricks workspace with the SCIM
API 2.0 (ServicePrincipals).
Add your Git provider credentials to your workspace with your Azure AD token and the Git Credentials API
2.0.
To call these two Databricks APIs, you can also use tools such as curl and Postman. You cannot use the Azure
Databricks user interface.
To learn more about setting up service principals with Databricks Repos and a Git provider, see Service
principals for CI/CD.
Enable support for arbitrary files in Databricks Repos
To work with non-notebook files in Databricks Repos, you must be running Databricks Runtime 8.4 or above. If
you are running Databricks Runtime 11.0 or above, support for arbitrary files is enabled by default.

IMPORTANT
This feature is in Public Preview.

In addition to syncing notebooks with a remote Git repository, you can sync any type of file your project
requires, such as:
.py files
data files in .csv or .json format
.yaml configuration files

You can import and read these files within a Databricks repo. You can also view and edit plain text files in the UI.
If support for this feature is not enabled, you still see non-notebook files in your repo, but you cannot work with
them.
Enable Files in Repos
An admin can enable this feature as follows:
1. Go to the Admin Console.
2. Click the Workspace Settings tab.
3. In the Repos section, click the Files in Repos toggle.
After the feature has been enabled, you must restart your cluster and refresh your browser before you can non-
noteboook files in Repos.
Additionally, the first time you access a repo after Files in Repos is enabled, you must open the Git dialog. The
dialog indicates that you must perform a pull operation to sync non-notebook files in the repo. Select Agree
and Pull to sync files. If there are any merge conflicts, another dialog appears giving you the option of
discarding your conflicting changes or pushing your changes to a new branch.

Confirm Files in Repos is enabled


You can use the command %sh pwd in a notebook inside a Repo to check if Files in Repos is enabled.
If Files in Repos is not enabled, the response is /databricks/driver .
If Files in Repos is enabled, the response is /Workspace/Repos/<path to notebook directory> .

Control access to Databricks Repos


Manage permissions
When you create a repo, you have Can Manage permission. This lets you perform Git operations or modify the
remote repository. You can clone public remote repositories without Git credentials (personal access token and
username). To modify a public remote repository, or to clone or modify a private remote repository, you must
have a Git provider username and personal access token with read and write permissions for the remote
repository.
Use allow lists
An admin can limit which remote repositories users can commit and push to.
1. Go to the Admin Console.
2. Click the Workspace Settings tab.
3. In the Advanced section, click the Enable Repos Git URL Allow List toggle.
4. Click Confirm .
5. In the field next to Repos Git URL Allow List: Empty list , enter a comma-separated list of URL prefixes.
6. Click Save .
Users can only commit and push to Git repositories that start with one of the URL prefixes you specify. The
default setting is “Empty list”, which disables access to all repositories. To allow access to all repositories, disable
Enable Repos Git URL Allow List .

NOTE
Users can load and pull remote repositories even if they are not on the allow list.
The list you save overwrites the existing set of saved URL prefixes.
It may take about 15 minutes for changes to take effect.

Secrets detection
Databricks Repos scans code for access key IDs that begin with the prefix AKIA and warns the user before
committing.

Terraform integration
You can manage Databricks Repos in a fully automated setup using Databricks Terraform provider and
databricks_repo:

resource "databricks_repo" "this" {


url = "https://github.com/user/demo.git"
}
GitHub version control
7/21/2022 • 6 minutes to read

NOTE
Databricks recommends that you use Git integration with Databricks Repos to sync your work in Azure Databricks with a
remote Git repository.

This article describes how to set up version control for notebooks using GitHub through the UI. You can also use
the Databricks CLI or Workspace API 2.0 to import and export notebooks and manage notebook versions using
GitHub tools.

Enable and disable Git versioning


By default version control is enabled. To toggle this setting, see Manage the ability to version notebooks in Git. If
Git versioning is disabled, the Git Integration tab is not available in the User Settings screen.

Configure version control


To configure version control, you create access credentials in your version control provider, then add those
credentials to Azure Databricks.
Get an access token
In GitHub, follow these steps to create a personal access token that allows access to your repositories:
1. In the upper-right corner of any page, click your profile photo, then click Settings .
2. Click Developer settings .
3. Click the Personal access tokens tab.
4. Click the Generate new token button.
5. Enter a token description.
6. Select the repo permission, and click the Generate token button.
7. Copy the token to your clipboard. You enter this token in Azure Databricks in the next step.
See the GitHub documentation to learn more about how to create personal access tokens.
Save your access token to Azure Databricks
1. In Azure Databricks, click Settings at the lower left of your screen and click User Settings .
2. Click the Git Integration tab.
3. If you have previously entered credentials, click the Change settings button.
4. In the Git provider drop-down, select GitHub .
5. Paste your token into the Token field.
6. Enter your GitHub username or email into the Git provider username or email field and click Save .

Work with notebook revisions


You work with notebook revisions in the history panel. Open the history panel by clicking Revision histor y at
the top right of the notebook.

NOTE
You cannot modify a notebook while the history panel is open.

Link a notebook to GitHub


1. Click Revision histor y at the top right of the notebook. The Git status bar displays Git: Not linked .

2. Click Git: Not linked .


The Git Preferences dialog appears. The first time you open your notebook, the Status is Unlink , because
the notebook is not in GitHub.

3. In the Status field, click Link .


4. In the Link field, paste the URL of the GitHub repository.
5. Click the Branch drop-down and select a branch or type the name of a new branch.
6. In the Path in Git Repo field, specify where in the repository to store your file.
Python notebooks have the suggested default file extension .py . If you use .ipynb , your notebook will
save in iPython notebook format. If the file already exists on GitHub, you can directly copy and paste the
URL of the file.
7. Click Save to finish linking your notebook. If this file did not previously exist, a prompt with the option
Save this file to your GitHub repo displays.
8. Type a message and click Save .
Save a notebook to GitHub
While the changes that you make to your notebook are saved automatically to the Azure Databricks revision
history, changes do not automatically persist to GitHub.
1. Click Revision histor y at the top right of the notebook to open the history Panel.

2. Click Save Now to save your notebook to GitHub. The Save Notebook Revision dialog appears.
3. Optionally, enter a message to describe your change.
4. Make sure that Also commit to Git is selected.

5. Click Save .
Revert or update a notebook to a version from GitHub
Once you link a notebook, Azure Databricks syncs your history with Git every time you re-open the history
panel. Versions that sync to Git have commit hashes as part of the entry.
1. Click Revision histor y at the top right of the notebook to open the history Panel.
2. Choose an entry in the history panel. Azure Databricks displays that version.
3. Click Restore this version .
4. Click Confirm to confirm that you want to restore that version.
Unlink a notebook
1. Click Revision histor y at the top right of the notebook to open the history Panel.
2. The Git status bar displays Git: Synced .

3. Click Git: Synced .

4. In the Git Preferences dialog, click Unlink .


5. Click Save .
6. Click Confirm to confirm that you want to unlink the notebook from version control.
Branch support
You can work on any branch of your repository and create new branches inside Azure Databricks.
Create a branch
1. Click Revision histor y at the top right of the notebook to open the history Panel.
2. Click the Git status bar to open the GitHub panel.
3. Click the Branch dropdown.
4. Enter a branch name.
5. Select the Create Branch option at the bottom of the dropdown. The parent branch is indicated. You
always branch from your current selected branch.
Create a pull request
1. Click Revision histor y at the top right of the notebook to open the history Panel.
2. Click the Git status bar to open the GitHub panel.

3. Click Create PR . GitHub opens to a pull request page for the branch.
Rebase a branch
You can also rebase your branch inside Azure Databricks. The Rebase link displays if new commits are available
in the parent branch. Only rebasing on top of the default branch of the parent repository is supported.

For example, assume that you are working on databricks/reference-apps . You fork it into your own account (for
example, brkyvz ) and start working on a branch called my-branch . If a new update is pushed to
databricks:master , then the Rebase button displays, and you will be able to pull the changes into your branch
brkyvz:my-branch .

Rebasing works a little differently in Azure Databricks. Assume the following branch structure:
After a rebase, the branch structure will look like:

What’s different here is that Commits C5 and C6 will not apply on top of C4. They will appear as local changes in
your notebook. Any merge conflict will show up as follows:

You can then commit to GitHub once again using the Save Now button.
What happens if someone branched off from my branch that I just rebased?
If your branch (for example, branch-a ) was the base for another branch ( branch-b ), and you rebase, you need
not worry! Once a user also rebases branch-b , everything will work out. The best practice in this situation is to
use separate branches for separate notebooks.
Best practices for code reviews
Azure Databricks supports Git branching.
You can link a notebook to any branch in a repository. Azure Databricks recommends using a separate
branch for each notebook.
During development, you can link a notebook to a fork of a repository or to a non-default branch in the main
repository. To integrate your changes upstream, you can use the Create PR link in the Git Preferences dialog
in Azure Databricks to create a GitHub pull request. The Create PR link displays only if you’re not working on
the default branch of the parent repository.

GitHub Enterprise
IMPORTANT
This feature is in Private Preview. To try it, reach out to your Azure Databricks contact.

You can also use the Workspace API 2.0 to programmatically create notebooks and manage the code base in
GitHub Enterprise Server.
Troubleshooting
If you receive errors related to syncing GitHub history, verify the following:
You can only link a notebook to an initialized Git repository that isn’t empty. Test the URL in a web browser.
The GitHub personal access token must be active.
To use a private GitHub repository, you must have permission to read the repository.
If a notebook is linked to a GitHub branch that is renamed, the change is not automaticaly reflected in Azure
Databricks. You must re-link the notebook to the branch manually.
Azure DevOps Services version control
7/21/2022 • 2 minutes to read

Azure DevOps is a collection of services that provide an end-to-end solution for the five core practices of
DevOps: planning and tracking, development, build and test, delivery, and monitoring and operations. This
article describes how to set Azure DevOps as your Git provider.

NOTE
Databricks recommends that you use Git integration with Databricks Repos to sync your work in Azure Databricks with
a remote Git repository.
For information about the name change from Visual Studio Team Services to Azure DevOps, see Visual Studio Team
Services is now Azure DevOps Services.

Enable and disable Git versioning


By default version control is enabled. To toggle this setting, see Manage the ability to version notebooks in Git. If
Git versioning is disabled, the Git Integration tab is not available in the User Settings screen.

Get started
Authentication with Azure DevOps Services is done automatically when you authenticate using Azure Active
Directory (Azure AD). The Azure DevOps Services organization must be linked to the same Azure AD tenant as
Databricks.
In Azure Databricks, set your Git provider to Azure DevOps Services on the User Settings page:

1. Click Settings at the lower left of your screen and select User Settings .
2. Click the Git Integration tab.
3. Change your provider to Azure DevOps Services.

Notebook integration
Notebook integration with Azure DevOps Services is exactly like integration with GitHub. See Work with
notebook revisions to learn more about how to work with notebooks using Git.
TIP
In Git Preferences, use the URL scheme https://dev.azure.com/<org>/<project>/_git/<repo> to link Azure DevOps
and Azure Databricks to the same Azure AD tenant.

If your Azure DevOps organzation is org.visualstudio.com , open dev.azure.com in your browser and navigate to
your repository. Copy the URL from the browser and paste that URL in the Link field.

Troubleshooting
The Save button in the Databricks UI is grayed out.
Visual Studio Team Services renamed to Azure DevOps Services. Original URLs in the format
https://<org>.visualstudio.com/<project>/_git/<repo> do not work in Azure Databricks notebooks.

An organization administrator can automatically update the URLs in Azure DevOps Services from the
organization settings page.
Alternately, you can manually create the new URL format used in Azure Databricks notebooks to sync with Azure
DevOps Services. In the Azure Databricks notebook, enter the new URL in the Link field in the Git Preferences
dialog.
Old URL format:
https://<org>.visualstudio.com/<project>/_git/<repo>

New URL format:


https://dev.azure.com/<org>/<project>/_git/<repo>
Bitbucket Cloud and Bitbucket Server version
control
7/21/2022 • 4 minutes to read

This guide describes how to set up version control for notebooks using Bitbucket Cloud and Bitbucket Server
through the UI.

NOTE
Databricks recommends that you use Git integration with Databricks Repos to sync your work in Azure Databricks with a
remote Git repository.

Bitbucket Server support


Databricks Repos supports Bitbucket Server integration, if the server is internet accessible.
To integrate with a Bitbucket Server instance that is not internet-accessible, get in touch with your Databricks
representative.

Enable and disable Git versioning


By default version control is enabled. To toggle this setting, see Manage the ability to version notebooks in Git. If
Git versioning is disabled, the Git Integration tab is not visible in the User Settings screen.

Configure version control


Configuring version control involves creating access credentials in your version control provider and adding
those credentials to Azure Databricks.
Get an app password
1. Go to Bitbucket Cloud and create an app password that allows access to your repositories. See the Bitbucket
Cloud documentation.
2. Record the password. You enter this password in Azure Databricks in the next step.
Save your app password and username to Azure Databricks
1. Click Settings at the lower left of your screen and select User Settings .
2. Click the Git Integration tab.
3. If you have previously entered credentials, click the Change settings button.
4. In the Git provider drop-down, select Bitbucket Cloud .
5. Paste your app password into the App password field.
6. Enter your username into the Git provider username field and click Save .

Work with notebook revisions


You work with notebook revisions in the History panel. Open the history panel by clicking Revision histor y at
the top right of the notebook.
NOTE
You cannot modify a notebook while the History panel is open.

Link a notebook to Bitbucket Cloud


1. Open the History panel. The Git status bar displays Git: Not linked .

2. Click Git: Not linked .


The Git Preferences dialog appears. The first time you open your notebook, the Status is Unlink , because
the notebook is not in Bitbucket Cloud.

3. In the Status field, click Link .


4. In the Link field, paste the URL of the Bitbucket Cloud repository.
5. Click the Branch drop-down and select a branch.
6. In the Path in Git Repo field, specify where in the repository to store your file.
Python notebooks have the suggested default file extension .py . If you use .ipynb , your notebook will
save in iPython notebook format. If the file already exists on Bitbucket Cloud, you can directly copy and
paste the URL of the file.
7. Click Save to finish linking your notebook. If this file did not previously exist, a prompt with the option
Save this file to your Bitbucket Cloud repo displays.
8. Type a message and click Save .
Save a notebook to Bitbucket Cloud
While the changes that you make to your notebook are saved automatically to the Azure Databricks revision
history, changes do not automatically persist to Bitbucket Cloud.
1. Open the History panel.
2. Click Save Now to save your notebook to Bitbucket Cloud. The Save Notebook Revision dialog appears.
3. Optionally, enter a message to describe your change.
4. Make sure that Also commit to Git is selected.

5. Click Save .
Revert or update a notebook to a version from Bitbucket Cloud
Once you link a notebook, Azure Databricks syncs your history with Git every time you re-open the History
panel. Versions that sync to Git have commit hashes as part of the entry.
1. Open the History panel.

2. Choose an entry in the History panel. Azure Databricks displays that version.
3. Click Restore this version .
4. Click Confirm to confirm that you want to restore that version.
Unlink a notebook
1. Open the History panel.
2. The Git status bar displays Git: Synced .
3. Click Git: Synced .

4. In the Git Preferences dialog, click Unlink .


5. Click Save .
6. Click Confirm to confirm that you want to unlink the notebook from version control.
Create a pull request
1. Open History panel.
2. Click the Git status bar to open the Git Preferences dialog.

3. Click Create PR . Bitbucket Cloud opens to a pull request page for the branch.

Best practice for code reviews


Azure Databricks supports Git branching.
You can link a notebook to your own fork and choose a branch.
We recommend using separate branches for each notebook.
Once you are happy with your changes, you can use the Create PR link in the Git Preferences dialog to take
you to Bitbucket Cloud’s pull request page.
The Create PR link displays only if you’re not working on the default branch of the parent repository.
Troubleshooting
If you receive errors related to Bitbucket Cloud history sync, verify the following:
1. You have initialized the repository on Bitbucket Cloud, and it isn’t empty. Try the URL that you entered and
verify that it forwards to your Bitbucket Cloud repository.
2. Your app password is active and your username is correct.
3. If the repository is private, you should have read and write access (through Bitbucket Cloud) on the
repository.
GitLab version control
7/21/2022 • 2 minutes to read

This article describes how to set up Git integration with Databricks Repos for notebooks using GitLab through
the UI.

Enable and disable Git versioning


By default version control is enabled. To toggle this setting, see Manage the ability to version notebooks in Git. If
Git versioning is disabled, the Git Integration tab is not available in the User Settings screen.

Configure version control


Configuring version control involves creating access credentials in your version control provider and adding
those credentials to Azure Databricks.
Get an access token
Go to GitLab and create a personal access token that allows access to your repositories:
1. From GitLab, click your user icon in the upper right corner of the screen and select Preferences .
2. Click Access Tokens in the sidebar.

3. Enter a name for the token.


4. Check the read_repository and write_repository permissions, and click Create personal access
token .
5. Copy the token to your clipboard. Enter this token in Azure Databricks in the next step.
See the GitLab documentation to learn more about how to create and manage personal access tokens.
Save your access token to Azure Databricks
1. Click Settings at the lower left of your screen and select User Settings .
2. Click the Git Integration tab.
3. If you have previously entered credentials, click the Change settings button.
4. In the Git provider drop-down, select GitLab .
5. Paste your token into the Token field.
6. Enter your GitLab username or email into the Git provider username or email field and click Save .
AWS CodeCommit version control
7/21/2022 • 2 minutes to read

This guide describes how to configure version control with AWS CodeCommit. Configuring version control
involves creating access credentials in your version control provider and adding those credentials to Azure
Databricks.

Create HTTPS Git credentials


1. In AWS CodeCommit, create HTTPS Git credentials that allow access to your repositories. See the AWS
CodeCommit documentation. The associated IAM user must have “read” and “write” permissions for the
repository.
2. Record the password. You enter this password in Azure Databricks in the next step.

Save your password and username to Azure Databricks


1. Click Settings at the lower left of your screen and select User Settings .
2. Click the Git Integration tab.
3. If you have previously entered credentials, click the Change settings button.
4. In the Git provider drop-down, select AWS CodeCommit .
5. Paste your password into the HTTPS Git password field.
6. Enter your username into the Git provider username field and click Save .
GitHub AE version control
7/21/2022 • 2 minutes to read

This guide describes how to configure version control with GitHub AE. Configuring version control involves
creating access credentials in your version control provider and adding those credentials to Azure Databricks.

Get an access token


In GitHub AE, follow these steps to create a personal access token that allows access to your repositories:
1. In the upper-right corner of any page, click your profile photo, then click Settings .
2. Click Developer settings .
3. Click the Personal access tokens tab.
4. Click the Generate new token button.
5. Enter a token description.
6. Select the repo permission, and click the Generate token button.
7. Copy the token to your clipboard. You enter this token in Azure Databricks in the next step.
See the GitHub documentation to learn more about how to create personal access tokens.

Save your token and username to Azure Databricks


1. Click Settings at the lower left of your screen and select User Settings .
2. Click the Git Integration tab.
3. If you have previously entered credentials, click the Change settings button.
4. In the Git provider drop-down, select GitHub Enterprise .
5. Paste your token into the Token field.
6. Enter your username into the Git provider username or email field and click Save .
Work with notebooks and project files in Azure
Databricks Repos
7/21/2022 • 5 minutes to read

This article walks you through steps for working with notebooks and other files in Databricks Repos with a
remote Git integration.
In Azure Databricks you can:
Clone a remote Git respository.
Work in notebooks or files.
Create notebooks, and edit notebooks and other files.
Sync with a remote repository.
Create new branches for development work.
For other tasks, you work in your Git provider:
Creating a PR
Resolving conflicts
Merging or deleting branches
Rebasing a branch

Clone a remote Git repository


When you clone a remote Git repository, you can then work on the notebooks or other files in Azure Databricks.

1. Click Repos in the sidebar.


2. Click Add Repo .

3. In the Add Repo dialog, click Clone remote Git repo and enter the repository URL. Select your Git
provider from the drop-down menu, optionally change the name to use for the Databricks repo, and click
Create . The contents of the remote repository are cloned to the Databricks repo.
Create a notebook or folder
To create a new notebook or folder in a repo, click the down arrow next to the repo name, and select Create >
Notebook or Create > Folder from the menu.

To move an notebook or folder in your workspace into a repo, navigate to the notebook or folder and select
Move from the drop-down menu:

In the dialog, select the repo to which you want to move the object:
You can import a SQL or Python file as a single-cell Azure Databricks notebook.
Add the comment line -- Databricks notebook source at the top of a SQL file.
Add the comment line # Databricks notebook source at the top of a Python file.

Work with non-notebook files in an Azure Databricks repo


This section covers how to add files to a repo and view and edit files.

IMPORTANT
This feature is in Public Preview.

Requirements
Databricks Runtime 8.4 or above.
Create a new file
The most common way to create a file in a repo is to clone a Git repository. You can also create a new file
directly from the Databricks repo. Click the down arrow next to the repo name, and select Create > File from
the menu.

Import a file
To import a file, click the down arrow next to the repo name, and select Impor t .

The import dialog appears. You can drag files into the dialog or click browse to select files.

Only notebooks can be imported from a URL.


When you import a .zip file, Azure Databricks automatically unzips the file and imports each file and
notebook that is included in the .zip file.
Edit a file
To edit a file in a repo, click the filename in the Repos browser. The file opens and you can edit it. Changes are
saved automatically.
When you open a Markdown ( .md ) file, the rendered view is displayed by default. To edit the file, click in the file
editor. To return to preview mode, click anywhere outside of the file editor.
Refactor code
A best practice for code development is to modularize code so it can be easily reused. You can create custom
Python files in a repo and make the code in those files available to a notebook using the import statement. For
an example, see the example notebook.
To refactor notebook code into reusable files:
1. From the Repos UI, create a new branch.
2. Create a new source code file for your code.
3. Add Python import statements to the notebook to make the code in your new file available to the notebook.
4. Commit and push your changes to your Git provider.
Access files in a repo programmatically
You can programmatically read small data files in a repo, such as .csv or .json files, directly from a notebook.
You cannot programmatically create or edit files from a notebook.

import pandas as pd
df = pd.read_csv("./data/winequality-red.csv")
df

You can use Spark to access files in a repo. Spark requires absolute file paths for file data. The absolute file path
for a file in a repo is file:/Workspace/Repos/<user_folder>/<repo_name>/file .
You can copy the absolute or relative path to a file in a repo from the drop-down menu next to the file:

The example below shows the use of {os.getcwd()} to get the full path.

import os
spark.read.format("csv").load(f"file:{os.getcwd()}/my_data.csv")

Example notebook
This notebook shows examples of working with arbitrary files in Databricks Repos.
Arbitrary Files in Repos example notebook
Get notebook

Work with Python and R modules


IMPORTANT
This feature is in Public Preview.

Requirements
Databricks Runtime 8.4 or above.
Import Python and R modules
The current working directory of your repo and notebook are automatically added to the Python path. When
you work in the repo root, you can import modules from the root directory and all subdirectories.
To import modules from another repo, you must add that repo to sys.path . For example:
import sys
sys.path.append("/Workspace/Repos/<user-name>/<repo-name>")

# to use a relative path


import sys
import os
sys.path.append(os.path.abspath('..'))

You import functions from a module in a repo just as you would from a module saved as a cluster library or
notebook-scoped library:
Python

from sample import power


power.powerOfTwo(3)

source("sample.R")
power.powerOfTwo(3)

Import Azure Databricks Python notebooks


To distinguish between a regular Python file and an Azure Databricks Python-language notebook exported in
source-code format, Databricks adds the line # Databricks Notebook source at the top of the notebook source
code file.
When you import the notebook, Azure Databricks recognizes it and imports it as a notebook, not as a Python
module.
If you want to import the notebook as a Python module, you must edit the notebook in a code editor and
remove the line # Databricks Notebook source . Removing that line converts the notebook to a regular Python
file.
Import precedence rules
When you use an import statement in a notebook in a repo, the library in the repo takes precedence over a
library or wheel with the same name that is installed on the cluster.
Autoreload for Python modules
While developing Python code, if you are editing multiple files, you can use the following commands in any cell
to force a reload of all modules.

%load_ext autoreload
%autoreload 2

Use Azure Databricks web terminal for testing


You can use Azure Databricks web terminal to test modifications to your Python or R code without having to
import the file to a notebook and execute the notebook.
1. Open web terminal.
2. Change to the Repo directory: cd /Workspace/Repos/<path_to_repo>/ .
3. Run the Python or R file: python file_name.py or Rscript file_name.r .

Run jobs using notebooks in a remote repository


You can run an Azure Databricks job using notebooks located in a remote Git repository. This is especially useful
for managing CI/CD for production runs. See Create a job.
Sync with a remote Git repository
7/21/2022 • 2 minutes to read

The article describes how you can use common Git capabilities to sync Databricks Repos with a remote Git
repository.
To update notebooks and other files in Databricks Repos, you can:
Pull changes from the remote Git repo.
Resolve merge conflicts.
Commit and push from Databricks to the remote repo.
You can also create a new branch in Databricks Repos.
To sync with Git, use the Git dialog. The Git dialog lets you pull changes from your remote Git repository and
push and commit changes. You can also change the branch you are working on or create a new branch.

IMPORTANT
Git operations that pull in upstream changes clear the notebook state. For more information, see Incoming changes clear
the notebook state.

Open the Git dialog


You can access the Git dialog from a notebook or from the Databricks Repos browser.
From a notebook, click the button at the top left of the notebook that identifies the current Git branch.

From the Databricks Repos browser, click the button to the right of the repo name:
You can also click the down arrow next to the repo name, and select Git… from the menu.

Pull changes from the remote Git repository


To pull changes from the remote Git repository, click in the Git dialog. Notebooks and other files are
updated automatically to the latest version in your remote repository.

Resolve merge conflicts


To resolve a merge conflict, you must either discard conflicting changes or commit your changes to a new
branch and then merge them into the original feature branch using a pull request.
1. If there is a merge conflict, the Repos UI shows a notice allowing you to cancel the pull or resolve the
conflict. If you select Resolve conflict using PR , a dialog appears that lets you create a new branch and
commit your changes to it.

2. When you click Commit to new branch , a notice appears with a link: Create a pull request to
resolve merge conflicts . Click the link to open your Git provider.

3. In your Git provider, create the PR, resolve the conflicts, and merge the new branch into the original
branch.
4. Return to the Repos UI. Use the Git dialog to pull changes from the Git repository to the original branch.

Commit and push changes to the remote Git repository


When you have added new notebooks or files, or made changes to existing notebooks or files, the Git dialog
highlights the changes.

Add a required Summary of the changes, and click Commit & Push to push these changes to the remote Git
repository.
If you don’t have permission to commit to the default branch, such as main , create a new branch and use your
Git provider interface to create a pull request (PR) to merge it into the default branch.

NOTE
Results are not included with a notebook commit. All results are cleared before the commit is made.
For instructions on resolving merge conflicts, see Resolve merge conflicts.

Create a new branch


You can create a new branch based on an existing branch from the Git dialog:
CI/CD workflows with Databricks Repos and Git
integration
7/21/2022 • 4 minutes to read

Learn best practices for using Databricks Repos in a CI/CD workflow. Integrating Git repos with Databricks Repos
provides source control for project files.
The following figure shows an overview of the steps.

Admin workflow
Databricks Repos have user-level folders and non-user top level folders. User-level folders are automatically
created when users first clone a remote repository. You can think of Databricks Repos in user folders as “local
checkouts” that are individual for each user and where users make changes to their code.

Set up top-level folders


Admins can create non-user top level folders. The most common use case for these top level folders is to create
Dev, Staging, and Production folders that contain Databricks Repos for the appropriate versions or branches for
development, staging, and production. For example, if your company uses the Main branch for production, the
Production folder would contain Repos configured to be at the Main branch.
Typically permissions on these top-level folders are read-only for all non-admin users within the workspace.
Set up Git automation to update Databricks Repos on merge
To ensure that Databricks Repos are always at the latest version, you can set up Git automation to call the Repos
API 2.0. In your Git provider, set up automation that, after every successful merge of a PR into the main branch,
calls the Repos API endpoint on the appropriate repo in the Production folder to bring that repo to the latest
version.
For example, on GitHub this can be achieved with GitHub Actions. For more information, see the Repos API.

Developer workflow
In your user folder in Databricks Repos, clone your remote repository. A best practice is to create a new feature
branch or select a previously created branch for your work, instead of directly committing and pushing changes
to the main branch. You can make changes, commit, and push changes in that branch. When you are ready to
merge your code, create a pull request and follow the review and merge processes in Git.
Here is an example workflow.
Requirements
This workflow requires that you have already set up your Git integration.

NOTE
Databricks recommends that each developer work on their own feature branch. Sharing feature branches among
developers can cause merge conflicts, which must be resolved using your Git provider. For information about how to
resolve merge conflicts, see Resolve merge conflicts.

Workflow
1. Clone your existing Git repository to your Databricks workspace.
2. Use the Repos UI to create a feature branch from the main branch. This example uses a single feature branch
feature-b for simplicity. You can create and use multiple feature branches to do your work.
3. Make your modifications to Databricks notebooks and files in the Repo.
4. Commit and push your changes to your Git provider.
5. Coworkers can now clone the Git repository into their own user folder.
a. Working on a new branch, a coworker makes changes to the notebooks and files in the Repo.
b. The coworker commits and pushes their changes to the Git provider.
6. To merge changes from other branches or rebase the feature branch, you must use the Git command line or
an IDE on your local system. Then, in the Repos UI, use the Git dialog to pull changes into the feature-b
branch in the Databricks Repo.
7. When you are ready to merge your work to the main branch, use your Git provider to create a PR to merge
the changes from feature-b.
8. In the Repos UI, pull changes to the main branch.

Production job workflow


You can point a job directly to a notebook in a Databricks Repo. When a job kicks off a run, it uses the current
version of the code in the repo.
If the automation is setup as described in Admin workflow, every successful merge calls the Repos API to update
the repo. As a result, jobs that are configured to run code from a repo always use the latest version available
when the job run was created.
Migration tips
IMPORTANT
This feature is in Public Preview.

If you are using %run commands to make Python or R functions defined in a notebook available to another
notebook, or are installing custom .whl files on a cluster, consider including those custom modules in a
Databricks repo. In this way, you can keep your notebooks and other code modules in sync, ensuring that your
notebook always uses the correct version.
Migrate from %run commands
%run commands let you include one notebook within another and are often used to make supporting Python
or R code available to a notebook. In this example, a notebook named power.py includes the code below.

# This code is in a notebook named "power.py".


def n_to_mth(n,m):
print(n, "to the", m, "th power is", n**m)

You can then make functions defined in power.py available to a different notebook with a %run command:

# This notebook uses a %run command to access the code in "power.py".


%run ./power
n_to_mth(3, 4)

Using Files in Repos, you can directly import the module that contains the Python code and run the function.

from power import n_to_mth


n_to_mth(3, 4)

Migrate from installing custom Python .whl files


You can install custom .whl files onto a cluster and then import them into a notebook attached to that cluster.
For code that is frequently updated, this process is cumbersome and error-prone. Files in Repos lets you keep
these Python files in the same repo with the notebooks that use the code, ensuring that your notebook always
uses the correct version.
For more information about packaging Python projects, see this tutorial.
Limitations and FAQ for Git integration with
Databricks Repos
7/21/2022 • 6 minutes to read

Databricks Repos and Git integration have limits specified in the following sections. For general information, see
Databricks limits

File and repo size limits


Databricks doesn’t enforce a limit on the size of a repo. However:
Working branches are limited to 200 MB.
Individual files are limited to 200 MB.
Files larger than 10 MB can’t be viewed in the Databricks UI.
Databricks recommends that in a repo:
The total number of all files not exceed 10,000.
The total number of notebooks not exceed 5,000.
You may receive an error message if your repo exceeds these limits. You may also receive a timeout error when
you clone the repo, but the operation might complete in the background.

Repo configuration
Where is Databricks repo content stored?
The contents of a repo are temporarily cloned onto disk in the control plane. Azure Databricks notebook files are
stored in the control plane database just like notebooks in the main workspace. Non-notebook files may be
stored on disk for up to 30 days.
Does Repos support on-premise or self-hosted Git servers?
Databricks Repos supports Bitbucket Server integration, if the server is internet accessible.
To integrate with a Bitbucket Server, GitHub Enterprise Server, or a GitLab self-managed subscription instance
that is not internet-accessible, get in touch with your Databricks representative.
Does Repos support .gitignore files?
Yes. If you add a file to your repo and do not want it to be tracked by Git, create a .gitignore file or use one
cloned from your remote repository and add the filename, including the extension.
.gitignore works only for files that are not already tracked by Git. If you add a file that is already tracked by Git
to a .gitignore file, the file is still tracked by Git.
Can I create top-level folders that are not user folders?
Yes, admins can create top-level folders to a single depth. Repos does not support additional folder levels.
Does Repos support Git submodules?
No. You can clone a repo that contains Git submodules, but the submodule is not cloned.
Does Azure Data Factory (ADF ) support Repos?
Yes.
How can I disable Repos in my workspace?
Follow these steps to disable Repos for Git in your workspace.
1. Go to the Admin Console.
2. Click the Workspace Settings tab.
3. In the Advanced section, click the Repos toggle.
4. Click Confirm .
5. Refresh your browser.

Source management
Can I pull in .ipynb files?
Yes. The file renders in .json format, not notebook format.
Does Repos support branch merging?
No. Databricks recommends that you create a pull request and merge through your Git provider.
Can I delete a branch from an Azure Databricks repo?
No. To delete a branch, you must work in your Git provider.
If a library is installed on a cluster, and a library with the same name is included in a folder within a repo, which
library is imported?
The library in the repo is imported.
Can I pull the latest version of a repository from Git before running a job without relying on an external
orchestration tool?
No. Typically you can integrate this as a pre-commit on the Git server so that every push to a branch
(main/prod) updates the Production repo.
Can I export a Repo?
You can export notebooks, folders, or an entire Repo. You cannot export non-notebook files, and if you export an
entire Repo, non-notebook files are not included. To export, use the Workspace CLI or the Workspace API 2.0.

Security, authentication, and tokens


Are the contents of Databricks Repos encrypted?
The contents of Databricks Repos are encrypted by Azure Databricks using a default key. Encryption using
Enable customer-managed keys for managed services is not supported.
How and where are the GitHub tokens stored in Azure Databricks? Who would have access from Azure
Databricks?
The authentication tokens are stored in the Azure Databricks control plane, and an Azure Databricks
employee can only gain access through a temporary credential that is audited.
Azure Databricks logs the creation and deletion of these tokens, but not their usage. Azure Databricks has
logging that tracks Git operations that could be used to audit the usage of the tokens by the Azure Databricks
application.
GitHub enterprise audits token usage. Other Git services may also have Git server auditing.
Does Repos support GPG signing of commits?
No.
Does Repos support SSH?
No, only HTTPS.
CI/CD and MLOps
Incoming changes clear the notebook state
Git operations that alter the notebook source code result in the loss of the notebook state, including cell results,
comments, revision history, and widgets. For example, Git pull can change the source code of a notebook. In this
case, Databricks Repos must overwrite the existing notebook to import the changes. Git commit and push or
creating a new branch do not affect the notebook source code, so the notebook state is preserved in these
operations.
Prevent data loss in MLflow experiments
MLflow experiment data in a notebook might be lost in this scenario: You rename the notebook and then, before
calling any MLflow commands, change to a branch that doesn’t contain the notebook.
To prevent this situation, Databricks recommends you avoid renaming notebooks in repos.
Can I create an MLflow experiment in a repo?
No. You can only create an MLflow experiment in the workspace. Experiments created in a Repo before the 3.72
platform release are no longer supported, though they may continue to work without guarantees. Databricks
recommends exporting existing experiments in repos to workspace experiments using the MLflow export tool.
What happens if a job starts running on a notebook while a Git operation is in progress?
At any point while a Git operation is in progress, some notebooks in the Repo may have been updated while
others have not. This can cause unpredictable behavior.
For example, suppose notebook A calls notebook Z using a %run command. If a job running during a Git
operation starts the most recent version of notebook A, but notebook Z has not yet been updated, the %run
command in notebook A might start the older version of notebook Z. During the Git operation, the notebook
states are not predictable and the job might fail or run notebook A and notebook Z from different commits.

Non-notebook files: Files in Repos


Files in Repos supports non-notebook solution files in Databricks Repos.

IMPORTANT
This feature is in Public Preview.

In Databricks Runtime 10.1 and below, Files in Repos is not compatible with Spark Streaming. To use Spark
Streaming on a cluster running Databricks Runtime 10.1 or below, you must disable Files in Repos on the
cluster. Set the Spark configuration spark.databricks.enableWsfs false .
Native file reads are supported in Python and R notebooks. Native file reads are not supported in Scala
notebooks, but you can use Scala notebooks with DBFS as you do today.
Only text-encoded files are rendered in the UI. To view files in Azure Databricks, the files must not be larger
than 10 MB.
You cannot create or edit a file from your notebook.
You can only export notebooks. You cannot export non-notebook files from a repo.
How can I run non-Databricks notebook files in a repo? For example, a .py file?
You can use any of the following:
Bundle and deploy as a library on the cluster.
Pip install the Git repository directly. This requires a credential in secrets manager.
Use %run with inline code in a notebook.
Use a custom container image. See Customize containers with Databricks Container Services.
Errors and troubleshooting for Databricks Repos
7/21/2022 • 3 minutes to read

Follow the guidance below to respond to common error messages or troubleshoot issues with Databricks
Repos.

Invalid credentials
Try the following:
Confirm that the settings in the Git integration tab (User Settings > Git Integration ) are correct.
You must enter both your Git provider username and token.
Legacy Git integrations did not require a username, so you may need to add a username to work with
Databricks Repos.
Confirm that you have selected the correct Git provider in the Add Repo dialog.
Ensure your personal access token or app password has the correct repo access.
If SSO is enabled on your Git provider, authorize your tokens for SSO.
Test your token with the Git command line. Both of these options should work:

git clone https://<username>:<personal-access-token>@github.com/<org>/<repo-name>.git

git clone -c http.sslVerify=false -c http.extraHeader='Authorization: Bearer <personal-access-token>'


https://agile.act.org/

Secure connection could not be established because of SSL problems

<link>: Secure connection to <link> could not be established because of SSL problems

This error occurs if your Git server is not accessible from Azure Databricks. Private Git servers are not
supported.

Error message: Azure Active Directory credentials


Encountered an error with your Azure Active Directory credentials. Please try logging out of Azure Active
Directory and logging back in.

This error can occur if your team has recently moved to using a multi-factor authentication (MFA) policy for
Azure Active Directory. To resolve this problem, you must log out of Azure Active Directory by going to
portal.azure.com and logging out. When you log back in, you should get the prompt to use MFA to log in.

If that does not work, try logging out completely from all Azure services before attempting to log in again.

Timeout errors
Expensive operations such as cloning a large repo or checking out a large branch may hit timeout errors, but the
operation might complete in the background. You can also try again later if the workspace was under heavy load
at the time.

404 errors
If you get a 404 error when you try to open a non-notebook file, try waiting a few minutes and then trying
again. There is a delay of a few minutes between when the workspace is enabled and when the webapp picks up
the configuration flag.

Resource not found errors after you pull non-notebook files into a
Databricks repo
This error can occur if you are not using Databricks Runtime 8.4 or above. A cluster running Databricks Runtime
8.4 or above is required to work with non-notebook files in a repo.

Errors suggest recloning


There was a problem with deleting folders. The repo could be in an inconsistent state and re-cloning is
recommended.

This error indicates that a problem occurred while deleting folders from the repo. This could leave the repo in an
inconsistent state, where folders that should have been deleted still exist. If this error occurs, Databricks
recommends deleting and re-cloning the repo to reset its state.

Unable to set repo to most recent state. This may be due to force pushes overriding commit history on the
remote repo. Repo may be out of sync and re-cloning is recommended.

This error indicates that the local and remote Git state have diverged. This can happen when a force push on the
remote overrides recent commits that still exist on the local repo. Databricks does not support a hard reset
within Repos and recommends deleting and re-cloning the repo if this error occurs.

Files do not appear after cloning a remote repos or pulling files into
an existing one
If you know your admin enabled Databricks Repos and support for arbitrary files, try the following:
Confirm your cluster is running Databricks Runtime 8.4 or above.
Refresh your browser and restart your cluster to pick up the new configuration.

No experiment for node found or MLflow UI errors


You may see a Azure Databricks error message “No experiment for node found” or an error in MLflow when you
work on an MLflow notebook experiment last logged to before the 3.72 platform release. To resolve the error,
log a new run in the notebook associated with that experiment.

NOTE
This applies only to notebook experiments. Creation of new experiments in Repos is unsupported.
Libraries
7/21/2022 • 4 minutes to read

To make third-party or custom code available to notebooks and jobs running on your clusters, you can install a
library. Libraries can be written in Python, Java, Scala, and R. You can upload Java, Scala, and Python libraries
and point to external packages in PyPI, Maven, and CRAN repositories.
This article focuses on performing library tasks in the workspace UI. You can also manage libraries using the
Libraries CLI or the Libraries API 2.0.

TIP
Azure Databricks includes many common libraries in Databricks Runtime. To see which libraries are included in Databricks
Runtime, look at the System Environment subsection of the Databricks Runtime release notes for your Databricks
Runtime version.

IMPORTANT
Azure Databricks does not invoke Python atexit functions when your notebook or job completes processing. If you use
a Python library that registers atexit handlers, you must ensure your code calls required functions before exiting.
Installing Python eggs is deprecated and will be removed in a future Databricks Runtime release. Use Python wheels or
install packages from PyPI instead.

NOTE
Microsoft Support helps isolate and resolve issues related to libraries installed and maintained by Azure Databricks. For
third-party components, including libraries, Microsoft provides commercially reasonable support to help you further
troubleshoot issues. Microsoft Support assists on a best-effort basis and might be able to resolve the issue. For open
source connectors and projects hosted on Github, we recommend that you file issues on Github and follow up on them.
Development efforts such as shading jars or building Python libraries are not supported through the standard support
case submission process: they require a consulting engagement for faster resolution. Support might ask you to engage
other channels for open-source technologies where you can find deep expertise for that technology. There are several
community sites; two examples are the Microsoft Q&A page for Azure Databricks and Stack Overflow.

You can install libraries in three modes: workspace, cluster-installed, and notebook-scoped.
Workspace libraries serve as a local repository from which you create cluster-installed libraries. A workspace
library might be custom code created by your organization, or might be a particular version of an open-
source library that your organization has standardized on.
Cluster libraries can be used by all notebooks running on a cluster. You can install a cluster library directly
from a public repository such as PyPI or Maven, or create one from a previously installed workspace library.
Notebook-scoped libraries, available for Python and R, allow you to install libraries and create an
environment scoped to a notebook session. These libraries do not affect other notebooks running on the
same cluster. Notebook-scoped libraries do not persist and must be re-installed for each session. Use
notebook-scoped libraries when you need a custom environment for a specific notebook.
Notebook-scoped Python libraries
Notebook-scoped R libraries
This section covers:
Workspace libraries
Cluster libraries
Notebook-scoped Python libraries
Notebook-scoped R libraries

Python environment management


The following table provides an overview of options you can use to install Python libraries in Azure Databricks.

NOTE
Custom containers that use a conda-based environment are not compatible with notebook-scoped libraries in
Databricks Runtime 9.0 and above and with cluster libraries in Databricks Runtime 10.1 and above. Instead, Azure
Databricks recommends installing libraries directly in the image or using init scripts. To continue using cluster libraries
in those scenarios, you can set the Spark configuration
spark.databricks.driverNfs.clusterWidePythonLibsEnabled to false . Support for the Spark configuration will
be removed on or after December 31, 2021.
Notebook-scoped libraries using magic commands are enabled by default in Databricks Runtime 7.1 and above,
Databricks Runtime 7.1 ML and above, and Databricks Runtime 7.1 for Genomics and above. They are also available
using a configuration setting in Databricks Runtime 6.4 ML to 7.0 ML and Databricks Runtime 6.4 for Genomics to
Databricks Runtime 7.0 for Genomics. See Requirements for details.
Notebook-scoped libraries with the library utility are deprecated and will be removed in an upcoming Databricks
Runtime version. They are not available on Databricks Runtime ML or Databricks Runtime for Genomics.

N OT EB O O K - SC O P ED N OT EB O O K - SC O P ED
P Y T H O N PA C K A GE L IB RA RIES W IT H L IB RA RIES W IT H T H E JO B L IB RA RIES W IT H
SO URC E % P IP L IB RA RY UT IL IT Y C L UST ER L IB RA RIES JO B S A P I

PyPI Use %pip install . Use Select PyPI as the Add a new pypi
See example. dbutils.library source. object to the job
.installPyPI . libraries and specify
the package field.

Private PyPI mirror, Use %pip install Use Not supported. Not supported.
such as Nexus or with the dbutils.library
Artifactory --index-url .installPyPI and
option. Secret specify the repo
management is argument.
available. See
example.

VCS, such as GitHub, Use %pip install Not supported. Select PyPI as the Add a new pypi
with raw source and specify the source and specify object to the job
repository URL as the repository URL libraries and specify
the package name. as the package name. the repository URL
See example. as the package
field.
N OT EB O O K - SC O P ED N OT EB O O K - SC O P ED
P Y T H O N PA C K A GE L IB RA RIES W IT H L IB RA RIES W IT H T H E JO B L IB RA RIES W IT H
SO URC E % P IP L IB RA RY UT IL IT Y C L UST ER L IB RA RIES JO B S A P I

Private VCS with raw Use %pip install Not supported. Not supported. Not supported.
source and specify the
repository URL with
basic authentication
as the package name.
Secret management
is available. See
example.

DBFS Use %pip install . Use Select DBFS as the Add a new egg or
See example. dbutils.library source. whl object to the
.install(dbfs_path) job libraries and
. specify the DBFS
path as the
package field.
Workspace libraries
7/21/2022 • 4 minutes to read

Workspace libraries serve as a local repository from which you create cluster-installed libraries. A workspace
library might be custom code created by your organization, or might be a particular version of an open-source
library that your organization has standardized on.
You must install a workspace library on a cluster before it can be used in a notebook or job.
Workspace libraries in the Shared folder are available to all users in a workspace, while workspace libraries in a
user folder are available only to that user.

Create a workspace library


1. Right-click the workspace folder where you want to store the library.
2. Select Create > Librar y .

The Create Library dialog appears.

3. Select the Librar y Source and follow the appropriate procedure:


Upload a library
Reference an uploaded library
PyPI package
Maven package
CRAN package
Upload a Jar, Python egg, or Python wheel

NOTE
Installing Python eggs is deprecated and will be removed in a future Databricks Runtime release.

1. In the Library Source button list, select Upload .


2. Select Jar , Python Egg , or Python Whl .
3. Optionally enter a library name.
4. Drag your Jar, Egg, or Whl to the drop box or click the drop box and navigate to a file. The file is uploaded to
dbfs:/FileStore/jars .
5. Click Create . The library status screen displays.
6. Optionally install the library on a cluster.
Reference an uploaded jar, Python egg, or Python wheel
If you’ve already uploaded a jar, egg, or wheel to object storage you can reference it in a workspace library.
You can choose a library in DBFS or one stored in ADLS.

NOTE
Libraries stored in ADLS are only supported in Databricks Runtime 8.0 and above and Databricks Runtime 7.3 LTS. ADLS
is only supported through the encrypted abfss:// path.

1. Select DBFS/ADLS in the Library Source button list.


2. Select Jar , Python Egg , or Python Whl .
3. Optionally enter a library name.
4. Specify the DBFS or ADLS path to the library.
5. Click Create . The library status screen displays.
6. Optionally install the library on a cluster.
PyPI package
1. In the Library Source button list, select PyPI .
2. Enter a PyPI package name. To install a specific version of a library use this format for the library:
<library>==<version> . For example, scikit-learn==0.19.1 .
3. In the Repository field, optionally enter a PyPI repository URL.
4. Click Create . The library status screen displays.
5. Optionally install the library on a cluster.
Maven or Spark package
1. In the Library Source button list, select Maven .
2. Specify a Maven coordinate. Do one of the following:
In the Coordinate field, enter the Maven coordinate of the library to install. Maven coordinates are in
the form groupId:artifactId:version ; for example, com.databricks:spark-avro_2.10:1.0.0 .
If you don’t know the exact coordinate, enter the library name and click Search Packages . A list of
matching packages displays. To display details about a package, click its name. You can sort packages
by name, organization, and rating. You can also filter the results by writing a query in the search bar.
The results refresh automatically.
a. Select Maven Central or Spark Packages in the drop-down list at the top left.
b. Optionally select the package version in the Releases column.
c. Click + Select next to a package. The Coordinate field is filled in with the selected package and
version.
3. In the Repository field, optionally enter a Maven repository URL.

NOTE
Internal Maven repositories are not supported.

4. In the Exclusions field, optionally provide the groupId and the artifactId of the dependencies that you
want to exclude; for example, log4j:log4j .
5. Click Create . The library status screen displays.
6. Optionally install the library on a cluster.
CRAN package
1. In the Library Source button list, select CRAN .
2. In the Package field, enter the name of the package.
3. In the Repository field, optionally enter the CRAN repository URL.
4. Click Create . The library detail screen displays.
5. Optionally install the library on a cluster.

NOTE
CRAN mirrors serve the latest version of a library. As a result, you may end up with different versions of an R package if
you attach the library to different clusters at different times. To learn how to manage and fix R package versions on
Databricks, see the Knowledge Base.

View workspace library details


1. Go to the workspace folder containing the library.
2. Click the library name.
The library details page shows the running clusters and the install status of the library. If the library is installed,
the page contains a link to the package host. If the library was uploaded, the page displays a link to the uploaded
package file.

Move a workspace library


1. Go to the workspace folder containing the library.

2. Click the drop-down arrow to the right of the library name and select Move . A folder browser
displays.
3. Click the destination folder.
4. Click Select .
5. Click Confirm and Move .
Delete a workspace library
IMPORTANT
Before deleting a workspace library, you should uninstall it from all clusters.

To delete a workspace library:


1. Move the library to the Trash folder.
2. Either permanently delete the library in the Trash folder or empty the Trash folder.
Cluster libraries
7/21/2022 • 3 minutes to read

Cluster libraries can be used by all notebooks running on a cluster. You can install a cluster library directly from
a public repository such as PyPI or Maven, using a previously installed workspace library, or using an init script.

Install a library on a cluster


There are two primary ways to install a library on a cluster:
Install a workspace library that has been already been uploaded to the workspace.
Install a library for use with a specific cluster only.
In addition, if your library requires custom configuration, you may not be able to install it using the methods
listed above. Instead, you can install the library using an init script that runs at cluster creation time.

NOTE
When you install a library on a cluster, a notebook already attached to that cluster will not immediately see the new
library. You must first detach and then reattach the notebook to the cluster.

In this section:
Workspace library
Cluster-installed library
Init script
Workspace library

NOTE
Starting with Databricks Runtime 7.2, Azure Databricks processes all workspace libraries in the order that they were
installed on the cluster. On Databricks Runtime 7.1 and below, Azure Databricks processes Maven and CRAN libraries in
the order they are installed on the cluster.
You might need to pay attention to the order of installation on the cluster if there are dependencies between libraries.

To install a library that already exists in the workspace, you can start from the cluster UI or the library UI:
Cluster

1. Click Compute in the sidebar.


2. Click a cluster name.
3. Click the Libraries tab.
4. Click Install New .
5. In the Library Source button list, select Workspace .
6. Select a workspace library.
7. Click Install .
8. To configure the library to be installed on all clusters:
a. Click the library.
b. Select the Install automatically on all clusters checkbox.
c. Click Confirm .
Library
1. Go to the folder containing the library.
2. Click the library name.
3. Do one of the following:
To configure the library to be installed on all clusters, select the Install automatically on all
clusters checkbox and click Confirm .

IMPORTANT
This option does not install the library on clusters running Databricks Runtime 7.0 and above.

Select the checkbox next to the cluster that you want to install the library on and click Install .
The library is installed on the cluster.
Cluster-installed library

IMPORTANT
If you have configured a library to install on all clusters automatically, or you select an existing terminated cluster that has
libraries installed, the job execution does not wait for library installation to complete. If a job requires a specific library, you
should attach the library to the job in the Dependent Libraries field.

You can install a library on a specific cluster without making it available as a workspace library.
To install a library on a cluster:

1. Click Compute in the sidebar.


2. Click a cluster name.
3. Click the Libraries tab.
4. Click Install New .
5. Follow one of the methods for creating a workspace library. After you click Create , the library is installed on
the cluster.
Init script
If your library requires custom configuration, you may not be able to install it using the workspace or cluster
library interface. Instead, you can install the library using an init script.
Here is an example of an init script that uses pip to install Python libraries on a Databricks Runtime cluster at
cluster initialization.

#!/bin/bash

/databricks/python/bin/pip install astropy

Uninstall a library from a cluster


NOTE
When you uninstall a library from a cluster, the library is removed only when you restart the cluster. Until you restart the
cluster, the status of the uninstalled library appears as Uninstall pending restar t .

To uninstall a library you can start from a cluster or a library:


Cluster
1. Click Compute in the sidebar.
2. Click a cluster name.
3. Click the Libraries tab.
4. Select the checkbox next to the cluster you want to uninstall the library from, click Uninstall , then Confirm .
The Status changes to Uninstall pending restar t .
Library
1. Go to the folder containing the library.
2. Click the library name.
3. Select the checkbox next to the cluster you want to uninstall the library from, click Uninstall , then Confirm .
The Status changes to Uninstall pending restar t .
4. Click the cluster name to go to the cluster detail page.
Click Restar t and Confirm to uninstall the library. The library is removed from the cluster’s Libraries tab.

View the libraries installed on a cluster


1. Click Compute in the sidebar.
2. Click the cluster name.
3. Click the Libraries tab. For each library, the tab displays the name and version, type, install status, and, if
uploaded, the source file.

Update a cluster-installed library


To update a cluster-installed library, uninstall the old version of the library and install a new version.
Notebook-scoped Python libraries
7/21/2022 • 11 minutes to read

Notebook-scoped libraries let you create, modify, save, reuse, and share custom Python environments that are
specific to a notebook. When you install a notebook-scoped library, only the current notebook and any jobs
associated with that notebook have access to that library. Other notebooks attached to the same cluster are not
affected.
Notebook-scoped libraries do not persist across sessions. You must reinstall notebook-scoped libraries at the
beginning of each session, or whenever the notebook is detached from a cluster.
There are two methods for installing notebook-scoped libraries:
Run the %pip magic command in a notebook. The %pip command is supported on Databricks Runtime 7.1
and above, and on Databricks Runtime 6.4 ML and above. Databricks recommends using this approach for
new workloads. This article describes how to use these magic commands.
On Databricks Runtime 10.5 and below, you can use the Azure Databricks library utility. The library utility is
supported only on Databricks Runtime, not Databricks Runtime ML or Databricks Runtime for Genomics. See
Library utility (dbutils.library).
To install libraries for all notebooks attached to a cluster, use workspace or cluster-installed libraries.

IMPORTANT
dbutils.library.install and dbutils.library.installPyPI APIs are removed in Databricks Runtime 11.0.

Requirements
Notebook-scoped libraries using magic commands are enabled by default in Databricks Runtime 7.1 and above,
Databricks Runtime 7.1 ML and above, and Databricks Runtime 7.1 for Genomics and above.
They are also available using a configuration setting in Databricks Runtime 6.4 ML to 7.0 ML and Databricks
Runtime 6.4 for Genomics to Databricks Runtime 7.0 for Genomics. Set the Spark configuration
spark.databricks.conda.condaMagic.enabled to true .

On a High Concurrency cluster running Databricks Runtime 7.4 ML or Databricks Runtime 7.4 for Genomics or
below, notebook-scoped libraries are not compatible with table access control or credential passthrough. An
alternative is to use Library utility (dbutils.library) on a Databricks Runtime cluster, or to upgrade your cluster to
Databricks Runtime 7.5 ML or Databricks Runtime 7.5 for Genomics or above.
To use notebook-scoped libraries with Databricks Connect, you must use Library utility (dbutils.library).
Driver node
Using notebook-scoped libraries might result in more traffic to the driver node as it works to keep the
environment consistent across executor nodes.
When you use a cluster with 10 or more nodes, Databricks recommends these specs as a minimum requirement
for the driver node:
For a 100 node CPU cluster, use Standard_DS5_v2.
For a 10 node GPU cluster, use Standard_NC12.
For larger clusters, use a larger driver node.

Install notebook-scoped libraries with %pip

IMPORTANT
You should place all %pip commands at the beginning of the notebook. The notebook state is reset after any %pip
command that modifies the environment. If you create Python methods or variables in a notebook, and then use
%pip commands in a later cell, the methods or variables are lost.
Upgrading, modifying, or uninstalling core Python packages (such as IPython) with %pip may cause some features to
stop working as expected. For example, IPython 7.21 and above are incompatible with Databricks Runtime 8.1 and
below. If you experience such problems, reset the environment by detaching and re-attaching the notebook or by
restarting the cluster.

Manage libraries with %pip commands


The %pip command is equivalent to the pip command and supports the same API. The following sections show
examples of how you can use %pip commands to manage your environment. For more information on
installing Python packages with pip , see the pip install documentation and related pages.
In this section:
Install a library with %pip
Install a wheel package with %pip
Uninstall a library with %pip
Install a library from a version control system with %pip
Install a private package with credentials managed by Databricks secrets with %pip
Install a package from DBFS with %pip
Save libraries in a requirements file
Use a requirements file to install libraries
Install a library with %pip

%pip install matplotlib

Install a wheel package with %pip

%pip install /path/to/my_package.whl

Uninstall a library with %pip

NOTE
You cannot uninstall a library that is included in Databricks Runtime or a library that has been installed as a cluster library.
If you have installed a different library version than the one included in Databricks Runtime or the one installed on the
cluster, you can use %pip uninstall to revert the library to the default version in Databricks Runtime or the version
installed on the cluster, but you cannot use a %pip command to uninstall the version of a library included in Databricks
Runtime or installed on the cluster.

%pip uninstall -y matplotlib


The -y option is required.
Install a library from a version control system with %pip

%pip install git+https://github.com/databricks/databricks-cli

You can add parameters to the URL to specify things like the version or git subdirectory. See the VCS support for
more information and for examples using other version control systems.
Install a private package with credentials managed by Databricks secrets with %pip

Pip supports installing packages from private sources with basic authentication, including private version
control systems and private package repositories, such as Nexus and Artifactory. Secret management is
available via the Databricks Secrets API, which allows you to store authentication tokens and passwords. Use the
DBUtils API to access secrets from your notebook. Note that you can use $variables in magic commands.
To install a package from a private repository, specify the repository URL with the --index-url option to
%pip install or add it to the pip config file at ~/.pip/pip.conf .

token = dbutils.secrets.get(scope="scope", key="key")

%pip install --index-url https://<user>:$token@<your-package-repository>.com/<path/to/repo> <package>==


<version> --extra-index-url https://pypi.org/simple/

Similarly, you can use secret management with magic commands to install private packages from version
control systems.

token = dbutils.secrets.get(scope="scope", key="key")

%pip install git+https://<user>:$token@<gitprovider>.com/<path/to/repo>

Install a package from DBFS with %pip

You can use %pip to install a private package that has been saved on DBFS.
When you upload a file to DBFS, it automatically renames the file, replacing spaces, periods, and hyphens with
underscores. pip requires that the name of the wheel file use periods in the version (for example, 0.1.0) and
hyphens instead of spaces or underscores. To install the package with a %pip command, you must rename the
file to meet these requirements.

%pip install /dbfs/mypackage-0.0.1-py3-none-any.whl

Save libraries in a requirements file

%pip freeze > /dbfs/requirements.txt

Any subdirectories in the file path must already exist. If you run
%pip freeze > /dbfs/<new-directory>/requirements.txt , the command fails if the directory
/dbfs/<new-directory> does not already exist.

Use a requirements file to install libraries


A requirements file contains a list of packages to be installed using pip . An example of using a requirements
file is:

%pip install -r /dbfs/requirements.txt

See Requirements File Format for more information on requirements.txt files.

Manage libraries with %conda commands


IMPORTANT
%conda commands have been deprecated, and will no longer be supported after Databricks Runtime ML 8.4. Databricks
recommends using %pip for managing notebook-scoped libraries. If you require Python libraries that can only be
installed using conda, you can use conda-based docker containers to pre-install the libraries you need.
Anaconda Inc. updated their terms of service for anaconda.org channels in September 2020. Based on the new terms of
service you may require a commercial license if you rely on Anaconda’s packaging and distribution. See Anaconda
Commercial Edition FAQ for more information. Your use of any Anaconda channels is governed by their terms of service.
As a result of this change, Databricks has removed the default channel configuration for the Conda package manager. This
is a breaking change.
To install or update packages using the %conda command, you must specify a channel using -c . You must also update
all usage of %conda install and %sh conda install to specify a channel using -c . If you do not specify a channel,
conda commands will fail with PackagesNotFoundError .

The %conda command is equivalent to the conda command and supports the same API with some restrictions
noted below. The following sections contain examples of how to use %conda commands to manage your
environment. For more information on installing Python packages with conda , see the conda install
documentation.
Note that %conda magic commands are not available on Databricks Runtime. They are only available on
Databricks Runtime ML up to Databricks Runtime ML 8.4, and on Databricks Runtime for Genomics. Databricks
recommends using pip to install libraries. For more information, see Understanding conda and pip.
If you must use both %pip and %conda commands in a notebook, see Interactions between pip and conda
commands.

NOTE
The following conda commands are not supported when used with %conda :
activate
create
init
run
env create
env remove

In this section:
Install a library with %conda
Uninstall a library with %conda
Save and reuse or share an environment
List the Python environment of a notebook
Interactions between pip and conda commands
Install a library with %conda

%conda install matplotlib -c conda-forge

Uninstall a library with %conda

%conda uninstall matplotlib

Save and reuse or share an environment


When you detach a notebook from a cluster, the environment is not saved. To save an environment so you can
reuse it later or share it with someone else, follow these steps.
Databricks recommends that environments be shared only between clusters running the same version of
Databricks Runtime ML or the same version of Databricks Runtime for Genomics.
1. Save the environment as a conda YAML specification.

%conda env export -f /dbfs/myenv.yml

2. Import the file to another notebook using conda env update .

%conda env update -f /dbfs/myenv.yml

List the Python environment of a notebook


To show the Python environment associated with a notebook, use %conda list :

%conda list

Interactions between pip and conda commands


To avoid conflicts, follow these guidelines when using pip or conda to install Python packages and libraries.
Libraries installed using the API or using the cluster UI are installed using pip . If any libraries have been
installed from the API or the cluster UI, you should use only %pip commands when installing notebook-
scoped libraries.
If you use notebook-scoped libraries on a cluster, init scripts run on that cluster can use either conda or pip
commands to install libraries. However, if the init script includes pip commands, use only %pip commands
in notebooks (not %conda ).
It’s best to use either pip commands exclusively or conda commands exclusively. If you must install some
packages using conda and some using pip , run the conda commands first, and then run the pip
commands. For more information, see Using Pip in a Conda Environment.

Frequently asked questions (FAQ)


How do libraries installed from the cluster UI/API interact with notebook-scoped libraries?
How do libraries installed using an init script interact with notebook-scoped libraries?
Can I use %pip and %conda commands in job notebooks?
Can I use %pip and %conda commands in R or Scala notebooks?
Can I use %sh pip or !pip ?
Can I update R packages using %conda commands?
How do libraries installed from the cluster UI/API interact with notebook-scoped libraries?
Libraries installed from the cluster UI or API are available to all notebooks on the cluster. These libraries are
installed using pip ; therefore, if libraries are installed using the cluster UI, use only %pip commands in
notebooks.
How do libraries installed using an init script interact with notebook-scoped libraries?
Libraries installed using an init script are available to all notebooks on the cluster.
If you use notebook-scoped libraries on a cluster running Databricks Runtime ML or Databricks Runtime for
Genomics, init scripts run on the cluster can use either conda or pip commands to install libraries. However, if
the init script includes pip commands, then use only %pip commands in notebooks.
For example, this notebook code snippet generates a script that installs fast.ai packages on all the cluster nodes.

dbutils.fs.put("dbfs:/home/myScripts/fast.ai", "conda install -c pytorch -c fastai fastai -y", True)

Can I use %pip and %conda commands in job notebooks?


Yes.
Can I use %pip and %conda commands in R or Scala notebooks?
Yes, in a Python magic cell.
Can I use %sh pip or !pip ?
%sh and ! execute a shell command in a notebook; the former is a Azure Databricks auxiliary magic
command while the latter is a feature of IPython. Databricks does not recommend using %sh pip or !pip as
they are not compatible with %pip usage.

NOTE
On Databricks Runtime 11.0 and above, %pip , %sh pip , and !pip all install a library as a notebook-scoped Python
library.

Can I update R packages using %conda commands?


No.

Known issues
On Databricks Runtime 7.0 ML and below as well as Databricks Runtime 7.0 for Genomics and below, if a
registered UDF depends on Python packages installed using %pip or %conda , it won’t work in %sql cells.
Use spark.sql in a Python command shell instead.
On Databricks Runtime 7.2 ML and below as well as Databricks Runtime 7.2 for Genomics and below, when
you update the notebook environment using %conda , the new environment is not activated on worker
Python processes. This can cause issues if a PySpark UDF function calls a third-party function that uses
resources installed inside the Conda environment.
When you use %conda env update to update a notebook environment, the installation order of packages is
not guaranteed. This can cause problems for the horovod package, which requires that tensorflow and
torch be installed before horovod in order to use horovod.tensorflow or horovod.torch respectively. If this
happens, uninstall the horovod package and reinstall it after ensuring that the dependencies are installed.
On Databricks Runtime 10.3 and below, notebook-scoped libraries are incompatible with batch streaming
jobs. Databricks recommends using cluster libraries or the IPython kernel instead.
Notebook-scoped R libraries
7/21/2022 • 2 minutes to read

Notebook-scoped R libraries enable you to create and modify custom R environments that are specific to a
notebook session. When you install an R notebook-scoped library, only the current notebook and any jobs
associated with that notebook have access to that library. Other notebooks attached to the same cluster are not
affected.
Notebook-scoped libraries do not persist across sessions. You must reinstall notebook-scoped libraries at the
beginning of each session, or whenever the notebook is detached from a cluster.
Notebook-scoped libraries libraries are automatically available on workers for SparkR UDFs.
To install libraries for all notebooks attached to a cluster, use workspace or cluster-installed libraries.

Requirements
Notebook-scoped R libraries are enabled by default in Databricks Runtime 9.0 and above.

Install notebook-scoped libraries in R


You can use any familiar method of installing packages in R, such as install.packages(), the Devtools APIs, or
Bioconductor.
Starting with Databricks Runtime 9.0, R packages are accessible to worker nodes as well as the driver node.

Manage notebook-scoped libraries in R


In this section:
Install a package
Remove an R package from a notebook environment
Install a package

install.packages("caesar", repos = "https://cran.microsoft.com/snapshot/2021-07-16/")

Databricks recommends using a CRAN snapshot as the repository to guarantee reproducible results.

devtools::install_github("klutometis/roxygen")

Remove an R package from a notebook environment


To remove a notebook-scoped library from a notebook, use the remove.packages() command.

remove.packages("caesar")

Notebook-scoped R libraries with Spark UDFs


In this section:
Notebook-scoped R libraries and SparkR
Notebook-scoped R libraries and sparklyr
Library isolation and hosted RStudio
Notebook-scoped R libraries and SparkR
Notebook-scoped libraries are available on SparkR workers; just import a library to use it. For example, you can
run the following to generate a caesar-encrypted message with a SparkR UDF:

install.packages("caesar", repos = "https://cran.microsoft.com/snapshot/2021-07-16/")


library(SparkR)
sparkR.session()

hello <- function(x) {


library(caesar)
caesar("hello world")
}
spark.lapply(c(1, 2), hello)

Notebook-scoped R libraries and sparklyr


By default, in sparklyr::spark_apply() , the packages argument is set to TRUE . This copies libraries in the
current libPaths to the workers, allowing you to import and use them on workers. For example, you can run
the following to generate a caesar-encrypted message with sparklyr::spark_apply() :

install.packages("caesar", repos = "https://cran.microsoft.com/snapshot/2021-07-16/")


library(sparklyr)
sc <- spark_connect(method = 'databricks')

apply_caes <- function(x) {


library(caesar)
caesar("hello world")
}
sdf_len(sc, 5) %>%
spark_apply(apply_caes)

If you do not want libraries to be available on workers, set packages to FALSE .


Library isolation and hosted RStudio
RStudio creates a separate library path for each user; therefore users are isolated from each other. However, the
library path is not available on workers. If you want to use a package inside SparkR workers in a job launched
from RStudio, you need to install it using cluster-scoped libraries.
Alternatively, if you use sparklyr UDFs, packages installed inside RStudio are available to workers when using
spark_apply(..., packages = TRUE) .

FAQs
How do I install a package on just the driver for all R notebooks?
Explicitly set the installation directory to /databricks/spark/R/lib . For example, with install.packages() , run
install.packages("pckg", lib="/databricks/spark/R/lib") . Packages installed in /databricks/spark/R/lib are
shared across all notebooks on the cluster, but they are not accessible to SparkR workers. If you wish to share
libraries across notebooks and also workers, use cluster-scoped libraries.
Are notebook-scoped libraries cached?
There is no caching implemented for notebook-scoped libraries on a cluster. If you install a package in a
notebook, and another user installs the same package in another notebook on the same cluster, the package is
downloaded, compiled, and installed again.
Databricks File System (DBFS)
7/21/2022 • 11 minutes to read

Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and
available on Azure Databricks clusters. DBFS is an abstraction on top of scalable object storage and offers the
following benefits:
Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
Allows you to interact with object storage using directory and file semantics instead of storage URLs.
Persists files to object storage, so you won’t lose data after you terminate a cluster.

Important information about DBFS permissions


All users have read and write access to the objects in object storage mounted to DBFS, with the exception of the
DBFS root.

DBFS root
The default storage location in DBFS is known as the DBFS root. Several types of data are stored in the following
DBFS root locations:
/FileStore : Imported data files, generated plots, and uploaded libraries. See Special DBFS root locations.
/databricks-datasets : Sample public datasets. See Special DBFS root locations.
/databricks-results : Files generated by downloading the full results of a query.
/databricks/init : Global and cluster-named (deprecated) init scripts.
/user/hive/warehouse : Data and metadata for non-external Hive tables.

In a new workspace, the DBFS root has the following default folders:

The DBFS root also contains data—including mount point metadata and credentials and certain types of logs—
that is not visible and cannot be directly accessed.
Configuration and usage recommendations
The DBFS root is created during workspace creation.
Data written to mount point paths ( /mnt ) is stored outside of the DBFS root. Even though the DBFS root is
writeable, Databricks recommends that you store data in mounted object storage rather than in the DBFS root.
The DBFS root is not intended for production customer data.
Optional encryption of DBFS root data with a customer-managed key
You can encrypt DBFS root data with a customer-managed key. See Configure customer-managed keys for DBFS
root
Special DBFS root locations
The following articles provide more detail on special DBFS root locations:
FileStore
Sample datasets (databricks-datasets)

Browse DBFS using the UI


You can browse and search for DBFS objects using the DBFS file browser.

NOTE
An admin user must enable the DBFS browser interface before you can use it. See Manage the DBFS file browser.

1. Click Data in the sidebar.


2. Click the DBFS button at the top of the page.
The browser displays DBFS objects in a hierarchy of vertical swimlanes. Select an object to expand the hierarchy.
Use Prefix search in any swimlane to find a DBFS object.

You can also list DBFS objects using the DBFS CLI, DBFS API 2.0, Databricks file system utility (dbutils.fs), Spark
APIs, and local file APIs. See Access DBFS.

Mount object storage to DBFS


Mounting object storage to DBFS allows you to access objects in object storage as if they were on the local file
system.
For more information, see Mounting cloud object storage on Azure Databricks.

IMPORTANT
Nested mounts are not supported. For example, the following structure is not supported:
storage1 mounted as /mnt/storage1
storage2 mounted as /mnt/storage1/storage2

Databricks recommends creating separate mount entries for each storage object:
storage1 mounted as /mnt/storage1
storage2 mounted as /mnt/storage2
Access DBFS
IMPORTANT
All users have read and write access to the objects in object storage mounted to DBFS, with the exception of the DBFS
root. For more information, see Important information about DBFS permissions.

You can upload data to DBFS using the file upload interface, and can upload and access DBFS objects using the
DBFS CLI, DBFS API 2.0, Databricks file system utility (dbutils.fs), Spark APIs, and local file APIs.
In a Azure Databricks cluster you access DBFS objects using the Databricks file system utility, Spark APIs, or local
file APIs. On a local computer you access DBFS objects using the Databricks CLI or the DBFS API.
In this section:
DBFS and local driver node paths
File upload interface
Databricks CLI
dbutils
DBFS API
Spark APIs
Local file APIs
DBFS and local driver node paths
You can work with files on DBFS or on the local driver node of the cluster. You can access the file system using
magic commands such as %fs or %sh . You can also use the Databricks file system utility (dbutils.fs).
Azure Databricks uses a FUSE mount to provide local access to files stored in the cloud. A FUSE mount is a
secure, virtual filesystem.
Access files on DBFS
The path to the default blob storage (root) is dbfs:/ .
The default location for %fs and dbutils.fs is root. Thus, to read from or write to root or an external bucket:

%fs <command> /<path>

dbutils.fs.<command> ("/<path>/")

%sh reads from the local filesystem by default. To access root or mounted paths in root with %sh , preface the
path with /dbfs/ . A typical use case is if you are working with single node libraries like TensorFlow or scikit-
learn and want to read and write data to cloud storage.

%sh <command> /dbfs/<path>/

You can also use single-node filesystem APIs:

import os
os.<command>('/dbfs/tmp')

Ex a m p l e s
# Default location for %fs is root
%fs ls /tmp/
%fs mkdirs /tmp/my_cloud_dir
%fs cp /tmp/test_dbfs.txt /tmp/file_b.txt

# Default location for dbutils.fs is root


dbutils.fs.ls ("/tmp/")
dbutils.fs.put("/tmp/my_new_file", "This is a file in cloud storage.")

# Default location for %sh is the local filesystem


%sh ls /dbfs/tmp/

# Default location for os commands is the local filesystem


import os
os.listdir('/dbfs/tmp')

Access files on the local filesystem


%fs and dbutils.fs read by default from root ( dbfs:/ ). To read from the local filesystem, you must use
file:/ .

%fs <command> file:/<path>


dbutils.fs.<command> ("file:/<path>/")

%sh reads from the local filesystem by default, so do not use file:/ :

%sh <command> /<path>

Ex a m p l e s

# With %fs and dbutils.fs, you must use file:/ to read from local filesystem
%fs ls file:/tmp
%fs mkdirs file:/tmp/my_local_dir
dbutils.fs.ls ("file:/tmp/")
dbutils.fs.put("file:/tmp/my_new_file", "This is a file on the local driver node.")

# %sh reads from the local filesystem by default


%sh ls /tmp

Access files on mounted object storage


Mounting object storage to DBFS allows you to access objects in object storage as if they were on the local file
system.
Ex a m p l e s

dbutils.fs.ls("/mnt/mymount")
df = spark.read.format("text").load("dbfs:/mymount/my_file.txt")

Summary table and diagram


The table and diagram summarize and illustrate the commands described in this section and when to use each
syntax.
TO REA D F RO M LO C A L
C OMMAND DEFA ULT LO C AT IO N TO REA D F RO M RO OT F IL ESY ST EM

%fs Root Add file:/ to path

%sh Local driver node Add /dbfs to path

dbutils.fs Root Add file:/ to path

os.<command> Local driver node Add /dbfs to path

File upload interface


If you have small data files on your local machine that you want to analyze with Azure Databricks, you can easily
import them to Databricks File System (DBFS) using one of the two file upload interfaces: from the DBFS file
browser or from a notebook.
Files are uploaded to the FileStore directory.
Upload data to DBFS from the file browser

NOTE
This feature is disabled by default. An administrator must enable the DBFS browser interface before you can use it. See
Manage the DBFS file browser.

1. Click Data in the sidebar.


2. Click the DBFS button at the top of the page.
3. Click the Upload button at the top of the page.
4. On the Upload Data to DBFS dialog, optionally select a target directory or enter a new one.
5. In the Files box, drag and drop or use the file browser to select the local file to upload.
Uploaded files are accessible by everyone who has access to the workspace.
Upload data to DBFS from a notebook

NOTE
This feature is enabled by default. If an administrator has disabled this feature, you will not have the option to upload files.

To create a table using the UI, see Upload data and create table in Databricks SQL.
To upload data for use in a notebook, follow these steps.
1. Create a new notebook or open an existing one, then click File > Upload Data

2. Select a target directory in DBFS to store the uploaded file. The target directory defaults to
/shared_uploads/<your-email-address>/ .

Uploaded files are accessible by everyone who has access to the workspace.
3. Either drag files onto the drop target or click Browse to locate files in your local filesystem.
4. When you have finished uploading the files, click Next .
If you’ve uploaded CSV, TSV, or JSON files, Azure Databricks generates code showing how to load the
data into a DataFrame.

To save the text to your clipboard, click Copy .


5. Click Done to return to the notebook.
Databricks CLI
The DBFS command-line interface (CLI) uses the DBFS API 2.0 to expose an easy to use command-line interface
to DBFS. Using this client, you can interact with DBFS using commands similar to those you use on a Unix
command line. For example:
# List files in DBFS
dbfs ls
# Put local file ./apple.txt to dbfs:/apple.txt
dbfs cp ./apple.txt dbfs:/apple.txt
# Get dbfs:/apple.txt and save to local file ./apple.txt
dbfs cp dbfs:/apple.txt ./apple.txt
# Recursively put local dir ./banana to dbfs:/banana
dbfs cp -r ./banana dbfs:/banana

For more information about the DBFS command-line interface, see Databricks CLI.
dbutils

dbutils.fs provides file-system-like commands to access files in DBFS. This section has several examples of how
to write files to and read files from DBFS using dbutils.fs commands.

TIP
To access the help menu for DBFS, use the dbutils.fs.help() command.

Write files to and read files from the DBFS root as if it were a local filesystem

dbutils.fs.mkdirs("/foobar/")

dbutils.fs.put("/foobar/baz.txt", "Hello, World!")

dbutils.fs.head("/foobar/baz.txt")

dbutils.fs.rm("/foobar/baz.txt")

Use dbfs:/ to access a DBFS path

display(dbutils.fs.ls("dbfs:/foobar"))

Use %fs magic commands


Notebooks support a shorthand— %fs magic commands—for accessing the dbutils filesystem module. Most
dbutils.fs commands are available using %fs magic commands.

# List the DBFS root

%fs ls

# Recursively remove the files under foobar

%fs rm -r foobar

# Overwrite the file "/mnt/my-file" with the string "Hello world!"

%fs put -f "/mnt/my-file" "Hello world!"

DBFS API
See DBFS API 2.0 and Upload a big file into DBFS.
Spark APIs
When you’re using Spark APIs, you reference files with "/mnt/training/file.csv" or
"dbfs:/mnt/training/file.csv" . The following example writes the file foo.text to the DBFS /tmp directory.

df.write.format("text").save("/tmp/foo.txt")

When you use the Spark APIs to access DBFS (for example, by calling spark.read ), you must specify the full,
absolute path to the target DBFS location. The path must start from the DBFS root, represented by / or dbfs:/
, which are equivalent. For example, to read a file named people.json in the DBFS location /FileStore , you can
specify either of the following:

df = spark.read.format("json").load('dbfs:/FileStore/people.json')
df.show()

Or:

df = spark.read.format("json").load('/FileStore/people.json')
df.show()

The following does not work:

# This will not work. The path must be absolute. It


# must start with '/' or 'dbfs:/'.
df = spark.read.format("json").load('FileStore/people.json')
df.show()

Local file APIs


You can use local file APIs to read and write to DBFS paths. Azure Databricks configures each cluster node with a
FUSE mount /dbfs that allows processes running on cluster nodes to read and write to the underlying
distributed storage layer with local file APIs. When using local file APIs, you must provide the path under /dbfs .
For example:
Python

#write a file to DBFS using Python file system APIs


with open("/dbfs/tmp/test_dbfs.txt", 'w') as f:
f.write("Apache Spark is awesome!\n")
f.write("End of example!")

# read the file


with open("/dbfs/tmp/test_dbfs.txt", "r") as f_read:
for line in f_read:
print(line)

Scala

import scala.io.Source

val filename = "/dbfs/tmp/test_dbfs.txt"


for (line <- Source.fromFile(filename).getLines()) {
println(line)
}

Local file API limitations


The following lists the limitations in local file API usage that apply to each FUSE and respective Databricks
Runtime versions.
All : Does not support credential passthrough.
FUSE V2 (default for Databricks Runtime 6.x and 7.x)
Does not support random writes. For workloads that require random writes, perform the
operations on local disk first and then copy the result to /dbfs . For example:

# python
import xlsxwriter
from shutil import copyfile

workbook = xlsxwriter.Workbook('/local_disk0/tmp/excel.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, "Key")
worksheet.write(0, 1, "Value")
workbook.close()

copyfile('/local_disk0/tmp/excel.xlsx', '/dbfs/tmp/excel.xlsx')

Does not support sparse files. To copy sparse files, use cp --sparse=never :

$ cp sparse.file /dbfs/sparse.file
error writing '/dbfs/sparse.file': Operation not supported
$ cp --sparse=never sparse.file /dbfs/sparse.file

FUSE V1 (default for Databricks Runtime 5.5 LTS)

IMPORTANT
If you experience issues with FUSE V1 on <DBR> 5.5 LTS, Databricks recommends that you use FUSE V2 instead.
You can override the default FUSE version in <DBR> 5.5 LTS by setting the environment variable
DBFS_FUSE_VERSION=2 .

Supports only files less than 2GB in size. If you use local file system APIs to read or write files
larger than 2GB you might see corrupted files. Instead, access files larger than 2GB using the DBFS
CLI, dbutils.fs, or Spark APIs or use the /dbfs/ml folder described in Local file APIs for deep
learning.
If you write a file using the local file system APIs and then immediately try to access it using the
DBFS CLI, dbutils.fs, or Spark APIs, you might encounter a FileNotFoundException , a file of size 0,
or stale file contents. That is expected because the operating system caches writes by default. To
force those writes to be flushed to persistent storage (in our case DBFS), use the standard Unix
system call sync. For example:

// scala
import scala.sys.process._

// Write a file using the local file API (over the FUSE mount).
dbutils.fs.put("file:/dbfs/tmp/test", "test-contents")

// Flush to persistent storage.


"sync /dbfs/tmp/test" !

// Read the file using "dbfs:/" instead of the FUSE mount.


dbutils.fs.head("dbfs:/tmp/test")
L o c a l fi l e A P I s fo r d e e p l e a r n i n g

For distributed deep learning applications, which require DBFS access for loading, checkpointing, and logging
data, Databricks Runtime 6.0 and above provide a high-performance /dbfs mount that’s optimized for deep
learning workloads.
In Databricks Runtime 5.5 LTS, only /dbfs/ml is optimized. In this version Databricks recommends saving data
under /dbfs/ml , which maps to dbfs:/ml .
FileStore
7/21/2022 • 3 minutes to read

FileStore is a special folder within Databricks File System (DBFS) where you can save files and have them
accessible to your web browser. You can use FileStore to:
Save files, such as images and libraries, that are accessible within HTML and JavaScript when you call
displayHTML .
Save output files that you want to download to your local desktop.
Upload CSVs and other data files from your local desktop to process on Databricks.
When you use certain features, Azure Databricks puts files in the following folders under FileStore:
/FileStore/jars - contains libraries that you upload. If you delete files in this folder, libraries that reference
these files in your workspace may no longer work.
/FileStore/tables - contains the files that you import using the UI. If you delete files in this folder, tables that
you created from these files may no longer be accessible.
/FileStore/plots - contains images created in notebooks when you call display() on a Python or R plot
object, such as a ggplot or matplotlib plot. If you delete files in this folder, you may have to regenerate
those plots in the notebooks that reference them. See Matplotlib and ggplot2for more information.
/FileStore/import-stage - contains temporary files created when you import notebooks or Databricks
archives files. These temporary files disappear after the notebook import completes.

Save a file to FileStore


To save a file to FileStore, put it in the /FileStore directory in DBFS:

dbutils.fs.put("/FileStore/my-stuff/my-file.txt", "Contents of my file")

In the following, replace <databricks-instance> with the workspace URL of your Azure Databricks deployment.
Files stored in are accessible in your web browser at
/FileStore
https://<databricks-instance>/files/<path-to-file>?o=###### . For example, the file you stored in
/FileStore/my-stuff/my-file.txt is accessible at
https://<databricks-instance>/files/my-stuff/my-file.txt?o=###### where the number after o= is the same as
in your URL.

NOTE
You can also use the DBFS file upload interfaces to put files in the /FileStore directory. See Databricks CLI.

Embed static images in notebooks


You can use the files/ location to embed static images into your notebooks:

displayHTML("<img src ='files/image.jpg'>")

or Markdown image import syntax:


%md
![my_test_image](files/image.jpg)

You can upload static images using the DBFS Databricks REST API reference and the requests Python HTTP
library. In the following example:
Replace <databricks-instance> with the workspace URL of your Azure Databricks deployment.
Replace <token> with the value of your personal access token.
Replace <image-dir> with the location in FileStore where you want to upload the image files.

NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.
import requests
import json
import os

TOKEN = '<token>'
headers = {'Authorization': 'Bearer %s' % TOKEN}
url = "https://<databricks-instance>/api/2.0"
dbfs_dir = "dbfs:/FileStore/<image-dir>/"

def perform_query(path, headers, data={}):


session = requests.Session()
resp = session.request('POST', url + path, data=json.dumps(data), verify=True, headers=headers)
return resp.json()

def mkdirs(path, headers):


_data = {}
_data['path'] = path
return perform_query('/dbfs/mkdirs', headers=headers, data=_data)

def create(path, overwrite, headers):


_data = {}
_data['path'] = path
_data['overwrite'] = overwrite
return perform_query('/dbfs/create', headers=headers, data=_data)

def add_block(handle, data, headers):


_data = {}
_data['handle'] = handle
_data['data'] = data
return perform_query('/dbfs/add-block', headers=headers, data=_data)

def close(handle, headers):


_data = {}
_data['handle'] = handle
return perform_query('/dbfs/close', headers=headers, data=_data)

def put_file(src_path, dbfs_path, overwrite, headers):


handle = create(dbfs_path, overwrite, headers=headers)['handle']
print("Putting file: " + dbfs_path)
with open(src_path, 'rb') as local_file:
while True:
contents = local_file.read(2**20)
if len(contents) == 0:
break
add_block(handle, b64encode(contents).decode(), headers=headers)
close(handle, headers=headers)

mkdirs(path=dbfs_dir, headers=headers)
files = [f for f in os.listdir('.') if os.path.isfile(f)]
for f in files:
if ".png" in f:
target_path = dbfs_dir + f
resp = put_file(src_path=f, dbfs_path=target_path, overwrite=True, headers=headers)
if resp == None:
print("Success")
else:
print(resp)

Scale static images


To scale the size of an image that you have saved to DBFS, copy the image to /FileStore and then resize using
image parameters in displayHTML :
dbutils.fs.cp('dbfs:/user/experimental/MyImage-1.png','dbfs:/FileStore/images/')
displayHTML('''<img src="files/images/MyImage-1.png" style="width:600px;height:600px;">''')

Use a Javascript library


This notebook shows how to use FileStore to contain a JavaScript library.
FileStore demo notebook
Get notebook
Mounting cloud object storage on Azure Databricks
7/21/2022 • 3 minutes to read

Azure Databricks enables users to mount cloud object storage to the Databricks File System (DBFS) to simplify
data access patterns for users that are unfamiliar with cloud concepts. Mounted data does not work with Unity
Catalog, and Databricks recommends migrating away from using mounts and managing data governance with
Unity Catalog.

How does Azure Databricks mount cloud object storage?


Azure Databricks mounts create a link between a workspace and cloud object storage, which enables you to
interact with cloud object storage using familiar file paths relative to the Databricks file system. Mounts work by
creating a local alias under the /mnt directory that stores the following information:
Location of the cloud object storage.
Driver specifications to connect to the storage account or container.
Security credentials required to access the data.

What is the syntax for mounting storage?


The source specifies the URI of the object storage (and can optionally encode security credentials). The
mountPoint specifies the local path in the /mnt directory. Some object storage sources support an optional
encryptionType argument, and for some access patterns you can pass additional configuration specifications as
a dictionary to extraConfigs .

mount(
source: str,
mountPoint: str,
encryptionType: Optional[str] = "",
extraConfigs: Optional[dict[str:str]] = None
)

Check with your workspace and cloud administrators before configuring or altering data mounts, as improper
configuration can provide unsecured access to all users in your workspace.

Unmount a mount point


To unmount a mount point, use the following command:

dbutils.fs.unmount("/mnt/<mount-name>")

IMPORTANT
Unmounting a mount point while jobs are running can lead to errors. Ensure that production jobs do not unmount
storage as part of processing.

Mount ADLS Gen2 or Blob Storage with ABFS


You can mount data in an Azure storage account using an Azure Active Directory (Azure AD) application service
principal for authentication. For more information, see Configure access to Azure storage with an Azure Active
Directory service principal.

IMPORTANT
All users in the Azure Databricks workspace have access to the mounted ADLS Gen2 account. The service principal you
use to access the ADLS Gen2 account should be granted access only to that ADLS Gen2 account; it should not be
granted access to other Azure resources.
When you create a mount point through a cluster, cluster users can immediately access the mount point. To use the
mount point in another running cluster, you must run dbutils.fs.refreshMounts() on that running cluster to
make the newly created mount point available for use.
Unmounting a mount point while jobs are running can lead to errors. Ensure that production jobs do not unmount
storage as part of processing.
Mount points that use secrets are not automatically refreshed. If mounted storage relies on a secret that is rotated,
expires, or is deleted, errors can occur, such as 401 Unauthorized . To resolve such an error, you must unmount and
remount the storage.

Run the following in your notebook to authenticate and create a mount point.

configs = {"fs.azure.account.auth.type": "OAuth",


"fs.azure.account.oauth.provider.type":
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-
credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-
id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)

val configs = Map(


"fs.azure.account.auth.type" -> "OAuth",
"fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id" -> "<application-id>",
"fs.azure.account.oauth2.client.secret" -> dbutils.secrets.get(scope="<scope-name>",key="<service-
credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint" -> "https://login.microsoftonline.com/<directory-
id>/oauth2/token")
// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mountPoint = "/mnt/<mount-name>",
extraConfigs = configs)

Replace
<application-id> with the Application (client) ID for the Azure Active Directory application.
<scope-name> with the Databricks secret scope name.
<service-credential-key-name> with the name of the key containing the client secret.
<directory-id> with the Director y (tenant) ID for the Azure Active Directory application.
<container-name> with the name of a container in the ADLS Gen2 storage account.
<storage-account-name> with the ADLS Gen2 storage account name.
<mount-name> with the name of the intended mount point in DBFS.
External Apache Hive metastore
7/21/2022 • 8 minutes to read

This article describes how to set up Azure Databricks clusters to connect to existing external Apache Hive
metastores. It provides information about recommended metastore setup and cluster configuration
requirements, followed by instructions for configuring clusters to connect to an external metastore. The
following table summarizes which Hive metastore versions are supported in each version of Databricks
Runtime.

DATA B RIC K S
RUN T IM E
VERSIO N 0. 13 - 1. 2. 1 2. 1 2. 2 2. 3 3. 1. 0

7.x Yes Yes Yes Yes Yes

6.x Yes Yes Yes Yes Yes

5.3 and above Yes Yes Yes Yes Yes

5.1 - 5.2 and 4.x Yes Yes Yes Yes No

3.x Yes Yes No No No

IMPORTANT
While SQL Server works as the underlying metastore database for Hive 2.0 and above, the examples throughout this
article use Azure SQL Database.
You can use a Hive 1.2.0 or 1.2.1 metastore of an HDInsight cluster as an external metastore. See Use external
metadata stores in Azure HDInsight.
If you use Azure Database for MySQL as an external metastore, you must change the value of the
lower_case_table_names property from 1 (the default) to 2 in the server-side database configuration. For details,
see Identifier Case Sensitivity.

Hive metastore setup


The metastore client running inside a cluster connects to your underlying metastore database directly using
JDBC.
To test network connectivity from a cluster to the metastore, you can run the following command inside a
notebook:

%sh
nc -vz <DNS name> <port>

where
<DNS name> is the server name of Azure SQL Database.
<port> is the port of the database.
Cluster configurations
You must set two sets of configuration options to connect a cluster to an external metastore:
Spark options configure Spark with the Hive metastore version and the JARs for the metastore client.
Hive options configure the metastore client to connect to the external metastore.
Spark configuration options
Set spark.sql.hive.metastore.version to the version of your Hive metastore and spark.sql.hive.metastore.jars
as follows:
Hive 0.13: do not set spark.sql.hive.metastore.jars .
Hive 1.2.0 or 1.2.1 (Databricks Runtime 6.6 and below): set spark.sql.hive.metastore.jars to builtin .

NOTE
Hive 1.2.0 and 1.2.1 are not the built-in metastore on Databricks Runtime 7.0 and above. If you want to use Hive
1.2.0 or 1.2.1 with Databricks Runtime 7.0 and above, follow the procedure described in Download the metastore
jars and point to them.

Hive 2.3.7 (Databricks Runtime 7.0 - 9.x) or Hive 2.3.9 (Databricks Runtime 10.0 and above): set
spark.sql.hive.metastore.jars to builtin .

For all other Hive versions, Azure Databricks recommends that you download the metastore JARs and set
the configuration spark.sql.hive.metastore.jars to point to the downloaded JARs using the procedure
described in Download the metastore jars and point to them.
Download the metastore jars and point to them
1. Create a cluster with spark.sql.hive.metastore.jars set to maven and spark.sql.hive.metastore.version
to match the version of your metastore.
2. When the cluster is running, search the driver log and find a line like the following:

17/11/18 22:41:19 INFO IsolatedClientLoader: Downloaded metastore jars to <path>

The directory <path> is the location of downloaded JARs in the driver node of the cluster.
Alternatively you can run the following code in a Scala notebook to print the location of the JARs:

import com.typesafe.config.ConfigFactory
val path = ConfigFactory.load().getString("java.io.tmpdir")

println(s"\nHive JARs are downloaded to the path: $path \n")

3. Run %sh cp -r <path> /dbfs/hive_metastore_jar (replacing <path> with your cluster’s info) to copy this
directory to a directory in DBFS called hive_metastore_jar through the FUSE client in the driver node.
4. Create an init script that copies /dbfs/hive_metastore_jar to the local filesystem of the node, making sure
to make the init script sleep a few seconds before it accesses the DBFS FUSE client. This ensures that the
client is ready.
5. Set spark.sql.hive.metastore.jars to use this directory. If your init script copies
/dbfs/hive_metastore_jar to /databricks/hive_metastore_jars/ , set spark.sql.hive.metastore.jars to
/databricks/hive_metastore_jars/* . The location must include the trailing /* .
6. Restart the cluster.
Hive configuration options
This section describes options specific to Hive.
To connect to an external metastore using local mode, set the following Hive configuration options:

# JDBC connect string for a JDBC metastore


javax.jdo.option.ConnectionURL <mssql-connection-string>

# Username to use against metastore database


javax.jdo.option.ConnectionUserName <mssql-username>

# Password to use against metastore database


javax.jdo.option.ConnectionPassword <mssql-password>

# Driver class name for a JDBC metastore


javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver

where
<mssql-connection-string> is the JDBC connection string (which you can get in the Azure portal). You do not
need to include username and password in the connection string, because these will be set by
javax.jdo.option.ConnectionUserName and javax.jdo.option.ConnectionDriverName .
<mssql-username> and <mssql-password> specify the username and password of your Azure SQL Database
account that has read/write access to the database.

NOTE
For production environments, we recommend that you set hive.metastore.schema.verification to true . This
prevents Hive metastore client from implicitly modifying the metastore database schema when the metastore client
version does not match the metastore database version. When enabling this setting for metastore client versions lower
than Hive 1.2.0, make sure that the metastore client has the write permission to the metastore database (to prevent the
issue described in HIVE-9749).
For Hive metastore 1.2.0 and higher, set hive.metastore.schema.verification.record.version to true to
enable hive.metastore.schema.verification .
For Hive metastore 2.1.1 and higher, set hive.metastore.schema.verification.record.version to true as it is
set to false by default.

Set up an external metastore using the UI


To set up an external metastore using the Azure Databricks UI:
1. Click the Clusters button on the sidebar.
2. Click Create Cluster .
3. Enter the following Spark configuration options:
# Hive-specific configuration options.
# spark.hadoop prefix is added to make sure these Hive specific options propagate to the metastore
client.
# JDBC connect string for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionURL <mssql-connection-string>

# Username to use against metastore database


spark.hadoop.javax.jdo.option.ConnectionUserName <mssql-username>

# Password to use against metastore database


spark.hadoop.javax.jdo.option.ConnectionPassword <mssql-password>

# Driver class name for a JDBC metastore


spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver

# Spark specific configuration options


spark.sql.hive.metastore.version <hive-version>
# Skip this one if <hive-version> is 0.13.x.
spark.sql.hive.metastore.jars <hive-jar-source>

4. Continue your cluster configuration, following the instructions in Configure clusters.


5. Click Create Cluster to create the cluster.

Set up an external metastore using an init script


Init scripts let you connect to an existing Hive metastore without manually setting required configurations.
1. Create the base directory you want to store the init script in if it does not exist. The following example uses
dbfs:/databricks/scripts .
2. Run the following snippet in a notebook. The snippet creates the init script
/databricks/scripts/external-metastore.sh in Databricks File System (DBFS). Alternatively, you can use the
DBFS REST API’s put operation to create the init script. This init script writes required configuration options to
a configuration file named 00-custom-spark.conf in a JSON-like format under /databricks/driver/conf/
inside every node of the cluster, whenever a cluster with the name specified as <cluster-name> starts. Azure
Databricks provides default Spark configurations in the /databricks/driver/conf/spark-branch.conf file.
Configuration files in the /databricks/driver/conf directory apply in reverse alphabetical order. If you want
to change the name of the 00-custom-spark.conf file, make sure that it continues to apply before the
spark-branch.conf file.

Scala
dbutils.fs.put(
"/databricks/scripts/external-metastore.sh",
"""#!/bin/sh
|# Loads environment variables to determine the correct JDBC driver to use.
|source /etc/environment
|# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
|cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
|[driver] {
| # Hive specific configuration options.
| # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the
metastore client.
| # JDBC connect string for a JDBC metastore
| "spark.hadoop.javax.jdo.option.ConnectionURL" = "<mssql-connection-string>"
|
| # Username to use against metastore database
| "spark.hadoop.javax.jdo.option.ConnectionUserName" = "<mssql-username>"
|
| # Password to use against metastore database
| "spark.hadoop.javax.jdo.option.ConnectionPassword" = "<mssql-password>"
|
| # Driver class name for a JDBC metastore
| "spark.hadoop.javax.jdo.option.ConnectionDriverName" =
"com.microsoft.sqlserver.jdbc.SQLServerDriver"
|
| # Spark specific configuration options
| "spark.sql.hive.metastore.version" = "<hive-version>"
| # Skip this one if <hive-version> is 0.13.x.
| "spark.sql.hive.metastore.jars" = "<hive-jar-source>"
|}
|EOF
|""".stripMargin,
overwrite = true
)

Python
contents = """#!/bin/sh
# Loads environment variables to determine the correct JDBC driver to use.
source /etc/environment
# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
[driver] {
# Hive specific configuration options.
# spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore
client.
# JDBC connect string for a JDBC metastore
"spark.hadoop.javax.jdo.option.ConnectionURL" = "<mssql-connection-string>"

# Username to use against metastore database


"spark.hadoop.javax.jdo.option.ConnectionUserName" = "<mssql-username>"

# Password to use against metastore database


"spark.hadoop.javax.jdo.option.ConnectionPassword" = "<mssql-password>"

# Driver class name for a JDBC metastore


"spark.hadoop.javax.jdo.option.ConnectionDriverName" = "com.microsoft.sqlserver.jdbc.SQLServerDriver"

# Spark specific configuration options


"spark.sql.hive.metastore.version" = "<hive-version>"
# Skip this one if <hive-version> is 0.13.x.
"spark.sql.hive.metastore.jars" = "<hive-jar-source>"
}
EOF
"""

dbutils.fs.put(
file = "/databricks/scripts/external-metastore.sh",
contents = contents,
overwrite = True
)

1. Configure your cluster with the init script.


2. Restart the cluster.

Troubleshooting
Clusters do not star t (due to incorrect init script settings)
If an init script for setting up the external metastore causes cluster creation failure, configure the init script to
log, and debug the init script using the logs.
Error in SQL statement: InvocationTargetException
Error message pattern in the full exception stack trace:

Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given


database. JDBC url = [...]

External metastore JDBC connection information is misconfigured. Verify the configured hostname, port,
username, password, and JDBC driver class name. Also, make sure that the username has the right
privilege to access the metastore database.
Error message pattern in the full exception stack trace:

Required table missing : "`DBS`" in Catalog "" Schema "". DataNucleus requires this table to perform
its persistence operations. [...]
External metastore database not properly initialized. Verify that you created the metastore database and
put the correct database name in the JDBC connection string. Then, start a new cluster with the following
two Spark configuration options:

datanucleus.autoCreateSchema true
datanucleus.fixedDatastore false

In this way, the Hive client library will try to create and initialize tables in the metastore database
automatically when it tries to access them but finds them absent.
Error in SQL statement: AnalysisException: Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetastoreClient
Error message in the full exception stacktrace:

The specified datastore driver (driver name) was not found in the CLASSPATH

The cluster is configured to use an incorrect JDBC driver.


Setting datanucleus.autoCreateSchema to true doesn’t work as expected
By default, Databricks also sets datanucleus.fixedDatastore to true , which prevents any accidental structural
changes to the metastore databases. Therefore, the Hive client library cannot create metastore tables even if you
set datanucleus.autoCreateSchema to true . This strategy is, in general, safer for production environments since
it prevents the metastore database to be accidentally upgraded.
If you do want to use datanucleus.autoCreateSchemato help initialize the metastore database, make sure you set
datanucleus.fixedDatastore to false . Also, you may want to flip both flags after initializing the metastore
database to provide better protection to your production environment.
Migrate production workloads to Azure Databricks
7/21/2022 • 7 minutes to read

This guide explains how to move your production jobs from Apache Spark on other platforms to Apache Spark
on Azure Databricks.

Concepts
Databricks job
A single unit of code that you can bundle and submit to Azure Databricks. An Azure Databricks job is equivalent
to a Spark application with a single SparkContext . The entry point can be in a library (for example, JAR, egg,
wheel) or a notebook. You can run Azure Databricks jobs on a schedule with sophisticated retries and alerting
mechanisms. The primary interfaces for running jobs are the Jobs API and UI.
Pool
A set of instances in your account that are managed by Azure Databricks but incur no Azure Databricks charges
when they are idle. Submitting multiple jobs on a pool ensures your jobs start quickly. You can set guardrails
(instance types, instance limits, and so on) and autoscaling policies for the pool of instances. A pool is equivalent
to an autoscaling cluster on other Spark platforms.

Migration steps
This section provides the steps for moving your production jobs to Azure Databricks.
Step 1: Create a pool
Create an autoscaling pool. This is equivalent to creating an autoscaling cluster in other Spark platforms. On
other platforms, if instances in the autoscaling cluster are idle for a few minutes or hours, you pay for them.
Azure Databricks manages the instance pool for you for free. That is, you don’t pay Azure Databricks if these
machines are not in use; you pay only the cloud provider. Azure Databricks charges only when jobs are run on
the instances.
Key configurations:
Min Idle : Number of standby instances, not in use by jobs, that the pool maintains. You can set this to 0.
Max Capacity : This is an optional field. If you already have cloud provider instance limits set, you can leave
this field empty. If you want to set additional max limits, set a high value so that a large number of jobs can
share the pool.
Idle Instance Auto Termination : The instances over Min Idle are released back to the cloud provider if
they are idle for the specified period. The higher the value, the more the instances are kept ready and thereby
your jobs will start faster.
Step 2: Run a job on a pool
You can run a job on a pool using the Jobs API or the UI. You must run each job by providing a cluster spec.
When a job is about to start, Azure Databricks automatically creates a new cluster from the pool. The cluster is
automatically terminated when the job finishes. You are charged exactly for the amount of time your job was
run. This is the most cost-effective way to run jobs on Azure Databricks. Each new cluster has:
One associated SparkContext , which is equivalent to a Spark application on other Spark platforms.
A driver node and a specified number of workers. For a single job, you can specify a worker range. Azure
Databricks autoscales a single Spark job based on the resources needed for that job. Azure Databricks
benchmarks show that this can save you up to 30% on cloud costs, depending on the nature of your job.
There are three ways to run jobs on a pool: API/CLI, Airflow, UI.
API / CLI
1. Download and configure the Databricks CLI.
2. Run the following command to submit your code one time. The API returns a URL that you can use to
track the progress of the job run.
databricks runs submit --json

{
"run_name": "my spark job",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",

"instance_pool_id": "0313-121005-test123-pool-ABCD1234",
"num_workers": 10
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
}

],
"timeout_seconds": 3600,
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
}

3. To schedule a job, use the following example. Jobs created through this mechanism are displayed in the
jobs list page. The return value is a job_id that you can use to look at the status of all the runs.

databricks jobs create --json

{
"name": "Nightly model training",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
...
"instance_pool_id": "0313-121005-test123-pool-ABCD1234",
"num_workers": 10
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
}
],
"email_notifications": {
"on_start": ["john@foo.com"],
"on_success": ["sally@foo.com"],
"on_failure": ["bob@foo.com"]
},
"timeout_seconds": 3600,
"max_retries": 2,
"schedule": {
"quartz_cron_expression": "0 15 22 ? \* \*",
"timezone_id": "America/Los_Angeles"
},
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
}

If you use spark-submit to submit Spark jobs, the following table shows how spark-submit parameters map to
different arguments in the Create a new job operation ( POST /jobs/create ) in the Jobs API.

SPA RK - SUB M IT PA RA M ET ER H O W IT A P P L IES O N A Z URE DATA B RIC K S


SPA RK - SUB M IT PA RA M ET ER H O W IT A P P L IES O N A Z URE DATA B RIC K S

–class Use the spark_jar_task structure to provide the main class


name and the parameters.

–jars Use the libraries argument to provide the list of


dependencies.

–py-files For Python jobs, use the spark_python_task structure. You


can use the libraries argument to provide egg or wheel
dependencies.

–master In the cloud, you don’t need to manage a long running


master node. All the instances and jobs are managed by
Azure Databricks services. Ignore this parameter.

–deploy-mode Ignore this parameter on Azure Databricks.

–conf In the new_cluster structure, use the spark_conf


argument.

–num-executors In the new_cluster structure, use the num_workers


argument. You can also use the autoscale option to
provide a range (recommended).

–driver-memory, –driver-cores Based on the driver memory and cores you need, choose an
appropriate instance type.

You will provide the instance type for the driver during the
pool creation. Ignore this parameter during job submission.

–executor-memory, –executor-cores Based on the executor memory you need, choose an


appropriate instance type.

You will provide the instance type for the workers during the
pool creation. Ignore this parameter during job submission.

–driver-class-path Set spark.driver.extraClassPath to the appropriate


value in spark_conf argument.

–driver-java-options Set spark.driver.extraJavaOptions to the appropriate


value in the spark_conf argument.

–files Set spark.files to the appropriate value in the


spark_conf argument.

–name In the submit job run request ( POST /jobs/runs/submit ),


use the run_name argument. In the create job request (
POST /jobs/create ), use the name argument.

Airflow
Azure Databricks offers an Airflow operator if you want to use Airflow to submit jobs in Azure Databricks. The
Databricks Airflow operator calls the Trigger a new job run operation ( POST /jobs/run-now ) of the Jobs API to
submit jobs to Azure Databricks. See Apache Airflow.
UI
Azure Databricks provides a simple and intuitive easy-to-use UI to submit and schedule jobs. To create and
submit jobs from the UI, follow the step-by-step guide.
Step 3: Troubleshoot jobs
Azure Databricks provides lots of tools to help you troubleshoot your jobs.
Access logs and Spark UI
Azure Databricks maintains a fully managed Spark history server to allow you to access all the Spark logs and
Spark UI for each job run. They can be accessed from the job runs page as well as the job run details page:

Forward logs
You can also forward cluster logs to your cloud storage location. To send logs to your location of choice, use the
cluster_log_conf parameter in the new_cluster structure.

View metrics
While the job is running, you can go to the cluster page and look at the live Ganglia metrics in the Metrics tab.
Azure Databricks also snapshots these metrics every 15 minutes and stores them, so you can look at these
metrics even after your job is completed. To send metrics to your metrics server, you can install custom agents in
the cluster. See Monitor performance.

Set alerts
Use email_notifications in the Create a new job operation ( POST /jobs/create ) in the Jobs API to get alerts on
job failures.
You can also forward these email alerts to PagerDuty, Slack, and other monitoring systems.
How to set up PagerDuty alerts with emails
How to set up Slack notification with emails

Frequently asked questions (FAQs)


Can I run jobs without a pool?
Yes. Pools are optional. You can directly run jobs on a new cluster. In such cases, Azure Databricks creates the
cluster by asking the cloud provider for the required instances. With pools, cluster startup time will be around
30s if instances are available in the pool.
What is a notebook job?
Azure Databricks has different job types—JAR, Python, notebook. A notebook job type runs code in the specified
notebook. See Notebook job tips.
When should I use a notebook job when compared to JAR job?
A JAR job is equivalent to a spark-submit job. It executes the JAR and then you can look at the logs and Spark UI
for troubleshooting. A notebook job executes the specified notebook. You can import libraries in a notebook and
call your libraries from the notebook too. The advantage of using a notebook job as the main entry point is you
can easily debug your production jobs’ intermediate results in the notebook output area. See JAR jobs.
Can I connect to my own Hive metastore?
Yes, Azure Databricks supports external Hive metastores. See External Apache Hive metastore.
Migrate single node workloads to Azure Databricks
7/21/2022 • 2 minutes to read

This article answers typical questions that come up when you migrate single node workloads to Azure
Databricks.
I just created a 20 node Spark cluster and my pandas code doesn’t run any faster. What is going
wrong?
If you are working with any single-node libraries, they will not inherently become distributed when you switch
to using Azure Databricks. You will need to re-write your code using PySpark, the Apache Spark Python API.
Alternatively, you can use Pandas API on Spark, which allows you to use the pandas DataFrame API to access
data in Apache Spark DataFrames.
There is an algorithm in sklearn that I love, but Spark ML doesn’t suppor t it (such as DBSCAN).
How can I use this algorithm and still take advantage of Spark?
Use joblib-spark, an Apache Spark backend for joblib to distribute tasks on a Spark cluster.
Use a pandas user-defined function.
For hyperparameter tuning, use Hyperopt.
What are my deployment options for Spark ML?
The best deployment option depends on the latency requirement of the application.
For batch predictions, see Deploy models for inference and prediction.
For streaming applications, see What is Apache Spark Structured Streaming?.
For low-latency model inference, consider MLflow Model Serving or a cloud provider-based solution
such as Azure Machine Learning.
How can I install or update pandas or another librar y?
There are several ways to install or update a library.
To install or update a library for all users on a cluster, see Cluster libraries.
To make a Python library or a library version available only for a specific notebook, see Notebook-scoped
Python libraries.
How can I view data on DBFS with just the driver?
Add /dbfs/ to the beginning of the file path. See Local file APIs.
How can I get data into Azure Databricks?
Mounting. See Mount object storage to DBFS.
Data tab. See Explore and create tables with the Data tab.
%sh wget

If you have a data file at a URL, you can use the %sh wget <url>/<filename> to import data to a Spark
driver node.
NOTE
The cell output prints Saving to: '<filename>' , but the file is actually saved to
file:/databricks/driver/<filename> .

For example if you download the file


https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOAD with the command:

%sh wget https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOAD

To load this data, run:

pandas_df = pd.read_csv("file:/databricks/driver/rows.csv?accessType=DOWNLOAD", header='infer')


Migration guide
7/21/2022 • 3 minutes to read

Migrate workloads to Delta Lake


When you migrate workloads to Delta Lake, you should be aware of the following simplifications and
differences compared with the data sources provided by Apache Spark and Apache Hive.
Delta Lake handles the following operations automatically, which you should never perform manually:
REFRESH TABLE: Delta tables always return the most up-to-date information, so there is no need to
manually call REFRESH TABLE after changes.
Add and remove par titions : Delta Lake automatically tracks the set of partitions present in a table and
updates the list as data is added or removed. As a result, there is no need to run
ALTER TABLE [ADD|DROP] PARTITION or MSCK .

Load a single par tition : As an optimization, you may sometimes directly load the partition of data you
are interested in. For example, spark.read.format("parquet").load("/data/date=2017-01-01") . This is
unnecessary with Delta Lake, since it can quickly read the list of files from the transaction log to find the
relevant ones. If you are interested in a single partition, specify it using a WHERE clause. For example,
spark.read.delta("/data").where("date = '2017-01-01'") . For large tables with many files in the partition,
this can be much faster than loading a single partition (with direct partition path, or with WHERE ) from a
Parquet table because listing the files in the directory is often slower than reading the list of files from the
transaction log.
When you port an existing application to Delta Lake, you should avoid the following operations, which bypass
the transaction log:
Manually modify data : Delta Lake uses the transaction log to atomically commit changes to the table.
Because the log is the source of truth, files that are written out but not added to the transaction log are
not read by Spark. Similarly, even if you manually delete a file, a pointer to the file is still present in the
transaction log. Instead of manually modifying files stored in a Delta table, always use the commands that
are described in this guide.
External readers : The data stored in Delta Lake is encoded as Parquet files. However, accessing these
files using an external reader is not safe. You’ll see duplicates and uncommitted data and the read may
fail when someone runs Remove files no longer referenced by a Delta table.

NOTE
Because the files are encoded in an open format, you always have the option to move the files outside Delta Lake. At that
point, you can run VACUUM RETAIN 0 and delete the transaction log. This leaves the table’s files in a consistent state that
can be read by the external reader of your choice.

Example
Suppose you have Parquet data stored in a directory named /data-pipeline , and you want to create a Delta
table named events .
The first example shows how to:
Read the Parquet data from its original location, /data-pipeline , into a DataFrame.
Save the DataFrame’s contents in Delta format in a separate location, /tmp/delta/data-pipeline/ .
Create the events table based on that separate location, /tmp/delta/data-pipeline/ .
The second example shows how to use CONVERT TO TABLE to convert data from Parquet to Delta format without
changing its original location, /data-pipeline/ .
Each of these examples create an unmanaged table, where you continue to manage the data in its specified
location. Azure Databricks records the table’s name and its specified location in the metastore.
Save as Delta table
1. Read the Parquet data into a DataFrame and then save the DataFrame’s contents to a new directory in
delta format:

data = spark.read.format("parquet").load("/data-pipeline")
data.write.format("delta").save("/tmp/delta/data-pipeline/")

2. Create a Delta table named events that refers to the files in the new directory:

spark.sql("CREATE TABLE events USING DELTA LOCATION '/tmp/delta/data-pipeline/'")

Convert to Delta table


You have three options for converting a Parquet table to a Delta table:
Convert files to Delta Lake format and then create a Delta table:

CONVERT TO DELTA parquet.`/data-pipeline/`


CREATE TABLE events USING DELTA LOCATION '/data-pipeline/'

Create a Parquet table and then convert it to a Delta table:

CREATE TABLE events USING PARQUET OPTIONS (path '/data-pipeline/')


CONVERT TO DELTA events

Convert a Parquet table to a Delta table:

CONVERT TO DELTA events

This assumes that the table named events is a Parquet table.

You can also convert Iceberg tables to Delta Lake using the file path in the cloud storage location:

CONVERT TO DELTA iceberg.`abfss://container-name@storage-account-name.dfs.core.windows.net/path/to/table`;

For details, see Convert an Iceberg table to a Delta table.


Ingest data into the Azure Databricks Lakehouse
7/21/2022 • 2 minutes to read

Azure Databricks offers a variety of ways to help you ingest data into a lakehouse backed by Delta Lake.

Upload CSV files


You can securely upload local CSV files to create tables using Databricks SQL. See Upload data and create table
in Databricks SQL.

Partner integrations
Databricks partner integrations enable you to load data into Azure Databricks. These integrations enable low-
code, scalable data ingestion from a variety of sources into Azure Databricks. See Databricks integrations.

COPY INTO
Load data with COPY INTO allows SQL users to idempotently and incrementally load data from cloud object
storage into Delta Lake tables. It can be used in Databricks SQL, notebooks, and Databricks Jobs.

Auto Loader
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without
additional setup. Auto Loader provides a new Structured Streaming source called cloudFiles . Given an input
directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive,
with the option of also processing existing files in that directory.

When to use COPY INTO and when to use Auto Loader


Here are a few things to consider when choosing between Auto Loader and COPY INTO:
If you’re going to ingest files in the order of thousands, you can use COPY INTO . If you are expecting files in
the order of millions or more over time, use Auto Loader. Auto Loader requires fewer total operations to
discover files compared to COPY INTO and can split the processing into multiple batches, meaning that Auto
Loader is less expensive and more efficient at scale.
If your data schema is going to evolve frequently, Auto Loader provides better primitives around schema
inference and evolution. See Configuring schema inference and evolution in Auto Loader for more details.
Loading a subset of re-uploaded files can be a bit easier to manage with COPY INTO. With Auto Loader, it’s
harder to reprocess a select subset of files. However, you can use COPY INTO to reload the subset of files
while an Auto Loader stream is running simultaneously.
For a brief overview and demonstration of Auto Loader, as well as COPY INTO, watch this YouTube video (2
minutes).

Use the Data tab to load data


The Data Science & Engineering workspace Data tab allows you to use the UI to load small files to create tables;
see Explore and create tables with the Data tab.
Use Apache Spark to load data from external sources
You can connect to a variety of data sources using Apache Spark. See Data sources for a list of options and
examples for connecting.

Review file metadata captured during data ingestion


Apache Spark automatically captures data about source files during data loading. Azure Databricks lets you
access this data with the File metadata column.
Upload data and create table in Databricks SQL
7/21/2022 • 4 minutes to read

The Databricks SQL create table UI allows you to quickly upload a CSV file and create a Delta table.

NOTE
For loading files from cloud storage such as Azure Data Lake Storage Gen2, AWS S3, or Google Cloud Storage, check out
the tutorial on COPY INTO.

Types of target tables


Create table in Databricks SQL can create managed Delta tables in Unity Catalog or in the Hive Metastore.

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Requirements
To use Create table in Databricks SQL with Unity Catalog, you need a metastore, catalog, and schema.
For Unity Catalog, you must also have the USAGE permission on the parent catalog of the selected
schema.
If your workspace is assigned to a Unity Catalog metastore, you can still create tables under schemas
in the Hive metastore.
You need USAGE and CREATE permissions on the schema you want to create a table in.
You must have a running SQL Warehouse.

Create a table using CSV upload


You can use the UI to create a Delta table by importing small CSV files to Databricks SQL from your local
machine.
The upload UI supports uploading a single file at a time under 100 megabytes.
The file must be a CSV and have an extension “.csv”.
Compressed files such as zip and tar files are not supported.
Upload the file
1. Navigate to the SQL persona by using the persona switcher.
To change the persona, click the icon below the Databricks logo , and select a persona.
2. Click Create in the sidebar and select Table from the menu.
3. The Create table in Databricks SQL page appears.
4. To start an upload, click the file browser button or drag-and-drop files directly on the drop zone.

NOTE
Imported files are uploaded to a secure internal location within your account which is garbage collected daily.

Table name selection


Upon completion of upload, you can select the destination for your data.

1. For workspaces that are assigned to a Unity Catalog metastore, you can select a catalog. If your workspace is
not assigned to a Unity Catalog metastore, the destination catalog will be hidden, and schemas will be loaded
from the Hive metastore.
To use the Hive metastore in a workspace that has been assigned to a Unity Catalog metastore, select
hive_metastore in the catalog selector.
2. Select a schema.
3. By default, the UI converts the file name to a valid table name. You can edit the table name.
Data preview
After the file upload is complete, you can preview the data (limit of 50 rows).
After the upload, the UI tries to start the endpoint selected in the top right. You can switch endpoints at any
time, but the preview and table creation require an active endpoint. If your endpoint is not active yet, it starts
automatically. This may take some time. The preview starts when your endpoint is running.

There are two ways to preview the data, vertically or horizontally. To switch between preview options, click

the toggle button above the table .


Format options
Depending on the file format uploaded, different options are available. Common format options appear in the
header bar, while less commonly used options are available in the Advanced attributes modal.
For CSV, the following options are available.
First row contains the header (enabled by default): This option specifies whether the CSV file
contains a header.
Column delimiter : The separator character between columns. Only a single character is allowed, and
backslash is not supported. This defaults to comma for CSV files.
Automatically detect column types (enabled by default): Automatically detect column types from
file content. You can edit types in the preview table. If this is set to false, all column types are inferred
as STRING .
Rows span multiple lines (disabled by default): Whether a column’s value can span multiple lines in
the file.
The data preview updates automatically when you edit format options.
Column headers and types

You can edit column header names and types.


To edit types, click the icon with the type.
To edit the column name, click the input box at the top of the column.
Column names do not support commas, backslashes, or unicode characters (such as emojis).
For CSV files, the column data types are inferred by default. You can interpret all columns as STRING type by
disabling Advanced attributes > Automatically detect column types .

NOTE
Schema inference does a best effort detection of column types. Changing column types may lead to certain values
being cast to NULL if the value cannot be cast correctly to the target data type. Casting BIGINT to DATE or
TIMESTAMP columns is not supported. Databricks recommends that you create a table first and then transform these
columns using SQL functions afterwards.
To support table column names with special characters, create table UI via upload in Databricks SQL leverages Column
Mapping.
To add comments to columns, create the table and navigate to Data Explorer where you can add comments.

Supported data types


Create table using CSV upload supports the following data types. For more information about individual data
types see SQL data types.

DATA T Y P E DESC RIP T IO N

BIGINT 8-byte signed integer numbers.

BOOLEAN Boolean ( true , false ) values.

DATE Values comprising values of fields year, month, and day,


without a time-zone.

DOUBLE 8-byte double-precision floating point numbers.

STRING Character string values.

TIMESTAMP Values comprising values of fields year, month, day, hour,


minute, and second, with the session local timezone.

Creating the table

To create the table, click Create at the bottom of the page.


After you create the table using Create table in Databricks SQL, the Data Explorer page for the Delta table under
the designated catalog and schema appears.

Known issues
Casting BIGINT to non-castable types like DATE , such as dates in the format of ‘yyyy’, may trigger errors.
Load data with COPY INTO
7/21/2022 • 3 minutes to read

The COPY INTO SQL command lets you load data from a file location into a Delta table. This is a re-triable and
idempotent operation; files in the source location that have already been loaded are skipped.
COPY INTO supports secure access in a several ways, including the ability to Use temporary credentials to load
data with COPY INTO.

Empty Delta Lake tables


NOTE
This feature is available in Databricks Runtime 11.0 and above.

You can create empty placeholder Delta tables so that the schema is later inferred during a COPY INTO
command:

CREATE TABLE IF NOT EXISTS my_table


[COMMENT <table_description>]
[TBLPROPERTIES (<table_properties>)];

COPY INTO my_table


FROM '/path/to/files'
FILEFORMAT = <format>
FORMAT_OPTIONS ('mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

The SQL statement above is idempotent and can be scheduled to run to ingest data exactly-once into a Delta
table.

NOTE
The empty Delta table is not usable outside of COPY INTO . INSERT INTO and MERGE INTO are not supported to write
data into schemaless Delta tables. After data is inserted into the table with COPY INTO , the table becomes queryable.

See Create target tables for COPY INTO

Example
For common use patterns, see Common data loading patterns with COPY INTO
The following example shows how to create a Delta table and then use the COPY INTO SQL command to load
sample data from Sample datasets (databricks-datasets) into the table. You can run the example Python, R, Scala,
or SQL code from a notebook attached to an Azure Databricks cluster. You can also run the SQL code from a
query associated with a SQL warehouse in Databricks SQL.
Python
table_name = 'default.loan_risks_upload'
source_data = '/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet'
source_format = 'PARQUET'

spark.sql("DROP TABLE IF EXISTS " + table_name)

spark.sql("CREATE TABLE " + table_name + " (" \


"loan_id BIGINT, " + \
"funded_amnt INT, " + \
"paid_amnt DOUBLE, " + \
"addr_state STRING)"
)

spark.sql("COPY INTO " + table_name + \


" FROM '" + source_data + "'" + \
" FILEFORMAT = " + source_format
)

loan_risks_upload_data = spark.sql("SELECT * FROM " + table_name)

display(loan_risks_upload_data)

'''
Result:
+---------+-------------+-----------+------------+
| loan_id | funded_amnt | paid_amnt | addr_state |
+=========+=============+===========+============+
| 0 | 1000 | 182.22 | CA |
+---------+-------------+-----------+------------+
| 1 | 1000 | 361.19 | WA |
+---------+-------------+-----------+------------+
| 2 | 1000 | 176.26 | TX |
+---------+-------------+-----------+------------+
...
'''

R
library(SparkR)
sparkR.session()

table_name = "default.loan_risks_upload"
source_data = "/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet"
source_format = "PARQUET"

sql(paste("DROP TABLE IF EXISTS ", table_name, sep = ""))

sql(paste("CREATE TABLE ", table_name, " (",


"loan_id BIGINT, ",
"funded_amnt INT, ",
"paid_amnt DOUBLE, ",
"addr_state STRING)",
sep = ""
))

sql(paste("COPY INTO ", table_name,


" FROM '", source_data, "'",
" FILEFORMAT = ", source_format,
sep = ""
))

loan_risks_upload_data = tableToDF(table_name)

display(loan_risks_upload_data)

# Result:
# +---------+-------------+-----------+------------+
# | loan_id | funded_amnt | paid_amnt | addr_state |
# +=========+=============+===========+============+
# | 0 | 1000 | 182.22 | CA |
# +---------+-------------+-----------+------------+
# | 1 | 1000 | 361.19 | WA |
# +---------+-------------+-----------+------------+
# | 2 | 1000 | 176.26 | TX |
# +---------+-------------+-----------+------------+
# ...

Scala
val table_name = "default.loan_risks_upload"
val source_data = "/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet"
val source_format = "PARQUET"

spark.sql("DROP TABLE IF EXISTS " + table_name)

spark.sql("CREATE TABLE " + table_name + " (" +


"loan_id BIGINT, " +
"funded_amnt INT, " +
"paid_amnt DOUBLE, " +
"addr_state STRING)"
)

spark.sql("COPY INTO " + table_name +


" FROM '" + source_data + "'" +
" FILEFORMAT = " + source_format
)

val loan_risks_upload_data = spark.table(table_name)

display(loan_risks_upload_data)

/*
Result:
+---------+-------------+-----------+------------+
| loan_id | funded_amnt | paid_amnt | addr_state |
+=========+=============+===========+============+
| 0 | 1000 | 182.22 | CA |
+---------+-------------+-----------+------------+
| 1 | 1000 | 361.19 | WA |
+---------+-------------+-----------+------------+
| 2 | 1000 | 176.26 | TX |
+---------+-------------+-----------+------------+
...
*/

SQL

DROP TABLE IF EXISTS default.loan_risks_upload;

CREATE TABLE default.loan_risks_upload (


loan_id BIGINT,
funded_amnt INT,
paid_amnt DOUBLE,
addr_state STRING
);

COPY INTO default.loan_risks_upload


FROM '/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet'
FILEFORMAT = PARQUET;

SELECT * FROM default.loan_risks_upload;

-- Result:
-- +---------+-------------+-----------+------------+
-- | loan_id | funded_amnt | paid_amnt | addr_state |
-- +=========+=============+===========+============+
-- | 0 | 1000 | 182.22 | CA |
-- +---------+-------------+-----------+------------+
-- | 1 | 1000 | 361.19 | WA |
-- +---------+-------------+-----------+------------+
-- | 2 | 1000 | 176.26 | TX |
-- +---------+-------------+-----------+------------+
-- ...
To clean up, run the following code, which deletes the table:
Python

spark.sql("DROP TABLE " + table_name)

sql(paste("DROP TABLE ", table_name, sep = ""))

Scala

spark.sql("DROP TABLE " + table_name)

SQL

DROP TABLE default.loan_risks_upload

Tutorial
Bulk load data into a table with COPY INTO in Databricks SQL
Bulk load data into a table with COPY INTO with Spark SQL

Reference
Databricks Runtime 7.x and above: COPY INTO
Databricks Runtime 5.5 LTS and 6.x: Copy Into (Delta Lake on Azure Databricks)
Use temporary credentials to load data with COPY
INTO
7/21/2022 • 2 minutes to read

If your an Azure Databricks cluster or SQL warehouse doesn’t have permissions to read your source files, you
can use temporary credentials to access data from external cloud object storage and load files into a Delta Lake
table.
Depending on how your organization manages your cloud security, you may need to ask a cloud administrator
or power user to provide you with credentials.

Specifying temporary credentials or encryption options to access data


NOTE
Credential and encryption options are available in Databricks Runtime 10.2 and above.

COPY INTO supports:


Azure SAS tokens to read data from ADLS Gen2 and Azure Blob Storage. Azure Blob Storage temporary
tokens are at the container level, whereas ADLS Gen2 tokens can be at the directory level in addition to the
container level. Databricks recommends using directory level SAS tokens when possible. The SAS token must
have “Read”, “List”, and “Permissions” permissions.
AWS STS tokens to read data from AWS S3. Your tokens should have the “s3:GetObject*”, “s3:ListBucket”, and
“s3:GetBucketLocation” permissions.

WARNING
To avoid misuse or exposure of temporary credentials, Databricks recommends that you set expiration horizons that are
just long enough to complete the task.

COPY INTO supports loading encrypted data from AWS S3. To load encrypted data, provide the type of
encryption and the key to decrypt the data.

Load data using temporary credentials


The following example loads data from S3 and ADLS Gen2 using temporary credentials to provide access to the
source data.
COPY INTO my_json_data
FROM 's3://my-bucket/jsonData' WITH (
CREDENTIAL (AWS_ACCESS_KEY = '...', AWS_SECRET_KEY = '...', AWS_SESSION_TOKEN = '...')
)
FILEFORMAT = JSON

COPY INTO my_json_data


FROM 'abfss://container@storageAccount.dfs.core.windows.net/jsonData' WITH (
CREDENTIAL (AZURE_SAS_TOKEN = '...')
)
FILEFORMAT = JSON

Load encrypted data


Using customer-provided encryption keys, the following example loads data from S3.

COPY INTO my_json_data


FROM 's3://my-bucket/jsonData' WITH (
ENCRYPTION (TYPE = 'AWS_SSE_C', MASTER_KEY = '...')
)
FILEFORMAT = JSON

Load JSON data using credentials for source and target


The following example loads JSON data from a file on Azure into the external Delta table called my_json_data .
This table must be created before COPY INTO can be executed. The command uses one existing credential to
write to external Delta table and another to read from the ABFSS location.

COPY INTO my_json_data WITH (CREDENTIAL target_credential)


FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path' WITH (CREDENTIAL source_credential)
FILEFORMAT = JSON
FILES = ('f.json')
Common data loading patterns with COPY INTO
7/21/2022 • 2 minutes to read

Learn common patterns for using COPY INTO to load data from file sources into Delta Lake.
There are many options for using COPY INTO. You can also Use temporary credentials to load data with COPY
INTO in combination with these patterns.
See COPY INTO for a full reference of all options.

Create target tables for COPY INTO


COPY INTO must target an existing Delta table. In Databricks Runtime 11.0 and above, setting the schema for
these tables is optional for formats that support schema evolution:

CREATE TABLE IF NOT EXISTS my_table


[(col_1 col_1_type, col_2 col_2_type, ...)]
[COMMENT <table_description>]
[TBLPROPERTIES (<table_properties>)];

Note that to infer schema with copy into, you must pass additional options:

COPY INTO my_table


FROM '/path/to/files'
FILEFORMAT = <format>
FORMAT_OPTIONS ('mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

The following example creates a schemaless Delta table called my_pipe_data and loads a pipe-delimited CSV
with a header:

CREATE TABLE IF NOT EXISTS my_pipe_data;

COPY INTO my_pipe_data


FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path'
FILEFORMAT = CSV
FORMAT_OPTIONS ('mergeSchema' = 'true',
'delimiter' = '|',
'header' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

Load JSON data with COPY INTO


The following example loads JSON data from 5 files on Azure into the Delta table called my_json_data . This table
must be created before COPY INTO can be executed. If any data had already been loaded from one of the files,
the data will not be reloaded for that file.
COPY INTO my_json_data
FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path'
FILEFORMAT = JSON
FILES = ('f1.json', 'f2.json', 'f3.json', 'f4.json', 'f5.json')

-- The second execution will not copy any data since the first command already loaded the data
COPY INTO my_json_data
FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path'
FILEFORMAT = JSON
FILES = ('f1.json', 'f2.json', 'f3.json', 'f4.json', 'f5.json')

Load Avro data with COPY INTO


The following example loads Avro data on Google Cloud Storage using additional SQL expressions as part of the
SELECT statement.

COPY INTO my_delta_table


FROM (SELECT to_date(dt) dt, event as measurement, quantity::double
FROM 'gs://my-bucket/avroData')
FILEFORMAT = AVRO

Load CSV files with COPY INTO


The following example loads CSV files from Azure Data Lake Storage Gen2 under
abfss://container@storageAccount.dfs.core.windows.net/base/path/folder1 into a Delta table at
abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target .

COPY INTO delta.`abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target`


FROM (SELECT key, index, textData, 'constant_value'
FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path')
FILEFORMAT = CSV
PATTERN = 'folder1/file_[a-g].csv'
FORMAT_OPTIONS('header' = 'true')

-- The example below loads CSV files without headers on ADLS Gen2 using COPY INTO.
-- By casting the data and renaming the columns, you can put the data in the schema you want
COPY INTO delta.`abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target`
FROM (SELECT _c0::bigint key, _c1::int index, _c2 textData
FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path')
FILEFORMAT = CSV
PATTERN = 'folder1/file_[a-g].csv'

Ignore corrupt files while loading data


If the data you’re loading can’t be read due to some corruption issue, those files can be skipped by setting
ignoreCorruptFiles to true in the FORMAT_OPTIONS .

The result of the command returns how many files were skipped due to corruption in the
COPY INTO
num_skipped_corrupt_files column. This metric also shows up in the operationMetrics column under
numSkippedCorruptFiles after running DESCRIBE HISTORY on the Delta table.

Corrupt files aren’t tracked by COPY INTO , so they can be reloaded in a subsequent run if the corruption is fixed.
You can see which files are corrupt by running COPY INTO in VALIDATE mode.
COPY INTO my_table
FROM '/path/to/files'
FILEFORMAT = <format>
[VALIDATE ALL]
FORMAT_OPTIONS ('ignoreCorruptFiles' = 'true')

NOTE
ignoreCorruptFiles is available in Databricks Runtime 11.0 and above.
Bulk load data into a table with COPY INTO with
Spark SQL
7/21/2022 • 4 minutes to read

Databricks recommends that you use the COPY INTO command for incremental and bulk data loading for data
sources that contain thousands of files. Databricks recommends that you use Auto Loader for advanced use
cases.
In this tutorial, you use the COPY INTO command to load data from cloud object storage into a table in your
Azure Databricks workspace.

Requirements
1. An Azure subscription, an Azure Databricks workspace in that subscription, and a cluster in that workspace.
To create these, see Quickstart: Run a Spark job on Azure Databricks Workspace using the Azure portal. If you
follow this quickstart, you do not need to follow the instructions in the Run a Spark SQL job section.
2. An all-purpose cluster in your workspace running Databricks Runtime 11.0 or above. To create an all-purpose
cluster, see Create a cluster.
3. Familiarity with the Azure Databricks workspace user interface. See Navigate the workspace.
4. Familiarity working with Notebooks.
5. A location you can write data to; this demo uses the DBFS root as an example, but Databricks recommends
an external storage location configured with Unity Catalog.

Step 1. Configure your environment and create a data generator


This tutorial assumes basic familiarity with Azure Databricks and a default workspace configuration. If you are
unable to run the code provided, contact your workspace administrator to make sure you have access to
compute resources and a location to which you can write data.
Note that the provided code uses a source parameter to specify the location you’ll configure as your
COPY INTO data source. As written, this code points to a location on DBFS root. If you have write permissions on
an external object storage location, replace the dbfs:/ portion of the source string with the path to your object
storage. Because this code block also does a recursive delete to reset this demo, make sure that you don’t point
this at production data and that you keep the /user/{username}/copy-into-demo nested directory to avoid
overwriting or deleting existing data.
1. Create a new SQL notebook and attach it to a cluster running Databricks Runtime 11.0 or above.
2. Copy and run the following code to reset the storage location and database used in this tutorial:
%python
# Set parameters for isolation in workspace and reset demo

username = spark.sql("SELECT regexp_replace(current_user(), '[^a-zA-Z0-9]', '_')").first()[0]


database = f"copyinto_{username}_db"
source = f"dbfs:/user/{username}/copy-into-demo"

spark.sql(f"SET c.username='{username}'")
spark.sql(f"SET c.database={database}")
spark.sql(f"SET c.source='{source}'")

spark.sql("DROP DATABASE IF EXISTS ${c.database} CASCADE")


spark.sql("CREATE DATABASE ${c.database}")
spark.sql("USE ${c.database}")

dbutils.fs.rm(source, True)

3. Copy and run the following code to configure some tables and functions that will be used to randomly
generate data:

-- Configure random data generator

CREATE TABLE user_ping_raw


(user_id STRING, ping INTEGER, time TIMESTAMP)
USING json
LOCATION ${c.source};

CREATE TABLE user_ids (user_id STRING);

INSERT INTO user_ids VALUES


("potato_luver"),
("beanbag_lyfe"),
("default_username"),
("the_king"),
("n00b"),
("frodo"),
("data_the_kid"),
("el_matador"),
("the_wiz");

CREATE FUNCTION get_ping()


RETURNS INT
RETURN int(rand() * 250);

CREATE FUNCTION is_active()


RETURNS BOOLEAN
RETURN CASE
WHEN rand() > .25 THEN true
ELSE false
END;

Step 2: Write the sample data to cloud storage


Writing to data formats other than Delta Lake is rare on Azure Databricks. The code provided here writes to
JSON, simulating an external system that might dump results from another system into object storage.
1. Copy and run the following code to write a batch of raw JSON data:
-- Write a new batch of data to the data source

INSERT INTO user_ping_raw


SELECT *,
get_ping() ping,
current_timestamp() time
FROM user_ids
WHERE is_active()=true;

Step 3: Use COPY INTO to load JSON data idempotently


You must create a target Delta Lake table before you can use COPY INTO . In Databricks Runtime 11.0 and above,
you do not need to provide anything other than a table name in your CREATE TABLE statement. For previous
versions of Databricks Runtime, you must provide a schema when creating an empty table.
1. Copy and run the following code to create your target Delta table and load data from your source:

-- Create target table and load data

CREATE TABLE IF NOT EXISTS user_ping_target;

COPY INTO user_ping_target


FROM ${c.source}
FILEFORMAT = JSON
FORMAT_OPTIONS ("mergeSchema" = "true")
COPY_OPTIONS ("mergeSchema" = "true")

Because this action is idempotent, you can run it multiple times but data will only be loaded once.

Step 4: Preview the contents of your table


You can run a simple SQL query to manually review the contents of this table.
1. Copy and execute the following code to preview your table:

-- Review updated table

SELECT * FROM user_ping_target

Step 5: Load more data and preview results


You can re-run steps 2-4 many times to land new batches of random raw JSON data in your source,
idempotently load them to Delta Lake with COPY INTO , and preview the results. Try running these steps out of
order or multiple times to simulate multiple batches of raw data being written or executing COPY INTO multiple
times without new data having arrived.

Step 6: Clean up tutorial


When you are done with this tutorial, you can clean up the associated resources if you no longer want to keep
them.
1. Copy and run the following code to drop the database, tables, and remove all data:
%python
# Drop database and tables and remove data

spark.sql("DROP DATABASE IF EXISTS ${c.database} CASCADE")


dbutils.fs.rm(source, True)

2. To stop your compute resource, go to the Clusters tab and Terminate your cluster.

Additional resources
The COPY INTO reference article
Bulk load data into a table with COPY INTO in
Databricks SQL
7/21/2022 • 6 minutes to read

Databricks recommends using the COPY INTO command for incremental and bulk data loading with Databricks
SQL.

NOTE
COPY INTO works well for data sources that contain thousands of files. Databricks recommends that you use Auto
Loader for loading millions of files, which is not supported in Databricks SQL.

In this tutorial, you use the COPY INTO command to load data from an Azure Data Lake Storage Gen2 (ADLS
Gen2) container in your Azure account into a table in Databricks SQL.

Requirements
1. A Databricks SQL warehouse. To create a SQL warehouse, see Create a SQL warehouse.
2. Familiarity with the Databricks SQL user interface. See the Databricks SQL user guide.
3. An ADLS Gen2 storage account in your Azure account. To create an ADLS Gen2 storage account, see Create a
storage account to use with Azure Data Lake Storage Gen2. Make sure that your storage account has
Storage account key access set to Enabled and Soft delete set to Disabled .

Step 1. Prepare the sample data


The COPY INTO command loads data from a supported source into your Azure Databricks workspace.
Supported sources include CSV, JSON, Avro, ORC, Parquet, text, and binary files. This source can be anywhere
that your Azure Databricks workspace has access to.
This tutorial follows a scenario in which the data is in an ADLS Gen2 container.
To set things up, in this step you get a copy of some sample data from Databricks datasets. You then prepare
that sample data to be stored in an existing ADLS Gen2 container in your Azure account.
In the next step, you upload this sample data to the container.
In the third step, you set up access permissions for the COPY INTO command.
Finally, you run the COPY INTO command to load the data from the container back into your Azure Databricks
workspace.
Normally, you would not export sample data from your Azure Databricks workspace and re-import it. However,
this scenario needs to follow this particular workflow to prepare sample data for the COPY INTO command.
To prepare the sample data, you can use the Databricks SQL editor.
1. In the SQL persona, on the sidebar, click Create > Quer y .
2. In the SQL editor’s menu bar, select the SQL warehouse that you created in the Requirements section, or
select another available SQL warehouse that you want to use.
3. In the SQL editor, paste the following code:
SELECT * FROM samples.nyctaxi.trips

4. Click Run .
5. At the bottom of the editor, click the ellipses icon, and then click Download as CSV file .

NOTE
This dataset contains almost 22,000 rows of data. This tutorial downloads only the first 1,000 rows of data. To
download all of the rows, clear the LIMIT 1000 box and then repeat steps 4-5.

Step 2: Upload the sample data to cloud storage


In this step, you upload the sample data from your local development machine into an ADLS Gen2 container in
your Azure account.
1. Sign in to the Azure portal for your Azure account, typically at https://portal.azure.com.
2. Browse to and open your existing Azure storage account. This tutorial uses a fictitious storage account
named nyctaxisampledata .
3. Click Containers > + Container .
4. Enter a name for the container, and then click Create . This tutorial uses a fictitious container named nyctaxi .
5. Click the nyctaxi container.
6. Click Upload .
7. Follow the on-screen instructions to upload the CSV file from the previous step into this container.

Step 3: Create resources in your cloud account to access cloud


storage
In this step, in your Azure storage account you get credentials that have just enough access to read the CSV file
that you uploaded to the container.
1. With the nyctaxi container from the previous step still open, click nyctaxisampledata in the navigation
breadcrumb.
2. Right-click the nyctaxi container, and then click Generate SAS .
3. Follow the on-screen instructions to generate a shared access signature (SAS) token and URL. Make sure to
specify both Read and List permissions.
4. After you click Generate SAS token and URL , copy the Blob SAS token value that appears to a secure
location, as you will need it in Step 5. (You will not need the Blob SAS URL value for this tutorial.)

Step 4: Create the table


In this step, you create a table in your Azure Databricks workspace to hold the incoming data.
1. In the sidebar, click Create > Quer y .
2. In the SQL editor’s menu bar, select the SQL warehouse that you created in the Requirements section, or
select another available SQL warehouse that you want to use.
3. In the SQL editor, paste the following code:
CREATE TABLE default.nyctaxi_trips (
tpep_pickup_datetime TIMESTAMP,
tpep_dropoff_datetime TIMESTAMP,
trip_distance DOUBLE,
fare_amount DOUBLE,
pickup_zip INT,
dropoff_zip INT
);

4. Click Run .

Step 5: Load the sample data from cloud storage into the table
In this step, you load the CSV file from the ADLS Gen2 container into the table in your Azure Databricks
workspace.
1. In the sidebar, click Create > Quer y .
2. In the SQL editor’s menu bar, select the SQL warehouse that you created in the Requirements section, or
select another available SQL warehouse that you want to use.
3. In the SQL editor, paste the following code. In this code, replace:
nyctaxisampledata with the name of your ADLS Gen2 storage account.
nyctaxi with the name of the container within your storage account.
<yourBlobSASToken> with the value of Blob SAS token from Step 3.

COPY INTO default.nyctaxi_trips


FROM 'abfss://nyctaxi@nyctaxisampledata.dfs.core.windows.net/'
WITH (
CREDENTIAL (
AZURE_SAS_TOKEN = "<yourBlobSASToken>"
)
)
FILEFORMAT = CSV
FORMAT_OPTIONS (
'header' = 'true',
'inferSchema' = 'true'
);

SELECT * FROM default.nyctaxi_trips;

NOTE
FORMAT_OPTIONS differs by FILEFORMAT . In this case, the header option instructs Azure Databricks to treat the
first row of the CSV file as a header, and the inferSchema options instructs Azure Databricks to automatically
determine the data type of each field in the CSV file.

4. Click Run .

NOTE
If you click Run again, no new data is loaded into the table. This is because the COPY INTO command only
processes what it considers to be new data.

Step 6: Clean up
When you are done with this tutorial, you can clean up the associated resources in your cloud account and
Azure Databricks if you no longer want to keep them.
Delete the ADLS Gen2 storage account
1. Open the Azure portal for your Azure account, typically at https://portal.azure.com.
2. Browse to and open the nyctaxisampledata storage account.
3. Click Delete .
4. Enter nyctaxisampledata , and then click Delete .
Delete the tables
1. In the sidebar, click Create > Quer y .
2. Select the SQL warehouse that you created in the Requirements section, or select another available SQL
warehouse that you want to use.
3. Paste the following code:

DROP TABLE default.nyctaxi_trips;

4. Click Run .
5. Hover over the tab for this query, and then click the X icon.
Delete the queries in the SQL editor
1. In your Azure Databricks workspace, in the SQL persona, click SQL Editor in the sidebar.
2. In the SQL editor’s menu bar, hover over the tab for each query that you created for this tutorial, and then
click the X icon.
Stop the SQL warehouse
If you are not using the SQL warehouse for any other tasks, you should stop the SQL warehouse to avoid
additional costs.
1. In the SQL persona, on the sidebar, click SQL Warehouses .
2. Next to the name of the SQL warehouse, click Stop .
3. When prompted, click Stop again.

Additional resources
The COPY INTO reference article
Auto Loader
7/21/2022 • 2 minutes to read

Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any
additional setup.

About Auto Loader


Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. Auto Loader
can load data files from AWS S3 ( s3:// ), Azure Data Lake Storage Gen2 (ADLS Gen2, abfss:// ), Google Cloud
Storage (GCS, gs:// ), Azure Blob Storage ( wasbs:// ), ADLS Gen1 ( adl:// ), and Databricks File System (DBFS,
dbfs:/ ). Auto Loader can ingest JSON , CSV , PARQUET , AVRO , ORC , TEXT , and BINARYFILE file formats.

Auto Loader provides a Structured Streaming source called cloudFiles . Given an input directory path on the
cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of
also processing existing files in that directory. Auto Loader has support for both Python and SQL in Delta Live
Tables.
You can use Auto Loader to process billions of files to migrate or backfill a table. Auto Loader scales to support
near real-time ingestion of millions of files per hour.

Getting Started
Databricks recommends using Auto Loader in Delta Live Tables for incremental data ingestion. Delta Live Tables
extends functionality in Apache Spark Structured Streaming and allows you to write just a few lines of
declarative Python or SQL to deploy a production-quality data pipeline.
Databricks recommends Auto Loader whenever you use Apache Spark Structured Streaming to ingest data from
cloud object storage. APIs are available in Python and Scala.
To get started using Auto Loader, see:
Using Auto Loader in Delta Live Tables
Using Auto Loader in Structured Streaming applications

Concepts
You can tune Auto Loader based on data volume, variety, and velocity.
Configuring schema inference and evolution in Auto Loader
Choosing between file notification and directory listing modes
Configure Auto Loader for production workloads

Reference
For a full list of Auto Loader options, see:
Auto Loader options

Tutorials
For details on how to use Auto Loader, see:
Common data loading patterns

Resources
For an overview and demonstration of Auto Loader, watch this YouTube video (59 minutes).

FAQ
Auto Loader FAQ
Using Auto Loader in Delta Live Tables
7/21/2022 • 2 minutes to read

You can use Auto Loader in your Delta Live Tables pipelines. Delta Live Tables extends functionality in Apache
Spark Structured Streaming and allows you to write just a few lines of declarative Python or SQL to deploy a
production-quality data pipeline with:
Autoscaling compute infrastructure for cost savings
Data quality checks with expectations
Automatic schema evolution handling
Monitoring via metrics in the event log
You do not need to provide a schema or checkpoint location because Delta Live Tables automatically manages
these settings for your pipelines. See Delta Live Tables data sources.

Auto Loader syntax for DLT


Delta Live Tables provides slightly modified Python syntax for Auto Loader, and adds SQL support for Auto
Loader.
The following examples use Auto Loader to create datasets from CSV and JSON files:
Python

@dlt.table
def customers():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.load("/databricks-datasets/retail-org/customers/")
)

@dlt.table
def sales_orders_raw():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/databricks-datasets/retail-org/sales_orders/")
)

SQL

CREATE OR REFRESH STREAMING LIVE TABLE customers


AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/customers/", "csv")

CREATE OR REFRESH STREAMING LIVE TABLE sales_orders_raw


AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/sales_orders/", "json")

You can use supported format options with Auto Loader. Using the map() function, you can pass any number of
options to the cloud_files() method. Options are key-value pairs, where the keys and values are strings. The
following describes the syntax for working with Auto Loader in SQL:
CREATE OR REFRESH STREAMING LIVE TABLE <table_name>
AS SELECT *
FROM cloud_files(
"<file_path>",
"<file_format>",
map(
"<option_key>", "<option_value",
"<option_key>", "<option_value",
...
)
)

The following example reads data from tab-delimited CSV files with a header:

CREATE OR REFRESH STREAMING LIVE TABLE customers


AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/customers/", "csv", map("delimiter", "\t",
"header", "true"))

You can use the schema to specify the format manually; you must specify the schema for formats that do not
support schema inference:
Python

@dlt.table
def wiki_raw():
return (
spark.readStream.format("cloudFiles")
.schema("title STRING, id INT, revisionId INT, revisionTimestamp TIMESTAMP, revisionUsername STRING,
revisionUsernameId INT, text STRING")
.option("cloudFiles.format", "parquet")
.load("/databricks-datasets/wikipedia-datasets/data-001/en_wikipedia/articles-only-parquet")
)

SQL

CREATE OR REFRESH STREAMING LIVE TABLE wiki_raw


AS SELECT *
FROM cloud_files(
"/databricks-datasets/wikipedia-datasets/data-001/en_wikipedia/articles-only-parquet",
"parquet",
map("schema", "title STRING, id INT, revisionId INT, revisionTimestamp TIMESTAMP, revisionUsername
STRING, revisionUsernameId INT, text STRING")
)

NOTE
Delta Live Tables automatically configures and manages the schema and checkpoint directories when using Auto Loader
to read files. However, if you manually configure either of these directories, performing a full refresh does not affect the
contents of the configured directories. Databricks recommends using the automatically configured directories to avoid
unexpected side effects during processing.
Using Auto Loader in Structured Streaming
applications
7/21/2022 • 4 minutes to read

Databricks recommends using Auto Loader in all Structured Streaming applications that ingest data from cloud
object storage.

Benefits over Apache Spark FileStreamSource


In Apache Spark, you can read files incrementally using spark.readStream.format(fileFormat).load(directory) .
Auto Loader provides the following benefits over the file source:
Scalability: Auto Loader can discover billions of files efficiently. Backfills can be performed asynchronously to
avoid wasting any compute resources.
Performance: The cost of discovering files with Auto Loader scales with the number of files that are being
ingested instead of the number of directories that the files may land in. See Optimized directory listing.
Schema inference and evolution support: Auto Loader can detect schema drifts, notify you when schema
changes happen, and rescue data that would have been otherwise ignored or lost. See Schema inference.
Cost: Auto Loader uses native cloud APIs to get lists of files that exist in storage. In addition, Auto Loader’s file
notification mode can help reduce your cloud costs further by avoiding directory listing altogether. Auto
Loader can automatically set up file notification services on storage to make file discovery much cheaper.

Quickstart
The following code example demonstrates how Auto Loader detects new data files as they arrive in cloud
storage. You can run the example code from within a notebook attached to an Azure Databricks cluster.
1. Create the file upload directory, for example:
Python

user_dir = '<my-name>@<my-organization.com>'
upload_path = "/FileStore/shared-uploads/" + user_dir + "/population_data_upload"

dbutils.fs.mkdirs(upload_path)

Scala

val user_dir = "<my-name>@<my-organization.com>"


val upload_path = "/FileStore/shared-uploads/" + user_dir + "/population_data_upload"

dbutils.fs.mkdirs(upload_path)

2. Create the following sample CSV files, and then upload them to the file upload directory by using the
DBFS file browser:
WA.csv :
city,year,population
Seattle metro,2019,3406000
Seattle metro,2020,3433000

OR.csv :

city,year,population
Portland metro,2019,2127000
Portland metro,2020,2151000

3. Run the following code to start Auto Loader.


Python

checkpoint_path = '/tmp/delta/population_data/_checkpoints'
write_path = '/tmp/delta/population_data'

# Set up the stream to begin reading incoming files from the


# upload_path location.
df = spark.readStream.format('cloudFiles') \
.option('cloudFiles.format', 'csv') \
.option('header', 'true') \
.schema('city string, year int, population long') \
.load(upload_path)

# Start the stream.


# Use the checkpoint_path location to keep a record of all files that
# have already been uploaded to the upload_path location.
# For those that have been uploaded since the last check,
# write the newly-uploaded files' data to the write_path location.
df.writeStream.format('delta') \
.option('checkpointLocation', checkpoint_path) \
.start(write_path)

Scala

val checkpoint_path = "/tmp/delta/population_data/_checkpoints"


val write_path = "/tmp/delta/population_data"

// Set up the stream to begin reading incoming files from the


// upload_path location.
val df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", "true")
.schema("city string, year int, population long")
.load(upload_path)

// Start the stream.


// Use the checkpoint_path location to keep a record of all files that
// have already been uploaded to the upload_path location.
// For those that have been uploaded since the last check,
// write the newly-uploaded files' data to the write_path location.
df.writeStream.format("delta")
.option("checkpointLocation", checkpoint_path)
.start(write_path)

4. With the code from step 3 still running, run the following code to query the data in the write directory:
Python
df_population = spark.read.format('delta').load(write_path)

display(df_population)

'''
Result:
+----------------+------+------------+
| city | year | population |
+================+======+============+
| Seattle metro | 2019 | 3406000 |
+----------------+------+------------+
| Seattle metro | 2020 | 3433000 |
+----------------+------+------------+
| Portland metro | 2019 | 2127000 |
+----------------+------+------------+
| Portland metro | 2020 | 2151000 |
+----------------+------+------------+
'''

Scala

val df_population = spark.read.format("delta").load(write_path)

display(df_population)

/* Result:
+----------------+------+------------+
| city | year | population |
+================+======+============+
| Seattle metro | 2019 | 3406000 |
+----------------+------+------------+
| Seattle metro | 2020 | 3433000 |
+----------------+------+------------+
| Portland metro | 2019 | 2127000 |
+----------------+------+------------+
| Portland metro | 2020 | 2151000 |
+----------------+------+------------+
*/

5. With the code from step 3 still running, create the following additional CSV files, and then upload them to
the upload directory by using the DBFS file browser:
ID.csv :

city,year,population
Boise,2019,438000
Boise,2020,447000

MT.csv :

city,year,population
Helena,2019,81653
Helena,2020,82590

Misc.csv :
city,year,population
Seattle metro,2021,3461000
Portland metro,2021,2174000
Boise,2021,455000
Helena,2021,81653

6. With the code from step 3 still running, run the following code to query the existing data in the write
directory, in addition to the new data from the files that Auto Loader has detected in the upload directory
and then written to the write directory:
Python

df_population = spark.read.format('delta').load(write_path)

display(df_population)

'''
Result:
+----------------+------+------------+
| city | year | population |
+================+======+============+
| Seattle metro | 2019 | 3406000 |
+----------------+------+------------+
| Seattle metro | 2020 | 3433000 |
+----------------+------+------------+
| Helena | 2019 | 81653 |
+----------------+------+------------+
| Helena | 2020 | 82590 |
+----------------+------+------------+
| Boise | 2019 | 438000 |
+----------------+------+------------+
| Boise | 2020 | 447000 |
+----------------+------+------------+
| Portland metro | 2019 | 2127000 |
+----------------+------+------------+
| Portland metro | 2020 | 2151000 |
+----------------+------+------------+
| Seattle metro | 2021 | 3461000 |
+----------------+------+------------+
| Portland metro | 2021 | 2174000 |
+----------------+------+------------+
| Boise | 2021 | 455000 |
+----------------+------+------------+
| Helena | 2021 | 81653 |
+----------------+------+------------+
'''

Scala
val df_population = spark.read.format("delta").load(write_path)

display(df_population)

/* Result
+----------------+------+------------+
| city | year | population |
+================+======+============+
| Seattle metro | 2019 | 3406000 |
+----------------+------+------------+
| Seattle metro | 2020 | 3433000 |
+----------------+------+------------+
| Helena | 2019 | 81653 |
+----------------+------+------------+
| Helena | 2020 | 82590 |
+----------------+------+------------+
| Boise | 2019 | 438000 |
+----------------+------+------------+
| Boise | 2020 | 447000 |
+----------------+------+------------+
| Portland metro | 2019 | 2127000 |
+----------------+------+------------+
| Portland metro | 2020 | 2151000 |
+----------------+------+------------+
| Seattle metro | 2021 | 3461000 |
+----------------+------+------------+
| Portland metro | 2021 | 2174000 |
+----------------+------+------------+
| Boise | 2021 | 455000 |
+----------------+------+------------+
| Helena | 2021 | 81653 |
+----------------+------+------------+
*/

7. To clean up, cancel the running code in step 3, and then run the following code, which deletes the upload,
checkpoint, and write directories:
Python

dbutils.fs.rm(write_path, True)
dbutils.fs.rm(upload_path, True)

Scala

dbutils.fs.rm(write_path, true)
dbutils.fs.rm(upload_path, true)

See also Tutorial: Continuously ingest data into Delta Lake with Auto Loader.
Configuring schema inference and evolution in Auto
Loader
7/21/2022 • 8 minutes to read

Auto Loader can automatically detect the introduction of new columns to your data and restart so you don’t
have to manage the tracking and handling of schema changes yourself. Auto Loader can also “rescue” data that
was unexpected (for example, of differing data types) in a JSON blob column, that you can choose to access later
using the semi-structured data access APIs.
The following formats are supported for schema inference and evolution:

F IL E F O RM AT SUP P O RT ED VERSIO N S

JSON Databricks Runtime 8.2 and above

CSV Databricks Runtime 8.3 and above

Avro Databricks Runtime 10.2 and above

Parquet Databricks Runtime 11.1 and above

ORC Unsupported

Text Not applicable (fixed-schema)

Binaryfile Not applicable (fixed-schema)

Schema inference
To infer the schema, Auto Loader samples the first 50 GB or 1000 files that it discovers, whichever limit is
crossed first. To avoid incurring this inference cost at every stream start up, and to be able to provide a stable
schema across stream restarts, you must set the option cloudFiles.schemaLocation . Auto Loader creates a
hidden directory _schemas at this location to track schema changes to the input data over time. If your stream
contains a single cloudFiles source to ingest data, you can provide the checkpoint location as
cloudFiles.schemaLocation . Otherwise, provide a unique directory for this option. If your input data returns an
unexpected schema for your stream, check that your schema location is being used by only a single Auto Loader
source.
NOTE
To change the size of the sample that’s used you can set the SQL configurations:

spark.databricks.cloudFiles.schemaInference.sampleSize.numBytes

(byte string, for example 10gb )


and

spark.databricks.cloudFiles.schemaInference.sampleSize.numFiles

(integer)

By default, Auto Loader infers columns in text-based file formats like CSV and JSON as string columns. In
JSON datasets, nested columns are also inferred as string columns. Since JSON and CSV data is self-
describing and can support many data types, inferring the data as string can help avoid schema evolution issues
such as numeric type mismatches (integers, longs, floats). If you want to retain the original Spark schema
inference behavior, set the option cloudFiles.inferColumnTypes to true .

NOTE
Unless case sensitivity is enabled, the columns abc , Abc , and ABC are considered the same column for the purposes
of schema inference. The selection of which case will be chosen is arbitrary and depends on the sampled data. You can use
schema hints to enforce which case should be used. Once a selection has been made and the schema is inferred, Auto
Loader will not consider the casing variants that were not selected consistent with the schema. These columns may need
to be found in the rescued data column.

Auto Loader also attempts to infer partition columns from the underlying directory structure of the data if the
data is laid out in Hive style partitioning. For example, a file path such as
base_path/event=click/date=2021-04-01/f0.json would result in the inference of date and event as partition
columns. The data types for these columns will be strings unless you set cloudFiles.inferColumnTypes to true. If
the underlying directory structure contains conflicting Hive partitions or doesn’t contain Hive style partitioning,
the partition columns will be ignored. You can provide the option cloudFiles.partitionColumns as a comma-
separated list of column names to always try and parse the given columns from the file path if these columns
exist as key=value pairs in your directory structure.
When Auto Loader infers the schema, a rescued data column is automatically added to your schema as
_rescued_data . See the section on rescued data column and schema evolution for details.

NOTE
Binary file ( binaryFile ) and text file formats have fixed data schemas, but also support partition column inference.
The partition columns are inferred at each stream restart unless you specify cloudFiles.schemaLocation . To avoid any
potential errors or information loss, Databricks recommends setting cloudFiles.schemaLocation or
cloudFiles.partitionColumns as options for these file formats as cloudFiles.schemaLocation is not a required
option for these formats.

Schema hints
The data types that are inferred may not always be exactly what you’re looking for. By using schema hints, you
can superimpose the information that you know and expect on an inferred schema.
By default, Apache Spark has a standard approach for inferring the type of data columns. For example, it infers
nested JSON as structs and integers as longs. In contrast, Auto Loader considers all columns as strings. When
you know that a column is of a specific data type, or if you want to choose an even more general data type (for
example, a double instead of an integer), you can provide an arbitrary number of hints for columns data types
as follows:

.option("cloudFiles.schemaHints", "tags map<string,string>, version int")

See the documentation on data types for the list of supported data types.
If a column is not present at the start of the stream, you can also use schema hints to add that column to the
inferred schema.
Here is an example of an inferred schema to see the behavior with schema hints. Inferred schema:

|-- date: string


|-- quantity: int
|-- user_info: struct
| |-- id: string
| |-- name: string
| |-- dob: string
|-- purchase_options: struct
| |-- delivery_address: string

By specifying the following schema hints:

.option("cloudFiles.schemaHints", "date DATE, user_info.dob DATE, purchase_options MAP<STRING,STRING>, time


TIMESTAMP")

you will get:

|-- date: string -> date


|-- quantity: int
|-- user_info: struct
| |-- id: string
| |-- name: string
| |-- dob: string -> date
|-- purchase_options: struct -> map<string,string>
|-- time: timestamp

NOTE
Array and Map schema hints support is available in Databricks Runtime 9.1 LTS and above.

Here is an example of an inferred schema with complex datatypes to see the behavior with schema hints.
Inferred schema:
|-- products: array<string>
|-- locations: array<string>
|-- users: array<struct>
| |-- users.element: struct
| | |-- id: string
| | |-- name: string
| | |-- dob: string
|-- ids: map<string,string>
|-- names: map<string,string>
|-- prices: map<string,string>
|-- discounts: map<struct,string>
| |-- discounts.key: struct
| | |-- id: string
| |-- discounts.value: string
|-- descriptions: map<string,struct>
| |-- descriptions.key: string
| |-- descriptions.value: struct
| | |-- content: int

By specifying the following schema hints:

.option("cloudFiles.schemaHints", "products ARRAY<INT>, locations.element STRING, users.element.id INT, ids


MAP<STRING,INT>, names.key INT, prices.value INT, discounts.key.id INT, descriptions.value.content STRING")

you will get:

|-- products: array<string> -> array<int>


|-- locations: array<int> -> array<string>
|-- users: array<struct>
| |-- users.element: struct
| | |-- id: string -> int
| | |-- name: string
| | |-- dob: string
|-- ids: map<string,string> -> map<string,int>
|-- names: map<string,string> -> map<int,string>
|-- prices: map<string,string> -> map<string,int>
|-- discounts: map<struct,string>
| |-- discounts.key: struct
| | |-- id: string -> int
| |-- discounts.value: string
|-- descriptions: map<string,struct>
| |-- descriptions.key: string
| |-- descriptions.value: struct
| | |-- content: int -> string

NOTE
Schema hints are used only if you do not provide a schema to Auto Loader. You can use schema hints whether
cloudFiles.inferColumnTypes is enabled or disabled.

Schema evolution
Auto Loader detects the addition of new columns as it processes your data. By default, addition of a new column
will cause your streams to stop with an UnknownFieldException . Before your stream throws this error, Auto
Loader performs schema inference on the latest micro-batch of data, and updates the schema location with the
latest schema. New columns are merged to the end of the schema. The data types of existing columns remain
unchanged. By setting your Auto Loader stream within an Azure Databricks job, you can get your stream to
restart automatically after such schema changes.
Auto Loader supports the following modes for schema evolution, which you set in the option
cloudFiles.schemaEvolutionMode :

addNewColumns : The default mode when a schema is not provided to Auto Loader. The streaming job will fail
with an UnknownFieldException . New columns are added to the schema. Existing columns do not evolve data
types. addNewColumns is not allowed when the schema of the stream is provided. You can instead provide
your schema as a schema hint instead if you want to use this mode.
failOnNewColumns : If Auto Loader detects a new column, the stream will fail. It will not restart unless the
provided schema is updated, or the offending data file is removed.
rescue : The stream runs with the very first inferred or provided schema. Any data type changes or new
columns that are added are rescued in the rescued data column that is automatically added to your stream’s
schema as _rescued_data . In this mode, your stream will not fail due to schema changes.
none : The default mode when a schema is provided. Does not evolve the schema, new columns are ignored,
and data is not rescued unless the rescued data column is provided separately as an option.
Partition columns are not considered for schema evolution. If you had an initial directory structure like
base_path/event=click/date=2021-04-01/f0.json , and then start receiving new files as
base_path/event=click/date=2021-04-01/hour=01/f1.json , the hour column is ignored. To capture information for
new partition columns, set cloudFiles.partitionColumns to event,date,hour .

Rescued data column


The rescued data column ensures that you never lose or miss out on data during ETL. The rescued data column
contains any data that wasn’t parsed, either because it was missing from the given schema, or because there
was a type mismatch, or because the casing of the column in the record or file didn’t match with that in the
schema. The rescued data column is returned as a JSON blob containing the columns that were rescued, and the
source file path of the record (source file path is available in Databricks Runtime 8.3 and above). To remove the
source file path from the rescued data column, you can set the SQL configuration
spark.conf.set("spark.databricks.sql.rescuedDataColumn.filePath.enabled", "false") . The rescued data column
is part of the schema returned by Auto Loader as _rescued_data by default when the schema is being inferred.
You can rename the column or include it in cases where you provide a schema by setting the option
rescuedDataColumn , for example:
spark.readStream.format("cloudFiles").option("cloudFiles.rescuedDataColumn",
"_rescued_data").option("cloudFiles.format", <format>).schema(<schema>).load(<path>)
.
Since the default value of cloudFiles.inferColumnTypes is false , and cloudFiles.schemaEvolutionMode is
addNewColumns when the schema is being inferred, rescuedDataColumn captures only columns that have a
different case than that in the schema.
The JSON and CSV parsers support three modes when parsing records: PERMISSIVE , DROPMALFORMED , and
FAILFAST . When used together with rescuedDataColumn , data type mismatches do not cause records to be
dropped in DROPMALFORMED mode or throw an error in FAILFAST mode. Only corrupt records—that is,
incomplete or malformed JSON or CSV—are dropped or throw errors. If you use badRecordsPath when parsing
JSON or CSV, data type mismatches are not considered as bad records when using the rescuedDataColumn . Only
incomplete and malformed JSON or CSV records are stored in badRecordsPath .

Limitations
Schema evolution is not supported in Python applications running on Databricks Runtime 8.2 and 8.3 that
use foreachBatch . You can use foreachBatch in Scala instead.
Choosing between file notification and directory
listing modes
7/21/2022 • 15 minutes to read

Auto Loader supports two modes for detecting new files: directory listing and file notification.
Director y listing : Auto Loader identifies new files by listing the input directory. Directory listing mode
allows you to quickly start Auto Loader streams without any permission configurations other than access to
your data on cloud storage. In Databricks Runtime 9.1 and above, Auto Loader can automatically detect
whether files are arriving with lexical ordering to your cloud storage and significantly reduce the amount of
API calls it needs to make to detect new files. See Incremental Listing for more details.
File notification : Auto Loader can automatically set up a notification service and queue service that
subscribe to file events from the input directory. File notification mode is more performant and scalable for
large input directories or a high volume of files but requires additional cloud permissions for set up. See
Leveraging file notifications for more details.
The availability for these modes are listed below.

C LO UD STO RA GE DIREC TO RY L IST IN G F IL E N OT IF IC AT IO N S

AWS S3 All versions All versions

ADLS Gen2 All versions All versions

GCS All versions Databricks Runtime 9.1 and above

Azure Blob Storage All versions All versions

ADLS Gen1 Databricks Runtime 7.3 and above Unsupported

DBFS All versions For mount points only.

As files are discovered, their metadata is persisted in a scalable key-value store (RocksDB) in the checkpoint
location of your Auto Loader pipeline. This key-value store ensures that data is processed exactly once. You can
switch file discovery modes across stream restarts and still obtain exactly-once data processing guarantees. In
fact, this is how Auto Loader can both perform a backfill on a directory containing existing files and concurrently
process new files that are being discovered through file notifications.
In case of failures, Auto Loader can resume from where it left off by information stored in the checkpoint
location and continue to provide exactly-once guarantees when writing data into Delta Lake. You don’t need to
maintain or manage any state yourself to achieve fault tolerance or exactly-once semantics.

Optimized directory listing


NOTE
Available in Databricks Runtime 9.0 and above.

Auto Loader can discover files on cloud storage systems using directory listing more efficiently than other
alternatives. For example, if you had files being uploaded every 5 minutes as /some/path/YYYY/MM/DD/HH/fileName
, to find all the files in these directories, the Apache Spark file source would list all subdirectories in parallel,
causing 1 (base directory) + 365 (per day) * 24 (per hour) = 8761 LIST API directory calls to storage. By
receiving a flattened response from storage, Auto Loader reduces the number of API calls to the number of files
in storage divided by the number of results returned by each API call (1000 with S3, 5000 with ADLS Gen2, and
1024 with GCS), greatly reducing your cloud costs.
Incremental Listing

NOTE
Available in Databricks Runtime 9.1 LTS and above.

For lexicographically generated files, Auto Loader now can leverage the lexical file ordering and optimized listing
APIs to improve the efficiency of directory listing by listing from recently ingested files rather than listing the
contents of the entire directory.
By default, Auto Loader will automatically detect whether a given directory is applicable for incremental listing
by checking and comparing file paths of previously completed directory listings. To ensure eventual
completeness of data in auto mode, Auto Loader will automatically trigger a full directory list after completing
7 consecutive incremental lists. You can control the frequency of full directory lists by setting
cloudFiles.backfillInterval to trigger asynchronous backfills at a given interval.

You can explicitly enable or disable incremental listing by setting cloudFiles.useIncrementalListing to "true"
or "false" (default "auto" ). When explicitly enabled, Auto Loader will not trigger full directory lists unless a
backfill interval is set. Services like AWS Kinesis Firehose, AWS DMS, and Azure Data Factory are services that
can be configured to upload files to a storage system in lexical order. See the Appendix for more examples of
lexical directory structures.

Leveraging file notifications


When files do not arrive with lexical ordering to a bucket, you can use file notifications to scale Auto Loader to
ingest millions of files an hour. Auto Loader can set up file notifications for you automatically when you set the
option cloudFiles.useNotifications to true and provide the necessary permissions to create cloud resources.
In addition, you may need to provide the following additional options to provide Auto Loader authorization to
create these resources. The following table summarizes which resources are created by Auto Loader.

SUB SC RIP T IO N
C LO UD STO RA GE SERVIC E Q UEUE SERVIC E P REF IX ( 1) L IM IT ( 2)

AWS S3 AWS SNS AWS SQS databricks-auto- 100 per S3 bucket


ingest

ADLS Gen2 Azure Event Grid Azure Queue Storage databricks 500 per storage
account

GCS Google Pub/Sub Google Pub/Sub databricks-auto- 100 per GCS bucket
ingest

Azure Blob Storage Azure Event Grid Azure Queue Storage databricks 500 per storage
account

1. Auto Loader will name the resources with this prefix


2. How many concurrent file notification pipelines can be launched
If you cannot provide Auto Loader with the necessary permissions to create file notification services, you can
ask your cloud administrators to use the setUpNotificationServices method in the next section in a Databricks
Scala notebook to create file notification services for you. Alternatively, your cloud administrators can set up the
file notification services manually, and can provide you with the queue identifier to leverage file notifications.
See File notification options for more details.
You can switch between file notifications and directory listing at any time and still maintain exactly once data
processing guarantees.

NOTE
Cloud providers do not guarantee 100% delivery of all file events under very rare conditions and do not provide any strict
SLAs on the latency of the file events. Databricks recommends that you trigger regular backfills with Auto Loader by using
the cloudFiles.backfillInterval option to guarantee that all files are discovered within a given SLA if data
completeness is a requirement. Triggering regular backfills will not cause duplicates.

If you require running more than the limited number of file notification pipelines for a given storage account,
you can:
Consider rearchitecting how files are uploaded to leverage incremental listing instead of file notifications
Leverage a service such as AWS Lambda, Azure Functions, or Google Cloud Functions to fan out notifications
from a single queue that listens to an entire container or bucket into directory specific queues
File notification events
AWS S3 provides an ObjectCreated event when a file is uploaded to an S3 bucket regardless of whether it was
uploaded by a put or multi-part upload.
ADLS Gen2 provides different event notifications for files appearing in your Gen2 container.
Auto Loader listens for the FlushWithClose event for processing a file.
Auto Loader streams created with Databricks Runtime 8.3 and after support the RenameFile action for
discovering files. RenameFile actions will require an API request to the storage system to get the size of the
renamed file.
Auto Loader streams created with Databricks Runtime 9.0 and after support the RenameDirectory action for
discovering files. RenameDirectory actions will require API requests to the storage system to list the contents
of the renamed directory.
Google Cloud Storage provides an OBJECT_FINALIZE event when a file is uploaded, which includes overwrites
and file copies. Failed uploads do not generate this event.
Managing file notification resources
You can use Scala APIs to manage the notification and queuing services created by Auto Loader. You must
configure the resource setup permissions described in Permissions before using this API.
/////////////////////////////////////
// Creating a ResourceManager in AWS
/////////////////////////////////////

import com.databricks.sql.CloudFilesAWSResourceManager
val manager = CloudFilesAWSResourceManager
.newManager
.option("cloudFiles.region", <region>) // optional, will use the region of the EC2 instances by default
.option("path", <path-to-specific-bucket-and-folder>) // required only for setUpNotificationServices
.create()

///////////////////////////////////////
// Creating a ResourceManager in Azure
///////////////////////////////////////

import com.databricks.sql.CloudFilesAzureResourceManager
val manager = CloudFilesAzureResourceManager
.newManager
.option("cloudFiles.connectionString", <connection-string>)
.option("cloudFiles.resourceGroup", <resource-group>)
.option("cloudFiles.subscriptionId", <subscription-id>)
.option("cloudFiles.tenantId", <tenant-id>)
.option("cloudFiles.clientId", <service-principal-client-id>)
.option("cloudFiles.clientSecret", <service-principal-client-secret>)
.option("path", <path-to-specific-container-and-folder>) // required only for setUpNotificationServices
.create()

///////////////////////////////////////
// Creating a ResourceManager in GCP
///////////////////////////////////////

import com.databricks.sql.CloudFilesGCPResourceManager
val manager = CloudFilesGCPResourceManager
.newManager
.option("path", <path-to-specific-bucket-and-folder>) // Required only for setUpNotificationServices.
.create()

// Set up a queue and a topic subscribed to the path provided in the manager.
manager.setUpNotificationServices(<resource-suffix>)

// List notification services created by Auto Loader


val df = manager.listNotificationServices()

// Tear down the notification services created for a specific stream ID.
// Stream ID is a GUID string that you can find in the list result above.
manager.tearDownNotificationServices(<stream-id>)

Use setUpNotificationServices(<resource-suffix>) to create a queue and a subscription with the name


<prefix>-<resource-suffix> (the prefix depends on the storage system summarized in Leveraging file
notifications. If there is an existing resource with the same name, Azure Databricks reuses the resource that
already exists instead of creating a new one. This function returns a queue identifier that you can pass to the
cloudFiles source using the identifier in File notification options. This enables the cloudFiles source user to
have fewer permissions than the user who creates the resources. See Permissions.
Provide the "path"option to newManager only if calling setUpNotificationServices ; it is not needed for
listNotificationServices or tearDownNotificationServices . This is the same path that you use when running a
streaming query.

C LO UD STO RA GE SET UP A P I L IST A P I T EA R DO W N A P I

AWS S3 All versions All versions All versions


C LO UD STO RA GE SET UP A P I L IST A P I T EA R DO W N A P I

ADLS Gen2 All versions All versions All versions

GCS Databricks Runtime 9.1 and Databricks Runtime 9.1 and Databricks Runtime 9.1 and
above above above

Azure Blob Storage All versions All versions All versions

ADLS Gen1 Unsupported Unsupported Unsupported

Lexical ordering of files


For files to be lexically ordered, new files that are uploaded need to have a prefix that is lexicographically greater
than existing files. Some examples of lexical ordered directories are shown below.
Versioned files
Delta Lake tables make commits to its transaction log in a lexical order.

<path_to_table>/_delta_log/00000000000000000000.json
<path_to_table>/_delta_log/00000000000000000001.json <- guaranteed to be written after version 0
<path_to_table>/_delta_log/00000000000000000002.json <- guaranteed to be written after version 1
...

AWS DMS uploads CDC files to AWS S3 in a versioned manner.

database_schema_name/table_name/LOAD00000001.csv
database_schema_name/table_name/LOAD00000002.csv
...

Date partitioned files


Files can be uploaded in a date partitioned format and leverage incremental listing. Some examples of this are:

// <base_path>/yyyy/MM/dd/HH:mm:ss-randomString
<base_path>/2021/12/01/10:11:23-b1662ecd-e05e-4bb7-a125-ad81f6e859b4.json
<base_path>/2021/12/01/10:11:23-b9794cf3-3f60-4b8d-ae11-8ea320fad9d1.json
...

// <base_path>/year=yyyy/month=MM/day=dd/hour=HH/minute=mm/randomString
<base_path>/year=2021/month=12/day=04/hour=08/minute=22/442463e5-f6fe-458a-8f69-a06aa970fc69.csv
<base_path>/year=2021/month=12/day=04/hour=08/minute=22/8f00988b-46be-4112-808d-6a35aead0d44.csv <- this may
be uploaded before the file above as long as processing happens less frequently than a minute

When files are uploaded with date partitioning, some things to keep in mind are:
Months, days, hours, minutes need to be left padded with zeros to ensure lexical ordering (should be
uploaded as hour=03 , instead of hour=3 or 2021/05/03 instead of 2021/5/3 ).
Files don’t necessarily have to be uploaded in lexical order in the deepest directory as long as processing
happens less frequently than the parent directory’s time granularity
Some services that can upload files in a date partitioned lexical ordering are:
Azure Data Factory can be configured to upload files in a lexical order. See an example here.
Kinesis Firehose
Required permissions for setting up file notification resources
ADLS Gen2 and Azure Blob Storage
You must have read permissions for the input directory. See Azure Blob Storage.
To use file notification mode, you must provide authentication credentials for setting up and accessing the event
notification services. In Databricks Runtime 8.1 and above, you only need a service principal for authentication.
For Databricks Runtime 8.0 and below, you must provide both a service principal and a connection string.
Service principal - using Azure built-in roles
Create an Azure Active Directory app and service principal in the form of client ID and client secret.
Assign this app the following roles to the storage account in which the input path resides:
Contributor : This role is for setting up resources in your storage account, such as queues and event
subscriptions.
Storage Queue Data Contributor : This role is for performing queue operations such as retrieving
and deleting messages from the queues. This role is required in Databricks Runtime 8.1 and above
only when you provide a service principal without a connection string.
Assign this app the following role to the related resource group:
EventGrid EventSubscription Contributor : This role is for performing event grid subscription
operations such as creating or listing event subscriptions.
For more information, see Assign Azure roles using the Azure portal.
Service principal - using custom role
If you are concerned with the execessive permissions required for the preceding roles, you may create a
Custom Role with at least the following permissions, listed below in Azure role JSON format:

"permissions": [
{
"actions": [
"Microsoft.EventGrid/eventSubscriptions/write",
"Microsoft.EventGrid/eventSubscriptions/read",
"Microsoft.EventGrid/eventSubscriptions/delete",
"Microsoft.EventGrid/locations/eventSubscriptions/read",
"Microsoft.Storage/storageAccounts/read",
"Microsoft.Storage/storageAccounts/write",
"Microsoft.Storage/storageAccounts/queueServices/read",
"Microsoft.Storage/storageAccounts/queueServices/write",
"Microsoft.Storage/storageAccounts/queueServices/queues/write",
"Microsoft.Storage/storageAccounts/queueServices/queues/read",
"Microsoft.Storage/storageAccounts/queueServices/queues/delete"
],
"notActions": [],
"dataActions": [
"Microsoft.Storage/storageAccounts/queueServices/queues/messages/delete",
"Microsoft.Storage/storageAccounts/queueServices/queues/messages/read",
"Microsoft.Storage/storageAccounts/queueServices/queues/messages/write",
"Microsoft.Storage/storageAccounts/queueServices/queues/messages/process/action"
],
"notDataActions": []
}
]

Then, you may assign this custom role to your app.


For more information, see Assign Azure roles using the Azure portal.
Connection string
Auto Loader requires a connection stringto authenticate for Azure Queue Storage operations, such as
creating a queue and reading and deleting messages from the queue. The queue is created in the same
storage account where the input directory path is located. You can find your connection string in your
account key or shared access signature (SAS).
If you are using Databricks Runtime 8.1 or above, you do not need a connection string.
If you are using Databricks Runtime 8.0 or below, you must provide a connection string to authenticate
for Azure Queue Storage operations, such as creating a queue and retrieving and deleting messages from
the queue. The queue is created in the same storage account in which the input path resides. You can find
your connection string in your account key or shared access signature (SAS). When configuring an SAS
token, you must provide the following permissions:

AWS S3
You must have read permissions for the input directory. See S3 connection details for more details.
To use file notification mode, attach the following JSON policy document to your IAM user or role.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DatabricksAutoLoaderSetup",
"Effect": "Allow",
"Action": [
"s3:GetBucketNotification",
"s3:PutBucketNotification",
"sns:ListSubscriptionsByTopic",
"sns:GetTopicAttributes",
"sns:SetTopicAttributes",
"sns:CreateTopic",
"sns:TagResource",
"sns:Publish",
"sns:Subscribe",
"sqs:CreateQueue",
"sqs:DeleteMessage",
"sqs:DeleteMessageBatch",
"sqs:ReceiveMessage",
"sqs:SendMessage",
"sqs:GetQueueUrl",
"sqs:GetQueueAttributes",
"sqs:SetQueueAttributes",
"sqs:TagQueue",
"sqs:ChangeMessageVisibility",
"sqs:ChangeMessageVisibilityBatch"
],
"Resource": [
"arn:aws:s3:::<bucket-name>",
"arn:aws:sqs:<region>:<account-number>:databricks-auto-ingest-*",
"arn:aws:sns:<region>:<account-number>:databricks-auto-ingest-*"
]
},
{
"Sid": "DatabricksAutoLoaderList",
"Effect": "Allow",
"Action": [
"sqs:ListQueues",
"sqs:ListQueueTags",
"sns:ListTopics"
],
"Resource": "*"
},
{
"Sid": "DatabricksAutoLoaderTeardown",
"Effect": "Allow",
"Action": [
"sns:Unsubscribe",
"sns:DeleteTopic",
"sqs:DeleteQueue"
],
"Resource": [
"arn:aws:sqs:<region>:<account-number>:databricks-auto-ingest-*",
"arn:aws:sns:<region>:<account-number>:databricks-auto-ingest-*"
]
}
]
}

where:
<bucket-name> : The S3 bucket name where your stream will read files, for example, auto-logs . You can use
* as a wildcard, for example, databricks-*-logs . To find out the underlying S3 bucket for your DBFS path,
you can list all the DBFS mount points in a notebook by running %fs mounts .
<region> : The AWS region where the S3 bucket resides, for example, us-west-2 . If you don’t want to specify
the region, use * .
<account-number> : The AWS account number that owns the S3 bucket, for example, 123456789012 . If don’t
want to specify the account number, use * .

The string databricks-auto-ingest-* in the SQS and SNS ARN specification is the name prefix that the
cloudFiles source uses when creating SQS and SNS services. Since Azure Databricks sets up the notification
services in the initial run of the stream, you can use a policy with reduced permissions after the initial run (for
example, stop the stream and then restart it).

NOTE
The preceding policy is concerned only with the permissions needed for setting up file notification services, namely S3
bucket notification, SNS, and SQS services and assumes you already have read access to the S3 bucket. If you need to add
S3 read-only permissions, add the following to the Action list in the DatabricksAutoLoaderSetup statement in the
JSON document:

s3:ListBucket
s3:GetObject

Reduced permissions after initial setup


The resource setup permissions described above are required only during the initial run of the stream. After the
first run, you can switch to the following IAM policy with reduced permissions.

IMPORTANT
With the reduced permissions, you won’t able to start new streaming queries or recreate resources in case of failures (for
example, the SQS queue has been accidentally deleted); you also won’t be able to use the cloud resource management
API to list or tear down resources.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DatabricksAutoLoaderUse",
"Effect": "Allow",
"Action": [
"s3:GetBucketNotification",
"sns:ListSubscriptionsByTopic",
"sns:GetTopicAttributes",
"sns:TagResource",
"sns:Publish",
"sqs:DeleteMessage",
"sqs:DeleteMessageBatch",
"sqs:ReceiveMessage",
"sqs:SendMessage",
"sqs:GetQueueUrl",
"sqs:GetQueueAttributes",
"sqs:TagQueue",
"sqs:ChangeMessageVisibility",
"sqs:ChangeMessageVisibilityBatch"
],
"Resource": [
"arn:aws:sqs:<region>:<account-number>:<queue-name>",
"arn:aws:sns:<region>:<account-number>:<topic-name>",
"arn:aws:s3:::<bucket-name>"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<bucket-name>"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:PutObjectAcl",
"s3:GetObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::<bucket-name>/*"
]
},
{
"Sid": "DatabricksAutoLoaderListTopics",
"Effect": "Allow",
"Action": [
"sqs:ListQueues",
"sqs:ListQueueTags",
"sns:ListTopics"
],
"Resource": "arn:aws:sns:<region>:<account-number>:*"
}
]
}

Securely ingest data in a different AWS account


Auto Loader can load data across AWS accounts by assuming an IAM role. After setting the temporary security
credentials created by AssumeRole , you can have Auto Loader load cloud files cross-accounts. To set up the Auto
Loader for cross-AWS accounts, follow the doc: _. Make sure you:
Verify that you have the AssumeRole meta role assigned to the cluster.
Configure the cluster’s Spark configuration to include the following properties:

fs.s3a.credentialsType AssumeRole
fs.s3a.stsAssumeRole.arn arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB
fs.s3a.acl.default BucketOwnerFullControl

GCS
You must have list and get permissions on your GCS bucket and on all the objects. For details, see the
Google documentation on IAM permissions.
To use file notification mode, you need to add permissions for the GCS service account and the account used to
access the Google Cloud Pub/Sub resources.
Add the Pub/Sub Publisher role to the GCS service account. This will allow the account to publish event
notification messages from your GCS buckets to Google Cloud Pub/Sub.
As for the service account used for the Google Cloud Pub/Sub resources, you will need to add the following
permissions:

pubsub.subscriptions.consume
pubsub.subscriptions.create
pubsub.subscriptions.delete
pubsub.subscriptions.get
pubsub.subscriptions.list
pubsub.subscriptions.update
pubsub.topics.attachSubscription
pubsub.topics.create
pubsub.topics.delete
pubsub.topics.get
pubsub.topics.list
pubsub.topics.update

To do this, you can either create an IAM custom role with these permissions or assign pre-existing GCP roles to
cover these permissions.
Finding the GCS Service Account
In the Google Cloud Console for the corresponding project, navigate to Cloud Storage > Settings . On that page,
you should see a section titled “Cloud Storage Service Account” containing the email of the GCS service account.
Creating a Custom Google Cloud IAM Role for File Notification Mode
In the Google Cloud console for the corresponding project, navigate to IAM & Admin > Roles . Then, either create
a role at the top or update an existing role. In the screen for role creation or edit, click Add Permissions . A menu
should then pop up in which you can add the desired permissions to the role.
Troubleshooting
Error :

java.lang.RuntimeException: Failed to create event grid subscription.

If you see this error message when you run Auto Loader for the first time, the Event Grid is not registered as a
Resource Provider in your Azure subscription. To register this on Azure portal:
1. Go to your subscription.
2. Click Resource Providers under the Settings section.
3. Register the provider Microsoft.EventGrid .

Error :

403 Forbidden ... does not have authorization to perform action


'Microsoft.EventGrid/eventSubscriptions/[read|write]' over scope ...
If you see this error message when you run Auto Loader for the first time, ensure you have given the
Contributor role to your service principal for Event Grid as well as your storage account.
Configure Auto Loader for production workloads
7/21/2022 • 4 minutes to read

Databricks recommends that you follow the streaming best practices for running Auto Loader in production.
Databricks recommends using Auto Loader in Delta Live Tables for incremental data ingestion. Delta Live Tables
extends functionality in Apache Spark Structured Streaming and allows you to write just a few lines of
declarative Python or SQL to deploy a production-quality data pipeline with:
Autoscaling compute infrastructure for cost savings
Data quality checks with expectations
Automatic schema evolution handling
Monitoring via metrics in the event log

Monitoring Auto Loader


Querying files discovered by Auto Loader

NOTE
The cloud_files_state function is available in Databricks Runtime 10.5 and above.

Auto Loader provides a SQL API for inspecting the state of a stream. Using the cloud_files_state function, you
can find metadata about files that have been discovered by an Auto Loader stream. Simply query from
cloud_files_state , providing the checkpoint location associated with an Auto Loader stream.

SELECT * FROM cloud_files_state('path/to/checkpoint');

Listen to stream updates


To further monitor Auto Loader streams, Databricks recommends using Apache Spark’s Streaming Query
Listener interface.
Auto Loader reports metrics to the Streaming Query Listener at every batch. You can view how many files exist
in the backlog and how large the backlog is in the numFilesOutstanding and numBytesOutstanding metrics under
the Raw Data tab in the streaming query progress dashboard:

{
"sources" : [
{
"description" : "CloudFilesSource[/path/to/source]",
"metrics" : {
"numFilesOutstanding" : "238",
"numBytesOutstanding" : "163939124006"
}
}
]
}

In Databricks Runtime 10.1 and later, when using file notification mode, the metrics will also include the
approximate number of file events that are in the cloud queue as approximateQueueSize for AWS and Azure.
Cost considerations
When running Auto Loader, your main source of costs would be the cost of compute resources and file
discovery.
To reduce compute costs, Databricks recommends using Databricks Jobs to schedule Auto Loader as batch jobs
using Trigger.AvailableNow (in Databricks Runtime 10.1 and later) or Trigger.Once instead of running it
continuously as long as you don’t have low latency requirements.
File discovery costs can come in the form of LIST operations on your storage accounts in directory listing mode
and API requests on the subscription service, and queue service in file notification mode. To reduce file
discovery costs, Databricks recommends:
Providing a ProcessingTime trigger when running Auto Loader continuously in directory listing mode
Architecting file uploads to your storage account in lexical ordering to leverage Incremental Listing when
possible
Using Databricks Runtime 9.0 or later in directory listing mode, especially for deeply nested directories
Leveraging file notifications when incremental listing is not possible
Using resource tags to tag resources created by Auto Loader to track your costs

Using Trigger.AvailableNow and rate limiting


NOTE
Available in Databricks Runtime 10.1 for Scala only.
Available in Databricks Runtime 10.2 and above for Python and Scala.

Auto Loader can be scheduled to run in Databricks Jobs as a batch job by using Trigger.AvailableNow . The
AvailableNow trigger will instruct Auto Loader to process all files that arrived before the query start time. New
files that are uploaded after the stream has started will be ignored until the next trigger.
With Trigger.AvailableNow , file discovery will happen asynchronously with data processing and data can be
processed across multiple micro-batches with rate limiting. Auto Loader by default processes a maximum of
1000 files every micro-batch. You can configure cloudFiles.maxFilesPerTrigger and
cloudFiles.maxBytesPerTrigger to configure how many files or how many bytes should be processed in a micro-
batch. The file limit is a hard limit but the byte limit is a soft limit, meaning that more bytes can be processed
than the provided maxBytesPerTrigger . When the options are both provided together, Auto Loader will process
as many files that are needed to hit one of the limits.

Event retention
NOTE
Available in Databricks Runtime 8.4 and above.

Auto Loader keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once
ingestion guarantees. For high volume datasets, you can use the cloudFiles.maxFileAge option to expire events
from the checkpoint location to reduce your storage costs and Auto Loader start up time. The minimum value
that you can set for cloudFiles.maxFileAge is "14 days" . Deletes in RocksDB appear as tombstone entries,
therefore you should expect the storage usage to increase temporarily as events expire before it starts to level
off.
WARNING
cloudFiles.maxFileAge is provided as a cost control mechanism for high volume datasets, ingesting in the order of
millions of files every hour. Tuning cloudFiles.maxFileAge incorrectly can lead to data quality issues. Therefore,
Databricks doesn’t recommend tuning this parameter unless absolutely required.

Trying to tune the cloudFiles.maxFileAge option can lead to unprocessed files being ignored by Auto Loader or
already processed files expiring and then being re-processed causing duplicate data. Here are some things to
consider when choosing a cloudFiles.maxFileAge :
If your stream restarts after a long time, file notification events that are pulled from the queue that are older
than cloudFiles.maxFileAge are ignored. Similarly, if you use directory listing, files that may have appeared
during the down time that are older than cloudFiles.maxFileAge are ignored.
If you use directory listing mode and use cloudFiles.maxFileAge , for example set to "1 month" , you stop
your stream and restart the stream with cloudFiles.maxFileAge set to "2 months" , all files that are older
than 1 month, but more recent than 2 months are reprocessed.
The best approach to tuning cloudFiles.maxFileAge would be to start from a generous expiration, for example,
"1 year" and working downwards to something like "9 months" . If you set this option the first time you start
the stream, you will not ingest data older than cloudFiles.maxFileAge , therefore, if you want to ingest old data
you should not set this option as you start your stream.
Auto Loader options
7/21/2022 • 21 minutes to read

Configuration options specific to the cloudFiles source are prefixed with cloudFiles so that they are in a
separate namespace from other Structured Streaming source options.
Common Auto Loader options
Directory listing options
File notification options
File format options
Generic options
JSON options
CSV options
PARQUET options
AVRO options
BINARYFILE options
TEXT options
ORC options
Cloud specific options
AWS specific options
Azure specific options
Google specific options

Common Auto Loader options


You can configure the following options for directory listing or file notification mode.

O P T IO N

cloudFiles.allowOver writes

Type: Boolean

Whether to allow input directory file changes to overwrite existing data. Available in Databricks Runtime 7.6 and above.

Default value: false

cloudFiles.backfillInter val

Type: Interval String

Auto Loader can trigger asynchronous backfills at a given interval,


e.g. 1 day to backfill once a day, or 1 week to backfill once a week. File event notification systems do not guarantee 100%
delivery of all files that have been uploaded therefore you can use backfills to guarantee that all files eventually get processed,
available in Databricks Runtime 8.4 (Unsupported) and above. If using the incremental listing, you can also use regular backfills
to guarantee the eventual completeness, available in Databricks Runtime 9.1 LTS and above.

Default value: None


O P T IO N

cloudFiles.format

Type: String

The data file format in the source path. Allowed values include:

* avro : Avro file


* binaryFile : Binary file
* csv : CSV file
* json : JSON file
* orc : ORC file
* parquet : Parquet file
* text : Text file

Default value: None (required option)

cloudFiles.includeExistingFiles

Type: Boolean

Whether to include existing files in the stream processing input path or to only process new files arriving after initial setup.
This option is evaluated only when you start a stream for the first time. Changing this option after restarting the stream has
no effect.

Default value: true

cloudFiles.inferColumnTypes

Type: Boolean

Whether to infer exact column types when leveraging schema inference. By default, columns are inferred as strings when
inferring JSON and CSV datasets. See schema inference for more details.

Default value: false

cloudFiles.maxBytesPerTrigger

Type: Byte String

The maximum number of new bytes to be processed in every trigger. You can specify a byte string such as 10g to limit each
microbatch to 10 GB of data. This is a soft maximum. If you have files that are 3 GB each, Azure Databricks processes 12 GB in
a microbatch. When used together with cloudFiles.maxFilesPerTrigger , Azure Databricks consumes up to the lower limit
of cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger , whichever is reached first. This option has no
effect when used with Trigger.Once() .

Default value: None

cloudFiles.maxFileAge

Type: Interval String

How long a file event is tracked for deduplication purposes. Databricks does not recommend tuning this parameter unless you
are ingesting data at the order of millions of files an hour. See the section on Event retention for more details.

Default value: None


O P T IO N

cloudFiles.maxFilesPerTrigger

Type: Integer

The maximum number of new files to be processed in every trigger. When used together with
cloudFiles.maxBytesPerTrigger , Azure Databricks consumes up to the lower limit of cloudFiles.maxFilesPerTrigger or
cloudFiles.maxBytesPerTrigger , whichever is reached first. This option has no effect when used with Trigger.Once() .

Default value: 1000

cloudFiles.par titionColumns

Type: String

A comma separated list of Hive style partition columns that you would like inferred from the directory structure of the files.
Hive style partition columns are key value pairs combined by an equality sign such as
<base_path>/a=x/b=1/c=y/file.format . In this example, the partition columns are a , b , and c . By default these
columns will be automatically added to your schema if you are using schema inference and provide the <base_path> to load
data from. If you provide a schema, Auto Loader expects these columns to be included in the schema. If you do not want
these columns as part of your schema, you can specify "" to ignore these columns. In addition, you can use this option
when you want columns to be inferred the file path in complex directory structures, like the example below:

<base_path>/year=2022/week=1/file1.csv
<base_path>/year=2022/month=2/day=3/file2.csv
<base_path>/year=2022/month=2/day=4/file3.csv

Specifying cloudFiles.partitionColumns as year,month,day will return


year=2022 for file1.csv , but the month and day columns will be null .
month and day will be parsed correctly for file2.csv and file3.csv .

Default value: None

cloudFiles.schemaEvolutionMode

Type: String

The mode for evolving the schema as new columns are discovered in the data. By default, columns are inferred as strings
when inferring JSON datasets. See schema evolution for more details.

Default value: "addNewColumns" when a schema is not provided.


"none" otherwise.

cloudFiles.schemaHints

Type: String

Schema information that you provide to Auto Loader during schema inference. See schema hints for more details.

Default value: None

cloudFiles.schemaLocation

Type: String

The location to store the inferred schema and subsequent changes. See schema inference for more details.

Default value: None (required when inferring the schema)


O P T IO N

cloudFiles.validateOptions

Type: Boolean

Whether to validate Auto Loader options and return an error for unknown or inconsistent options.

Default value: true

Directory listing options


The following options are relevant to directory listing mode.

O P T IO N

cloudFiles.useIncrementalListing

Type: String

Whether to use the incremental listing rather than the full listing in directory listing mode. By default, Auto Loader will make
the best effort to automatically detect if a given directory is applicable for the incremental listing. You can explicitly use the
incremental listing or use the full directory listing by setting it as true or false respectively.

Available in Databricks Runtime 9.1 LTS and above.

Default value: auto

Available values: auto , true , false

File notification options


The following options are relevant to file notification mode.

O P T IO N

cloudFiles.fetchParallelism

Type: Integer

Number of threads to use when fetching messages from the queueing service.

Default value: 1

cloudFiles.pathRewrites

Type: A JSON string

Required only if you specify a queueUrl that receives file notifications from multiple S3 buckets and you want to leverage
mount points configured for accessing data in these containers. Use this option to rewrite the prefix of the bucket/key path
with the mount point. Only prefixes can be rewritten. For example, for the configuration
{"<databricks-mounted-bucket>/path": "dbfs:/mnt/data-warehouse"} , the path
s3://<databricks-mounted-bucket>/path/2017/08/fileA.json is rewritten to
dbfs:/mnt/data-warehouse/2017/08/fileA.json .

Default value: None


O P T IO N

cloudFiles.resourceTags

Type: Map(String, String)

A series of key-value tag pairs to help associate and identify related resources, for example:

cloudFiles.option("cloudFiles.resourceTag.myFirstKey", "myFirstValue")
.option("cloudFiles.resourceTag.mySecondKey", "mySecondValue")

For more information on AWS, see Amazon SQS cost allocation tags and Configuring tags for an Amazon SNS topic. (1)

For more information on Azure, see Naming Queues and Metadata and the coverage of properties.labels in Event
Subscriptions. Auto Loader stores these key-value tag pairs in JSON as labels. (1)

For more information on GCP, see Reporting usage with labels. (1)

Default value: None

cloudFiles.useNotifications

Type: Boolean

Whether to use file notification mode to determine when there are new files. If false , use directory listing mode. See How
Auto Loader works.

Default value: false

(1) Auto Loader adds the following key-value tag pairs by default on a best-effort basis:
vendor : Databricks
path : The location from where the data is loaded. Unavailable in GCP due to labeling limitations.
checkpointLocation : The location of the stream’s checkpoint. Unavailable in GCP due to labeling limitations.
streamId : A globally unique identifier for the stream.

These key names are reserved and you cannot overwrite their values.

File format options


With Auto Loader you can ingest JSON , CSV , PARQUET , AVRO , TEXT , BINARYFILE , and ORC files.
Generic options
JSON options
CSV options
PARQUET options
AVRO options
BINARYFILE options
TEXT options
ORC options

Generic options
The following options apply to all file formats.
O P T IO N

ignoreCorruptFiles

Type: Boolean

Whether to ignore corrupt files. If true, the Spark jobs will continue to run when encountering corrupted files and the contents
that have been read will still be returned. Observable as numSkippedCorruptFiles in the
operationMetrics column of the Delta Lake history. Available in Databricks Runtime 11.0 and above.

Default value: false

ignoreMissingFiles

Type: Boolean

Whether to ignore missing files. If true, the Spark jobs will continue to run when encountering missing files and the contents
that have been read will still be returned. Available in Databricks Runtime 11.0 and above.

Default value: false ( true for COPY INTO )

modifiedAfter

Type: Timestamp String , for example, 2021-01-01 00:00:00.000000 UTC+0

An optional timestamp to ingest files that have a modification timestamp after the provided timestamp.

Default value: None

modifiedBefore

Type: Timestamp String , for example, 2021-01-01 00:00:00.000000 UTC+0

An optional timestamp to ingest files that have a modification timestamp before the provided timestamp.

Default value: None

pathGlobFilter

Type: String

A potential glob pattern to provide for choosing files. Equivalent to


PATTERN in COPY INTO .

Default value: None

recursiveFileLookup

Type: Boolean

Whether to load data recursively within the base directory and skip partition inference.

Default value: false

JSON options
O P T IO N

allowBackslashEscapingAnyCharacter

Type: Boolean

Whether to allow backslashes to escape any character that succeeds it. If not enabled, only characters that are explicitly listed
by the JSON specification can be escaped.

Default value: false

allowComments

Type: Boolean

Whether to allow the use of Java, C, and C++ style comments ( '/' , '*' , and '//' varieties) within parsed content or
not.

Default value: false

allowNonNumericNumbers

Type: Boolean

Whether to allow the set of not-a-number ( NaN ) tokens as legal floating number values.

Default value: true

allowNumericLeadingZeros

Type: Boolean

Whether to allow integral numbers to start with additional (ignorable) zeroes (for example, 000001).

Default value: false

allowSingleQuotes

Type: Boolean

Whether to allow use of single quotes (apostrophe, character '\' ) for quoting strings (names and String values).

Default value: true

allowUnquotedControlChars

Type: Boolean

Whether to allow JSON strings to contain unescaped control characters (ASCII characters with value less than 32, including tab
and line feed characters) or not.

Default value: false


O P T IO N

allowUnquotedFieldNames

Type: Boolean

Whether to allow use of unquoted field names (which are allowed by JavaScript, but not by the JSON specification).

Default value: false

badRecordsPath

Type: String

The path to store files for recording the information about bad JSON records.

Default value: None

columnNameOfCorruptRecord

Type: String

The column for storing records that are malformed and cannot be parsed. If the mode for parsing is set as DROPMALFORMED ,
this column will be empty.

Default value: _corrupt_record

dateFormat

Type: String

The format for parsing date strings.

Default value: yyyy-MM-dd

dropFieldIfAllNull

Type: Boolean

Whether to ignore columns of all null values or empty arrays and structs during schema inference.

Default value: false

encoding or charset

Type: String

The name of the encoding of the JSON files. See java.nio.charset.Charset for list of options. You cannot use UTF-16 and
UTF-32 when multiline is true .

Default value: UTF-8


O P T IO N

inferTimestamp

Type: Boolean

Whether to try and infer timestamp strings as a TimestampType . When set to


true , schema inference may take noticeably longer.

Default value: false

lineSep

Type: String

A string between two consecutive JSON records.

Default value: None, which covers \r , \r\n , and \n

locale

Type: String

A java.util.Locale identifier. Influences default date, timestamp, and decimal parsing within the JSON.

Default value: US

mode

Type: String

Parser mode around handling malformed records. One of 'PERMISSIVE' ,


'DROPMALFORMED' , or 'FAILFAST' .

Default value: PERMISSIVE

multiLine

Type: Boolean

Whether the JSON records span multiple lines.

Default value: false

prefersDecimal

Type: Boolean

Whether to infer floats and doubles as DecimalType during schema inference.

Default value: false


O P T IO N

primitivesAsString

Type: Boolean

Whether to infer primitive types like numbers and booleans as StringType .

Default value: false

rescuedDataColumn

Type: String

Whether to collect all data that can’t be parsed due to a data type mismatch or schema mismatch (including column casing) to
a separate column. This column is included by default when using Auto Loader. For more details, refer to Rescued data column.

Default value: None

timestampFormat

Type: String

The format for parsing timestamp strings.

Default value: yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]

timeZone

Type: String

The java.time.ZoneId to use when parsing timestamps and dates.

Default value: None

CSV options

O P T IO N

badRecordsPath

Type: String

The path to store files for recording the information about bad CSV records.

Default value: None

charToEscapeQuoteEscaping

Type: Char

The character used to escape the character used for escaping quotes. For example, for the following record: [ " a\\", b ] :

* If the character to escape the '\' is undefined, the record won’t be parsed. The parser will read characters:
[a],[\],["],[,],[ ],[b] and throw an error because it cannot find a closing quote.
* If the character to escape the '\' is defined as '\' , the record will be read with 2 values: [a\] and [b] .

Default value: '\0'


O P T IO N

columnNameOfCorruptRecord

Type: String

A column for storing records that are malformed and cannot be parsed. If the mode for parsing is set as DROPMALFORMED ,
this column will be empty.

Default value: _corrupt_record

comment

Type: Char

Defines the character that represents a line comment when found in the beginning of a line of text. Use '\0' to disable
comment skipping.

Default value: '\u0000'

dateFormat

Type: String

The format for parsing date strings.

Default value: yyyy-MM-dd

emptyValue

Type: String

String representation of an empty value.

Default value: ""

encoding or charset

Type: String

The name of the encoding of the CSV files. See java.nio.charset.Charset for the list of options. UTF-16 and UTF-32
cannot be used when multiline is true .

Default value: UTF-8

enforceSchema

Type: Boolean

Whether to forcibly apply the specified or inferred schema to the CSV files. If the option is enabled, headers of CSV files are
ignored. This option is ignored by default when using Auto Loader to rescue data and allow schema evolution.

Default value: true


O P T IO N

escape

Type: Char

The escape character to use when parsing the data.

Default value: '\'

header

Type: Boolean

Whether the CSV files contain a header. Auto Loader assumes that files have headers when inferring the schema.

Default value: false

ignoreLeadingWhiteSpace

Type: Boolean

Whether to ignore leading whitespaces for each parsed value.

Default value: false

ignoreTrailingWhiteSpace

Type: Boolean

Whether to ignore trailing whitespaces for each parsed value.

Default value: false

inferSchema

Type: Boolean

Whether to infer the data types of the parsed CSV records or to assume all columns are of StringType . Requires an
additional pass over the data if set to true .

Default value: false

lineSep

Type: String

A string between two consecutive CSV records.

Default value: None, which covers \r , \r\n , and \n

locale

Type: String

A java.util.Locale identifier. Influences default date, timestamp, and decimal parsing within the CSV.

Default value: US
O P T IO N

maxCharsPerColumn

Type: Int

Maximum number of characters expected from a value to parse. Can be used to avoid memory errors. Defaults to -1 , which
means unlimited.

Default value: -1

maxColumns

Type: Int

The hard limit of how many columns a record can have.

Default value: 20480

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader
when inferring the schema.

Default value: false

mode

Type: String

Parser mode around handling malformed records. One of 'PERMISSIVE' ,


'DROPMALFORMED' , and 'FAILFAST' .

Default value: PERMISSIVE

multiLine

Type: Boolean

Whether the CSV records span multiple lines.

Default value: false

nanValue

Type: String

The string representation of a non-a-number value when parsing FloatType and DoubleType columns.

Default value: "NaN"


O P T IO N

negativeInf

Type: String

The string representation of negative infinity when parsing FloatType or DoubleType columns.

Default value: "-Inf"

nullValue

Type: String

String representation of a null value.

Default value: ""

parserCaseSensitive (deprecated)

Type: Boolean

While reading files, whether to align columns declared in the header with the schema case sensitively. This is true by default
for Auto Loader. Columns that differ by case will be rescued in the rescuedDataColumn if enabled. This option has been
deprecated in favor of readerCaseSensitive .

Default value: false

positiveInf

Type: String

The string representation of positive infinity when parsing FloatType or DoubleType columns.

Default value: "Inf"

quote

Type: Char

The character used for escaping values where the field delimiter is part of the value.

Default value: '\'

readerCaseSensitive

Type: Boolean

Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names
differ by case from the schema; otherwise, read the data in a case-insensitive manner.

Default value: true


O P T IO N

rescuedDataColumn

Type: String

Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing)
to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data
column.

Default value: None

sep or delimiter

Type: String

The separator string between columns.

Default value: ","

skipRows

Type: Int

The number of rows from the beginning of the CSV file that should be ignored (including commented and empty rows). If
header is true, the header will be the first unskipped and uncommented row.

Default value: 0

timestampFormat

Type: String

The format for parsing timestamp strings.

Default value: yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]

timeZone

Type: String

The java.time.ZoneId to use when parsing timestamps and dates.

Default value: None


O P T IO N

unescapedQuoteHandling

Type: String

The strategy for handling unescaped quotes. Allowed options:

* STOP_AT_CLOSING_QUOTE : If unescaped quotes are found in the input, accumulate the quote character and proceed parsing
the value as a quoted value, until a closing quote is found.
* BACK_TO_DELIMITER : If unescaped quotes are found in the input, consider the value as an unquoted value. This will make
the parser accumulate all characters of the current parsed value until the delimiter defined by sep is found. If no delimiter is
found in the value, the parser will continue accumulating characters from the input until a delimiter or line ending is found.
* STOP_AT_DELIMITER : If unescaped quotes are found in the input, consider the value as an unquoted value. This will make
the parser accumulate all characters until the delimiter defined by sep , or a line ending is found in the input.
* SKIP_VALUE : If unescaped quotes are found in the input, the content parsed for the given value will be skipped (until the
next delimiter is found) and the value set in nullValue will be produced instead.
* RAISE_ERROR : If unescaped quotes are found in the input, a
TextParsingException will be thrown.

Default value: STOP_AT_DELIMITER

PARQUET options

O P T IO N

datetimeRebaseMode

Type: String

Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values:
EXCEPTION , LEGACY , and
CORRECTED .

Default value: LEGACY

int96RebaseMode

Type: String

Controls the rebasing of the INT96 timestamp values between Julian and Proleptic Gregorian calendars. Allowed values:
EXCEPTION , LEGACY , and
CORRECTED .

Default value: LEGACY

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file.

Default value: false


O P T IO N

readerCaseSensitive

Type: Boolean

Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names
differ by case from the schema; otherwise, read the data in a case-insensitive manner.

Default value: true

rescuedDataColumn

Type: String

Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing)
to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data
column.

Default value: None

AVRO options

O P T IO N

avroSchema

Type: String

Optional schema provided by a user in Avro format. When reading Avro, this option can be set to an evolved schema, which is
compatible but different with the actual Avro schema. The deserialization schema will be consistent with the evolved schema.
For example, if you set an evolved schema containing one additional column with a default value, the read result will contain
the new column too.

Default value: None

datetimeRebaseMode

Type: String

Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values:
EXCEPTION , LEGACY , and
CORRECTED .

Default value: LEGACY

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file.
mergeSchema for Avro does not relax data types.

Default value: false


O P T IO N

readerCaseSensitive

Type: Boolean

Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names
differ by case from the schema; otherwise, read the data in a case-insensitive manner.

Default value: true

rescuedDataColumn

Type: String

Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing)
to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data
column.

Default value: None

BINARYFILE options
Binary files do not have any additional configuration options.
TEXT options

O P T IO N

encoding

Type: String

The name of the encoding of the TEXT files. See java.nio.charset.Charset for list of options.

Default value: UTF-8

lineSep

Type: String

A string between two consecutive TEXT records.

Default value: None, which covers \r , \r\n and \n

wholeText

Type: Boolean

Whether to read a file as a single record.

Default value: false

ORC options
O P T IO N

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file.

Default value: false

Cloud specific options


Auto Loader provides a number of options for configuring cloud infrastructure.
AWS specific options
Azure specific options
Google specific options
AWS specific options
Provide the following option only if you choose cloudFiles.useNotifications = true and you want Auto
Loader to set up the notification services for you:

O P T IO N

cloudFiles.region

Type: String

The region where the source S3 bucket resides and where the AWS SNS and SQS services will be created.

Default value: In Databricks Runtime 9.0 and above the region of the EC2 instance. In Databricks Runtime 8.4 and below you
must specify the region.

Provide the following option only if you choose cloudFiles.useNotifications = true and you want Auto
Loader to use a queue that you have already set up:

O P T IO N

cloudFiles.queueUrl

Type: String

The URL of the SQS queue. If provided, Auto Loader directly consumes events from this queue instead of setting up its own
AWS SNS and SQS services.

Default value: None

You can use the following options to provide credentials to access AWS SNS and SQS when IAM roles are not
available or when you’re ingesting data from different clouds.
O P T IO N

cloudFiles.awsAccessKey

Type: String

The AWS access key ID for the user. Must be provided with
cloudFiles.awsSecretKey .

Default value: None

cloudFiles.awsSecretKey

Type: String

The AWS secret access key for the user. Must be provided with
cloudFiles.awsAccessKey .

Default value: None

cloudFiles.roleArn

Type: String

The ARN of an IAM role to assume. The role can be assumed from your cluster’s instance profile or by providing credentials
with
cloudFiles.awsAccessKey and cloudFiles.awsSecretKey .

Default value: None

cloudFiles.roleExternalId

Type: String

An identifier to provide while assuming a role using cloudFiles.roleArn .

Default value: None

cloudFiles.roleSessionName

Type: String

An optional session name to use while assuming a role using


cloudFiles.roleArn .

Default value: None

cloudFiles.stsEndpoint

Type: String

An optional endpoint to provide for accessing AWS STS when assuming a role using cloudFiles.roleArn .

Default value: None

Azure specific options


You must provide values for all of the following options if you specify cloudFiles.useNotifications = true and
you want Auto Loader to set up the notification services for you:
O P T IO N

cloudFiles.clientId

Type: String

The client ID or application ID of the service principal.

Default value: None

cloudFiles.clientSecret

Type: String

The client secret of the service principal.

Default value: None

cloudFiles.connectionString

Type: String

The connection string for the storage account, based on either account access key or shared access signature (SAS).

Default value: None

cloudFiles.resourceGroup

Type: String

The Azure Resource Group under which the storage account is created.

Default value: None

cloudFiles.subscriptionId

Type: String

The Azure Subscription ID under which the resource group is created.

Default value: None

cloudFiles.tenantId

Type: String

The Azure Tenant ID under which the service principal is created.

Default value: None

IMPORTANT
Automated notification setup is available in Azure China and Government regions with Databricks Runtime 9.1 and later.
You must provide a queueName to use Auto Loader with file notifications in these regions for older DBR versions.

Provide the following option only if you choose cloudFiles.useNotifications = true and you want Auto
Loader to use a queue that you have already set up:
O P T IO N

cloudFiles.queueName

Type: String

The name of the Azure queue. If provided, the cloud files source directly consumes events from this queue instead of setting
up its own Azure Event Grid and Queue Storage services. In that case, your cloudFiles.connectionString requires only
read permissions on the queue.

Default value: None

Google specific options


Auto Loader can automatically set up notification services for you by leveraging Google Service Accounts. You
can configure your cluster to assume a service account by following Google service setup. The permissions that
your service account needs are specified in Required permissions for setting up file notification resources.
Otherwise, you can provide the following options for authentication if you want Auto Loader to set up the
notification services for you.

O P T IO N

cloudFiles.client

Type: String

The client ID of the Google Service Account.

Default value: None

cloudFiles.clientEmail

Type: String

The email of the Google Service Account.

Default value: None

cloudFiles.privateKey

Type: String

The private key that’s generated for the Google Service Account.

Default value: None

cloudFiles.privateKeyId

Type: String

The id of the private key that’s generated for the Google Service Account.

Default value: None


O P T IO N

cloudFiles.projectId

Type: String

The id of the project that the GCS bucket is in. The Google Cloud Pub/Sub subscription will also be created within this project.

Default value: None

Provide the following option only if you choose cloudFiles.useNotifications = true and you want Auto
Loader to use a queue that you have already set up:

O P T IO N

cloudFiles.subscription

Type: String

The name of the Google Cloud Pub/Sub subscription. If provided, the cloud files source consumes events from this queue
instead of setting up its own GCS Notification and Google Cloud Pub/Sub services.

Default value: None


Common data loading patterns
7/21/2022 • 3 minutes to read

Auto Loader simplifies a number of common data ingestion tasks. This quick reference provides examples for
several popular patterns.

Filtering directories or files using glob patterns


Glob patterns can be used for filtering directories and files when provided in the path.

PAT T ERN DESC RIP T IO N

? Matches any single character

* Matches zero or more characters

[abc] Matches a single character from character set {a,b,c}.

[a-z] Matches a single character from the character range {a…z}.

[^a] Matches a single character that is not from character set or


range {a}. Note that the ^ character must occur
immediately to the right of the opening bracket.

{ab,cd} Matches a string from the string set {ab, cd}.

{ab,c{de, fh}} Matches a string from the string set {ab, cde, cfh}.

Use the path for providing prefix patterns, for example:


Python

df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", <format>) \
.schema(schema) \
.load("<base_path>/*/files")

Scala

val df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", <format>)
.schema(schema)
.load("<base_path>/*/files")

IMPORTANT
You need to use the option pathGlobFilter for explicitly providing suffix patterns. The path only provides a prefix
filter.

For example, if you would like to parse only png files in a directory that contains files with different suffixes,
you can do:
Python

df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "binaryFile") \
.option("pathGlobfilter", "*.png") \
.load(<base_path>)

Scala

val df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "binaryFile")
.option("pathGlobfilter", "*.png")
.load(<base_path>)

Enable easy ETL


An easy way to get your data into Delta Lake without losing any data is to use the following pattern and
enabling schema inference with Auto Loader. Databricks recommends running the following code in an Azure
Databricks job for it to automatically restart your stream when the schema of your source data changes. By
default, the schema is inferred as string types, any parsing errors (there should be none if everything remains as
a string) will go to _rescued_data , and any new columns will fail the stream and evolve the schema.
Python

spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.schemaLocation", "<path_to_schema_location>") \
.load("<path_to_source_data>") \
.writeStream \
.option("mergeSchema", "true") \
.option("checkpointLocation", "<path_to_checkpoint>") \
.start("<path_to_target")

Scala

spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "<path_to_schema_location>")
.load("<path_to_source_data>")
.writeStream
.option("mergeSchema", "true")
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")

Prevent data loss in well-structured data


When you know your schema, but want to know whenever you receive unexpected data, Databricks
recommends using the rescuedDataColumn .
Python
spark.readStream.format("cloudFiles") \
.schema(expected_schema) \
.option("cloudFiles.format", "json") \
# will collect all new fields as well as data type mismatches in _rescued_data
.option("cloudFiles.schemaEvolutionMode", "rescue") \
.load("<path_to_source_data>") \
.writeStream \
.option("checkpointLocation", "<path_to_checkpoint>") \
.start("<path_to_target")

Scala

spark.readStream.format("cloudFiles")
.schema(expected_schema)
.option("cloudFiles.format", "json")
// will collect all new fields as well as data type mismatches in _rescued_data
.option("cloudFiles.schemaEvolutionMode", "rescue")
.load("<path_to_source_data>")
.writeStream
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")

If you want your stream to stop processing if a new field is introduced that doesn’t match your schema, you can
add:

.option("cloudFiles.schemaEvolutionMode", "failOnNewColumns")

Enable flexible semi-structured data pipelines


When you’re receiving data from a vendor that introduces new columns to the information they provide, you
may not be aware of exactly when they do it, or you may not have the bandwidth to update your data pipeline.
You can now leverage schema evolution to restart the stream and let Auto Loader update the inferred schema
automatically. You can also leverage schemaHints for some of the “schemaless” fields that the vendor may be
providing.
Python

spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
# will ensure that the headers column gets processed as a map
.option("cloudFiles.schemaHints",
"headers map<string,string>, statusCode SHORT") \
.load("/api/requests") \
.writeStream \
.option("mergeSchema", "true") \
.option("checkpointLocation", "<path_to_checkpoint>") \
.start("<path_to_target")

Scala
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
// will ensure that the headers column gets processed as a map
.option("cloudFiles.schemaHints",
"headers map<string,string>, statusCode SHORT")
.load("/api/requests")
.writeStream
.option("mergeSchema", "true")
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")

Examples
For detailed examples with common data formats, see:
Ingest CSV data with Auto Loader
Ingest JSON data with Auto Loader
Ingest Parquet data with Auto Loader
Ingest Avro data with Auto Loader
Ingest image data with Auto Loader
Tutorial: Continuously ingest data into Delta Lake with Auto Loader
Access file metadata with Auto Loader
Ingest CSV data with Auto Loader
7/21/2022 • 6 minutes to read

NOTE
Schema inference for CSV files is available in Databricks Runtime 8.3 and above.

Using Auto Loader to ingest CSV data into Delta Lake takes only a few lines of code. By leveraging Auto Loader,
you get the following benefits:
Automatic discovery of new files to process: You don’t need to have special logic to handle late arriving data
or keep track of which files have been processed yourself.
Scalable file discovery: Auto Loader can ingest billions of files.
Schema inference and evolution: Auto Loader can infer your data schema and detect schema drift on the fly.
It can also evolve the schema to add new columns and restart the stream with the new schema automatically.
Data rescue: You can configure Auto Loader to rescue data that couldn’t be parsed from your CSV files in a
rescued data column.
You can use the following code to run Auto Loader with schema inference and evolution capabilities on CSV
files. You specify cloudFiles as the format to leverage Auto Loader. You then specify csv with the option
cloudFiles.format . In the option cloudFiles.schemaLocation specify a directory that Auto Loader can use to
persist the schema changes in your source data over time:

Python
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
# The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path-to-checkpoint>") \
.load("<path-to-source-data>") \
.writeStream \
.option("mergeSchema", "true") \
.option("checkpointLocation", "<path-to-checkpoint>") \
.start("<path-to-target")

Scala
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
// The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path-to-checkpoint>")
.load("<path-to-source-data>")
.writeStream
.option("mergeSchema", "true")
.option("checkpointLocation", "<path-to-checkpoint>")
.start("<path-to-target")

Auto Loader provides additional functionality to help make ingesting CSV data easier. In this section, we describe
some of the behavior differences between the Apache Spark built-in CSV parser and Auto Loader.
Schema inference
To infer the schema, Auto Loader uses a sample of data. When inferring schema for CSV data, Auto Loader
assumes that the files contain headers. If your CSV files do not contain headers, provide the option
.option("header", "false") . In addition, Auto Loader merges the schemas of all the files in the sample to come
up with a global schema. Auto Loader can then read each file according to its header and parse the CSV
correctly. This is different behavior than the Apache Spark in-built CSV parser. The following example
demonstrates the differences in behavior:

f0.csv:
-------
name,age,lucky_number
john,20,4

f1.csv:
-------
age,lucky_number,name
25,7,nadia

f2.csv:
-------
height,lucky_number
1.81,five

Apache Spark behavior:

+-------+------+--------------+
| name | age | lucky_number | <-- uses just the first file to infer schema
+-------+------+--------------+
| john | 20 | 4 |
| 25 | 7 | nadia | <-- all files are assumed to have the same schema
| 5.21 | five | null |
+-------+------+--------------+

Auto Loader behavior:

+-------+------+--------------+--------+
| name | age | lucky_number | height | <-- schema is merged across files
+-------+------+--------------+--------+
| john | 20 | 4 | null |
| nadia | 25 | 7 | null | <-- columns are parsed according to order specified in header
| null | null | five | 1.81 | <-- lucky_number's data type will be relaxed to a string
+-------+------+--------------+--------+

NOTE
To get the same schema inference and parsing semantics with the CSV reader in Databricks Runtime, you can use
spark.read.option("mergeSchema", "true").format("csv").load(<path>)

By default, Auto Loader infers columns in your CSV data as string columns. Since CSV data can support many
data types, inferring the data as string can help avoid schema evolution issues such as numeric type mismatches
(integers, longs, floats). If you want to infer specific column types, set the option cloudFiles.inferColumnTypes to
true . You don’t need to set inferSchema to true if you set cloudFiles.inferColumnTypes as true .
NOTE
Unless case sensitivity is enabled, the columns abc , Abc , and ABC are considered the same column for the purposes
of schema inference. The selection of which case will be chosen is arbitrary and depends on the sampled data. You can use
schema hints to enforce which case should be used. Once a selection has been made and the schema is inferred, Auto
Loader will not consider the casing variants that were not selected consistent with the schema. These columns may need
to be found in the rescued data column.

Learn more about schema inference and evolution with Auto Loader in Configuring schema inference and
evolution in Auto Loader.

Rescued data column


The rescued data column contains any data that wasn’t parsed, because it was missing from the given schema,
because there was a type mismatch, or because the casing of the column didn’t match. The rescued data column
is part of the schema returned by Auto Loader as “_rescued_data” by default when the schema is being inferred.
You can rename the column or include it in cases where you provide a schema by setting the option
“rescuedDataColumn”.
With CSV data, the additional cases can also be rescued:
Columns with an empty string or null value will be rescued as _c${index} , where index represents the
zero-based ordinal of the column in the file.
If the same column name appears with exact same casing multiple times in the header, the duplicate
instances (each instance except the first) of the column will be rescued as ${columnName}_${index} ,
where index represents the zero-based ordinal of the column in the file.

NOTE
You can provide a rescued data column to all CSV parsers in Databricks Runtime by using the option
rescuedDataColumn . For example, as an option to spark.read.csv by using the DataFrameReader or the from_csv
function within a SELECT query.

/path/to/table/f0.csv:
---------------------
name,age,lucky_number
john,20,4

spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("rescuedDataColumn", "_rescue") \
.option("header", "true") \
.schema("name string, age int") \
.load("/path/to/table/")

+-------+------+------------------------------------------+
| name | age | _rescue |
+-------+------+------------------------------------------+
| john | 20 | { |
| | | "lucky_number": 4, |
| | | "_file_path": "/path/to/table/f0.csv" |
| | | } |
+-------+------+------------------------------------------+
To remove the source file path from the rescued data column, you can set the SQL configuration
spark.conf.set("spark.databricks.sql.rescuedDataColumn.filePath.enabled", "false") .

Malformed record behavior


The CSV parser supports three mode s when parsing malformed records: PERMISSIVE (default), DROPMALFORMED ,
and FAILFAST . When used together with rescuedDataColumn , data type mismatches do not cause records to be
dropped in DROPMALFORMED mode or throw an error in FAILFAST mode. Only corrupt records, that is, incomplete
or malformed CSV, will be dropped or will throw an error respectively. If you use badRecordsPath when parsing
CSV, data type mismatches will not be considered as bad records when using the rescuedDataColumn . Only
incomplete and malformed CSV records will be stored in badRecordsPath .
Auto Loader will treat:
Any row that doesn’t have the same amount of tokens as the header if headers are enabled
Any row that doesn’t have the same amount of tokens as the data schema if headers are disabled
as a malformed record. For files that contain records spanning multiple lines, you can set the option multiline
to true . You may also need to configure quote ( '"' by default) and escape ( '\' by default) to get your
desired parsing behavior.

Changing the case-sensitive behavior


When rescued data column is enabled, fields named in a case other than that of the schema will be loaded to the
_rescued_data column. You can change this behavior by setting the option readerCaseSensitive to false, in
which case Auto Loader will read data in a case-insensitive way.

Common data loading patterns


Loading CSV files without headers
Python

df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("rescuedDataColumn", "_rescued_data") \ # makes sure that you don't lose data
.schema(<schema>) \ # provide a schema here for the files
.load(<path>)

Scala

val df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("rescuedDataColumn", "_rescued_data") // makes sure that you don't lose data
.schema(<schema>) // provide a schema here for the files
.load(<path>)

Enforcing a schema on CSV files with headers


Python

df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header", "true") \
.option("rescuedDataColumn", "_rescued_data") \ # makes sure that you don't lose data
.schema(<schema>) \ # provide a schema here for the files
.load(<path>)
Scala

val df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", "true")
.option("rescuedDataColumn", "_rescued_data") // makes sure that you don't lose data
.schema(<schema>) // provide a schema here for the files
.load(<path>)
Ingest JSON data with Auto Loader
7/21/2022 • 5 minutes to read

NOTE
Schema inference for JSON files is available in Databricks Runtime 8.2 and above.

Using Auto Loader to ingest JSON data into Delta Lake takes only a few lines of code. By leveraging Auto Loader,
you get the following benefits:
Automatic discovery of new files to process: You don’t need to have special logic to handle late arriving data
or keep track of which files have been processed yourself.
Scalable file discovery: Auto Loader can ingest billions of files without a hiccup.
Schema inference and evolution: Auto Loader can infer your data schema and detect schema drift on the fly.
It can also evolve the schema to add new columns and restart the stream with the new schema automatically.
Data rescue: You can configure Auto Loader to rescue data that couldn’t be parsed from your JSON in a
rescued data column that preserves the structure of your JSON record.
You can use the following code to run Auto Loader with schema inference and evolution capabilities on JSON
files. You specify cloudFiles as the format to leverage Auto Loader. To ingest JSON files, specify json with the
option cloudFiles.format . In the option cloudFiles.schemaLocation specify a directory that Auto Loader can use
to persist the schema changes in your source data over time:

Python
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
# The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>") \
.load("<path_to_source_data>") \
.writeStream \
.option("mergeSchema", "true") \
.option("checkpointLocation", "<path_to_checkpoint>") \
.start("<path_to_target")

Scala
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
// The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
.load("<path_to_source_data>")
.writeStream
.option("mergeSchema", "true")
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")

Schema inference and evolution


To infer the schema, Auto Loader uses a sample of data. By default, Auto Loader infers all top-level columns in
your JSON data as string columns. Since JSON data is self describing and can support many data types,
inferring the data as string can help avoid schema evolution issues such as numeric type mismatches (integers,
longs, floats). If you want to retain the original Spark schema inference behavior, set the option
cloudFiles.inferColumnTypes to true .

NOTE
Unless case sensitivity is enabled, the columns abc , Abc , and ABC are considered the same column for the purposes
of schema inference. The selection of which case will be chosen is arbitrary and depends on the sampled data. You can use
schema hints to enforce which case should be used. Once a selection has been made and the schema is inferred, Auto
Loader will not consider the casing variants that were not selected consistent with the schema. These columns may need
to be found in the rescued data column.

Learn more about schema inference and evolution with Auto Loader in Configuring schema inference and
evolution in Auto Loader.

Rescued data column


JSON data is a very popular format because it allows you to store arbitrary data without having to adhere to a
specific schema. However, it can cause difficulties for data engineers. In particular, JSON data can have the
following problems:
A column can appear with different data types in different records
A column can appear with different cases in different records (for example, as “foo” and “Foo”)
A new column can appear in a subset of records
There may not be a globally consistent schema across all records
To address the first three problems, use the rescued data column. The rescued data column contains any data
that wasn’t parsed, because it was missing from the given schema, because there was a type mismatch, or
because the casing of the column didn’t match. The rescued data column is part of the schema returned by Auto
Loader as _rescued_data by default when the schema is being inferred. You can rename the column or include it
in cases where you provide a schema by setting the option rescuedDataColumn .
Since the default value of cloudFiles.inferColumnTypes is false , and cloudFiles.schemaEvolutionMode is
addNewColumns when the schema is being inferred, rescuedDataColumn captures only columns that have a
different case than that in the schema.
The JSON parser supports three mode s when parsing malformed records: PERMISSIVE (default), DROPMALFORMED ,
and FAILFAST . When used together with rescuedDataColumn , data type mismatches do not cause records to be
dropped in DROPMALFORMED mode or throw an error in FAILFAST mode. Only corrupt records, that is, incomplete
or malformed JSON, will be dropped or will throw an error respectively. If you use badRecordsPath when
parsing JSON, data type mismatches will not be considered as bad records when using the rescuedDataColumn .
Only incomplete and malformed JSON records are stored in badRecordsPath .

NOTE
You can provide a rescued data column to all JSON parsers in Databricks Runtime by using the option
rescuedDataColumn . For example, as an option to spark.read.json by using the DataFrameReader or the from_json
function within a SELECT query.
/path/to/table/f0.json:
---------------------
{"name":"john","age":20,"lucky_number":4}

spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("rescuedDataColumn", "_rescue") \
.schema("name string, age int") \
.load("/path/to/table/")

+-------+------+--------------------------------------------+
| name | age | _rescue |
+-------+------+--------------------------------------------+
| john | 20 | { |
| | | "lucky_number": 4, |
| | | "_file_path": "/path/to/table/f0.json" |
| | | } |
+-------+------+--------------------------------------------+

Common data loading patterns


Transform nested JSON data
Since Auto Loader infers the top level JSON columns as strings, you may be left with nested JSON objects that
require further transformations. You can leverage the semi-structured data access APIs to further transform
complex JSON content.
Python

spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
# The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>") \
.load("<source_data_with_nested_json>") \
.selectExpr(
"*",
"tags:page.name", # extracts {"tags":{"page":{"name":...}}}
"tags:page.id::int", # extracts {"tags":{"page":{"id":...}}} and casts to int
"tags:eventType" # extracts {"tags":{"eventType":...}}
)

Scala

spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
// The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
.load("<source_data_with_nested_json>")
.selectExpr(
"*",
"tags:page.name", // extracts {"tags":{"page":{"name":...}}}
"tags:page.id::int", // extracts {"tags":{"page":{"id":...}}} and casts to int
"tags:eventType" // extracts {"tags":{"eventType":...}}
)

Inferring nested JSON data


When you have nested data, you can use the cloudFiles.inferColumnTypes option to infer the nested structure of
your data and other column types.
Python

spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
# The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>") \
.option("cloudFiles.inferColumnTypes", "true") \
.load("<source_data_with_nested_json>")

Scala

spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
// The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
.option("cloudFiles.inferColumnTypes", "true")
.load("<source_data_with_nested_json>")
Ingest Parquet data with Auto Loader
7/21/2022 • 3 minutes to read

NOTE
Schema inference for Parquet files is available in Databricks Runtime 11.1 and above.

Using Auto Loader to ingest Parquet data into Delta Lake takes only a few lines of code. By leveraging Auto
Loader, you get the following benefits:
Automatic discovery of new files to process: You don’t need to have special logic to handle late arriving data
or keep track of which files have been processed yourself.
Scalable file discovery: Auto Loader can ingest billions of files without a hiccup.
Schema inference and evolution: Auto Loader can infer your data schema and detect schema drift on the fly.
It can also evolve the schema to add new columns and restart the stream with the new schema automatically.
Data rescue: You can configure Auto Loader to rescue data that couldn’t be read properly in a rescued data
column that preserves the structure of your nested record.
You can use the following code to run Auto Loader with schema inference and evolution capabilities on Parquet
files. You specify cloudFiles as the format to leverage Auto Loader. To ingest Parquet files, specify parquet
with the option cloudFiles.format . In the option cloudFiles.schemaLocation specify a directory that Auto
Loader can use to persist the schema changes in your source data over time:

Python
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "parquet") \
# The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>") \
.load("<path_to_source_data>") \
.writeStream \
.option("checkpointLocation", "<path_to_checkpoint>") \
.start("<path_to_target")

Scala
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "parquet")
// The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
.load("<path_to_source_data>")
.writeStream
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")

Schema inference and evolution


Each Parquet file is self describing and associated with a typed schema. To infer the schema of the Parquet data,
Auto Loader samples a subset of Parquet files and merges the schemas of these individual files.
If a column has different data types in two Parquet files, Auto Loader will determine if one data type can be
safely upcast to the other. If upcasting is possible, Auto Loader can merge the two schemas and choose the
more encompassing data type for this column; otherwise the inference will fail. For example, a: int and
a: double can be merged as a: double ; a: double and a: string can be merged as a: string ; but a: int
and a: struct cannot be merged. Note that, after merging a: int and a: double as a: double , Auto Loader
can read Parquet files with column a: double as normal, but for the Parquet files with a: int , Auto Loader
needs to read as a as part of the rescued data column, because the data type is different from the inferred
schema. Users still have a chance to safely upcast the rescued a:int and backfill a:double later.
Unless case sensitivity is enabled, the columns abc , Abc , and ABC are considered the same column for the
purposes of schema inference. The selection of which case will be chosen is arbitrary and depends on the
sampled data. You can use schema hints to enforce which case should be used. Once a selection has been made
and the schema is inferred, Auto Loader will not consider the casing variants that were not selected consistent
with the schema. These columns may be found in the rescued data column.
Learn more about schema inference and evolution with Auto Loader in Configuring schema inference and
evolution in Auto Loader.

Rescued data column


Since Parquet files are self describing, Parquet data can have the following problems:
A column can appear with different data types in different files
A column can appear with different cases in different files (for example, as “foo” and “Foo”)
A new column can appear in a subset of files
There may not be a globally consistent schema across all files
The rescued data column addresses these issues and will capture data:
1. Missing from the given schema
2. Containing a type mismatch
3. From fields with inconsistent name casing
Auto Loader includes the rescued data column as part of the inferred schema with the default name
_rescued_data . You can rename the column or include it when you provide a schema by setting the option
rescuedDataColumn .

/path/to/table/f0.parquet (data shown as JSON):


----------------------------------------------
{"name":"john","age":20,"lucky_number":4}

spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "parquet") \
.option("rescuedDataColumn", "_rescue") \
.schema("name string, age int") \
.load("/path/to/table/")
+-------+------+-----------------------------------------------+
| name | age | _rescue |
+-------+------+-----------------------------------------------+
| john | 20 | { |
| | | "lucky_number": 4, |
| | | "_file_path": "/path/to/table/f0.parquet" |
| | | } |
+-------+------+-----------------------------------------------+

Changing the case-sensitive behavior


When rescued data column is enabled, fields named in a case other than that of the schema will be loaded to the
_rescued_data column. You can change this behavior by setting the option readerCaseSensitive to false, in
which case Auto Loader will read data in a case-insensitive way.
Ingest Avro data with Auto Loader
7/21/2022 • 2 minutes to read

NOTE
Schema inference for Avro files is available in Databricks Runtime 10.2 and above.

You can use Auto Loader to ingest Avro data into Delta Lake with only a few lines of code. Auto Loader provides
the following benefits:
Automatic discovery of new files to process: You do not need special logic to handle late arriving data or to
keep track of which files that you have already processed.
Scalable file discovery: Auto Loader can ingest billions of files with ease.
Schema inference and evolution: Auto Loader can infer your data schema and detect schema drift in real
time. It can also evolve the schema to add new columns and continue the ingestion with the new schema
automatically when the stream is restarted.
Data rescue: You can configure Auto Loader to rescue data that cannot be read from your Avro file by placing
that data in a rescued data column, which preserves the structure of your Avro record.
You can use the following code to run Auto Loader with schema inference and evolution capabilities on Avro
files. You specify cloudFiles as the format to leverage Auto Loader. To ingest Avro files, specify avro with the
option cloudFiles.format . In the option cloudFiles.schemaLocation specify a directory that Auto Loader can use
to persist the schema changes in your source data over time:

Python
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "avro") \
# The schema location directory keeps track of the data schema over time.
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>") \
.load("<path_to_source_data>") \
.writeStream \
.option("mergeSchema", "true") \
.option("checkpointLocation", "<path_to_checkpoint>") \
.start("<path_to_target")

Scala
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "avro")
// The schema location directory keeps track of the data schema over time.
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
.load("<path_to_source_data>")
.writeStream
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")

Schema inference and evolution


To infer the schema, Auto Loader uses a sample of data. By default, Auto Loader merges schemas across the
sample of Avro files or throws an exception if there are mismatched column types.
Learn more about schema inference and evolution with Auto Loader in Configuring schema inference and
evolution in Auto Loader.

Rescued data column


The rescued data column contains any data that was not read because it was missing from the given schema,
because there was a type mismatch, or because the casing of the column did not match. The rescued data
column is part of the schema returned by Auto Loader as _rescued_data by default as the schema is inferred.
You can rename the column or include it in cases where you provide a schema by setting the option
rescuedDataColumn .

NOTE
You can provide a rescued data column to all Avro readers in Databricks Runtime by using the option
rescuedDataColumn , for example as an option to spark.read.format("avro") by using the DataFrameReader.

/path/to/table/f0.avro:
---------------------
{"name":"john","age":20,"lucky_number":4}

spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "avro") \
.option("rescuedDataColumn", "_rescue") \
.schema("name string, age int") \
.load("/path/to/table/")

+-------+------+--------------------------------------------+
| name | age | _rescue |
+-------+------+--------------------------------------------+
| john | 20 | { |
| | | "lucky_number": 4, |
| | | "_file_path": "/path/to/table/f0.avro" |
| | | } |
+-------+------+--------------------------------------------+

Changing the case-sensitive behavior


When rescued data column is enabled, fields named in a case other than that of the schema will be loaded to the
_rescued_data column. You can change this behavior by setting the option readerCaseSensitive to false, in
which case Auto Loader will read data in a case-insensitive way.
Ingest image data with Auto Loader
7/21/2022 • 2 minutes to read

NOTE
Available in Databricks Runtime 9.0 and above.

Using Auto Loader to ingest image data into Delta Lake takes only a few lines of code. By using Auto Loader, you
get the following benefits:
Automatic discovery of new files to process: You don’t need to have special logic to handle late arriving data
or keep track of which files have been processed yourself.
Scalable file discovery: Auto Loader can ingest billions of files.
Optimized storage: Auto Loader can provide Delta Lake with additional information over the data to optimize
file storage.

Python
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "binaryFile") \
.load("<path_to_source_data>") \
.writeStream \
.option("checkpointLocation", "<path_to_checkpoint>") \
.start("<path_to_target")

Scala
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "binaryFile")
.load("<path_to_source_data>")
.writeStream
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")

The preceding code will write your image data into a Delta table in an optimized format.

Use a Delta table for machine learning


Once the data is stored in Delta Lake, you can run distributed inference on the data. See the reference article for
more details.
Tutorial: Continuously ingest data into Delta Lake
with Auto Loader
7/21/2022 • 8 minutes to read

Continuous, incremental data ingestion is a common need. For example, applications from mobile games to e-
commerce websites to IoT sensors generate continuous streams of data. Analysts desire access to the freshest
data, yet it can be challenging to implement for several reasons:
You may need to transform and ingest data as it arrives, while processing files exactly once.
You may want to enforce schemas before writing to tables. This logic can be complex to write and maintain.
It is challenging to handle data whose schemas change over time. For example, you must decide how to deal
with incoming rows that have data quality problems and how to reprocess those rows after you have solved
issues with the raw data.
A scalable solution–one that processes thousands or millions of files per minute–requires integrating cloud
services like event notifications, message queues, and triggers, which adds to development complexity and
long-term maintenance.
Building a continuous, cost effective, maintainable, and scalable data transformation and ingestion system is not
trivial. Azure Databricks provides Auto Loader as a built-in, optimized solution that addresses the preceding
issues, and provides a way for data teams to load raw data from cloud object stores at lower costs and latencies.
Auto Loader automatically configures and listens to a notification service for new files and can scale up to
millions of files per second. It also takes care of common issues such as schema inference and schema evolution.
To learn more, see Auto Loader.
In this tutorial, you use Auto Loader to incrementally ingest (load) data into a Delta table.

Requirements
1. An Azure subscription, an Azure Databricks workspace within that subscription, and a cluster within that
workspace. To create these, see Quickstart: Run a Spark job on Azure Databricks Workspace using the Azure
portal. (If you follow this quickstart, you do not need to follow the instructions in the Run a Spark SQL job
section.)
2. Familiarity with the Azure Databricks workspace user interface. See Navigate the workspace.

Step 1. Create sample data


In this step, you create a notebook in your workspace. In this notebook, you run code that generates a random
comma-separated file in your workspace every 30 seconds. Each of these files contains a random set of data.

NOTE
Auto Loader also works with data in the following formats: Avro, binary, CSV, JSON, ORC, Parquet, and text.

1. In your workspace, in the sidebar, click Create > Notebook .


2. In the Create Notebook dialog, enter a name for the notebook, for example Fake Data Generator .
3. For Default Language , select Python .
4. For Cluster , select the cluster that you created in the Requirements section, or select another available
cluster that you want to use.
5. Click Create .
6. In the notebook’s menu bar, if the circle next to the name of the cluster does not contain a green check
mark, click the drop-down arrow next to the cluster’s name, and then click Star t Cluster . Click Confirm ,
and then wait until the circle contains a green check mark.
7. In the notebook’s first cell, paste the following code:

import csv
import uuid
import random
import time
from pathlib import Path

count = 0
path = "/tmp/generated_raw_csv_data"
Path(path).mkdir(parents=True, exist_ok=True)

while True:
row_list = [ ["id", "x_axis", "y_axis"],
[uuid.uuid4(), random.randint(-100, 100), random.randint(-100, 100)],
[uuid.uuid4(), random.randint(-100, 100), random.randint(-100, 100)],
[uuid.uuid4(), random.randint(-100, 100), random.randint(-100, 100)]
]
file_location = f'{path}/file_{count}.csv'

with open(file_location, 'w', newline='') as file:


writer = csv.writer(file)
writer.writerows(row_list)
file.close()

count += 1
dbutils.fs.mv(f'file:{file_location}', f'dbfs:{file_location}')
time.sleep(30)
print(f'New CSV file created at dbfs:{file_location}. Contents:')

with open(f'/dbfs{file_location}', 'r') as file:


reader = csv.reader(file, delimiter=' ')
for row in reader:
print(', '.join(row))
file.close()

The preceding code does the following:


a. Creates a directory at /tmp/generated_raw_csv_data in your workspace, if a directory does not
already exist.

TIP
If this path already exists in your workspace because someone else ran this tutorial, you may want to clear
out any existing files in this path first.

b. Creates a random set of data, for example:

id,x_axis,y_axis
d033faf3-b6bd-4bbc-83a4-43a37ce7e994,88,-13
fde2bdb6-b0a1-41c2-9650-35af717549ca,-96,19
297a2dfe-99de-4c52-8310-b24bc2f83874,-23,43
c. After 30 seconds, creates a file named file_<number>.csv , writes the random set of data to the file,
stores the file in dbfs:/tmp/generated_raw_csv_data , and reports the path to the file and its
contents. <number> starts at 0 and increases by 1 every time a file is created (for example,
file_0.csv , file_1.csv , and so on).

8. In the notebook’s menu bar, click Run All . Leave this notebook running.

NOTE
To view the list of generated files, in the sidebar, click Data . Click DBFS, select a cluster if prompted, and then click
tmp > generated_raw_csv_data .

Step 2: Run Auto Loader


In this step you use Auto Loader to continuously read raw data from one location in your workspace and then
stream that data into a Delta table in another location in the same workspace.
1. In the sidebar, click Create > Notebook .
2. In the Create Notebook dialog, enter a name for the notebook, for example Auto Loader Demo .
3. For Default Language , select Python .
4. For Cluster , select the cluster that you created in the Requirements section, or select another available
cluster that you want to use.
5. Click Create .
6. In the notebook’s menu bar, if the circle next to the name of the cluster does not contain a green check
mark, click the drop-down arrow next to the cluster’s name, and then click Star t Cluster . Click Confirm ,
and then wait until the circle contains a green check mark.
7. In the notebook’s first cell, paste the following code:

raw_data_location = "dbfs:/tmp/generated_raw_csv_data"
target_delta_table_location = "dbfs:/tmp/table/coordinates"
schema_location = "dbfs:/tmp/auto_loader/schema"
checkpoint_location = "dbfs:/tmp/auto_loader/checkpoint"

This code defines in your workspace the paths to the raw data and the target Delta table, the path to the
table’s schema, and the path to the location where Auto Loader writes checkpoint file information in the
Delta Lake transaction log. Checkpoints enable Auto Loader to process only new incoming data and to
skip over any existing data that has already been processed.

TIP
If any of these paths already exist in your workspace because someone else ran this tutorial, you may want to
clear out any existing files in these paths first.

8. With your cursor still in the first cell, run the cell. (To run the cell, press Shift+Enter.) Azure Databricks
reads the specified paths into memory.
9. Add a cell below the first cell, if it is not already there. (To add a cell, rest your mouse pointer along the
bottom edge of the cell, and then click the + icon.) In this second cell, paste the following code (note that
cloudFiles represents Auto Loader):
stream = spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header", "true") \
.option("cloudFiles.schemaLocation", schema_location) \
.load(raw_data_location)

10. Run this cell.


11. In the notebook’s third cell, paste the following code:

display(stream)

12. Run this cell. Auto Loader begins processing the existing CSV files in raw_data_location as well as any
incoming CSV files as they arrive in that location. Auto Loader processes each CSV file by using the first
line in the file for the field names and the remaining lines as field data. Azure Databricks displays the data
as Auto Loader processes it.
13. In the notebook’s fourth cell, paste the following code:

stream.writeStream \
.option("checkpointLocation", checkpoint_location) \
.start(target_delta_table_location)

14. Run this cell. Auto Loader writes the data to the Delta table in target_data_table_location . Auto Loader
also writes checkpoint file information in checkpoint_location .

Step 3: Evolve and enforce data schema


What happens if your data’s schema changes over time? For example, what if you want to evolve field data types
so that in the future you can better enforce data quality issues and to make it easier to do calculations on your
data? In this step, you evolve the allowed data types of your data, and then you enforce this schema on
incoming data.
Remember to keep your notebook from step 1 running, to maintain the data stream with new generated sample
files.
1. Stop the notebook from step 2. (To stop the notebook, click Stop Execution in the notebook’s menu bar.)
2. In the notebook from step 2, replace the contents of the fourth cell (the one that starts with
stream.writeStream ) with the following code:

stream.printSchema()

3. Run all of the notebook’s cells. (To run all of the cells, click Run All in the notebook’s menu bar.) Azure
Databricks prints the data’s schema, which shows all fields as strings. Let’s evolve the x_axis and
y_axis fields to integers.

4. Stop the notebook.


5. Replace the contents of the second cell (the one that starts with stream = spark.readStream ) with the
following code:
stream = spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header", "true") \
.option("cloudFiles.schemaLocation", schema_location) \
.option("cloudFiles.schemaHints", """x_axis integer, y_axis integer""") \
.load(raw_data_location)

6. Run all of the notebook’s cells. Azure Databricks prints the data’s new schema, which shows the x_axis
and y_axis columns as integers. Let’s now enforce data quality by using this new schema.
7. Stop the notebook.
8. Replace the contents of the second cell with the following code:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
StructField('id', StringType(), True),
StructField('x_axis', IntegerType(), True),
StructField('y_axis', IntegerType(), True)
])

stream = spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header", "true") \
.option("cloudFiles.schemaLocation", schema_location) \
.schema(schema) \
.load(raw_data_location)

9. Run all of the notebook’s cells. Auto Loader now uses its schema inference and evolution logic to
determine how to process incoming data that does not match the new schema.

Step 4: Clean up
When you are done with this tutorial, you can clean up the associated Azure Databricks resources in your
workspace, if you no longer want to keep them.
Delete the data
1. Stop both notebooks. (To open a notebook, in the sidebar, click Workspace > Users > your user
name , and then click the notebook.)
2. In the notebook from step 1, add a cell after the first one, and paste the following code into this second
cell.

dbutils.fs.rm("dbfs:/tmp/generated_raw_csv_data", True)
dbutils.fs.rm("dbfs:/tmp/table", True)
dbutils.fs.rm("dbfs:/tmp/auto_loader", True)

WARNING
If you have any other information in these locations, this information will also be deleted!

3. Run the cell. Azure Databricks deletes the directories that contain the raw data, the Delta table, the table’s
schema, and the Auto Loader checkpoint information.
Delete the notebooks
1. In the sidebar, click Workspace > Users > your user name .
2. Click the drop-down arrow next to the first notebook, and click Move to Trash .
3. Click Confirm and move to Trash .
4. Repeat steps 1 - 3 for the second notebook.
Stop the cluster
If you are not using the cluster for any other tasks, you should stop it to avoid additional costs.
1. In the sidebar, click Compute .
2. Click the cluster’s name.
3. Click Terminate .
4. Click Confirm .

Additional resources
Auto Loader technical documentation
Leveraging file notifications for larger volumes of data
10 Powerful Features to Simplify Semi-structured Data Management in the Databricks Lakehouse blog
Hassle-Free Data Ingestion on-demand webinar series
Auto Loader FAQ
7/21/2022 • 3 minutes to read

Commonly asked questions about Databricks Auto Loader.

Does Auto Loader process the file again when the file gets appended
or overwritten?
Files are processed exactly once unless cloudFiles.allowOverwrites is enabled. If a file is appended to or
overwritten, Azure Databricks does not guarantee which version of the file is processed. Databricks
recommends you use Auto Loader to ingest only immutable files. If this does not meet your requirements,
contact your Databricks representative.

If my data files do not arrive continuously, but in regular intervals, for


example, once a day, should I still use this source and are there any
benefits?
In this case, you can set up a Trigger.Once or Trigger.AvailableNow (available in Databricks Runtime 10.2 and
later) Structured Streaming job and schedule to run after the anticipated file arrival time. Auto Loader works well
with both infrequent or frequent updates. Even if the eventual updates are very large, Auto Loader scales well to
the input size. Auto Loader’s efficient file discovery techniques and schema evolution capabilities make Auto
Loader the recommended method for incremental data ingestion.

What happens if I change the checkpoint location when restarting


the stream?
A checkpoint location maintains important identifying information of a stream. Changing the checkpoint
location effectively means that you have abandoned the previous stream and started a new stream.

Do I need to create event notification services beforehand?


No. If you choose file notification mode and provide the required permissions, Auto Loader can create file
notification services for you. See Leveraging file notifications

How do I clean up the event notification resources created by Auto


Loader?
You can use the cloud resource manager to list and tear down resources. You can also delete these resources
manually using the cloud provider’s UI or APIs.

Can I run multiple streaming queries from different input directories


on the same bucket/container?
Yes, as long as they are not parent-child directories; for example, prod-logs/ and prod-logs/usage/ would not
work because /usage is a child directory of /prod-logs .

Can I use this feature when there are existing file notifications on my
bucket or container?
Yes, as long as your input directory does not conflict with the existing notification prefix (for example, the above
parent-child directories).

How does Auto Loader infer schema?


When the DataFrame is first defined, Auto Loader lists your source directory and chooses the most recent (by
file modification time) 50 GB of data or 1000 files, and uses those to infer your data schema.
Auto Loader also infers partition columns by examining the source directory structure and looks for file paths
that contain the /key=value/ structure. If the source directory has an inconsistent structure, for example:

base/path/partition=1/date=2020-12-31/file1.json
// inconsistent because date and partition directories are in different orders
base/path/date=2020-12-31/partition=2/file2.json
// inconsistent because the date directory is missing
base/path/partition=3/file3.json

Auto Loader infers the partition columns as empty. Use cloudFiles.partitionColumns to explicitly parse columns
from the directory structure.

How does Auto Loader behave when the source folder is empty?
If the source directory is empty, Auto Loader requires you to provide a schema as there is no data to perform
inference.

When does Autoloader infer schema? Does it evolve automatically


after every micro-batch?
The schema is inferred when the DataFrame is first defined in your code. During each micro-batch, schema
changes are evaluated on the fly; therefore, you don’t need to worry about performance hits. When the stream
restarts, it picks up the evolved schema from the schema location and starts executing without any overhead
from inference.

What’s the performance impact on ingesting the data when using


Auto Loader schema inference?
You should expect schema inference to take a couple of minutes for very large source directories during initial
schema inference. You shouldn’t observe significant performance hits otherwise during stream execution. If you
run your code in an Azure Databricks notebook, you can see status updates that specify when Auto Loader will
be listing your directory for sampling and inferring your data schema.

Due to a bug, a bad file has changed my schema drastically. What


should I do to roll back a schema change?
Contact Databricks support for help.
SQL databases using JDBC
7/21/2022 • 11 minutes to read

Databricks Runtime contains JDBC drivers for Microsoft SQL Server and Azure SQL Database. See the
Databricks runtime release notes for the complete list of JDBC libraries included in Databricks Runtime.
This article covers how to use the DataFrame API to connect to SQL databases using JDBC and how to control
the parallelism of reads through the JDBC interface. This article provides detailed examples using the Scala API,
with abbreviated Python and Spark SQL examples at the end. For all of the supported arguments for connecting
to SQL databases using JDBC, see JDBC To Other Databases.

NOTE
Another option for connecting to SQL Server and Azure SQL Database is the Apache Spark connector. It can provide
faster bulk inserts and lets you connect using your Azure Active Directory identity.

IMPORTANT
The examples in this article do not include usernames and passwords in JDBC URLs. Instead it expects that you follow the
Secret management user guide to store your database credentials as secrets, and then leverage them in a notebook to
populate your credentials in a java.util.Properties object. For example:

val jdbcUsername = dbutils.secrets.get(scope = "jdbc", key = "username")


val jdbcPassword = dbutils.secrets.get(scope = "jdbc", key = "password")

For a full example of secret management, see Secret workflow example.

Establish connectivity to SQL Server


This example queries SQL Server using its JDBC driver.
Step 1: Check that the JDBC driver is available

Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")

Step 2: Create the JDBC URL

val jdbcHostname = "<hostname>"


val jdbcPort = 1433
val jdbcDatabase = "<database>"

// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"

// Create a Properties() object to hold the parameters.


import java.util.Properties
val connectionProperties = new Properties()

connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
Step 3: Check connectivity to the SQLServer database

val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"


connectionProperties.setProperty("Driver", driverClass)

Connect to a PostgreSQL database over SSL


To connect to a PostgreSQL database over SSL when using JDBC:
You must provide certificates and keys in PK8 and DER format, not PEM.
The certificates and keys must be in a folder in DBFS within the /dbfs folder so that all nodes can read them.
The following Python notebook example demonstrates how to generate the PK8 and DER files from a set of PEM
files and then use those PK8 and DER files to create a DataFrame. This example assumes that the following PEM
files exist:
For the client public key certificate, client_cert.pem .
For the client private key, client_key.pem .
For the server certificate, server_ca.pem .

%sh
# Copy the PEM files to a folder within /dbfs so that all nodes can read them.
mkdir -p <target-folder>
cp <source-files> <target-folder>

%sh
# Convert the PEM files to PK8 and DER format.
cd <target-folder>
openssl pkcs8 -topk8 -inform PEM -in client_key.pem -outform DER -out client_key.pk8 -nocrypt
openssl x509 -in server_ca.pem -out server_ca.der -outform DER
openssl x509 -in client_cert.pem -out client_cert.der -outform DER

# Create the DataFrame.


df = (spark
.read
.format("jdbc")
.option("url", <connection-string>)
.option("dbtable", <table-name>)
.option("user", <username>)
.option("password", <password>)
.option("ssl", True)
.option("sslmode", "require")
.option("sslcert", <target-folder>/client_cert.der)
.option("sslkey", <target-folder>/client_key.pk8)
.option("sslrootcert", <target-folder>/server_ca.der)
.load()
)

Replace:
<source-files> with the list of files in the source directory, for example
.pem
/dbfs/FileStore/Users/someone@example.com/* .
<target-folder> with the name of the target directory containing the generated PK8 and DER files, for
example /dbfs/databricks/driver/ssl .
<connection-string> with the JDBC URL connection string to the database.
<table-name> with the name of the table to use in the database.
<username> and <password> with the username and password to access the database.

Read data from JDBC


This section loads data from a database table. This uses a single JDBC connection to pull the table into the Spark
environment. For parallel reads, see Manage parallelism.

val employees_table = spark.read.jdbc(jdbcUrl, "employees", connectionProperties)

Spark automatically reads the schema from the database table and maps its types back to Spark SQL types.

employees_table.printSchema

You can run queries against this JDBC table:

display(employees_table.select("age", "salary").groupBy("age").avg("salary"))

Write data to JDBC


This section shows how to write data to a database from an existing Spark SQL table named diamonds .

select * from diamonds limit 5

The following code saves the data into a database table named diamonds . Using column names that are
reserved keywords can trigger an exception. The example table has column named table , so you can rename it
with withColumnRenamed() prior to pushing it to the JDBC API.

spark.table("diamonds").withColumnRenamed("table", "table_number")
.write
.jdbc(jdbcUrl, "diamonds", connectionProperties)

Spark automatically creates a database table with the appropriate schema determined from the DataFrame
schema.
The default behavior is to create a new table and to throw an error message if a table with the same name
already exists. You can use the Spark SQL SaveMode feature to change this behavior. For example, here’s how to
append more rows to the table:

import org.apache.spark.sql.SaveMode

spark.sql("select * from diamonds limit 10").withColumnRenamed("table", "table_number")


.write
.mode(SaveMode.Append) // <--- Append to the existing table
.jdbc(jdbcUrl, "diamonds", connectionProperties)

You can also overwrite an existing table:


spark.table("diamonds").withColumnRenamed("table", "table_number")
.write
.mode(SaveMode.Overwrite) // <--- Overwrite the existing table
.jdbc(jdbcUrl, "diamonds", connectionProperties)

Push down a query to the database engine


You can push down an entire query to the database and return just the result. The table parameter identifies
the JDBC table to read. You can use anything that is valid in a SQL query FROM clause.

// Note: The parentheses are required.


val pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
val df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)

Push down optimization


In addition to ingesting an entire table, you can push down a query to the database to leverage it for processing,
and return only the results.

// Explain plan with no column selection returns all columns


spark.read.jdbc(jdbcUrl, "diamonds", connectionProperties).explain(true)

You can prune columns and pushdown query predicates to the database with DataFrame methods.

// Explain plan with column selection will prune columns and just return the ones specified
// Notice that only the 3 specified columns are in the explain plan
spark.read.jdbc(jdbcUrl, "diamonds", connectionProperties).select("carat", "cut", "price").explain(true)

// You can push query predicates down too


// Notice the filter at the top of the physical plan
spark.read.jdbc(jdbcUrl, "diamonds", connectionProperties).select("carat", "cut", "price").where("cut =
'Good'").explain(true)

Manage parallelism
In the Spark UI, you can see that the number of partitions dictate the number of tasks that are launched. Each
task is spread across the executors, which can increase the parallelism of the reads and writes through the JDBC
interface. See the Spark SQL programming guide for other parameters, such as fetchsize , that can help with
performance.
You can use two DataFrameReader APIs to specify partitioning:

jdbc(url:String,table:String,partitionColumn:String,lowerBound:Long,upperBound:Long,numPartitions:Int,...)
takes the name of a numeric, date, or timestamp column ( partitionColumn ), two range endpoints (
lowerBound , upperBound ) and a target numPartitions and generates Spark tasks by evenly splitting the
specified range into numPartitions tasks. This work well if your database table has an indexed numeric
column with fairly evenly-distributed values, such as an auto-incrementing primary key; it works somewhat
less well if the numeric column is extremely skewed, leading to imbalanced tasks.
jdbc(url:String,table:String,predicates:Array[String],...) accepts an array of WHERE conditions that can
be used to define custom partitions: this is useful for partitioning on non-numeric columns or for dealing
with skew. When defining custom partitions, remember to consider NULL when the partition columns are
Nullable. Don’t manually define partitions using more than two columns since writing the boundary
predicates require much more complex logic.
JDBC reads
You can provide split boundaries based on the dataset’s column values.
These options specify the parallelism on read. These options must all be specified if any of them is specified.
lowerBound and upperBound decide the partition stride, but do not filter the rows in table. Therefore, Spark
partitions and returns all rows in the table.
The following example splits the table read across executors on the emp_no column using the columnName ,
lowerBound , upperBound , and numPartitions parameters.

val df = (spark.read.jdbc(url=jdbcUrl,
table="employees",
columnName="emp_no",
lowerBound=1L,
upperBound=100000L,
numPartitions=100,
connectionProperties=connectionProperties))
display(df)

JDBC writes
Spark’s partitions dictate the number of connections used to push data through the JDBC API. You can control
the parallelism by calling coalesce(<N>) or repartition(<N>) depending on the existing number of partitions.
Call coalesce when reducing the number of partitions, and repartition when increasing the number of
partitions.

import org.apache.spark.sql.SaveMode

val df = spark.table("diamonds")
println(df.rdd.partitions.length)

// Given the number of partitions above, you can reduce the partition value by calling coalesce() or
increase it by calling repartition() to manage the number of connections.
df.repartition(10).write.mode(SaveMode.Append).jdbc(jdbcUrl, "diamonds", connectionProperties)

Python example
The following Python examples cover some of the same tasks as those provided for Scala.
Create the JDBC URL

jdbcHostname = "<hostname>"
jdbcDatabase = "employees"
jdbcPort = 1433
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2};user={3};password={4}".format(jdbcHostname, jdbcPort,
jdbcDatabase, username, password)

You can pass in a dictionary that contains the credentials and driver class similar to the preceding Scala example.
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

Push down a query to the database engine

pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"


df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)

Read from JDBC connections across multiple workers

df = spark.read.jdbc(url=jdbcUrl, table="employees", column="emp_no", lowerBound=1, upperBound=100000,


numPartitions=100)
display(df)

Spark SQL example


To define a Spark SQL table or view that uses a JDBC connection you must first register the JDBC table as a
Spark data source table or a temporary view.
For details, see
Databricks Runtime 7.x and above: CREATE TABLE [USING] and CREATE VIEW
Databricks Runtime 5.5 LTS and 6.x: Create Table and Create View

CREATE TABLE <jdbcTable>


USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:<databaseServerType>://<jdbcHostname>:<jdbcPort>",
dbtable "<jdbcDatabase>.atable",
user "<jdbcUsername>",
password "<jdbcPassword>"
)

Append data to the table


Append data into the table using Spark SQL:

INSERT INTO diamonds


SELECT * FROM diamonds LIMIT 10 -- append 10 records to the table

SELECT count(*) record_count FROM diamonds --count increased by 10

Overwrite data in the table


Overwrite data in the table using Spark SQL. This causes the database to drop and create the diamonds table:

INSERT OVERWRITE TABLE diamonds


SELECT carat, cut, color, clarity, depth, TABLE AS table_number, price, x, y, z FROM diamonds

SELECT count(*) record_count FROM diamonds --count returned to original value (10 less)
Create a view on the table
Create a view on the table using Spark SQL.

CREATE OR REPLACE VIEW pricey_diamonds


AS SELECT * FROM diamonds WHERE price > 5000;

Optimize performance when reading data


If you’re attempting to read data from an external JDBC database and it’s slow, this section contains some
suggestions to improve performance.
Determine whether the JDBC read is occurring in parallel
In order to read data in parallel, the Spark JDBC data source must be configured with appropriate partitioning
information so that it can issue multiple concurrent queries to the external database. If you neglect to configure
partitioning, all data will be fetched on the driver using a single JDBC query which runs the risk of causing the
driver to throw an OOM exception.
Here’s an example of a JDBC read without partitioning configured:

Here’s an example of a JDBC read with partitioning configured: the column partitionColumn , which was passed
as columnName , two range endpoints ( lowerBound , upperBound ), and the numPartitions parameter specifying
the maximum number of partitions.
Tune the JDBC fetchSize parameter
JDBC drivers have a fetchSize parameter that controls the number of rows fetched at a time from the remote
JDBC database. If this value is set too low then your workload may become latency-bound due to a high number
of roundtrip requests between Spark and the external database in order to fetch the full result set. If this value is
too high you risk OOM exceptions. The optimal value will be workload dependent (since it depends on the result
schema, sizes of strings in results, and so on), but increasing it even slightly from the default can result in huge
performance gains.
Oracle’s default fetchSize is 10. Increasing it even slightly, to 100, gives massive performance gains, and going
up to a higher value, like 2000, gives an additional improvement. For example:

PreparedStatement stmt = null;


ResultSet rs = null;

try {
stmt = conn. prepareStatement("select a, b, c from table");
stmt.setFetchSize(100);

rs = stmt.executeQuery();
while (rs.next()) {
...
}
}

See Make your java run faster for a more general discussion of this tuning parameter for Oracle JDBC drivers.
Consider the impact of indexes
If you are reading in parallel (using one of the partitioning techniques) Spark issues concurrent queries to the
JDBC database. If these queries end up requiring full table scans this could end up bottlenecking in the remote
database and become extremely slow. Thus you should consider the impact of indexes when choosing a
partitioning column and pick a column such that the individual partitions’ queries can be executed reasonably
efficiently in parallel.

IMPORTANT
Make sure that the database has an index on the partitioning column.

When a single-column index is not defined on the source table, you still can choose the leading(leftmost)
column in a composite index as the partitioning column. When only composite indexes are available, most
databases can use a concatenated index when searching with the leading (leftmost) columns. Thus, the leading
column in a multi-column index can also be used as a partitioning column.
Consider whether the number of partitions is appropriate
Using too many partitions when reading from the external database risks overloading that database with too
many queries. Most DBMS systems have limits on the concurrent connections. As a starting point, aim to have
the number of partitions be close to the number of cores or task slots in your Spark cluster in order to maximize
parallelism but keep the total number of queries capped at a reasonable limit. If you need lots of parallelism
after fetching the JDBC rows (because you’re doing something CPU bound in Spark) but don’t want to issue too
many concurrent queries to your database then consider using a lower numPartitions for the JDBC read and
then doing an explicit repartition() in Spark.
Consider database -specific tuning techniques
The database vendor may have a guide on tuning performance for ETL and bulk access workloads.
SQL Databases using the Apache Spark connector
7/21/2022 • 2 minutes to read

The Apache Spark connector for Azure SQL Database and SQL Server enables these databases to act as input
data sources and output data sinks for Apache Spark jobs. It allows you to use real-time transactional data in big
data analytics and persist results for ad-hoc queries or reporting.
Compared to the built-in JDBC connector, this connector provides the ability to bulk insert data into SQL
databases. It can outperform row-by-row insertion with 10x to 20x faster performance. The Spark connector for
SQL Server and Azure SQL Database also supports Azure Active Directory (Azure AD) authentication, enabling
you to connect securely to your Azure SQL databases from Azure Databricks using your Azure AD account. It
provides interfaces that are similar to the built-in JDBC connector. It is easy to migrate your existing Spark jobs
to use this connector.

Requirements
There are two versions of the Spark connector for SQL Server: one for Spark 2.4 and one for Spark 3.x. The
Spark 3.x connector requires Databricks Runtime 7.x or above. The connector is community-supported and does
not include Microsoft SLA support. File any issues on GitHub to engage the community for help.

C O M P O N EN T VERSIO N S SUP P O RT ED

Apache Spark 3.0.x and 2.4x

Databricks Runtime Apache Spark 3.0 connector: Databricks Runtime 7.x and
above

Apache Spark 2.4 connector: Databricks Runtime 5.5 LTS and


above

Scala Apache Spark 3.0 connector: 2.12

Apache Spark 2.4 connector: 2.11

Microsoft JDBC Driver for SQL Server 8.2

Microsoft SQL Server SQL Server 2008 and above

Azure SQL Database Supported

Use the Spark connector


For instructions on using the Spark connector, see Apache Spark connector: SQL Server & Azure SQL.
Accessing Azure Data Lake Storage Gen2 and Blob
Storage with Azure Databricks
7/21/2022 • 3 minutes to read

Use the Azure Blob Filesystem driver (ABFS) to connect to Azure Blob Storage and Azure Data Lake Storage
Gen2 from Azure Databricks. Databricks recommends securing access to Azure storage containers by using
Azure service principals set in cluster configurations.
This article details how to access Azure storage containers using:
Azure service principals
SAS tokens
Account keys
You will set Spark properties to configure these credentials for a compute environment, either:
Scoped to an Azure Databricks cluster
Scoped to an Azure Databricks notebook
Azure service principals can also be used to access Azure storage from Databricks SQL; see Configure access to
cloud storage.
Databricks recommends using secret scopes for storing all credentials.

Deprecated patterns for storing and accessing data from Azure


Databricks
Databricks no longer recommends mounting external data locations to the Databricks Filesystem; see Mounting
cloud object storage on Azure Databricks.
Databricks no longer recommends Access Azure Data Lake Storage using Azure Active Directory credential
passthrough.
The legacy Windows Azure Storage Blob driver (WASB) has been deprecated. ABFS has numerous benefits over
WASB; see Azure documentation on ABFS. For documentation for working with the legacy WASB driver, see
Connect to Azure Blob Storage with WASB (legacy).
Azure has announced the pending retirement of Azure Data Lake Storage Gen1. Azure Databricks recommends
migrating all Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2. If you have not yet migrated, see
Accessing Azure Data Lake Storage Gen1 from Azure Databricks.

Direct access using ABFS URI for Blob Storage or Azure Data Lake
Storage Gen2
If you have properly configured credentials to access your Azure storage container, you can interact with
resources in the storage account using URIs. Databricks recommends using the abfss driver for greater
security.
spark.read.load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")

dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")

CREATE TABLE <database-name>.<table-name>;

COPY INTO <database-name>.<table-name>


FROM 'abfss://container@storageAccount.dfs.core.windows.net/path/to/folder'
FILEFORMAT = CSV
COPY_OPTIONS ('mergeSchema' = 'true');

Access Azure Data Lake Storage Gen2 or Blob Storage using OAuth
2.0 with an Azure service principal
You can securely access data in an Azure storage account using OAuth 2.0 with an Azure Active Directory (Azure
AD) application service principal for authentication; see Configure access to Azure storage with an Azure Active
Directory service principal.

service_credential = dbutils.secrets.get(scope="<scope>",key="<service-credential-key>")

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-
id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net",
service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net",
"https://login.microsoftonline.com/<directory-id>/oauth2/token")

Replace
<scope> with the Databricks secret scope name.
<service-credential-key> with the name of the key containing the client secret.
<storage-account> with the name of the Azure storage account.
<application-id> with the Application (client) ID for the Azure Active Directory application.
<directory-id> with the Director y (tenant) ID for the Azure Active Directory application.

Access Azure Data Lake Storage Gen2 or Blob Storage using a SAS
token
You can use storage shared access signatures (SAS) to access an Azure Data Lake Storage Gen2 storage account
directly. With SAS, you can restrict access to a storage account using temporary tokens with fine-grained access
control.
You can configure SAS tokens for multiple storage accounts in the same Spark session.

NOTE
SAS support is available in Databricks Runtime 7.5 and above.
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", "<token>")

Access Azure Data Lake Storage Gen2 or Blob Storage using the
account key
You can use storage account access keys to manage access to Azure Storage.

spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))

Replace
<storage-account> with the Azure Storage account name.
<scope> with the Azure Databricks secret scope name.
<storage-account-access-key> with the name of the key containing the Azure storage account access key.

Example notebook
This notebook demonstrates using a service principal to:
1. Authenticate to an ADLS Gen2 storage account.
2. Mount a filesystem in the storage account.
3. Write a JSON file containing Internet of things (IoT) data to the new container.
4. List files using direct access and through the mount point.
5. Read and display the IoT file using direct access and through the mount point.
ADLS Gen2 OAuth 2.0 with Azure service principals notebook
Get notebook

Azure Data Lake Storage Gen2 FAQs and known issues


See Azure Data Lake Storage Gen2 frequently asked questions and known issues.
Azure Data Lake Storage Gen2 frequently asked
questions and known issues
7/21/2022 • 2 minutes to read

Frequently asked questions (FAQ)


Can I use the abfs scheme to access Azure Data Lake Storage Gen2?
Yes. However, Databricks recommends that you use the abfss scheme, which uses SSL encrypted access. You
must use abfss with OAuth or Azure Active Directory-based authentication because of the requirement for
secure transport of Azure AD tokens.
When I accessed an Azure Data Lake Storage Gen2 account with the hierarchical namespace enabled, I
experienced a java.io.FileNotFoundException error, and the error message includes FilesystemNotFound .
If the error message includes the following information, it is because your command is trying to access a Blob
storage container created through the Azure portal:

StatusCode=404
StatusDescription=The specified filesystem does not exist.
ErrorCode=FilesystemNotFound
ErrorMessage=The specified filesystem does not exist.

When a hierarchical namespace is enabled, you do not need to create containers through Azure portal. If you
see this issue, delete the Blob container through Azure portal. After a few minutes, you will be able to access the
container. Alternatively, you can change your abfss URI to use a different container, as long as this container is
not created through Azure portal.

Known issues
See Known issues with Azure Data Lake Storage Gen2 in the Microsoft documentation.
Accessing Azure Data Lake Storage Gen1 from
Azure Databricks
7/21/2022 • 3 minutes to read

Microsoft has announced the planned retirement of Azure Data Lake Storage Gen1 (formerly Azure Data Lake
Store, also known as ADLS) and recommends all users migrate to Azure Data Lake Storage Gen2. Databricks
recommends upgrading to Azure Data Lake Storage Gen2 for best performance and new features.
There are two ways of accessing Azure Data Lake Storage Gen1:
1. Pass your Azure Active Directory credentials, also known as credential passthrough.
2. Use a service principal directly.

Access automatically with your Azure Active Directory credentials


You can authenticate automatically to Azure Data Lake Storage Gen1 from Azure Databricks clusters using the
same Azure Active Directory (Azure AD) identity that you use to log into Azure Databricks. When you enable
your cluster for Azure AD credential passthrough, commands that you run on that cluster will be able to read
and write your data in Azure Data Lake Storage Gen1 without requiring you to configure service principal
credentials for access to storage.
For complete setup and usage instructions, see Access Azure Data Lake Storage using Azure Active Directory
credential passthrough.

Create and grant permissions to service principal


If your selected access method requires a service principal with adequate permissions, and you do not have one,
follow these steps:
1. Create an Azure AD application and service principal that can access resources. Note the following
properties:
application-id : An ID that uniquely identifies the client application.
directory-id : An ID that uniquely identifies the Azure AD instance.
service-credential : A string that the application uses to prove its identity.
2. Register the service principal, granting the correct role assignment, such as Contributor, on the Azure Data
Lake Storage Gen1 account.

Access directly with Spark APIs using a service principal and OAuth
2.0
To read from your Azure Data Lake Storage Gen1 account, you can configure Spark to use service credentials
with the following snippet in your notebook:

spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("fs.adl.oauth2.client.id", "<application-id>")
spark.conf.set("fs.adl.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-
service-credential>"))
spark.conf.set("fs.adl.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
where
dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") retrieves your storage account access key
that has been stored as a secret in a secret scope.
After you’ve set up your credentials, you can use standard Spark and Databricks APIs to access the resources.
For example:

val df = spark.read.format("parquet").load("adl://<storage-resource>.azuredatalakestore.net/<directory-
name>")

dbutils.fs.ls("adl://<storage-resource>.azuredatalakestore.net/<directory-name>")

Azure Data Lake Storage Gen1 provides directory level access control, so the service principal must have access
to the directories that you want to read from as well as the Azure Data Lake Storage Gen1 resource.
Access through metastore
To access adl:// locations specified in the metastore, you must specify Hadoop credential configuration
options as Spark options when you create the cluster by adding the spark.hadoop. prefix to the corresponding
Hadoop configuration keys to propagate them to the Hadoop configurations used by the metastore:

spark.hadoop.fs.adl.oauth2.access.token.provider.type ClientCredential
spark.hadoop.fs.adl.oauth2.client.id <application-id>
spark.hadoop.fs.adl.oauth2.credential <service-credential>
spark.hadoop.fs.adl.oauth2.refresh.url https://login.microsoftonline.com/<directory-id>/oauth2/token

WARNING
These credentials are available to all users who access the cluster.

Mount Azure Data Lake Storage Gen1 resource or folder


To mount an Azure Data Lake Storage Gen1 resource or a folder inside it, use the following command:
.. code-language-tabs:

```python
configs = {"fs.adl.oauth2.access.token.provider.type": "ClientCredential",
"fs.adl.oauth2.client.id": "<application-id>",
"fs.adl.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-
service-credential>"),
"fs.adl.oauth2.refresh.url": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
val configs = Map(
"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
"fs.adl.oauth2.client.id" -> "<application-id>",
"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-
credential>"),
"fs.adl.oauth2.refresh.url" -> "https://login.microsoftonline.com/<directory-id>/oauth2/token")

// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
mountPoint = "/mnt/<mount-name>",
extraConfigs = configs)

where
<mount-name>is a DBFS path that represents where the account or a folder inside it (specified in source ) will
be mounted in DBFS.

Access files in your container as if they were local files, for example:
.. code-language-tabs::

df = spark.read.format("text").load("/mnt/<mount-name>/....")
df = spark.read.format("text").load("dbfs:/mnt/<mount-name>/....")

val df = spark.read.format("text").load("/mnt/<mount-name>/....")
val df = spark.read.format("text").load("dbfs:/mnt/<mount-name>/....")

## <a id="mount-adls"> </a><a id="set-up-service-credentials-for-multiple-accounts"> </a>Set up service


credentials for multiple accounts

You can set up service credentials for multiple Azure Data Lake Storage Gen1 accounts for use within in a
single Spark session by adding ``account.<account-name>`` to the configuration keys. For example, if you
want to set up credentials for both the accounts to access ``adl://example1.azuredatalakestore.net`` and
``adl://example2.azuredatalakestore.net``, you can do this as follows:

```scala
spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")

spark.conf.set("fs.adl.account.example1.oauth2.client.id", "<application-id-example1>")
spark.conf.set("fs.adl.account.example1.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key
= "<key-name-for-service-credential-example1>"))
spark.conf.set("fs.adl.account.example1.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-
id-example1>/oauth2/token")

spark.conf.set("fs.adl.account.example2.oauth2.client.id", "<application-id-example2>")
spark.conf.set("fs.adl.account.example2.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key
= "<key-name-for-service-credential-example2>"))
spark.conf.set("fs.adl.account.example2.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-
id-example2>/oauth2/token")

This also works for the cluster Spark configuration:


spark.hadoop.fs.adl.oauth2.access.token.provider.type ClientCredential

spark.hadoop.fs.adl.account.example1.oauth2.client.id <application-id-example1>
spark.hadoop.fs.adl.account.example1.oauth2.credential <service-credential-example1>
spark.hadoop.fs.adl.account.example1.oauth2.refresh.url https://login.microsoftonline.com/<directory-id-
example1>/oauth2/token

spark.hadoop.fs.adl.account.example2.oauth2.client.id <application-id-example2>
spark.hadoop.fs.adl.account.example2.oauth2.credential <service-credential-example2>
spark.hadoop.fs.adl.account.example2.oauth2.refresh.url https://login.microsoftonline.com/<directory-id-
example2>/oauth2/token

The following notebook demonstrates how to access Azure Data Lake Storage Gen1 directly and with a mount.
ADLS Gen1 service principal notebook
Get notebook
Connect to Azure Blob Storage with WASB (legacy)
7/21/2022 • 4 minutes to read

Microsoft has deprecated the Windows Azure Storage Blob driver (WASB) for Azure Blob Storage in favor of the
Azure Blob Filesystem driver (ABFS); see Accessing Azure Data Lake Storage Gen2 and Blob Storage with Azure
Databricks. ABFS has numerous benefits over WASB; see Azure documentation on ABFS.
This article provides documentation for maintaining code that uses the WASB driver. Databricks recommends
using ABFS for all connections to Azure Blob Storage.

Configure WASB credentials in Databricks


The WASB driver allows you to use either a storage account access key or a Shared Access Signature (SAS). (If
you are reading data from a public storage account, you do not need to configure credentials).
Databricks recommends using secrets whenever you need to pass credentials in Azure Databricks. Secrets are
available to all users with access to the containing secret scope.
You can pass credentials:
Scoped to the cluster in the Spark configuration
Scoped to the notebook
Attached to a mounted directory
Databricks recommends upgrading all your connections to use ABFS to access Azure Blob Storage, which
provides similar access patterns as WASB. Use ABFS for the best security and performance when interacting
with Azure Blob Storage.
To configure cluster credentials, set Spark configuration properties when you create the cluster. Credentials set
at the cluster level are available to all users with access to that cluster.
To configure notebook-scoped credentials, use spark.conf.set() . Credentials passed at the notebook level are
available to all users with access to that notebook.

Setting Azure Blob Storage credentials with a storage account access


key
A storage account access key grants full access to all containers within a storage account. While this pattern is
useful for prototyping, avoid using it in production to reduce risks associated with granting unrestricted access
to production data.

spark.conf(
"fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
"<storage-account-access-key>"
)

You can upgrade account key URIs to use ABFS. For more information, see Access Azure Data Lake Storage Gen2
or Blob Storage using the account key.

Setting Azure Blob Storage credentials with a Shared Access


Signature (SAS)
You can use SAS tokens to configure limited access to a single container in a storage account that expires at a
specific time.

spark.conf(
"fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net",
"<sas-token-for-container>"
)

Access Azure Blob Storage using the DataFrame API


The Apache Spark DataFrame API can use credentials configured at either the notebook or cluster level. All
WASB driver URIs specify the container and storage account names. The directory name is optional, and can
specify multiple nested directories relative to the container.

wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>

The following code examples show how you can use the DataFrames API and Databricks Utilities to interact with
a named directory within a container.

df = spark.read.format("parquet").load("wasbs://<container-name>@<storage-account-
name>.blob.core.windows.net/<directory-name>")

dbutils.fs.ls("wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>")

To update ABFS instead of WASB, update your URIs. For more information, see Direct access using ABFS URI for
Blob Storage or Azure Data Lake Storage Gen2

Access Azure Blob Storage with SQL


Credentials set in a notebook’s session configuration are not accessible to notebooks running Spark SQL.
After an account access key or a SAS is set up in your cluster configuration, you can use standard Spark SQL
queries with Azure Blob Storage:

-- SQL
CREATE DATABASE <db-name>
LOCATION "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/";

To update ABFS instead of WASB, update your URIs; see Direct access using ABFS URI for Blob Storage or Azure
Data Lake Storage Gen2

Mount Azure Blob Storage containers to DBFS


You can mount an Azure Blob Storage container or a folder inside a container to DBFS. For Databricks
recommendations, see Mounting cloud object storage on Azure Databricks.
IMPORTANT
Azure Blob storage supports three blob types: block, append, and page. You can only mount block blobs to DBFS.
All users have read and write access to the objects in Blob storage containers mounted to DBFS.
After a mount point is created through a cluster, users of that cluster can immediately access the mount point. To use
the mount point in another running cluster, you must run dbutils.fs.refreshMounts() on that running cluster to
make the newly created mount point available.

DBFS uses the credential that you provide when you create the mount point to access the mounted Blob storage
container. If a Blob storage container is mounted using a storage account access key, DBFS uses temporary SAS
tokens derived from the storage account key when it accesses this mount point.
Mount an Azure Blob storage container
Databricks recommends using ABFS instead of WASB. For more information about mounting with ABFS, see:
Mount ADLS Gen2 or Blob Storage with ABFS.
1. To mount a Blob storage container or a folder inside a container, use the following command:
Python

dbutils.fs.mount(
source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})

Scala

dbutils.fs.mount(
source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>",
mountPoint = "/mnt/<mount-name>",
extraConfigs = Map("<conf-key>" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-
name>")))

where
<storage-account-name> is the name of your Azure Blob storage account.
<container-name> is the name of a container in your Azure Blob storage account.
<mount-name> is a DBFS path representing where the Blob storage container or a folder inside the
container (specified in source ) will be mounted in DBFS.
<conf-key> can be either fs.azure.account.key.<storage-account-name>.blob.core.windows.net or
fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net
dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") gets the key that has been stored as
a secret in a secret scope.
2. Access files in your container as if they were local files, for example:
Python

# python
df = spark.read.format("text").load("/mnt/<mount-name>/...")
df = spark.read.format("text").load("dbfs:/<mount-name>/...")

Scala
// scala
val df = spark.read.format("text").load("/mnt/<mount-name>/...")
val df = spark.read.format("text").load("dbfs:/<mount-name>/...")

SQL

-- SQL
CREATE DATABASE <db-name>
LOCATION "/mnt/<mount-name>"
Azure Cosmos DB
7/21/2022 • 2 minutes to read

Azure Cosmos DB is Microsoft’s globally distributed, multi-model database. Azure Cosmos DB enables you to
elastically and independently scale throughput and storage across any number of Azure’s geographic regions. It
offers throughput, latency, availability, and consistency guarantees with comprehensive service level agreements
(SLAs). Azure Cosmos DB provides APIs for the following data models, with SDKs available in multiple
languages:
SQL API
MongoDB API
Cassandra API
Graph (Gremlin) API
Table API
This article explains how to read data from and write data to Azure Cosmos DB using Azure Databricks. For
more the most up-to-date details about Azure Cosmos DB, see Accelerate big data analytics by using the Apache
Spark to Azure Cosmos DB connector.

IMPORTANT
This connector supports the core (SQL) API of Azure Cosmos DB. For the Cosmos DB for MongoDB API, use the
MongoDB Spark connector. For the Cosmos DB Cassandra API, use the Cassandra Spark connector.

Create and attach required libraries


1. Download the latest azure-cosmosdb-spark library for the version of Apache Spark you are running.
2. Upload the downloaded JAR files to Databricks following the instructions in Upload a Jar, Python egg, or
Python wheel.
3. Install the uploaded libraries into your Databricks cluster.

Use the Azure Cosmos DB Spark connector


The following Scala notebook provides a simple example of how to write data to Cosmos DB and read data from
Cosmos DB. See the[Azure Cosmos DB Spark Connector][azure cosmos db spark connector] project for detailed
documentation.
Azure Cosmos DB notebook
Get notebook
Azure Synapse Analytics
7/21/2022 • 27 minutes to read

Azure Synapse Analytics (formerly SQL Data Warehouse) is a cloud-based enterprise data warehouse that
leverages massively parallel processing (MPP) to quickly run complex queries across petabytes of data. Use
Azure as a key component of a big data solution. Import big data into Azure with simple PolyBase T-SQL
queries, or COPY statement and then use the power of MPP to run high-performance analytics. As you integrate
and analyze, the data warehouse will become the single version of truth your business can count on for insights.
You can access Azure Synapse from Azure Databricks using the Azure Synapse connector, a data source
implementation for Apache Spark that uses Azure Blob storage, and PolyBase or the COPY statement in Azure
Synapse to transfer large volumes of data efficiently between an Azure Databricks cluster and an Azure Synapse
instance.
Both the Azure Databricks cluster and the Azure Synapse instance access a common Blob storage container to
exchange data between these two systems. In Azure Databricks, Apache Spark jobs are triggered by the Azure
Synapse connector to read data from and write data to the Blob storage container. On the Azure Synapse side,
data loading and unloading operations performed by PolyBase are triggered by the Azure Synapse connector
through JDBC. In Databricks Runtime 7.0 and above, COPY is used by default to load data into Azure Synapse by
the Azure Synapse connector through JDBC.

NOTE
COPY is available only on Azure Synapse Gen2 instances, which provide better performance. If your database still uses
Gen1 instances, we recommend that you migrate the database to Gen2.

The Azure Synapse connector is more suited to ETL than to interactive queries, because each query execution
can extract large amounts of data to Blob storage. If you plan to perform several queries against the same Azure
Synapse table, we recommend that you save the extracted data in a format such as Parquet.

Requirements
A database master key for the Azure Synapse.

Authentication
The Azure Synapse connector uses three types of network connections:
Spark driver to Azure Synapse
Spark driver and executors to Azure storage account
Azure Synapse to Azure storage account
┌─────────┐
┌─────────────────────────>│ STORAGE │<────────────────────────┐
│ Storage acc key / │ ACCOUNT │ Storage acc key / │
│ Managed Service ID / └─────────┘ OAuth 2.0 / │
│ │ │
│ │ │
│ │ Storage acc key / │
│ │ OAuth 2.0 / │
│ │ │
v v ┌──────v────┐
┌──────────┐ ┌──────────┐ │┌──────────┴┐
│ Synapse │ │ Spark │ ││ Spark │
│ Analytics│<────────────────────>│ Driver │<───────────────>│ Executors │
└──────────┘ JDBC with └──────────┘ Configured └───────────┘
username & password / in Spark

The following sections describe each connection’s authentication configuration options.


Spark driver to Azure Synapse
The Spark driver can connect to Azure Synapse using JDBC with a username and password or OAuth 2.0 with a
service principal for authentication.
Username and password
We recommend that you use the connection strings provided by Azure portal for both authentication types,
which enable Secure Sockets Layer (SSL) encryption for all data sent between the Spark driver and the Azure
Synapse instance through the JDBC connection. To verify that the SSL encryption is enabled, you can search for
encrypt=true in the connection string.

To allow the Spark driver to reach Azure Synapse, we recommend that you set Allow access to Azure
ser vices to ON on the firewall pane of the Azure Synapse server through Azure portal. This setting allows
communications from all Azure IP addresses and all Azure subnets, which allows Spark drivers to reach the
Azure Synapse instance.
OAuth 2.0 with a service principal
You can authenticate to Azure Synapse Analytics using a service principal with access to the underlying storage
account. For more information on using service principal credentials to access an Azure storage account, see
Accessing Azure Data Lake Storage Gen2 and Blob Storage with Azure Databricks. You must set the
enableServicePrincipalAuth option to true in the connection configuration Parameters to enable the connector
to authenticate with a service principal.
You can optionally use a different service principal for the Azure Synapse Analytics connection. An example that
configures service principal credentials for the storage account and optional service principal credentials for
Synapse:
ini

; Defining the Service Principal credentials for the Azure storage account
fs.azure.account.auth.type OAuth
fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.oauth2.client.id <application-id>
fs.azure.account.oauth2.client.secret <service-credential>
fs.azure.account.oauth2.client.endpoint https://login.microsoftonline.com/<directory-id>/oauth2/token

; Defining a separate set of service principal credentials for Azure Synapse Analytics (If not defined, the
connector will use the Azure storage account credentials)
spark.databricks.sqldw.jdbc.service.principal.client.id <application-id>
spark.databricks.sqldw.jdbc.service.principal.client.secret <service-credential>

Sc a l a
// Defining the Service Principal credentials for the Azure storage account
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<service-credential>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<directory-
id>/oauth2/token")

// Defining a separate set of service principal credentials for Azure Synapse Analytics (If not defined, the
connector will use the Azure storage account credentials)
spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.id", "<application-id>")
spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.secret", "<service-credential>")

Python

# Defining the service principal credentials for the Azure storage account
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<service-credential>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<directory-
id>/oauth2/token")

# Defining a separate set of service principal credentials for Azure Synapse Analytics (If not defined, the
connector will use the Azure storage account credentials)
spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.id", "<application-id>")
spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.secret", "<service-credential>")

# Load SparkR
library(SparkR)
conf <- sparkR.callJMethod(sparkR.session(), "conf")

# Defining the service principal credentials for the Azure storage account
sparkR.callJMethod(conf, "set", "fs.azure.account.auth.type", "OAuth")
sparkR.callJMethod(conf, "set", "fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
sparkR.callJMethod(conf, "set", "fs.azure.account.oauth2.client.id", "<application-id>")
sparkR.callJMethod(conf, "set", "fs.azure.account.oauth2.client.secret", "<service-credential>")
sparkR.callJMethod(conf, "set", "fs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/<directory-id>/oauth2/token")

# Defining a separate set of service principal credentials for Azure Synapse Analytics (If not defined, the
connector will use the Azure storage account credentials)
sparkR.callJMethod(conf, "set", "spark.databricks.sqldw.jdbc.service.principal.client.id", "<application-
id>")
sparkR.callJMethod(conf, "set", "spark.databricks.sqldw.jdbc.service.principal.client.secret", "<service-
credential>")

Spark driver and executors to Azure storage account


The Azure storage container acts as an intermediary to store bulk data when reading from or writing to Azure
Synapse. Spark connects to ADLS Gen2 or Blob Storage using the abfss driver.
The following authentication options are available:
Storage account access key and secret
OAuth 2.0 authentication. For more information about OAuth 2.0 and Service Principal, see Configure access
to Azure storage with an Azure Active Directory service principal.
The examples below illustrate these two ways using the storage account access key approach. The same applies
to OAuth 2.0 configuration.
Notebook session configuration (preferred)
Using this approach, the account access key is set in the session configuration associated with the notebook that
runs the command. This configuration does not affect other notebooks attached to the same cluster. spark is
the SparkSession object provided in the notebook.

spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
"<your-storage-account-access-key>")

Global Hadoop configuration


This approach updates the global Hadoop configuration associated with the SparkContext object shared by all
notebooks.
Sc a l a

sc.hadoopConfiguration.set(
"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
"<your-storage-account-access-key>")

Python

hadoopConfiguration is not exposed in all versions of PySpark. Although the following command relies on some
Spark internals, it should work with all PySpark versions and is unlikely to break or change in the future:

sc._jsc.hadoopConfiguration().set(
"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
"<your-storage-account-access-key>")

Azure Synapse to Azure storage account


Azure Synapse also connects to a storage account during loading and unloading of temporary data.
In case you have set up an account key and secret for the storage account, you can set
forwardSparkAzureStorageCredentials to true , in which case Azure Synapse connector automatically discovers
the account access key set in the notebook session configuration or the global Hadoop configuration and
forwards the storage account access key to the connected Azure Synapse instance by creating a temporary
Azure database scoped credential.
Alternatively, if you use ADLS Gen2 + OAuth 2.0 authentication or your Azure Synapse instance is configured to
have a Managed Service Identity (typically in conjunction with a VNet + Service Endpoints setup), you must set
useAzureMSI to true . In this case the connector will specify IDENTITY = 'Managed Service Identity' for the
databased scoped credential and no SECRET .

Streaming support
The Azure Synapse connector offers efficient and scalable Structured Streaming write support for Azure Synapse
that provides consistent user experience with batch writes, and uses PolyBase or COPY for large data transfers
between an Azure Databricks cluster and Azure Synapse instance. Similar to the batch writes, streaming is
designed largely for ETL, thus providing higher latency that may not be suitable for real-time data processing in
some cases.
Fault tolerance semantics
By default, Azure Synapse Streaming offers end-to-end exactly-once guarantee for writing data into an Azure
Synapse table by reliably tracking progress of the query using a combination of checkpoint location in DBFS,
checkpoint table in Azure Synapse, and locking mechanism to ensure that streaming can handle any types of
failures, retries, and query restarts. Optionally, you can select less restrictive at-least-once semantics for Azure
Synapse Streaming by setting spark.databricks.sqldw.streaming.exactlyOnce.enabled option to false , in which
case data duplication could occur in the event of intermittent connection failures to Azure Synapse or
unexpected query termination.
Usage (Batch)
You can use this connector via the data source API in Scala, Python, SQL, and R notebooks.
Scala

// Otherwise, set up the Blob storage account access key in the notebook session conf.
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
"<your-storage-account-access-key>")

// Get some data from an Azure Synapse table.


val df: DataFrame = spark.read
.format("com.databricks.spark.sqldw")
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>")
.option("tempDir", "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>")
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", "<your-table-name>")
.load()

// Load data from an Azure Synapse query.


val df: DataFrame = spark.read
.format("com.databricks.spark.sqldw")
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>")
.option("tempDir", "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>")
.option("forwardSparkAzureStorageCredentials", "true")
.option("query", "select x, count(*) as cnt from table group by x")
.load()

// Apply some transformations to the data, then use the


// Data Source API to write the data back to another table in Azure Synapse.

df.write
.format("com.databricks.spark.sqldw")
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>")
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", "<your-table-name>")
.option("tempDir", "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>")
.save()

Python
# Otherwise, set up the Blob storage account access key in the notebook session conf.
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
"<your-storage-account-access-key>")

# Get some data from an Azure Synapse table.


df = spark.read \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("tempDir", "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.load()

# Load data from an Azure Synapse query.


df = spark.read \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("tempDir", "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("query", "select x, count(*) as cnt from table group by x") \
.load()

# Apply some transformations to the data, then use the


# Data Source API to write the data back to another table in Azure Synapse.

df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("tempDir", "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>") \
.save()

SQL

-- Otherwise, set up the Blob storage account access key in the notebook session conf.
SET fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net=<your-storage-account-access-key>;

-- Read data using SQL.


CREATE TABLE example_table_in_spark_read
USING com.databricks.spark.sqldw
OPTIONS (
url 'jdbc:sqlserver://<the-rest-of-the-connection-string>',
forwardSparkAzureStorageCredentials 'true',
dbTable '<your-table-name>',
tempDir 'abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-directory-
name>'
);

-- Write data using SQL.


-- Create a new table, throwing an error if a table with the same name already exists:

CREATE TABLE example_table_in_spark_write


USING com.databricks.spark.sqldw
OPTIONS (
url 'jdbc:sqlserver://<the-rest-of-the-connection-string>',
forwardSparkAzureStorageCredentials 'true',
dbTable '<your-table-name>',
tempDir 'abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-directory-
name>'
)
AS SELECT * FROM table_to_save_in_spark;
R

# Load SparkR
library(SparkR)

# Otherwise, set up the Blob storage account access key in the notebook session conf.
conf <- sparkR.callJMethod(sparkR.session(), "conf")
sparkR.callJMethod(conf, "set", "fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net", "
<your-storage-account-access-key>")

# Get some data from an Azure Synapse table.


df <- read.df(
source = "com.databricks.spark.sqldw",
url = "jdbc:sqlserver://<the-rest-of-the-connection-string>",
forward_spark_azure_storage_credentials = "true",
dbTable = "<your-table-name>",
tempDir = "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>")

# Load data from an Azure Synapse query.


df <- read.df(
source = "com.databricks.spark.sqldw",
url = "jdbc:sqlserver://<the-rest-of-the-connection-string>",
forward_spark_azure_storage_credentials = "true",
query = "select x, count(*) as cnt from table group by x",
tempDir = "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>")

# Apply some transformations to the data, then use the


# Data Source API to write the data back to another table in Azure Synapse.

write.df(
df,
source = "com.databricks.spark.sqldw",
url = "jdbc:sqlserver://<the-rest-of-the-connection-string>",
forward_spark_azure_storage_credentials = "true",
dbTable = "<your-table-name>",
tempDir = "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-directory-
name>")

Usage (Streaming)
You can write data using Structured Streaming in Scala and Python notebooks.
Scala
// Set up the Blob storage account access key in the notebook session conf.
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
"<your-storage-account-access-key>")

// Prepare streaming source; this could be Kafka or a simple rate stream.


val df: DataFrame = spark.readStream
.format("rate")
.option("rowsPerSecond", "100000")
.option("numPartitions", "16")
.load()

// Apply some transformations to the data then use


// Structured Streaming API to continuously write the data to a table in Azure Synapse.

df.writeStream
.format("com.databricks.spark.sqldw")
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>")
.option("tempDir", "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>")
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", "<your-table-name>")
.option("checkpointLocation", "/tmp_checkpoint_location")
.start()

Python

# Set up the Blob storage account access key in the notebook session conf.
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
"<your-storage-account-access-key>")

# Prepare streaming source; this could be Kafka or a simple rate stream.


df = spark.readStream \
.format("rate") \
.option("rowsPerSecond", "100000") \
.option("numPartitions", "16") \
.load()

# Apply some transformations to the data then use


# Structured Streaming API to continuously write the data to a table in Azure Synapse.

df.writeStream \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("tempDir", "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("checkpointLocation", "/tmp_checkpoint_location") \
.start()

Configuration
This section describes how to configure write semantics for the connector, required permissions, and
miscellaneous configuration parameters.
In this section:
Supported save modes for batch writes
Supported output modes for streaming writes
Write semantics
Required Azure Synapse permissions for PolyBase
Required Azure Synapse permissions for the COPY statement
Parameters
Query pushdown into Azure Synapse
Temporary data management
Temporary object management
Streaming checkpoint table management
Supported save modes for batch writes
The Azure Synapse connector supports ErrorIfExists , Ignore , Append , and Overwrite save modes with the
default mode being ErrorIfExists . For more information on supported save modes in Apache Spark, see Spark
SQL documentation on Save Modes.
Supported output modes for streaming writes
The Azure Synapse connector supports Append and Complete output modes for record appends and
aggregations. For more details on output modes and compatibility matrix, see the Structured Streaming guide.
Write semantics

NOTE
COPY is available in Databricks Runtime 7.0 and above.

In addition to PolyBase, the Azure Synapse connector supports the COPY statement. The COPY statement offers
a more convenient way of loading data into Azure Synapse without the need to create an external table, requires
fewer permissions to load data, and improves the performance of data ingestion into Azure Synapse.
By default, the connector automatically discovers the best write semantics ( COPY when targeting an Azure
Synapse Gen2 instance, PolyBase otherwise). You can also specify the write semantics with the following
configuration:
Scala

// Configure the write semantics for Azure Synapse connector in the notebook session conf.
spark.conf.set("spark.databricks.sqldw.writeSemantics", "<write-semantics>")

Python

# Configure the write semantics for Azure Synapse connector in the notebook session conf.
spark.conf.set("spark.databricks.sqldw.writeSemantics", "<write-semantics>")

SQL

-- Configure the write semantics for Azure Synapse connector in the notebook session conf.
SET spark.databricks.sqldw.writeSemantics=<write-semantics>;

# Load SparkR
library(SparkR)

# Configure the write semantics for Azure Synapse connector in the notebook session conf.
conf <- sparkR.callJMethod(sparkR.session(), "conf")
sparkR.callJMethod(conf, "set", "spark.databricks.sqldw.writeSemantics", "<write-semantics>")

where <write-semantics> is either polybase to use PolyBase, or copy to use the COPY statement.
Required Azure Synapse permissions for PolyBase
When you use PolyBase, the Azure Synapse connector requires the JDBC connection user to have permission to
run the following commands in the connected Azure Synapse instance:
CREATE DATABASE SCOPED CREDENTIAL
CREATE EXTERNAL DATA SOURCE
CREATE EXTERNAL FILE FORMAT
CREATE EXTERNAL TABLE
As a prerequisite for the first command, the connector expects that a database master key already exists for the
specified Azure Synapse instance. If not, you can create a key using the CREATE MASTER KEY command.
Additionally, to read the Azure Synapse table set through dbTable or tables referred in query , the JDBC user
must have permission to access needed Azure Synapse tables. To write data back to an Azure Synapse table set
through dbTable , the JDBC user must have permission to write to this Azure Synapse table.
The following table summarizes the required permissions for all operations with PolyBase:

P ERM ISSIO N S W H EN USIN G EXT ERN A L


O P ERAT IO N P ERM ISSIO N S DATA SO URC E

Batch write CONTROL See Batch write

Streaming write CONTROL See Streaming write

Read CONTROL See Read

Required Azure Synapse permissions for PolyBase with the external data source option

NOTE
Available in Databricks Runtime 8.4 and above.

You can use PolyBase with a pre-provisioned external data source. See the externalDataSource parameter in
Parameters for more information.
To use PolyBase with a pre-provisioned external data source, the Azure Synapse connector requires the JDBC
connection user to have permission to run the following commands in the connected Azure Synapse instance:
CREATE EXTERNAL FILE FORMAT
CREATE EXTERNAL TABLE
To create an external data source, you should first create a database scoped credential. The following links
describe how to create a scoped credential for service principals and an external data source for an ABFS
location:
CREATE DATABASE SCOPED CREDENTIAL
CREATE EXTERNAL DATA SOURCE

NOTE
The external data source location must point to a container. The connector will not work if the location is a directory in a
container.

The following table summarizes the permissions for PolyBase write operations with the external data source
option:

P ERM ISSIO N S ( IN SERT IN TO A N P ERM ISSIO N S ( IN SERT IN TO A N EW


O P ERAT IO N EXIST IN G TA B L E) TA B L E)
P ERM ISSIO N S ( IN SERT IN TO A N P ERM ISSIO N S ( IN SERT IN TO A N EW
O P ERAT IO N EXIST IN G TA B L E) TA B L E)

Batch write ADMINISTER DATABASE BULK ADMINISTER DATABASE BULK


OPERATIONS OPERATIONS

INSERT INSERT

CREATE TABLE CREATE TABLE

ALTER ANY SCHEMA ALTER ANY SCHEMA

ALTER ANY EXTERNAL DATA SOURCE ALTER ANY EXTERNAL DATA SOURCE

ALTER ANY EXTERNAL FILE FORMAT ALTER ANY EXTERNAL FILE FORMAT

Streaming write ADMINISTER DATABASE BULK ADMINISTER DATABASE BULK


OPERATIONS OPERATIONS

INSERT INSERT

CREATE TABLE CREATE TABLE

ALTER ANY SCHEMA ALTER ANY SCHEMA

ALTER ANY EXTERNAL DATA SOURCE ALTER ANY EXTERNAL DATA SOURCE

ALTER ANY EXTERNAL FILE FORMAT ALTER ANY EXTERNAL FILE FORMAT

The following table summarizes the permissions for PolyBase read operations with external data source option:

O P ERAT IO N P ERM ISSIO N S

Read CREATE TABLE

ALTER ANY SCHEMA

ALTER ANY EXTERNAL DATA SOURCE

ALTER ANY EXTERNAL FILE FORMAT

You can use this connector to read via the data source API in Scala, Python, SQL, and R notebooks.
Sc a l a

// Get some data from an Azure Synapse table.


val df: DataFrame = spark.read
.format("com.databricks.spark.sqldw")
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>")
.option("tempDir", "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>")
.option("externalDataSource", "<your-pre-provisioned-data-source>")
.option("dbTable", "<your-table-name>")
.load()

Python
# Get some data from an Azure Synapse table.
df = spark.read \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("tempDir", "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>") \
.option("externalDataSource", "<your-pre-provisioned-data-source>") \
.option("dbTable", "<your-table-name>") \
.load()

SQ L

-- Read data using SQL.


CREATE TABLE example_table_in_spark_read
USING com.databricks.spark.sqldw
OPTIONS (
url 'jdbc:sqlserver://<the-rest-of-the-connection-string>',
forwardSparkAzureStorageCredentials 'true',
dbTable '<your-table-name>',
tempDir 'abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-directory-
name>',
externalDataSource '<your-pre-provisioned-data-source>'
);

# Get some data from an Azure Synapse table.


df <- read.df(
source = "com.databricks.spark.sqldw",
url = "jdbc:sqlserver://<the-rest-of-the-connection-string>",
forward_spark_azure_storage_credentials = "true",
dbTable = "<your-table-name>",
tempDir = "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>"
externalDataSource = "<your-pre-provisioned-data-source>")

Required Azure Synapse permissions for the COPY statement

NOTE
Available in Databricks Runtime 7.0 and above.

When you use the COPY statement, the Azure Synapse connector requires the JDBC connection user to have
permission to run the following commands in the connected Azure Synapse instance:
COPY INTO
If the destination table does not exist in Azure Synapse, permission to run the following command is required in
addition to the command above:
CREATE TABLE
The following table summarizes the permissions for batch and streaming writes with COPY :

P ERM ISSIO N S ( IN SERT IN TO A N P ERM ISSIO N S ( IN SERT IN TO A N EW


O P ERAT IO N EXIST IN G TA B L E) TA B L E)
P ERM ISSIO N S ( IN SERT IN TO A N P ERM ISSIO N S ( IN SERT IN TO A N EW
O P ERAT IO N EXIST IN G TA B L E) TA B L E)

Batch write ADMINISTER DATABASE BULK ADMINISTER DATABASE BULK


OPERATIONS OPERATIONS

INSERT INSERT

CREATE TABLE

ALTER ON SCHEMA :: dbo

Streaming write ADMINISTER DATABASE BULK ADMINISTER DATABASE BULK


OPERATIONS OPERATIONS

INSERT INSERT

CREATE TABLE

ALTER ON SCHEMA :: dbo

Parameters
The parameter map or OPTIONS provided in Spark SQL support the following settings:

PA RA M ET ER REQ UIRED DEFA ULT N OT ES

dbTable Yes, unless query is No default The table to create or read


specified from in Azure Synapse. This
parameter is required when
saving data back to Azure
Synapse.

You can also use


{SCHEMA NAME}.{TABLE
NAME}
to access a table in a given
schema. If schema name is
not provided, the default
schema associated with the
JDBC user is used.

The previously supported


dbtable variant is
deprecated and will be
ignored in future releases.
Use the “camel case” name
instead.

query Yes, unless dbTable is No default The query to read from in


specified Azure Synapse.

For tables referred in the


query, you can also use
{SCHEMA NAME}.{TABLE
NAME}
to access a table in a given
schema. If schema name is
not provided, the default
schema associated with the
JDBC user is used.
PA RA M ET ER REQ UIRED DEFA ULT N OT ES

user No No default The Azure Synapse


username. Must be used in
tandem with password
option. Can only be used if
the user and password are
not passed in the URL.
Passing both will result in
an error.

password No No default The Azure Synapse


password. Must be used in
tandem with user option.
Can only be used if the user
and password are not
passed in the URL. Passing
both will result in an error.

url Yes No default A JDBC URL with


sqlserver set as the
subprotocol. It is
recommended to use the
connection string provided
by Azure portal. Setting
encrypt=true is strongly
recommended, because it
enables SSL encryption of
the JDBC connection. If
user and password are
set separately, you do not
need to include them in the
URL.

jdbcDriver No Determined by the JDBC The class name of the JDBC


URL’s subprotocol driver to use. This class
must be on the classpath.
In most cases, it should not
be necessary to specify this
option, as the appropriate
driver classname should
automatically be
determined by the JDBC
URL’s subprotocol.

The previously supported


jdbc_driver variant is
deprecated and will be
ignored in future releases.
Use the “camel case” name
instead.

tempDir Yes No default A abfss URI. We


recommend you use a
dedicated Blob storage
container for the Azure
Synapse.

The previously supported


tempdir variant is
deprecated and will be
ignored in future releases.
Use the “camel case” name
instead.
PA RA M ET ER REQ UIRED DEFA ULT N OT ES

tempFormat No PARQUET The format in which to save


temporary files to the blob
store when writing to Azure
Synapse. Defaults to
PARQUET ; no other values
are allowed right now.

tempCompression No SNAPPY The compression algorithm


to be used to
encode/decode temporary
by both Spark and Azure
Synapse. Currently
supported values are:
UNCOMPRESSED , SNAPPY
and GZIP .

No
forwardSparkAzureStorageCredentials false If true , the library
automatically discovers the
credentials that Spark is
using to connect to the
Blob storage container and
forwards those credentials
to Azure Synapse over
JDBC. These credentials are
sent as part of the JDBC
query. Therefore it is
strongly recommended that
you enable SSL encryption
of the JDBC connection
when you use this option.

The current version of


Azure Synapse connector
requires (exactly) one of
forwardSparkAzureStorageCredentials
,
enableServicePrincipalAuth
, or useAzureMSI to be
explicitly set to true .

The previously supported


forward_spark_azure_storage_credentials
variant is deprecated and
will be ignored in future
releases. Use the “camel
case” name instead.
PA RA M ET ER REQ UIRED DEFA ULT N OT ES

useAzureMSI No false If true , the library will


specify
IDENTITY = 'Managed
Service Identity'
and no SECRET for the
database scoped credentials
it creates.

The current version of


Azure Synapse connector
requires (exactly) one of
forwardSparkAzureStorageCredentials
,
enableServicePrincipalAuth
, or useAzureMSI to be
explicitly set to true .

enableServicePrincipalAuth No false If true , the library will use


the provided service
principal credentials to
connect to the Azure
storage account and Azure
Synapse Analytics over
JDBC.

The current version of


Azure Synapse connector
requires (exactly) one of
forwardSparkAzureStorageCredentials
,
enableServicePrincipalAuth
, or useAzureMSI to be
explicitly set to true .

tableOptions No CLUSTERED COLUMNSTORE A string used to specify


INDEX table options when creating
, the Azure Synapse table set
DISTRIBUTION = through dbTable . This
ROUND_ROBIN
string is passed literally to
the WITH clause of the
CREATE TABLE SQL
statement that is issued
against Azure Synapse.

The previously supported


table_options variant is
deprecated and will be
ignored in future releases.
Use the “camel case” name
instead.
PA RA M ET ER REQ UIRED DEFA ULT N OT ES

preActions No No default (empty string) A ; separated list of SQL


commands to be executed
in Azure Synapse before
writing data to the Azure
Synapse instance. These
SQL commands are
required to be valid
commands accepted by
Azure Synapse.

If any of these commands


fail, it is treated as an error
and the write operation is
not executed.

postActions No No default (empty string) A ; separated list of SQL


commands to be executed
in Azure Synapse after the
connector successfully
writes data to the Azure
Synapse instance. These
SQL commands are
required to be valid
commands accepted by
Azure Synapse.

If any of these commands


fail, it is treated as an error
and you’ll get an exception
after the data is successfully
written to the Azure
Synapse instance.

maxStrLength No 256 StringType in Spark is


mapped to the
NVARCHAR(maxStrLength)
type in Azure Synapse. You
can use maxStrLength to
set the string length for all
NVARCHAR(maxStrLength)
type columns that are in
the table with name
dbTable in Azure Synapse.

The previously supported


maxstrlength variant is
deprecated and will be
ignored in future releases.
Use the “camel case” name
instead.

checkpointLocation Yes No default Location on DBFS that will


be used by Structured
Streaming to write
metadata and checkpoint
information. See Recovering
from Failures with
Checkpointing in Structured
Streaming programming
guide.
PA RA M ET ER REQ UIRED DEFA ULT N OT ES

numStreamingTempDirsToKeep No 0 Indicates how many (latest)


temporary directories to
keep for periodic cleanup of
micro batches in streaming.
When set to 0 , directory
deletion is triggered
immediately after micro
batch is committed,
otherwise provided number
of latest micro batches is
kept and the rest of
directories is removed. Use
-1 to disable periodic
cleanup.

applicationName No Databricks-User-Query The tag of the connection


for each query. If not
specified or the value is an
empty string, the default
value of the tag is added
the JDBC URL. The default
value prevents the Azure
DB Monitoring tool from
raising spurious SQL
injection alerts against
queries.

maxbinlength No No default Control the column length


of BinaryType columns.
This parameter is translated
as
VARBINARY(maxbinlength)
.

identityInsert No false Setting to true enables


IDENTITY_INSERT mode,
which inserts a DataFrame
provided value in the
identity column of the
Azure Synapse table.

See Explicitly inserting


values into an IDENTITY
column.
PA RA M ET ER REQ UIRED DEFA ULT N OT ES

externalDataSource No No default A pre-provisioned external


data source to read data
from Azure Synapse. An
external data source can
only be used with PolyBase
and removes the CONTROL
permission requirement
since the connector does
not need to create a scoped
credential and an external
data source to load data.

For example usage and the


list of permissions required
when using an external data
source, see Required Azure
Synapse permissions for
PolyBase with the external
data source option.

maxErrors No 0 The maximum number of


rows that can be rejected
during reads and writes
before the loading
operation (either PolyBase
or COPY) is cancelled. The
rejected rows will be
ignored. For example, if two
out of ten records have
errors, only eight records
will be processed.

See REJECT_VALUE
documentation in CREATE
EXTERNAL TABLE and
MAXERRORS
documentation in COPY.

NOTE
tableOptions , preActions , postActions , and maxStrLength are relevant only when writing data from Azure
Databricks to a new table in Azure Synapse.
externalDataSource is relevant only when reading data from Azure Synapse and writing data from Azure Databricks
to a new table in Azure Synapse with PolyBase semantics. You should not specify other storage authentication types
while using externalDataSource such as forwardSparkAzureStorageCredentials or useAzureMSI .
checkpointLocation and numStreamingTempDirsToKeep are relevant only for streaming writes from Azure
Databricks to a new table in Azure Synapse.
Even though all data source option names are case-insensitive, we recommend that you specify them in “camel case”
for clarity.

Query pushdown into Azure Synapse


The Azure Synapse connector implements a set of optimization rules to push the following operators down into
Azure Synapse:
Filter
Project
Limit

The Project and Filter operators support the following expressions:


Most boolean logic operators
Comparisons
Basic arithmetic operations
Numeric and string casts
For the Limit operator, pushdown is supported only when there is no ordering specified. For example:
SELECT TOP(10) * FROM table , but not SELECT TOP(10) * FROM table ORDER BY col .

NOTE
The Azure Synapse connector does not push down expressions operating on strings, dates, or timestamps.

Query pushdown built with the Azure Synapse connector is enabled by default. You can disable it by setting
spark.databricks.sqldw.pushdown to false .

Temporary data management


The Azure Synapse connector does not delete the temporary files that it creates in the Blob storage container.
Therefore we recommend that you periodically delete temporary files under the user-supplied tempDir
location.
To facilitate data cleanup, the Azure Synapse connector does not store data files directly under tempDir , but
instead creates a subdirectory of the form: <tempDir>/<yyyy-MM-dd>/<HH-mm-ss-SSS>/<randomUUID>/ . You can set up
periodic jobs (using the Azure Databricks jobs feature or otherwise) to recursively delete any subdirectories that
are older than a given threshold (for example, 2 days), with the assumption that there cannot be Spark jobs
running longer than that threshold.
A simpler alternative is to periodically drop the whole container and create a new one with the same name. This
requires that you use a dedicated container for the temporary data produced by the Azure Synapse connector
and that you can find a time window in which you can guarantee that no queries involving the connector are
running.
Temporary object management
The Azure Synapse connector automates data transfer between an Azure Databricks cluster and an Azure
Synapse instance. For reading data from an Azure Synapse table or query or writing data to an Azure Synapse
table, the Azure Synapse connector creates temporary objects, including DATABASE SCOPED CREDENTIAL ,
EXTERNAL DATA SOURCE , EXTERNAL FILE FORMAT , and EXTERNAL TABLE behind the scenes. These objects live only
throughout the duration of the corresponding Spark job and should automatically be dropped thereafter.
When a cluster is running a query using the Azure Synapse connector, if the Spark driver process crashes or is
forcefully restarted, or if the cluster is forcefully terminated or restarted, temporary objects might not be
dropped. To facilitate identification and manual deletion of these objects, the Azure Synapse connector prefixes
the names of all intermediate temporary objects created in the Azure Synapse instance with a tag of the form:
tmp_databricks_<yyyy_MM_dd_HH_mm_ss_SSS>_<randomUUID>_<internalObject> .

We recommend that you periodically look for leaked objects using queries such as the following:
SELECT * FROM sys.database_scoped_credentials WHERE name LIKE 'tmp_databricks_%'
SELECT * FROM sys.external_data_sources WHERE name LIKE 'tmp_databricks_%'
SELECT * FROM sys.external_file_formats WHERE name LIKE 'tmp_databricks_%'
SELECT * FROM sys.external_tables WHERE name LIKE 'tmp_databricks_%'

Streaming checkpoint table management


The Azure Synapse connector does not delete the streaming checkpoint table that is created when new
streaming query is started. This behavior is consistent with the checkpointLocation on DBFS. Therefore we
recommend that you periodically delete checkpoint tables at the same time as removing checkpoint locations on
DBFS for queries that are not going to be run in the future or already have checkpoint location removed.
By default, all checkpoint tables have the name <prefix>_<query_id> , where <prefix> is a configurable prefix
with default value databricks_streaming_checkpoint and query_id is a streaming query ID with _ characters
removed. To find all checkpoint tables for stale or deleted streaming queries, run the query:

SELECT * FROM sys.tables WHERE name LIKE 'databricks_streaming_checkpoint%'

You can configure the prefix with the Spark SQL configuration option
spark.databricks.sqldw.streaming.exactlyOnce.checkpointTableNamePrefix .

Frequently asked questions (FAQ)


I received an error while using the Azure Synapse connector. How can I tell if this error is from
Azure Synapse or Azure Databricks?
To help you debug errors, any exception thrown by code that is specific to the Azure Synapse connector is
wrapped in an exception extending the SqlDWException trait. Exceptions also make the following distinction:
SqlDWConnectorException represents an error thrown by the Azure Synapse connector
SqlDWSideException represents an error thrown by the connected Azure Synapse instance

What should I do if my quer y failed with the error “No access key found in the session conf or the
global Hadoop conf ”?
This error means that Azure Synapse connector could not find the storage account access key in the notebook
session configuration or global Hadoop configuration for the storage account specified in tempDir . See Usage
(Batch) for examples of how to configure Storage Account access properly. If a Spark table is created using Azure
Synapse connector, you must still provide the storage account access credentials in order to read or write to the
Spark table.
Can I use a Shared Access Signature (SAS) to access the Blob storage container specified by
tempDir ?

Azure Synapse does not support using SAS to access Blob storage. Therefore the Azure Synapse connector does
not support SAS to access the Blob storage container specified by tempDir .
I created a Spark table using Azure Synapse connector with the dbTable option, wrote some data
to this Spark table, and then dropped this Spark table. Will the table created at the Azure Synapse
side be dropped?
No. Azure Synapse is considered an external data source. The Azure Synapse table with the name set through
dbTable is not dropped when the Spark table is dropped.

When writing a DataFrame to Azure Synapse, why do I need to say


.option("dbTable", tableName).save() instead of just .saveAsTable(tableName) ?

That is because we want to make the following distinction clear: .option("dbTable", tableName) refers to the
database (that is, Azure Synapse) table, whereas .saveAsTable(tableName) refers to the Spark table. In fact, you
could even combine the two: df.write. ... .option("dbTable", tableNameDW).saveAsTable(tableNameSpark) which
creates a table in Azure Synapse called tableNameDW and an external table in Spark called tableNameSpark that is
backed by the Azure Synapse table.
WARNING
Beware of the following difference between .save() and .saveAsTable() :
For df.write. ... .option("dbTable", tableNameDW).mode(writeMode).save() , writeMode acts on the Azure
Synapse table, as expected.
For df.write. ... .option("dbTable", tableNameDW).mode(writeMode).saveAsTable(tableNameSpark) ,
writeMode acts on the Spark table, whereas tableNameDW is silently overwritten if it already exists in Azure Synapse.

This behavior is no different from writing to any other data source. It is just a caveat of the Spark DataFrameWriter API.
Binary file
7/21/2022 • 2 minutes to read

Databricks Runtime supports the binary file data source, which reads binary files and converts each file into a
single record that contains the raw content and metadata of the file. The binary file data source produces a
DataFrame with the following columns and possibly partition columns:
path (StringType) : The path of the file.
modificationTime (TimestampType) : The modification time of the file. In some Hadoop FileSystem
implementations, this parameter might be unavailable and the value would be set to a default value.
length (LongType) : The length of the file in bytes.
content (BinaryType) : The contents of the file.

To read binary files, specify the data source format as binaryFile .

Images
Databricks recommends that you use the binary file data source to load image data.
In Databricks Runtime 8.4 and above, the Databricks display function supports displaying image data loaded
using the binary data source.
If all the loaded files have a file name with an image extension, image preview is automatically enabled:

df = spark.read.format("binaryFile").load("<path-to-image-dir>")
display(df) # image thumbnails are rendered in the "content" column

Alternatively, you can force the image preview functionality by using the mimeType option with a string value
"image/*" to annotate the binary column. Images are decoded based on their format information in the binary
content. Supported image types are bmp , gif , jpeg , and png . Unsupported files appear as a broken image
icon.
df = spark.read.format("binaryFile").option("mimeType", "image/*").load("<path-to-dir>")
display(df) # unsupported files are displayed as a broken image icon

See Reference solution for image applications for the recommended workflow to handle image data.

Options
To load files with paths matching a given glob pattern while keeping the behavior of partition discovery, you can
use the pathGlobFilter option. The following code reads all JPG files from the input directory with partition
discovery:

df = spark.read.format("binaryFile").option("pathGlobFilter", "*.jpg").load("<path-to-dir>")

If you want to ignore partition discovery and recursively search files under the input directory, use the
recursiveFileLookup option. This option searches through nested directories even if their names do not follow a
partition naming scheme like date=2019-07-01 . The following code reads all JPG files recursively from the input
directory and ignores partition discovery:

df = spark.read.format("binaryFile") \
.option("pathGlobFilter", "*.jpg") \
.option("recursiveFileLookup", "true") \
.load("<path-to-dir>")

Similar APIs exist for Scala, Java, and R.

NOTE
To improve read performance when you load data back, Azure Databricks recommends turning off compression when you
save data loaded from binary files:

spark.conf.set("spark.sql.parquet.compression.codec", "uncompressed")
df.write.format("delta").save("<path-to-table>")
Cassandra
7/21/2022 • 2 minutes to read

The following notebook shows how to connect Cassandra with Azure Databricks.

Connect to Cassandra notebook


Get notebook
Couchbase
7/21/2022 • 2 minutes to read

Couchbase provides an enterprise-class, multi-cloud to edge database that offers the robust capabilities
required for business-critical applications on a highly scalable and available platform.
The following notebook shows how to set up Couchbase with Azure Databricks.

Couchbase notebook
Get notebook
ElasticSearch
7/21/2022 • 2 minutes to read

ElasticSearch is a distributed, RESTful search and analytics engine.


The following notebook shows how to read and write data to ElasticSearch.

ElasticSearch notebook
Get notebook
Image
7/21/2022 • 2 minutes to read

IMPORTANT
Databricks recommends that you use the binary file data source to load image data into the Spark DataFrame as raw
bytes. See Reference solution for image applications for the recommended workflow to handle image data.

The image data source abstracts from the details of image representations and provides a standard API to load
image data. To read image files, specify the data source format as image .

df = spark.read.format("image").load("<path-to-image-data>")

Similar APIs exist for Scala, Java, and R.


You can import a nested directory structure (for example, use a path like /path/to/dir/ ) and you can use
partition discovery by specifying a path with a partition directory (that is, a path like
/path/to/dir/date=2018-01-02/category=automobile ).

Image structure
Image files are loaded as a DataFrame containing a single struct-type column called image with the following
fields:

image: struct containing all the image data


|-- origin: string representing the source URI
|-- height: integer, image height in pixels
|-- width: integer, image width in pixels
|-- nChannels
|-- mode
|-- data

where the fields are:


nChannels: The number of color channels. Typical values are 1 for grayscale images, 3 for colored images
(for example, RGB), and 4 for colored images with alpha channel.
mode : Integer flag that indicates how to interpret the data field. It specifies the data type and channel
order the data is stored in. The value of the field is expected (but not enforced) to map to one of the
OpenCV types displayed in the following table. OpenCV types are defined for 1, 2, 3, or 4 channels and
several data types for the pixel values. Channel order specifies the order in which the colors are stored.
For example, if you have a typical three channel image with red, blue, and green components, there are
six possible orderings. Most libraries use either RGB or BGR. Three (four) channel OpenCV types are
expected to be in BGR(A) order.
Map of Type to Numbers in OpenCV (data types x number of channels)

TYPE C1 C2 C3 C4

CV_8U 0 8 16 24
TYPE C1 C2 C3 C4

CV_8S 1 9 17 25

CV_16U 2 10 18 26

CV_16S 3 11 19 27

CV_32S 4 12 20 28

CV_32S 5 13 21 29

CV_64F 6 14 22 30

data : Image data stored in a binary format. Image data is represented as a 3-dimensional array with the
dimension shape (height, width, nChannels) and array values of type t specified by the mode field. The
array is stored in row-major order.

Display image data


The Databricks display function supports displaying image data. See Images.

Notebook
The following notebook shows how to read and write data to image files.
Image data source notebook
Get notebook

Limitations of image data source


The image data source decodes the image files during the creation of the Spark DataFrame, increases the data
size, and introduces limitations in the following scenarios:
1. Persisting the DataFrame: If you want to persist the DataFrame into a Delta table for easier access, you should
persist the raw bytes instead of the decoded data to save disk space.
2. Shuffling the partitions: Shuffling the decoded image data takes more disk space and network bandwidth,
which results in slower shuffling. You should delay decoding the image as much as possible.
3. Choosing other decoding method: The image data source uses the Image IO library of javax to decode the
image, which prevents you from choosing other image decoding libraries for better performance or
implementing customized decoding logic.
Those limitations can be avoided by using the binary file data source to load image data and decoding only as
needed.
Hive table
7/21/2022 • 2 minutes to read

This article shows how to import a Hive table from cloud storage into Azure Databricks using an external table.

Step 1: Show the CREATE TABLE statement


Issue a SHOW CREATE TABLE <tablename> command on your Hive command line to see the statement that created
the table.

hive> SHOW CREATE TABLE wikicc;


OK
CREATE TABLE `wikicc`(
`country` string,
`count` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'<path-to-table>'
TBLPROPERTIES (
'totalSize'='2335',
'numRows'='240',
'rawDataSize'='2095',
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'transient_lastDdlTime'='1418173653')

Step 2: Issue a CREATE EXTERNAL TABLE statement


If the statement that is returned uses a CREATE TABLE command, copy the statement and replace CREATE TABLE
with CREATE EXTERNAL TABLE .
EXTERNALensures that Spark SQL does not delete your data if you drop the table.
You can omit the TBLPROPERTIES field.

DROP TABLE wikicc

CREATE EXTERNAL TABLE `wikicc`(


`country` string,
`count` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'<path-to-table>'
Step 3: Issue SQL commands on your data
SELECT * FROM wikicc
MLflow experiment
7/21/2022 • 2 minutes to read

The MLflow experiment data source provides a standard API to load MLflow experiment run data. You can load
data from the notebook experiment, or you can use the MLflow experiment name or experiment ID.

Requirements
Databricks Runtime 6.0 ML or above.

Load data from the notebook experiment


To load data from the notebook experiment, use load() .
Python

df = spark.read.format("mlflow-experiment").load()
display(df)

Scala

val df = spark.read.format("mlflow-experiment").load()
display(df)

Load data using experiment IDs


To load data from one or more workspace experiments, specify the experiment IDs as shown.
Python

df = spark.read.format("mlflow-experiment").load("3270527066281272")
display(df)

Scala

val df = spark.read.format("mlflow-experiment").load("3270527066281272,953590262154175")
display(df)

Load data using experiment name


You can also pass the experiment name to the load() method.
Python

expId = mlflow.get_experiment_by_name("/Shared/diabetes_experiment/").experiment_id
df = spark.read.format("mlflow-experiment").load(expId)
display(df)

Scala
val expId = mlflow.getExperimentByName("/Shared/diabetes_experiment/").get.getExperimentId
val df = spark.read.format("mlflow-experiment").load(expId)
display(df)

Filter data based on metrics and parameters


The examples in this section show how you can filter data after loading it from an experiment.
Python

df = spark.read.format("mlflow-experiment").load("3270527066281272")
filtered_df = df.filter("metrics.loss < 0.01 AND params.learning_rate > '0.001'")
display(filtered_df)

Scala

val df = spark.read.format("mlflow-experiment").load("3270527066281272")
val filtered_df = df.filter("metrics.loss < 1.85 AND params.num_epochs > '30'")
display(filtered_df)

Schema
The schema of the DataFrame returned by the data source is:

root
|-- run_id: string
|-- experiment_id: string
|-- metrics: map
| |-- key: string
| |-- value: double
|-- params: map
| |-- key: string
| |-- value: string
|-- tags: map
| |-- key: string
| |-- value: string
|-- start_time: timestamp
|-- end_time: timestamp
|-- status: string
|-- artifact_uri: string
MongoDB
7/21/2022 • 2 minutes to read

MongoDB is a document database that stores data in flexible, JSON-like documents.


The following notebook shows you how to read and write data to MongoDB Atlas, the hosted version of
MongoDB, using Apache Spark. The MongoDB Connector for Spark was developed by MongoDB.
You can also access Microsoft Azure CosmosDB using the MongoDB API. See Introduction to Azure Cosmos DB:
MongoDB API.

MongoDB notebook
Get notebook
Neo4j
7/21/2022 • 2 minutes to read

Neo4j is a native graph database that leverages data relationships as first-class entities. You can connect an
Azure Databricks cluster to a Neo4j cluster using the neo4j-spark-connector, which offers Apache Spark APIs for
RDD, DataFrame, and GraphFrames. The neo4j-spark-connector uses the binary Bolt protocol to transfer data to
and from the Neo4j server.
This article describes how to deploy and configure Neo4j, configure Azure Databricks to access Neo4j, and
includes a notebook demonstrating usage.

Neo4j deployment and configuration


You can deploy Neo4j on various cloud providers.
To deploy Neo4j, see the official Neo4j cloud deployment guide. This guide assumes Neo4j 3.2.2 .
Change the Neo4j password from the default (you should be prompted when you first access Neo4j) and
modify conf/neo4j.conf to accept remote connections.

# conf/neo4j.conf

# Bolt connector
dbms.connector.bolt.enabled=true
#dbms.connector.bolt.tls_level=OPTIONAL
dbms.connector.bolt.listen_address=0.0.0.0:7687

# HTTP Connector. There must be exactly one HTTP connector.


dbms.connector.http.enabled=true
#dbms.connector.http.listen_address=0.0.0.0:7474

# HTTPS Connector. There can be zero or one HTTPS connectors.


dbms.connector.https.enabled=true
#dbms.connector.https.listen_address=0.0.0.0:7473

For more information, see Configuring Neo4j Connectors.

Azure Databricks configuration


1. Install two libraries: neo4j-spark-connector and graphframes as Spark Packages. See the libraries guide
for instructions.
2. Create a cluster with these Spark configurations.

spark.neo4j.bolt.url bolt://<ip-of-neo4j-instance>:7687
spark.neo4j.bolt.user <username>
spark.neo4j.bolt.password <password>

3. Import libraries and test the connection.


import org.neo4j.spark._
import org.graphframes._

val neo = Neo4j(sc)

// Dummy Cypher query to check connection


val testConnection = neo.cypher("MATCH (n) RETURN n;").loadRdd[Long]

Neo4j notebook
Get notebook
Avro file
7/21/2022 • 4 minutes to read

Apache Avro is a data serialization system. Avro provides:


Rich data structures.
A compact, fast, binary data format.
A container file, to store persistent data.
Remote procedure call (RPC).
Simple integration with dynamic languages. Code generation is not required to read or write data files nor to
use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for
statically typed languages.
The Avro data source supports:
Schema conversion: Automatic conversion between Apache Spark SQL and Avro records.
Partitioning: Easily reading and writing partitioned data without any extra configuration.
Compression: Compression to use when writing Avro out to disk. The supported types are uncompressed ,
snappy , and deflate . You can also specify the deflate level.
Record names: Record name and namespace by passing a map of parameters with recordName and
recordNamespace .

Also see Read and write streaming Avro data.

Configuration
You can change the behavior of an Avro data source using various configuration parameters.
To ignore files without the extension when reading, you can set the parameter
.avro
avro.mapred.ignore.inputs.without.extension in the Hadoop configuration. The default is false .

spark
.sparkContext
.hadoopConfiguration
.set("avro.mapred.ignore.inputs.without.extension", "true")

To configure compression when writing, set the following Spark properties:


Compression codec: spark.sql.avro.compression.codec . Supported codecs are snappy and deflate . The
default codec is snappy .
If the compression codec is deflate , you can set the compression level with: spark.sql.avro.deflate.level .
The default level is -1 .
You can set these properties in the cluster Spark configuration or at runtime using spark.conf.set() . For
example:

spark.conf.set("spark.sql.avro.compression.codec", "deflate")
spark.conf.set("spark.sql.avro.deflate.level", "5")

For Databricks Runtime 9.1 LTS and above, you can change the default schema inference behavior in Avro by
providing the mergeSchema option when reading files. Setting mergeSchema to true will infer a schema from a
set of Avro files in the target directory and merge them rather than infer the read schema from a single file.

Supported types for Avro -> Spark SQL conversion


This library supports reading all Avro types. It uses the following mapping from Avro types to Spark SQL types:

AVRO T Y P E SPA RK SQ L T Y P E

boolean BooleanType

int IntegerType

long LongType

float FloatType

double DoubleType

bytes BinaryType

string StringType

record StructType

enum StringType

array ArrayType

map MapType

fixed BinaryType

union See Union types.

Union types
The Avro data source supports reading union types. Avro considers the following three types to be union
types:
union(int, long) maps to LongType .
union(float, double) maps to DoubleType .
union(something, null) , where something is any supported Avro type. This maps to the same Spark SQL
type as that of something , with nullable set to true .
All other union types are complex types. They map to StructType where field names are member0 , member1 ,
and so on, in accordance with members of the union . This is consistent with the behavior when converting
between Avro and Parquet.
Logical types
The Avro data source supports reading the following Avro logical types:
AVRO LO GIC A L T Y P E AVRO T Y P E SPA RK SQ L T Y P E

date int DateType

timestamp-millis long TimestampType

timestamp-micros long TimestampType

decimal fixed DecimalType

decimal bytes DecimalType

NOTE
The Avro data source ignores docs, aliases, and other properties present in the Avro file.

Supported types for Spark SQL -> Avro conversion


This library supports writing of all Spark SQL types into Avro. For most types, the mapping from Spark types to
Avro types is straightforward (for example IntegerType gets converted to int ); the following is a list of the few
special cases:

SPA RK SQ L T Y P E AVRO T Y P E AVRO LO GIC A L T Y P E

ByteType int

ShortType int

BinaryType bytes

DecimalType fixed decimal

TimestampType long timestamp-micros

DateType int date

You can also specify the whole output Avro schema with the option avroSchema , so that Spark SQL types can be
converted into other Avro types. The following conversions are not applied by default and require user specified
Avro schema:

SPA RK SQ L T Y P E AVRO T Y P E AVRO LO GIC A L T Y P E

ByteType fixed

StringType enum

DecimalType bytes decimal

TimestampType long timestamp-millis

Examples
These examples use the episodes.avro file.
Scala

// The Avro records are converted to Spark types, filtered, and


// then written back out as Avro records

val df = spark.read.format("avro").load("/tmp/episodes.avro")
df.filter("doctor > 5").write.format("avro").save("/tmp/output")

This example demonstrates a custom Avro schema:

import org.apache.avro.Schema

val schema = new Schema.Parser().parse(new File("episode.avsc"))

spark
.read
.format("avro")
.option("avroSchema", schema.toString)
.load("/tmp/episodes.avro")
.show()

This example demonstrates Avro compression options:

// configuration to use deflate compression


spark.conf.set("spark.sql.avro.compression.codec", "deflate")
spark.conf.set("spark.sql.avro.deflate.level", "5")

val df = spark.read.format("avro").load("/tmp/episodes.avro")

// writes out compressed Avro records


df.write.format("avro").save("/tmp/output")

This example demonstrates partitioned Avro records:

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().master("local").getOrCreate()

val df = spark.createDataFrame(
Seq(
(2012, 8, "Batman", 9.8),
(2012, 8, "Hero", 8.7),
(2012, 7, "Robot", 5.5),
(2011, 7, "Git", 2.0))
).toDF("year", "month", "title", "rating")

df.toDF.write.format("avro").partitionBy("year", "month").save("/tmp/output")

This example demonstrates the record name and namespace:

val df = spark.read.format("avro").load("/tmp/episodes.avro")

val name = "AvroTest"


val namespace = "org.foo"
val parameters = Map("recordName" -> name, "recordNamespace" -> namespace)

df.write.options(parameters).format("avro").save("/tmp/output")
Python

# Create a DataFrame from a specified directory


df = spark.read.format("avro").load("/tmp/episodes.avro")

# Saves the subset of the Avro records read in


subset = df.where("doctor > 5")
subset.write.format("avro").save("/tmp/output")

SQL
To query Avro data in SQL, register the data file as a table or temporary view:

CREATE TEMPORARY VIEW episodes


USING avro
OPTIONS (path "/tmp/episodes.avro")

SELECT * from episodes

Notebook
The following notebook demonstrates how to read and write Avro files.
Read and write Avro files notebook
Get notebook
CSV file
7/21/2022 • 3 minutes to read

This article provides examples for reading and writing to CSV files with Azure Databricks using Python, Scala, R,
and SQL.

NOTE
You can use SQL to read CSV data directly or by using a temporary view. Databricks recommends using a temporary view.
Reading the CSV file directly has the following drawbacks:
You can’t specify data source options.
You can’t specify the schema for the data.
See Examples.

Options
You can configure several options for CSV file data sources. See the following Apache Spark reference articles
for supported read and write options.
Read
Python
Scala
Write
Python
Scala

Rescued data column


NOTE
This feature is supported in Databricks Runtime 8.3 (Unsupported) and above.

The rescued data column ensures that you never lose or miss out on data during ETL. The rescued data column
contains any data that wasn’t parsed, either because it was missing from the given schema, or because there
was a type mismatch, or because the casing of the column in the record or file didn’t match with that in the
schema. The rescued data column is returned as a JSON blob containing the columns that were rescued, and the
source file path of the record (the source file path is available in Databricks Runtime 8.3 and above). To remove
the source file path from the rescued data column, you can set the SQL configuration
spark.conf.set("spark.databricks.sql.rescuedDataColumn.filePath.enabled", "false") . You can enable the
rescued data column by setting the option rescuedDataColumn to a column name, such as _rescued_data with
spark.read.option("rescuedDataColumn", "_rescued_data").format("csv").load(<path>) .

The CSV parser supports three modes when parsing records: PERMISSIVE , DROPMALFORMED , and FAILFAST . When
used together with rescuedDataColumn , data type mismatches do not cause records to be dropped in
DROPMALFORMED mode or throw an error in FAILFAST mode. Only corrupt records—that is, incomplete or
malformed CSV—are dropped or throw errors. If you use the option badRecordsPath when parsing CSV, data
type mismatches are not considered as bad records when using the rescuedDataColumn . Only incomplete and
malformed CSV records are stored in badRecordsPath .

Examples
These examples use the diamonds dataset. Specify the path to the dataset as well as any options that you would
like.
In this section:
Read file in any language
Specify schema
Verify correctness of the data
Pitfalls of reading a subset of columns
Read file in any language
This notebook shows how to read a file, display sample data, and print the data schema using Scala, R, Python,
and SQL.
Read CSV files notebook
Get notebook
Specify schema
When the schema of the CSV file is known, you can specify the desired schema to the CSV reader with the
schema option.

Read CSV files with schema notebook


Get notebook
Verify correctness of the data
When reading CSV files with a specified schema, it is possible that the data in the files does not match the
schema. For example, a field containing name of the city will not parse as an integer. The consequences depend
on the mode that the parser runs in:
PERMISSIVE (default): nulls are inserted for fields that could not be parsed correctly
DROPMALFORMED : drops lines that contain fields that could not be parsed
FAILFAST : aborts the reading if any malformed data is found

To set the mode, use the mode option.

val diamonds_with_wrong_schema_drop_malformed = spark.read.format("csv").option("mode", "PERMISSIVE")

In the PERMISSIVE mode it is possible to inspect the rows that could not be parsed correctly. To do that, you can
add _corrupt_record column to the schema.
Find malformed rows notebook
Get notebook
Pitfalls of reading a subset of columns
The behavior of the CSV parser depends on the set of columns that are read. If the specified schema is incorrect,
the results might differ considerably depending on the subset of columns that is accessed. The following
notebook presents the most common pitfalls.
Caveats of reading a subset of columns of a CSV file notebook
Get notebook
JSON file
7/21/2022 • 2 minutes to read

You can read JSON files in single-line or multi-line mode. In single-line mode, a file can be split into many parts
and read in parallel. In multi-line mode, a file is loaded as a whole entity and cannot be split.
For further information, see JSON Files.

Options
See the following Apache Spark reference articles for supported read and write options.
Read
Python
Scala
Write
Python
Scala

Rescued data column


NOTE
This feature is supported in Databricks Runtime 8.2 (Unsupported) and above.

The rescued data column ensures that you never lose or miss out on data during ETL. The rescued data column
contains any data that wasn’t parsed, either because it was missing from the given schema, or because there
was a type mismatch, or because the casing of the column in the record or file didn’t match with that in the
schema. The rescued data column is returned as a JSON blob containing the columns that were rescued, and the
source file path of the record (the source file path is available in Databricks Runtime 8.3 and above). To remove
the source file path from the rescued data column, you can set the SQL configuration
spark.conf.set("spark.databricks.sql.rescuedDataColumn.filePath.enabled", "false") . You can enable the
rescued data column by setting the option rescuedDataColumn to a column name, such as _rescued_data with
spark.read.option("rescuedDataColumn", "_rescued_data").format("json").load(<path>) .

The JSON parser supports three modes when parsing records: PERMISSIVE , DROPMALFORMED , and FAILFAST .
When used together with rescuedDataColumn , data type mismatches do not cause records to be dropped in
DROPMALFORMED mode or throw an error in FAILFAST mode. Only corrupt records—that is, incomplete or
malformed JSON—are dropped or throw errors. If you use the option badRecordsPath when parsing JSON, data
type mismatches are not considered as bad records when using the rescuedDataColumn . Only incomplete and
malformed JSON records are stored in badRecordsPath .

Examples
Single -line mode
In this example, there is one JSON object per line:
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
{"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key": "extra_value3"}}

To read the JSON data, use:

val df = spark.read.format("json").load("example.json")

Spark infers the schema automatically.

df.printSchema

root
|-- array: array (nullable = true)
| |-- element: long (containsNull = true)
|-- dict: struct (nullable = true)
| |-- extra_key: string (nullable = true)
| |-- key: string (nullable = true)
|-- int: long (nullable = true)
|-- string: string (nullable = true)

Multi-line mode
This JSON object occupies multiple lines:

[
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
{
"string": "string3",
"int": 3,
"array": [
3,
6,
9
],
"dict": {
"key": "value3",
"extra_key": "extra_value3"
}
}
]

To read this object, enable multi-line mode:


SQL

CREATE TEMPORARY VIEW multiLineJsonTable


USING json
OPTIONS (path="/tmp/multi-line.json",multiline=true)

Scala

val mdf = spark.read.option("multiline", "true").format("json").load("/tmp/multi-line.json")


mdf.show(false)

Charset auto -detection


By default, the charset of input files is detected automatically. You can specify the charset explicitly using the
charset option:

spark.read.option("charset", "UTF-16BE").format("json").load("fileInUTF16.json")

Some supported charsets include: UTF-8 , UTF-16BE , UTF-16LE , UTF-16 , UTF-32BE , UTF-32LE , UTF-32 . For the
full list of charsets supported by Oracle Java SE, see Supported Encodings.

Notebook
The following notebook demonstrates single line and multi-line mode.
Read JSON files notebook
Get notebook
LZO compressed file
7/21/2022 • 2 minutes to read

Due to licensing restrictions, the LZO compression codec is not available by default on Azure Databricks clusters.
To read an LZO compressed file, you must use an init script to install the codec on your cluster at launch time.
This article includes two notebooks:
Init LZO compressed files
Builds the LZO codec.
Creates an init script that:
Installs the LZO compression libraries and the lzop command, and copies the LZO codec to proper
class path.
Configures Spark to use the LZO compression codec.
Read LZO compressed files - Uses the codec installed by the init script.

Init LZO compressed files notebook


Get notebook

Read LZO compressed files notebook


Get notebook
Parquet file
7/21/2022 • 2 minutes to read

Apache Parquet is a columnar file format that provides optimizations to speed up queries and is a far more
efficient file format than CSV or JSON.
For further information, see Parquet Files.

Options
See the following Apache Spark reference articles for supported read and write options.
Read
Python
Scala
Write
Python
Scala
The following notebook shows how to read and write data to Parquet files.
Reading Parquet files notebook
Get notebook
Redis
7/21/2022 • 2 minutes to read

Redis is a popular key-value store that is fast and easy to use. The following notebook shows how to use Redis
with Apache Spark.

Example Notebooks
Redis overview notebook
Get notebook
Snowflake
7/21/2022 • 2 minutes to read

Snowflake is a cloud-based SQL data warehouse. This article explains how to read data from and write data to
Snowflake using the Databricks Snowflake connector.

Snowflake Connector for Spark notebooks


The following notebooks provide simple examples of how to write data to and read data from Snowflake. See
Using the Spark Connector for more details. In particular, see Setting Configuration Options for the Connector
for all configuration options.

TIP
Avoid exposing your Snowflake username and password in notebooks by using Secrets, which are demonstrated in the
notebooks.

In this section:
Snowflake Scala notebook
Snowflake Python notebook
Snowflake R notebook
Snowflake Scala notebook
Get notebook
Snowflake Python notebook
Get notebook
Snowflake R notebook
Get notebook

Train a machine learning model and save results to Snowflake


The following notebook walks through best practices for using the Snowflake Connector for Spark. It writes data
to Snowflake, uses Snowflake for some basic data manipulation, trains a machine learning model in Azure
Databricks, and writes the results back to Snowflake.
Store ML training results in Snowflake notebook
Get notebook

Frequently asked questions (FAQ)


Why don’t my Spark DataFrame columns appear in the same order in Snowflake?
The Snowflake Connector for Spark doesn’t respect the order of the columns in the table being written to; you
must explicitly specify the mapping between DataFrame and Snowflake columns. To specify this mapping, use
the columnmap parameter.
Why is INTEGER data written to Snowflake always read back as DECIMAL ?
Snowflake represents all INTEGER types as NUMBER , which can cause a change in data type when you write data
to and read data from Snowflake. For example, INTEGER data can be converted to DECIMAL when writing to
Snowflake, because INTEGER and DECIMAL are semantically equivalent in Snowflake (see Snowflake Numeric
Data Types).
Why are the fields in my Snowflake table schema always uppercase?
Snowflake uses uppercase fields by default, which means that the table schema is converted to uppercase.
Zip files
7/21/2022 • 2 minutes to read

Hadoop does not have support for zip files as a compression codec. While a text file in GZip, BZip2, and other
supported compression formats can be configured to be automatically decompressed in Apache Spark as long
as it has the right file extension, you must perform additional steps to read zip files.
The following notebooks show how to read zip files. After you download a zip file to a temp directory, you can
invoke the Azure Databricks %sh zip magic command to unzip the file. For the sample file used in the
notebooks, the tail step removes a comment line from the unzipped file.
When you use %sh to operate on files, the results are stored in the directory /databricks/driver . Before you
load the file using the Spark API, you move the file to DBFS using Databricks Utilities.

Zip files Python notebook


Get notebook

Zip files Scala notebook


Get notebook
Databricks integrations
7/21/2022 • 2 minutes to read

Databricks integrates with a wide range of data sources, developer tools, and partner solutions.
Data sources: Databricks can read data from and write data to a variety of data formats such as CSV, Delta
Lake, JSON, Parquet, XML, and other formats, as well as data storage providers such as Azure Data Lake
Storage, Google BigQuery and Cloud Storage, Snowflake, and other providers.
Developer tools: Databricks supports various developer tools such as DataGrip, IntelliJ, PyCharm, Visual
Studio Code, and others, that allow you to work with data through Azure Databricks clusters and Databricks
SQL warehouses by writing code.
Partner solutions: Databricks has validated integrations with various third-party products such as Fivetran,
Power BI, Tableau, and others, that allow you to work with data through Azure Databricks clusters and SQL
warehouses, in many cases with low-code and no-code experiences. These solutions enable common
scenarios such as data ingestion, data preparation and transformation, business intelligence (BI), and
machine learning. Databricks also provides Partner Connect, a user interface that allows some of these
validated solutions to integrate faster and easier with your Azure Databricks clusters and SQL warehouses.
See also Git integration with Databricks Repos.
This guide covers Databricks partner solutions:
Databricks Partner Connect
Databricks partners
Explore and create tables with the Data tab
7/21/2022 • 3 minutes to read

You can use the Data tab in the Data Science & Engineering workspace to create, view, and delete tables.
Databricks recommends using features in Databricks SQL to complete these tasks; the Data explorer provides an
improved experience for viewing data objects and managing ACLs and the create table UI allows users to easily
ingest small files into Delta Lake.

Requirements
To view and create databases and tables, you must be connected to a running cluster.

View databases and tables


Click Data in the sidebar. Azure Databricks selects a running cluster to which you have access. The
Databases folder displays the list of databases with the default database selected. The Tables folder displays
the list of tables in the default database.

You can change the cluster from the Databases menu, create table UI, or view table UI. For example, from the
Databases menu:
1. Click the at the top of the Databases folder.
2. Select a cluster.

Import data
If you have small data files on your local machine that you want to analyze with Azure Databricks, you can
import them to DBFS using the UI.
NOTE
This feature may be disabled by admin users. To enable or disable this setting, see Manage data upload.

Upload data to a table with the Create table UI.

Files imported to DBFS using these methods are stored in FileStore.

Create a table
The Create in the sidebar and the Create Table button in the Data tab both launch the Create Table UI.
You can populate a table from files in DBFS or data stored in any of the supported data sources.

NOTE
When you create a table using the UI, you cannot update the table.

Create a table using the UI


With the UI, you can only create external tables.

1. Click Data in the sidebar. The Databases and Tables folders appear.
2. In the Databases folder, select a database.
3. Above the Tables folder, click Create Table .
4. Choose a data source and follow the steps in the corresponding section to configure the table.
If an Azure Databricks administrator has disabled the Upload File option, you do not have the option to
upload files; you can create tables using one of the other data sources.

Instructions for
a. Drag files to the Files dropzone or click the dropzone to browse and choose files. After upload, a
path displays for each file. The path will be something like
/FileStore/tables/<filename>-<integer>.<file-type> . You can use this path in a notebook to read
data.

b. Click Create Table with UI .


c. In the Cluster drop-down, choose a cluster.

Instructions for
a. Select a file.
b. Click Create Table with UI .
c. In the Cluster drop-down, choose a cluster.
5. Click Preview Table to view the table.
6. In the Table Name field, optionally override the default table name. A table name can contain only
lowercase alphanumeric characters and underscores and must start with a lowercase letter or
underscore.
7. In the Create in Database field, optionally override the selected default database.
8. In the File Type field, optionally override the inferred file type.
9. If the file type is CSV:
a. In the Column Delimiter field, select whether to override the inferred delimiter.
b. Indicate whether to use the first row as the column titles.
c. Indicate whether to infer the schema.
10. If the file type is JSON, indicate whether the file is multi-line.
11. Click Create Table .
Create a table in a notebook
In the Create New Table UI you can use quickstart notebooks provided by Azure Databricks to connect to any
data source.
DBFS : Click Create Table in Notebook .
Other Data Sources : In the Connector drop-down, select a data source type. Then click Create Table in
Notebook .
View table details
The table details view shows the table schema and sample data.

1. Click Data in the sidebar.


2. In the Databases folder, click a database.
3. In the Tables folder, click the table name.
4. In the Cluster drop-down, optionally select another cluster to render the table preview.

NOTE
To display the table preview, a Spark SQL query runs on the cluster selected in the Cluster drop-down. If the
cluster already has a workload running on it, the table preview may take longer to load.

Delete a table using the UI


1. Click Data in the sidebar.

2. Click the next to the table name and select Delete .

.
File metadata column
7/21/2022 • 2 minutes to read

NOTE
Available in Databricks Runtime 10.5 and above.

You can get metadata information for input files with the _metadata column. The _metadata column is a hidden
column, and is available for all input file formats. To include the _metadata column in the returned DataFrame,
you must explicitly reference it in your query.
If the data source contains a column named _metadata , queries will return the column from the data source,
and not the file metadata.

WARNING
New fields may be added to the _metadata column in future releases. To prevent schema evolution errors if the
_metadata column is updated, Databricks recommends selecting specific fields from the column in your queries. See
examples.

Supported metadata
The _metadata column is a STRUCT containing the following fields:

NAME TYPE DESC RIP T IO N EXA M P L E

file_path STRING File path of the input file. file:/tmp/f0.csv

file_name STRING Name of the input file along f0.csv


with its extension.

file_size LONG Length of the input file, in 628


bytes.

file_modification_time TIMESTAMP Last modification 2021-12-20 20:05:21


timestamp of the input file.

Examples
Use in a basic file -based data source reader
Python
df = spark.read \
.format("csv") \
.schema(schema) \
.load("dbfs:/tmp/*") \
.select("*", "_metadata")

display(df)

'''
Result:
+---------+-----+----------------------------------------------------+
| name | age | _metadata |
+=========+=====+====================================================+
| | | { |
| | | "file_path": "dbfs:/tmp/f0.csv", |
| Debbie | 18 | "file_name": "f0.csv", |
| | | "file_size": 12, |
| | | "file_modification_time": "2021-07-02 01:05:21" |
| | | } |
+---------+-----+----------------------------------------------------+
| | | { |
| | | "file_path": "dbfs:/tmp/f1.csv", |
| Frank | 24 | "file_name": "f1.csv", |
| | | "file_size": 12, |
| | | "file_modification_time": "2021-12-20 02:06:21" |
| | | } |
+---------+-----+----------------------------------------------------+
'''

Scala

val df = spark.read
.format("csv")
.schema(schema)
.load("dbfs:/tmp/*")
.select("*", "_metadata")

display(df_population)

/* Result:
+---------+-----+----------------------------------------------------+
| name | age | _metadata |
+=========+=====+====================================================+
| | | { |
| | | "file_path": "dbfs:/tmp/f0.csv", |
| Debbie | 18 | "file_name": "f0.csv", |
| | | "file_size": 12, |
| | | "file_modification_time": "2021-07-02 01:05:21" |
| | | } |
+---------+-----+----------------------------------------------------+
| | | { |
| | | "file_path": "dbfs:/tmp/f1.csv", |
| Frank | 24 | "file_name": "f1.csv", |
| | | "file_size": 10, |
| | | "file_modification_time": "2021-12-20 02:06:21" |
| | | } |
+---------+-----+----------------------------------------------------+
*/

Select specific fields


Python
spark.read \
.format("csv") \
.schema(schema) \
.load("dbfs:/tmp/*") \
.select("_metadata.file_name", "_metadata.file_size")

Scala

spark.read
.format("csv")
.schema(schema)
.load("dbfs:/tmp/*")
.select("_metadata.file_name", "_metadata.file_size")

Use in filters
Python

spark.read \
.format("csv") \
.schema(schema) \
.load("dbfs:/tmp/*") \
.select("*") \
.filter(col("_metadata.file_name") == lit("test.csv"))

Scala

spark.read
.format("csv")
.schema(schema)
.load("dbfs:/tmp/*")
.select("*")
.filter(col("_metadata.file_name") === lit("test.csv"))

Use in COPY INTO

COPY INTO my_delta_table


FROM (
SELECT *, _metadata FROM 'abfss://my-bucket/csvData'
)
FILEFORMAT = CSV

Use in Auto Loader


Python

spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.schema(schema) \
.load("abfss://my-bucket/csvData") \
.select("*", "_metadata") \
.writeStream \
.format("delta") \
.option("checkpointLocation", checkpointLocation) \
.start(targetTable)

Scala
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.schema(schema)
.load("abfss://my-bucket/csvData")
.select("*", "_metadata")
.writeStream
.format("delta")
.option("checkpointLocation", checkpointLocation)
.start(targetTable)

Related articles
COPY INTO
Auto Loader
Structured Streaming
Workflows
7/21/2022 • 2 minutes to read

This guide shows how to process and analyze data using Azure Databricks jobs, Delta Live Tables; the Azure
Databricks data processing pipeline framework, and common workflow tools including Apache Airflow and
Azure Data Factory.
Workflows with jobs
Delta Live Tables
Managing dependencies in data pipelines
Delta Live Tables
7/21/2022 • 2 minutes to read

Delta Live Tables is a framework for building reliable, maintainable, and testable data processing pipelines. You
define the transformations to perform on your data, and Delta Live Tables manages task orchestration, cluster
management, monitoring, data quality, and error handling.
Instead of defining your data pipelines using a series of separate Apache Spark tasks, Delta Live Tables manages
how your data is transformed based on a target schema you define for each processing step. You can also
enforce data quality with Delta Live Tables expectations. Expectations allow you to define expected data quality
and specify how to handle records that fail those expectations.
To get started with Delta Live Tables:
Develop your first Delta Live Tables pipeline with the quickstart.
Learn about fundamental Delta Live Tables concepts.
Learn how to create, run, and manage pipelines with the Delta Live Tables user interface.
Learn how to develop Delta Live Tables pipelines with Python or SQL.
Learn how to manage data quality in your Delta Live Tables pipelines with expectations.
Learn more about Delta Live Tables:
Use external data sources in your Delta Live Tables pipelines: Data sources
Use the data produced by your Delta Live Tables pipelines: Publish data
Efficiently process continually arriving data in your Delta Live Tables pipelines: Streaming data processing
Use change data capture (CDC) processing in your Delta Live Tables pipelines: Change data capture with
Delta Live Tables
Use the Delta Live Tables API: API guide
Use the Delta Live Tables command line interface: CLI
Configure your Delta Live Tables pipelines: Pipeline settings
Analyze and report on your Delta Live Tables pipelines: Querying the event log
Run your Delta Live Tables pipelines with popular workflow orchestration tools: Workflow tool integration
Learn how to use access control lists (ACLs) to configure permissions on your Delta Live Tables pipelines:
Access control
Find answers and solutions for Delta Live Tables:
Implement common tasks in your Delta Live Tables pipelines: Cookbook
Review frequently asked questions and issues: FAQ
Learn how the Delta Live Tables upgrade process works and how to test your pipelines with the next system
version: Upgrades
Delta Live Tables quickstart
7/21/2022 • 5 minutes to read

You can easily create and run a Delta Live Tables pipeline using an Azure Databricks notebook. This article
demonstrates using a Delta Live Tables pipeline on a dataset containing Wikipedia clickstream data to:
Read the raw JSON clickstream data into a table.
Read the records from the raw data table and use Delta Live Tables expectations to create a new table that
contains cleansed data.
Use the records from the cleansed data table to make Delta Live Tables queries that create derived datasets.
In this quickstart, you:
1. Create a new notebook and add the code to implement the pipeline.
2. Create a new pipeline job using the notebook.
3. Start an update of the pipeline job.
4. View results of the pipeline job.

Requirements
You must have cluster creation permission to start a pipeline. The Delta Live Tables runtime creates a cluster
before it runs your pipeline and fails if you don’t have the correct permission.

Create a notebook
You can use an example notebook or create a new notebook to run the Delta Live Tables pipeline:
1. Go to your Azure Databricks landing page and select Create Blank Notebook .
2. In the Create Notebook dialogue, give your notebook a name and select Python or SQL from the
Default Language dropdown menu. You can leave Cluster set to the default value. The Delta Live
Tables runtime creates a cluster before it runs your pipeline.
3. Click Create .
4. Copy the Python or SQL code example and paste it into your new notebook. You can add the example
code to a single cell of the notebook or multiple cells.

NOTE
You must start your pipeline from the Delta Live Tables tab of the Jobs user interface. Clicking to run your
pipeline will return an error.

Code example
Python
import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *

json_path = "/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-
json/2015_2_clickstream.json"
@dlt.table(
comment="The raw wikipedia clickstream dataset, ingested from /databricks-datasets."
)
def clickstream_raw():
return (spark.read.format("json").load(json_path))

@dlt.table(
comment="Wikipedia clickstream data cleaned and prepared for analysis."
)
@dlt.expect("valid_current_page_title", "current_page_title IS NOT NULL")
@dlt.expect_or_fail("valid_count", "click_count > 0")
def clickstream_prepared():
return (
dlt.read("clickstream_raw")
.withColumn("click_count", expr("CAST(n AS INT)"))
.withColumnRenamed("curr_title", "current_page_title")
.withColumnRenamed("prev_title", "previous_page_title")
.select("current_page_title", "click_count", "previous_page_title")
)

@dlt.table(
comment="A table containing the top pages linking to the Apache Spark page."
)
def top_spark_referrers():
return (
dlt.read("clickstream_prepared")
.filter(expr("current_page_title == 'Apache_Spark'"))
.withColumnRenamed("previous_page_title", "referrer")
.sort(desc("click_count"))
.select("referrer", "click_count")
.limit(10)
)

SQL
CREATE OR REFRESH LIVE TABLE clickstream_raw
COMMENT "The raw wikipedia clickstream dataset, ingested from /databricks-datasets."
AS SELECT * FROM json.`/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-
json/2015_2_clickstream.json`;

CREATE OR REFRESH LIVE TABLE clickstream_prepared(


CONSTRAINT valid_current_page EXPECT (current_page_title IS NOT NULL),
CONSTRAINT valid_count EXPECT (click_count > 0) ON VIOLATION FAIL UPDATE
)
COMMENT "Wikipedia clickstream data cleaned and prepared for analysis."
AS SELECT
curr_title AS current_page_title,
CAST(n AS INT) AS click_count,
prev_title AS previous_page_title
FROM live.clickstream_raw;

CREATE OR REFRESH LIVE TABLE top_spark_referers


COMMENT "A table containing the top pages linking to the Apache Spark page."
AS SELECT
previous_page_title as referrer,
click_count
FROM live.clickstream_prepared
WHERE current_page_title = 'Apache_Spark'
ORDER BY click_count DESC
LIMIT 10;

Create a pipeline
To create a new pipeline using the Delta Live Tables notebook:

1. Click Workflows in the sidebar, click the Delta Live Tables tab, and click Create Pipeline .

2. Give the pipeline a name and click to select a notebook.


3. Optionally enter a storage location for output data from the pipeline. The system uses a default location if
you leave Storage Location empty.
4. Select Triggered for Pipeline Mode .
5. Click Create .
The system displays the Pipeline Details page after you click Create . You can also access your pipeline by
clicking the pipeline name in the Delta Live Tables tab.

Start the pipeline


To start an update for the new pipeline, click the button in the top panel. The system returns a
message confirming that your pipeline is starting.

After successfully starting the update, the Delta Live Tables system:
1. Starts a cluster using a cluster configuration created by the Delta Live Tables system. You can also specify a
custom cluster configuration.
2. Creates any tables that don’t exist and ensures that the schema is correct for any existing tables.
3. Updates tables with the latest data available.
4. Shuts down the cluster when the update is complete.
You can track the progress of the update by viewing the event log at the bottom of the Pipeline Details page.

View results
You can use the Delta Live Tables user interface to view pipeline processing details. This includes a visual view of
the pipeline graph and schemas, and record processing details such as the number of records processed and
records that fail validation.
View the pipeline graph
To view the processing graph for your pipeline, click the Graph tab. You can use your mouse to adjust the view

or the buttons in the upper right corner of the graph panel.

View dataset information


Click a dataset to view schema information for the dataset.
View processing details
You can view processing details for each dataset, such as the number of records processed and data quality
metrics. In the event log at the bottom of the Pipeline Details page, select the Completed entry for a dataset
and click the JSON tab.

View pipeline settings


Click the Settings tab to view the generated configuration for your pipeline. Click the Settings button to
modify the pipeline configuration. See Delta Live Tables settings for details on configuration settings.

Publish datasets
You can make pipeline output data available for querying by publishing tables to the Azure Databricks
metastore:
1. Click the Settings button.
2. Add the target setting to configure a database name for your tables.
3. Click Save .

4. Click the button to start a new update for your pipeline.


After the update completes, you can view the database and tables, query the data, or use the data in
downstream applications.

Example notebooks
These notebooks provide Python and SQL examples that implement a Delta Live Tables pipeline to:
Read raw JSON clickstream data into a table.
Read the records from the raw data table and use Delta Live Tables expectations to create a new table that
contains cleansed data.
Use the records from the cleansed data table to make Delta Live Tables queries that create derived datasets.
Get started with Delta Live Tables Python notebook
Get notebook
Get started with Delta Live Tables SQL notebook
Get notebook
Find more example notebooks at _.
Delta Live Tables concepts
7/21/2022 • 10 minutes to read

This article introduces the fundamental concepts you should understand to use Delta Live Tables effectively.

In this section:
Pipelines
Datasets
Continuous and triggered pipelines
Tables and views in continuous pipelines
Development and production modes
Databricks Enhanced Autoscaling
Product editions

Pipelines
The main unit of execution in Delta Live Tables is a pipeline. A pipeline is a directed acyclic graph (DAG) linking
data sources to target datasets. You define the contents of Delta Live Tables datasets using SQL queries or
Python functions that return Spark SQL or Koalas DataFrames. A pipeline also has an associated configuration
defining the settings required to run the pipeline. You can optionally specify data quality constraints when
defining datasets.
You implement Delta Live Tables pipelines in Azure Databricks notebooks. You can implement pipelines in a
single notebook or in multiple notebooks. All queries in a single notebook must be implemented in either
Python or SQL, but you can configure multiple-notebook pipelines with a mix of Python and SQL notebooks.
Each notebook shares a storage location for output data and is able to reference datasets from other notebooks
in the pipeline.
You can use Databricks Repos to store and manage your Delta Live Tables notebooks. To make a notebook
managed with Databricks Repos available when you create a pipeline:
Add the comment line -- Databricks notebook source at the top of a SQL notebook.
Add the comment line # Databricks notebook source at the top of a Python notebook.
See Create, run, and manage Delta Live Tables pipelines to learn more about creating and running a pipeline.
See Configure multiple notebooks in a pipeline for an example of configuring a multi-notebook pipeline.
Queries
Expectations
Pipeline settings
Pipeline updates
Queries
Queries implement data transformations by defining a data source and a target dataset. Delta Live Tables
queries can be implemented in Python or SQL.
Expectations
You use expectations to specify data quality controls on the contents of a dataset. Unlike a CHECK constraint in a
traditional database which prevents adding any records that fail the constraint, expectations provide flexibility
when processing data that fails data quality requirements. This flexibility allows you to process and store data
that you expect to be messy and data that must meet strict quality requirements.
You can define expectations to retain records that fail validation, drop records that fail validation, or halt the
pipeline when a record fails validation.
Pipeline settings
Pipeline settings are defined in JSON and include the parameters required to run the pipeline, including:
Libraries (in the form of notebooks) that contain the queries that describe the tables and views to create the
target datasets in Delta Lake.
A cloud storage location where the tables and metadata required for processing will be stored. This location
is either DBFS or another location you provide.
Optional configuration for a Spark cluster where data processing will take place.
See Delta Live Tables settings for more details.
Pipeline updates
After you create the pipeline and are ready to run it, you start an update. An update:
Starts a cluster with the correct configuration.
Discovers all the tables and views defined, and checks for any analysis errors such as invalid column names,
missing dependencies, syntax errors, and so on.
Creates or updates all of the tables and views with the most recent data available.
If the pipeline is triggered, the system stops processing after updating all tables in the pipeline once.
When a triggered update completes successfully, each table is guaranteed to be updated based on the data
available when the update started.
For use cases that require low latency, you can configure a pipeline to update continuously.
See Continuous and triggered pipelines for more information about choosing an execution mode for your
pipeline.

Datasets
There are two types of datasets in a Delta Live Tables pipeline: views and tables.
Views are similar to a temporary view in SQL and are an alias for some computation. A view allows you to
break a complicated query into smaller or easier-to-understand queries. Views also allow you to reuse a
given transformation as a source for more than one table. Views are available within a pipeline only and
cannot be queried interactively.
Tables are similar to traditional materialized views. The Delta Live Tables runtime automatically creates tables
in the Delta format and ensures those tables are updated with the latest result of the query that creates the
table.
You can define a live or streaming live view or table:
A live table or view always reflects the results of the query that defines it, including when the query defining the
table or view is updated, or an input data source is updated. Like a traditional materialized view, a live table or
view may be entirely computed when possible to optimize computation resources and time.
A streaming live table or view processes data that has been added only since the last pipeline update. Streaming
tables and views are stateful; if the defining query changes, new data will be processed based on the new query
and existing data is not recomputed.
Streaming live tables are valuable for a number of use cases, including:
Data retention: a streaming live table can preserve data indefinitely, even when an input data source has low
retention, for example, a streaming data source such as Apache Kafka or Amazon Kinesis.
Data source evolution: data can be retained even if the data source changes, for example, moving from Kafka
to Kinesis.
You can publish your tables to make them available for discovery and querying by downstream consumers.

Continuous and triggered pipelines


Delta Live Tables supports two different modes of execution:
Triggered pipelines update each table with whatever data is currently available and then stop the cluster
running the pipeline. Delta Live Tables automatically analyzes the dependencies between your tables and
starts by computing those that read from external sources. Tables within the pipeline are updated after their
dependent data sources have been updated.
Continuous pipelines update tables continuously as input data changes. Once an update is started, it
continues to run until manually stopped. Continuous pipelines require an always-running cluster but ensure
that downstream consumers have the most up-to-date data.
Triggered pipelines can reduce resource consumption and expense since the cluster runs only long enough to
execute the pipeline. However, new data won’t be processed until the pipeline is triggered. Continuous pipelines
require an always-running cluster, which is more expensive but reduces processing latency.
The continuous flag in the pipeline settings controls the execution mode. Pipelines run in triggered execution
mode by default. Set continuous to true if you require low latency updates of the tables in your pipeline.

{
...
"continuous": true,
...
}

The execution mode is independent of the type of table being computed. Both live and streaming live tables can
be updated in either execution mode.
If some tables in your pipeline have weaker latency requirements, you can configure their update frequency
independently by setting the pipelines.trigger.interval setting:

spark_conf={"pipelines.trigger.interval": "1 hour"}

This option does not turn off the cluster in between pipeline updates, but can free up resources for updating
other tables in your pipeline.

Tables and views in continuous pipelines


You can use both live tables or views and streaming live tables or views in a pipeline that runs continuously. To
avoid unnecessary processing, pipelines automatically monitor dependent Delta tables and perform an update
only when the contents of those dependent tables have changed.
The Delta Live Tables runtime is not able to detect changes in non-Delta data sources. The table is still updated
regularly, but with a higher default trigger interval to prevent excessive recomputation from slowing down any
incremental processing happening on the cluster.

Development and production modes


You can optimize pipeline execution by switching between development and production modes. When you run
your pipeline in development mode, the Delta Live Tables system:
Reuses a cluster to avoid the overhead of restarts.
Disables pipeline retries so you can immediately detect and fix errors.
In production mode, the Delta Live Tables system:
Restarts the cluster for specific recoverable errors, including memory leaks and stale credentials.
Retries execution in the event of specific errors, for example, a failure to start a cluster.

Use the buttons in the Pipelines UI to switch between development and


production modes. By default, pipelines run in development mode.
Switching between development and production modes only controls cluster and pipeline execution behavior.
Storage locations must be configured as part of pipeline settings and are not affected when switching between
modes.

Databricks Enhanced Autoscaling


IMPORTANT
This feature is in Public Preview.

Databricks Enhanced Autoscaling optimizes cluster utilization by automatically allocating cluster resources
based on workload volume, with minimal impact to the data processing latency of your pipelines.
Enhanced Autoscaling adds to the existing cluster autoscaling functionality with the following features:
Enhanced Autoscaling implements optimization of streaming workloads, and adds enhancements to improve
the performance of batch workloads. These optimizations result in more efficient cluster utilization, reduced
resource usage, and lower cost.
Enhanced Autoscaling proactively shuts down under-utilized nodes while guaranteeing there are no failed
tasks during shutdown. The existing cluster autoscaling feature scales down nodes only if the node is idle.
Requirements
To use Enhanced Autoscaling:
1. Set the pipelines.advancedAutoscaling.enabled field to "true" in the pipeline settings configuration object.
2. Add the autoscale configuration to the pipeline default cluster. The following example configures an
Enhanced Autoscaling cluster with a minimum of 5 workers and a maximum of 10 workers. max_workers
must be greater than or equal to min_workers .

NOTE
Enhanced Autoscaling is available for the default cluster only. If you include the autoscale configuration in the
maintenance cluster configuration, the existing cluster autoscaling feature is used.
If you add the autoscale configuration without the pipelines.advancedAutoscaling.enabled configuration, Delta
Live Tables will use the existing cluster autoscaling feature.
{
"configuration": {
"pipelines.advancedAutoscaling.enabled": "true"
},
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 5,
"max_workers": 10
}
}
]
}

The pipeline is automatically restarted after the autoscaling configuration changes if the pipeline is continuous.
After restart, expect a short period of increased latency. Following this brief period of increased latency, the
cluster size should be updated based on your autoscale configuration, and the pipeline latency returned to its
previous latency characteristics.
Monitoring Enhanced Autoscaling enabled pipelines
You can use the Delta Live Tables event log to monitor Enhanced Autoscaling metrics. You can view the metrics
in the user interface. Enhanced Autoscaling events have the autoscale event type. The following are example
events:

EVEN T M ESSA GE

Cluster resize request submitted Autoscale cluster to <X> executors while keeping
alive <Y> executors and retiring <Z> executors

Cluster manager accepted the resize request Submitted request to resize cluster <cluster-id> to
size <X>.

Resizing successfully completed Achieved desired cluster size <X> for cluster
<cluster-id>.

You can also view Enhanced Autoscaling events by directly querying the event log:
To query the event log for cluster performance metrics, for example, Spark task slot utilization, see Cluster
performance metrics.
To monitor cluster resizing requests and responses during Enhanced Autoscaling operations, see Databricks
Enhanced Autoscaling events.

Product editions
You can use the Delta Live Tables product edition option to run your pipeline with the features best suited for the
pipeline requirements. The following product editions are available:
core to run streaming ingest workloads. Select the core edition if your pipeline doesn’t require advanced
features such as change data capture (CDC) or Delta Live Tables expectations.
pro to run streaming ingest and CDC workloads. The pro product edition supports all of the core
features, plus support for workloads that require updating tables based on changes in source data.
advanced to run streaming ingest workloads, CDC workloads, and workloads that require expectations. The
advanced product edition supports the features of the core and pro editions, and also supports
enforcement of data quality constraints with Delta Live Tables expectations.
You can select the product edition when you create or edit a pipeline. You can select a different edition for each
pipeline.
If your pipeline includes features not supported by the selected product edition, for example, expectations, you
will receive an error message with the reason for the error. You can then edit the pipeline to select the
appropriate edition.
Create, run, and manage Delta Live Tables pipelines
7/21/2022 • 9 minutes to read

You can create, run, manage, and monitor a Delta Live Tables pipeline using the UI or the Delta Live Tables API.
You can also run your pipeline with an orchestration tool such as Azure Databricks jobs. This article focuses on
performing Delta Live Tables tasks using the UI. To use the API, see the API guide.
To create and run your first pipeline, see the Delta Live Tables quickstart.

Create a pipeline
1. Do one of the following:

Click Workflows in the sidebar, click the Delta Live Tables tab, and click . The
Create Pipeline dialog appears.
In the sidebar, click Create and select Pipeline from the menu.
2. Select the Delta Live Tables product edition for the pipeline from the Product Edition drop-down.
The product edition option allows you to choose the best product edition based on the requirements of
your pipeline. See Product editions.
3. Enter a name for the pipeline in the Pipeline Name field.
4. Enter a path to a notebook containing your pipeline queries in the Notebook Libraries field, or click
to browse to your notebook.
5. To optionally add additional notebooks to the pipeline, click the Add notebook librar y button.
You can add notebooks in any order. Delta Live Tables automatically analyzes dataset dependencies to
construct the processing graph for your pipeline.
6. To optionally add Spark configuration settings to the cluster that will run the pipeline, click the Add
configuration button.
7. To optionally make your tables available for discovery and querying, enter a database name in the Target
field. See Publish datasets
8. To optionally enter a storage location for output data from the pipeline, enter a DBFS or cloud storage
path in the Storage Location field. The system uses a default location if you leave Storage Location
empty.
9. Select Triggered or Continuous for Pipeline Mode . See Continuous and triggered pipelines.
10. You can optionally modify the configuration for pipeline clusters, including enabling and disabling
autoscaling and setting the number of worker nodes. See Manage cluster size.
11. To optionally run this pipeline using Photon runtime, click the Use Photon Acceleration check box.
12. To optionally change the Delta Live Tables runtime version for this pipeline, click the Channel drop-down.
See the channel field in the Delta Live Tables settings.
13. Click Create .
To optionally view and edit the JSON configuration for your pipeline, click the JSON button on the Create
Pipeline dialog.

Start a pipeline update


1. Click Workflows in the sidebar and click the Delta Live Tables tab. The Pipelines list displays.
2. Do one of the following:
To start a pipeline update immediately, click in the Actions column. The system returns a message
confirming that your pipeline is starting.
To view more options before starting the pipeline, click the pipeline name. The Pipeline Details page
displays.
The Pipeline Details page provides the following options:

To switch between development and production modes, use the buttons. By


default, pipelines run in development mode. See Development and production modes.
To optionally configure permissions on a pipeline, click the Permissions button. See Delta Live Tables access
control.
To view and edit pipeline settings, click the Settings button. See Delta Live Tables settings for details on
pipeline settings.

To start an update of your pipeline from the Pipeline Details page, click the button.
You might want to reprocess data that has already been ingested, for example, because you modified your
queries based on new requirements or to fix a bug calculating a new column. You can reprocess data that’s
already been ingested by instructing the Delta Live Tables system to perform a full refresh from the UI. To
perform a full refresh, click next to the Star t button and select Full Refresh .
After starting an update or a full refresh, the system returns a message confirming your pipeline is starting.
After successfully starting the update, the Delta Live Tables system:
1. Starts a cluster using a cluster configuration created by the Delta Live Tables system. You can also specify a
custom cluster configuration.
2. Creates any tables that don’t exist and ensures that the schema is correct for any existing tables.
3. Updates tables with the latest data available.
4. Shuts down the cluster when the update is complete.
You can track the progress of the update by viewing the event log at the bottom of the Pipeline Details page.

To view details for a log entry, click the entry. The Pipeline event log details pop-up appears. To view a JSON
document containing the log details, click the JSON tab.
To learn how to query the event log, for example, to analyze performance or data quality metrics, see Delta Live
Tables event log.
View pipeline details
Pipeline graph
After the pipeline starts successfully, the pipeline graph displays. You can use your mouse to adjust the view or

the buttons in the corner of the graph panel.

To view tooltips for data quality metrics, hover over the data quality values for a dataset in the pipeline graph.
Pipeline details
The Pipeline Details panel displays information about the pipeline and the current or most recent update of
the pipeline, including pipeline and update identifiers, update status, and update runtime.
The Pipeline Details panel also displays information about the pipeline compute cluster, including the compute
cost, product edition, Databricks Runtime version, and the channel configured for the pipeline. To open the Spark
UI for the cluster in a new tab, click the Spark UI button. To open the cluster logs in a new tab, click the Logs
button. To open the cluster metrics in a new tab, click the Metrics button.
The Run as value displays the user that pipeline updates run as. The Run as user is the pipeline owner, and
pipeline updates run with this user’s permissions. To change the run as user, click Permissions and change the
pipeline owner.
Dataset details
To view details for a dataset, including the dataset schema and data quality metrics, click the dataset in the
Graph view. The dataset details displays.
To open the pipeline notebook in a new window, click the Path value.
To close the dataset details view and return to the Pipeline Details , click .

Stop a pipeline update


To stop a pipeline update, click .

Schedule a pipeline
You can start a triggered pipeline manually or run the pipeline on a schedule with an Azure Databricks job. You
can create and schedule a job with a single pipeline task directly in the Delta Live Tables UI or add a pipeline task
to a multi-task workflow in the jobs UI.
To create a single-task job and a schedule for the job in the Delta Live Tables UI:
1. Click Schedule > Add a schedule . The Schedule button is updated to show the number of existing
schedules if the pipeline is included in one or more scheduled jobs, for example, Schedule (5) .
2. Enter a name for the job in the Job name field.
3. Set the Schedule to Scheduled .
4. Specify the period, starting time, and time zone.
5. Configure one or more email addresses to receive alerts on pipeline start, success, or failure.
6. Click Create .
To create a multi-task workflow with an Azure Databricks job and add a pipeline task:
1. Create a job in the jobs UI and add your pipeline to the job workflow using a Pipeline task.
2. Create a schedule for the job in the jobs UI.
After creating the pipeline schedule, you can:
View a summary of the schedule in the Delta Live Tables UI, including the schedule name, whether it is
paused, the last run time, and the status of the last run. To view the schedule summary, click the Schedule
button.
Edit the job or the pipeline task.
Edit the schedule or pause and resume the schedule. The schedule will also be paused if you selected
Manual when creating the schedule.
Run the job manually and view details on job runs.

View pipelines
Click Workflows in the sidebar and click the Delta Live Tables tab. The Pipelines page appears with a
list of all defined pipelines, the status of the most recent pipeline updates, the pipeline identifier, and the pipeline
creator.
You can filter pipelines in the list by:
Pipeline name.
A partial text match on one or more pipeline names.
Selecting only the pipelines you own.
Selecting all pipelines you have permissions to access.
Click the Name column header to sort pipelines by name in ascending order (A -> Z) or descending order (Z ->
A).
Pipeline names render as a link when you view the pipelines list, allowing you to right-click on a pipeline name
and access context menu options such as opening the pipeline details in a new tab or window.

Edit settings
On the Pipeline Details page, click the Settings button to view and modify the pipeline settings. You can add,
edit, or remove settings. For example, to make pipeline output available for querying after you’ve created a
pipeline:
1. Click the Settings button. The Edit Pipeline Settings dialog appears.
2. Enter a database name in the Target field.
3. Click Save .
To view and edit the JSON specification, click the JSON button.
See Delta Live Tables settings for more information on configuration settings.

View update history


To view the history and status of pipeline updates, click the Update histor y drop-down.

To view the graph, details, and events for an update, select the update in the drop-down. To return to the latest
update, click Show the latest update .

Publish datasets
When creating or editing a pipeline, you can configure the target setting to publish your table definitions to
the Azure Databricks metastore and persist the records to Delta tables.
After your update completes, you can view the database and tables, query the data, or use the data in
downstream applications.
See Delta Live Tables data publishing.

Manage cluster size


You can manage the cluster resources used by your pipeline. By default, Delta Live Tables automatically scales
your pipeline clusters to optimize performance and cost. Databricks recommends cluster autoscaling, but you
can optionally disable autoscaling and configure a fixed number of worker nodes for your pipeline clusters
when you create or edit a pipeline:
When creating a pipeline, disable the Enable autoscaling check box and specify the number of nodes in
the Workers field.
Modify the settings of an existing pipeline to remove autoscaling. This snippet from the settings for a
pipeline shows cluster autoscaling enabled:

"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
]

This snippet from the settings for a pipeline illustrates cluster autoscaling disabled and the number of
worker nodes fixed at 5:

"clusters": [
{
"label": "default",
"num_workers": 5
}
]

Delete a pipeline
You can delete a pipeline from the Pipelines list or the Pipeline Details page:

In the Pipelines list, click in the Actions column.


On the Pipeline Details page for your pipeline, click the Delete button.
Deleting a pipeline removes the pipeline definition from the Delta Live Tables system and cannot be undone.
Delta Live Tables Python language reference
7/21/2022 • 7 minutes to read

This article provides details and examples for the Delta Live Tables Python programming interface. For the
complete API specification, see the Python API specification.
For information on the SQL API, see the Delta Live Tables SQL language reference.

Python datasets
The Python API is defined in the dlt module. You must import the dlt module in your Delta Live Tables
pipelines implemented with the Python API. Apply the @dlt.view or @dlt.table decorator to a function to
define a view or table in Python. You can use the function name or the name parameter to assign the table or
view name. The following example defines two different datasets: a view called taxi_raw that takes a JSON file
as the input source and a table called filtered_data that takes the taxi_raw view as input:

@dlt.view
def taxi_raw():
return spark.read.format("json").load("/databricks-datasets/nyctaxi/sample/json/")

# Use the function name as the table name


@dlt.table
def filtered_data():
return dlt.read("taxi_raw").where(...)

# Use the name parameter as the table name


@dlt.table(
name="filtered_data")
def create_filtered_data():
return dlt.read("taxi_raw").where(...)

View and table functions must return a Spark DataFrame or a Koalas DataFrame. A Koalas DataFrame returned
by a function is converted to a Spark Dataset by the Delta Live Tables runtime.
In addition to reading from external data sources, you can access datasets defined in the same pipeline with the
Delta Live Tables read() function. The following example demonstrate creating a customers_filtered dataset
using the read() function:

@dlt.table
def customers_raw():
return spark.read.format("csv").load("/data/customers.csv")

@dlt.table
def customers_filteredA():
return dlt.read("customers_raw").where(...)

You can also use the spark.table() function to access a dataset defined in the same pipeline or a table
registered in the metastore. When using the spark.table() function to access a dataset defined in the pipeline,
in the function argument prepend the LIVE keyword to the dataset name:
@dlt.table
def customers_raw():
return spark.read.format("csv").load("/data/customers.csv")

@dlt.table
def customers_filteredB():
return spark.table("LIVE.customers_raw").where(...)

To read data from a table registered in the metastore, in the function argument omit the LIVE keyword and
optionally qualify the table name with the database name:

@dlt.table
def customers():
return spark.table("sales.customers").where(...)

Delta Live Tables ensures that the pipeline automatically captures the dependency between datasets. This
dependency information is used to determine the execution order when performing an update and recording
lineage information in the event log for a pipeline.
You can also return a dataset using a spark.sql expression in a query function. To read from an internal dataset,
prepend LIVE. to the dataset name:

@dlt.table
def chicago_customers():
return spark.sql("SELECT * FROM LIVE.customers_cleaned WHERE city = 'Chicago'")

Both views and tables have the following optional properties:


comment : A human-readable description of this dataset.
spark_conf : A Python dictionary containing Spark configurations for the execution of this query only.
Data quality constraints enforced with expectations.
Tables also offer additional control of their materialization:
Specify how tables are partitioned using partition_cols . You can use partitioning to speed up queries.
You can set table properties when you define a view or table. See Table properties for more details.
Set a storage location for table data using the path setting. By default, table data is stored in the pipeline
storage location if path isn’t set.
You can optionally specify a table schema using a Python StructType or a SQL DDL string. The following
examples create a table called sales with an explicitly specified schema:
sales_schema = StructType([
StructField("customer_id", StringType(), True),
StructField("number_of_line_items", StringType(), True),
StructField("order_datetime", StringType(), True),
StructField("order_number", LongType(), True)]
)

@dlt.table(
comment="Raw data on sales",
schema=sales_schema)
def sales():
return ("...")

@dlt.table(
comment="Raw data on sales",
schema="customer_id STRING, customer_name STRING, number_of_line_items STRING, order_datetime
STRING, order_number LONG")
def sales():
return ("...")

By default, Delta Live Tables infers the schema from the table definition if you don’t specify a schema.

Python libraries
To specify external Python libraries, use the %pip install magic command. When an update starts, Delta Live
Tables runs all cells containing a %pip install command before running any table definitions. Every Python
notebook included in the pipeline has access to all installed libraries. The following example installs a package
called logger and makes it globally available to any Python notebook in the pipeline:

%pip install logger

from logger import log_info

@dlt.table
def dataset():
log_info(...)
return dlt.read(..)

To install a Python wheel package, add the wheel path to the %pip install command. Installed Python wheel
packages are available to all tables in the pipeline. The following example installs a wheel named
dltfns-1.0-py3-none-any.whl from the DBFS directory /dbfs/dlt/ :

%pip install /dbfs/dlt/dltfns-1.0-py3-none-any.whl

See Install a wheel package with %pip.

Python API specification


Python module

NOTE
The Delta Live Tables Python interface has the following limitations:
The pivot() function is not supported. Using the pivot() function in a dataset definition results in non-
deterministic pipeline latencies.
Delta Live Tables Python functions are defined in the dlt module. Your pipelines implemented with the Python
API must import this module:

import dlt

Create table
To define a table in Python, apply the @table decorator. The @table decorator is an alias for the @create_table
decorator.

import dlt

@dlt.table(
name="<name>",
comment="<comment>",
spark_conf={"<key>" : "<value", "<key" : "<value>"},
table_properties={"<key>" : "<value>", "<key>" : "<value>"},
path="<storage-location-path>",
partition_cols=["<partition-column>", "<partition-column>"],
schema="schema-definition",
temporary=False)
@dlt.expect
@dlt.expect_or_fail
@dlt.expect_or_drop
@dlt.expect_all
@dlt.expect_all_or_drop
@dlt.expect_all_or_fail
def <function-name>():
return (<query>)

Create view
To define a view in Python, apply the @view decorator. The @view decorator is an alias for the @create_view
decorator.

import dlt

@dlt.view(
name="<name>",
comment="<comment>")
@dlt.expect
@dlt.expect_or_fail
@dlt.expect_or_drop
@dlt.expect_all
@dlt.expect_all_or_drop
@dlt.expect_all_or_fail
def <function-name>():
return (<query>)

Python properties
@TA B L E O R @VIEW

name

Type: str

An optional name for the table or view. If not defined, the function name is used as the table or view name.
@TA B L E O R @VIEW

comment

Type: str

An optional description for the table.

spark_conf

Type: dict

An optional list of Spark configurations for the execution of this query.

table_proper ties

Type: dict

An optional list of table properties for the table.

path

Type: str

An optional storage location for table data. If not set, the system will default to the pipeline storage location.

par tition_cols

Type: array

An optional list of one or more columns to use for partitioning the table.

schema

Type: str or StructType

An optional schema definition for the table. Schemas can be defined as a SQL DDL string, or with a Python
StructType .

temporar y

Type: bool

Create a temporary table. No metadata is persisted for this table.

The default is ‘False’.

TA B L E O R VIEW DEF IN IT IO N

def ()

A Python function that defines the dataset. If the name parameter is not set, then <function-name> is used as the target
dataset name.
TA B L E O R VIEW DEF IN IT IO N

quer y

A Spark SQL statement that returns a Spark Dataset or Koalas DataFrame.

Use dlt.read() or spark.table() to perform a complete read from a dataset defined in the same pipeline. When using
the spark.table() function to read from a dataset defined in the same pipeline, prepend the LIVE keyword to the dataset
name in the function argument. For example, to read from a dataset named customers :

spark.table("LIVE.customers")

You can also use the spark.table() function to read from a table registered in the metastore by omitting the LIVE
keyword and optionally qualifying the table name with the database name:

spark.table("sales.customers")

Use dlt.read_stream() to perform a streaming read from a dataset defined in the same pipeline.

Use the spark.sql function to define a SQL query to create the return dataset.

Use PySpark syntax to define Delta Live Tables queries with Python.

EXP EC TAT IO N S

@expect(“description”, “constraint”)

Declare a data quality constraint identified by


description . If a row violates the expectation, include the row in the target dataset.

@expect_or_drop(“description”, “constraint”)

Declare a data quality constraint identified by


description . If a row violates the expectation, drop the row from the target dataset.

@expect_or_fail(“description”, “constraint”)

Declare a data quality constraint identified by


description . If a row violates the expectation, immediately stop execution.

@expect_all(expectations)

Declare one or more data quality constraints.


expectations is a Python dictionary, where the key is the expectation description and the value is the expectation
constraint. If a row violates any of the expectations, include the row in the target dataset.

@expect_all_or_drop(expectations)

Declare one or more data quality constraints.


expectations is a Python dictionary, where the key is the expectation description and the value is the expectation
constraint. If a row violates any of the expectations, drop the row from the target dataset.

@expect_all_or_fail(expectations)

Declare one or more data quality constraints.


expectations is a Python dictionary, where the key is the expectation description and the value is the expectation
constraint. If a row violates any of the expectations, immediately stop execution.
Table properties
In addition to the table properties supported by Delta Lake, you can set the following table properties.

TA B L E P RO P ERT IES

pipelines.autoOptimize.managed

Default: true

Enables or disables automatic scheduled optimization of this table.

pipelines.autoOptimize.zOrderCols

Default: None

An optional comma-separated list of column names to z-order this table by.

pipelines.reset.allowed

Default: true

Controls whether a full-refresh is allowed for this table.


Delta Live Tables SQL language reference
7/21/2022 • 4 minutes to read

This article provides details and examples for the Delta Live Tables SQL programming interface. For the
complete API specification, see SQL API specification.
For information on the Python API, see the Delta Live Tables Python language reference.

SQL datasets
Use the CREATE LIVE VIEW or CREATE OR REFRESH LIVE TABLE syntax to create a view or table with SQL. You can
create a dataset by reading from an external data source or from datasets defined in a pipeline. To read from an
internal dataset, prepend the LIVE keyword to the dataset name. The following example defines two different
datasets: a table called taxi_raw that takes a JSON file as the input source and a table called filtered_data that
takes the taxi_raw table as input:

CREATE OR REFRESH LIVE TABLE taxi_raw


AS SELECT * FROM json.`/databricks-datasets/nyctaxi/sample/json/`

CREATE OR REFRESH LIVE TABLE filtered_data


AS SELECT
...
FROM LIVE.taxi_raw

Delta Live Tables automatically captures the dependencies between datasets defined in your pipeline and uses
this dependency information to determine the execution order when performing an update and to record
lineage information in the event log for a pipeline.
Both views and tables have the following optional properties:
COMMENT: A human-readable description of this dataset.
Data quality constraints enforced with expectations.
Tables also offer additional control of their materialization:
Specify how tables are partitioned using PARTITIONED BY . You can use partitioning to speed up queries.
You can set table properties using TBLPROPERTIES . See Table properties for more detail.
Set a storage location using the LOCATION setting. By default, table data is stored in the pipeline storage
location if LOCATION isn’t set.
See SQL API specification for more information about table and view properties.
Use SET to specify a configuration value for a table or view, including Spark configurations. Any table or view
you define in a notebook after the SET statement has access to the defined value. Any Spark configurations
specified using the SET statement are used when executing the Spark query for any table or view following the
SET statement. To read a configuration value in a query, use the string interpolation syntax ${} . The following
example sets a Spark configuration value named startDate and uses that value in a query:
SET startDate='2020-01-01';

CREATE OR REFRESH LIVE TABLE filtered


AS SELECT * FROM src
WHERE date > ${startDate}

To specify multiple configuration values, use a separate SET statement for each value.
To read data from a streaming source, for example, Auto Loader or an internal data set, define a STREAMING LIVE
table:

CREATE OR REFRESH STREAMING LIVE TABLE customers_bronze


AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/customers/", "csv")

CREATE OR REFRESH STREAMING LIVE TABLE customers_silver


AS SELECT * FROM STREAM(LIVE.customers_bronze)

For more information on streaming data, see Streaming data processing.

SQL API specification


NOTE
The Delta Live Tables SQL interface has the following limitations:
Identity and generated columns are not supported.
The PIVOT clause is not supported. Using a PIVOT clause in a dataset definition results in non-deterministic pipeline
latencies.

Create table

CREATE OR REFRESH [TEMPORARY] { STREAMING LIVE TABLE | LIVE TABLE } table_name


[(
[
col_name1 col_type1 [ COMMENT col_comment1 ],
col_name2 col_type2 [ COMMENT col_comment2 ],
...
]
[
CONSTRAINT expectation_name_1 EXPECT (expectation_expr1) [ON VIOLATION { FAIL UPDATE | DROP ROW }],
CONSTRAINT expectation_name_2 EXPECT (expectation_expr2) [ON VIOLATION { FAIL UPDATE | DROP ROW }],
...
]
)]
[USING DELTA]
[PARTITIONED BY (col_name1, col_name2, ... )]
[LOCATION path]
[COMMENT table_comment]
[TBLPROPERTIES (key1 [ = ] val1, key2 [ = ] val2, ... )]
AS select_statement

Create view
CREATE TEMPORARY [STREAMING] LIVE VIEW view_name
[(
[
col_name1 [ COMMENT col_comment1 ],
col_name2 [ COMMENT col_comment2 ],
...
]
[
CONSTRAINT expectation_name_1 EXPECT (expectation_expr1) [ON VIOLATION { FAIL UPDATE | DROP ROW }],
CONSTRAINT expectation_name_2 EXPECT (expectation_expr2) [ON VIOLATION { FAIL UPDATE | DROP ROW }],
...
]
)]
[COMMENT view_comment]
AS select_statement

SQL properties
C REAT E TA B L E O R VIEW

TEMPORARY

Create a temporary table. No metadata is persisted for this table.

STREAMING

Create a table that reads an input dataset as a stream. The input dataset must be a streaming data source, for example, Auto
Loader or a STREAMING LIVE table.

PARTITIONED BY

An optional list of one or more columns to use for partitioning the table.

LOCATION

An optional storage location for table data. If not set, the system will default to the pipeline storage location.

COMMENT

An optional description for the table.

TBLPROPERTIES

An optional list of table properties for the table.

select_statement

A Delta Live Tables query that defines the dataset for the table.

C O N ST RA IN T C L A USE

EXPECT expectation_name

Define data quality constraint expectation_name . If ON VIOLATION constraint is not defined, add rows that violate the
constraint to the target dataset.
C O N ST RA IN T C L A USE

ON VIOL ATION

Optional action to take for failed rows:

* FAIL UPDATE : Immediately stop pipeline execution.


* DROP ROW : Drop the record and continue processing.

Table properties
In addition to the table properties supported by Delta Lake, you can set the following table properties.

TA B L E P RO P ERT IES

pipelines.autoOptimize.managed

Default: true

Enables or disables automatic scheduled optimization of this table.

pipelines.autoOptimize.zOrderCols

Default: None

An optional comma-separated list of column names to z-order this table by.

pipelines.reset.allowed

Default: true

Controls whether a full-refresh is allowed for this table.


Delta Live Tables data quality constraints
7/21/2022 • 2 minutes to read

You use expectations to define data quality constraints on the contents of a dataset. An expectation consists of a
description, an invariant, and an action to take when a record fails the invariant. You apply expectations to
queries using Python decorators or SQL constraint clauses.
Use the expect , expect or drop , and expect or fail expectations with Python or SQL queries to define a
single data quality constraint.
You can define expectations with one or more data quality constraints in Python pipelines using the
@expect_all , @expect_all_or_drop , and @expect_all_or_fail decorators. These decorators accept a Python
dictionary as an argument, where the key is the expectation name and the value is the expectation constraint.
You can view data quality metrics such as the number of records that violate an expectation by querying the
Delta Live Tables event log.

Retain invalid records


Use the expect operator when you want to keep records that violate the expectation. Records that violate the
expectation are added to the target dataset along with valid records:
Python

@dlt.expect("valid timestamp", "col(“timestamp”) > '2012-01-01'")

SQL

CONSTRAINT valid_timestamp EXPECT (timestamp > '2012-01-01')

Drop invalid records


Use the expect or drop operator to prevent the processing of invalid records. Records that violate the
expectation are dropped from the target dataset:
Python

@dlt.expect_or_drop("valid_current_page", "current_page_id IS NOT NULL AND current_page_title IS NOT NULL")

SQL

CONSTRAINT valid_current_page EXPECT (current_page_id IS NOT NULL and current_page_title IS NOT NULL) ON
VIOLATION DROP ROW

Fail on invalid records


When invalid records are unacceptable, use the expect or fail operator to halt execution immediately when a
record fails validation. If the operation is a table update, the system atomically rolls back the transaction:
Python

@dlt.expect_or_fail("valid_count", "count > 0")


SQL

CONSTRAINT valid_count EXPECT (count > 0) ON VIOLATION FAIL UPDATE

When a pipeline fails because of an expectation violation, you must fix the pipeline code to handle the invalid
data correctly before re-running the pipeline.
Fail expectations modify the Spark query plan of your transformations to track information required to detect
and report on violations. For many queries, you can use this information to identify which input record resulted
in the violation. The following is an example exception:

Expectation Violated:
{
"flowName": "a-b",
"verboseInfo": {
"expectationsViolated": [
"x1 is negative"
],
"inputData": {
"a": {"x1": 1,"y1": "a },
"b": {
"x2": 1,
"y2": "aa"
}
},
"outputRecord": {
"x1": 1,
"y1": "a",
"x2": 1,
"y2": "aa"
},
"missingInputData": false
}
}

Multiple expectations
Use expect_all to specify multiple data quality constraints when records that fail validation should be included
in the target dataset:

@dlt.expect_all({"valid_count": "count > 0", "valid_current_page": "current_page_id IS NOT NULL AND


current_page_title IS NOT NULL"})

Use expect_all_or_drop to specify multiple data quality constraints when records that fail validation should be
dropped from the target dataset:

@dlt.expect_all_or_drop({"valid_count": "count > 0", "valid_current_page": "current_page_id IS NOT NULL AND


current_page_title IS NOT NULL"})

Use expect_all_or_fail to specify multiple data quality constraints when records that fail validation should halt
pipeline execution:

@dlt.expect_all_or_fail({"valid_count": "count > 0", "valid_current_page": "current_page_id IS NOT NULL AND


current_page_title IS NOT NULL"})

You can also define a collection of expectations as a variable and pass it to one or more queries in your pipeline:
valid_pages = {"valid_count": "count > 0", "valid_current_page": "current_page_id IS NOT NULL AND
current_page_title IS NOT NULL"}

@dlt.table
@dlt.expect_all(valid_pages)
def raw_data():
# Create raw dataset

@dlt.table
@dlt.expect_all_or_drop(valid_pages)
def prepared_data():
# Create cleaned and prepared dataset
Delta Live Tables data sources
7/21/2022 • 2 minutes to read

You can use the following external data sources to create datasets:
Any data source that Databricks Runtime directly supports.
Any file in cloud storage such as Azure Data Lake Storage Gen2 (ADLS Gen2), AWS S3, or Google Cloud
Storage (GCS).
Any file stored in DBFS.
Databricks recommends using Auto Loader for pipelines that read data from supported file formats, particularly
for streaming live tables that operate on continually arriving data. Auto Loader is scalable, efficient, and
supports schema inference.
Python datasets can use the Apache Spark built-in file data sources to read data in a batch operation from file
formats not supported by Auto Loader.
SQL datasets can use Delta Live Tables file sources to read data in a batch operation from file formats not
supported by Auto Loader.

Auto Loader
The following examples use Auto Loader to create datasets from CSV and JSON files:
Python

@dlt.table
def customers():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.load("/databricks-datasets/retail-org/customers/")
)

@dlt.table
def sales_orders_raw():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/databricks-datasets/retail-org/sales_orders/")
)

SQL

CREATE OR REFRESH STREAMING LIVE TABLE customers


AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/customers/", "csv")

CREATE OR REFRESH STREAMING LIVE TABLE sales_orders_raw


AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/sales_orders/", "json")

You can use supported format options with Auto Loader. Using the map() function, you can pass any number of
options to the cloud_files() method. Options are key-value pairs, where the keys and values are strings. The
following describes the syntax for working with Auto Loader in SQL:
CREATE OR REFRESH STREAMING LIVE TABLE <table_name>
AS SELECT *
FROM cloud_files(
"<file_path>",
"<file_format>",
map(
"<option_key>", "<option_value",
"<option_key>", "<option_value",
...
)
)

The following example reads data from tab-delimited CSV files with a header:

CREATE OR REFRESH STREAMING LIVE TABLE customers


AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/customers/", "csv", map("delimiter", "\t",
"header", "true"))

You can use the schema to specify the format manually; you must specify the schema for formats that do not
support schema inference:
Python

@dlt.table
def wiki_raw():
return (
spark.readStream.format("cloudFiles")
.schema("title STRING, id INT, revisionId INT, revisionTimestamp TIMESTAMP, revisionUsername STRING,
revisionUsernameId INT, text STRING")
.option("cloudFiles.format", "parquet")
.load("/databricks-datasets/wikipedia-datasets/data-001/en_wikipedia/articles-only-parquet")
)

SQL

CREATE OR REFRESH STREAMING LIVE TABLE wiki_raw


AS SELECT *
FROM cloud_files(
"/databricks-datasets/wikipedia-datasets/data-001/en_wikipedia/articles-only-parquet",
"parquet",
map("schema", "title STRING, id INT, revisionId INT, revisionTimestamp TIMESTAMP, revisionUsername
STRING, revisionUsernameId INT, text STRING")
)

NOTE
Delta Live Tables automatically configures and manages the schema and checkpoint directories when using Auto Loader
to read files. However, if you manually configure either of these directories, performing a full refresh does not affect the
contents of the configured directories. Databricks recommends using the automatically configured directories to avoid
unexpected side effects during processing.

Apache Spark file sources


To read files in a batch operation when defining datasets in Python, you can use standard PySpark functions. The
following example reads Parquet data from files using the PySpark spark.read.format("parquet").load()
function:
@dlt.table
def lendingclub_raw_data():
return (
spark.read.format("parquet").load("/databricks-datasets/samples/lending_club/parquet/")
)

Spark SQL file sources


To read files in a batch operation when defining datasets in SQL, you can use Spark SQL syntax. The following
example reads Parquet data from files:

CREATE OR REFRESH LIVE TABLE customers


AS SELECT * FROM parquet.`/databricks-datasets/samples/lending_club/parquet/`
Delta Live Tables data publishing
7/21/2022 • 2 minutes to read

You can make the output data of your pipeline discoverable and available to query by publishing datasets to the
Azure Databricks metastore. To publish datasets to the metastore, enter a database name in the Target field
when you create a pipeline. You can also add a target database to an existing pipeline:
1. Click the Settings button.
2. Add the target setting to configure a database name for your tables.

3. Click Save .

4. Click the button to start a new update for your pipeline.


After the update completes, you can view the database and tables, query the data, or use the data in
downstream applications.
You can use this feature with multiple environment configurations to publish to different databases based on the
environment. For example, you can publish to a dev database for development and a prod database for
production data.
When you create a target configuration, only tables and associated metadata are published. Views are not
published to the metastore.

Exclude tables
To prevent publishing of intermediate tables that are not intended for external consumption, mark them as
TEMPORARY :

CREATE TEMPORARY LIVE TABLE temp_table


AS SELECT ... ;

@dlt.table(
Temporary=True)
def temp_table():
return ("...")
Streaming data processing
7/21/2022 • 3 minutes to read

Many applications require that tables be updated based on continually arriving data. However, as data sizes
grow, the resources required to reprocess data with each update can become prohibitive. You can define a
streaming table or view to incrementally compute continually arriving data. Streaming tables and views reduce
the cost of ingesting new data and the latency at which new data is made available.
When an update is triggered for a pipeline, a streaming table or view processes only new data that has arrived
since the last update. Data already processed is automatically tracked by the Delta Live Tables runtime.

Streaming ingestion from external data sources


To ingest streaming data, you must define a streaming live table from a streaming source; for example, you can
read external data as a stream with the following code:
Python

inputPath = "/databricks-datasets/structured-streaming/events/"

@dlt.table
def streaming_bronze_table():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(inputPath)
)

SQL

CREATE OR REFRESH STREAMING LIVE TABLE streaming_bronze_table


AS SELECT * FROM cloud_files("/databricks-datasets/structured-streaming/events/", "json")

Streaming from other datasets within a pipeline


You can also stream data from other tables in the same pipeline.
Python

@dlt.table
def streaming_silver_table:
return dlt.read_stream("streaming_bronze_table").where(...)

SQL

CREATE OR REFRESH STREAMING LIVE TABLE streaming_silver_table


AS SELECT
*
FROM
STREAM(LIVE.streaming_bronze_table)
WHERE ...
Process streaming and batch data in a single pipeline
Because streaming live tables use What is Apache Spark Structured Streaming?, a streaming live table can only
process append queries; that is, queries where new rows are inserted into the source table. Processing updates
from source tables, for example, merges and deletes, is not supported. To process updates, see the APPLY
CHANGES INTO command.
A common streaming pattern includes the ingestion of source data to create the initial datasets in a pipeline.
These initial datasets are commonly called bronze tables, and often perform simple transformations.
Reprocessing inefficient formats like JSON can be prohibitive with these simple transformations, and are a
perfect fit for streaming live tables.
By contrast, the final tables in a pipeline, commonly referred to as gold tables, often require complicated
aggregations or read from sources that are the targets of an APPLY CHANGES INTO operation. Because these
operations inherently create updates rather than appends, they are not supported as inputs to streaming live
tables. These transformations are better suited for materialization as a live table.
By mixing streaming live tables and live tables into a single pipeline, you can simplify your pipeline and avoid
costly re-ingestion or re-processing of raw data and have the full power of SQL to compute complex
aggregations over an efficiently encoded and filtered dataset. The following example illustrates this type of
mixed processing:
Python

@dlt.table
def streaming_bronze():
return (
# Since this is a streaming source, this table is incremental.
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("abfss://path/to/raw/data")
)

@dlt.table
def streaming_silver():
# Since we read the bronze table as a stream, this silver table is also
# updated incrementally.
return dlt.read_stream("streaming_bronze").where(...)

@dlt.table
def live_gold():
# This table will be recomputed completely by reading the whole silver table
# when it is updated.
return dlt.read("streaming_silver").groupBy("user_id").count()

SQL

CREATE OR REFRESH STREAMING LIVE TABLE streaming_bronze


AS SELECT * FROM cloud_files(
"abfss://path/to/raw/data", "json"
)

CREATE OR REFRESH STREAMING LIVE TABLE streaming_silver


AS SELECT * FROM STREAM(LIVE.streaming_bronze) WHERE...

CREATE OR REFRESH LIVE TABLE live_gold


AS SELECT count(*) FROM LIVE.streaming_silver GROUP BY user_id

Learn more about using Auto Loader to efficiently read JSON files from Azure storage for incremental
processing.
Streaming joins
Delta Live Tables supports various join strategies for updating tables.
Stream-batch joins
Stream-batch joins are a good choice when denormalizing a continuous stream of append-only data with a
primarily static dimension table. Each time the derived dataset is updated, new records from the stream are
joined with a static snapshot of the batch table when the update started. Records added or updated in the static
table are not reflected in the table until a full refresh is performed.
The following are examples of stream-batch joins:
Python

@dlt.table
def customer_sales():
return dlt.read_stream("sales").join(read("customers"), ["customer_id"], "left")

SQL

CREATE OR REFRESH STREAMING LIVE TABLE customer_sales


AS SELECT * FROM STREAM(LIVE.sales)
INNER JOIN LEFT LIVE.customers USING (customer_id)

In continuous pipelines, the batch side of the join is regularly polled for updates with each micro-batch.

Streaming aggregation
Simple distributive aggregates like count, min, max, or sum, and algebraic aggregates like average or standard
deviation can also be calculated incrementally with streaming live tables. Databricks recommends incremental
aggregation for queries with a limited number of groups, for example, a query with a GROUP BY country clause.
Only new input data is read with each update.
Change data capture with Delta Live Tables
7/21/2022 • 13 minutes to read

NOTE
This article describes how to update tables in your Delta Live Tables pipeline based on changes in source data. To learn
how to record and query row-level change information for Delta tables, see Change data feed.

IMPORTANT
Delta Live Tables support for SCD type 2 is in Public Preview.

You can use change data capture (CDC) in Delta Live Tables to update tables based on changes in source data.
CDC is supported in the Delta Live Tables SQL and Python interfaces. Delta Live Tables supports updating tables
with slowly changing dimensions (SCD) type 1 and type 2:
Use SCD type 1 to update records directly. History is not retained for records that are updated.
Use SCD type 2 to retain the history of all updates to records.
To represent the effective period of a change, SCD Type 2 stores every change with the generated __START_AT
and __END_AT columns. Delta Live Tables uses the column specified by SEQUENCE BY in SQL or sequence_by in
Python to generate the __START_AT and __END_AT columns.

NOTE
The data type of the __START_AT and __END_AT columns is the same as the data type of the specified SEQUENCE BY
field.

SQL
Use the APPLY CHANGES INTO statement to use Delta Live Tables CDC functionality:

APPLY CHANGES INTO LIVE.table_name


FROM source
KEYS (keys)
[WHERE condition]
[IGNORE NULL UPDATES]
[APPLY AS DELETE WHEN condition]
SEQUENCE BY orderByColumn
[COLUMNS {columnList | * EXCEPT (exceptColumnList)}]
[STORED AS {SCD TYPE 1 | SCD TYPE 2}]
C L A USES

KEYS

The column or combination of columns that uniquely identify a row in the source data. This is used to identify which CDC
events apply to specific records in the target table.

This clause is required.

WHERE

A condition applied to both source and target to trigger optimizations such as partition pruning. This condition cannot be
used to drop source rows; all CDC rows in the source must satisfy this condition or an error is thrown. Using the WHERE
clause is optional and should be used when your processing requires specific optimizations.

This clause is optional.

IGNORE NULL UPDATES

Allow ingesting updates containing a subset of the target columns. When a CDC event matches an existing row and IGNORE
NULL UPDATES is specified, columns with a null will retain their existing values in the target. This also applies to nested
columns with a value of null .

This clause is optional.

The default is to overwrite existing columns with null values.

APPLY AS DELETE WHEN

Specifies when a CDC event should be treated as a DELETE rather than an upsert. To handle out-of-order data, the deleted
row is temporarily retained as a tombstone in the underlying Delta table, and a view is created in the metastore that filters out
these tombstones. The retention interval can be configured with the
pipelines.cdc.tombstoneGCThresholdInSeconds table property.

This clause is optional.

SEQUENCE BY

The column name specifying the logical order of CDC events in the source data. Delta Live Tables uses this sequencing to
handle change events that arrive out of order.

This clause is required.

COLUMNS

Specifies a subset of columns to include in the target table. You can either:

* Specify the complete list of columns to include: COLUMNS (userId, name, city) .
* Specify a list of columns to exclude: COLUMNS * EXCEPT (operation, sequenceNum)

This clause is optional.

The default is to include all columns in the target table when the COLUMNS clause is not specified.
C L A USES

STORED AS

Whether to store records as SCD type 1 or SCD type 2.

This clause is optional.

The default is SCD type 1.

The default behavior for INSERT and UPDATE events is to upsert CDC events from the source: update any rows
in the target table that match the specified key(s) or insert a new row when a matching record does not exist in
the target table. Handling for DELETE events can be specified with the APPLY AS DELETE WHEN condition.

Python
Use the apply_changes() function in the Python API to use Delta Live Tables CDC functionality. The Delta Live
Tables Python CDC interface also provides the create_streaming_live_table() function. You can use this function
to create the target table required by the apply_changes() function. See the example queries.
Apply changes function

apply_changes(
target = "<target-table>",
source = "<data-source>",
keys = ["key1", "key2", "keyN"],
sequence_by = "<sequence-column>",
ignore_null_updates = False,
apply_as_deletes = None,
column_list = None,
except_column_list = None,
stored_as_scd_type = <type>
)

A RGUM EN T S

target

Type: str

The name of the table to be updated. You can use the create_streaming_live_table() function to create the target table before
executing the apply_changes() function.

This parameter is required.

source

Type: str

The data source containing CDC records.

This parameter is required.


A RGUM EN T S

keys

Type: list

The column or combination of columns that uniquely identify a row in the source data. This is used to identify which CDC
events apply to specific records in the target table.

You can specify either:

* A list of strings: ["userId", "orderId"]


* A list of Spark SQL col() functions: [col("userId"), col("orderId"]

Arguments to col() functions cannot include qualifiers. For example, you can use col(userId) , but you cannot use
col(source.userId) .

This parameter is required.

sequence_by

Type: str or col()

The column name specifying the logical order of CDC events in the source data. Delta Live Tables uses this sequencing to
handle change events that arrive out of order.

You can specify either:

* A string: "sequenceNum"
* A Spark SQL col() function: col("sequenceNum")

Arguments to col() functions cannot include qualifiers. For example, you can use col(userId) , but you cannot use
col(source.userId) .

This parameter is required.

ignore_null_updates

Type: bool

Allow ingesting updates containing a subset of the target columns. When a CDC event matches an existing row and
ignore_null_updates is True , columns with a null will retain their existing values in the target. This also applies to
nested columns with a value of null . When ignore_null_updates is False , existing values will be overwritten with
null values.

This parameter is optional.

The default is False .


A RGUM EN T S

apply_as_deletes

Type: str or expr()

Specifies when a CDC event should be treated as a DELETE rather than an upsert. To handle out-of-order data, the deleted
row is temporarily retained as a tombstone in the underlying Delta table, and a view is created in the metastore that filters out
these tombstones. The retention interval can be configured with the
pipelines.cdc.tombstoneGCThresholdInSeconds table property.

You can specify either:

* A string: "Operation = 'DELETE'"


* A Spark SQL expr() function: expr("Operation = 'DELETE'")

This parameter is optional.

column_list
except_column_list

Type: list

A subset of columns to include in the target table. Use column_list to specify the complete list of columns to include. Use
except_column_list to specify the columns to exclude. You can declare either value as a list of strings or as Spark SQL
col() functions:

* column_list = ["userId", "name", "city"] .


* column_list = [col("userId"), col("name"), col("city")]
* except_column_list = ["operation", "sequenceNum"]
* except_column_list = [col("operation"), col("sequenceNum")

Arguments to col() functions cannot include qualifiers. For example, you can use col(userId) , but you cannot use
col(source.userId) .

This parameter is optional.

The default is to include all columns in the target table when no column_list or except_column_list argument is passed
to the function.

stored_as_scd_type

Type: str or int

Whether to store records as SCD type 1 or SCD type 2.

Set to 1 for SCD type 1 or 2 for SCD type 2.

This clause is optional.

The default is SCD type 1.

The default behavior for INSERT and UPDATE events is to upsert CDC events from the source: update any rows
in the target table that match the specified key(s) or insert a new row when a matching record does not exist in
the target table. Handling for DELETE events can be specified with the apply_as_deletes argument.
Create a target table for output records
Use the create_streaming_live_table() function to create a target table for the apply_changes() output records.
NOTE
The create_target_table() function is deprecated. Databricks recommends updating existing code to use the
create_streaming_live_table() function.

create_streaming_live_table(
name = "<table-name>",
comment = "<comment>"
spark_conf={"<key>" : "<value", "<key" : "<value>"},
table_properties={"<key>" : "<value>", "<key>" : "<value>"},
partition_cols=["<partition-column>", "<partition-column>"],
path="<storage-location-path>",
schema="schema-definition"
)

A RGUM EN T S

name

Type: str

The table name.

This parameter is required.

comment

Type: str

An optional description for the table.

spark_conf

Type: dict

An optional list of Spark configurations for the execution of this query.

table_proper ties

Type: dict

An optional list of table properties for the table.

par tition_cols

Type: array

An optional list of one or more columns to use for partitioning the table.

path

Type: str

An optional storage location for table data. If not set, the system will default to the pipeline storage location.
A RGUM EN T S

schema

Type: str or StructType

An optional schema definition for the table. Schemas can be defined as a SQL DDL string, or with a Python
StructType .

When specifying the schema of the apply_changes target table, you must also include the __START_AT and
__END_AT columns with the same data type as the sequence_by field. For example, if your target table has the
columns key, STRING , value, STRING , and sequencing, LONG :

create_streaming_live_table(
name = "target",
comment = "Target for CDC ingestion.",
partition_cols=["value"],
path="$tablePath",
schema=
StructType(
[
StructField('key', StringType()),
StructField('value', StringType()),
StructField('sequencing', LongType()),
StructField('__START_AT', LongType()),
StructField('__END_AT', LongType())
]
)
)

NOTE
You must ensure that a target table is created before you execute the APPLY CHANGES INTO query or
apply_changes function. See the example queries.
Metrics for the target table, such as number of output rows, are not available.
SCD type 2 updates will add a history row for every input row, even if no columns have changed.
The target of the APPLY CHANGES INTO query or apply_changes function cannot be used as a source for a
streaming live table. A table that reads from the target of an APPLY CHANGES INTO query or apply_changes
function must be a live table.
Expectations are not supported in an APPLY CHANGES INTO query or apply_changes() function. To use expectations
for the source or target dataset:
Add expectations on source data by defining an intermediate table with the required expectations and use this
dataset as the source for the target table.
Add expectations on target data with a downstream table that reads input data from the target table.

Table properties
The following table properties are added to control the behavior of tombstone management for DELETE events:

TA B L E P RO P ERT IES

pipelines.cdc.tombstoneGCThresholdInSeconds

Set this value to match the highest expected interval between out-of-order data.
TA B L E P RO P ERT IES

pipelines.cdc.tombstoneGCFrequencyInSeconds

Controls how frequently tombstones are checked for cleanup.

Default value: 5 minutes

Examples
These examples demonstrate Delta Live Tables SCD type 1 and type 2 queries that update target tables based on
source events that:
1. Create new user records.
2. Delete a user record.
3. Update user records. In the SCD type 1 example, the last UPDATE operations arrive late and are dropped from
the target table, demonstrating the handling of out of order events.
The following are the input records for these examples.

USERID NAME C IT Y O P ERAT IO N SEQ UEN C EN UM

124 Raul Oaxaca INSERT 1

123 Isabel Monterrey INSERT 1

125 Mercedes Tijuana INSERT 2

123 null null DELETE 5

125 Mercedes Guadalajara UPDATE 5

125 Mercedes Mexicali UPDATE 4

123 Isabel Chihuahua UPDATE 4

After running the SCD type 1 example, the target table contains the following records:

USERID NAME C IT Y

124 Raul Oaxaca

125 Mercedes Guadalajara

After running the SCD type 2 example, the target table contains the following records:

USERID NAME C IT Y __STA RT _AT __EN D_AT

123 Isabel Monterrey 1 4

123 Isabel Chihuahua 4 5

124 Raul Oaxaca 1 null


USERID NAME C IT Y __STA RT _AT __EN D_AT

125 Mercedes Tijuana 2 4

125 Mercedes Mexicali 4 5

125 Mercedes Guadalajara 5 null

Generate test data


To create the test records for this example:

1. Go to your Azure Databricks landing page and select Create a notebook or click Create in the
sidebar and select Notebook from the menu. The Create Notebook dialog appears.
2. In the Create Notebook dialog, give your notebook a name; for example, Generate test CDC records .
Select SQL from the Default Language drop-down menu.
3. If there are running clusters, the Cluster drop-down displays. Select the cluster you want to attach the
notebook to. You can also create a new cluster to attach to after you create the notebook.
4. Click Create .
5. Copy the following query and paste it into the first cell of the new notebook:

CREATE SCHEMA IF NOT EXISTS cdc_data;

CREATE TABLE
cdc_data.users
AS SELECT
col1 AS userId,
col2 AS name,
col3 AS city,
col4 AS operation,
col5 AS sequenceNum
FROM (
VALUES
-- Initial load.
(123, "Isabel", "Monterrey", "INSERT", 1),
(124, "Raul", "Oaxaca", "INSERT", 1),
-- One new user.
(125, "Mercedes", "Tijuana", "INSERT", 2),
-- Isabel is removed from the system and Mercedes moved to Guadalajara.
(123, null, null, "DELETE", 5),
(125, "Mercedes", "Guadalajara", "UPDATE", 5),
-- This batch of updates arrived out of order. The above batch at sequenceNum 5 will be the final
state.
(123, "Isabel", "Chihuahua", "UPDATE", 4),
(125, "Mercedes", "Mexicali", "UPDATE", 4)
);

6. To run the notebook and populate the test records, in the cell actions menu at the far
right, click and select Run Cell , or press shift+enter .
Create and run the SCD type 1 example pipeline
1. Go to your Azure Databricks landing page and select Create a notebook or click Create in the sidebar
and select Notebook from the menu. The Create Notebook dialog appears.
2. In the Create Notebook dialog, give your notebook a name; for example, DLT CDC example . Select
Python or SQL from the Default Language drop-down menu based on your preferred language. You can
leave Cluster set to the default value. The Delta Live Tables runtime creates a cluster before it runs your
pipeline.
3. Click Create .
4. Copy the Python or SQL query and paste it into the first cell of the notebook.
5. Create a new pipeline and add the notebook in the Notebook Libraries field. To publish the output of the
pipeline processing, you can optionally enter a database name in the Target field.
6. Start the pipeline. If you configured the Target value, you can view and validate the results of the query.
Example queries
Python

import dlt
from pyspark.sql.functions import col, expr

@dlt.view
def users():
return spark.readStream.format("delta").table("cdc_data.users")

dlt.create_streaming_live_table("target")

dlt.apply_changes(
target = "target",
source = "users",
keys = ["userId"],
sequence_by = col("sequenceNum"),
apply_as_deletes = expr("operation = 'DELETE'"),
except_column_list = ["operation", "sequenceNum"],
stored_as_scd_type = 1
)

SQL

-- Create and populate the target table.


CREATE OR REFRESH STREAMING LIVE TABLE target;

APPLY CHANGES INTO


live.target
FROM
stream(cdc_data.users)
KEYS
(userId)
APPLY AS DELETE WHEN
operation = "DELETE"
SEQUENCE BY
sequenceNum
COLUMNS * EXCEPT
(operation, sequenceNum)
STORED AS
SCD TYPE 1;

Create and run the SCD type 2 example pipeline


1. Go to your Azure Databricks landing page and select Create a notebook or click Create in the sidebar
and select Notebook from the menu. The Create Notebook dialog appears.
2. In the Create Notebook dialog, give your notebook a name; for example, DLT CDC example . Select
Python or SQL from the Default Language drop-down menu based on your preferred language. You can
leave Cluster set to the default value. The Delta Live Tables runtime creates a cluster before it runs your
pipeline.
3. Click Create .
4. Copy the Python or SQL query and paste it into the first cell of the notebook.
5. Create a new pipeline and add the notebook in the Notebook Libraries field. To publish the output of the
pipeline processing, you can optionally enter a database name in the Target field.
6. Start the pipeline. If you configured the Target value, you can view and validate the results of the query.
Example queries
Python

import dlt
from pyspark.sql.functions import col, expr

@dlt.view
def users():
return spark.readStream.format("delta").table("cdc_data.users")

dlt.create_streaming_live_table("target")

dlt.apply_changes(
target = "target",
source = "users",
keys = ["userId"],
sequence_by = col("sequenceNum"),
apply_as_deletes = expr("operation = 'DELETE'"),
except_column_list = ["operation", "sequenceNum"],
stored_as_scd_type = "2"
)

SQL

-- Create and populate the target table.


CREATE OR REFRESH STREAMING LIVE TABLE target;

APPLY CHANGES INTO


live.target
FROM
stream(cdc_data.users)
KEYS
(userId)
APPLY AS DELETE WHEN
operation = "DELETE"
SEQUENCE BY
sequenceNum
COLUMNS * EXCEPT
(operation, sequenceNum)
STORED AS
SCD TYPE 2;
Delta Live Tables API guide
7/21/2022 • 13 minutes to read

The Delta Live Tables API allows you to create, edit, delete, start, and view details about pipelines.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Create a pipeline
EN DP O IN T H T T P M ET H O D

2.0/pipelines POST

Creates a new Delta Live Tables pipeline.


Example
This example creates a new triggered pipeline.
Request

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/pipelines \
--data @pipeline-settings.json

pipeline-settings.json :

{
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"continuous": false
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
This example uses a .netrc file.
Response

{
"pipeline_id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5"
}

Request structure
See PipelineSettings.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N

pipeline_id STRING The unique identifier for the newly


created pipeline.

Edit a pipeline
EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id} PUT

Updates the settings for an existing pipeline.


Example
This example adds a target parameter to the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl --netrc --request PUT \


https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 \
> --data @pipeline-settings.json

pipeline-settings.json
{
"id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"target": "wikipedia_quickstart_data",
"continuous": false
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Request structure
See PipelineSettings.

Delete a pipeline
EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id} DELETE

Deletes a pipeline from the Delta Live Tables system.


Example
This example deletes the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl --netrc --request DELETE \


https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.

Start a pipeline update


EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id}/updates POST

Starts an update for a pipeline.


Example
This example starts an update with full refresh for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/updates \
--data '{ "full_refresh": "true" }'

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Response

{
"update_id": "a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8"
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

full_refresh BOOLEAN Whether to reprocess all data. If true


, the Delta Live Tables system will reset
all tables before running the pipeline.

This field is optional.

The default value is false

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

update_id STRING The unique identifier of the newly


created update.

Stop any active pipeline update


EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id}/stop POST

Stops any active pipeline update. If no update is running, this request is a no-op.
Example
This example stops an update for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/stop

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.

List pipeline events


EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id}/events GET

Retrieves events for a pipeline.


Example
This example retrieves a maximum of 5 events for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 .
Request

curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/events \
--data '{"max_results": 5}'

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

page_token STRING Page token returned by previous call.


This field is mutually exclusive with all
fields in this request except
max_results. An error is returned if any
fields other than max_results are set
when this field is set.

This field is optional.


F IEL D N A M E TYPE DESC RIP T IO N

max_results INT32 The maximum number of entries to


return in a single page. The system
may return fewer than max_results
events in a response, even if there are
more events available.

This field is optional.

The default value is 25.

The maximum value is 100. An error is


returned if the value of
max_results is greater than 100.

order_by STRING A string indicating a sort order by


timestamp for the results, for example,
["timestamp asc"] .

The sort order can be ascending or


descending. By default, events are
returned in descending order by
timestamp.

This field is optional.

filter STRING Criteria to select a subset of results,


expressed using a SQL-like syntax. The
supported filters are:

* level='INFO' (or WARN or ERROR


)
* level in ('INFO', 'WARN')
* id='[event-id]'
* timestamp > 'TIMESTAMP' (or >= ,
< , <= ,=)

Composite expressions are supported,


for example:
level in ('ERROR', 'WARN') AND
timestamp> '2021-07-
22T06:37:33.083Z'

This field is optional.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

events An array of pipeline events. The list of events matching the request
criteria.

next_page_token STRING If present, a token to fetch the next


page of events.

prev_page_token STRING If present, a token to fetch the


previous page of events.
Get pipeline details
EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id} GET

Gets details about a pipeline, including the pipeline settings and recent updates.
Example
This example gets details for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Response
{
"pipeline_id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"spec": {
"id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"target": "wikipedia_quickstart_data",
"continuous": false
},
"state": "IDLE",
"cluster_id": "1234-567891-abcde123",
"name": "Wikipedia pipeline (SQL)",
"creator_user_name": "username",
"latest_updates": [
{
"update_id": "8a0b6d02-fbd0-11eb-9a03-0242ac130003",
"state": "COMPLETED",
"creation_time": "2021-08-13T00:37:30.279Z"
},
{
"update_id": "a72c08ba-fbd0-11eb-9a03-0242ac130003",
"state": "CANCELED",
"creation_time": "2021-08-13T00:35:51.902Z"
},
{
"update_id": "ac37d924-fbd0-11eb-9a03-0242ac130003",
"state": "FAILED",
"creation_time": "2021-08-13T00:33:38.565Z"
}
],
"run_as_user_name": "username"
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

pipeline_id STRING The unique identifier of the pipeline.

spec PipelineSettings The pipeline settings.

state STRING The state of the pipeline. One of


IDLE or RUNNING .

If state = RUNNING , then there is at


least one active update.
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The identifier of the cluster running the


pipeline.

name STRING The user-friendly name for this


pipeline.

creator_user_name STRING The username of the pipeline creator.

latest_updates An array of UpdateStateInfo Status of the most recent updates for


the pipeline, ordered with the newest
update first.

run_as_user_name STRING The username that the pipeline runs


as.

Get update details


EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id}/updates/{update_id} GET

Gets details for a pipeline update.


Example
This example gets details for update 9a84f906-fc51-11eb-9a03-0242ac130003 for the pipeline with ID
a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/updates/9a84f906-fc51-
11eb-9a03-0242ac130003

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Response
{
"update": {
"pipeline_id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"update_id": "9a84f906-fc51-11eb-9a03-0242ac130003",
"config": {
"id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"configuration": {
"pipelines.numStreamRetryAttempts": "5"
},
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"target": "wikipedia_quickstart_data",
"filters": {},
"email_notifications": {},
"continuous": false,
"development": false
},
"cause": "API_CALL",
"state": "COMPLETED",
"creation_time": 1628815050279,
"full_refresh": true
}
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

pipeline_id STRING The unique identifier of the pipeline.

update_id STRING The unique identifier of this update.

config PipelineSettings The pipeline settings.

cause STRING The trigger for the update. One of


API_CALL ,
RETRY_ON_FAILURE ,
SERVICE_UPGRADE .
F IEL D N A M E TYPE DESC RIP T IO N

state STRING The state of the update. One of


QUEUED , CREATED
WAITING_FOR_RESOURCES ,
INITIALIZING , RESETTING ,
SETTING_UP_TABLES , RUNNING ,
STOPPING , COMPLETED ,
FAILED , or CANCELED .

cluster_id STRING The identifier of the cluster running the


pipeline.

creation_time INT64 The timestamp when the update was


created.

full_refresh BOOLEAN Whether the update was triggered to


perform a full refresh. If true, all
pipeline tables were reset before
running the update.

List pipelines
EN DP O IN T H T T P M ET H O D

2.0/pipelines/ GET

Lists pipelines defined in the Delta Live Tables system.


Example
This example retrieves details for up to two pipelines, starting from a specified page_token :
Request

curl -n -X GET https://<databricks-instance>/api/2.0/pipelines \


--data '{ "page_token": "eyJ...==", "max_results": 2 }'

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Response
{
"statuses": [
{
"pipeline_id": "e0f01758-fc61-11eb-9a03-0242ac130003",
"state": "IDLE",
"name": "dlt-pipeline-python",
"latest_updates": [
{
"update_id": "ee9ae73e-fc61-11eb-9a03-0242ac130003",
"state": "COMPLETED",
"creation_time": "2021-08-13T00:34:21.871Z"
}
],
"creator_user_name": "username"
},
{
"pipeline_id": "f4c82f5e-fc61-11eb-9a03-0242ac130003",
"state": "IDLE",
"name": "dlt-pipeline-python",
"creator_user_name": "username"
}
],
"next_page_token": "eyJ...==",
"prev_page_token": "eyJ..x9"
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

page_token STRING Page token returned by previous call.

This field is optional.

max_results INT32 The maximum number of entries to


return in a single page. The system
may return fewer than max_results
events in a response, even if there are
more events available.

This field is optional.

The default value is 25.

The maximum value is 100. An error is


returned if the value of
max_results is greater than 100.

order_by An array of STRING A list of strings specifying the order of


results, for example,
["name asc"] . Supported
order_by fields are id and
name . The default is id asc .

This field is optional.


F IEL D N A M E TYPE DESC RIP T IO N

filter STRING Select a subset of results based on the


specified criteria.

The supported filters are:

"notebook='<path>'" to select
pipelines that reference the provided
notebook path.

name LIKE '[pattern]' to select


pipelines with a name that matches
pattern . Wildcards are supported,
for example:
name LIKE '%shopping%'

Composite filters are not supported.

This field is optional.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

statuses An array of PipelineStateInfo The list of events matching the request


criteria.

next_page_token STRING If present, a token to fetch the next


page of events.

prev_page_token STRING If present, a token to fetch the


previous page of events.

Data structures
In this section:
KeyValue
NotebookLibrary
PipelineLibrary
PipelineSettings
PipelineStateInfo
PipelinesNewCluster
UpdateStateInfo
KeyValue
A key-value pair that specifies configuration parameters.

F IEL D N A M E TYPE DESC RIP T IO N

key STRING The configuration property name.

value STRING The configuration property value.

NotebookLibrary
A specification for a notebook containing pipeline code.

F IEL D N A M E TYPE DESC RIP T IO N

path STRING The absolute path to the notebook.

This field is required.

PipelineLibrary
A specification for pipeline dependencies.

F IEL D N A M E TYPE DESC RIP T IO N

notebook NotebookLibrary The path to a notebook defining Delta


Live Tables datasets. The path must be
in the Databricks workspace, for
example:
{ "notebook" : { "path" : "/my-
pipeline-notebook-path" } }
.

PipelineSettings
The settings for a pipeline deployment.

F IEL D N A M E TYPE DESC RIP T IO N

id STRING The unique identifier for this pipeline.

The identifier is created by the Delta


Live Tables system, and must not be
provided when creating a pipeline.

name STRING A user-friendly name for this pipeline.

This field is optional.

By default, the pipeline name must be


unique. To use a duplicate name, set
allow_duplicate_names to true in
the pipeline configuration.

storage STRING A path to a DBFS directory for storing


checkpoints and tables created by the
pipeline.

This field is optional.

The system uses a default location if


this field is empty.

configuration A map of STRING:STRING A list of key-value pairs to add to the


Spark configuration of the cluster that
will run the pipeline.

This field is optional.

Elements must be formatted as


key:value pairs.
F IEL D N A M E TYPE DESC RIP T IO N

clusters An array of PipelinesNewCluster An array of specifications for the


clusters to run the pipeline.

This field is optional.

If this is not specified, the system will


select a default cluster configuration
for the pipeline.

libraries An array of PipelineLibrary The notebooks containing the pipeline


code and any dependencies required
to run the pipeline.

target STRING A database name for persisting


pipeline output data.

See Delta Live Tables data publishing


for more information.

continuous BOOLEAN Whether this is a continuous pipeline.

This field is optional.

The default value is false .

development BOOLEAN Whether to run the pipeline in


development mode.

This field is optional.

The default value is false .

PipelineStateInfo
The state of a pipeline, the status of the most recent updates, and information about associated resources.

F IEL D N A M E TYPE DESC RIP T IO N

state STRING The state of the pipeline. One of


IDLE or RUNNING .

pipeline_id STRING The unique identifier of the pipeline.

cluster_id STRING The unique identifier of the cluster


running the pipeline.

name STRING The user-friendly name of the pipeline.

latest_updates An array of UpdateStateInfo Status of the most recent updates for


the pipeline, ordered with the newest
update first.

creator_user_name STRING The username of the pipeline creator.


F IEL D N A M E TYPE DESC RIP T IO N

run_as_user_name STRING The username that the pipeline runs


as. This is a read only value derived
from the pipeline owner.

PipelinesNewCluster
A pipeline cluster specification.
The Delta Live Tables system sets the following attributes. These attributes cannot be configured by users:
spark_version
init_scripts

F IEL D N A M E TYPE DESC RIP T IO N

label STRING A label for the cluster specification,


either
default to configure the default
cluster, or
maintenance to configure the
maintenance cluster.

This field is optional. The default value


is default .

spark_conf KeyValue An object containing a set of optional,


user-specified Spark configuration key-
value pairs. You can also pass in a
string of extra JVM options to the
driver and the executors via
spark.driver.extraJavaOptions
and
spark.executor.extraJavaOptions
respectively.

Example Spark confs:


{"spark.speculation": true,
"spark.streaming.ui.retainedBatches":
5}
or
{"spark.driver.extraJavaOptions":
"-verbose:gc -
XX:+PrintGCDetails"}

node_type_id STRING This field encodes, through a single


value, the resources available to each
of the Spark nodes in this cluster. For
example, the Spark nodes can be
provisioned and optimized for memory
or compute intensive workloads A list
of available node types can be
retrieved by using the List node types
API call.

driver_node_type_id STRING The node type of the Spark driver. This


field is optional; if unset, the driver
node type will be set as the same value
as node_type_id defined above.
F IEL D N A M E TYPE DESC RIP T IO N

ssh_public_keys An array of STRING SSH public key contents that will be


added to each Spark node in this
cluster. The corresponding private keys
can be used to login with the user
name ubuntu on port 2200 . Up to
10 keys can be specified.

custom_tags KeyValue An object containing a set of tags for


cluster resources. Databricks tags all
cluster resources with these tags in
addition to default_tags.

Note :

* Tags are not supported on legacy


node types such as compute-
optimized and memory-optimized
* Azure Databricks allows at most 45
custom tags.

cluster_log_conf ClusterLogConf The configuration for delivering Spark


logs to a long-term storage
destination. Only one destination can
be specified for one cluster. If this
configuration is provided, the logs will
be delivered to the destination every
5 mins . The destination of driver logs
is
<destination>/<cluster-
ID>/driver
, while the destination of executor logs
is
<destination>/<cluster-
ID>/executor
.

spark_env_vars KeyValue An object containing a set of optional,


user-specified environment variable
key-value pairs. Key-value pairs of the
form (X,Y) are exported as is (that is,
export X='Y' ) while launching the
driver and workers.

In order to specify an additional set of


SPARK_DAEMON_JAVA_OPTS , Databricks
recommends appending them to
$SPARK_DAEMON_JAVA_OPTS as shown
in the following example. This ensures
that all default Azure Databricks
managed environmental variables are
included as well.

Example Spark environment variables:


{"SPARK_WORKER_MEMORY":
"28000m", "SPARK_LOCAL_DIRS":
"/local_disk0"}
or
{"SPARK_DAEMON_JAVA_OPTS":
"$SPARK_DAEMON_JAVA_OPTS -
Dspark.shuffle.service.enabled=true"}
F IEL D N A M E TYPE DESC RIP T IO N

init_scripts An array of InitScriptInfo The configuration for storing init


scripts. Any number of destinations
can be specified. The scripts are
executed sequentially in the order
provided. If cluster_log_conf is
specified, init script logs are sent to
<destination>/<cluster-
ID>/init_scripts
.

instance_pool_id STRING The optional ID of the instance pool to


which the cluster belongs. See Pools.

driver_instance_pool_id STRING The optional ID of the instance pool to


use for the driver node. You must also
specify
instance_pool_id . See Instance
Pools API 2.0.

policy_id STRING A cluster policy ID.

num_workers OR autoscale INT32 OR AutoScale If num_workers, number of worker


nodes that this cluster should have. A
cluster has one Spark driver and
num_workers executors for a total of
num_workers + 1 Spark nodes.

When reading the properties of a


cluster, this field reflects the desired
number of workers rather than the
actual number of workers. For
instance, if a cluster is resized from 5
to 10 workers, this field is updated to
reflect the target size of 10 workers,
whereas the workers listed in executors
gradually increase from 5 to 10 as the
new nodes are provisioned.

If autoscale, parameters needed to


automatically scale clusters up and
down based on load.

This field is optional.

apply_policy_default_values BOOLEAN Whether to use policy default values


for missing cluster attributes.

UpdateStateInfo
The current state of a pipeline update.

F IEL D N A M E TYPE DESC RIP T IO N

update_id STRING The unique identifier for this update.


F IEL D N A M E TYPE DESC RIP T IO N

state STRING The state of the update. One of


QUEUED , CREATED ,
WAITING_FOR_RESOURCES ,
INITIALIZING , RESETTING ,
SETTING_UP_TABLES , RUNNING ,
STOPPING , COMPLETED ,
FAILED , or CANCELED .

creation_time STRING Timestamp when this update was


created.
Delta Live Tables settings
7/21/2022 • 7 minutes to read

Delta Live Tables settings specify one or more notebooks that implement a pipeline and the parameters
specifying how to run the pipeline in an environment, for example, development, staging, or production. Delta
Live Tables settings are expressed as JSON and can be modified in the Delta Live Tables UI.

Settings
F IEL DS

id

Type: string

A globally unique identifier for this pipeline. The identifier is assigned by the system and cannot be changed.

name

Type: string

A user-friendly name for this pipeline. The name can be used to identify pipeline jobs in the UI.

storage

Type: string

A location on DBFS or cloud storage where output data and metadata required for pipeline execution are stored. Tables and
metadata are stored in subdirectories of this location.

When the storage setting is not specified, the system will default to a location in dbfs:/pipelines/ .

The storage setting cannot be changed after a pipeline is created.

configuration

Type: object

An optional list of settings to add to the Spark configuration of the cluster that will run the pipeline. These settings are read by
the Delta Live Tables runtime and available to pipeline queries through the Spark configuration.

Elements must be formatted as key:value pairs.

See Parameterize pipelines for an example of using the configuration object.

See Databricks Enhanced Autoscaling for an example of using the configuration object to enable Enhanced Autoscaling for
a Delta Live Tables pipeline.

libraries

Type: array of objects

An array of notebooks containing the pipeline code and required artifacts. See Configure multiple notebooks in a pipeline for
an example.
F IEL DS

clusters

Type: array of objects

An array of specifications for the clusters to run the pipeline. See Cluster configuration for more detail.

If this is not specified, pipelines will automatically select a default cluster configuration for the pipeline.

continuous

Type: boolean

A flag indicating whether to run the pipeline continuously.

The default value is false .

target

Type: string

The name of a database for persisting pipeline output data. Configuring the target setting allows you to view and query the
pipeline output data from the Azure Databricks UI.

channel

Type: string

The version of the Delta Live Tables runtime to use. The supported values are:

* preview to test your pipeline with the latest runtime version.


* current to use the stable runtime version.

The channel field is optional. The default value is


current . Databricks recommends using the current runtime version for production workloads.

edition

Type string

The Delta Live Tables product edition to run the pipeline. This setting allows you to choose the best product edition based on
the requirements of your pipeline:

* core to run streaming ingest workloads.


* pro to run streaming ingest and change data capture (CDC) workloads.
* advanced to run streaming ingest workloads, CDC workloads, and workloads that require Delta Live Tables expectations to
enforce data quality constraints.

The edition field is optional. The default value is


advanced .
F IEL DS

photon

Type: boolean

A flag indicating whether to use Photon runtime to run the pipeline. Photon is the Azure Databricks high performance Spark
engine. Photon enabled pipelines are billed at a different rate than non-Photon pipelines.

The photon field is optional. The default value is false .

Pipelines trigger interval


You can use pipelines.trigger.interval to control the trigger interval for a flow updating a table or an entire
pipeline. Because a triggered pipeline processes each table only once, the pipelines.trigger.interval is used
only with continuous pipelines.
Databricks recommends setting pipelines.trigger.interval on individual tables because of different defaults
for streaming versus batch queries. Set the value on a pipeline only when your processing requires controlling
updates for the entire pipeline graph.
You set pipelines.trigger.interval on a table using spark_conf in Python, or SET in SQL:

@dlt.table(
spark_conf={"pipelines.trigger.interval" : "10 seconds"}
)
def <function-name>():
return (<query>)

SET pipelines.trigger.interval='10 seconds';

CREATE OR REFRESH LIVE TABLE TABLE_NAME


AS SELECT ...

To set pipelines.trigger.interval on a pipeline, add it to the configuration object in the pipeline settings:

{
"configuration": {
"pipelines.trigger.interval": "10 seconds"
}
}
P IP EL IN ES. T RIGGER. IN T ERVA L

The default is based on flow type:

* Five seconds for streaming queries.


* One minute for complete queries when all input data is from Delta sources.
* Ten minutes for complete queries when some data sources may be non-Delta. See Tables and views in continuous pipelines.

The value is a number plus the time unit. The following are the valid time units:

* second , seconds
* minute , minutes
* hour , hours
* day , days

You can use the singular or plural unit when defining the value, for example:

* {"pipelines.trigger.interval" : "1 hour"}


* {"pipelines.trigger.interval" : "10 seconds"}
* {"pipelines.trigger.interval" : "30 second"}
* {"pipelines.trigger.interval" : "1 minute"}
* {"pipelines.trigger.interval" : "10 minutes"}
* {"pipelines.trigger.interval" : "10 minute"}

Cluster configuration
You can configure clusters used by managed pipelines with the same JSON format as the create cluster API. You
can specify configuration for two different cluster types: a default cluster where all processing is performed and
a maintenance cluster where daily maintenance tasks are run. Each cluster is identified using the label field.
Specifying cluster properties is optional, and the system uses defaults for any missing values.
NOTE
You cannot set the Spark version in cluster configurations. Delta Live Tables clusters run on a custom version of
Databricks Runtime that is continually updated to include the latest features.
Because a Delta Live Tables cluster automatically shuts down when not in use, referencing a cluster policy that sets
autotermination_minutes in your cluster configuration results in an error. To control cluster shutdown behavior,
you can use development or production mode or use the pipelines.clusterShutdown.delay setting in the
pipeline configuration. The following example sets the pipelines.clusterShutdown.delay value to 60 seconds:

{
"configuration": {
"pipelines.clusterShutdown.delay": "60s"
}
}

If you set num_workers to 0 in cluster settings, the cluster is created as a Single Node cluster. Configuring an
autoscaling cluster and setting min_workers to 0 and max_workers to 0 also creates a Single Node cluster.
If you configure an autoscaling cluster and set only min_workers to 0, then the cluster is not created as a Single
Node cluster. The cluster has at least 1 active worker at all times until terminated.
An example cluster configuration to create a Single Node cluster in Delta Live Tables:

{
"clusters": [
{
"label": "default",
"num_workers": 0
}
]
}

NOTE
If you need Azure Data Lake Storage credential passthrough or other configuration to access your storage location,
specify it for both the default cluster and the maintenance cluster.

An example configuration for a default cluster and a maintenance cluster:

{
"clusters": [
{
"label": "default",
"node_type_id": "Standard_D3_v2",
"driver_node_type_id": "Standard_D3_v2",
"num_workers": 20,
"spark_conf": {
"spark.databricks.io.parquet.nativeReader.enabled": "false"
}
},
{
"label": "maintenance"
}
]
}

Cluster policies
NOTE
When using cluster policies to configure Delta Live Tables clusters, Databricks recommends applying a single policy to
both the default and maintenance clusters.

To configure a cluster policy for a pipeline cluster, create a policy with the cluster_type field set to dlt :

{
"cluster_type": {
"type": "fixed",
"value": "dlt"
}
}

In the pipeline settings, set the cluster policy_id field to the value of the policy identifier. The following example
configures the default and maintenance clusters using the cluster policy with the identifier C65B864F02000008 .

{
"clusters": [
{
"label": "default",
"policy_id": "C65B864F02000008",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
},
{
"label": "maintenance",
"policy_id": "C65B864F02000008"
}
]
}

For an example of creating and using a cluster policy, see Define limits on pipeline clusters.

Examples
Configure a pipeline and cluster
The following example configures a triggered pipeline implemented in example-notebook_1 , using DBFS for
storage, and running on a small one-node cluster:
{
"name": "Example pipeline 1",
"storage": "dbfs:/pipeline-examples/storage-location/example1",
"clusters": [
{
"num_workers": 1,
"spark_conf": {}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/user@databricks.com/example_notebook_1"
}
}
],
"continuous": false
}

Configure multiple notebooks in a pipeline


Use the libraries field to configure a pipeline with multiple notebooks. You can add notebooks in any order,
because Delta Live Tables automatically analyzes dataset dependencies to construct the processing graph for
your pipeline. The following example creates a pipeline that includes the datasets defined in example-
notebook_1 and example-notebook_2 :

{
"name": "Example pipeline 3",
"storage": "dbfs:/pipeline-examples/storage-location/example3",
"libraries": [
{ "notebook": { "path": "/example-notebook_1" } },
{ "notebook": { "path": "/example-notebook_2" } }
]
}

Create a development workflow with Delta Live Tables


You can create separate Delta Live Tables pipelines for development, staging, and production, allowing you to
test and debug your transformation logic without affecting the consumers of the data you produce. Simply
create separate pipelines that target different databases but use the same underlying code.
You can combine this functionality with Databricks Repos to create a fully isolated development environment
and a simple workflow to push from development to production.

{
"name": "Data Ingest - DEV user@databricks",
"target": "customers_dev_user",
"libraries": ["/Repos/user@databricks.com/ingestion/etl.py"],
}

{
"name": "Data Ingest - PROD",
"target": "customers",
"libraries": ["/Repos/production/ingestion/etl.py"],
}

Parameterize pipelines
The Python and SQL code that defines your datasets can be parameterized by the pipeline’s settings.
Parameterization enables the following use cases:
Separating long paths and other variables from your code.
Reducing the amount of data that is processed in development or staging environments to speed up testing.
Reusing the same transformation logic to process from multiple data sources.
The following example uses the startDate configuration value to limit the development pipeline to a subset of
the input data:

CREATE OR REFRESH LIVE TABLE customer_events


AS SELECT * FROM sourceTable WHERE date > ${mypipeline.startDate};

@dlt.table
def customer_events():
start_date = spark.conf.get("mypipeline.startDate")
return read("sourceTable").where(col("date") > start_date)

{
"name": "Data Ingest - DEV",
"configuration": {
"mypipeline.startDate": "2021-01-02"
}
}

{
"name": "Data Ingest - PROD",
"configuration": {
"mypipeline.startDate": "2010-01-02"
}
}
Delta Live Tables event log
7/21/2022 • 5 minutes to read

An event log is created and maintained for every Delta Live Tables pipeline. The event log contains all
information related to the pipeline, including audit logs, data quality checks, pipeline progress, and data lineage.
You can use the event log to track, understand, and monitor the state of your data pipelines.
The event log for each pipeline is stored in a Delta table in DBFS. You can view event log entries in the Delta Live
Tables user interface, the Delta Live Tables API, or by directly querying the Delta table. This article focuses on
querying the Delta table.
The example notebook includes queries discussed in this article and can be used to explore the Delta Live Tables
event log.

Requirements
The examples in this article use JSON SQL functions available in Databricks Runtime 8.1 or higher.

Event log location


The event log is stored in /system/events under the storage location. For example, if you have configured your
pipeline storage setting as /Users/username/data , the event log is stored in the
/Users/username/data/system/events path in DBFS.

If you have not configured the setting, the default event log location is
storage
/pipelines/<pipeline-id>/system/events in DBFS. For example, if the ID of your pipeline is
91de5e48-35ed-11ec-8d3d-0242ac130003 , the storage location is
/pipelines/91de5e48-35ed-11ec-8d3d-0242ac130003/system/events .

Event log schema


The following table describes the event log schema. Some of these fields contain JSON documents that require
parsing to perform some queries. For example, analyzing data quality metrics requires parsing fields in the
details JSON document. The examples in this article demonstrate using Python functions to perform the
required parsing.

F IEL D DESC RIP T IO N

id A unique identifier for the pipeline.

sequence A JSON document containing metadata to identify and order


events.

origin A JSON document containing metadata for the origin of the


event, for example, cloud provider, region, user_id, or
pipeline_id.

timestamp The time the event was recorded.

message A human-readable message describing the event.


F IEL D DESC RIP T IO N

level The event type, for example, INFO, WARN, ERROR, or


METRICS.

error If an error occurred, details describing the error.

details A JSON document containing structured details of the event.


This is the primary field used for analyzing events.

event_type The event type.

Event log queries


Audit logging
Lineage
Data quality
Cluster performance metrics
Databricks Enhanced Autoscaling events
Runtime information
You can create a view to simplify querying the event log. The following example creates a view called
event_log_view . This view is used in the following examples that query event log records:

event_log = spark.read.format('delta').load(event_log_path)
event_log.createOrReplaceTempView("event_log_raw")

Replace event_log_path with the event log location.


Each instance of a pipeline run is called an update. Some of the following queries extract information for the
most recent update. Run the following commands to find the identifier for the most recent update and save it in
the latest_update_id variable:

latest_update_id = spark.sql("SELECT origin.update_id FROM event_log_raw WHERE event_type = 'create_update'


ORDER BY timestamp DESC LIMIT 1").collect()[0].update_id
spark.conf.set('latest_update.id', latest_update_id)

Audit logging
You can use the event log to audit events, for example, user actions. Events containing information about user
actions have the event type user_action . Information about the action is stored in the user_action object in the
details field. Use the following query to construct an audit log of user events:

SELECT timestamp, details:user_action:action, details:user_action:user_name FROM event_log_raw WHERE


event_type = 'user_action'

T IM ESTA M P A C T IO N USER_N A M E

1 2021-05- START user@company.com


20T19:36:03.517+0000
T IM ESTA M P A C T IO N USER_N A M E

2 2021-05- CREATE user@company.com


20T19:35:59.913+0000

3 2021-05- START user@company.com


27T00:35:51.971+0000

Lineage
You can see a visual representation of your pipeline graph in the Delta Live Tables user interface. You can also
programatically extract this information to perform tasks such as generating reports for compliance or tracking
data dependencies across an organization. Events containing information about lineage have the event type
flow_definition . The lineage information is stored in the flow_definition object in the details field. The
fields in the flow_definition object contain the necessary information to infer the relationships between
datasets:

SELECT details:flow_definition.output_dataset, details:flow_definition.input_datasets FROM event_log_raw


WHERE event_type = 'flow_definition' AND origin.update_id = '${latest_update.id}'

O UT P UT _DATA SET IN P UT _DATA SET S

1 customers null

2 sales_orders_raw null

3 sales_orders_cleaned [“customers”, “sales_orders_raw”]

4 sales_order_in_la [“sales_orders_cleaned”]

Data quality
The event log captures data quality metrics based on the expectations defined in your pipelines. Events
containing information about data quality have the event type flow_progress . When an expectation is defined
on a dataset, the data quality metrics are stored in the details field in the
flow_progress.data_quality.expectations object. The following example queries the data quality metrics for the
last pipeline update:
SELECT
row_expectations.dataset as dataset,
row_expectations.name as expectation,
SUM(row_expectations.passed_records) as passing_records,
SUM(row_expectations.failed_records) as failing_records
FROM
(
SELECT
explode(
from_json(
details :flow_progress :data_quality :expectations,
"array<struct<name: string, dataset: string, passed_records: int, failed_records: int>>"
)
) row_expectations
FROM
event_log_raw
WHERE
event_type = 'flow_progress'
AND origin.update_id = '${latest_update.id}'
)
GROUP BY
row_expectations.dataset,
row_expectations.name

DATA SET EXP EC TAT IO N PA SSIN G_REC O RDS FA IL IN G_REC O RDS

1 sales_orders_cleaned valid_order_number 4083 0

Cluster performance metrics


You can use the event log to query cluster performance metrics, for example, task slot utilization. Events
containing information about cluster performance metrics have the event types cluster_utilization or
flow_progress . Information about cluster performance metrics is stored in the cluster_utilization and
flow_progress.metrics.backlog_bytes objects in the details field. The following example queries cluster
performance metrics for the last pipeline update:

SELECT
timestamp,
Double(details :cluster_utilization.num_executors) as current_num_executors,
Double(details :cluster_utilization.avg_num_task_slots) as avg_num_task_slots,
Double(
details :cluster_utilization.avg_task_slot_utilization
) as avg_task_slot_utilization,
Double(
details :cluster_utilization.avg_num_queued_tasks
) as queue_size,
Double(details :flow_progress.metrics.backlog_bytes) as backlog
FROM
event_log_raw
WHERE
event_type IN ('cluster_utilization', 'flow_progress')
AND origin.update_id = '${latest_update.id}'

NOTE
The backlog metrics may not be available depending on the pipeline’s data source type and Databricks Runtime version.

Databricks Enhanced Autoscaling events


The event log captures cluster resizes when Enhanced Autoscaling is enabled in your pipelines. Events
containing information about Enhanced Autoscaling have the event type autoscale . The cluster resizing request
information is stored in the autoscale object. The following example queries the Enhanced Autoscaling cluster
resize requests for the last pipeline update:

SELECT
timestamp,
Double(
case
when details :autoscale.status = 'REQUESTED' then details :autoscale.desired_num_workers
else null
end
) as requested_workers,
Double(
case
when details :autoscale.status = 'ACCEPTED' then details :autoscale.desired_num_workers
else null
end
) as accepted_workers,
Double(
case
when details :autoscale.status = 'SUCCEEDED' then details :autoscale.desired_num_workers
else null
end
) as succeeded_workers,
Double(
case
when details :autoscale.status = 'REJECTED' then details :autoscale.desired_num_workers
else null
end
) as rejected_workers
FROM
event_log_raw
WHERE
event_type = 'autoscale'
AND origin.update_id = '${latest_update.id}'

Runtime information
You can view runtime information for a pipeline update, for example, the Databricks Runtime version for the
update:

SELECT details:create_update:runtime_version:dbr_version FROM event_log_raw WHERE event_type =


'create_update'

DB R_VERSIO N

1 11.0

Example notebook
Querying the Delta Live Tables event log
Get notebook
Run a Delta Live Tables pipeline in a workflow
7/21/2022 • 4 minutes to read

You can run a Delta Live Tables pipeline as part of a data processing workflow with Databricks jobs, Apache
Airflow, or Azure Data Factory.

Jobs
You can orchestrate multiple tasks in a Databricks job to implement a data processing workflow. To include a
Delta Live Tables pipeline in a job, use the Pipeline task when you create a job.

Apache Airflow
Apache Airflow is an open source solution for managing and scheduling data workflows. Airflow represents
workflows as directed acyclic graphs (DAGs) of operations. You define a workflow in a Python file and Airflow
manages the scheduling and execution. For information on installing and using Airflow with Azure Databricks,
see Apache Airflow.
To run a Delta Live Tables pipeline as part of an Airflow workflow, use the DatabricksSubmitRunOperator.
Requirements
The following are required to use the Airflow support for Delta Live Tables:
Airflow version 2.1.0 or later.
The Databricks provider package version 2.1.0 or later.
Example
The following example creates an Airflow DAG that triggers an update for the Delta Live Tables pipeline with the
identifier 8279d543-063c-4d63-9926-dae38e35ce8b :

from airflow import DAG


from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator
from airflow.utils.dates import days_ago

default_args = {
'owner': 'airflow'
}

with DAG('dlt',
start_date=days_ago(2),
schedule_interval="@once",
default_args=default_args
) as dag:

opr_run_now=DatabricksSubmitRunOperator(
task_id='run_now',
databricks_conn_id='CONNECTION_ID',
pipeline_task={"pipeline_id": "8279d543-063c-4d63-9926-dae38e35ce8b"}
)

Replace CONNECTION_ID with the identifier for an Airflow connection to your workspace.
Save this example in the airflow/dags directory and use the Airflow UI to view and trigger the DAG. Use the
Delta Live Tables UI to view the details of the pipeline update.
Azure Data Factory
Azure Data Factory is a cloud-based ETL service that lets you orchestrate data integration and transformation
workflows. Azure Data Factory directly supports running Azure Databricks tasks in a workflow, including
notebooks, JAR tasks, and Python scripts. You can also include a pipeline in a workflow by calling the Delta Live
Tables API from an Azure Data Factory Web activity. For example, to trigger a pipeline update from Azure Data
Factory:
1. Create a data factory or open an existing data factory.
2. When creation completes, open the page for your data factory and click the Open Azure Data Factor y
Studio tile. The Azure Data Factory user interface appears.
3. Create an Azure Databricks linked service.
4. Create a new Azure Data Factory pipeline by selecting Pipeline from the New dropdown menu in the
Azure Data Factory Studio user interface.
5. In the Activities toolbox, expand General and drag the Web activity to the pipeline canvas. Click the
Settings tab and enter the following values:

NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access
tokens, or both for authentication. As a security best practice, when authenticating with automated tools, systems,
scripts, and apps, Databricks recommends you use access tokens belonging to service principals instead of
workspace users. For more information, see Service principals for Azure Databricks automation.

URL : https://<databricks-instance>/api/2.0/pipelines/<pipeline-id>/updates .
Replace <databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

Replace <pipeline-id> with the pipeline identifier.


Method : Select POST from the dropdown.
Headers : Click + New . In the Name text box, enter Authorization . In the Value text box, enter
Bearer <personal-access-token> .

Replace <personal-access-token> with an Azure Databricks personal access token.


Body : To pass additional request parameters, enter a JSON document containing the parameters.
For example, to start an update and reprocess all data for the pipeline: {"full_refresh": "true"} .
If there are no additional request parameters, enter empty braces ( {} ).
To test the Web activity, click Debug on the pipeline toolbar in the Data Factory UI. The output and status of the
run, including errors, are displayed in the Output tab of the Azure Data Factory pipeline. Use the Delta Live
Tables UI to view the details of the pipeline update.
TIP
A common workflow requirement is to start a task after completion of a previous task. Because the Delta Live Tables
updates request is asynchronous—the request returns after starting the update but before the update completes—
tasks in your Azure Data Factory pipeline with a dependency on the Delta Live Tables update must wait for the update to
complete. An option to wait for update completion is adding an Until activity following the Web activity that triggers the
Delta Live Tables update. In the Until activity:
1. Add a Wait activity to wait a configured number of seconds for update completion.
2. Add a Web activity following the Wait activity that uses the Delta Live Tables Get update details request to get the
status of the update. The state field in the response returns the current state of the update, including if it has
completed.
3. Use the value of the state field to set the terminating condition for the Until activity. You can also use a Set Variable
activity to add a pipeline variable based on the state value and use this variable for the terminating condition.
Delta Live Tables cookbook
7/21/2022 • 12 minutes to read

This article contains a collection of recommendations and solutions to implement common tasks in your Delta
Live Tables pipelines.
Make expectations portable and reusable
Use Python UDFs in SQL
Use MLFlow models in a Delta Live Tables pipeline
Create sample datasets for development and testing
Programmatically manage and create multiple live tables
Quarantine invalid data
Validate row counts across tables
Retain manual deletes or updates
Exclude tables from publishing
Use secrets in a pipeline
Define limits on pipeline clusters

Make expectations portable and reusable


Scenario
You want to apply a common set of data quality rules to multiple tables, or the team members that develop and
maintain data quality rules are separate from the pipeline developers.
Solution
Maintain data quality rules separately from your pipeline implementations. Store the rules in a format that is
reliable and easy to access and update, for example, a text file stored in DBFS or cloud storage or a Delta table.
The following example uses a CSV file named rules.csv stored in DBFS to maintain rules. Each rule in
rules.csv is categorized by a tag. You use this tag in dataset definitions to determine which rules to apply:

name, constraint, tag


website_not_null,"Website IS NOT NULL",validity
location_not_null,"Location IS NOT NULL",validity
state_not_null,"State IS NOT NULL",validity
fresh_data,"to_date(updateTime,'M/d/yyyy h:m:s a') > '2010-01-01'",maintained
social_media_access,"NOT(Facebook IS NULL AND Twitter IS NULL AND Youtube IS NULL)",maintained

The following Python example defines data quality expectations based on the rules stored in the rules.csv file.
The get_rules() function reads the rules from rules.csv and returns a Python dictionary containing rules
matching the tag argument passed to the function. The dictionary is applied in the @dlt.expect_all_*()
decorators to enforce data quality constraints. For example, any records failing the rules tagged with validity
will be dropped from the raw_farmers_market table:
import dlt
from pyspark.sql.functions import expr, col

def get_rules(tag):
"""
loads data quality rules from csv file
:param tag: tag to match
:return: dictionary of rules that matched the tag
"""
rules = {}
df = spark.read.format("csv").option("header", "true").load("/path/to/rules.csv")
for row in df.filter(col("tag") == tag).collect():
rules[row['name']] = row['constraint']
return rules

@dlt.table(
name="raw_farmers_market"
)
@dlt.expect_all_or_drop(get_rules('validity'))
def get_farmers_market_data():
return (
spark.read.format('csv').option("header", "true")
.load('/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/')
)

@dlt.table(
name="organic_farmers_market"
)
@dlt.expect_all_or_drop(get_rules('maintained'))
def get_organic_farmers_market():
return (
dlt.read("raw_farmers_market")
.filter(expr("Organic = 'Y'"))
.select("MarketName", "Website", "State",
"Facebook", "Twitter", "Youtube", "Organic",
"updateTime"
)
)

Use Python UDFs in SQL


Scenario
You want the simplicity of SQL to define Delta Live Tables datasets but need transformations not directly
supported in SQL.
Solution
Use a Python user-defined function (UDF) in your SQL queries. The following example defines and registers the
square() UDF to return the square of the input argument and calls the square() UDF in a SQL expression.

1. Define and register the UDF.


Create a notebook with Default Language set to Python and add the following in a cell:
def square(i: int) -> int:
"""
Simple udf for squaring the parameter passed.
:param i: column from Pyspark or SQL
:return: squared value of the passed param.
"""
return i * i

spark.udf.register("makeItSquared", square) # register the square udf for Spark SQL

2. Call the UDF.


Create a SQL notebook and add the following query in a cell:

CREATE OR REFRESH LIVE TABLE raw_squared


AS SELECT makeItSquared(2) AS numSquared;

3. Create a pipeline
Create a new Delta Live Tables pipeline, adding the notebooks you created to Notebook Libraries . Use
the Add notebook librar y button to add additional notebooks in the Create Pipeline dialog or the
libraries field in the Delta Live Tables settings to configure the notebooks.

Use MLFlow models in a Delta Live Tables pipeline


Scenario
You want to use an MLFlow trained model in a pipeline.
Solution
To use an MLFlow model in a Delta Live Tables pipeline:
1. Obtain the run ID and model name of the MLFlow model. The run ID and model name are used to construct
the URI of the MLFlow model.
2. Use the URI to define a Spark UDF to load the MLFlow model.
3. Call the UDF in your table definitions to use the MLFlow model.
The following example defines a Spark UDF named loaded_model that loads an MLFlow model trained on loan
risk data. The loaded_model UDF is then used to define the gtb_scoring_train_data and gtb_scoring_valid_data
tables:
%pip install mlflow

import dlt
import mlflow
from pyspark.sql.functions import struct

run_id= "mlflow_run_id"
model_name = "the_model_name_in_run"
model_uri = "runs:/{run_id}/{model_name}".format(run_id=run_id, model_name=model_name)
loaded_model = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri)

categoricals = ["term", "home_ownership", "purpose",


"addr_state","verification_status","application_type"]

numerics = ["loan_amnt", "emp_length", "annual_inc", "dti", "delinq_2yrs",


"revol_util", "total_acc", "credit_length_in_years"]

features = categoricals + numerics

@dlt.table(
comment="GBT ML scored training dataset based on Loan Risk",
table_properties={
"quality": "gold"
}
)
def gtb_scoring_train_data():
return dlt.read("train_data")
.withColumn('predictions', loaded_model(struct(features)))

@dlt.table(
comment="GBT ML scored valid dataset based on Loan Risk",
table_properties={
"quality": "gold"
}
)
def gtb_scoring_valid_data():
return dlt.read("valid_data")
.withColumn('predictions', loaded_model(struct(features)))

Create sample datasets for development and testing


Scenario
You want to create a sample dataset for development or testing, for example, a dataset containing a subset of
data or specific record types.
Solution
Implement your transformation logic in a single or shared set of notebooks. Then create separate notebooks to
define multiple datasets based on environment. For example, in production, create a notebook that defines the
complete set of data for your pipeline:

CREATE OR REFRESH STREAMING LIVE TABLE input_data AS SELECT * FROM cloud_files("/production/data", "json")

Then create notebooks that define a sample of data based on requirements. For example, to generate a small
dataset with specific records for testing:

CREATE OR REFRESH LIVE TABLE input_data AS


SELECT "2021/09/04" AS date, 22.4 as sensor_reading UNION ALL
SELECT "2021/09/05" AS date, 21.5 as sensor_reading
You can also filter data to create a subset of the production data for development or testing:

CREATE OR REFRESH LIVE TABLE input_data AS SELECT * FROM prod.input_data WHERE date > current_date() -
INTERVAL 1 DAY

To use these different datasets, create multiple pipelines with the notebooks implementing the transformation
logic. Each pipeline can read data from the LIVE.input_data dataset but is configured to include the notebook
that creates the dataset specific to the environment.

Programmatically manage and create multiple live tables


Scenario
You have pipelines containing multiple flows or dataset definitions that differ only by a small number of
parameters. This redundancy results in pipelines that are error-prone and difficult to maintain. For example, the
following diagram shows the graph of a pipeline that uses a fire department dataset to find neighborhoods with
the fastest response times for different categories of emergency calls. In this example, the parallel flows differ by
only a few parameters.

You can use a metaprogramming pattern to reduce the overhead of generating and maintaining redundant flow
definitions. Metaprogramming in Delta Live Tables is done using Python inner functions. Because these
functions are lazily evaluated, you can use them to create flows that are identical except for input parameters.
Each invocation can include a different set of parameters that controls how each table should be generated, as
shown in the following example:

import dlt
from pyspark.sql.functions import *

@dlt.table(
name="raw_fire_department",
comment="raw table for fire department response"
)
@dlt.expect_or_drop("valid_received", "received IS NOT NULL")
@dlt.expect_or_drop("valid_response", "responded IS NOT NULL")
@dlt.expect_or_drop("valid_neighborhood", "neighborhood != 'None'")
def get_raw_fire_department():
return (
return (
spark.read.format('csv')
.option('header', 'true')
.option('multiline', 'true')
.load('/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv')
.withColumnRenamed('Call Type', 'call_type')
.withColumnRenamed('Received DtTm', 'received')
.withColumnRenamed('Response DtTm', 'responded')
.withColumnRenamed('Neighborhooods - Analysis Boundaries', 'neighborhood')
.select('call_type', 'received', 'responded', 'neighborhood')
)

all_tables = []

def generate_tables(call_table, response_table, filter):


@dlt.table(
name=call_table,
comment="top level tables by call type"
)
def create_call_table():
return (
spark.sql("""
SELECT
unix_timestamp(received,'M/d/yyyy h:m:s a') as ts_received,
unix_timestamp(responded,'M/d/yyyy h:m:s a') as ts_responded,
neighborhood
FROM LIVE.raw_fire_department
WHERE call_type = '{filter}'
""".format(filter=filter))
)

@dlt.table(
name=response_table,
comment="top 10 neighborhoods with fastest response time "
)
def create_response_table():
return (
spark.sql("""
SELECT
neighborhood,
AVG((ts_received - ts_responded)) as response_time
FROM LIVE.{call_table}
GROUP BY 1
ORDER BY response_time
LIMIT 10
""".format(call_table=call_table))
)

all_tables.append(response_table)

generate_tables("alarms_table", "alarms_response", "Alarms")


generate_tables("fire_table", "fire_response", "Structure Fire")
generate_tables("medical_table", "medical_response", "Medical Incident")

@dlt.table(
name="best_neighborhoods",
comment="which neighbor appears in the best response time list the most"
)
def summary():
target_tables = [dlt.read(t) for t in all_tables]
unioned = functools.reduce(lambda x,y: x.union(y), target_tables)
return (
unioned.groupBy(col("neighborhood"))
.agg(count("*").alias("score"))
.orderBy(desc("score"))
)
Quarantine invalid data
Scenario
You’ve defined expectations to filter out records that violate data quality constraints, but you also want to save
the invalid records for analysis.
Solution
Create rules that are the inverse of the expectations you’ve defined and use those rules to save the invalid
records to a separate table. You can programmatically create these inverse rules. The following example creates
the valid_farmers_market table containing input records that pass the valid_website and valid_location data
quality constraints and also creates the invalid_farmers_market table containing the records that fail those data
quality constraints:

import dlt

rules = {}
quarantine_rules = {}

rules["valid_website"] = "(Website IS NOT NULL)"


rules["valid_location"] = "(Location IS NOT NULL)"

# concatenate inverse rules


quarantine_rules["invalid_record"] = "NOT({0})".format(" AND ".join(rules.values()))

@dlt.table(
name="raw_farmers_market"
)
def get_farmers_market_data():
return (
spark.read.format('csv').option("header", "true")
.load('/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/')
)

@dlt.table(
name="valid_farmers_market"
)
@dlt.expect_all_or_drop(rules)
def get_valid_farmers_market():
return (
dlt.read("raw_farmers_market")
.select("MarketName", "Website", "Location", "State",
"Facebook", "Twitter", "Youtube", "Organic", "updateTime")
)

@dlt.table(
name="invalid_farmers_market"
)
@dlt.expect_all_or_drop(quarantine_rules)
def get_invalid_farmers_market():
return (
dlt.read("raw_farmers_market")
.select("MarketName", "Website", "Location", "State",
"Facebook", "Twitter", "Youtube", "Organic", "updateTime")
)

A disadvantage of the above approach is that it generates the quarantine table by processing the data twice. If
you don’t want this performance overhead, you can use the constraints directly within a query to generate a
column indicating the validation status of a record. You can then partition the table by this column for further
optimization.
This approach does not use expectations, so data quality metrics do not appear in the event logs or the pipelines
UI.
import dlt
from pyspark.sql.functions import expr

rules = {}
quarantine_rules = {}

rules["valid_website"] = "(Website IS NOT NULL)"


rules["valid_location"] = "(Location IS NOT NULL)"
quarantine_rules = "NOT({0})".format(" AND ".join(rules.values()))

@dlt.table(
name="raw_farmers_market"
)
def get_farmers_market_data():
return (
spark.read.format('csv').option("header", "true")
.load('/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/')
)

@dlt.table(
name="partitioned_farmers_market",
partition_cols = [ 'Quarantine' ]
)
def get_partitioned_farmers_market():
return (
dlt.read("raw_farmers_market")
.withColumn("Quarantine", expr(quarantine_rules))
.select("MarketName", "Website", "Location", "State",
"Facebook", "Twitter", "Youtube", "Organic", "updateTime",
"Quarantine")
)

Validate row counts across tables


Scenario
You need to compare row counts between two live tables, perhaps to verify that data was processed successfully
without dropping rows.
Solution
Add an additional table to your pipeline that defines an expectation to perform the comparison. The results of
this expectation appear in the event log and the Delta Live Tables UI. This example validates equal row counts
between the tbla and tblb tables:

CREATE OR REFRESH LIVE TABLE count_verification(


CONSTRAINT no_rows_dropped EXPECT (a_count == b_count)
) AS SELECT * FROM
(SELECT COUNT(*) AS a_count FROM LIVE.tbla),
(SELECT COUNT(*) AS b_count FROM LIVE.tblb)

Retain manual deletes or updates


Scenario
Your application requires arbitrary deletion or updates of records in a table and recomputation of all
downstream tables. The following diagram illustrates two streaming live tables:
raw_user_table ingests a set of raw user data from a source.
bmi_table incrementally computes BMI scores using weight and height from raw_user_table .
To comply with privacy requirements, you need to delete or update a user record from the raw_user_table and
recompute the bmi_table .

You can manually delete or update the record from raw_user_table and do a refresh operation to recompute
the downstream tables. However, you need to make sure the deleted record isn’t reloaded from the source data.
Solution
Use the pipelines.reset.allowed table property to disable full refresh for raw_user_table so that intended
changes are retained over time:

CREATE OR REFRESH STREAMING LIVE TABLE raw_user_table


TBLPROPERTIES(pipelines.reset.allowed = false)
AS SELECT * FROM cloud_files("/databricks-datasets/iot-stream/data-user", "csv");

CREATE OR REFRESH STREAMING LIVE TABLE bmi_table


AS SELECT userid, (weight/2.2) / pow(height*0.0254,2) AS bmi FROM STREAM(LIVE.raw_user_table);

Setting pipelines.reset.allowed to false prevents refreshes to raw_user_table , but does not prevent
incremental writes to the tables or prevent new data from flowing into the table.

Exclude tables from publishing


Scenario
You’ve configured the target setting to publish your tables, but there are some tables you don’t want
published.
Solution
Define a temporary table to instruct Delta Live Tables not to persist metadata for the table:
CREATE TEMPORARY LIVE TABLE customers_raw
AS SELECT * FROM json.`/data/customers/json/`

@dlt.table(
comment="Raw customer data",
temporary=True)
def customers_raw():
return ("...")

Use secrets in a pipeline


Scenario
You need to authenticate to a data source from your pipeline, for example, cloud data storage or a database, and
don’t want to include credentials in your notebook or configuration.
Solution
Use Azure Databricks secrets to store credentials such as access keys or passwords. To configure the secret in
your pipeline, use a Spark property in the pipeline settings cluster configuration.
The following example uses a secret to store an access key required to read input data from an Azure Data Lake
Storage Gen2 (ADLS Gen2) storage account using Auto Loader. You can use this same method to configure any
secret required by your pipeline, for example, AWS keys to access S3, or the password to an Apache Hive
metastore.
To learn more about working with Azure Data Lake Storage Gen2, see Accessing Azure Data Lake Storage Gen2
and Blob Storage with Azure Databricks.

NOTE
You must add the spark.hadoop. prefix to the spark_conf configuration key that sets the secret value.
{
"id": "43246596-a63f-11ec-b909-0242ac120002",
"clusters": [
{
"label": "default",
"spark_conf": {
"spark.hadoop.fs.azure.account.key.<storage-account-name>.dfs.core.windows.net": "
{{secrets/<scope-name>/<secret-name>}}"
},
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
},
{
"label": "maintenance",
"spark_conf": {
"spark.hadoop.fs.azure.account.key.<storage-account-name>.dfs.core.windows.net": "
{{secrets/<scope-name>/<secret-name>}}"
}
}
],
"development": true,
"continuous": false,
"libraries": [
{
"notebook": {
"path": "/Users/user@databricks.com/DLT Notebooks/Delta Live Tables quickstart"
}
}
],
"name": "DLT quickstart using ADLS2"
}

Replace
<storage-account-name> with the ADLS Gen2 storage account name.
<scope-name> with the Azure Databricks secret scope name.
<secret-name> with the name of the key containing the Azure storage account access key.

import dlt

json_path = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-input-dataset>"
@dlt.create_table(
comment="Data ingested from an ADLS2 storage account."
)
def read_from_ADLS2():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(json_path)
)

Replace
<container-name> with the name of the Azure storage account container that stores the input data.
<storage-account-name> with the ADLS Gen2 storage account name.
<path-to-input-dataset> with the path to the input dataset.

Define limits on pipeline clusters


Scenario
You want to restrict configuration options for clusters that run Delta Live Tables pipelines; for example, you want
to control costs by limiting cluster size or simplify cluster configuration by providing pre-defined cluster
templates.
Solution
Cluster policies allow you to define templates that limit user access to cluster configuration. You can define one
or more cluster policies to use when configuring pipelines.
To create a cluster policy for Delta Live Tables pipelines, define a cluster policy with the cluster_type field set to
dlt . The following example creates a minimal policy for a Delta Live Tables cluster:

{
"cluster_type": {
"type": "fixed",
"value": "dlt"
},
"num_workers": {
"type": "unlimited",
"defaultValue": 3,
"isOptional": true
},
"node_type_id": {
"type": "unlimited",
"isOptional": true
},
"spark_version": {
"type": "unlimited",
"hidden": true
}
}

For more information on creating cluster policies, including example policies, see Create a cluster policy.
To use a cluster policy in a pipeline configuration, you need the policy ID. To find the policy ID:

1. Click Compute in the sidebar.


2. Click the Cluster Policies tab.
3. Click the policy name.
4. Copy the policy ID in the ID field.
The policy ID is also available at the end of the URL when you click on the policy name.
To use the cluster policy with a pipeline:

NOTE
When using cluster policies to configure Delta Live Tables clusters, Databricks recommends applying a single policy to
both the default and maintenance clusters.

1. Click Workflows in the sidebar and click the Delta Live Tables tab. The Pipelines list displays.
2. Click the pipeline name. The Pipeline details page appears.
3. Click the Settings button. The Edit Pipeline Settings dialog appears.
4. Click the JSON button.
5. In the clusters setting, set the policy_id field to the value of the policy ID. The following example
configures the default and maintenance clusters using the cluster policy with the ID C65B864F02000008 :

{
"clusters": [
{
"label": "default",
"policy_id": "C65B864F02000008",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
},
{
"label": "maintenance",
"policy_id": "C65B864F02000008"
}
]
}

6. Click Save .
Delta Live Tables frequently asked questions
7/21/2022 • 2 minutes to read

Frequently asked questions (FAQ)


Can I query views in a pipeline from another cluster or SQL warehouse?
No, currently views are only available to other tables in the pipeline.
Does Delta Live Tables only support updating of Delta tables?
Yes, Delta Live Tables can only be used to update Delta tables.
Does Delta Live Tables perform maintenance tasks on my tables?
Yes, Delta Live Tables performs maintenance tasks on tables every 24 hours. Maintenance can improve query
performance and reduce cost by removing old versions of tables. By default, the system performs a full
OPTIMIZE operation followed by VACUUM. You can disable OPTIMIZE for a table by setting
pipelines.autoOptimize.managed = false in the table properties for the table.

Can I query an older snapshot of a table?


By default, Delta tables created by a Delta Live Tables pipeline retain seven days of history, allowing you to query
a snapshot of a table. You can modify the retention period using table properties if you require more than seven
days of history.
Can I have multiple queries in a pipeline writing to the same target table?
No, each table must be defined once. You can use UNION If you need to combine multiple inputs to create a
table. Adding or removing UNION from a streaming live table is a breaking operation that requires a full-
refresh.
Understand and manage Delta Live Tables upgrades
7/21/2022 • 2 minutes to read

Delta Live Tables clusters use a runtime based on Databricks Runtime. Databricks automatically upgrades the
Delta Live Tables runtime to support enhancements and upgrades to the platform. As with any software
upgrade, a Delta Live Tables runtime upgrade may result in errors or issues running your pipelines. This article
describes best practices to test your pipeline with upcoming releases of the Delta Live Tables runtime, and Delta
Live Tables features that enhance the stability of your pipelines.

Delta Live Tables runtime channels


The channel field in the Delta Live Tables pipeline settings controls the Delta Live Tables runtime version that
runs your pipeline. The supported values are:
preview to test your pipeline with the latest runtime version.
current to use the stable runtime version.
By default, your pipelines run using the current runtime version. Databricks recommends using the current
runtime for production workloads. To learn how to use the preview setting to test your pipelines with the latest
runtime version, see Automate testing of your pipelines with the next runtime version.

Delta Live Tables upgrade process


Delta Live Tables automatically upgrades the runtime in your Azure Databricks workspaces and monitors the
health of your pipelines after the upgrade. If Delta Live Tables detects that a pipeline cannot start because of an
upgrade, the runtime version for the pipeline reverts to the previous known-good version, and the following
steps are triggered automatically:

NOTE
Delta Live Tables reverts only pipelines running in production mode and with the channel set to current .

The pipeline’s Delta Live Tables runtime is pinned to the previous known-good version.
The Delta Live Tables UI shows a visual indicator that the pipeline is pinned to a previous version because of
an upgrade failure.
Databricks support is notified of the issue. If the issue is related to a regression in the runtime, Databricks will
resolve the issue. If the issue is caused by a custom library or package used by the pipeline, Databricks will
contact you to resolve the issue.
When the issue is resolved, Databricks will initiate the upgrade again.

Best practices
Automate testing of your pipelines with the next runtime version
To ensure changes in the next Delta Live Tables runtime version do not impact your pipelines, use the Delta Live
Tables channels feature:
1. Create a staging pipeline and set the channel to preview .
2. In the Delta Live Tables UI, create a schedule to run the pipeline weekly and enable alerts to receive an email
notification for pipeline failures.
3. If you receive a notification of a failure and are unable to resolve it, open a support ticket with Databricks.
Pipeline dependencies
Delta Live Tables supports external dependencies in your pipelines; for example, you can install any Python
package using the %pip install command. Delta Live Tables also supports using global and cluster-scoped init
scripts. However, these external dependencies, particularly init scripts, increase the risk of issues with runtime
upgrades. To mitigate these risks, minimize using init scripts in your pipelines. If your processing requires init
scripts, automate testing of your pipeline to detect problems early; see Automate testing of your pipelines with
the next runtime version. If you use init scripts, Databricks recommends increasing your testing frequency.
Workflows with jobs
7/21/2022 • 2 minutes to read

You can use a job to run a data processing or data analysis task in an Azure Databricks cluster with scalable
resources. Your job can consist of a single task or can be a large, multi-task workflow with complex
dependencies. Azure Databricks manages the task orchestration, cluster management, monitoring, and error
reporting for all of your jobs. You can run your jobs immediately or periodically through an easy-to-use
scheduling system. You can implement job tasks using notebooks, JARS, Delta Live Tables pipelines, or Python,
Scala, Spark submit, and Java applications.
You create jobs through the Jobs UI, the Jobs API, or the Databricks CLI. The Jobs UI allows you to monitor, test,
and troubleshoot your running and completed jobs.
To get started:
Create your first Azure Databricks jobs workflow with the quickstart.
Learn how to create, view, and run workflows with the Azure Databricks jobs user interface.
Learn about Jobs API updates to support creating and managing workflows with Azure Databricks jobs.
Jobs quickstart
7/21/2022 • 3 minutes to read

This article demonstrates an Azure Databricks job that orchestrates tasks to read and process a sample dataset.
In this quickstart, you:
1. Create a new notebook and add code to retrieve a sample dataset containing popular baby names by year.
2. Save the sample dataset to DBFS.
3. Create a new notebook and add code to read the dataset from DBFS, filter it by year, and display the results.
4. Create a new job and configure two tasks using the notebooks.
5. Run the job and view the results.

Requirements
You must have cluster creation permission to create a job cluster or permissions to an all-purpose cluster.

Create the notebooks


Retrieve and save data
To create a notebook to retrieve the sample dataset and save it to DBFS:

1. Go to your Azure Databricks landing page and select Create Blank Notebook or click Create in
the sidebar and select Notebook from the menu. The Create Notebook dialog appears.
2. In the Create Notebook dialog, give your notebook a name; for example, Retrieve baby names . Select
Python from the Default Language dropdown menu. You can leave Cluster set to the default value.
You configure the cluster when you create a task using this notebook.
3. Click Create .
4. Copy the following Python code and paste it into the first cell of the notebook.

import requests

response = requests.get('http://health.data.ny.gov/api/views/myeu-hzra/rows.csv')
csvfile = response.content.decode('utf-8')
dbutils.fs.put("dbfs:/FileStore/babynames.csv", csvfile, True)

Read and display filtered data


To create a notebook to read and present the data for filtering:

1. Go to your Azure Databricks landing page and select Create Blank Notebook or click Create in
the sidebar and select Notebook from the menu. The Create Notebook dialog appears.
2. In the Create Notebook dialog, give your notebook a name; for example, Filter baby names . Select
Python from the Default Language dropdown menu. You can leave Cluster set to the default value.
You configure the cluster when you create a task using this notebook.
3. Click Create .
4. Copy the following Python code and paste it into the first cell of the notebook.
babynames = spark.read.format("csv").option("header", "true").option("inferSchema",
"true").load("dbfs:/FileStore/babynames.csv")
babynames.createOrReplaceTempView("babynames_table")
years = spark.sql("select distinct(Year) from babynames_table").rdd.map(lambda row :
row[0]).collect()
years.sort()
dbutils.widgets.dropdown("year", "2014", [str(x) for x in years])
display(babynames.filter(babynames.Year == dbutils.widgets.get("year")))

Create a job
1. Click Workflows in the sidebar.

2. Click .
The Tasks tab displays with the create task dialog.

3. Replace Add a name for your job… with your job name.
4. In the Task name field, enter a name for the task; for example, retrieve-baby-names .
5. In the Type drop-down, select Notebook .
6. Use the file browser to find the first notebook you created, click the notebook name, and click Confirm .
7. Click Create task .

8. Click below the task you just created to add another task.
9. In the Task name field, enter a name for the task; for example, filter-baby-names .
10. In the Type drop-down, select Notebook .
11. Use the file browser to find the second notebook you created, click the notebook name, and click
Confirm .
12. Click Add under Parameters . In the Key field, enter year . In the Value field, enter 2014 .
13. Click Create task .

Run the job


To run the job immediately, click in the upper right corner. You can also run the job by clicking the
Runs tab and clicking Run Now in the Active Runs table.

View run details


1. Click the Runs tab and click the link for the run in the Active Runs table or in the Completed Runs
(past 60 days) table.
2. Click on either task to see the output and details. For example, click the filter-baby-names task to view
the status and output for the filter task:

Run with different parameters


To re-run the job and filter baby names for a different year:

1. Click next to Run Now and select Run Now with Different Parameters or click Run Now with
Different Parameters in the Active Runs table.
2. In the Value field, enter 2015 .
3. Click Run .
Jobs
7/21/2022 • 29 minutes to read

A job is a way to run non-interactive code in an Azure Databricks cluster. For example, you can run an extract,
transform, and load (ETL) workload interactively or on a schedule. You can also run jobs interactively in the
notebook UI.
You can create and run a job using the UI, the CLI, or by invoking the Jobs API. You can repair and re-run a failed
or canceled job using the UI or API. You can monitor job run results using the UI, CLI, API, and email notifications.
This article focuses on performing job tasks using the UI. For the other methods, see Jobs CLI and Jobs API 2.1.
Your job can consist of a single task or can be a large, multi-task workflow with complex dependencies. Azure
Databricks manages the task orchestration, cluster management, monitoring, and error reporting for all of your
jobs. You can run your jobs immediately or periodically through an easy-to-use scheduling system.
You can implement a task in a JAR, an Azure Databricks notebook, a Delta Live Tables pipeline, or an application
written in Scala, Java, or Python. Legacy Spark Submit applications are also supported. You control the execution
order of tasks by specifying dependencies between the tasks. You can configure tasks to run in sequence or
parallel. The following diagram illustrates a workflow that:
1. Ingests raw clickstream data and performs processing to sessionize the records.
2. Ingests order data and joins it with the sessionized clickstream data to create a prepared data set for
analysis.
3. Extracts features from the prepared data.
4. Performs tasks in parallel to persist the features and train a machine learning model.

To create your first workflow with an Azure Databricks job, see the quickstart.
IMPORTANT
You can create jobs only in a Data Science & Engineering workspace or a Machine Learning workspace.
A workspace is limited to 1000 concurrent job runs. A 429 Too Many Requests response is returned when you
request a run that cannot start immediately.
The number of jobs a workspace can create in an hour is limited to 5000 (includes “run now” and “runs submit”). This
limit also affects jobs created by the REST API and notebook workflows.

Create a job
1. Do one of the following:

Click Workflows in the sidebar and click .


In the sidebar, click Create and select Job from the menu.
The Tasks tab appears with the create task dialog.

2. Replace Add a name for your job… with your job name.
3. Enter a name for the task in the Task name field.
4. Specify the type of task to run. In the Type drop-down, select Notebook , JAR , Spark Submit , Python ,
or Pipeline .
Notebook : In the Source drop-down, select a location for the notebook; either Workspace for a
notebook located in a Azure Databricks workspace folder or Git provider for a notebook located
in a remote Git repository.
Workspace : Use the file browser to find the notebook, click the notebook name, and click
Confirm .
Git provider : Click Edit and enter the Git repository information. See Run jobs using notebooks
in a remote Git repository.
JAR : Specify the Main class . Use the fully qualified name of the class containing the main
method, for example, org.apache.spark.examples.SparkPi . Then click Add under Dependent
Libraries to add libraries required to run the task. One of these libraries must contain the main
class.
To learn more about JAR tasks, see JAR jobs.
Spark Submit : In the Parameters text box, specify the main class, the path to the library JAR, and
all arguments, formatted as a JSON array of strings. The following example configures a spark-
submit task to run the DFSReadWriteTest from the Apache Spark examples:

["--
class","org.apache.spark.examples.DFSReadWriteTest","dbfs:/FileStore/libraries/spark_examples_
2_12_3_1_1.jar","/dbfs/databricks-datasets/README.md","/FileStore/examples/output/"]

IMPORTANT
There are several limitations for spark-submit tasks:
You can run spark-submit tasks only on new clusters.
Spark-submit does not support cluster autoscaling. To learn more about autoscaling, see Cluster
autoscaling.
Spark-submit does not support Databricks Utilities. To use Databricks Utilities, use JAR tasks instead.

Python : In the Path textbox, enter the URI of a Python script on DBFS or cloud storage; for
example, dbfs:/FileStore/myscript.py .
Pipeline : In the Pipeline drop-down, select an existing Delta Live Tables pipeline.
Python Wheel : In the Package name text box, enter the package to import, for example,
myWheel-1.0-py2.py3-none-any.whl . In the Entr y Point text box, enter the function to call when
starting the wheel. Click Add under Dependent Libraries to add libraries required to run the
task.
5. Configure the cluster where the task runs. In the Cluster drop-down, select either New Job Cluster or
Existing All-Purpose Clusters .
New Job Cluster : Click Edit in the Cluster drop-down and complete the cluster configuration.
Existing All-Purpose Cluster : Select an existing cluster in the Cluster drop-down. To open the
cluster in a new page, click the icon to the right of the cluster name and description.
To learn more about selecting and configuring clusters to run tasks, see Cluster configuration tips.
6. You can pass parameters for your task. Each task type has different requirements for formatting and
passing the parameters.
Notebook : Click Add and specify the key and value of each parameter to pass to the task. You can
override or add additional parameters when you manually run a task using the Run a job with
different parameters option. Parameters set the value of the notebook widget specified by the key of
the parameter. Use task parameter variables to pass a limited set of dynamic values as part of a
parameter value.
JAR : Use a JSON-formatted array of strings to specify parameters. These strings are passed as
arguments to the main method of the main class. See Configure JAR job parameters.
Spark Submit task: Parameters are specified as a JSON-formatted array of strings. Conforming to
the Apache Spark spark-submit convention, parameters after the JAR path are passed to the main
method of the main class.
Python : Use a JSON-formatted array of strings to specify parameters. These strings are passed as
arguments which can be parsed using the argparse module in Python.
Python Wheel : In the Parameters drop-down, select Positional arguments to enter parameters as
a JSON-formatted array of strings, or select Keyword arguments > Add to enter the key and value
of each parameter. Both positional and keyword arguments are passed to the Python wheel task as
command-line arguments.
7. To access additional options, including Dependent Libraries, Retr y Policy, and Timeouts , click
Advanced Options . See Edit a task.
8. Click Create .
9. To optionally set the job’s schedule, click Edit schedule in the Job details panel. See Schedule a job.
10. To optionally allow multiple concurrent runs of the same job, click Edit concurrent runs in the Job
details panel. See Maximum concurrent runs.
11. To optionally specify email addresses to receive notifications on job events, click Edit notifications in
the Job details panel. See Notifications.
12. To optionally control permission levels on the job, click Edit permissions in the Job details panel. See
Control access to jobs.

To add another task, click below the task you just created. A shared cluster option is provided if you have
configured a New Job Cluster for a previous task. You can also configure a cluster for each task when you
create or edit a task. To learn more about selecting and configuring clusters to run tasks, see Cluster
configuration tips.

Run a job
1. Click Workflows in the sidebar.
2. Select a job and click the Runs tab. You can run a job immediately or schedule the job to run later.
If one or more tasks in a job with multiple tasks are not successful, you can re-run the subset of unsuccessful
tasks. See Repair an unsuccessful job run.
Run a job immediately
To run the job immediately, click .

TIP
You can perform a test run of a job with a notebook task by clicking Run Now . If you need to make changes to the
notebook, clicking Run Now again after editing the notebook will automatically run the new version of the notebook.

Run a job with different parameters


You can use Run Now with Different Parameters to re-run a job with different parameters or different
values for existing parameters.

1. Click next to Run Now and select Run Now with Different Parameters or, in the Active Runs table,
click Run Now with Different Parameters . Enter the new parameters depending on the type of task.
Notebook : You can enter parameters as key-value pairs or a JSON object. The provided parameters
are merged with the default parameters for the triggered run. You can use this dialog to set the values
of widgets.
JAR and spark-submit : You can enter a list of parameters or a JSON document. If you delete keys,
the default parameters are used. You can also add task parameter variables for the run.
2. Click Run .
Repair an unsuccessful job run
You can repair failed or canceled multi-task jobs by running only the subset of unsuccessful tasks and any
dependent tasks. Because successful tasks and any tasks that depend on them are not re-run, this feature
reduces the time and resources required to recover from unsuccessful job runs.
You can change job or task settings before repairing the job run. Unsuccessful tasks are re-run with the current
job and task settings. For example, if you change the path to a notebook or a cluster setting, the task is re-run
with the updated notebook or cluster settings.
You can view the history of all task runs on the Task run details page.

NOTE
If one or more tasks share a job cluster, a repair run creates a new job cluster; for example, if the original run used the
job cluster my_job_cluster , the first repair run uses the new job cluster my_job_cluster_v1 , allowing you to easily
see the cluster and cluster settings used by the initial run and any repair runs. The settings for my_job_cluster_v1
are the same as the current settings for my_job_cluster .
Repair is supported only with jobs that orchestrate two or more tasks.
The Duration value displayed in the Runs tab includes the time the first run started until the time when the latest
repair run finished. For example, if a run failed twice and succeeded on the third run, the duration includes the time for
all three runs.

To repair an unsuccessful job run:

1. Click Jobs in the sidebar.


2. In the Name column, click a job name. The Runs tab shows active runs and completed runs, including any
unsuccessful runs.
3. Click the link for the unsuccessful run in the Star t time column of the Completed Runs (past 60 days)
table. The Job run details page appears.
4. Click Repair run . The Repair job run dialog appears, listing all unsuccessful tasks and any dependent tasks
that will be re-run.
5. To add or edit parameters for the tasks to repair, enter the parameters in the Repair job run dialog.
Parameters you enter in the Repair job run dialog override existing values. On subsequent repair runs, you
can return a parameter to its original value by clearing the key and value in the Repair job run dialog.
6. Click Repair run in the Repair job run dialog.
View task run history
To view the run history of a task, including successful and unsuccessful runs:
1. Click on a task on the Job run details page. The Task run details page appears.
2. Select the task run in the run history drop-down.
Schedule a job
To define a schedule for the job:
1. Click Edit schedule in the Job details panel and set the Schedule Type to Scheduled .
2. Specify the period, starting time, and time zone. Optionally select the Show Cron Syntax checkbox to
display and edit the schedule in Quartz Cron Syntax.
NOTE
Azure Databricks enforces a minimum interval of 10 seconds between subsequent runs triggered by the
schedule of a job regardless of the seconds configuration in the cron expression.
You can choose a time zone that observes daylight saving time or UTC. If you select a zone that observes
daylight saving time, an hourly job will be skipped or may appear to not fire for an hour or two when daylight
saving time begins or ends. To run at every hour (absolute time), choose UTC.
The job scheduler is not intended for low latency jobs. Due to network or cloud issues, job runs may
occasionally be delayed up to several minutes. In these situations, scheduled jobs will run immediately upon
service availability.

3. Click Save .
Pause and resume a job schedule
To pause a job, you can either:
Click Pause in the Job details panel.
Click Edit schedule in the Job details panel and set the Schedule Type to Manual (Paused)
To resume a paused job schedule, set the Schedule Type to Scheduled .

View jobs
Click Workflows in the sidebar. The Jobs list appears. The Jobs page lists all defined jobs, the cluster
definition, the schedule, if any, and the result of the last run.

NOTE
If you have the increased jobs limit enabled for this workspace, only 25 jobs are displayed in the Jobs list to improve the
page loading time. Use the left and right arrows to page through the full list of jobs.

You can filter jobs in the Jobs list:


Using keywords. If you have the increased jobs limit feature enabled for this workspace, searching by
keywords is supported only for the name, job ID, and job tag fields.
Selecting only the jobs you own.
Selecting all jobs you have permissions to access. Access to this filter requires that Jobs access control is
enabled.
Using tags. To search for a tag created with only a key, type the key into the search box. To search for a tag
created with a key and value, you can search by the key, the value, or both the key and value. For example, for
a tag with the key department and the value finance , you can search for department or finance to find
matching jobs. To search by both the key and value, enter the key and value separated by a colon; for
example, department:finance .

You can also click any column header to sort the list of jobs (either descending or ascending) by that column.
When the increased jobs limit feature is enabled, you can sort only by Name , Job ID , or Created by . The
default sorting is by Name in ascending order.

View runs for a job


1. Click Workflows in the sidebar.
2. In the Name column, click a job name. The Runs tab appears with a table of active runs and completed runs.
To switch to a matrix view, click Matrix . The matrix view shows a history of runs for the job, including each job
task.
The Job Runs row of the matrix displays the total duration of the run and the state of the run. To view details of
the run, including the start time, duration, and status, hover over the bar in the Job Runs row.
Each cell in the Tasks row represents a task and the corresponding status of the task. To view details of each
task, including the start time, duration, cluster, and status, hover over the cell for that task.
The job run and task run bars are color-coded to indicate the status of the run. Successful runs are green,
unsuccessful runs are red, and skipped runs are pink. The height of the individual job run and task run bars
provides a visual indication of the run duration.
Azure Databricks maintains a history of your job runs for up to 60 days. If you need to preserve job runs,
Databricks recommends that you export results before they expire. For more information, see Export job run
results.

View job run details


The job run details page contains job output and links to logs, including information about the success or failure
of each task in the job run. You can access job run details from the Runs tab for the job. To view job run details
from the Runs tab, click the link for the run in the Run column of the Completed Runs (past 60 days) table.
To return to the Runs tab for the job, click on the Job ID value.
Click on a task to view task run details, including:
the cluster that ran the task
the Spark UI for the task
logs for the task
metrics for the task
Click the Job ID value to return to the Runs tab for the job. Click the Job run ID value to return to the job run
details.

View recent job runs


You can view a list of currently running and recently completed runs for all jobs in a workspace you have access
to, including runs started by external orchestration tools such as Apache Airflow or Azure Data Factory. To view
the list of recent job runs:

1. Click Workflows in the sidebar. The Jobs list appears.


2. Click the Job runs tab. The Job runs list appears.
The Job runs list displays:
The start time for the run.
The name of the job associated with the run.
The user name that the job runs as.
Whether the run was triggered by a job schedule or an API request, or was manually started.
The time elapsed for a currently running job, or the total running time for a completed run.
The status of the run, either Pending , Running , Skipped , Succeeded , Failed , Terminating , Terminated ,
Internal Error , Timed Out , Canceled , Canceling , or Waiting for Retry .

To view job run details, click the link in the Star t time column for the run. To view job details, click the job name
in the Job column.

Export job run results


You can export notebook run results and job run logs for all job types.
Export notebook run results
You can persist job runs by exporting their results. For notebook job runs, you can export a rendered notebook
that can later be imported into your Azure Databricks workspace.
To export notebook run results for a job with a single task:
1. On the job detail page, click the View Details link for the run in the Run column of the Completed Runs
(past 60 days) table.
2. Click Expor t to HTML .
To export notebook run results for a job with multiple tasks:
1. On the job detail page, click the View Details link for the run in the Run column of the Completed Runs
(past 60 days) table.
2. Click the notebook task to export.
3. Click Expor t to HTML .
Export job run logs
You can also export the logs for your job run. You can set up your job to automatically deliver logs to DBFS
through the Job API. See the new_cluster.cluster_log_conf object in the request body passed to the Create a
new job operation ( POST /jobs/create ) in the Jobs API.

Edit a job
Some configuration options are available on the job, and other options are available on individual tasks. For
example, the maximum concurrent runs can be set on the job only, while parameters must be defined for each
task.
To change the configuration for a job:

1. Click Workflows in the sidebar.


2. In the Name column, click the job name.
The side panel displays the Job details . You can change the schedule, cluster configuration, notifications,
maximum number of concurrent runs, and add or change tags. If job access control is enabled, you can also edit
job permissions.
Tags
To add labels or key:value attributes to your job, you can add tags when you edit the job. You can use tags to
filter jobs in the Jobs list; for example, you can use a department tag to filter all jobs that belong to a specific
department.
NOTE
Because job tags are not designed to store sensitive information such as personally identifiable information or passwords,
Databricks recommends using tags for non-sensitive values only.

Tags also propagate to job clusters created when a job is run, allowing you to use tags with your existing cluster
monitoring.
To add or edit tags, click + Tag in the Job details side panel. You can add the tag as a key and value, or a label.
To add a label, enter the label in the Key field and leave the Value field empty.
Clusters
To see tasks associated with a cluster, hover over the cluster in the side panel. To change the cluster configuration
for all associated tasks, click Configure under the cluster. To configure a new cluster for all associated tasks,
click Swap under the cluster.
Maximum concurrent runs
The maximum number of parallel runs for this job. Azure Databricks skips the run if the job has already reached
its maximum number of active runs when attempting to start a new run. Set this value higher than the default of
1 to perform multiple runs of the same job concurrently. This is useful, for example, if you trigger your job on a
frequent schedule and want to allow consecutive runs to overlap with each other, or you want to trigger multiple
runs that differ by their input parameters.
Notifications
You can add one or more email addresses to notify when runs of this job begin, complete, or fail:
1. Click Edit notifications .
2. Click Add .
3. Enter an email address and click the check box for each notification type to send to that address.
4. To enter another email address for notification, click Add .
5. If you do not want to receive notifications for skipped job runs, click the check box.
6. Click Confirm .
Integrate these email notifications with your favorite notification tools, including:
PagerDuty
Slack
Control access to jobs
Job access control enables job owners and administrators to grant fine-grained permissions on their jobs. Job
owners can choose which other users or groups can view the results of the job. Owners can also choose who
can manage their job runs (Run now and Cancel run permissions).
See Jobs access control for details.

Edit a task
To set task configuration options:

1. Click Workflows in the sidebar.


2. In the Name column, click the job name.
3. Click the Tasks tab.
Task dependencies
You can define the order of execution of tasks in a job using the Depends on drop-down. You can set this field
to one or more tasks in the job.

NOTE
Depends on is not visible if the job consists of only a single task.

Configuring task dependencies creates a Directed Acyclic Graph (DAG) of task execution, a common way of
representing execution order in job schedulers. For example, consider the following job consisting of four tasks:

Task 1 is the root task and does not depend on any other task.
Task 2 and Task 3 depend on Task 1 completing first.
Finally, Task 4 depends on Task 2 and Task 3 completing successfully.
Azure Databricks runs upstream tasks before running downstream tasks, running as many of them in parallel as
possible. The following diagram illustrates the order of processing for these tasks:
Individual task configuration options
Individual tasks have the following configuration options:
In this section:
Cluster
Dependent libraries
Task parameter variables
Timeout
Retries
Cluster
To configure the cluster where a task runs, click the Cluster drop-down. You can edit a shared job cluster, but
you cannot delete a shared cluster if it is still used by other tasks.
To learn more about selecting and configuring clusters to run tasks, see Cluster configuration tips.
Dependent libraries
Dependent libraries will be installed on the cluster before the task runs. You must set all task dependencies to
ensure they are installed before the run starts.
To add a dependent library, click Advanced options and select Add Dependent Libraries to open the Add
Dependent Librar y chooser. Follow the recommendations in Library dependencies for specifying
dependencies.

IMPORTANT
If you have configured a library to install on all clusters automatically, or you select an existing terminated cluster that has
libraries installed, the job execution does not wait for library installation to complete. If a job requires a specific library, you
should attach the library to the job in the Dependent Libraries field.

Task parameter variables


You can pass templated variables into a job task as part of the task’s parameters. These variables are replaced
with the appropriate values when the job task runs. You can use task parameter values to pass the context about
a job run, such as the run ID or the job’s start time.
When a job runs, the task parameter variable surrounded by double curly braces is replaced and appended to an
optional string value included as part of the value. For example, to pass a parameter named MyJobId with a
value of my-job-6 for any run of job ID 6, add the following task parameter:

{
"MyJobID": "my-job-{{job_id}}"
}

The contents of the double curly braces are not evaluated as expressions, so you cannot do operations or
functions within double-curly braces. Whitespace is not stripped inside the curly braces, so {{ job_id }} will
not be evaluated.
The following task parameter variables are supported:

VA RIA B L E DESC RIP T IO N

{{job_id}} The unique identifier assigned to a job

{{run_id}} The unique identifier assigned to a job run

{{start_date}} The date a task run started. The format is yyyy-MM-dd in


UTC timezone.

{{start_time}} The timestamp of the run’s start of execution after the


cluster is created and ready. The format is milliseconds since
UNIX epoch in UTC timezone, as returned by
System.currentTimeMillis() .

{{task_retry_count}} The number of retries that have been attempted to run a


task if the first attempt fails. The value is 0 for the first
attempt and increments with each retry.

{{parent_run_id}} The unique identifier assigned to the run of a job with


multiple tasks.

{{task_key}} The unique name assigned to a task that’s part of a job with
multiple tasks.

You can set these variables with any task when you Create a job, Edit a job, or Run a job with different
parameters.
Timeout
The maximum completion time for a job. If the job does not complete in this time, Azure Databricks sets its
status to “Timed Out”.
Retries
A policy that determines when and how many times failed runs are retried. To set the retries for the task, click
Advanced options and select Edit Retr y Policy . The retry interval is calculated in milliseconds between the
start of the failed run and the subsequent retry run.

NOTE
If you configure both Timeout and Retries , the timeout applies to each retry.
Clone a job
You can quickly create a new job by cloning an existing job. Cloning a job creates an identical copy of the job,
except for the job ID. On the job’s page, click More … next to the job’s name and select Clone from the drop-
down menu.

Clone a task
You can quickly create a new task by cloning an existing task:
1. On the job’s page, click the Tasks tab.
2. Select the task to clone.
3. Click and select Clone task .

Delete a job
To delete a job, on the job’s page, click More … next to the job’s name and select Delete from the drop-down
menu.

Delete a task
To delete a task:
1. Click the Tasks tab.
2. Select the task to be deleted.
3. Click and select Remove task .

Copy a task path


To copy the path to a task, for example, a notebook path:
1. Click the Tasks tab.
2. Select the task containing the path to copy.
3. Click next to the task path to copy the path to the clipboard.

Run jobs using notebooks in a remote Git repository


IMPORTANT
This feature is in Public Preview.

You can run jobs with notebooks located in a remote Git repository. This feature simplifies creation and
management of production jobs and automates continuous deployment:
You don’t need to create a separate production repo in Azure Databricks, manage permissions for it, and
keep it updated.
You can prevent unintentional changes to a production job, such as local edits in the production repo or
changes from switching a branch.
The job definition process has a single source of truth in the remote repository.
To use notebooks in a remote Git repository, you must Set up Git integration with Databricks Repos.
To create a task with a notebook located in a remote Git repository:
1. In the Type drop-down, select Notebook .
2. In the Source drop-down, select Git provider . The Git information dialog appears.
3. In the Git Information dialog, enter details for the repository.
For Path , enter a relative path to the notebook location, such as etl/notebooks/ .
When you enter the relative path, don’t begin it with / or ./ and don’t include the notebook file
extension, such as .py .
Additional notebook tasks in a multitask job reference the same commit in the remote repository in one of the
following ways:
sha of $branch/head when git_branch is set
sha of $tag when git_tag is set
the value of git_commit

In a multitask job, there cannot be a task that uses a local notebook and another task that uses a remote
repository. This restriction doesn’t apply to non-notebook tasks.

Best practices
In this section:
Cluster configuration tips
Notebook job tips
Streaming tasks
JAR jobs
Library dependencies
Cluster configuration tips
Cluster configuration is important when you operationalize a job. The following provides general guidance on
choosing and configuring job clusters, followed by recommendations for specific job types.
Use shared job clusters
To optimize resource usage with jobs that orchestrate multiple tasks, use shared job clusters. A shared job
cluster allows multiple tasks in the same job run to reuse the cluster. You can use a single job cluster to run all
tasks that are part of the job, or multiple job clusters optimized for specific workloads. To use a shared job
cluster:
1. Select New Job Clusters when you create a task and complete the cluster configuration.
2. Select the new cluster when adding a task to the job, or create a new job cluster. Any cluster you configure
when you select New Job Clusters is available to any task in the job.
A shared job cluster is scoped to a single job run, and cannot be used by other jobs or runs of the same job.
Libraries cannot be declared in a shared job cluster configuration. You must add dependent libraries in task
settings.
Choose the correct cluster type for your job
New Job Clusters are dedicated clusters for a job or task run. A shared job cluster is created and started
when the first task using the cluster starts and terminates after the last task using the cluster completes. The
cluster is not terminated when idle but terminates only after all tasks using it have completed. If a shared job
cluster fails or is terminated before all tasks have finished, a new cluster is created. A cluster scoped to a
single task is created and started when the task starts and terminates when the task completes. In
production, Databricks recommends using new shared or task scoped clusters so that each job or task runs
in a fully isolated environment.
When you run a task on a new cluster, the task is treated as a data engineering (task) workload, subject to the
task workload pricing. When you run a task on an existing all-purpose cluster, the task is treated as a data
analytics (all-purpose) workload, subject to all-purpose workload pricing.
If you select a terminated existing cluster and the job owner has Can Restar t permission, Azure Databricks
starts the cluster when the job is scheduled to run.
Existing all-purpose clusters work best for tasks such as updating dashboards at regular intervals.
Use a pool to reduce cluster start times
To decrease new job cluster start time, create a pool and configure the job’s cluster to use the pool.
Notebook job tips
Total notebook cell output (the combined output of all notebook cells) is subject to a 20MB size limit.
Additionally, individual cell output is subject to an 8MB size limit. If total cell output exceeds 20MB in size, or if
the output of an individual cell is larger than 8MB, the run is canceled and marked as failed.
If you need help finding cells near or beyond the limit, run the notebook against an all-purpose cluster and use
this notebook autosave technique.
Streaming tasks
Spark Streaming jobs should never have maximum concurrent runs set to greater than 1. Streaming jobs should
be set to run using the cron expression "* * * * * ?" (every minute).
Since a streaming task runs continuously, it should always be the final task in a job.
JAR jobs
When running a JAR job, keep in mind the following:
Output size limits

NOTE
Available in Databricks Runtime 6.3 and above.

Job output, such as log output emitted to stdout, is subject to a 20MB size limit. If the total output has a larger
size, the run is canceled and marked as failed.
To avoid encountering this limit, you can prevent stdout from being returned from the driver to Azure
Databricks by setting the spark.databricks.driver.disableScalaOutput Spark configuration to true . By default,
the flag value is false . The flag controls cell output for Scala JAR jobs and Scala notebooks. If the flag is
enabled, Spark does not return job execution results to the client. The flag does not affect the data that is written
in the cluster’s log files. Setting this flag is recommended only for job clusters for JAR jobs because it will disable
notebook results.
Use the shared SparkContext

Because Azure Databricks is a managed service, some code changes may be necessary to ensure that your
Apache Spark jobs run correctly. JAR job programs must use the shared SparkContext API to get the
SparkContext . Because Azure Databricks initializes the SparkContext , programs that invoke new SparkContext()
will fail. To get the SparkContext , use only the shared SparkContext created by Azure Databricks:

val goodSparkContext = SparkContext.getOrCreate()


val goodSparkSession = SparkSession.builder().getOrCreate()

There are also several methods you should avoid when using the shared SparkContext .
Do not call SparkContext.stop() .
Do not call System.exit(0) or sc.stop() at the end of your Main program. This can cause undefined
behavior.
Use try-finally blocks for job clean up
Consider a JAR that consists of two parts:
jobBody() which contains the main part of the job.
jobCleanup() which has to be executed after jobBody() whether that function succeeded or returned an
exception.
As an example, jobBody() may create tables, and you can use jobCleanup() to drop these tables.
The safe way to ensure that the clean up method is called is to put a try-finally block in the code:

try {
jobBody()
} finally {
jobCleanup()
}

You should not try to clean up using sys.addShutdownHook(jobCleanup) or the following code:

val cleanupThread = new Thread { override def run = jobCleanup() }


Runtime.getRuntime.addShutdownHook(cleanupThread)

Due to the way the lifetime of Spark containers is managed in Azure Databricks, the shutdown hooks are not run
reliably.
Configure JAR job parameters
You pass parameters to JAR jobs with a JSON string array. See the spark_jar_task object in the request body
passed to the Create a new job operation ( POST /jobs/create ) in the Jobs API. To access these parameters,
inspect the String array passed into your main function.
Library dependencies
The Spark driver has certain library dependencies that cannot be overridden. These libraries take priority over
any of your libraries that conflict with them.
To get the full list of the driver library dependencies, run the following command inside a notebook attached to a
cluster of the same Spark version (or the cluster with the driver you want to examine).

%sh
ls /databricks/jars

Manage library dependencies


A good rule of thumb when dealing with library dependencies while creating JARs for jobs is to list Spark and
Hadoop as provided dependencies. On Maven, add Spark and Hadoop as provided dependencies, as shown in
the following example:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
<scope>provided</scope>
</dependency>

In sbt , add Spark and Hadoop as provided dependencies, as shown in the following example:

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0" % "provided"


libraryDependencies += "org.apache.hadoop" %% "hadoop-core" % "1.2.1" % "provided"

TIP
Specify the correct Scala version for your dependencies based on the version you are running.
Jobs API updates
7/21/2022 • 10 minutes to read

You can now orchestrate multiple tasks with Azure Databricks jobs. This article details changes to the Jobs API
2.1 that support jobs with multiple tasks and provides guidance to help you update your existing API clients to
work with this new feature.
Databricks recommends Jobs API 2.1 for your API scripts and clients, particularly when using jobs with multiple
tasks.
This article refers to jobs defined with a single task as single-task format and jobs defined with multiple tasks as
multi-task format.
Jobs API 2.0 and 2.1 now support the update request. Use the update request to change an existing job instead
of the reset request to minimize changes between single-task format jobs and multi-task format jobs.

API changes
The Jobs API now defines a TaskSettings object to capture settings for each task in a job. For multi-task format
jobs, the tasks field, an array of TaskSettings data structures, is included in the JobSettings object. Some
fields previously part of JobSettings are now part of the task settings for multi-task format jobs. JobSettings
is also updated to include the format field. The format field indicates the format of the job and is a STRING
value set to SINGLE_TASK or MULTI_TASK .
You need to update your existing API clients for these changes to JobSettings for multi-task format jobs. See the
API client guide for more information on required changes.
Jobs API 2.1 supports the multi-task format. All API 2.1 requests must conform to the multi-task format and
responses are structured in the multi-task format. New features are released for API 2.1 first.
Jobs API 2.0 is updated with an additional field to support multi-task format jobs. Except where noted, the
examples in this document use API 2.0. However, Databricks recommends API 2.1 for new and existing API
scripts and clients.
An example JSON document representing a multi-task format job for API 2.0 and 2.1:
{
"job_id": 53,
"settings": {
"name": "A job with multiple tasks",
"email_notifications": {},
"timeout_seconds": 0,
"max_concurrent_runs": 1,
"tasks": [
{
"task_key": "clean_data",
"description": "Clean and prepare the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/clean-data"
},
"existing_cluster_id": "1201-my-cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
},
{
"task_key": "analyze_data",
"description": "Perform an analysis of the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/analyze-data"
},
"depends_on": [
{
"task_key": "clean_data"
}
],
"existing_cluster_id": "1201-my-cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
}
],
"format": "MULTI_TASK"
},
"created_time": 1625841911296,
"creator_user_name": "user@databricks.com",
"run_as_user_name": "user@databricks.com"
}

Jobs API 2.1 supports configuration of task level clusters or one or more shared job clusters:
A task level cluster is created and started when a task starts and terminates when the task completes.
A shared job cluster allows multiple tasks in the same job to use the cluster. The cluster is created and started
when the first task using the cluster starts and terminates after the last task using the cluster completes. A
shared job cluster is not terminated when idle but terminates only after all tasks using it are complete.
Multiple non-dependent tasks sharing a cluster can start at the same time. If a shared job cluster fails or is
terminated before all tasks have finished, a new cluster is created.
To configure shared job clusters, include a JobCluster array in the JobSettings object. You can specify a
maximum of 100 clusters per job. The following is an example of an API 2.1 response for a job configured with
two shared clusters:
NOTE
If a task has library dependencies, you must configure the libraries in the task field settings; libraries cannot be
configured in a shared job cluster configuration. In the following example, the libraries field in the configuration of the
ingest_orders task demonstrates specification of a library dependency.

{
"job_id": 53,
"settings": {
"name": "A job with multiple tasks",
"email_notifications": {},
"timeout_seconds": 0,
"max_concurrent_runs": 1,
"job_clusters": [
{
"job_cluster_key": "default_cluster",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "i3.xlarge",
"spark_conf": {
"spark.speculation": true
},
"aws_attributes": {
"availability": "SPOT",
"zone_id": "us-west-2a"
},
"autoscale": {
"min_workers": 2,
"max_workers": 8
}
}
},
{
"job_cluster_key": "data_processing_cluster",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "r4.2xlarge",
"spark_conf": {
"spark.speculation": true
},
"aws_attributes": {
"availability": "SPOT",
"zone_id": "us-west-2a"
},
"autoscale": {
"min_workers": 8,
"max_workers": 16
}
}
}
],
"tasks": [
{
"task_key": "ingest_orders",
"description": "Ingest order data",
"depends_on": [ ],
"job_cluster_key": "auto_scaling_cluster",
"spark_jar_task": {
"main_class_name": "com.databricks.OrdersIngest",
"parameters": [
"--data",
"dbfs:/path/to/order-data.json"
]
},
"libraries": [
{
{
"jar": "dbfs:/mnt/databricks/OrderIngest.jar"
}
],
"timeout_seconds": 86400,
"max_retries": 3,
"min_retry_interval_millis": 2000,
"retry_on_timeout": false
},
{
"task_key": "clean_orders",
"description": "Clean and prepare the order data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/clean-data"
},
"job_cluster_key": "default_cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
},
{
"task_key": "analyze_orders",
"description": "Perform an analysis of the order data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/analyze-data"
},
"depends_on": [
{
"task_key": "clean_data"
}
],
"job_cluster_key": "data_processing_cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
}
],
"format": "MULTI_TASK"
},
"created_time": 1625841911296,
"creator_user_name": "user@databricks.com",
"run_as_user_name": "user@databricks.com"
}

For single-task format jobs, the JobSettings data structure remains unchanged except for the addition of the
format field. No TaskSettings array is included, and the task settings remain defined at the top level of the
JobSettings data structure. You will not need to make changes to your existing API clients to process single-task
format jobs.
An example JSON document representing a single-task format job for API 2.0:
{
"job_id": 27,
"settings": {
"name": "Example notebook",
"existing_cluster_id": "1201-my-cluster",
"libraries": [
{
"jar": "dbfs:/FileStore/jars/spark_examples.jar"
}
],
"email_notifications": {},
"timeout_seconds": 0,
"schedule": {
"quartz_cron_expression": "0 0 0 * * ?",
"timezone_id": "US/Pacific",
"pause_status": "UNPAUSED"
},
"notebook_task": {
"notebook_path": "/notebooks/example-notebook",
"revision_timestamp": 0
},
"max_concurrent_runs": 1,
"format": "SINGLE_TASK"
},
"created_time": 1504128821443,
"creator_user_name": "user@databricks.com"
}

API client guide


This section provides guidelines, examples, and required changes for API calls affected by the new multi-task
format feature.
In this section:
Create
Runs submit
Update
Reset
List
Get
Runs get
Runs get output
Runs list
Create
To create a single-task format job through the Create a new job operation ( POST /jobs/create ) in the Jobs API,
you do not need to change existing clients.
To create a multi-task format job, use the tasks field in JobSettings to specify settings for each task. The
following example creates a job with two notebook tasks. This example is for API 2.0 and 2.1:

NOTE
A maximum of 100 tasks can be specified per job.
{
"name": "Multi-task-job",
"max_concurrent_runs": 1,
"tasks": [
{
"task_key": "clean_data",
"description": "Clean and prepare the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/clean-data"
},
"existing_cluster_id": "1201-my-cluster",
"timeout_seconds": 3600,
"max_retries": 3,
"retry_on_timeout": true
},
{
"task_key": "analyze_data",
"description": "Perform an analysis of the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/analyze-data"
},
"depends_on": [
{
"task_key": "clean_data"
}
],
"existing_cluster_id": "1201-my-cluster",
"timeout_seconds": 3600,
"max_retries": 3,
"retry_on_timeout": true
}
]
}

Runs submit
To submit a one-time run of a single-task format job with the Create and trigger a one-time run operation (
POST /runs/submit ) in the Jobs API, you do not need to change existing clients.

To submit a one-time run of a multi-task format job, use the tasks field in JobSettings to specify settings for
each task, including clusters. Clusters must be set at the task level when submitting a multi-task format job
because the runs submit request does not support shared job clusters. See Create for an example JobSettings
specifying multiple tasks.
Update
To update a single-task format job with the Partially update a job operation ( POST /jobs/update ) in the Jobs API,
you do not need to change existing clients.
To update the settings of a multi-task format job, you must use the unique task_key field to identify new task
settings. See Create for an example JobSettings specifying multiple tasks.
Reset
To overwrite the settings of a single-task format job with the Overwrite all settings for a job operation (
POST /jobs/reset ) in the Jobs API, you do not need to change existing clients.

To overwrite the settings of a multi-task format job, specify a JobSettings data structure with an array of
TaskSettings data structures. See Create for an example JobSettings specifying multiple tasks.

Use Update to change individual fields without switching from single-task to multi-task format.
List
For single-task format jobs, no client changes are required to process the response from the List all jobs
operation ( GET /jobs/list ) in the Jobs API.
For multi-task format jobs, most settings are defined at the task level and not the job level. Cluster configuration
may be set at the task or job level. To modify clients to access cluster or task settings for a multi-task format job
returned in the Job structure:
Parse the job_id field for the multi-task format job.
Pass the job_id to the Get a job operation ( GET /jobs/get ) in the Jobs API to retrieve job details. See Get for
an example response from the Get API call for a multi-task format job.

The following example shows a response containing single-task and multi-task format jobs. This example is for
API 2.0:

{
"jobs": [
{
"job_id": 36,
"settings": {
"name": "A job with a single task",
"existing_cluster_id": "1201-my-cluster",
"email_notifications": {},
"timeout_seconds": 0,
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/example-notebook",
"revision_timestamp": 0
},
"max_concurrent_runs": 1,
"format": "SINGLE_TASK"
},
"created_time": 1505427148390,
"creator_user_name": "user@databricks.com"
},
{
"job_id": 53,
"settings": {
"name": "A job with multiple tasks",
"email_notifications": {},
"timeout_seconds": 0,
"max_concurrent_runs": 1,
"format": "MULTI_TASK"
},
"created_time": 1625841911296,
"creator_user_name": "user@databricks.com"
}
]
}

Get
For single-task format jobs, no client changes are required to process the response from the Get a job operation
( GET /jobs/get ) in the Jobs API.
Multi-task format jobs return an array of task data structures containing task settings. If you require access to
task level details, you need to modify your clients to iterate through the tasks array and extract required fields.
The following shows an example response from the Get API call for a multi-task format job. This example is for
API 2.0 and 2.1:
{
"job_id": 53,
"settings": {
"name": "A job with multiple tasks",
"email_notifications": {},
"timeout_seconds": 0,
"max_concurrent_runs": 1,
"tasks": [
{
"task_key": "clean_data",
"description": "Clean and prepare the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/clean-data"
},
"existing_cluster_id": "1201-my-cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
},
{
"task_key": "analyze_data",
"description": "Perform an analysis of the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/analyze-data"
},
"depends_on": [
{
"task_key": "clean_data"
}
],
"existing_cluster_id": "1201-my-cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
}
],
"format": "MULTI_TASK"
},
"created_time": 1625841911296,
"creator_user_name": "user@databricks.com",
"run_as_user_name": "user@databricks.com"
}

Runs get
For single-task format jobs, no client changes are required to process the response from the Get a job run
operation ( GET /jobs/runs/get ) in the Jobs API.
The response for a multi-task format job run contains an array of TaskSettings . To retrieve run results for each
task:
Iterate through each of the tasks.
Parse the run_id for each task.
Call the Get the output for a run operation ( GET /jobs/runs/get-output ) with the run_id to get details on the
run for each task. The following is an example response from this request:
{
"job_id": 53,
"run_id": 759600,
"number_in_job": 7,
"original_attempt_run_id": 759600,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
},
"cluster_spec": {},
"start_time": 1595943854860,
"setup_duration": 0,
"execution_duration": 0,
"cleanup_duration": 0,
"trigger": "ONE_TIME",
"creator_user_name": "user@databricks.com",
"run_name": "Query logs",
"run_type": "JOB_RUN",
"tasks": [
{
"run_id": 759601,
"task_key": "query-logs",
"description": "Query session logs",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/log-query"
},
"existing_cluster_id": "1201-my-cluster",
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
}
},
{
"run_id": 759602,
"task_key": "validate_output",
"description": "Validate query output",
"depends_on": [
{
"task_key": "query-logs"
}
],
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/validate-query-results"
},
"existing_cluster_id": "1201-my-cluster",
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
}
}
],
"format": "MULTI_TASK"
}

Runs get output


For single-task format jobs, no client changes are required to process the response from the Get the output for a
run operation ( GET /jobs/runs/get-output ) in the Jobs API.
For multi-task format jobs, calling Runs get output on a parent run results in an error since run output is
available only for individual tasks. To get the output and metadata for a multi-task format job:
Call the Get the output for a run request.
Iterate over the child run_id fields in the response.
Use the child run_id values to call Runs get output .
Runs list
For single-task format jobs, no client changes are required to process the response from the List runs for a job
operation ( GET /jobs/runs/list ).
For multi-task format jobs, an empty tasks array is returned. Pass the run_id to the Get a job run operation (
GET /jobs/runs/get ) to retrieve the tasks. The following shows an example response from the Runs list API
call for a multi-task format job:

{
"runs": [
{
"job_id": 53,
"run_id": 759600,
"number_in_job": 7,
"original_attempt_run_id": 759600,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
},
"cluster_spec": {},
"start_time": 1595943854860,
"setup_duration": 0,
"execution_duration": 0,
"cleanup_duration": 0,
"trigger": "ONE_TIME",
"creator_user_name": "user@databricks.com",
"run_name": "Query logs",
"run_type": "JOB_RUN",
"tasks": [],
"format": "MULTI_TASK"
}
],
"has_more": false
}
Managing dependencies in data pipelines
7/21/2022 • 8 minutes to read

Developing and deploying a data processing pipeline often requires managing complex dependencies between
tasks. For example, a pipeline might read data from a source, clean the data, transform the cleaned data, and
writing the transformed data to a target. You need to test, schedule, and troubleshoot data pipelines when you
operationalize them.
Workflow systems address these challenges by allowing you to define dependencies between tasks, schedule
when pipelines run, and monitor workflows. Databricks recommends jobs with multiple tasks to manage your
workflows without relying on an external system. Azure Databricks jobs provide task orchestration with
standard authentication and access control methods. You can manage jobs using a familiar, user-friendly
interface to create and manage complex workflows. You can define a job containing multiple tasks, where each
task runs code such as a notebook or JAR, and control the execution order of tasks in a job by specifying
dependencies between them. You can configure a job’s tasks to run in sequence or parallel.
Azure Databricks also supports workflow management with Azure Data Factory or Apache Airflow.

Azure Data Factory


Azure Data Factory is a cloud data integration service that lets you compose data storage, movement, and
processing services into automated data pipelines. You can operationalize Databricks notebooks in Azure Data
Factory data pipelines. See Run a Databricks notebook with the Databricks notebook activity in Azure Data
Factory for instructions on how to create an Azure Data Factory pipeline that runs a Databricks notebook in an
Azure Databricks cluster, followed by Transform data by running a Databricks notebook.

Apache Airflow
Apache Airflow is an open source solution for managing and scheduling data pipelines. Airflow represents data
pipelines as directed acyclic graphs (DAGs) of operations. You define a workflow in a Python file and Airflow
manages the scheduling and execution.
Airflow provides tight integration between Azure Databricks and Airflow. The Airflow Azure Databricks
integration lets you take advantage of the optimized Spark engine offered by Azure Databricks with the
scheduling features of Airflow.

Requirements
The integration between Airflow and Azure Databricks is available in Airflow version 1.9.0 and later. The
examples in this article are tested with Airflow version 2.1.0.
Airflow requires Python 3.6, 3.7, or 3.8. The examples in this article are tested with Python 3.8.

Install the Airflow Azure Databricks integration


To install the Airflow Azure Databricks integration, open a terminal and run the following commands:
mkdir airflow
cd airflow
pipenv --python 3.8
pipenv shell
export AIRFLOW_HOME=$(pwd)
pipenv install apache-airflow==2.1.0
pipenv install apache-airflow-providers-databricks
mkdir dags
airflow db init
airflow users create --username admin --firstname <firstname> --lastname <lastname> --role Admin --email
your@email.com

These commands:
1. Create a directory named airflow and change into that directory.
2. Use pipenv to create and spawn a Python virtual environment. Databricks recommends using a Python
virtual environment to isolate package versions and code dependencies to that environment. This isolation
helps reduce unexpected package version mismatches and code dependency collisions.
3. Initialize an environment variable named AIRFLOW_HOME set to the path of the airflow directory.
4. Install Airflow and the Airflow Databricks provider packages.
5. Create an airflow/dags directory. Airflow uses the dags directory to store DAG definitions.
6. Initialize a SQLite database that Airflow uses to track metadata. In a production Airflow deployment, you
would configure Airflow with a standard database. The SQLite database and default configuration for your
Airflow deployment are initialized in the airflow directory.
7. Create an admin user for Airflow.
To install extras, for example celery and password , run:

pip install "apache-airflow[databricks, celery, password]"

Start the Airflow web server and scheduler


The Airflow web server is required to view the Airflow UI. To start the web server, open a terminal and run the
following command:

airflow webserver

The scheduler is the Airflow component that schedules DAGs. To run it, open a new terminal and run the
following command:

pipenv shell
export AIRFLOW_HOME=$(pwd)
airflow scheduler

Test the Airflow installation


To verify the Airflow installation, you can run one of the example DAGs included with Airflow:
1. In a browser window, open http://localhost:8080/home. The Airflow DAGs screen appears.
2. Click the Pause/Unpause DAG toggle to unpause one of the example DAGs, for example, the
example_python_operator .
3. Trigger the example DAG by clicking the Star t button.
4. Click the DAG name to view details, including the run status of the DAG.

Run an Azure Databricks job from Airflow


The Airflow Azure Databricks integration provides two different operators for triggering jobs:
The DatabricksRunNowOperator requires an existing Azure Databricks job and uses the Trigger a new job
run ( POST /jobs/run-now ) API request to trigger a run. Databricks recommends using
DatabricksRunNowOperator because it reduces duplication of job definitions and job runs triggered with this
operator are easy to find in the jobs UI.
The DatabricksSubmitRunOperator does not require a job to exist in Azure Databricks and uses the Create
and trigger a one-time run ( POST /jobs/runs/submit ) API request to submit the job specification and trigger a
run.
The Databricks Airflow operator writes the job run page URL to the Airflow logs every polling_period_seconds
(the default is 30 seconds). For more information, see the apache-airflow-providers-databricks package page on
the Airflow website.
Example
The following example demonstrates how to create a simple Airflow deployment that runs on your local
machine and deploys an example DAG to trigger runs in Azure Databricks. For this example, you:
1. Create a new notebook and add code to print a greeting based on a configured parameter.
2. Create an Azure Databricks job with a single task that runs the notebook.
3. Configure an Airflow connection to your Azure Databricks workspace.
4. Create an Airflow DAG to trigger the notebook job. You define the DAG in a Python script using
DatabricksRunNowOperator .
5. Use the Airflow UI to trigger the DAG and view the run status.
Create a notebook
This example uses a notebook containing two cells:
The first cell contains a Databricks Utilities text widget defining a variable named greeting set to the default
value world .
The second cell prints the value of the greeting variable prefixed by hello .
To create the notebook:

1. Go to your Azure Databricks landing page and select Create Blank Notebook or click Create in
the sidebar and select Notebook from the menu. The Create Notebook dialog appears.
2. In the Create Notebook dialog, give your notebook a name, such as Hello Airflow . Set Default
Language to Python . Leave Cluster set to the default value. You will configure the cluster when you
create a task that uses this notebook.
3. Click Create .
4. Copy the following Python code and paste it into the first cell of the notebook.

dbutils.widgets.text("greeting", "world", "Greeting")


greeting = dbutils.widgets.get("greeting")

5. Add a new cell below the first cell and copy and paste the following Python code into the new cell:
print("hello {}".format(greeting))

Create a job

1. Click Workflows in the sidebar.

2. Click .
The Tasks tab displays with the create task dialog.

3. Replace Add a name for your job… with your job name.
4. In the Task name field, enter a name for the task, for example, greeting-task .
5. In the Type drop-down, select Notebook .
6. Use the file browser to find the notebook you created, click the notebook name, and click Confirm .
7. Click Add under Parameters . In the Key field, enter greeting . In the Value field, enter Airflow user .
8. Click Create task .
Run the job

To run the job immediately, click in the upper right corner. You can also run the job by clicking the
Runs tab and clicking Run Now in the Active Runs table.
View run details
1. Click the Runs tab and click View Details in the Active Runs table or the Completed Runs (past 60
days) table.
2. Copy the Job ID value. This value is required to trigger the job from Airflow.
Create an Azure Databricks personal access token

NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.

Airflow connects to Databricks using an Azure Databricks personal access token (PAT). See personal access token
for instructions on creating a PAT.
Configure an Azure Databricks connection
Your Airflow installation contains a default connection for Azure Databricks. To update the connection to connect
to your workspace using the personal access token you created above:
1. In a browser window, open http://localhost:8080/connection/list/.
2. Under Conn ID , locate databricks_default and click the Edit record button.
3. Replace the value in the Host field with the workspace instance name of your Azure Databricks
deployment.
4. In the Extra field, enter the following value:

{"token": "PERSONAL_ACCESS_TOKEN"}

Replace PERSONAL_ACCESS_TOKEN with your Azure Databricks personal access token.


Create a new DAG
You define an Airflow DAG in a Python file. To create a DAG to trigger the example notebook job:
1. In a text editor or IDE, create a new file named databricks_dag.py with the following contents:
from airflow import DAG
from airflow.providers.databricks.operators.databricks import DatabricksRunNowOperator
from airflow.utils.dates import days_ago

default_args = {
'owner': 'airflow'
}

with DAG('databricks_dag',
start_date = days_ago(2),
schedule_interval = None,
default_args = default_args
) as dag:

opr_run_now = DatabricksRunNowOperator(
task_id = 'run_now',
databricks_conn_id = 'databricks_default',
job_id = JOB_ID
)

Replace JOB_ID with the value of the job ID saved earlier.


2. Save the file in the airflow/dags directory. Airflow automatically reads and installs DAG files stored in
airflow/dags/ .

Install and verify the DAG in Airflow


To trigger and verify the DAG in the Airflow UI:
1. In a browser window, open http://localhost:8080/home. The Airflow DAGs screen appears.
2. Locate databricks_dag and click the Pause/Unpause DAG toggle to unpause the DAG.
3. Trigger the DAG by clicking the Star t button.
4. Click a run in the Runs column to view the status and details of the run.
Delta Lake guide
7/21/2022 • 2 minutes to read

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID
transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on
top of your existing data lake and is fully compatible with Apache Spark APIs. Delta Lake on Azure Databricks
allows you to configure Delta Lake based on your workload patterns.
Azure Databricks adds optimized layouts and indexes to Delta Lake for fast interactive queries.
This guide provides an introductory overview, quickstarts, and guidance for using Delta Lake on Azure
Databricks.
Introduction
Delta Lake quickstart
Introductory notebooks
Ingest data into Delta Lake
Table batch reads and writes
Table streaming reads and writes
Table deletes, updates, and merges
Change data feed
Table utility commands
Constraints
Table protocol versioning
Delta column mapping
Unity Catalog
Delta Lake APIs
Concurrency control
Migration guide
Best practices: Delta Lake
Frequently asked questions (FAQ)
Delta Lake resources
Optimizations
Delta table properties reference
Introduction
7/21/2022 • 3 minutes to read

Delta Lake is an open source project that enables building a Lakehouse architecture on top of data lakes. Delta
Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing
on top of existing data lakes.
For a quick overview and benefits of Delta Lake, watch this YouTube video (3 minutes).

Specifically, Delta Lake offers:


ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.
Scalable metadata handling: Leverages Spark distributed processing power to handle all the metadata for
petabyte-scale tables with billions of files at ease.
Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink.
Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.
Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during
ingestion.
Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning
experiments.
Upserts and deletes: Supports merge, update and delete operations to enable complex use cases like change-
data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on.
For a general introduction and demonstration of Delta Lake, watch this YouTube video (51 minutes).

Delta Engine optimizations make Delta Lake operations highly performant, supporting a variety of workloads
ranging from large-scale ETL processing to ad-hoc, interactive queries. For information on Delta Engine, see
Optimizations.

Quickstart
The Delta Lake quickstart provides an overview of the basics of working with Delta Lake. The quickstart shows
how to load data into a Delta table, modify the table, read the table, display table history, and optimize the table.
For Azure Databricks notebooks that demonstrate these features, see Introductory notebooks.
To try out Delta Lake, see Sign up for Azure Databricks.

Key tasks
The following list provides links to documentation for common Delta Lake tasks.
Create a Delta table: quick start, as part of batch data tasks
Load and write data into a Delta Lake table:
With COPY INTO
With Auto Loader
With the Create Table UI in Databricks SQL.
With streaming: quick start, as part of streaming
With third-party solutions: with partners, with third-party providers
Update data
Merge data updates and insertions (upserts): quick start, as part of table updates
Append data
Overwrite data
Convert a Parquet table to a Delta table
Read data from a Delta table: quick start, as part of batch data tasks, as part of streaming
Optimize a Delta table: quick start, as part of bin packing, as part of Z-ordering, as part of file size tuning
Create a view on top of a Delta table
Delete data from a Delta table
Display Delta table details
Display the history of a Delta table: quick start, as part of data utilities
Clean up Delta table snapshots (vacuum): quick start, as part of data utilities
Work with Delta table columns:
Work with column constraints
Partition data by columns: quick start, as part of batch data tasks
Use automatically-generated columns
Update columns (add, reorder, replace, rename, change type)
Map columns in Delta tables to columns in related Parquet tables
Track changes to a Delta table (change data feed)
Copy or clone a Delta table
Work with table constraints
Work with Delta table versions:
Query an earlier version of a Delta table (time travel): quick start, as part of batch data tasks
Restore or roll back a Delta table to an earlier version
Work with Delta table reader and writer versions
Work with Delta table metadata:
Read existing metadata
Add your own metadata
Use Delta Lake SQL statements
Use the Delta Lake API reference
Learn about Delta Lake concurrency control (ACID transactions)

Resources
For answers to frequently asked questions, see Frequently asked questions (FAQ).
For reference information on Delta Lake SQL commands, see Delta Lake statements.
For further resources, including blog posts, talks, and examples, see Delta Lake resources.
For deep-dive training on Delta Lake, watch this YouTube video (2 hours, 42 minutes).
Delta Lake quickstart
7/21/2022 • 14 minutes to read

The Delta Lake quickstart provides an overview of the basics of working with Delta Lake. The quickstart shows
how to load data into a Delta table, modify the table, read the table, display table history, and optimize the table.
For a demonstration of some of the features that are described in this article (and many more), watch this
YouTube video (9 minutes).

You can run the example Python, R, Scala, and SQL code in this article from within a notebook attached to an
Azure Databricks cluster. You can also run the SQL code in this article from within a query associated with a SQL
warehouse in Databricks SQL.
For existing Azure Databricks notebooks that demonstrate these features, see Introductory notebooks.

Create a table
To create a Delta table, you can use existing Apache Spark SQL code and change the write format from parquet ,
csv , json , and so on, to delta .

For all file types, you read the files into a DataFrame using the corresponding input format (for example,
parquet , csv , json , and so on) and then write out the data in Delta format. In this code example, the input
files are already in Delta format and are located in Sample datasets (databricks-datasets). This code saves the
data in Delta format in Databricks File System (DBFS) in the location specified by save_path .
Python

# Define the input and output formats and paths and the table name.
read_format = 'delta'
write_format = 'delta'
load_path = '/databricks-datasets/learning-spark-v2/people/people-10m.delta'
save_path = '/tmp/delta/people-10m'
table_name = 'default.people10m'

# Load the data from its source.


people = spark \
.read \
.format(read_format) \
.load(load_path)

# Write the data to its target.


people.write \
.format(write_format) \
.save(save_path)

# Create the table.


spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")

R
library(SparkR)
sparkR.session()

# Define the input and output formats and paths and the table name.
read_format = "delta"
write_format = "delta"
load_path = "/databricks-datasets/learning-spark-v2/people/people-10m.delta"
save_path = "/tmp/delta/people-10m/"
table_name = "default.people10m"

# Load the data from its source.


people = read.df(load_path, source = read_format)

# Write the data to its target.


write.df(people, source = write_format, path = save_path)

# Create the table.


sql(paste("CREATE TABLE ", table_name, " USING DELTA LOCATION '", save_path, "'", sep = ""))

Scala

// Define the input and output formats and paths and the table name.
val read_format = "delta"
val write_format = "delta"
val load_path = "/databricks-datasets/learning-spark-v2/people/people-10m.delta"
val save_path = "/tmp/delta/people-10m"
val table_name = "default.people10m"

// Load the data from its source.


val people = spark
.read
.format(read_format)
.load(load_path)

// Write the data to its target.


people.write
.format(write_format)
.save(save_path)

// Create the table.


spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")

SQL

-- The path for LOCATION must already exist


-- and must be in Delta format.

CREATE TABLE default.people10m


USING DELTA
LOCATION '/tmp/delta/people-10m'

The preceding operations create a new unmanaged table by using the schema that was inferred from the data.
For unmanaged tables, you control the location of the data. Azure Databricks tracks the table’s name and its
location. For information about available options when you create a Delta table, see Create a table and Write to a
table.
If your source files are in Parquet format, you can use the CONVERT TO DELTA statement to convert files in place.
If the corresponding table is unmanaged, the table remains unmanaged after the conversion:

CONVERT TO DELTA parquet.`/tmp/delta/people-10m`


To create a new managed table, you can use the CREATE TABLE statement to specify a table name, and then you
can load data into the table. Or you can use the saveAsTable method with Python, R, or Scala. For example:
Python

tableName = 'people10m'
sourceType = 'delta'
loadPath = '/databricks-datasets/learning-spark-v2/people/people-10m.delta'

people = spark \
.read \
.format(sourceType) \
.load(loadPath)

people.write \
.format(sourceType) \
.saveAsTable(tableName)

display(spark.sql("SELECT * FROM " + tableName))

library(SparkR)
sparkR.session()

tableName = "people10m"
sourceType = "delta"
loadPath = "/databricks-datasets/learning-spark-v2/people/people-10m.delta"

people = read.df(
path = loadPath,
source = sourceType
)

saveAsTable(
df = people,
source = sourceType,
tableName = tableName
)

display(sql(paste("SELECT * FROM ", tableName, sep="")))

Scala

val tableName = "people10m"


val sourceType = "delta"
val loadPath = "/databricks-datasets/learning-spark-v2/people/people-10m.delta"

val people = spark


.read
.format(sourceType)
.load(loadPath)

people.write
.format(sourceType)
.saveAsTable(tableName)

display(spark.sql("SELECT * FROM " + tableName))

SQL
CREATE TABLE people10m USING DELTA AS
SELECT * FROM delta.`/databricks-datasets/learning-spark-v2/people/people-10m.delta`;

SELECT * FROM people10m;

If your source files are in a format that Delta Lake supports, you can use the following shorthand to read from a
file directly, by using the location specifier of `.`````:
Python

# Define the input format and path.


read_format = 'delta'
load_path = '/databricks-datasets/learning-spark-v2/people/people-10m.delta'

# Load the data from its source.


people = spark.sql("SELECT * FROM " + read_format + ".`" + load_path + "`")

library(SparkR)
sparkR.session()

# Define the input format and path.


read_format = "delta"
load_path = "/databricks-datasets/learning-spark-v2/people/people-10m.delta"

# Load the data from its source.


people = sql(paste("SELECT * FROM ", read_format, ".`", load_path, "`", sep=""))

Scala

// Define the input format and path.


val read_format = "delta"
val load_path = "/databricks-datasets/learning-spark-v2/people/people-10m.delta"

// Load the data from its source.


val people = spark.sql("SELECT * FROM " + read_format + ".`" + load_path + "`")

SQL

SELECT * FROM delta.`/databricks-datasets/learning-spark-v2/people/people-10m.delta`

For managed tables, Azure Databricks determines the location for the data. To get the location, you can use the
DESCRIBE DETAIL statement, for example:
Python

display(spark.sql('DESCRIBE DETAIL people10m'))

display(sql("DESCRIBE DETAIL people10m"))

Scala
display(spark.sql("DESCRIBE DETAIL people10m"))

SQL

DESCRIBE DETAIL people10m;

See also Create a table and Control data location.


Partition data
To speed up queries that have predicates involving the partition columns, you can partition data. The following
code example is similar to the one in Create a table, but this example partitions the data.
Python

# Define the input and output formats and paths and the table name.
read_format = 'delta'
write_format = 'delta'
load_path = '/databricks-datasets/learning-spark-v2/people/people-10m.delta'
partition_by = 'gender'
save_path = '/tmp/delta/people-10m'
table_name = 'default.people10m'

# Load the data from its source.


people = spark \
.read \
.format(read_format) \
.load(load_path)

# Write the data to its target.


people.write \
.partitionBy(partition_by) \
.format(write_format) \
.save(save_path)

# Create the table.


spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")

If you already ran the Python code example in Create a table, you must first delete the existing table and the
saved data:

# Define the table name and the output path.


table_name = 'default.people10m'
save_path = '/tmp/delta/people-10m'

# Delete the table.


spark.sql("DROP TABLE " + table_name)

# Delete the saved data.


dbutils.fs.rm(save_path, True)

R
library(SparkR)
sparkR.session()

# Define the input and output formats and paths and the table name.
read_format = "delta"
write_format = "delta"
load_path = "/databricks-datasets/learning-spark-v2/people/people-10m.delta"
partition_by = "gender"
save_path = "/tmp/delta/people-10m/"
table_name = "default.people10m"

# Load the data from its source.


people = read.df(load_path, source = read_format)

# Write the data to its target.


write.df(people, source = write_format, partitionBy = partition_by, path = save_path)

# Create the table.


sql(paste("CREATE TABLE ", table_name, " USING DELTA LOCATION '", save_path, "'", sep = ""))

If you already ran the R code example in Create a table, you must first delete the existing table and the saved
data:

library(SparkR)
sparkR.session()

# Define the table name and the output path.


table_name = "default.people10m"
save_path = "/tmp/delta/people-10m"

# Delete the table.


sql(paste("DROP TABLE ", table_name, sep = ""))

# Delete the saved data.


dbutils.fs.rm(save_path, TRUE)

Scala

// Define the input and output formats and paths and the table name.
val read_format = "delta"
val write_format = "delta"
val load_path = "/databricks-datasets/learning-spark-v2/people/people-10m.delta"
val partition_by = "gender"
val save_path = "/tmp/delta/people-10m"
val table_name = "default.people10m"

// Load the data from its source.


val people = spark
.read
.format(read_format)
.load(load_path)

// Write the data to its target.


people.write
.partitionBy(partition_by)
.format(write_format)
.save(save_path)

// Create the table.


spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")

If you already ran the Scala code example in Create a table, you must first delete the existing table and the saved
data:
// Define the table name and the output path.
val table_name = "default.people10m"
val save_path = "/tmp/delta/people-10m"

// Delete the table.


spark.sql("DROP TABLE " + table_name)

// Delete the saved data.


dbutils.fs.rm(save_path, true)

SQL
To partition data when you create a Delta table using SQL, specify the PARTITIONED BY columns.

-- The path in LOCATION must already exist


-- and must be in Delta format.

CREATE TABLE default.people10m (


id INT,
firstName STRING,
middleName STRING,
lastName STRING,
gender STRING,
birthDate TIMESTAMP,
ssn STRING,
salary INT
)
USING DELTA
PARTITIONED BY (gender)
LOCATION '/tmp/delta/people-10m'

If you already ran the SQL code example in Create a table, you must first delete the existing table:

DROP TABLE default.people10m

Modify a table
Delta Lake supports a rich set of operations to modify tables.
Stream writes to a table
You can write data into a Delta table using Structured Streaming. The Delta Lake transaction log guarantees
exactly-once processing, even when there are other streams or batch queries running concurrently against the
table. By default, streams run in append mode, which adds new records to the table.
The following code example starts Structured Streaming. It monitors the DBFS location specified in
json_read_path , scanning for JSON files that are uploaded to this location. As Structured Streaming notices a file
upload, it attempts to write the data to the DBFS location specified in save_path by using the schema specified
in read_schema . Structured Streaming continues monitoring for uploaded files until the code is stopped.
Structured Streaming uses the DBFS location specified in checkpoint_path to help ensure that uploaded files are
evaluated only once.
Python
# Define the schema and the input, checkpoint, and output paths.
read_schema = ("id int, " +
"firstName string, " +
"middleName string, " +
"lastName string, " +
"gender string, " +
"birthDate timestamp, " +
"ssn string, " +
"salary int")
json_read_path = '/FileStore/streaming-uploads/people-10m'
checkpoint_path = '/tmp/delta/people-10m/checkpoints'
save_path = '/tmp/delta/people-10m'

people_stream = (spark
.readStream
.schema(read_schema)
.option('maxFilesPerTrigger', 1)
.option('multiline', True)
.format("json")
.load(json_read_path)
)

(people_stream.writeStream
.format('delta')
.outputMode('append')
.option('checkpointLocation', checkpoint_path)
.start(save_path)
)

library(SparkR)
sparkR.session()

# Define the schema and the input, checkpoint, and output paths.
read_schema = "id int, firstName string, middleName string, lastName string, gender string, birthDate
timestamp, ssn string, salary int"
json_read_path = "/FileStore/streaming-uploads/people-10m"
checkpoint_path = "/tmp/delta/people-10m/checkpoints"
save_path = "/tmp/delta/people-10m"

people_stream = read.stream(
"json",
path = json_read_path,
schema = read_schema,
multiline = TRUE,
maxFilesPerTrigger = 1
)

write.stream(
people_stream,
path = save_path,
mode = "append",
checkpointLocation = checkpoint_path
)

Scala
// Define the schema and the input, checkpoint, and output paths.
val read_schema = ("id int, " +
"firstName string, " +
"middleName string, " +
"lastName string, " +
"gender string, " +
"birthDate timestamp, " +
"ssn string, " +
"salary int")
val json_read_path = "/FileStore/streaming-uploads/people-10m"
val checkpoint_path = "/tmp/delta/people-10m/checkpoints"
val save_path = "/tmp/delta/people-10m"

val people_stream = (spark


.readStream
.schema(read_schema)
.option("maxFilesPerTrigger", 1)
.option("multiline", true)
.format("json")
.load(json_read_path))

people_stream.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", checkpoint_path)
.start(save_path)

To test this behavior, here is a conforming JSON file that you can upload to the location specified in
json_read_path , and then query the location in save_path to see the data written by Structured Streaming.

[
{
"id": 10000021,
"firstName": "Joe",
"middleName": "Alexander",
"lastName": "Smith",
"gender": "M",
"birthDate": 188712000,
"ssn": "123-45-6789",
"salary": 50000
},
{
"id": 10000022,
"firstName": "Mary",
"middleName": "Jane",
"lastName": "Doe",
"gender": "F",
"birthDate": "1968-10-27T04:00:00.000+000",
"ssn": "234-56-7890",
"salary": 75500
}
]

For more information about Delta Lake integration with Structured Streaming, see Table streaming reads and
writes and Production considerations for Structured Streaming applications on Azure Databricks. See also the
Structured Streaming Programming Guide on the Apache Spark website.
Batch upserts
To merge a set of updates and insertions into an existing Delta table, you use the MERGE INTO statement. For
example, the following statement takes data from the source table and merges it into the target Delta table.
When there is a matching row in both tables, Delta Lake updates the data column using the given expression.
When there is no matching row, Delta Lake adds a new row. This operation is known as an upsert.
MERGE INTO default.people10m
USING default.people10m_upload
ON default.people10m.id = default.people10m_upload.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

If you specify * , this updates or inserts all columns in the target table. This assumes that the source table has
the same columns as those in the target table, otherwise the query will throw an analysis error.
You must specify a value for every column in your table when you perform an INSERT operation (for example,
when there is no matching row in the existing dataset). However, you do not need to update all values.
To test the preceding example, create the source table as follows:

CREATE TABLE default.people10m_upload (


id INT,
firstName STRING,
middleName STRING,
lastName STRING,
gender STRING,
birthDate TIMESTAMP,
ssn STRING,
salary INT
) USING DELTA

To test the clause, fill the source table with the following rows, and then run the preceding
WHEN MATCHED
MERGE INTO statement. Because both tables have rows that match the ON clause, the target table’s matching
rows are updated.

INSERT INTO default.people10m_upload VALUES


(9999998, 'Billy', 'Tommie', 'Luppitt', 'M', '1992-09-17T04:00:00.000+0000', '953-38-9452', 55250),
(9999999, 'Elias', 'Cyril', 'Leadbetter', 'M', '1984-05-22T04:00:00.000+0000', '906-51-2137', 48500),
(10000000, 'Joshua', 'Chas', 'Broggio', 'M', '1968-07-22T04:00:00.000+0000', '988-61-6247', 90000)

To see the results, query the table.

SELECT * FROM default.people10m WHERE id BETWEEN 9999998 AND 10000000 SORT BY id ASC

To test the WHEN NOT MATCHED clause, fill the source table with the following rows, and then run the preceding
MERGE INTO statement. Because the target table does not have the following rows, these rows are added to the
target table.

INSERT INTO default.people10m_upload VALUES


(20000001, 'John', '', 'Doe', 'M', '1978-01-14T04:00:00.000+000', '345-67-8901', 55500),
(20000002, 'Mary', '', 'Smith', 'F', '1982-10-29T01:00:00.000+000', '456-78-9012', 98250),
(20000003, 'Jane', '', 'Doe', 'F', '1981-06-25T04:00:00.000+000', '567-89-0123', 89900)

To see the results, query the table.

SELECT * FROM default.people10m WHERE id BETWEEN 20000001 AND 20000003 SORT BY id ASC

To run any of the preceding SQL statements in Python, R, or Scala, pass the statement as a string argument to
the spark.sql function in Python or Scala or the sql function in R.
Read a table
In this section:
Display table history
Query an earlier version of the table (time travel)
You access data in Delta tables either by specifying the path on DBFS ( "/tmp/delta/people-10m" ) or the table
name ( "default.people10m" ):
Python

people = spark.read.format('delta').load('/tmp/delta/people-10m')

display(people)

or

people = spark.table('default.people10m')

display(people)

library(SparkR)
sparkR.session()

people = read.df(path = "/tmp/delta/people-10m", source = "delta")

display(people)

or

library(SparkR)
sparkR.session()

people = tableToDF("default.people10m")

display(people)

Scala

val people = spark.read.format("delta").load("/tmp/delta/people-10m")

display(people)

or

val people = spark.table("default.people10m")

display(people)

SQL

SELECT * FROM delta.`/tmp/delta/people-10m`


or

SELECT * FROM default.people10m

Display table history


To view the history of a table, use the DESCRIBE HISTORY statement, which provides provenance information,
including the table version, operation, user, and so on, for each write to a table.
Query an earlier version of the table (time travel)
Delta Lake time travel allows you to query an older snapshot of a Delta table.
To query an older version of a table, specify a version or timestamp in a SELECT statement. For example, to
query version 0 from the history above, use:
Python

spark.sql('SELECT * FROM default.people10m VERSION AS OF 0')

or

spark.sql("SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58'")

library(SparkR)
sparkR.session()

sql("SELECT * FROM default.people10m VERSION AS OF 0")

or

library(SparkR)
sparkR.session()

sql("SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58'")

Scala

spark.sql("SELECT * FROM default.people10m VERSION AS OF 0")

or

spark.sql("SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58'")

SQL

SELECT * FROM default.people10m VERSION AS OF 0

or

SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58'


For timestamps, only date or timestamp strings are accepted, for example, "2019-01-01" and
"2019-01-01'T'00:00:00.000Z" .

NOTE
Because version 1 is at timestamp '2019-01-29 00:38:10' , to query version 0 you can use any timestamp in the range
'2019-01-29 00:37:58' to '2019-01-29 00:38:09' inclusive.

DataFrameReader options allow you to create a DataFrame from a Delta table that is fixed to a specific version
of the table, for example in Python:

df1 = spark.read.format('delta').option('timestampAsOf', '2019-01-01').load('/tmp/delta/people-10m')

display(df1)

or

df2 = spark.read.format('delta').option('versionAsOf', 2).load('/tmp/delta/people-10m')

display(df2)

For details, see Query an older snapshot of a table (time travel).

Optimize a table
Once you have performed multiple changes to a table, you might have a lot of small files. To improve the speed
of read queries, you can use OPTIMIZE to collapse small files into larger ones:
Python

spark.sql("OPTIMIZE delta.`/tmp/delta/people-10m`")

or

spark.sql('OPTIMIZE default.people10m')

library(SparkR)
sparkR.session()

sql("OPTIMIZE delta.`/tmp/delta/people-10m`")

or

library(SparkR)
sparkR.session()

sql("OPTIMIZE default.people10m")

Scala
spark.sql("OPTIMIZE delta.`/tmp/delta/people-10m`")

or

spark.sql("OPTIMIZE default.people10m")

SQL

OPTIMIZE delta.`/tmp/delta/people-10m`

or

OPTIMIZE default.people10m

Z-order by columns
To improve read performance further, you can co-locate related information in the same set of files by Z-
Ordering. This co-locality is automatically used by Delta Lake data-skipping algorithms to dramatically reduce
the amount of data that needs to be read. To Z-Order data, you specify the columns to order on in the
ZORDER BY clause. For example, to co-locate by gender , run:

Python

spark.sql("OPTIMIZE delta.`/tmp/delta/people-10m` ZORDER BY (gender)")

or

spark.sql('OPTIMIZE default.people10m ZORDER BY (gender)')

library(SparkR)
sparkR.session()

sql("OPTIMIZE delta.`/tmp/delta/people-10m` ZORDER BY (gender)")

or

library(SparkR)
sparkR.session()

sql("OPTIMIZE default.people10m ZORDER BY (gender)")

Scala

spark.sql("OPTIMIZE delta.`/tmp/delta/people-10m` ZORDER BY (gender)")

or

spark.sql("OPTIMIZE default.people10m ZORDER BY (gender)")

SQL
OPTIMIZE delta.`/tmp/delta/people-10m`
ZORDER BY (gender)

or

OPTIMIZE default.people10m
ZORDER BY (gender)

For the full set of options available when running OPTIMIZE , see Compaction (bin-packing).

Clean up snapshots
Delta Lake provides snapshot isolation for reads, which means that it is safe to run OPTIMIZE even while other
users or jobs are querying the table. Eventually however, you should clean up old snapshots. You can do this by
running the VACUUM command:
Python

spark.sql('VACUUM default.people10m')

library(SparkR)
sparkR.session()

sql("VACUUM default.people10m")

Scala

spark.sql("VACUUM default.people10m")

SQL

VACUUM default.people10m

You control the age of the latest retained snapshot by using the RETAIN <N> HOURS option:
Python

spark.sql('VACUUM default.people10m RETAIN 24 HOURS')

library(SparkR)
sparkR.session()

sql("VACUUM default.people10m RETAIN 24 HOURS")

Scala

spark.sql("VACUUM default.people10m RETAIN 24 HOURS")


SQL

VACUUM default.people10m RETAIN 24 HOURS

For details on using VACUUM effectively, see Remove files no longer referenced by a Delta table.
Introductory notebooks
7/21/2022 • 2 minutes to read

These notebooks show how to load and save data in Delta Lake format, create a Delta table, optimize the
resulting table, and finally use Delta Lake metadata commands to show the table history, format, and details.
To try out Delta Lake, see Quickstart: Run a Spark job on Azure Databricks using the Azure portal.

Delta Lake Quickstart Python notebook


Get notebook

Delta Lake Quickstart R notebook


Get notebook

Delta Lake Quickstart Scala notebook


Get notebook

Delta Lake Quickstart SQL notebook


Get notebook
Ingest data into Delta Lake
7/21/2022 • 8 minutes to read

Azure Databricks offers a variety of ways to help you ingest data into Delta Lake.

Upload CSV files


You can securely create tables from CSV files using the Create Table UI in Databricks SQL.

Partner integrations
Databricks partner integrations enable you to easily load data into Azure Databricks. These integrations enable
low-code, easy-to-implement, and scalable data ingestion from a variety of sources into Azure Databricks. See
the Databricks integrations.

COPY INTO SQL command


The COPY INTO SQL command lets you load data from a file location into a Delta table. This is a re-triable and
idempotent operation; files in the source location that have already been loaded are skipped.
Use the COPY INTO SQL command instead of Auto Loader when:
You want to load data from a file location that contains files in the order of thousands or fewer.
Your data schema is not expected to evolve frequently.
You plan to load subsets of previously uploaded files.
For a brief overview and demonstration of the COPY INTO SQL command, as well as Auto Loader later in this
article, watch this YouTube video (2 minutes).

The following example shows how to create a Delta table and then use the COPY INTO SQL command to load
sample data from Sample datasets (databricks-datasets) into the table. You can run the example Python, R, Scala,
or SQL code from within a notebook attached to an Azure Databricks cluster. You can also run the SQL code
from within a query associated with a SQL warehouse in Databricks SQL.
Python
table_name = 'default.loan_risks_upload'
source_data = '/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet'
source_format = 'PARQUET'

spark.sql("DROP TABLE IF EXISTS " + table_name)

spark.sql("CREATE TABLE " + table_name + " (" \


"loan_id BIGINT, " + \
"funded_amnt INT, " + \
"paid_amnt DOUBLE, " + \
"addr_state STRING)"
)

spark.sql("COPY INTO " + table_name + \


" FROM '" + source_data + "'" + \
" FILEFORMAT = " + source_format
)

loan_risks_upload_data = spark.sql("SELECT * FROM " + table_name)

display(loan_risks_upload_data)

'''
Result:
+---------+-------------+-----------+------------+
| loan_id | funded_amnt | paid_amnt | addr_state |
+=========+=============+===========+============+
| 0 | 1000 | 182.22 | CA |
+---------+-------------+-----------+------------+
| 1 | 1000 | 361.19 | WA |
+---------+-------------+-----------+------------+
| 2 | 1000 | 176.26 | TX |
+---------+-------------+-----------+------------+
...
'''

R
library(SparkR)
sparkR.session()

table_name = "default.loan_risks_upload"
source_data = "/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet"
source_format = "PARQUET"

sql(paste("DROP TABLE IF EXISTS ", table_name, sep = ""))

sql(paste("CREATE TABLE ", table_name, " (",


"loan_id BIGINT, ",
"funded_amnt INT, ",
"paid_amnt DOUBLE, ",
"addr_state STRING)",
sep = ""
))

sql(paste("COPY INTO ", table_name,


" FROM '", source_data, "'",
" FILEFORMAT = ", source_format,
sep = ""
))

loan_risks_upload_data = tableToDF(table_name)

display(loan_risks_upload_data)

# Result:
# +---------+-------------+-----------+------------+
# | loan_id | funded_amnt | paid_amnt | addr_state |
# +=========+=============+===========+============+
# | 0 | 1000 | 182.22 | CA |
# +---------+-------------+-----------+------------+
# | 1 | 1000 | 361.19 | WA |
# +---------+-------------+-----------+------------+
# | 2 | 1000 | 176.26 | TX |
# +---------+-------------+-----------+------------+
# ...

Scala
val table_name = "default.loan_risks_upload"
val source_data = "/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet"
val source_format = "PARQUET"

spark.sql("DROP TABLE IF EXISTS " + table_name)

spark.sql("CREATE TABLE " + table_name + " (" +


"loan_id BIGINT, " +
"funded_amnt INT, " +
"paid_amnt DOUBLE, " +
"addr_state STRING)"
)

spark.sql("COPY INTO " + table_name +


" FROM '" + source_data + "'" +
" FILEFORMAT = " + source_format
)

val loan_risks_upload_data = spark.table(table_name)

display(loan_risks_upload_data)

/*
Result:
+---------+-------------+-----------+------------+
| loan_id | funded_amnt | paid_amnt | addr_state |
+=========+=============+===========+============+
| 0 | 1000 | 182.22 | CA |
+---------+-------------+-----------+------------+
| 1 | 1000 | 361.19 | WA |
+---------+-------------+-----------+------------+
| 2 | 1000 | 176.26 | TX |
+---------+-------------+-----------+------------+
...
*/

SQL

DROP TABLE IF EXISTS default.loan_risks_upload;

CREATE TABLE default.loan_risks_upload (


loan_id BIGINT,
funded_amnt INT,
paid_amnt DOUBLE,
addr_state STRING
);

COPY INTO default.loan_risks_upload


FROM '/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet'
FILEFORMAT = PARQUET;

SELECT * FROM default.loan_risks_upload;

-- Result:
-- +---------+-------------+-----------+------------+
-- | loan_id | funded_amnt | paid_amnt | addr_state |
-- +=========+=============+===========+============+
-- | 0 | 1000 | 182.22 | CA |
-- +---------+-------------+-----------+------------+
-- | 1 | 1000 | 361.19 | WA |
-- +---------+-------------+-----------+------------+
-- | 2 | 1000 | 176.26 | TX |
-- +---------+-------------+-----------+------------+
-- ...
To clean up, run the following code, which deletes the table:
Python

spark.sql("DROP TABLE " + table_name)

sql(paste("DROP TABLE ", table_name, sep = ""))

Scala

spark.sql("DROP TABLE " + table_name)

SQL

DROP TABLE default.loan_risks_upload

For more examples and details, see


Databricks Runtime 7.x and above: COPY INTO
Databricks Runtime 5.5 LTS and 6.x: Copy Into (Delta Lake on Azure Databricks)

Auto Loader
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any
additional setup. Auto Loader provides a new Structured Streaming source called cloudFiles . Given an input
directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive,
with the option of also processing existing files in that directory.
Use Auto Loader instead of the COPY INTO SQL command when:
You want to load data from a file location that contains files in the order of millions or higher. Auto Loader
can discover files more efficiently than the COPY INTO SQL command and can split file processing into
multiple batches.
Your data schema evolves frequently. Auto Loader provides better support for schema inference and
evolution. See Configuring schema inference and evolution in Auto Loader.
You do not plan to load subsets of previously uploaded files. With Auto Loader, it can be more difficult to
reprocess subsets of files. However, you can use the COPY INTO SQL command to reload subsets of files
while an Auto Loader stream is simultaneously running.
For a brief overview and demonstration of Auto Loader, as well as the COPY INTO SQL command earlier in this
article, watch this YouTube video (2 minutes).

For a longer overview and demonstration of Auto Loader, watch this YouTube video (59 minutes).

The following code example demonstrates how Auto Loader detects new data files as they arrive in cloud
storage. You can run the example code from within a notebook attached to an Azure Databricks cluster.
1. Create the file upload directory, for example:
Python
user_dir = '<my-name>@<my-organization.com>'
upload_path = "/FileStore/shared-uploads/" + user_dir + "/population_data_upload"

dbutils.fs.mkdirs(upload_path)

Scala

val user_dir = "<my-name>@<my-organization.com>"


val upload_path = "/FileStore/shared-uploads/" + user_dir + "/population_data_upload"

dbutils.fs.mkdirs(upload_path)

2. Create the following sample CSV files, and then upload them to the file upload directory by using the
DBFS file browser:
WA.csv :

city,year,population
Seattle metro,2019,3406000
Seattle metro,2020,3433000

OR.csv :

city,year,population
Portland metro,2019,2127000
Portland metro,2020,2151000

3. Run the following code to start Auto Loader.


Python

checkpoint_path = '/tmp/delta/population_data/_checkpoints'
write_path = '/tmp/delta/population_data'

# Set up the stream to begin reading incoming files from the


# upload_path location.
df = spark.readStream.format('cloudFiles') \
.option('cloudFiles.format', 'csv') \
.option('header', 'true') \
.schema('city string, year int, population long') \
.load(upload_path)

# Start the stream.


# Use the checkpoint_path location to keep a record of all files that
# have already been uploaded to the upload_path location.
# For those that have been uploaded since the last check,
# write the newly-uploaded files' data to the write_path location.
df.writeStream.format('delta') \
.option('checkpointLocation', checkpoint_path) \
.start(write_path)

Scala
val checkpoint_path = "/tmp/delta/population_data/_checkpoints"
val write_path = "/tmp/delta/population_data"

// Set up the stream to begin reading incoming files from the


// upload_path location.
val df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", "true")
.schema("city string, year int, population long")
.load(upload_path)

// Start the stream.


// Use the checkpoint_path location to keep a record of all files that
// have already been uploaded to the upload_path location.
// For those that have been uploaded since the last check,
// write the newly-uploaded files' data to the write_path location.
df.writeStream.format("delta")
.option("checkpointLocation", checkpoint_path)
.start(write_path)

4. With the code from step 3 still running, run the following code to query the data in the write directory:
Python

df_population = spark.read.format('delta').load(write_path)

display(df_population)

'''
Result:
+----------------+------+------------+
| city | year | population |
+================+======+============+
| Seattle metro | 2019 | 3406000 |
+----------------+------+------------+
| Seattle metro | 2020 | 3433000 |
+----------------+------+------------+
| Portland metro | 2019 | 2127000 |
+----------------+------+------------+
| Portland metro | 2020 | 2151000 |
+----------------+------+------------+
'''

Scala

val df_population = spark.read.format("delta").load(write_path)

display(df_population)

/* Result:
+----------------+------+------------+
| city | year | population |
+================+======+============+
| Seattle metro | 2019 | 3406000 |
+----------------+------+------------+
| Seattle metro | 2020 | 3433000 |
+----------------+------+------------+
| Portland metro | 2019 | 2127000 |
+----------------+------+------------+
| Portland metro | 2020 | 2151000 |
+----------------+------+------------+
*/
5. With the code from step 3 still running, create the following additional CSV files, and then upload them to
the upload directory by using the DBFS file browser:
ID.csv :

city,year,population
Boise,2019,438000
Boise,2020,447000

MT.csv :

city,year,population
Helena,2019,81653
Helena,2020,82590

Misc.csv :

city,year,population
Seattle metro,2021,3461000
Portland metro,2021,2174000
Boise,2021,455000
Helena,2021,81653

6. With the code from step 3 still running, run the following code to query the existing data in the write
directory, in addition to the new data from the files that Auto Loader has detected in the upload directory
and then written to the write directory:
Python
df_population = spark.read.format('delta').load(write_path)

display(df_population)

'''
Result:
+----------------+------+------------+
| city | year | population |
+================+======+============+
| Seattle metro | 2019 | 3406000 |
+----------------+------+------------+
| Seattle metro | 2020 | 3433000 |
+----------------+------+------------+
| Helena | 2019 | 81653 |
+----------------+------+------------+
| Helena | 2020 | 82590 |
+----------------+------+------------+
| Boise | 2019 | 438000 |
+----------------+------+------------+
| Boise | 2020 | 447000 |
+----------------+------+------------+
| Portland metro | 2019 | 2127000 |
+----------------+------+------------+
| Portland metro | 2020 | 2151000 |
+----------------+------+------------+
| Seattle metro | 2021 | 3461000 |
+----------------+------+------------+
| Portland metro | 2021 | 2174000 |
+----------------+------+------------+
| Boise | 2021 | 455000 |
+----------------+------+------------+
| Helena | 2021 | 81653 |
+----------------+------+------------+
'''

Scala
val df_population = spark.read.format("delta").load(write_path)

display(df_population)

/* Result
+----------------+------+------------+
| city | year | population |
+================+======+============+
| Seattle metro | 2019 | 3406000 |
+----------------+------+------------+
| Seattle metro | 2020 | 3433000 |
+----------------+------+------------+
| Helena | 2019 | 81653 |
+----------------+------+------------+
| Helena | 2020 | 82590 |
+----------------+------+------------+
| Boise | 2019 | 438000 |
+----------------+------+------------+
| Boise | 2020 | 447000 |
+----------------+------+------------+
| Portland metro | 2019 | 2127000 |
+----------------+------+------------+
| Portland metro | 2020 | 2151000 |
+----------------+------+------------+
| Seattle metro | 2021 | 3461000 |
+----------------+------+------------+
| Portland metro | 2021 | 2174000 |
+----------------+------+------------+
| Boise | 2021 | 455000 |
+----------------+------+------------+
| Helena | 2021 | 81653 |
+----------------+------+------------+
*/

7. To clean up, cancel the running code in step 3, and then run the following code, which deletes the upload,
checkpoint, and write directories:
Python

dbutils.fs.rm(write_path, True)
dbutils.fs.rm(upload_path, True)

Scala

dbutils.fs.rm(write_path, true)
dbutils.fs.rm(upload_path, true)

For more details, see Auto Loader.


Table batch reads and writes
7/21/2022 • 30 minutes to read

Delta Lake supports most of the options provided by Apache Spark DataFrame read and write APIs for
performing batch reads and writes on tables.
For information on Delta Lake SQL commands, see
Databricks Runtime 7.x and above: Delta Lake statements
Databricks Runtime 5.5 LTS and 6.x: SQL reference for Databricks Runtime 5.5 LTS and 6.x

Create a table
Delta Lake supports creating two types of tables—tables defined in the metastore and tables defined by path.
You can create tables in the following ways.
SQL DDL commands : You can use standard SQL DDL commands supported in Apache Spark (for
example, CREATE TABLE and REPLACE TABLE ) to create Delta tables.

CREATE TABLE IF NOT EXISTS default.people10m (


id INT,
firstName STRING,
middleName STRING,
lastName STRING,
gender STRING,
birthDate TIMESTAMP,
ssn STRING,
salary INT
) USING DELTA

CREATE OR REPLACE TABLE default.people10m (


id INT,
firstName STRING,
middleName STRING,
lastName STRING,
gender STRING,
birthDate TIMESTAMP,
ssn STRING,
salary INT
) USING DELTA

NOTE
In Databricks Runtime 8.0 and above, Delta Lake is the default format and you don’t need USING DELTA .

In Databricks Runtime 7.0 and above, SQL also supports a creating table at a path without creating an
entry in the Hive metastore.
-- Create or replace table with path
CREATE OR REPLACE TABLE delta.`/tmp/delta/people10m` (
id INT,
firstName STRING,
middleName STRING,
lastName STRING,
gender STRING,
birthDate TIMESTAMP,
ssn STRING,
salary INT
) USING DELTA

DataFrameWriter API : If you want to simultaneously create a table and insert data into it from Spark
DataFrames or Datasets, you can use the Spark DataFrameWriter (Scala or Java and Python).
Python

# Create table in the metastore using DataFrame's schema and write data to it
df.write.format("delta").saveAsTable("default.people10m")

# Create or replace partitioned table with path using DataFrame's schema and write/overwrite data to
it
df.write.format("delta").mode("overwrite").save("/tmp/delta/people10m")

Scala

// Create table in the metastore using DataFrame's schema and write data to it
df.write.format("delta").saveAsTable("default.people10m")

// Create table with path using DataFrame's schema and write data to it
df.write.format("delta").mode("overwrite").save("/tmp/delta/people10m")

In Databricks Runtime 8.0 and above, Delta Lake is the default format and you don’t need to specify
USING DELTA , format("delta") , or using("delta") .
In Databricks Runtime 7.0 and above, you can also create Delta tables using the Spark
DataFrameWriterV2 API.
DeltaTableBuilder API : You can also use the DeltaTableBuilder API in Delta Lake to create tables.
Compared to the DataFrameWriter APIs, this API makes it easier to specify additional information like
column comments, table properties, and generated columns.

IMPORTANT
This feature is in Public Preview.

NOTE
This feature is available on Databricks Runtime 8.3 and above.

Python
# Create table in the metastore
DeltaTable.createIfNotExists(spark) \
.tableName("default.people10m") \
.addColumn("id", "INT") \
.addColumn("firstName", "STRING") \
.addColumn("middleName", "STRING") \
.addColumn("lastName", "STRING", comment = "surname") \
.addColumn("gender", "STRING") \
.addColumn("birthDate", "TIMESTAMP") \
.addColumn("ssn", "STRING") \
.addColumn("salary", "INT") \
.execute()

# Create or replace table with path and add properties


DeltaTable.createOrReplace(spark) \
.addColumn("id", "INT") \
.addColumn("firstName", "STRING") \
.addColumn("middleName", "STRING") \
.addColumn("lastName", "STRING", comment = "surname") \
.addColumn("gender", "STRING") \
.addColumn("birthDate", "TIMESTAMP") \
.addColumn("ssn", "STRING") \
.addColumn("salary", "INT") \
.property("description", "table with people data") \
.location("/tmp/delta/people10m") \
.execute()

Scala

// Create table in the metastore


DeltaTable.createOrReplace(spark)
.tableName("default.people10m")
.addColumn("id", "INT")
.addColumn("firstName", "STRING")
.addColumn("middleName", "STRING")
.addColumn(
DeltaTable.columnBuilder("lastName")
.dataType("STRING")
.comment("surname")
.build())
.addColumn("lastName", "STRING", comment = "surname")
.addColumn("gender", "STRING")
.addColumn("birthDate", "TIMESTAMP")
.addColumn("ssn", "STRING")
.addColumn("salary", "INT")
.execute()

// Create or replace table with path and add properties


DeltaTable.createOrReplace(spark)
.addColumn("id", "INT")
.addColumn("firstName", "STRING")
.addColumn("middleName", "STRING")
.addColumn(
DeltaTable.columnBuilder("lastName")
.dataType("STRING")
.comment("surname")
.build())
.addColumn("lastName", "STRING", comment = "surname")
.addColumn("gender", "STRING")
.addColumn("birthDate", "TIMESTAMP")
.addColumn("ssn", "STRING")
.addColumn("salary", "INT")
.property("description", "table with people data")
.location("/tmp/delta/people10m")
.execute()
See the API documentation for details.
See also Create a table.
Partition data
You can partition data to speed up queries or DML that have predicates involving the partition columns. To
partition data when you create a Delta table, specify a partition by columns. The following example partitions by
gender.
SQL

-- Create table in the metastore


CREATE TABLE default.people10m (
id INT,
firstName STRING,
middleName STRING,
lastName STRING,
gender STRING,
birthDate TIMESTAMP,
ssn STRING,
salary INT
)
USING DELTA
PARTITIONED BY (gender)

Python

df.write.format("delta").partitionBy("gender").saveAsTable("default.people10m")

DeltaTable.create(spark) \
.tableName("default.people10m") \
.addColumn("id", "INT") \
.addColumn("firstName", "STRING") \
.addColumn("middleName", "STRING") \
.addColumn("lastName", "STRING", comment = "surname") \
.addColumn("gender", "STRING") \
.addColumn("birthDate", "TIMESTAMP") \
.addColumn("ssn", "STRING") \
.addColumn("salary", "INT") \
.partitionedBy("gender") \
.execute()

Scala

df.write.format("delta").partitionBy("gender").saveAsTable("default.people10m")

DeltaTable.createOrReplace(spark)
.tableName("default.people10m")
.addColumn("id", "INT")
.addColumn("firstName", "STRING")
.addColumn("middleName", "STRING")
.addColumn(
DeltaTable.columnBuilder("lastName")
.dataType("STRING")
.comment("surname")
.build())
.addColumn("lastName", "STRING", comment = "surname")
.addColumn("gender", "STRING")
.addColumn("birthDate", "TIMESTAMP")
.addColumn("ssn", "STRING")
.addColumn("salary", "INT")
.partitionedBy("gender")
.execute()
To determine whether a table contains a specific partition, use the statement
SELECT COUNT(*) > 0 FROM <table-name> WHERE <partition-column> = <value> . If the partition exists, true is
returned. For example:
SQL

SELECT COUNT(*) > 0 AS `Partition exists` FROM default.people10m WHERE gender = "M"

Python

display(spark.sql("SELECT COUNT(*) > 0 AS `Partition exists` FROM default.people10m WHERE gender = 'M'"))

Scala

display(spark.sql("SELECT COUNT(*) > 0 AS `Partition exists` FROM default.people10m WHERE gender = 'M'"))

Control data location


For tables defined in the metastore, you can optionally specify the LOCATION as a path. Tables created with a
specified LOCATION are considered unmanaged by the metastore. Unlike a managed table, where no path is
specified, an unmanaged table’s files are not deleted when you DROP the table.
When you run CREATE TABLE with a LOCATION that already contains data stored using Delta Lake, Delta Lake
does the following:
If you specify only the table name and location, for example:

CREATE TABLE default.people10m


USING DELTA
LOCATION '/tmp/delta/people10m'

the table in the metastore automatically inherits the schema, partitioning, and table properties of the
existing data. This functionality can be used to “import” data into the metastore.
If you specify any configuration (schema, partitioning, or table properties), Delta Lake verifies that the
specification exactly matches the configuration of the existing data.

IMPORTANT
If the specified configuration does not exactly match the configuration of the data, Delta Lake throws an exception
that describes the discrepancy.

NOTE
The metastore is not the source of truth about the latest information of a Delta table. In fact, the table definition in the
metastore may not contain all the metadata like schema and properties. It contains the location of the table, and the
table’s transaction log at the location is the source of truth. If you query the metastore from a system that is not aware of
this Delta-specific customization, you may see incomplete or stale table information.

Use generated columns

IMPORTANT
This feature is in Public Preview.
NOTE
This feature is available on Databricks Runtime 8.3 and above.

Delta Lake supports generated columns which are a special type of columns whose values are automatically
generated based on a user-specified function over other columns in the Delta table. When you write to a table
with generated columns and you do not explicitly provide values for them, Delta Lake automatically computes
the values. For example, you can automatically generate a date column (for partitioning the table by date) from
the timestamp column; any writes into the table need only specify the data for the timestamp column. However,
if you explicitly provide values for them, the values must satisfy the constraint
(<value> <=> <generation expression>) IS TRUE or the write will fail with an error.

IMPORTANT
Tables created with generated columns have a higher table writer protocol version than the default. See Table protocol
versioning to understand table protocol versioning and what it means to have a higher version of a table protocol
version.

The following example shows how to create a table with generated columns:
SQL

CREATE TABLE default.people10m (


id INT,
firstName STRING,
middleName STRING,
lastName STRING,
gender STRING,
birthDate TIMESTAMP,
dateOfBirth DATE GENERATED ALWAYS AS (CAST(birthDate AS DATE)),
ssn STRING,
salary INT
)
USING DELTA
PARTITIONED BY (gender)

Python

DeltaTable.create(spark) \
.tableName("default.people10m") \
.addColumn("id", "INT") \
.addColumn("firstName", "STRING") \
.addColumn("middleName", "STRING") \
.addColumn("lastName", "STRING", comment = "surname") \
.addColumn("gender", "STRING") \
.addColumn("birthDate", "TIMESTAMP") \
.addColumn("dateOfBirth", DateType(), generatedAlwaysAs="CAST(birthDate AS DATE)") \
.addColumn("ssn", "STRING") \
.addColumn("salary", "INT") \
.partitionedBy("gender") \
.execute()

Scala
DeltaTable.create(spark)
.tableName("default.people10m")
.addColumn("id", "INT")
.addColumn("firstName", "STRING")
.addColumn("middleName", "STRING")
.addColumn(
DeltaTable.columnBuilder("lastName")
.dataType("STRING")
.comment("surname")
.build())
.addColumn("lastName", "STRING", comment = "surname")
.addColumn("gender", "STRING")
.addColumn("birthDate", "TIMESTAMP")
.addColumn(
DeltaTable.columnBuilder("dateOfBirth")
.dataType(DateType)
.generatedAlwaysAs("CAST(dateOfBirth AS DATE)")
.build())
.addColumn("ssn", "STRING")
.addColumn("salary", "INT")
.partitionedBy("gender")
.execute()

Generated columns are stored as if they were normal columns. That is, they occupy storage.
The following restrictions apply to generated columns:
A generation expression can use any SQL functions in Spark that always return the same result when
given the same argument values, except the following types of functions:
User-defined functions.
Aggregate functions.
Window functions.
Functions returning multiple rows.
For Databricks Runtime 9.1 and above, MERGE operations support generated columns when you set
spark.databricks.delta.schema.autoMerge.enabled to true.

In Databricks Runtime 8.4 and above with Photon support, Delta Lake may be able to generate partition filters
for a query whenever a partition column is defined by one of the following expressions:
CAST(col AS DATE) and the type of col is TIMESTAMP .
YEAR(col) and the type of col is TIMESTAMP .
Two partition columns defined by YEAR(col), MONTH(col) and the type of col is TIMESTAMP .
Three partition columns defined by YEAR(col), MONTH(col), DAY(col) and the type of col is TIMESTAMP .
Four partition columns defined by YEAR(col), MONTH(col), DAY(col), HOUR(col) and the type of col is
TIMESTAMP .
SUBSTRING(col, pos, len) and the type of col is STRING
DATE_FORMAT(col, format) and the type of col is TIMESTAMP .

If a partition column is defined by one of the preceding expressions, and a query filters data using the
underlying base column of a generation expression, Delta Lake looks at the relationship between the base
column and the generated column, and populates partition filters based on the generated partition column if
possible. For example, given the following table:
CREATE TABLE events(
eventId BIGINT,
data STRING,
eventType STRING,
eventTime TIMESTAMP,
eventDate date GENERATED ALWAYS AS (CAST(eventTime AS DATE))
)
USING DELTA
PARTITIONED BY (eventType, eventDate)

If you then run the following query:

SELECT * FROM events


WHERE eventTime >= "2020-10-01 00:00:00" <= "2020-10-01 12:00:00"

Delta Lake automatically generates a partition filter so that the preceding query only reads the data in partition
date=2020-10-01 even if a partition filter is not specified.

As another example, given the following table:

CREATE TABLE events(


eventId BIGINT,
data STRING,
eventType STRING,
eventTime TIMESTAMP,
year INT GENERATED ALWAYS AS (YEAR(eventTime)),
month INT GENERATED ALWAYS AS (MONTH(eventTime)),
day INT GENERATED ALWAYS AS (DAY(eventTime))
)
USING DELTA
PARTITIONED BY (eventType, year, month, day)

If you then run the following query:

SELECT * FROM events


WHERE eventTime >= "2020-10-01 00:00:00" <= "2020-10-01 12:00:00"

Delta Lake automatically generates a partition filter so that the preceding query only reads the data in partition
year=2020/month=10/day=01 even if a partition filter is not specified.

You can use an EXPLAIN clause and check the provided plan to see whether Delta Lake automatically generates
any partition filters.
Use special characters in column names
By default, special characters such as spaces and any of the characters ,;{}()\n\t= are not supported in table
column names. To include these special characters in a table’s column name, enable column mapping.

Read a table
You can load a Delta table as a DataFrame by specifying a table name or a path:
SQL

SELECT * FROM default.people10m -- query table in the metastore

SELECT * FROM delta.`/tmp/delta/people10m` -- query table by path


Python

spark.table("default.people10m") # query table in the metastore

spark.read.format("delta").load("/tmp/delta/people10m") # query table by path

Scala

spark.table("default.people10m") // query table in the metastore

spark.read.format("delta").load("/tmp/delta/people10m") // create table by path

import io.delta.implicits._
spark.read.delta("/tmp/delta/people10m")

The DataFrame returned automatically reads the most recent snapshot of the table for any query; you never
need to run REFRESH TABLE . Delta Lake automatically uses partitioning and statistics to read the minimum
amount of data when there are applicable predicates in the query.

Query an older snapshot of a table (time travel)


In this section:
Syntax
Examples
Data retention
Delta Lake time travel allows you to query an older snapshot of a Delta table. Time travel has many use cases,
including:
Re-creating analyses, reports, or outputs (for example, the output of a machine learning model). This could
be useful for debugging or auditing, especially in regulated industries.
Writing complex temporal queries.
Fixing mistakes in your data.
Providing snapshot isolation for a set of queries for fast changing tables.
This section describes the supported methods for querying older versions of tables, data retention concerns,
and provides examples.
Syntax
This section shows how to query an older version of a Delta table.
In this section:
SQL AS OF syntax
Example
DataFrameReader options
@ syntax

SQL AS OF syntax

SELECT * FROM table_name TIMESTAMP AS OF timestamp_expression


SELECT * FROM table_name VERSION AS OF version

where
timestamp_expression can be any one of:
'2018-10-18T22:15:12.013Z' , that is, a string that can be cast to a timestamp
cast('2018-10-18 13:36:32 CEST' as timestamp)
'2018-10-18' , that is, a date string
In Databricks Runtime 6.6 and above:
current_timestamp() - interval 12 hours
date_sub(current_date(), 1)
Any other expression that is or can be cast to a timestamp
version is a long value that can be obtained from the output of DESCRIBE HISTORY table_spec .

Neither timestamp_expression nor version can be subqueries.


Example

SELECT * FROM default.people10m TIMESTAMP AS OF '2018-10-18T22:15:12.013Z'


SELECT * FROM delta.`/tmp/delta/people10m` VERSION AS OF 123

DataFrameReader options
DataFrameReader options allow you to create a DataFrame from a Delta table that is fixed to a specific version
of the table.

df1 = spark.read.format("delta").option("timestampAsOf", timestamp_string).load("/tmp/delta/people10m")


df2 = spark.read.format("delta").option("versionAsOf", version).load("/tmp/delta/people10m")

For , only date or timestamp strings are accepted. For example,


timestamp_string "2019-01-01" and
"2019-01-01T00:00:00.000Z" .

A common pattern is to use the latest state of the Delta table throughout the execution of an Azure Databricks
job to update downstream applications.
Because Delta tables auto update, a DataFrame loaded from a Delta table may return different results across
invocations if the underlying data is updated. By using time travel, you can fix the data returned by the
DataFrame across invocations:

latest_version = spark.sql("SELECT max(version) FROM (DESCRIBE HISTORY


delta.`/tmp/delta/people10m`)").collect()
df = spark.read.format("delta").option("versionAsOf", latest_version[0][0]).load("/tmp/delta/people10m")

@ syntax
You may have a parametrized pipeline, where the input path of your pipeline is a parameter of your job. After
the execution of your job, you may want to reproduce the output some time in the future. In this case, you can
use the @ syntax to specify the timestamp or version. The timestamp must be in yyyyMMddHHmmssSSS format. You
can specify a version after @ by prepending a v to the version. For example, to query version 123 for the
table people10m , specify people10m@v123 .
SQ L

SELECT * FROM default.people10m@20190101000000000


SELECT * FROM default.people10m@v123

Python

spark.read.format("delta").load("/tmp/delta/people10m@20190101000000000") # table on 2019-01-01 00:00:00.000


spark.read.format("delta").load("/tmp/delta/people10m@v123") # table on version 123
Examples
Fix accidental deletes to a table for the user 111 :

INSERT INTO my_table


SELECT * FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
WHERE userId = 111

Fix accidental incorrect updates to a table:

MERGE INTO my_table target


USING my_table TIMESTAMP AS OF date_sub(current_date(), 1) source
ON source.userId = target.userId
WHEN MATCHED THEN UPDATE SET *

Query the number of new customers added over the last week.

SELECT count(distinct userId) - (


SELECT count(distinct userId)
FROM my_table TIMESTAMP AS OF date_sub(current_date(), 7))

Data retention
To time travel to a previous version, you must retain both the log and the data files for that version.
The data files backing a Delta table are never deleted automatically; data files are deleted only when you run
VACUUM. VACUUM does not delete Delta log files; log files are automatically cleaned up after checkpoints are
written.
By default you can time travel to a Delta table up to 30 days old unless you have:
Run VACUUM on your Delta table.
Changed the data or log file retention periods using the following table properties:
delta.logRetentionDuration = "interval <interval>" : controls how long the history for a table is
kept. The default is interval 30 days .
Each time a checkpoint is written, Azure Databricks automatically cleans up log entries older than
the retention interval. If you set this config to a large enough value, many log entries are retained.
This should not impact performance as operations against the log are constant time. Operations
on history are parallel but will become more expensive as the log size increases.
delta.deletedFileRetentionDuration = "interval <interval>" : controls how long ago a file must
have been deleted before being a candidate for VACUUM . The default is interval 7 days .
To access 30 days of historical data even if you run VACUUM on the Delta table, set
delta.deletedFileRetentionDuration = "interval 30 days" . This setting may cause your storage
costs to go up.

Write to a table
Append
To atomically add new data to an existing Delta table, use append mode:
SQL

INSERT INTO default.people10m SELECT * FROM morePeople


Python

df.write.format("delta").mode("append").save("/tmp/delta/people10m")
df.write.format("delta").mode("append").saveAsTable("default.people10m")

Scala

df.write.format("delta").mode("append").save("/tmp/delta/people10m")
df.write.format("delta").mode("append").saveAsTable("default.people10m")

import io.delta.implicits._
df.write.mode("append").delta("/tmp/delta/people10m")

Overwrite
To atomically replace all the data in a table, use overwrite mode:
SQL

INSERT OVERWRITE TABLE default.people10m SELECT * FROM morePeople

Python

df.write.format("delta").mode("overwrite").save("/tmp/delta/people10m")
df.write.format("delta").mode("overwrite").saveAsTable("default.people10m")

Scala

df.write.format("delta").mode("overwrite").save("/tmp/delta/people10m")
df.write.format("delta").mode("overwrite").saveAsTable("default.people10m")

import io.delta.implicits._
df.write.mode("overwrite").delta("/tmp/delta/people10m")

Using DataFrames, you can also selectively overwrite only the data that matches an arbitrary expression. This
feature is available in Databricks Runtime 9.1 LTS and above. The following command atomically replaces events
in January in the target table, which is partitioned by start_date , with the data in df :
Python

df.write \
.format("delta") \
.mode("overwrite") \
.option("replaceWhere", "start_date >= '2017-01-01' AND end_date <= '2017-01-31'") \
.save("/tmp/delta/events")

Scala

df.write
.format("delta")
.mode("overwrite")
.option("replaceWhere", "start_date >= '2017-01-01' AND end_date <= '2017-01-31'")
.save("/tmp/delta/events")

This sample code writes out the data in df , validates that it all matches the predicate, and performs an atomic
replacement. If you want to write out data that doesn’t all match the predicate, to replace the matching rows in
the target table, you can disable the constraint check by setting
spark.databricks.delta.replaceWhere.constraintCheck.enabled to false:
Python

spark.conf.set("spark.databricks.delta.replaceWhere.constraintCheck.enabled", False)

Scala

spark.conf.set("spark.databricks.delta.replaceWhere.constraintCheck.enabled", false)

In Databricks Runtime 9.0 and below, replaceWhere overwrites data matching a predicate over partition
columns only. The following command atomically replaces the month in January in the target table, which is
partitioned by date , with the data in df :
Python

df.write \
.format("delta") \
.mode("overwrite") \
.option("replaceWhere", "birthDate >= '2017-01-01' AND birthDate <= '2017-01-31'") \
.save("/tmp/delta/people10m")

Scala

df.write
.format("delta")
.mode("overwrite")
.option("replaceWhere", "birthDate >= '2017-01-01' AND birthDate <= '2017-01-31'")
.save("/tmp/delta/people10m")

In Databricks Runtime 9.1 and above, if you want to fall back to the old behavior, you can disable the
spark.databricks.delta.replaceWhere.dataColumns.enabled flag:

Python

spark.conf.set("spark.databricks.delta.replaceWhere.dataColumns.enabled", False)

Scala

spark.conf.set("spark.databricks.delta.replaceWhere.dataColumns.enabled", false)

Dynamic Partition Overwrites

IMPORTANT
This feature is in Public Preview.

Databricks Runtime 11.1 and above supports dynamic partition overwrite mode for partitioned tables.
When in dynamic partition overwrite mode, we overwrite all existing data in each logical partition for which the
write will commit new data. Any existing logical partitions for which the write does not contain data will remain
unchanged. This mode is only applicable when data is being written in overwrite mode: either INSERT OVERWRITE
in SQL, or a DataFrame write with df.write.mode("overwrite") .
Configure dynamic partition overwrite mode by setting the Spark session configuration
spark.sql.sources.partitionOverwriteMode to dynamic . You can also enable this by setting the DataFrameWriter
option partitionOverwriteMode to dynamic . If present, the query-specific option overrides the mode defined in
the session configuration. The default for partitionOverwriteMode is static .
SQ L

SET spark.sql.sources.partitionOverwriteMode=dynamic;
INSERT OVERWRITE TABLE default.people10m SELECT * FROM morePeople;

Python

df.write \
.format("delta") \
.mode("overwrite") \
.option("partitionOverwriteMode", "dynamic") \
.saveAsTable("default.people10m")

Sc a l a

df.write
.format("delta")
.mode("overwrite")
.option("partitionOverwriteMode", "dynamic")
.saveAsTable("default.people10m")

NOTE
Dynamic partition overwrite conflicts with the option replaceWhere for partitioned tables.
If dynamic partition overwrite is enabled in the Spark session configuration, and replaceWhere is provided as a
DataFrameWriter option, then Delta Lake overwrites the data according to the replaceWhere expression (query-
specific options override session configurations).
You’ll receive an error if the DataFrameWriter options have both dynamic partition overwrite and replaceWhere
enabled.

IMPORTANT
Validate that the data written with dynamic partition overwrite touches only the expected partitions. A single row in the
incorrect partition can lead to unintentionally overwriting an entire partition. We recommend using replaceWhere to
specify which data to overwrite.
If a partition has been accidentally overwritten, you can use Find the last commit’s version in the Spark session to undo
the change.

For Delta Lake support for updating tables, see Table deletes, updates, and merges.
Limit rows written in a file
You can use the SQL session configuration spark.sql.files.maxRecordsPerFile to specify the maximum number
of records to write to a single file for a Delta Lake table. Specifying a value of zero or a negative value represents
no limit.
In Databricks Runtime 10.5 and above, you can also use the DataFrameWriter option maxRecordsPerFile when
using the DataFrame APIs to write to a Delta Lake table. When maxRecordsPerFile is specified, the value of the
SQL session configuration spark.sql.files.maxRecordsPerFile is ignored.
Python
df.write.format("delta") \
.mode("append") \
.option("maxRecordsPerFile", "10000") \
.save("/tmp/delta/people10m")

Scala

df.write.format("delta")
.mode("append")
.option("maxRecordsPerFile", "10000")
.save("/tmp/delta/people10m")

Idempotent writes
Sometimes a job that writes data to a Delta table is restarted due to various reasons (for example, job
encounters a failure). The failed job may or may not have written the data to Delta table before terminating. In
the case where the data is written to the Delta table, the restarted job writes the same data to the Delta table
which results in duplicate data.
To address this, Delta tables support the following DataFrameWriter options to make the writes idempotent:
txnAppId : A unique string that you can pass on each DataFrame write. For example, this can be the name of
the job.
: A monotonically increasing number that acts as transaction version. This number needs to be
txnVersion
unique for data that is being written to the Delta table(s). For example, this can be the epoch seconds of the
instant when the query is attempted for the first time. Any subsequent restarts of the same job needs to have
the same value for txnVersion .
The above combination of options needs to be unique for each new data that is being ingested into the Delta
table and the txnVersion needs to be higher than the last data that was ingested into the Delta table. For
example:
Last successfully written data contains option values as dailyETL:23423 ( txnAppId:txnVersion ).
Next write of data should have txnAppId = dailyETL and txnVersion as at least 23424 (one more than the
last written data txnVersion ).
Any attempt to write data with txnAppId = dailyETL and txnVersion as 23422 or less is ignored because the
txnVersion is less than the last recorded txnVersion in the table.
Attempt to write data with txnAppId:txnVersion as anotherETL:23424 is successful writing data to the table
as it contains a different txnAppId compared to the same option value in last ingested data.

WARNING
This solution assumes that the data being written to Delta table(s) in multiple retries of the job is same. If a write attempt
in a Delta table succeeds but due to some downstream failure there is a second write attempt with same txn options but
different data, then that second write attempt will be ignored. This can cause unexpected results.

Example
Python

app_id = ... # A unique string that is used as an application ID.


version = ... # A monotonically increasing number that acts as transaction version.

dataFrame.write.format(...).option("txnVersion", version).option("txnAppId", app_id).save(...)

Sc a l a
val appId = ... // A unique string that is used as an application ID.
version = ... // A monotonically increasing number that acts as transaction version.

dataFrame.write.format(...).option("txnVersion", version).option("txnAppId", appId).save(...)

Set user-defined commit metadata


You can specify user-defined strings as metadata in commits made by these operations, either using the
DataFrameWriter option userMetadata or the SparkSession configuration
spark.databricks.delta.commitInfo.userMetadata . If both of them have been specified, then the option takes
preference. This user-defined metadata is readable in the history operation.
SQL

SET spark.databricks.delta.commitInfo.userMetadata=overwritten-for-fixing-incorrect-data
INSERT OVERWRITE default.people10m SELECT * FROM morePeople

Python

df.write.format("delta") \
.mode("overwrite") \
.option("userMetadata", "overwritten-for-fixing-incorrect-data") \
.save("/tmp/delta/people10m")

Scala

df.write.format("delta")
.mode("overwrite")
.option("userMetadata", "overwritten-for-fixing-incorrect-data")
.save("/tmp/delta/people10m")

Schema validation
Delta Lake automatically validates that the schema of the DataFrame being written is compatible with the
schema of the table. Delta Lake uses the following rules to determine whether a write from a DataFrame to a
table is compatible:
All DataFrame columns must exist in the target table. If there are columns in the DataFrame not present in
the table, an exception is raised. Columns present in the table but not in the DataFrame are set to null.
DataFrame column data types must match the column data types in the target table. If they don’t match, an
exception is raised.
DataFrame column names cannot differ only by case. This means that you cannot have columns such as
“Foo” and “foo” defined in the same table. While you can use Spark in case sensitive or insensitive (default)
mode, Parquet is case sensitive when storing and returning column information. Delta Lake is case-
preserving but insensitive when storing the schema and has this restriction to avoid potential mistakes, data
corruption, or loss issues.
Delta Lake support DDL to add new columns explicitly and the ability to update schema automatically.
If you specify other options, such as partitionBy , in combination with append mode, Delta Lake validates that
they match and throws an error for any mismatch. When partitionBy is not present, appends automatically
follow the partitioning of the existing data.
NOTE
In Databricks Runtime 7.0 and above, INSERT syntax provides schema enforcement and supports schema evolution. If a
column’s data type cannot be safely cast to your Delta Lake table’s data type, then a runtime exception is thrown. If
schema evolution is enabled, new columns can exist as the last columns of your schema (or nested columns) for the
schema to evolve.

For more information about enforcing and evolving schemas in Delta Lake, watch this YouTube video (55
minutes).

Update table schema


Delta Lake lets you update the schema of a table. The following types of changes are supported:
Adding new columns (at arbitrary positions)
Reordering existing columns
Renaming existing columns
You can make these changes explicitly using DDL or implicitly using DML.

IMPORTANT
When you update a Delta table schema, streams that read from that table terminate. If you want the stream to continue
you must restart it.

For recommended methods, see Production considerations for Structured Streaming applications on Azure
Databricks.
Explicitly update schema
You can use the following DDL to explicitly change the schema of a table.
Add columns

ALTER TABLE table_name ADD COLUMNS (col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name], ...)

By default, nullability is true .


To add a column to a nested field, use:

ALTER TABLE table_name ADD COLUMNS (col_name.nested_col_name data_type [COMMENT col_comment] [FIRST|AFTER
colA_name], ...)

Ex a m p l e

If the schema before running ALTER TABLE boxes ADD COLUMNS (colB.nested STRING AFTER field1) is:

- root
| - colA
| - colB
| +-field1
| +-field2

the schema after is:


- root
| - colA
| - colB
| +-field1
| +-nested
| +-field2

NOTE
Adding nested columns is supported only for structs. Arrays and maps are not supported.

Change column comment or ordering

ALTER TABLE table_name ALTER [COLUMN] col_name col_name data_type [COMMENT col_comment] [FIRST|AFTER
colA_name]

To change a column in a nested field, use:

ALTER TABLE table_name ALTER [COLUMN] col_name.nested_col_name nested_col_name data_type [COMMENT


col_comment] [FIRST|AFTER colA_name]

Ex a m p l e

If the schema before running ALTER TABLE boxes CHANGE COLUMN colB.field2 field2 STRING FIRST is:

- root
| - colA
| - colB
| +-field1
| +-field2

the schema after is:

- root
| - colA
| - colB
| +-field2
| +-field1

Replace columns

ALTER TABLE table_name REPLACE COLUMNS (col_name1 col_type1 [COMMENT col_comment1], ...)

Ex a m p l e

When running the following DDL:

ALTER TABLE boxes REPLACE COLUMNS (colC STRING, colB STRUCT<field2:STRING, nested:STRING, field1:STRING>,
colA STRING)

if the schema before is:


- root
| - colA
| - colB
| +-field1
| +-field2

the schema after is:

- root
| - colC
| - colB
| +-field2
| +-nested
| +-field1
| - colA

Rename columns

IMPORTANT
This feature is in Public Preview.

NOTE
This feature is available in Databricks Runtime 10.2 and above.

To rename columns without rewriting any of the columns’ existing data, you must enable column mapping for
the table. See Delta column mapping.
To rename a column:

ALTER TABLE table_name RENAME COLUMN old_col_name TO new_col_name

To rename a nested field:

ALTER TABLE table_name RENAME COLUMN col_name.old_nested_field TO new_nested_field

Ex a m p l e

When you run the following command:

ALTER TABLE boxes RENAME COLUMN colB.field1 TO field001

If the schema before is:

- root
| - colA
| - colB
| +-field1
| +-field2

Then the schema after is:


- root
| - colA
| - colB
| +-field001
| +-field2

See Delta column mapping.


Drop columns

IMPORTANT
This feature is in Public Preview.

NOTE
This feature is available in Databricks Runtime 11.0 and above.

To drop columns as a metadata-only operation without rewriting any data files, you must enable column
mapping for the table. See Delta column mapping.

IMPORTANT
Dropping a column from metadata does not delete the underlying data for the column in files. To purge the dropped
column data, you can use REORG TABLE to rewrite files. You can then use VACUUM to physically delete the files that
contain the dropped column data.

To drop a column:

ALTER TABLE table_name DROP COLUMN col_name

To drop multiple columns:

ALTER TABLE table_name DROP COLUMNS (col_name_1, col_name_2)

Change column type or name


You can change a column’s type or name or drop a column by rewriting the table. To do this, use the
overwriteSchema option:

Ch an ge a c ol u m n t ype

spark.read.table(...) \
.withColumn("birthDate", col("birthDate").cast("date")) \
.write \
.format("delta") \
.mode("overwrite")
.option("overwriteSchema", "true") \
.saveAsTable(...)

Ch an ge a c ol u m n n am e
spark.read.table(...) \
.withColumnRenamed("dateOfBirth", "birthDate") \
.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.saveAsTable(...)

Automatic schema update


Delta Lake can automatically update the schema of a table as part of a DML transaction (either appending or
overwriting), and make the schema compatible with the data being written.
Add columns
Columns that are present in the DataFrame but missing from the table are automatically added as part of a
write transaction when:
write or writeStream have .option("mergeSchema", "true")
spark.databricks.delta.schema.autoMerge.enabled is true

When both options are specified, the option from the DataFrameWriter takes precedence. The added columns
are appended to the end of the struct they are present in. Case is preserved when appending a new column.

NOTE
mergeSchema is not supported when table access control is enabled (as it elevates a request that requires MODIFY to
one that requires ALL PRIVILEGES ).
mergeSchema cannot be used with INSERT INTO or .write.insertInto() .

NullType columns
Because Parquet doesn’t support NullType , NullType columns are dropped from the DataFrame when writing
into Delta tables, but are still stored in the schema. When a different data type is received for that column, Delta
Lake merges the schema to the new data type. If Delta Lake receives a NullType for an existing column, the old
schema is retained and the new column is dropped during the write.
NullType in streaming is not supported. Since you must set schemas when using streaming this should be very
rare. NullType is also not accepted for complex types such as ArrayType and MapType .

Replace table schema


By default, overwriting the data in a table does not overwrite the schema. When overwriting a table using
mode("overwrite") without replaceWhere , you may still want to overwrite the schema of the data being written.
You replace the schema and partitioning of the table by setting the overwriteSchema option to true :

df.write.option("overwriteSchema", "true")

Views on tables
Delta Lake supports the creation of views on top of Delta tables just like you might with a data source table.
These views integrate with table access control to allow for column and row level security.
The core challenge when you operate with views is resolving the schemas. If you alter a Delta table schema, you
must recreate derivative views to account for any additions to the schema. For instance, if you add a new
column to a Delta table, you must make sure that this column is available in the appropriate views built on top
of that base table.

Table properties
You can store your own metadata as a table property using TBLPROPERTIES in CREATE and ALTER . You can then
SHOW that metadata. For example:

ALTER TABLE default.people10m SET TBLPROPERTIES ('department' = 'accounting', 'delta.appendOnly' = 'true');

-- Show the table's properties.


SHOW TBLPROPERTIES default.people10m;

-- Show just the 'department' table property.


SHOW TBLPROPERTIES default.people10m ('department');

TBLPROPERTIES are stored as part of Delta table metadata. You cannot define new TBLPROPERTIES in a CREATE
statement if a Delta table already exists in a given location.
In addition, to tailor behavior and performance, Delta Lake supports certain Delta table properties:
Block deletes and updates in a Delta table: delta.appendOnly=true .
Configure the time travel retention properties: delta.logRetentionDuration=<interval-string> and
delta.deletedFileRetentionDuration=<interval-string> . For details, see Data retention.
Configure the number of columns for which statistics are collected: delta.dataSkippingNumIndexedCols=n . This
property indicates to the writer that statistics are to be collected only for the first n columns in the table.
Also the data skipping code ignores statistics for any column beyond this column index. This property takes
affect only for new data that is written out.

NOTE
Modifying a Delta table property is a write operation that will conflict with other concurrent write operations, causing
them to fail. We recommend that you modify a table property only when there are no concurrent write operations on
the table.

You can also set delta. -prefixed properties during the first commit to a Delta table using Spark configurations.
For example, to initialize a Delta table with the property delta.appendOnly=true , set the Spark configuration
spark.databricks.delta.properties.defaults.appendOnly to true . For example:

SQL

spark.sql("SET spark.databricks.delta.properties.defaults.appendOnly = true")

Python

spark.conf.set("spark.databricks.delta.properties.defaults.appendOnly", "true")

Scala

spark.conf.set("spark.databricks.delta.properties.defaults.appendOnly", "true")

See also the Delta table properties reference.


Table metadata
Delta Lake has rich features for exploring table metadata.
It supports SHOW [PARTITIONS | COLUMNS] and DESCRIBE TABLE . See
Databricks Runtime 7.x and above: SHOW PARTITIONS, SHOW COLUMNS, DESCRIBE TABLE
Databricks Runtime 5.5 LTS and 6.x: Show Partitions, Show Columns, Describe Table
It also provides the following unique commands:
DESCRIBE DETAIL
DESCRIBE HISTORY

DESCRIBE DETAIL

Provides information about schema, partitioning, table size, and so on. For details, see Retrieve Delta table
details.
DESCRIBE HISTORY

Provides provenance information, including the operation, user, and so on, and operation metrics for each write
to a table. Table history is retained for 30 days. For details, see Retrieve Delta table history.
The Explore and create tables with the Data tab provides a visual view of this detailed table information and
history for Delta tables. In addition to the table schema and sample data, you can click the Histor y tab to see
the table history that displays with DESCRIBE HISTORY .

Configure storage credentials


Delta Lake uses Hadoop FileSystem APIs to access the storage systems. The credentails for storage systems
usually can be set through Hadoop configurations. Delta Lake provides multiple ways to set Hadoop
configurations similar to Apache Spark.
Spark configurations
When you start a Spark application on a cluster, you can set the Spark configurations in the form of
spark.hadoop.* to pass your custom Hadoop configurations. For example, Setting a value for
spark.hadoop.a.b.c will pass the value as a Hadoop configuration a.b.c , and Delta Lake will use it to access
Hadoop FileSystem APIs.
See __ for more details.
SQL session configurations
Spark SQL will pass all of the current SQL session configurations to Delta Lake, and Delta Lake will use them to
access Hadoop FileSystem APIs. For example, SET a.b.c=x.y.z will tell Delta Lake to pass the value x.y.z as a
Hadoop configuration a.b.c , and Delta Lake will use it to access Hadoop FileSystem APIs.
DataFrame options
Besides setting Hadoop file system configurations through the Spark (cluster) configurations or SQL session
configurations, Delta supports reading Hadoop file system configurations from DataFrameReader and
DataFrameWriter options (that is, option keys that start with the fs. prefix) when the table is read or written, by
using DataFrameReader.load(path) or DataFrameWriter.save(path) .

NOTE
This feature is available in Databricks Runtime 10.1 and above.
For example, you can pass your storage credentails through DataFrame options:
Python

df1 = spark.read.format("delta") \
.option("fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-
1>") \
.read("...")
df2 = spark.read.format("delta") \
.option("fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-
2>") \
.read("...")
df1.union(df2).write.format("delta") \
.mode("overwrite") \
.option("fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-
3>") \
.save("...")

Scala

val df1 = spark.read.format("delta")


.option("fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-
1>")
.read("...")
val df2 = spark.read.format("delta")
.option("fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-
2>")
.read("...")
df1.union(df2).write.format("delta")
.mode("overwrite")
.option("fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-
3>")
.save("...")

You can find the details of the Hadoop file system configurations for your storage in Data sources.

Notebook
For an example of the various Delta table metadata commands, see the end of the following notebook:
Delta Lake batch commands notebook
Get notebook
Table streaming reads and writes
7/21/2022 • 8 minutes to read

Delta Lake is deeply integrated with Spark Structured Streaming through readStream and writeStream . Delta
Lake overcomes many of the limitations typically associated with streaming systems and files, including:
Coalescing small files produced by low latency ingest
Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs)
Efficiently discovering which files are new when using files as the source for a stream
See also Production considerations for Structured Streaming applications on Azure Databricks.

Delta table as a source


When you load a Delta table as a stream source and use it in a streaming query, the query processes all of the
data present in the table as well as any new data that arrives after the stream is started.
You can load both paths and tables as a stream.

spark.readStream.format("delta")
.load("/tmp/delta/events")

import io.delta.implicits._
spark.readStream.delta("/tmp/delta/events")

or

import io.delta.implicits._

spark.readStream.format("delta").table("events")

In this section:
Limit input rate
Ignore updates and deletes
Specify initial position
Limit input rate
The following options are available to control micro-batches:
maxFilesPerTrigger : How many new files to be considered in every micro-batch. The default is 1000.
maxBytesPerTrigger : How much data gets processed in each micro-batch. This option sets a “soft max”,
meaning that a batch processes approximately this amount of data and may process more than the limit in
order to make the streaming query move forward in cases when the smallest input unit is larger than this
limit. If you use Trigger.Once for your streaming, this option is ignored. This is not set by default.
If you use maxBytesPerTrigger in conjunction with maxFilesPerTrigger , the micro-batch processes data until
either the maxFilesPerTrigger or maxBytesPerTrigger limit is reached.
NOTE
In cases when the source table transactions are cleaned up due to the logRetentionDuration configuration and the
stream lags in processing, Delta Lake processes the data corresponding to the latest available transaction history of the
source table but does not fail the stream. This can result in data being dropped.

Ignore updates and deletes


Structured Streaming does not handle input that is not an append and throws an exception if any modifications
occur on the table being used as a source. There are two main strategies for dealing with changes that cannot be
automatically propagated downstream:
You can delete the output and checkpoint and restart the stream from the beginning.
You can set either of these two options:
ignoreDeletes : ignore transactions that delete data at partition boundaries.
ignoreChanges : re-process updates if files had to be rewritten in the source table due to a data
changing operation such as UPDATE , MERGE INTO , DELETE (within partitions), or OVERWRITE .
Unchanged rows may still be emitted, therefore your downstream consumers should be able to
handle duplicates. Deletes are not propagated downstream. ignoreChanges subsumes ignoreDeletes .
Therefore if you use ignoreChanges , your stream will not be disrupted by either deletions or updates
to the source table.
Example
For example, suppose you have a table user_events with date , user_email , and action columns that is
partitioned by date . You stream out of the user_events table and you need to delete data from it due to GDPR.
When you delete at partition boundaries (that is, the WHERE is on a partition column), the files are already
segmented by value so the delete just drops those files from the metadata. Thus, if you just want to delete data
from some partitions, you can use:

spark.readStream.format("delta")
.option("ignoreDeletes", "true")
.load("/tmp/delta/user_events")

However, if you have to delete data based on user_email , then you will need to use:

spark.readStream.format("delta")
.option("ignoreChanges", "true")
.load("/tmp/delta/user_events")

If you update a user_email with the UPDATE statement, the file containing the user_email in question is
rewritten. When you use ignoreChanges , the new record is propagated downstream with all other unchanged
records that were in the same file. Your logic should be able to handle these incoming duplicate records.
Specify initial position

NOTE
This feature is available on Databricks Runtime 7.3 LTS and above.

You can use the following options to specify the starting point of the Delta Lake streaming source without
processing the entire table.
startingVersion : The Delta Lake version to start from. All table changes starting from this version
(inclusive) will be read by the streaming source. You can obtain the commit versions from the version
column of the DESCRIBE HISTORY command output.
In Databricks Runtime 7.4 and above, to return only the latest changes, specify latest .
startingTimestamp : The timestamp to start from. All table changes committed at or after the timestamp
(inclusive) will be read by the streaming source. One of:
A timestamp string. For example, "2019-01-01T00:00:00.000Z" .
A date string. For example, "2019-01-01" .
You cannot set both options at the same time; you can use only one of them. They take effect only when starting
a new streaming query. If a streaming query has started and the progress has been recorded in its checkpoint,
these options are ignored.

IMPORTANT
Although you can start the streaming source from a specified version or timestamp, the schema of the streaming source
is always the latest schema of the Delta table. You must ensure there is no incompatible schema change to the Delta table
after the specified version or timestamp. Otherwise, the streaming source may return incorrect results when reading the
data with an incorrect schema.

Example
For example, suppose you have a table user_events . If you want to read changes since version 5, use:

spark.readStream.format("delta")
.option("startingVersion", "5")
.load("/tmp/delta/user_events")

If you want to read changes since 2018-10-18, use:

spark.readStream.format("delta")
.option("startingTimestamp", "2018-10-18")
.load("/tmp/delta/user_events")

Delta table as a sink


You can also write data into a Delta table using Structured Streaming. The transaction log enables Delta Lake to
guarantee exactly-once processing, even when there are other streams or batch queries running concurrently
against the table.

NOTE
The Delta Lake VACUUM function removes all files not managed by Delta Lake but skips any directories that begin with
_ . You can safely store checkpoints alongside other data and metadata for a Delta table using a directory structure such
as <table_name>/_checkpoints .

In this section:
Metrics
Append mode
Complete mode
Metrics
NOTE
Available in Databricks Runtime 8.1 and above.

You can find out the number of bytes and number of files yet to be processed in a streaming query process as
the numBytesOutstanding and numFilesOutstanding metrics. If you are running the stream in a notebook, you can
see these metrics under the Raw Data tab in the streaming query progress dashboard:

{
"sources" : [
{
"description" : "DeltaSource[file:/path/to/source]",
"metrics" : {
"numBytesOutstanding" : "3456",
"numFilesOutstanding" : "8"
},
}
]
}

Append mode
By default, streams run in append mode, which adds new records to the table.
You can use the path method:
Python

events.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/tmp/delta/_checkpoints/")
.start("/delta/events")

Scala

events.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/tmp/delta/events/_checkpoints/")
.start("/tmp/delta/events")

import io.delta.implicits._
events.writeStream
.outputMode("append")
.option("checkpointLocation", "/tmp/delta/events/_checkpoints/")
.delta("/tmp/delta/events")

or the toTable method in Spark 3.1 and higher (Databricks Runtime 8.3 and above), as follows. (In Spark
versions before 3.1 (Databricks Runtime 8.2 and below), use the table method instead.)
Python

events.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/tmp/delta/events/_checkpoints/")
.toTable("events")

Scala
events.writeStream
.outputMode("append")
.option("checkpointLocation", "/tmp/delta/events/_checkpoints/")
.toTable("events")

Complete mode
You can also use Structured Streaming to replace the entire table with every batch. One example use case is to
compute a summary using aggregation:
Python

(spark.readStream
.format("delta")
.load("/tmp/delta/events")
.groupBy("customerId")
.count()
.writeStream
.format("delta")
.outputMode("complete")
.option("checkpointLocation", "/tmp/delta/eventsByCustomer/_checkpoints/")
.start("/tmp/delta/eventsByCustomer")
)

Scala

spark.readStream
.format("delta")
.load("/tmp/delta/events")
.groupBy("customerId")
.count()
.writeStream
.format("delta")
.outputMode("complete")
.option("checkpointLocation", "/tmp/delta/eventsByCustomer/_checkpoints/")
.start("/tmp/delta/eventsByCustomer")

The preceding example continuously updates a table that contains the aggregate number of events by customer.
For applications with more lenient latency requirements, you can save computing resources with one-time
triggers. Use these to update summary aggregation tables on a given schedule, processing only new data that
has arrived since the last update.

Idempotent table writes in foreachBatch

NOTE
Available in Databricks Runtime 8.4 and above.

The command foreachBatch allows you to specify a function that is executed on the output of every micro-batch
after arbitrary transformations in the streaming query. This allows implementating a foreachBatch function that
can write the micro-batch output to one or more target Delta table destinations. However, foreachBatch does
not make those writes idempotent as those write attempts lack the information of whether the batch is being re-
executed or not. For example, rerunning a failed batch could result in duplicate data writes.
To address this, Delta tables support the following DataFrameWriter options to make the writes idempotent:
txnAppId : A unique string that you can pass on each DataFrame write. For example, you can use the
StreamingQuery ID as txnAppId .
txnVersion : A monotonically increasing number that acts as transaction version.

Delta table uses the combination of txnAppId and txnVersion to identify duplicate writes and ignore them.
If a batch write is interrupted with a failure, rerunning the batch uses the same application and batch ID, which
would help the runtime correctly identify duplicate writes and ignore them. Application ID ( txnAppId ) can be
any user-generated unique string and does not have to be related to the stream ID.

WARNING
If you delete the streaming checkpoint and restart the query with a new checkpoint, you must provide a different appId
; otherwise, writes from the restarted query will be ignored because it will contain the same txnAppId and the batch ID
would start from 0.

The same DataFrameWriter options can be used to achieve the idempotent writes in non-Streaming job. For
details Idempotent writes.
Example
Python

app_id = ... # A unique string that is used as an application ID.

def writeToDeltaLakeTableIdempotent(batch_df, batch_id):


batch_df.write.format(...).option("txnVersion", batch_id).option("txnAppId", app_id).save(...) # location
1
batch_df.write.format(...).option("txnVersion", batch_id).option("txnAppId", app_id).save(...) # location
2

Scala

val appId = ... // A unique string that is used as an application ID.


streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.write.format(...).option("txnVersion", batchId).option("txnAppId", appId).save(...) // location 1
batchDF.write.format(...).option("txnVersion", batchId).option("txnAppId", appId).save(...) // location 2
}

Performing stream-static joins


You can rely on the transactional guarantees and versioning protocol of Delta Lake to perform stream-static
joins. A stream-static join joins the latest valid version of a Delta table (the static data) to a data stream using a
stateless join.
When Azure Databricks processes a micro-batch of data in a stream-static join, the latest valid version of data
from the static Delta table joins with the records present in the current micro-batch. Because the join is stateless,
you do not need to configure watermarking and can process results with low latency. The data in the static Delta
table used in the join should be slowly-changing.
streamingDF = spark.readStream.table("orders")
staticDF = spark.read.table("customers")

query = (streamingDF
.join(staticDF, streamingDF.customer_id==staticDF.id, "inner")
.writeStream
.option("checkpointLocation", checkpoint_path)
.table("orders_with_customer_info")
)
Table deletes, updates, and merges
7/21/2022 • 20 minutes to read

Delta Lake supports several statements to facilitate deleting data from and updating data in Delta tables.
For an overview and demonstration of deleting and updating data in Delta Lake, watch this YouTube video (54
minutes).

For additional information about capturing change data from Delta Lake, watch this YouTube video (53 minutes).

Delete from a table


You can remove data that matches a predicate from a Delta table. For instance, in a table named people10m or a
path at /tmp/delta/people-10m , to delete all rows corresponding to people with a value in the birthDate column
from before 1955 , you can run the following:
SQL

DELETE FROM people10m WHERE birthDate < '1955-01-01'

DELETE FROM delta.`/tmp/delta/people-10m` WHERE birthDate < '1955-01-01'

Python

NOTE
The Python API is available in Databricks Runtime 6.1 and above.

from delta.tables import *


from pyspark.sql.functions import *

deltaTable = DeltaTable.forPath(spark, '/tmp/delta/people-10m')

# Declare the predicate by using a SQL-formatted string.


deltaTable.delete("birthDate < '1955-01-01'")

# Declare the predicate by using Spark SQL functions.


deltaTable.delete(col('birthDate') < '1960-01-01')

Scala

NOTE
The Scala API is available in Databricks Runtime 6.0 and above.
import io.delta.tables._

val deltaTable = DeltaTable.forPath(spark, "/tmp/delta/people-10m")

// Declare the predicate by using a SQL-formatted string.


deltaTable.delete("birthDate < '1955-01-01'")

import org.apache.spark.sql.functions._
import spark.implicits._

// Declare the predicate by using Spark SQL functions and implicits.


deltaTable.delete(col("birthDate") < "1955-01-01")

Java

NOTE
The Java API is available in Databricks Runtime 6.0 and above.

import io.delta.tables.*;
import org.apache.spark.sql.functions;

DeltaTable deltaTable = DeltaTable.forPath(spark, "/tmp/delta/people-10m");

// Declare the predicate by using a SQL-formatted string.


deltaTable.delete("birthDate < '1955-01-01'");

// Declare the predicate by using Spark SQL functions.


deltaTable.delete(functions.col("birthDate").lt(functions.lit("1955-01-01")));

See the Delta Lake APIs for details.

IMPORTANT
delete removes the data from the latest version of the Delta table but does not remove it from the physical storage
until the old versions are explicitly vacuumed. See vacuum for details.

TIP
When possible, provide predicates on the partition columns for a partitioned Delta table as such predicates can
significantly speed up the operation.

Update a table
You can update data that matches a predicate in a Delta table. For example, in a table named people10m or a
path at /tmp/delta/people-10m , to change an abbreviation in the gender column from M or F to Male or
Female , you can run the following:

SQL

UPDATE people10m SET gender = 'Female' WHERE gender = 'F';


UPDATE people10m SET gender = 'Male' WHERE gender = 'M';

UPDATE delta.`/tmp/delta/people-10m` SET gender = 'Female' WHERE gender = 'F';


UPDATE delta.`/tmp/delta/people-10m` SET gender = 'Male' WHERE gender = 'M';
Python

NOTE
The Python API is available in Databricks Runtime 6.1 and above.

from delta.tables import *


from pyspark.sql.functions import *

deltaTable = DeltaTable.forPath(spark, '/tmp/delta/people-10m')

# Declare the predicate by using a SQL-formatted string.


deltaTable.update(
condition = "gender = 'F'",
set = { "gender": "'Female'" }
)

# Declare the predicate by using Spark SQL functions.


deltaTable.update(
condition = col('gender') == 'M',
set = { 'gender': lit('Male') }
)

Scala

NOTE
The Scala API is available in Databricks Runtime 6.0 and above.

import io.delta.tables._

val deltaTable = DeltaTable.forPath(spark, "/tmp/delta/people-10m")

// Declare the predicate by using a SQL-formatted string.


deltaTable.updateExpr(
"gender = 'F'",
Map("gender" -> "'Female'")

import org.apache.spark.sql.functions._
import spark.implicits._

// Declare the predicate by using Spark SQL functions and implicits.


deltaTable.update(
col("gender") === "M",
Map("gender" -> lit("Male")));

Java

NOTE
The Scala API is available in Databricks Runtime 6.0 and above.
import io.delta.tables.*;
import org.apache.spark.sql.functions;
import java.util.HashMap;

DeltaTable deltaTable = DeltaTable.forPath(spark, "/data/events/");

// Declare the predicate by using a SQL-formatted string.


deltaTable.updateExpr(
"gender = 'F'",
new HashMap<String, String>() {{
put("gender", "'Female'");
}}
);

// Declare the predicate by using Spark SQL functions.


deltaTable.update(
functions.col(gender).eq("M"),
new HashMap<String, Column>() {{
put("gender", functions.lit("Male"));
}}
);

See the Delta Lake APIs for details.

TIP
Similar to delete, update operations can get a significant speedup with predicates on partitions.

Upsert into a table using merge


You can upsert data from a source table, view, or DataFrame into a target Delta table by using the MERGE SQL
operation. Delta Lake supports inserts, updates and deletes in MERGE , and it supports extended syntax beyond
the SQL standards to facilitate advanced use cases.
Suppose you have a source table named people10mupdates or a source path at /tmp/delta/people-10m-updates
that contains new data for a target table named people10m or a target path at /tmp/delta/people-10m . Some of
these new records may already be present in the target data. To merge the new data, you want to update rows
where the person’s id is already present and insert the new rows where no matching id is present. You can
run the following:
SQL
MERGE INTO people10m
USING people10mupdates
ON people10m.id = people10mupdates.id
WHEN MATCHED THEN
UPDATE SET
id = people10mupdates.id,
firstName = people10mupdates.firstName,
middleName = people10mupdates.middleName,
lastName = people10mupdates.lastName,
gender = people10mupdates.gender,
birthDate = people10mupdates.birthDate,
ssn = people10mupdates.ssn,
salary = people10mupdates.salary
WHEN NOT MATCHED
THEN INSERT (
id,
firstName,
middleName,
lastName,
gender,
birthDate,
ssn,
salary
)
VALUES (
people10mupdates.id,
people10mupdates.firstName,
people10mupdates.middleName,
people10mupdates.lastName,
people10mupdates.gender,
people10mupdates.birthDate,
people10mupdates.ssn,
people10mupdates.salary
)

For syntax details, see


Databricks Runtime 7.x and above: MERGE INTO
Databricks Runtime 5.5 LTS and 6.x: Merge Into (Delta Lake on Azure Databricks)
Python
from delta.tables import *

deltaTablePeople = DeltaTable.forPath(spark, '/tmp/delta/people-10m')


deltaTablePeopleUpdates = DeltaTable.forPath(spark, '/tmp/delta/people-10m-updates')

dfUpdates = deltaTablePeopleUpdates.toDF()

deltaTablePeople.alias('people') \
.merge(
dfUpdates.alias('updates'),
'people.id = updates.id'
) \
.whenMatchedUpdate(set =
{
"id": "updates.id",
"firstName": "updates.firstName",
"middleName": "updates.middleName",
"lastName": "updates.lastName",
"gender": "updates.gender",
"birthDate": "updates.birthDate",
"ssn": "updates.ssn",
"salary": "updates.salary"
}
) \
.whenNotMatchedInsert(values =
{
"id": "updates.id",
"firstName": "updates.firstName",
"middleName": "updates.middleName",
"lastName": "updates.lastName",
"gender": "updates.gender",
"birthDate": "updates.birthDate",
"ssn": "updates.ssn",
"salary": "updates.salary"
}
) \
.execute()

Scala
import io.delta.tables._
import org.apache.spark.sql.functions._

val deltaTablePeople = DeltaTable.forPath(spark, "/tmp/delta/people-10m")


val deltaTablePeopleUpdates = DeltaTable.forPath(spark, "tmp/delta/people-10m-updates")
val dfUpdates = deltaTablePeopleUpdates.toDF()

deltaTablePeople
.as("people")
.merge(
dfUpdates.as("updates"),
"people.id = updates.id")
.whenMatched
.updateExpr(
Map(
"id" -> "updates.id",
"firstName" -> "updates.firstName",
"middleName" -> "updates.middleName",
"lastName" -> "updates.lastName",
"gender" -> "updates.gender",
"birthDate" -> "updates.birthDate",
"ssn" -> "updates.ssn",
"salary" -> "updates.salary"
))
.whenNotMatched
.insertExpr(
Map(
"id" -> "updates.id",
"firstName" -> "updates.firstName",
"middleName" -> "updates.middleName",
"lastName" -> "updates.lastName",
"gender" -> "updates.gender",
"birthDate" -> "updates.birthDate",
"ssn" -> "updates.ssn",
"salary" -> "updates.salary"
))
.execute()

Java
import io.delta.tables.*;
import org.apache.spark.sql.functions;
import java.util.HashMap;

DeltaTable deltaTable = DeltaTable.forPath(spark, "/tmp/delta/people-10m")


Dataset<Row> dfUpdates = spark.read("delta").load("/tmp/delta/people-10m-updates")

deltaTable
.as("people")
.merge(
dfUpdates.as("updates"),
"people.id = updates.id")
.whenMatched()
.updateExpr(
new HashMap<String, String>() {{
put("id", "updates.id");
put("firstName", "updates.firstName");
put("middleName", "updates.middleName");
put("lastName", "updates.lastName");
put("gender", "updates.gender");
put("birthDate", "updates.birthDate");
put("ssn", "updates.ssn");
put("salary", "updates.salary");
}})
.whenNotMatched()
.insertExpr(
new HashMap<String, String>() {{
put("id", "updates.id");
put("firstName", "updates.firstName");
put("middleName", "updates.middleName");
put("lastName", "updates.lastName");
put("gender", "updates.gender");
put("birthDate", "updates.birthDate");
put("ssn", "updates.ssn");
put("salary", "updates.salary");
}})
.execute();

See the Delta Lake APIs for Scala, Java, and Python syntax details.
Delta Lake merge operations typically require two passes over the source data. If your source data contains
nondeterministic expressions, multiple passes on the source data can produce different rows causing incorrect
results. Some common examples of nondeterministic expressions include the current_date and
current_timestamp functions. If you cannot avoid using non-deterministic functions, consider saving the source
data to storage, for example as a temporary Delta table. Caching the source data may not address this issue, as
cache invalidation can cause the source data to be recomputed partially or completely (for example when a
cluster loses some of it executors when scaling down).
Operation semantics
Here is a detailed description of the merge programmatic operation.
There can be any number of whenMatched and whenNotMatched clauses.

NOTE
In Databricks Runtime 7.2 and below, merge can have at most 2 whenMatched clauses and at most 1
whenNotMatched clause.

whenMatched clauses are executed when a source row matches a target table row based on the match
condition. These clauses have the following semantics.
whenMatched clauses can have at most one update and one delete action. The update action in
merge only updates the specified columns (similar to the update operation) of the matched target
row. The delete action deletes the matched row.
Each whenMatched clause can have an optional condition. If this clause condition exists, the update
or delete action is executed for any matching source-target row pair only when the clause
condition is true.
If there are multiple whenMatched clauses, then they are evaluated in the order they are specified.
All whenMatched clauses, except the last one, must have conditions.
If none of the whenMatched conditions evaluate to true for a source and target row pair that
matches the merge condition, then the target row is left unchanged.
To update all the columns of the target Delta table with the corresponding columns of the source
dataset, use whenMatched(...).updateAll() . This is equivalent to:

whenMatched(...).updateExpr(Map("col1" -> "source.col1", "col2" -> "source.col2", ...))

for all the columns of the target Delta table. Therefore, this action assumes that the source table
has the same columns as those in the target table, otherwise the query throws an analysis error.

NOTE
This behavior changes when automatic schema migration is enabled. See Automatic schema evolution for
details.

whenNotMatched clauses are executed when a source row does not match any target row based on the
match condition. These clauses have the following semantics.
whenNotMatched clauses can have only the insert action. The new row is generated based on the
specified column and corresponding expressions. You do not need to specify all the columns in the
target table. For unspecified target columns, NULL is inserted.

NOTE
In Databricks Runtime 6.5 and below, you must provide all the columns in the target table for the
INSERT action.

Each whenNotMatched clause can have an optional condition. If the clause condition is present, a
source row is inserted only if that condition is true for that row. Otherwise, the source column is
ignored.
If there are multiple whenNotMatched clauses, then they are evaluated in the order they are
specified. All whenNotMatched clauses, except the last one, must have conditions.
To insert all the columns of the target Delta table with the corresponding columns of the source
dataset, use whenNotMatched(...).insertAll() . This is equivalent to:

whenNotMatched(...).insertExpr(Map("col1" -> "source.col1", "col2" -> "source.col2", ...))

for all the columns of the target Delta table. Therefore, this action assumes that the source table
has the same columns as those in the target table, otherwise the query throws an analysis error.
NOTE
This behavior changes when automatic schema migration is enabled. See Automatic schema evolution for
details.

IMPORTANT
A merge operation can fail if multiple rows of the source dataset match and the merge attempts to update the same
rows of the target Delta table. According to the SQL semantics of merge, such an update operation is ambiguous as it
is unclear which source row should be used to update the matched target row. You can preprocess the source table to
eliminate the possibility of multiple matches. See the change data capture example—it shows how to preprocess the
change dataset (that is, the source dataset) to retain only the latest change for each key before applying that change
into the target Delta table.
A merge operation can produce incorrect results if the source dataset is non-deterministic. This is because merge
may perform two scans of the source dataset and if the data produced by the two scans are different, the final
changes made to the table can be incorrect. Non-determinism in the source can arise in many ways. Some of them are
as follows:
Reading from non-Delta tables. For example, reading from a CSV table where the underlying files can change
between the multiple scans.
Using non-deterministic operations. For example, Dataset.filter() operations that uses current timestamp
to filter data can produce different results between the multiple scans.
You can apply a SQL MERGE operation on a SQL VIEW only if the view has been defined as
CREATE VIEW viewName AS SELECT * FROM deltaTable .

NOTE
In Databricks Runtime 7.3 LTS and above, multiple matches are allowed when matches are unconditionally deleted (since
unconditional delete is not ambiguous even if there are multiple matches).

Schema validation
merge automatically validates that the schema of the data generated by insert and update expressions are
compatible with the schema of the table. It uses the following rules to determine whether the merge operation
is compatible:
For update and insert actions, the specified target columns must exist in the target Delta table.
For updateAll and insertAll actions, the source dataset must have all the columns of the target Delta table.
The source dataset can have extra columns and they are ignored.
For all actions, if the data type generated by the expressions producing the target columns are different from
the corresponding columns in the target Delta table, merge tries to cast them to the types in the table.
Automatic schema evolution

NOTE
Schema evolution in merge is available in Databricks Runtime 6.6 and above.

By default, updateAll and insertAll assign all the columns in the target Delta table with columns of the same
name from the source dataset. Any columns in the source dataset that don’t match columns in the target table
are ignored. However, in some use cases, it is desirable to automatically add source columns to the target Delta
table. To automatically update the table schema during a merge operation with updateAll and insertAll (at
least one of them), you can set the Spark session configuration
spark.databricks.delta.schema.autoMerge.enabled to true before running the merge operation.

NOTE
Schema evolution occurs only when there is either an updateAll ( UPDATE SET * ) or an insertAll ( INSERT * )
action, or both.
update and insert actions cannot explicitly refer to target columns that do not already exist in the target table
(even it there are updateAll or insertAll as one of the clauses). See the examples below.

NOTE
In Databricks Runtime 7.4 and below, merge supports schema evolution of only top-level columns, and not of nested
columns.

Here are a few examples of the effects of merge operation with and without schema evolution.

B EH AVIO R W IT H O UT
SC H EM A EVO L UT IO N B EH AVIO R W IT H SC H EM A
C O L UM N S Q UERY ( IN SC A L A ) ( DEFA ULT ) EVO L UT IO N

Target columns: targetDeltaTable.alias("t")The table schema remains The table schema is


key, value .merge( unchanged; only columns changed to
sourceDataFrame.alias("s"), key , value are (key, value, newValue) .
Source columns: "t.key = s.key") updated/inserted. updateAll updates
key, value, newValue .whenMatched().updateAll() columns value and
.whenNotMatched().insertAll() newValue , and
.execute() insertAll inserts rows
(key, value, newValue) .

Target columns: targetDeltaTable.alias("t") updateAll and The table schema is


key, oldValue .merge( actions throw
insertAll changed to
sourceDataFrame.alias("s"), an error because the target (key, oldValue,
newValue)
Source columns: "t.key = s.key") column oldValue is not in
key, newValue . updateAll updates
.whenMatched().updateAll() the source.
columns key and
.whenNotMatched().insertAll()
newValue leaving
.execute()
oldValue unchanged, and
insertAll inserts rows
(key, NULL, newValue)
(that is, oldValue is
inserted as NULL ).

Target columns: targetDeltaTable.alias("t") update throws an error update still throws an


key, oldValue .merge( because column newValue error because column
sourceDataFrame.alias("s"), does not exist in the target newValue does not exist in
Source columns: "t.key = s.key") table. the target table.
key, newValue .whenMatched().update(Map(
"newValue" ->
col("s.newValue")))
.whenNotMatched().insertAll()
.execute()
B EH AVIO R W IT H O UT
SC H EM A EVO L UT IO N B EH AVIO R W IT H SC H EM A
C O L UM N S Q UERY ( IN SC A L A ) ( DEFA ULT ) EVO L UT IO N

Target columns: targetDeltaTable.alias("t") insert throws an error insert still throws an


key, oldValue .merge( because column newValue error as column newValue
sourceDataFrame.alias("s"), does not exist in the target does not exist in the target
Source columns: "t.key = s.key") table. table.
key, newValue .whenMatched().updateAll()
.whenNotMatched().insert(Map(
"key" -> col("s.key"),
"newValue" ->
col("s.newValue")))
.execute()

Special considerations for schemas that contain arrays of structs


Delta MERGE INTO supports resolving struct fields by name and evolving schemas for arrays of structs. With
schema evolution enabled, target table schemas will evolve for arrays of structs, which also works with any
nested structs inside of arrays.

NOTE
This feature is available in Databricks Runtime 9.1 and above. For Databricks Runtime 9.0 and below, implicit Spark casting
is used for arrays of structs to resolve struct fields by position, and the effects of merge operations with and without
schema evolution of structs in arrays are inconsistent with the behaviors of structs outside of arrays.

Here are a few examples of the effects of merge operations with and without schema evolution for arrays of
structs.

B EH AVIO R W IT H O UT
SC H EM A EVO L UT IO N B EH AVIO R W IT H SC H EM A
SO URC E SC H EM A TA RGET SC H EM A ( DEFA ULT ) EVO L UT IO N

array<struct<b: string, a: array<struct<a: int, b: The table schema remains The table schema remains
string>> int>> unchanged. Columns will be unchanged. Columns will be
resolved by name and resolved by name and
updated or inserted. updated or inserted.

array<struct<a: int, c: array<struct<a: string, b: update and insert The table schema is
string, d: string>> string>> throw errors because c changed to array<struct<a:
and d do not exist in the string, b: string, c: string, d:
target table. string>>. c and d are
inserted as NULL for
existing entries in the target
table. update and
insert fill entries in the
source table with a casted
to string and b as NULL .

array<struct<a: string, b: array<struct<a: string, b: update and insert The target table schema is
struct<c: string, d: struct<c: string>>> throw errors because d changed to array<struct<a:
string>>> does not exist in the target string, b: struct<c: string, d:
table. string>>>. d is inserted
as NULL for existing
entries in the target table.
Performance tuning
You can reduce the time taken by merge using the following approaches:
Reduce the search space for matches : By default, the merge operation searches the entire Delta table
to find matches in the source table. One way to speed up merge is to reduce the search space by adding
known constraints in the match condition. For example, suppose you have a table that is partitioned by
country and date and you want to use merge to update information for the last day and a specific
country. Adding the condition

events.date = current_date() AND events.country = 'USA'

will make the query faster as it looks for matches only in the relevant partitions. Furthermore, it will also
reduce the chances of conflicts with other concurrent operations. See Concurrency control for more
details.
Compact files : If the data is stored in many small files, reading the data to search for matches can
become slow. You can compact small files into larger files to improve read throughput. See Compact files
for details.
Control the shuffle par titions for writes : The merge operation shuffles data multiple times to
compute and write the updated data. The number of tasks used to shuffle is controlled by the Spark
session configuration spark.sql.shuffle.partitions . Setting this parameter not only controls the
parallelism but also determines the number of output files. Increasing the value increases parallelism but
also generates a larger number of smaller data files.
Enable optimized writes : For partitioned tables, merge can produce a much larger number of small
files than the number of shuffle partitions. This is because every shuffle task can write multiple files in
multiple partitions, and can become a performance bottleneck. You can reduce the number of files by
enabling Optimized Write.

NOTE
In Databricks Runtime 7.4 and above, Optimized Write is automatically enabled in merge operations on partitioned
tables.

Tune file sizes in table : In Databricks Runtime 8.2 and above, Azure Databricks can automatically detect if
a Delta table has frequent merge operations that rewrite files and may choose to reduce the size of rewritten
files in anticipation of further file rewrites in the future. See the section on tuning file sizes for details.
Low Shuffle Merge : In Databricks Runtime 9.0 and above, Low Shuffle Merge provides an optimized
implementation of MERGE that provides better performance for most common workloads. In addition, it
preserves existing data layout optimizations such as Z-ordering on unmodified data.

Merge examples
Here are a few examples on how to use merge in different scenarios.
In this section:
Data deduplication when writing into Delta tables
Slowly changing data (SCD) Type 2 operation into Delta tables
Write change data into a Delta table
Upsert from streaming queries using foreachBatch
Data deduplication when writing into Delta tables
A common ETL use case is to collect logs into Delta table by appending them to a table. However, often the
sources can generate duplicate log records and downstream deduplication steps are needed to take care of
them. With merge , you can avoid inserting the duplicate records.
SQL

MERGE INTO logs


USING newDedupedLogs
ON logs.uniqueId = newDedupedLogs.uniqueId
WHEN NOT MATCHED
THEN INSERT *

Python

deltaTable.alias("logs").merge(
newDedupedLogs.alias("newDedupedLogs"),
"logs.uniqueId = newDedupedLogs.uniqueId") \
.whenNotMatchedInsertAll() \
.execute()

Scala

deltaTable
.as("logs")
.merge(
newDedupedLogs.as("newDedupedLogs"),
"logs.uniqueId = newDedupedLogs.uniqueId")
.whenNotMatched()
.insertAll()
.execute()

Java

deltaTable
.as("logs")
.merge(
newDedupedLogs.as("newDedupedLogs"),
"logs.uniqueId = newDedupedLogs.uniqueId")
.whenNotMatched()
.insertAll()
.execute();

NOTE
The dataset containing the new logs needs to be deduplicated within itself. By the SQL semantics of merge, it matches
and deduplicates the new data with the existing data in the table, but if there is duplicate data within the new dataset, it is
inserted. Hence, deduplicate the new data before merging into the table.

If you know that you may get duplicate records only for a few days, you can optimized your query further by
partitioning the table by date, and then specifying the date range of the target table to match on.
SQL

MERGE INTO logs


USING newDedupedLogs
ON logs.uniqueId = newDedupedLogs.uniqueId AND logs.date > current_date() - INTERVAL 7 DAYS
WHEN NOT MATCHED AND newDedupedLogs.date > current_date() - INTERVAL 7 DAYS
THEN INSERT *
Python

deltaTable.alias("logs").merge(
newDedupedLogs.alias("newDedupedLogs"),
"logs.uniqueId = newDedupedLogs.uniqueId AND logs.date > current_date() - INTERVAL 7 DAYS") \
.whenNotMatchedInsertAll("newDedupedLogs.date > current_date() - INTERVAL 7 DAYS") \
.execute()

Scala

deltaTable.as("logs").merge(
newDedupedLogs.as("newDedupedLogs"),
"logs.uniqueId = newDedupedLogs.uniqueId AND logs.date > current_date() - INTERVAL 7 DAYS")
.whenNotMatched("newDedupedLogs.date > current_date() - INTERVAL 7 DAYS")
.insertAll()
.execute()

Java

deltaTable.as("logs").merge(
newDedupedLogs.as("newDedupedLogs"),
"logs.uniqueId = newDedupedLogs.uniqueId AND logs.date > current_date() - INTERVAL 7 DAYS")
.whenNotMatched("newDedupedLogs.date > current_date() - INTERVAL 7 DAYS")
.insertAll()
.execute();

This is more efficient than the previous command as it looks for duplicates only in the last 7 days of logs, not the
entire table. Furthermore, you can use this insert-only merge with Structured Streaming to perform continuous
deduplication of the logs.
In a streaming query, you can use merge operation in foreachBatch to continuously write any streaming
data to a Delta table with deduplication. See the following streaming example for more information on
foreachBatch .
In another streaming query, you can continuously read deduplicated data from this Delta table. This is
possible because an insert-only merge only appends new data to the Delta table.

NOTE
Insert-only merge is optimized to only append data in Databricks Runtime 6.2 and above. In Databricks Runtime 6.1 and
below, writes from insert-only merge operations cannot be read as a stream.

Slowly changing data (SCD) Type 2 operation into Delta tables


Another common operation is SCD Type 2, which maintains history of all changes made to each key in a
dimensional table. Such operations require updating existing rows to mark previous values of keys as old, and
the inserting the new rows as the latest values. Given a source table with updates and the target table with the
dimensional data, SCD Type 2 can be expressed with merge .
Here is a concrete example of maintaining the history of addresses for a customer along with the active date
range of each address. When a customer’s address needs to be updated, you have to mark the previous address
as not the current one, update its active date range, and add the new address as the current one.
SCD Type 2 using merge notebook
Get notebook
Write change data into a Delta table
Similar to SCD, another common use case, often called change data capture (CDC), is to apply all data changes
generated from an external database into a Delta table. In other words, a set of updates, deletes, and inserts
applied to an external table needs to be applied to a Delta table. You can do this using merge as follows.
Write change data using MERGE notebook
Get notebook
Upsert from streaming queries using foreachBatch

You can use a combination of merge and foreachBatch (see foreachbatch for more information) to write
complex upserts from a streaming query into a Delta table. For example:
Write streaming aggregates in Update Mode : This is much more efficient than Complete Mode.
Write a stream of database changes into a Delta table : The merge query for writing change data can
be used in foreachBatch to continuously apply a stream of changes to a Delta table.
Write a stream data into Delta table with deduplication : The insert-only merge query for
deduplication can be used in foreachBatch to continuously write data (with duplicates) to a Delta table with
automatic deduplication.

NOTE
Make sure that your merge statement inside foreachBatch is idempotent as restarts of the streaming query can
apply the operation on the same batch of data multiple times.
When merge is used in foreachBatch , the input data rate of the streaming query (reported through
StreamingQueryProgress and visible in the notebook rate graph) may be reported as a multiple of the actual rate at
which data is generated at the source. This is because merge reads the input data multiple times causing the input
metrics to be multiplied. If this is a bottleneck, you can cache the batch DataFrame before merge and then uncache it
after merge .

Write streaming aggregates in update mode using merge and foreachBatch notebook
Get notebook
Change data feed
7/21/2022 • 6 minutes to read

NOTE
Delta change data feed is available in Databricks Runtime 8.4 and above.
This article describes how to record and query row-level change information for Delta tables using the change data
feed feature. To learn how to update tables in a Delta Live Tables pipeline based on changes in source data, see Change
data capture with Delta Live Tables.

Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta
table. When enabled on a Delta table, the runtime records “change events” for all the data written into the table.
This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or
updated.
You can read the change events in batch queries using SQL and DataFrame APIs (that is, df.read ), and in
streaming queries using DataFrame APIs (that is, df.readStream ).

Use cases
Change Data Feed is not enabled by default. The following use cases should drive when you enable the change
data feed.
Silver and Gold tables : Improve Delta performance by processing only row-level changes following initial
MERGE , UPDATE , or DELETE operations to accelerate and simplify ETL and ELT operations.
Materialized views : Create up-to-date, aggregated views of information for use in BI and analytics without
having to reprocess the full underlying tables, instead updating only where changes have come through.
Transmit changes : Send a change data feed to downstream systems such as Kafka or RDBMS that can use
it to incrementally process in later stages of data pipelines.
Audit trail table : Capture the change data feed as a Delta table provides perpetual storage and efficient
query capability to see all changes over time, including when deletes occur and what updates were made.

Enable change data feed


You must explicitly enable the change data feed option using one of the following methods:
New table : Set the table property delta.enableChangeDataFeed = true in the CREATE TABLE command.

CREATE TABLE student (id INT, name STRING, age INT) TBLPROPERTIES (delta.enableChangeDataFeed = true)

Existing table : Set the table property delta.enableChangeDataFeed = true in the ALTER TABLE command.

ALTER TABLE myDeltaTable SET TBLPROPERTIES (delta.enableChangeDataFeed = true)

All new tables :

set spark.databricks.delta.properties.defaults.enableChangeDataFeed = true;


IMPORTANT
Once you enable the change data feed option for a table, you can no longer write to the table using Databricks
Runtime 8.1 or below. You can always read the table.
Only changes made after you enable the change data feed are recorded; past changes to a table are not captured.

Change data storage


Databricks records change data for UPDATE , DELETE , and MERGE operations in the _change_data folder under
the Delta table directory. These records may be skipped when Databricks detects it can efficiently compute the
change data feed directly from the transaction log. In particular, insert-only operations and full partition deletes
will not generate data in the _change_data directory.
The files in the _change_data folder follow the retention policy of the table. Therefore, if you run the VACUUM
command, change data feed data is also deleted.

Read changes in batch queries


You can provide either version or timestamp for the start and end. The start and end versions and timestamps
are inclusive in the queries. To read the changes from a particular start version to the latest version of the table,
specify only the starting version or timestamp.
You specify a version as an integer and a timestamps as a string in the format yyyy-MM-dd[ HH:mm:ss[.SSS]] .
If you provide a version lower or timestamp older than one that has recorded change events, that is, when the
change data feed was enabled, an error is thrown indicating that the change data feed was not enabled.
SQL

-- version as ints or longs e.g. changes from version 0 to 10


SELECT * FROM table_changes('tableName', 0, 10)

-- timestamp as string formatted timestamps


SELECT * FROM table_changes('tableName', '2021-04-21 05:45:46', '2021-05-21 12:00:00')

-- providing only the startingVersion/timestamp


SELECT * FROM table_changes('tableName', 0)

-- database/schema names inside the string for table name, with backticks for escaping dots and special
characters
SELECT * FROM table_changes('dbName.`dotted.tableName`', '2021-04-21 06:45:46' , '2021-05-21 12:00:00')

-- path based tables


SELECT * FROM table_changes_by_path('\path', '2021-04-21 05:45:46')

Python
# version as ints or longs
spark.read.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", 0) \
.option("endingVersion", 10) \
.table("myDeltaTable")

# timestamps as formatted timestamp


spark.read.format("delta") \
.option("readChangeFeed", "true") \
.option("startingTimestamp", '2021-04-21 05:45:46') \
.option("endingTimestamp", '2021-05-21 12:00:00') \
.table("myDeltaTable")

# providing only the startingVersion/timestamp


spark.read.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", 0) \
.table("myDeltaTable")

# path based tables


spark.read.format("delta") \
.option("readChangeFeed", "true") \
.option("startingTimestamp", '2021-04-21 05:45:46') \
.load("pathToMyDeltaTable")

Scala

// version as ints or longs


spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 0)
.option("endingVersion", 10)
.table("myDeltaTable")

// timestamps as formatted timestamp


spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingTimestamp", "2021-04-21 05:45:46")
.option("endingTimestamp", "2021-05-21 12:00:00")
.table("myDeltaTable")

// providing only the startingVersion/timestamp


spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 0)
.table("myDeltaTable")

// path based tables


spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingTimestamp", "2021-04-21 05:45:46")
.load("pathToMyDeltaTable")

Read changes in streaming queries


Python
# providing a starting version
spark.readStream.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", 0) \
.table("myDeltaTable")

# providing a starting timestamp


spark.readStream.format("delta") \
.option("readChangeFeed", "true") \
.option("startingTimestamp", "2021-04-21 05:35:43") \
.load("/pathToMyDeltaTable")

# not providing a starting version/timestamp will result in the latest snapshot being fetched first
spark.readStream.format("delta") \
.option("readChangeFeed", "true") \
.table("myDeltaTable")

Scala

// providing a starting version


spark.readStream.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 0)
.table("myDeltaTable")

// providing a starting timestamp


spark.readStream.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", "2021-04-21 05:35:43")
.load("/pathToMyDeltaTable")

// not providing a starting version/timestamp will result in the latest snapshot being fetched first
spark.readStream.format("delta")
.option("readChangeFeed", "true")
.table("myDeltaTable")

To get the change data while reading the table, set the option readChangeFeed to true . The startingVersion or
startingTimestamp are optional and if not provided the stream returns the latest snapshot of the table at the
time of streaming as an INSERT and future changes as change data. Options like rate limits ( maxFilesPerTrigger
, maxBytesPerTrigger ) and excludeRegex are also supported when reading change data.

NOTE
Rate limiting can be atomic for versions other than the starting snapshot version. That is, the entire commit version will
be rate limited or the entire commit will be returned.
By default if a user passes in a version or timestamp exceeding the last commit on a table, the error
timestampGreaterThanLatestCommit will be thrown. CDF can handle the out of range version case, if the user sets the
following configuration to true .

set spark.databricks.delta.changeDataFeed.timestampOutOfRange.enabled = true;

If you provide a start version greater than the last commit on a table or a start timestamp newer than the last commit on
a table, then when the preceding configuration is enabled, an empty read result is returned.
If you provide an end version greater than the last commit on a table or an end timestamp newer than the last commit
on a table, then when the preceding configuration is enabled in batch read mode, all changes between the start version
and the last commit are be returned.
Change data event schema
In addition to the data columns, change data contains metadata columns that identify the type of change event:

C O L UM N N A M E TYPE VA L UES

_change_type String insert , update_preimage ,


update_postimage , delete (1)

_commit_version Long The Delta log or table version


containing the change.

_commit_timestamp Timestamp The timestamp associated when the


commit was created.

(1) preimage is the value before the update, postimage is the value after the update.

Frequently asked questions (FAQ)


What is the overhead of enabling the change data feed?
There is no significant impact. The change data records are generated in line during the query execution process,
and are generally much smaller than the total size of rewritten files.
What is the retention policy for change records?
Change records follow the same retention policy as out-of-date table versions, and will be cleaned up through
VACUUM if they are outside the specified retention period.
When do new records become available in the change data feed?
Change data is committed along with the Delta Lake transaction, and will become available at the same time as
the new data is available in the table.

Notebook
The notebook shows how to propagate changes made to a silver table of absolute number of vaccinations to a
gold table of vaccination rates.
Change data feed notebook
Get notebook
Table utility commands
7/21/2022 • 23 minutes to read

Delta tables support a number of utility commands.

Remove files no longer referenced by a Delta table


You can remove files no longer referenced by a Delta table and are older than the retention threshold by
running the vacuum command on the table. vacuum is not triggered automatically. The default retention
threshold for the files is 7 days. To change this behavior, see Data retention.

IMPORTANT
vacuum removes all files from directories not managed by Delta Lake, ignoring directories beginning with _ . If you
are storing additional metadata like Structured Streaming checkpoints within a Delta table directory, use a directory
name such as _checkpoints .
vacuum deletes only data files, not log files. Log files are deleted automatically and asynchronously after checkpoint
operations. The default retention period of log files is 30 days, configurable through the
delta.logRetentionDuration property which you set with the ALTER TABLE SET TBLPROPERTIES SQL method. See
Table properties.
The ability to time travel back to a version older than the retention period is lost after running vacuum .

NOTE
When the Delta cache is enabled, a cluster might contain data from Parquet files that have been deleted with vacuum .
Therefore, it may be possible to query the data of previous table versions whose files have been deleted. Restarting the
cluster will remove the cached data. See Configure the Delta cache.

SQL

VACUUM eventsTable -- vacuum files not required by versions older than the default retention period

VACUUM '/data/events' -- vacuum files in path-based table

VACUUM delta.`/data/events/`

VACUUM delta.`/data/events/` RETAIN 100 HOURS -- vacuum files not required by versions more than 100 hours
old

VACUUM eventsTable DRY RUN -- do dry run to get the list of files to be deleted

For Spark SQL syntax details, see


Databricks Runtime 7.x and above: VACUUM
Databricks Runtime 5.5 LTS and 6.x: Vacuum
Python

NOTE
The Python API is available in Databricks Runtime 6.1 and above.
from delta.tables import *

deltaTable = DeltaTable.forPath(spark, pathToTable) # path-based tables, or


deltaTable = DeltaTable.forName(spark, tableName) # Hive metastore-based tables

deltaTable.vacuum() # vacuum files not required by versions older than the default retention period

deltaTable.vacuum(100) # vacuum files not required by versions more than 100 hours old

Scala

NOTE
The Scala API is available in Databricks Runtime 6.0 and above.

import io.delta.tables._

val deltaTable = DeltaTable.forPath(spark, pathToTable)

deltaTable.vacuum() // vacuum files not required by versions older than the default retention period

deltaTable.vacuum(100) // vacuum files not required by versions more than 100 hours old

Java

NOTE
The Java API is available in Databricks Runtime 6.0 and above.

import io.delta.tables.*;
import org.apache.spark.sql.functions;

DeltaTable deltaTable = DeltaTable.forPath(spark, pathToTable);

deltaTable.vacuum(); // vacuum files not required by versions older than the default retention period

deltaTable.vacuum(100); // vacuum files not required by versions more than 100 hours old

See the Delta Lake APIs for Scala, Java, and Python syntax details.

WARNING
It is recommended that you set a retention interval to be at least 7 days, because old snapshots and uncommitted files
can still be in use by concurrent readers or writers to the table. If VACUUM cleans up active files, concurrent readers can
fail or, worse, tables can be corrupted when VACUUM deletes files that have not yet been committed. You must choose an
interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag
behind the most recent update to the table.

Delta Lake has a safety check to prevent you from running a dangerous VACUUM command. If you are certain
that there are no operations being performed on this table that take longer than the retention interval you plan
to specify, you can turn off this safety check by setting the Spark configuration property
spark.databricks.delta.retentionDurationCheck.enabled to false .

Audit information
VACUUM commits to the Delta transaction log contain audit information. You can query the audit events using
DESCRIBE HISTORY .

Retrieve Delta table history


You can retrieve information on the operations, user, timestamp, and so on for each write to a Delta table by
running the history command. The operations are returned in reverse chronological order. By default table
history is retained for 30 days.
SQL

DESCRIBE HISTORY '/data/events/' -- get the full history of the table

DESCRIBE HISTORY delta.`/data/events/`

DESCRIBE HISTORY '/data/events/' LIMIT 1 -- get the last operation only

DESCRIBE HISTORY eventsTable

For Spark SQL syntax details, see


Databricks Runtime 7.x and above: DESCRIBE HISTORY (Delta Lake on Azure Databricks)
Databricks Runtime 5.5 LTS and 6.x: Describe History (Delta Lake on Azure Databricks)
Python

NOTE
The Python API is available in Databricks Runtime 6.1 and above.

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, pathToTable)

fullHistoryDF = deltaTable.history() # get the full history of the table

lastOperationDF = deltaTable.history(1) # get the last operation

Scala

NOTE
The Scala API is available in Databricks Runtime 6.0 and above.

import io.delta.tables._

val deltaTable = DeltaTable.forPath(spark, pathToTable)

val fullHistoryDF = deltaTable.history() // get the full history of the table

val lastOperationDF = deltaTable.history(1) // get the last operation

Java

NOTE
The Java API is available in Databricks Runtime 6.0 and above.
import io.delta.tables.*;

DeltaTable deltaTable = DeltaTable.forPath(spark, pathToTable);

DataFrame fullHistoryDF = deltaTable.history(); // get the full history of the table

DataFrame lastOperationDF = deltaTable.history(1); // fetch the last operation on the DeltaTable

See the Delta Lake APIs for Scala/Java/Python syntax details.


History schema
The output of the history operation has the following columns.

C O L UM N TYPE DESC RIP T IO N

version long Table version generated by the


operation.

timestamp timestamp When this version was committed.

userId string ID of the user that ran the operation.

userName string Name of the user that ran the


operation.

operation string Name of the operation.

operationParameters map Parameters of the operation (for


example, predicates.)

job struct Details of the job that ran the


operation.

notebook struct Details of notebook from which the


operation was run.

clusterId string ID of the cluster on which the


operation ran.

readVersion long Version of the table that was read to


perform the write operation.

isolationLevel string Isolation level used for this operation.

isBlindAppend boolean Whether this operation appended


data.

operationMetrics map Metrics of the operation (for example,


number of rows and files modified.)

userMetadata string User-defined commit metadata if it


was specified
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+--------
---+-----------------+-------------+--------------------+
|version| timestamp|userId|userName|operation| operationParameters|
job|notebook|clusterId|readVersion| isolationLevel|isBlindAppend| operationMetrics|
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+--------
---+-----------------+-------------+--------------------+
| 5|2019-07-29 14:07:47| ###| ###| DELETE|[predicate -> ["(...|null| ###| ###|
4|WriteSerializable| false|[numTotalRows -> ...|
| 4|2019-07-29 14:07:41| ###| ###| UPDATE|[predicate -> (id...|null| ###| ###|
3|WriteSerializable| false|[numTotalRows -> ...|
| 3|2019-07-29 14:07:29| ###| ###| DELETE|[predicate -> ["(...|null| ###| ###|
2|WriteSerializable| false|[numTotalRows -> ...|
| 2|2019-07-29 14:06:56| ###| ###| UPDATE|[predicate -> (id...|null| ###| ###|
1|WriteSerializable| false|[numTotalRows -> ...|
| 1|2019-07-29 14:04:31| ###| ###| DELETE|[predicate -> ["(...|null| ###| ###|
0|WriteSerializable| false|[numTotalRows -> ...|
| 0|2019-07-29 14:01:40| ###| ###| WRITE|[mode -> ErrorIfE...|null| ###| ###|
null|WriteSerializable| true|[numFiles -> 2, n...|
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+--------
---+-----------------+-------------+--------------------+

NOTE
Operation metrics are available only when the history command and the operation in the history were run using
Databricks Runtime 6.5 or above.
A few of the other columns are not available if you write into a Delta table using the following methods:
JDBC or ODBC
JAR job
spark-submit job
Run a command using the REST API
Columns added in the future will always be added after the last column.

Operation metrics keys


The history operation returns a collection of operations metrics in the operationMetrics column map.
The following tables list the map key definitions by operation.

O P ERAT IO N M ET RIC N A M E DESC RIP T IO N

WRITE, CREATE TABLE AS SELECT,


REPLACE TABLE AS SELECT, COPY
INTO

numFiles Number of files written.

numOutputBytes Size in bytes of the written contents.

numOutputRows Number of rows written.

STREAMING UPDATE

numAddedFiles Number of files added.

numRemovedFiles Number of files removed.

numOutputRows Number of rows written.


O P ERAT IO N M ET RIC N A M E DESC RIP T IO N

numOutputBytes Size of write in bytes.

DELETE

numAddedFiles Number of files added. Not provided


when partitions of the table are
deleted.

numRemovedFiles Number of files removed.

numDeletedRows Number of rows removed. Not


provided when partitions of the table
are deleted.

numCopiedRows Number of rows copied in the process


of deleting files.

executionTimeMs Time taken to execute the entire


operation.

scanTimeMs Time taken to scan the files for


matches.

rewriteTimeMs Time taken to rewrite the matched


files.

TRUNCATE

numRemovedFiles Number of files removed.

executionTimeMs Time taken to execute the entire


operation.

MERGE

numSourceRows Number of rows in the source


DataFrame.

numTargetRowsInserted Number of rows inserted into the


target table.

numTargetRowsUpdated Number of rows updated in the target


table.

numTargetRowsDeleted Number of rows deleted in the target


table.

numTargetRowsCopied Number of target rows copied.

numOutputRows Total number of rows written out.

numTargetFilesAdded Number of files added to the


sink(target).
O P ERAT IO N M ET RIC N A M E DESC RIP T IO N

numTargetFilesRemoved Number of files removed from the


sink(target).

executionTimeMs Time taken to execute the entire


operation.

scanTimeMs Time taken to scan the files for


matches.

rewriteTimeMs Time taken to rewrite the matched


files.

UPDATE

numAddedFiles Number of files added.

numRemovedFiles Number of files removed.

numUpdatedRows Number of rows updated.

numCopiedRows Number of rows just copied over in the


process of updating files.

executionTimeMs Time taken to execute the entire


operation.

scanTimeMs Time taken to scan the files for


matches.

rewriteTimeMs Time taken to rewrite the matched


files.

FSCK numRemovedFiles Number of files removed.

CONVERT numConvertedFiles Number of Parquet files that have


been converted.

OPTIMIZE

numAddedFiles Number of files added.

numRemovedFiles Number of files optimized.

numAddedBytes Number of bytes added after the table


was optimized.

numRemovedBytes Number of bytes removed.

minFileSize Size of the smallest file after the table


was optimized.
O P ERAT IO N M ET RIC N A M E DESC RIP T IO N

p25FileSize Size of the 25th percentile file after the


table was optimized.

p50FileSize Median file size after the table was


optimized.

p75FileSize Size of the 75th percentile file after the


table was optimized.

maxFileSize Size of the largest file after the table


was optimized.

O P ERAT IO N M ET RIC N A M E DESC RIP T IO N

CLONE (1)

sourceTableSize Size in bytes of the source table at the


version that’s cloned.

sourceNumOfFiles Number of files in the source table at


the version that’s cloned.

numRemovedFiles Number of files removed from the


target table if a previous Delta table
was replaced.

removedFilesSize Total size in bytes of the files removed


from the target table if a previous
Delta table was replaced.

numCopiedFiles Number of files that were copied over


to the new location. 0 for shallow
clones.

copiedFilesSize Total size in bytes of the files that were


copied over to the new location. 0 for
shallow clones.

RESTORE (2)

tableSizeAfterRestore Table size in bytes after restore.

numOfFilesAfterRestore Number of files in the table after


restore.

numRemovedFiles Number of files removed by the


restore operation.

numRestoredFiles Number of files that were added as a


result of the restore.

removedFilesSize Size in bytes of files removed by the


restore.
O P ERAT IO N M ET RIC N A M E DESC RIP T IO N

restoredFilesSize Size in bytes of files added by the


restore.

VACUUM (3)

numDeletedFiles Number of deleted files.

numVacuumedDirectories Number of vacuumed directories.

numFilesToDelete Number of files to delete.

(1) Requires Databricks Runtime 7.3 LTS or above.


(2) Requires Databricks Runtime 7.4 or above.
(3) Requires Databricks Runtime 8.2 or above.

Retrieve Delta table details


You can retrieve detailed information about a Delta table (for example, number of files, data size) using
DESCRIBE DETAIL .

DESCRIBE DETAIL '/data/events/'

DESCRIBE DETAIL eventsTable

For Spark SQL syntax details, see


Databricks Runtime 7.x and above: DESCRIBE DETAIL
Databricks Runtime 5.5 LTS and 6.x: Describe Detail (Delta Lake on Azure Databricks)
Detail schema
The output of this operation has only one row with the following schema.

C O L UM N TYPE DESC RIP T IO N

format string Format of the table, that is, delta .

id string Unique ID of the table.

name string Name of the table as defined in the


metastore.

description string Description of the table.

location string Location of the table.

createdAt timestamp When the table was created.

lastModified timestamp When the table was last modified.


C O L UM N TYPE DESC RIP T IO N

partitionColumns array of strings Names of the partition columns if the


table is partitioned.

numFiles long Number of the files in the latest


version of the table.

sizeInBytes int The size of the latest snapshot of the


table in bytes.

properties string-string map All the properties set for this table.

minReaderVersion int Minimum version of readers (according


to the log protocol) that can read the
table.

minWriterVersion int Minimum version of writers (according


to the log protocol) that can write to
the table.

+------+--------------------+------------------+-----------+--------------------+--------------------+------
-------------+----------------+--------+-----------+----------+----------------+----------------+
|format| id| name|description| location| createdAt|
lastModified|partitionColumns|numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|
+------+--------------------+------------------+-----------+--------------------+--------------------+------
-------------+----------------+--------+-----------+----------+----------------+----------------+
| delta|d31f82d2-a69f-42e...|default.deltatable| null|file:/Users/tuor/...|2020-06-05 12:20:...|2020-
06-05 12:20:20| []| 10| 12345| []| 1| 2|
+------+--------------------+------------------+-----------+--------------------+--------------------+------
-------------+----------------+--------+-----------+----------+----------------+----------------+

Convert a Parquet table to a Delta table


Convert a Parquet table to a Delta table in-place. This command lists all the files in the directory, creates a Delta
Lake transaction log that tracks these files, and automatically infers the data schema by reading the footers of all
Parquet files. If your data is partitioned, you must specify the schema of the partition columns as a DDL-
formatted string (that is, <column-name1> <type>, <column-name2> <type>, ... ).

NOTE
If a Parquet table was created by Structured Streaming, the listing of files can be avoided by using the _spark_metadata
sub-directory as the source of truth for files contained in the table setting the SQL configuration
spark.databricks.delta.convert.useMetadataLog to true .

SQL

-- Convert unpartitioned Parquet table at path '<path-to-table>'


CONVERT TO DELTA parquet.`<path-to-table>`

-- Convert partitioned Parquet table at path '<path-to-table>' and partitioned by integer columns named
'part' and 'part2'
CONVERT TO DELTA parquet.`<path-to-table>` PARTITIONED BY (part int, part2 int)

For syntax details, see


Databricks Runtime 7.x and above: CONVERT TO DELTA
Databricks Runtime 5.5 LTS and 6.x: Convert To Delta (Delta Lake on Azure Databricks)
Python

NOTE
The Python API is available in Databricks Runtime 6.1 and above.

from delta.tables import *

# Convert unpartitioned Parquet table at path '<path-to-table>'


deltaTable = DeltaTable.convertToDelta(spark, "parquet.`<path-to-table>`")

# Convert partitioned parquet table at path '<path-to-table>' and partitioned by integer column named 'part'
partitionedDeltaTable = DeltaTable.convertToDelta(spark, "parquet.`<path-to-table>`", "part int")

Scala

NOTE
The Scala API is available in Databricks Runtime 6.0 and above.

import io.delta.tables._

// Convert unpartitioned Parquet table at path '<path-to-table>'


val deltaTable = DeltaTable.convertToDelta(spark, "parquet.`<path-to-table>`")

// Convert partitioned Parquet table at path '<path-to-table>' and partitioned by integer columns named
'part' and 'part2'
val partitionedDeltaTable = DeltaTable.convertToDelta(spark, "parquet.`<path-to-table>`", "part int, part2
int")

Java

NOTE
The Scala API is available in Databricks Runtime 6.0 and above.

import io.delta.tables.*;

// Convert unpartitioned Parquet table at path '<path-to-table>'


DeltaTable deltaTable = DeltaTable.convertToDelta(spark, "parquet.`<path-to-table>`");

// Convert partitioned Parquet table at path '<path-to-table>' and partitioned by integer columns named
'part' and 'part2'
DeltaTable deltaTable = DeltaTable.convertToDelta(spark, "parquet.`<path-to-table>`", "part int, part2
int");

NOTE
Any file not tracked by Delta Lake is invisible and can be deleted when you run vacuum . You should avoid updating or
appending data files during the conversion process. After the table is converted, make sure all writes go through Delta
Lake.
Convert an Iceberg table to a Delta table
NOTE
This feature is in Public Preview.
This feature is supported in Databricks Runtime 10.4 and above.

You can convert an Iceberg table to a Delta table in place if the underlying file format of the Iceberg table is
Parquet. The following command creates a Delta Lake transaction log based on the Iceberg table’s native file
manifest, schema and partitioning information. The converter also collects column stats during the conversion,
unless NO STATISTICS is specified.

-- Convert the Iceberg table in the path <path-to-table>.


CONVERT TO DELTA iceberg.`<path-to-table>`

-- Convert the Iceberg table in the path <path-to-table> without collecting statistics.
CONVERT TO DELTA iceberg.`<path-to-table>` NO STATISTICS

NOTE
Converting Iceberg metastore tables is not supported.

Convert a Delta table to a Parquet table


You can easily convert a Delta table back to a Parquet table using the following steps:
1. If you have performed Delta Lake operations that can change the data files (for example, delete or merge ),
run vacuum with retention of 0 hours to delete all data files that do not belong to the latest version of the
table.
2. Delete the _delta_log directory in the table directory.

Restore a Delta table to an earlier state


NOTE
Available in Databricks Runtime 7.4 and above.

You can restore a Delta table to its earlier state by using the RESTORE command. A Delta table internally
maintains historic versions of the table that enable it to be restored to an earlier state. A version corresponding
to the earlier state or a timestamp of when the earlier state was created are supported as options by the
RESTORE command.
IMPORTANT
You can restore an already restored table.
You can restore a cloned table.
Restoring a table to an older version where the data files were deleted manually or by vacuum will fail. Restoring
to this version partially is still possible if spark.sql.files.ignoreMissingFiles is set to true .
The timestamp format for restoring to an earlier state is yyyy-MM-dd HH:mm:ss . Providing only a date(
yyyy-MM-dd ) string is also supported.

SQL

RESTORE TABLE db.target_table TO VERSION AS OF <version>


RESTORE TABLE delta.`/data/target/` TO TIMESTAMP AS OF <timestamp>

Python

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, <path-to-table>) # path-based tables, or


deltaTable = DeltaTable.forName(spark, <table-name>) # Hive metastore-based tables

deltaTable.restoreToVersion(0) # restore table to oldest version

deltaTable.restoreToTimestamp('2019-02-14') # restore to a specific timestamp

Scala

import io.delta.tables._

val deltaTable = DeltaTable.forPath(spark, <path-to-table>)


val deltaTable = DeltaTable.forName(spark, <table-name>)

deltaTable.restoreToVersion(0) // restore table to oldest version

deltaTable.restoreToTimestamp("2019-02-14") // restore to a specific timestamp

Java

import io.delta.tables.*;

DeltaTable deltaTable = DeltaTable.forPath(spark, <path-to-table>);


DeltaTable deltaTable = DeltaTable.forName(spark, <table-name>);

deltaTable.restoreToVersion(0) // restore table to oldest version

deltaTable.restoreToTimestamp("2019-02-14") // restore to a specific timestamp

For syntax details, see RESTORE (Delta Lake on Azure Databricks).


IMPORTANT
Restore is considered a data-changing operation. Delta Lake log entries added by the RESTORE command contain
dataChange set to true. If there is a downstream application, such as a Structured streaming job that processes the
updates to a Delta Lake table, the data change log entries added by the restore operation are considered as new data
updates, and processing them may result in duplicate data.
For example:

R E C O R D S I N D ATA C H A N G E
TA B L E V E R S I O N O P E R AT I O N D E LTA L O G U P D AT E S L O G U P D AT E S

0 INSERT AddFile(/path/to/file-1, (name = Viktor, age = 29,


dataChange = true) (name = George, age = 55)

1 INSERT AddFile(/path/to/file-2, (name = George, age = 39)


dataChange = true)

2 OPTIMIZE AddFile(/path/to/file-3, (No records as Optimize


dataChange = false), compaction does not
RemoveFile(/path/to/file-1), change the data in the table)
RemoveFile(/path/to/file-2)

3 RESTORE(version=1) RemoveFile(/path/to/file-3), (name = Viktor, age = 29),


AddFile(/path/to/file-1, (name = George, age = 55),
dataChange = true), (name = George, age = 39)
AddFile(/path/to/file-2,
dataChange = true)

In the preceding example, the RESTORE command results in updates that were already seen when reading the Delta
table version 0 and 1. If a streaming query was reading this table, then these files will be considered as newly added data
and will be processed again.

Restore metrics

NOTE
Available in Databricks Runtime 8.2 and above.

RESTORE reports the following metrics as a single row DataFrame once the operation is complete:
table_size_after_restore : The size of the table after restoring.
num_of_files_after_restore : The number of files in the table after restoring.
num_removed_files : Number of files removed (logically deleted) from the table.
num_restored_files : Number of files restored due to rolling back.
removed_files_size : Total size in bytes of the files that are removed from the table.
restored_files_size : Total size in bytes of the files that are restored.

Table access control


You must have MODIFY permission on the table being restored.
Clone a Delta table
NOTE
Available in Databricks Runtime 7.2 and above.

You can create a copy of an existing Delta table at a specific version using the clone command. Clones can be
either deep or shallow.
In this section:
Clone types
Clone metrics
Permissions
Clone use cases
Clone types
A deep clone is a clone that copies the source table data to the clone target in addition to the metadata of the
existing table. Additionally, stream metadata is also cloned such that a stream that writes to the Delta table
can be stopped on a source table and continued on the target of a clone from where it left off.
A shallow clone is a clone that does not copy the data files to the clone target. The table metadata is
equivalent to the source. These clones are cheaper to create.
Any changes made to either deep or shallow clones affect only the clones themselves and not the source table.
The metadata that is cloned includes: schema, partitioning information, invariants, nullability. For deep clones
only, stream and COPY INTO metadata are also cloned. Metadata not cloned are the table description and user-
defined commit metadata.

IMPORTANT
Shallow clones reference data files in the source directory. If you run vacuum on the source table clients will no longer
be able to read the referenced data files and a FileNotFoundException will be thrown. In this case, running clone
with replace over the shallow clone will repair the clone. If this occurs often, consider using a deep clone instead which
does not depend on the source table.
Deep clones do not depend on the source from which they were cloned, but are expensive to create because a deep
clone copies the data as well as the metadata.
Cloning with replace to a target that already has a table at that path creates a Delta log if one does not exist at that
path. You can clean up any existing data by running vacuum .
If an existing Delta table exists, a new commit is created that includes the new metadata and new data from the source
table. This new commit is incremental, meaning that only new changes since the last clone are committed to the table.
Cloning a table is not the same as Create Table As Select or CTAS . A clone copies the metadata of the source
table in addition to the data. Cloning also has simpler syntax: you don’t need to specify partitioning, format, invariants,
nullability and so on as they are taken from the source table.
A cloned table has an independent history from its source table. Time travel queries on a cloned table will not work
with the same inputs as they work on its source table.

SQL
CREATE TABLE delta.`/data/target/` CLONE delta.`/data/source/` -- Create a deep clone of /data/source at
/data/target

CREATE OR REPLACE TABLE db.target_table CLONE db.source_table -- Replace the target

CREATE TABLE IF NOT EXISTS delta.`/data/target/` CLONE db.source_table -- No-op if the target table exists

CREATE TABLE db.target_table SHALLOW CLONE delta.`/data/source`

CREATE TABLE db.target_table SHALLOW CLONE delta.`/data/source` VERSION AS OF version

CREATE TABLE db.target_table SHALLOW CLONE delta.`/data/source` TIMESTAMP AS OF timestamp_expression --


timestamp can be like “2019-01-01” or like date_sub(current_date(), 1)

Python

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, pathToTable) # path-based tables, or


deltaTable = DeltaTable.forName(spark, tableName) # Hive metastore-based tables

deltaTable.clone(target, isShallow, replace) # clone the source at latest version

deltaTable.cloneAtVersion(version, target, isShallow, replace) # clone the source at a specific version

# clone the source at a specific timestamp such as timestamp=“2019-01-01”


deltaTable.cloneAtTimestamp(timestamp, target, isShallow, replace)

Scala

import io.delta.tables._

val deltaTable = DeltaTable.forPath(spark, pathToTable)


val deltaTable = DeltaTable.forName(spark, tableName)

deltaTable.clone(target, isShallow, replace) // clone the source at latest version

deltaTable.cloneAtVersion(version, target, isShallow, replace) // clone the source at a specific version

deltaTable.cloneAtTimestamp(timestamp, target, isShallow, replace) // clone the source at a specific


timestamp

Java

import io.delta.tables.*;

DeltaTable deltaTable = DeltaTable.forPath(spark, pathToTable);


DeltaTable deltaTable = DeltaTable.forName(spark, tableName);

deltaTable.clone(target, isShallow, replace) // clone the source at latest version

deltaTable.cloneAtVersion(version, target, isShallow, replace) // clone the source at a specific version

deltaTable.cloneAtTimestamp(timestamp, target, isShallow, replace) // clone the source at a specific


timestamp

For syntax details, see CREATE TABLE CLONE.


Clone metrics
NOTE
Available in Databricks Runtime 8.2 and above.

CLONE reports the following metrics as a single row DataFrame once the operation is complete:
source_table_size : Size of the source table that’s being cloned in bytes.
source_num_of_files : The number of files in the source table.
num_removed_files : If the table is being replaced, how many files are removed from the current table.
num_copied_files : Number of files that were copied from the source (0 for shallow clones).
removed_files_size : Size in bytes of the files that are being removed from the current table.
copied_files_size : Size in bytes of the files copied to the table.

Permissions
You must configure permissions for Azure Databricks table access control and your cloud provider.
Table access control
The following permissions are required for both deep and shallow clones:
SELECT permission on the source table.
If you are using CLONE to create a new table, CREATE permission on the database in which you are creating
the table.
If you are using CLONE to replace a table, you must have MODIFY permission on the table.
Cloud provider permissions
If you have created a deep clone, any user that reads the deep clone must have read access to the clone’s
directory. To make changes to the clone, users must have write access to the clone’s directory.
If you have created a shallow clone, any user that reads the shallow clone needs permission to read the files in
the original table, since the data files remain in the source table with shallow clones, as well as the clone’s
directory. To make changes to the clone, users will need write access to the clone’s directory.
Clone use cases
In this section:
Data archiving
Machine learning flow reproduction
Short-term experiments on a production table
Data sharing
Table property overrides
Data archiving
Data may need to be kept for longer than is feasible with time travel or for disaster recovery. In these cases, you
can create a deep clone to preserve the state of a table at a certain point in time for archival. Incremental
archiving is also possible to keep a continually updating state of a source table for disaster recovery.

-- Every month run


CREATE OR REPLACE TABLE delta.`/some/archive/path` CLONE my_prod_table
Machine learning flow reproduction
When doing machine learning, you may want to archive a certain version of a table on which you trained an ML
model. Future models can be tested using this archived data set.

-- Trained model on version 15 of Delta table


CREATE TABLE delta.`/model/dataset` CLONE entire_dataset VERSION AS OF 15

Short-term experiments on a production table


To test a workflow on a production table without corrupting the table, you can easily create a shallow clone. This
allows you to run arbitrary workflows on the cloned table that contains all the production data but does not
affect any production workloads.

-- Perform shallow clone


CREATE OR REPLACE TABLE my_test SHALLOW CLONE my_prod_table;

UPDATE my_test WHERE user_id is null SET invalid=true;


-- Run a bunch of validations. Once happy:

-- This should leverage the update information in the clone to prune to only
-- changed files in the clone if possible
MERGE INTO my_prod_table
USING my_test
ON my_test.user_id <=> my_prod_table.user_id
WHEN MATCHED AND my_test.user_id is null THEN UPDATE *;

DROP TABLE my_test;

Data sharing
Other business units within a single organization may want to access the same data but may not require the
latest updates. Instead of giving access to the source table directly, you can provide clones with different
permissions for different business units. The performance of the clone can exceed that of a simple view.

-- Perform deep clone


CREATE OR REPLACE TABLE shared_table CLONE my_prod_table;

-- Grant other users access to the shared table


GRANT SELECT ON shared_table TO `<user-name>@<user-domain>.com`;

Table property overrides

NOTE
Available in Databricks Runtime 7.5 and above.

Table property overrides are particularly useful for:


Annotating tables with owner or user information when sharing data with different business units.
Archiving Delta tables and time travel is required. You can specify the log retention period independently
for the archive table. For example:
SQ L
CREATE OR REPLACE TABLE archive.my_table CLONE prod.my_table
TBLPROPERTIES (
delta.logRetentionDuration = '3650 days',
delta.deletedFileRetentionDuration = '3650 days'
)
LOCATION 'xx://archive/my_table'

Python

dt = DeltaTable.forName(spark, "prod.my_table")
tblProps = {
"delta.logRetentionDuration": "3650 days",
"delta.deletedFileRetentionDuration": "3650 days"
}
dt.clone('xx://archive/my_table', isShallow=False, replace=True, tblProps)

Sc a l a

val dt = DeltaTable.forName(spark, "prod.my_table")


val tblProps = Map(
"delta.logRetentionDuration" -> "3650 days",
"delta.deletedFileRetentionDuration" -> "3650 days"
)
dt.clone("xx://archive/my_table", isShallow = false, replace = true, properties = tblProps)

Find the last commit’s version in the Spark session


NOTE
Available in Databricks Runtime 7.1 and above.

To get the version number of the last commit written by the current SparkSession across all threads and all
tables, query the SQL configuration spark.databricks.delta.lastCommitVersionInSession .
SQL

SET spark.databricks.delta.lastCommitVersionInSession

Python

spark.conf.get("spark.databricks.delta.lastCommitVersionInSession")

Scala

spark.conf.get("spark.databricks.delta.lastCommitVersionInSession")

If no commits have been made by the SparkSession , querying the key returns an empty value.

NOTE
If you share the same SparkSession across multiple threads, it’s similar to sharing a variable across multiple threads;
you may hit race conditions as the configuration value is updated concurrently.
Delta Lake APIs
7/21/2022 • 2 minutes to read

For most read and write operations on Delta tables, you can use Apache Spark reader and writer APIs. For
examples, see Table batch reads and writes and Table streaming reads and writes.
However, there are some operations that are specific to Delta Lake and you must use Delta Lake APIs. For
examples, see Table utility commands.

NOTE
Some Delta Lake APIs are still evolving and are indicated with the Evolving qualifier in the API docs.

Azure Databricks ensures binary compatibility between the Delta Lake project and Delta Lake in Databricks
Runtime. To view the Delta Lake API version packaged in each Databricks Runtime version and links to the API
documentation, see the Delta Lake API compatibility matrix.
Concurrency control
7/21/2022 • 4 minutes to read

Delta Lake provides ACID transaction guarantees between reads and writes. This means that:
Multiple writers across multiple clusters can simultaneously modify a table partition and see a consistent
snapshot view of the table and there will be a serial order for these writes.
Readers continue to see a consistent snapshot view of the table that the Azure Databricks job started with,
even when a table is modified during a job.

Optimistic concurrency control


Delta Lake uses optimistic concurrency control to provide transactional guarantees between writes. Under this
mechanism, writes operate in three stages:
1. Read : Reads (if needed) the latest available version of the table to identify which files need to be modified
(that is, rewritten).
2. Write : Stages all the changes by writing new data files.
3. Validate and commit : Before committing the changes, checks whether the proposed changes conflict with
any other changes that may have been concurrently committed since the snapshot that was read. If there are
no conflicts, all the staged changes are committed as a new versioned snapshot, and the write operation
succeeds. However, if there are conflicts, the write operation fails with a concurrent modification exception
rather than corrupting the table as would happen with the write operation on a Parquet table.
The isolation level of a table defines the degree to which a transaction must be isolated from modifications
made by concurrent operations. For information on the isolation levels supported by Delta Lake on Azure
Databricks, see Isolation levels.

Write conflicts
The following table describes which pairs of write operations can conflict in each isolation level.

UP DAT E, DEL ET E, M ERGE


IN SERT IN TO O P T IM IZ E

INSERT Cannot conflict

UPDATE, DELETE, Can conflict in Serializable, Can conflict in Serializable


MERGE INTO cannot conflict in and WriteSerializable
WriteSerializable if it writes
to the table without reading
first

OPTIMIZE Cannot conflict Can conflict in Serializable Can conflict in Serializable


and WriteSerializable and WriteSerializable

Avoid conflicts using partitioning and disjoint command conditions


In all cases marked “can conflict”, whether the two operations will conflict depends on whether they operate on
the same set of files. You can make the two sets of files disjoint by partitioning the table by the same columns as
those used in the conditions of the operations. For example, the two commands
UPDATE table WHERE date > '2010-01-01' ... and DELETE table WHERE date < '2010-01-01' will conflict if the
table is not partitioned by date, as both can attempt to modify the same set of files. Partitioning the table by
date will avoid the conflict. Hence, partitioning a table according to the conditions commonly used on the
command can reduce conflicts significantly. However, partitioning a table by a column that has high cardinality
can lead to other performance issues due to large number of subdirectories.

Conflict exceptions
When a transaction conflict occurs, you will observe one of the following exceptions:
ConcurrentAppendException
ConcurrentDeleteReadException
ConcurrentDeleteDeleteException
MetadataChangedException
ConcurrentTransactionException
ProtocolChangedException
ConcurrentAppendException
This exception occurs when a concurrent operation adds files in the same partition (or anywhere in an
unpartitioned table) that your operation reads. The file additions can be caused by INSERT , DELETE , UPDATE , or
MERGE operations.

With the default isolation level of WriteSerializable , files added by blind INSERT operations (that is, operations
that blindly append data without reading any data) do not conflict with any operation, even if they touch the
same partition (or anywhere in an unpartitioned table). If the isolation level is set to Serializable , then blind
appends may conflict.
This exception is often thrown during concurrent DELETE , UPDATE , or MERGE operations. While the concurrent
operations may be physically updating different partition directories, one of them may read the same partition
that the other one concurrently updates, thus causing a conflict. You can avoid this by making the separation
explicit in the operation condition. Consider the following example.

// Target 'deltaTable' is partitioned by date and country


deltaTable.as("t").merge(
source.as("s"),
"s.user_id = t.user_id AND s.date = t.date AND s.country = t.country")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()

Suppose you run the above code concurrently for different dates or countries. Since each job is working on an
independent partition on the target Delta table, you don’t expect any conflicts. However, the condition is not
explicit enough and can scan the entire table and can conflict with concurrent operations updating any other
partitions. Instead, you can rewrite your statement to add specific date and country to the merge condition, as
shown in the following example.

// Target 'deltaTable' is partitioned by date and country


deltaTable.as("t").merge(
source.as("s"),
"s.user_id = t.user_id AND s.date = t.date AND s.country = t.country AND t.date = '" + <date> + "' AND
t.country = '" + <country> + "'")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
This operation is now safe to run concurrently on different dates and countries.
ConcurrentDeleteReadException
This exception occurs when a concurrent operation deleted a file that your operation read. Common causes are
a DELETE , UPDATE , or MERGE operation that rewrites files.
ConcurrentDeleteDeleteException
This exception occurs when a concurrent operation deleted a file that your operation also deletes. This could be
caused by two concurrent compaction operations rewriting the same files.
MetadataChangedException
This exception occurs when a concurrent transaction updates the metadata of a Delta table. Common causes are
ALTER TABLE operations or writes to your Delta table that update the schema of the table.

ConcurrentTransactionException
If a streaming query using the same checkpoint location is started multiple times concurrently and tries to write
to the Delta table at the same time. You should never have two streaming queries use the same checkpoint
location and run at the same time.
ProtocolChangedException
This exception can occur in the following cases:
When your Delta table is upgraded to a new version. For future operations to succeed you may need to
upgrade your Delta Lake version.
When multiple writers are creating or replacing a table at the same time.
When multiple writers are writing to an empty path at the same time.
See Table protocol versioning for more details.
Best practices: Delta Lake
7/21/2022 • 3 minutes to read

This article describes best practices when using Delta Lake.

Provide data location hints


If you expect a column to be commonly used in query predicates and if that column has high cardinality (that is,
a large number of distinct values), then use Z-ORDER BY . Delta Lake automatically lays out the data in the files
based on the column values and use the layout information to skip irrelevant data while querying.
For details, see Z-Ordering (multi-dimensional clustering).

Choose the right partition column


You can partition a Delta table by a column. The most commonly used partition column is date . Follow these
two rules of thumb for deciding on what column to partition by:
If the cardinality of a column will be very high, do not use that column for partitioning. For example, if you
partition by a column userId and if there can be 1M distinct user IDs, then that is a bad partitioning
strategy.
Amount of data in each partition: You can partition by a column if you expect data in that partition to be at
least 1 GB.

Compact files
If you continuously write data to a Delta table, it will over time accumulate a large number of files, especially if
you add data in small batches. This can have an adverse effect on the efficiency of table reads, and it can also
affect the performance of your file system. Ideally, a large number of small files should be rewritten into a
smaller number of larger files on a regular basis. This is known as compaction.
You can compact a table using the OPTIMIZE command.

Replace the content or schema of a table


Sometimes you may want to replace a Delta table. For example:
You discover the data in the table is incorrect and want to replace the content.
You want to rewrite the whole table to do incompatible schema changes (such as changing column types).
While you can delete the entire directory of a Delta table and create a new table on the same path, it’s not
recommended because:
Deleting a directory is not efficient. A directory containing very large files can take hours or even days to
delete.
You lose all of content in the deleted files; it’s hard to recover if you delete the wrong table.
The directory deletion is not atomic. While you are deleting the table a concurrent query reading the table
can fail or see a partial table.
If you don’t need to change the table schema, you can delete data from a Delta table and insert your new data,
or update the table to fix the incorrect values.
If you want to change the table schema, you can replace the whole table atomically. For example:
Python

dataframe.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.partitionBy(<your-partition-columns>) \
.saveAsTable("<your-table>") # Managed table
dataframe.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.option("path", "<your-table-path>") \
.partitionBy(<your-partition-columns>) \
.saveAsTable("<your-table>") # External table

SQL

REPLACE TABLE <your-table> USING DELTA PARTITIONED BY (<your-partition-columns>) AS SELECT ... -- Managed
table
REPLACE TABLE <your-table> USING DELTA PARTITIONED BY (<your-partition-columns>) LOCATION "<your-table-
path>" AS SELECT ... -- External table

Scala

dataframe.write
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy(<your-partition-columns>)
.saveAsTable("<your-table>") // Managed table
dataframe.write
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.option("path", "<your-table-path>")
.partitionBy(<your-partition-columns>)
.saveAsTable("<your-table>") // External table

There are multiple benefits with this approach:


Overwriting a table is much faster because it doesn’t need to list the directory recursively or delete any files.
The old version of the table still exists. If you delete the wrong table you can easily retrieve the old data using
Time Travel.
It’s an atomic operation. Concurrent queries can still read the table while you are deleting the table.
Because of Delta Lake ACID transaction guarantees, if overwriting the table fails, the table will be in its
previous state.
In addition, if you want to delete old files to save storage cost after overwriting the table, you can use VACUUM
to delete them. It’s optimized for file deletion and usually faster than deleting the entire directory.

Spark caching
Databricks does not recommend that you use Spark caching for the following reasons:
You lose any data skipping that can come from additional filters added on top of the cached DataFrame .
The data that gets cached may not be updated if the table is accessed using a different identifier (for example,
you do spark.table(x).cache() but then write to the table using spark.write.save(/some/path) .
Frequently asked questions (FAQ)
7/21/2022 • 5 minutes to read

What is Delta Lake?


Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID
transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on
top of your existing data lake and is fully compatible with Apache Spark APIs.
Delta Lake on Azure Databricks allows you to configure Delta Lake based on your workload patterns and
provides optimized layouts and indexes for fast interactive queries.

How is Delta Lake related to Apache Spark?


Delta Lake sits on top of Apache Spark. The format and the compute layer helps to simplify building big data
pipelines and increase the overall efficiency of your pipelines.

What format does Delta Lake use to store data?


Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta
Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to
provide ACID transactions.

How can I read and write data with Delta Lake?


You can use your favorite Apache Spark APIs to read and write data with Delta Lake. See Read a table and Write
to a table.

Where does Delta Lake store the data?


When writing data, you can specify the location in your cloud storage. Delta Lake stores the data in that location
in Parquet format.

Can I copy my Delta Lake table to another location?


Yes you can copy your Delta Lake table to another location. Remember to copy files without changing the
timestamps to ensure that the time travel with timestamps will be consistent.

Can I stream data directly into and from Delta tables?


Yes, you can use Structured Streaming to directly write data into Delta tables and read from Delta tables. See
Stream data into Delta tables and Stream data from Delta tables.

Does Delta Lake support writes or reads using the Spark Streaming
DStream API?
Delta does not support the DStream API. We recommend Table streaming reads and writes.

When I use Delta Lake, will I be able to port my code to other Spark
platforms easily?
Yes. When you use Delta Lake, you are using open Apache Spark APIs so you can easily port your code to other
Spark platforms. To port your code, replace delta format with parquet format.

How do Delta tables compare to Hive SerDe tables?


Delta tables are managed to a greater degree. In particular, there are several Hive SerDe parameters that Delta
Lake manages on your behalf that you should never specify manually:
ROWFORMAT
SERDE
OUTPUTFORMAT AND INPUTFORMAT
COMPRESSION
STORED AS

What DDL and DML features does Delta Lake not support?
Unsupported DDL features:
ANALYZE TABLE PARTITION
ALTER TABLE [ADD|DROP] PARTITION
ALTER TABLE RECOVER PARTITIONS
ALTER TABLE SET SERDEPROPERTIES
CREATE TABLE LIKE
INSERT OVERWRITE DIRECTORY
LOAD DATA
Unsupported DML features:
INSERT INTO [OVERWRITE] table with static partitions
INSERT OVERWRITE TABLE for table with dynamic partitions
Bucketing
Specifying a schema when reading from a table
Specifying target partitions using PARTITION (part_spec) in TRUNCATE TABLE

Does Delta Lake support multi-table transactions?


Delta Lake does not support multi-table transactions and foreign keys. Delta Lake supports transactions at the
table level.

How can I change the type of a column?


Changing a column’s type or dropping a column requires rewriting the table. For an example, see Change
column type.

What does it mean that Delta Lake supports multi-cluster writes?


It means that Delta Lake does locking to make sure that queries writing to a table from multiple clusters at the
same time won’t corrupt the table. However, it does not mean that if there is a write conflict (for example, update
and delete the same thing) that they will both succeed. Instead, one of writes will fail atomically and the error
will tell you to retry the operation.
Can I modify a Delta table from different workspaces?
Yes, you can concurrently modify the same Delta table from different workspaces. Moreover, if one process is
writing from a workspace, readers in other workspaces will see a consistent view.

Can I access Delta tables outside of Databricks Runtime?


There are two cases to consider: external reads and external writes.
External reads: Delta tables store data encoded in an open format (Parquet), allowing other tools that
understand this format to read the data. However, since other tools do not support the Delta Lake
transaction log, it is likely that they will incorrectly read stale deleted data, uncommitted data, or the
partial results of failed transactions.
In cases where the data is static (that is, there are no active jobs writing to the table), you can use VACUUM
with a retention of ZERO HOURS to clean up any stale Parquet files that are not currently part of the table.
This operation puts the Parquet files present in DBFS into a consistent state such that they can now be
read by external tools.
However, Delta Lake relies on stale snapshots for the following functionality, which will fail when using
VACUUM with zero retention allowance:

Snapshot isolation for readers: Long running jobs will continue to read a consistent snapshot from the
moment the jobs started, even if the table is modified concurrently. Running VACUUM with a retention
less than length of these jobs can cause them to fail with a FileNotFoundException .
Streaming from Delta tables: Streams read from the original files written into a table in order to
ensure exactly once processing. When combined with OPTIMIZE , VACUUM with zero retention can
remove these files before the stream has time to processes them, causing it to fail.
For these reasons Databricks recommends using this technique only on static data sets that must be read
by external tools.
External writes: Delta Lake maintains additional metadata in a transaction log to enable ACID transactions
and snapshot isolation for readers. To ensure the transaction log is updated correctly and the proper
validations are performed, writer implementations must strictly adhere to the Delta Transaction Protocol.
Delta Lake in Databricks Runtime ensures ACID guarantees based on the Delta Transaction Protocol.
Whether non-Spark Delta connectors that write to Delta tables can write with ACID guarantees depends
on the connector implementation. For information, see _ and the integration-specific documentation on
their write guarantees.
Delta Lake resources
7/21/2022 • 2 minutes to read

Blog posts and talks


Databricks blog posts and talks

VLDB 2020 paper


Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores

Examples
The Delta Lake GitHub repository has Scala and Python examples.

Delta Lake transaction log specification


The Delta Lake transaction log has a well-defined open protocol that can be used by any system to read the log.
See Delta Transaction Log Protocol.
For more information about the Delta Lake transaction log, watch this YouTube video (53 minutes).

Data types
Delta Lake supports all of the data types listed in Data types except for Interval data types.
Optimizations
7/21/2022 • 2 minutes to read

Azure Databricks provides optimizations for Delta Lake that accelerate data lake operations, supporting a variety
of workloads ranging from large-scale ETL processing to ad-hoc, interactive queries. Many of these
optimizations take place automatically; you get their benefits simply by using Azure Databricks for your data
lakes.
Optimize performance with file management
Compaction (bin-packing)
Data skipping
Z-Ordering (multi-dimensional clustering)
Tune file size
Notebooks
Improve interactive query performance
Frequently asked questions (FAQ)
Auto Optimize
How Auto Optimize works
Enable Auto Optimize
When to opt in and opt out
Example workflow: Streaming ingest with concurrent deletes or updates
Frequently asked questions (FAQ)
Optimize performance with caching
Delta and Apache Spark caching
Delta cache consistency
Use Delta caching
Cache a subset of the data
Monitor the Delta cache
Configure the Delta cache
Dynamic file pruning
Isolation levels
Set the isolation level
Bloom filter indexes
How Bloom filter indexes work
Configuration
Create a Bloom filter index
Drop a Bloom filter index
Display the list of Bloom filter indexes
Notebook
Low Shuffle Merge
Optimized performance
Optimized data layout
Availability
Optimize join performance
Range join optimization
Range join optimization
Skew join optimization
Optimized data transformation
Higher-order functions
Transform complex data types

Additional resources
Cost-based optimizer
Optimize performance with file management
7/21/2022 • 16 minutes to read

To improve query speed, Delta Lake on Azure Databricks supports the ability to optimize the layout of data
stored in cloud storage. Delta Lake on Azure Databricks supports two layout algorithms: bin-packing and Z-
Ordering.
This article describes:
How to run the optimization commands.
How the two layout algorithms work.
How to clean up stale table snapshots.
The FAQ explains why optimization is not automatic and includes recommendations for how often to run
optimize commands.
For notebooks that demonstrate the benefits of optimization, see Optimization examples.
For Delta Lake on Azure Databricks SQL optimization command reference information, see
Databricks Runtime 7.x and above: OPTIMIZE (Delta Lake on Azure Databricks)
Databricks Runtime 5.5 LTS and 6.x: Optimize (Delta Lake on Azure Databricks)

Compaction (bin-packing)
Delta Lake on Azure Databricks can improve the speed of read queries from a table. One way to improve this
speed is to coalesce small files into larger ones. You trigger compaction by running the OPTIMIZE command:
SQL

OPTIMIZE delta.`/data/events`

Python

from delta.tables import *


deltaTable = DeltaTable.forPath(spark, "/data/events")
deltaTable.optimize().executeCompaction()

Scala

import io.delta.tables._
val deltaTable = DeltaTable.forPath(spark, "/data/events")
deltaTable.optimize().executeCompaction()

or
SQL

OPTIMIZE events

Python
from delta.tables import *
deltaTable = DeltaTable.forName(spark, "events")
deltaTable.optimize().executeCompaction()

Scala

import io.delta.tables._
val deltaTable = DeltaTable.forName(spark, "events")
deltaTable.optimize().executeCompaction()

If you have a large amount of data and only want to optimize a subset of it, you can specify an optional partition
predicate using WHERE :
SQL

OPTIMIZE events WHERE date >= '2022-11-18'

Python

from delta.tables import *


deltaTable = DeltaTable.forName(spark, "events")
deltaTable.optimize().where("date='2021-11-18'").executeCompaction()

Scala

import io.delta.tables._
val deltaTable = DeltaTable.forName(spark, "events")
deltaTable.optimize().where("date='2021-11-18'").executeCompaction()

NOTE
Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no
effect.
Bin-packing aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number
of tuples per file. However, the two measures are most often correlated.
Python and Scala APIs for executing OPTIMIZE operation are available from Databricks Runtime 11.0 and above.

Readers of Delta tables use snapshot isolation, which means that they are not interrupted when OPTIMIZE
removes unnecessary files from the transaction log. OPTIMIZE makes no data related changes to the table, so a
read before and after an OPTIMIZE has the same results. Performing OPTIMIZE on a table that is a streaming
source does not affect any current or future streams that treat this table as a source. OPTIMIZE returns the file
statistics (min, max, total, and so on) for the files removed and the files added by the operation. Optimize stats
also contains the Z-Ordering statistics, the number of batches, and partitions optimized.

NOTE
Available in Databricks Runtime 6.0 and above.

You can also compact small files automatically using Auto Optimize.

Data skipping
Data skipping information is collected automatically when you write data into a Delta table. Delta Lake on Azure
Databricks takes advantage of this information (minimum and maximum values) at query time to provide faster
queries. You do not need to configure data skipping; the feature is activated whenever applicable. However, its
effectiveness depends on the layout of your data. For best results, apply Z-Ordering.
For an example of the benefits of Delta Lake on Azure Databricks data skipping and Z-Ordering, see the
notebooks in Optimization examples. By default Delta Lake on Azure Databricks collects statistics on the first 32
columns defined in your table schema. You can change this value using the table property
delta.dataSkippingNumIndexedCols . Adding more columns to collect statistics would add more overhead as you
write files.
Collecting statistics on long strings is an expensive operation. To avoid collecting statistics on long strings, you
can either configure the table property delta.dataSkippingNumIndexedCols to avoid columns containing long
strings or move columns containing long strings to a column greater than delta.dataSkippingNumIndexedCols
using ALTER TABLE ALTER COLUMN . See:
Databricks Runtime 7.x and above: ALTER TABLE
Databricks Runtime 5.5 LTS and 6.x: Change columns
For the purposes of collecting statistics, each field within a nested column is considered as an individual column.
You can read more on this article in the blog post: Processing Petabytes of Data in Seconds with Databricks
Delta.

Z-Ordering (multi-dimensional clustering)


Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically
used by Delta Lake on Azure Databricks data-skipping algorithms. This behavior dramatically reduces the
amount of data that Delta Lake on Azure Databricks needs to read. To Z-Order data, you specify the columns to
order on in the ZORDER BY clause:

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

If you expect a column to be commonly used in query predicates and if that column has high cardinality (that is,
a large number of distinct values), then use ZORDER BY .
You can specify multiple columns for ZORDER BY as a comma-separated list. However, the effectiveness of the
locality drops with each extra column. Z-Ordering on columns that do not have statistics collected on them
would be ineffective and a waste of resources. This is because data skipping requires column-local stats such as
min, max, and count. You can configure statistics collection on certain columns by reordering columns in the
schema, or you can increase the number of columns to collect statistics on. See Data skipping.
NOTE
Z-Ordering is not idempotent but aims to be an incremental operation. The time it takes for Z-Ordering is not
guaranteed to reduce over multiple runs. However, if no new data was added to a partition that was just Z-
Ordered, another Z-Ordering of that partition will not have any effect.
Z-Ordering aims to produce evenly-balanced data files with respect to the number of tuples, but not necessarily
data size on disk. The two measures are most often correlated, but there can be situations when that is not the
case, leading to skew in optimize task times.
For example, if you ZORDER BY date and your most recent records are all much wider (for example longer arrays
or string values) than the ones in the past, it is expected that the OPTIMIZE job’s task durations will be skewed, as
well as the resulting file sizes. This is, however, only a problem for the OPTIMIZE command itself; it should not
have any negative impact on subsequent queries.

Tune file size


This section describes how to tune the size of files in Delta tables.
Set a target size
Autotune based on workload
Autotune based on table size
Set a target size

NOTE
Available in Databricks Runtime 8.2 and above.

If you want to tune the size of files in your Delta table, set the table property delta.targetFileSize to the
desired size. If this property is set, all data layout optimization operations will make a best-effort attempt to
generate files of the specified size. Examples here include optimize with Compaction (bin-packing) or Z-
Ordering (multi-dimensional clustering), Auto Compaction, and Optimized Writes.

TA B L E P RO P ERT Y

delta.targetFileSize

Type: Size in bytes or higher units.

The target file size. For example, 104857600 (bytes) or 100mb .

Default value: None

For existing tables, you can set and unset properties using the SQL command ALTER TABLE SET TBL
PROPERTIES. You can also set these properties automatically when creating new tables using Spark session
configurations. See Table properties for details.
Autotune based on workload

NOTE
Available in Databricks Runtime 8.2 and above.

To minimize the need for manual tuning, Azure Databricks can automatically tune the file size of Delta tables,
based on workloads operating on the table. Azure Databricks can automatically detect if a Delta table has
frequent MERGE operations that rewrite files and may choose to reduce the size of rewritten files in anticipation
of further file rewrites in the future. For example, when executing a MERGE operation, if 9 out of last 10 previous
operations on the table were also MERGEs, then Optimized Writes and Auto Compaction used by MERGE (if
enabled) will generate smaller file sizes than it would otherwise. This helps in reducing the duration of future
MERGE operations.

Autotune is activated after a few rewrite operations have occurred. However, if you anticipate a Delta table will
experience frequent MERGE , UPDATE , or DELETE operations and want this tuning immediately, you can explicitly
tune file sizes for rewrites by setting the table property delta.tuneFileSizesForRewrites . Set this property to
true to always use lower file sizes for all data layout optimization operations on the table. Set it to false to
never tune to lower file sizes, that is, prevent auto-detection from being activated.

TA B L E P RO P ERT Y

delta.tuneFileSizesForRewrites

Type: Boolean

Whether to tune file sizes for data layout optimization.

Default value: None

For existing tables, you can set and unset properties using the SQL command ALTER TABLE SET TBL
PROPERTIES. You can also set these properties automatically when creating new tables using Spark session
configurations. See Table properties for details.
Autotune based on table size

NOTE
Available in Databricks Runtime 8.4 and above.

To minimize the need for manual tuning, Azure Databricks automatically tunes the file size of Delta tables based
on the size of the table. Azure Databricks will use smaller file sizes for smaller tables and larger file sizes for
larger tables so that the number of files in the table does not grow too large. Azure Databricks does not
autotune tables that you have tuned with a specific target size or based on a workload with frequent rewrites.
The target file size is based on the current size of the Delta table. For tables smaller than 2.56 TB, the autotuned
target file size is 256 MB. For tables with a size between 2.56 TB and 10 TB, the target size will grow linearly
from 256 MB to 1 GB. For tables larger than 10 TB, the target file size is 1 GB.

NOTE
When the target file size for a table grows, existing files are not re-optimized into larger files by the OPTIMIZE command.
A large table can therefore always have some files that are smaller than the target size. If it is required to optimize those
smaller files into larger files as well, you can configure a fixed target file size for the table using the
delta.targetFileSize table property.

When a table is written incrementally, the target file sizes and file counts will be close to the following numbers,
based on table size. The file counts in this table are only an example. The actual results will be different
depending on many factors.
A P P RO XIM AT E N UM B ER O F F IL ES IN
TA B L E SIZ E TA RGET F IL E SIZ E TA B L E

10 GB 256 MB 40

1 TB 256 MB 4096

2.56 TB 256 MB 10240

3 TB 307 MB 12108

5 TB 512 MB 17339

7 TB 716 MB 20784

10 TB 1 GB 24437

20 TB 1 GB 34437

50 TB 1 GB 64437

100 TB 1 GB 114437

Notebooks
For an example of the benefits of optimization, see the following notebooks:
Optimization examples
Delta Lake on Databricks optimizations Python notebook
Delta Lake on Databricks optimizations Scala notebook
Delta Lake on Databricks optimizations SQL notebook

Improve interactive query performance


Delta Lake offers more mechanisms to improve query performance.
Manage data recency
Enhanced checkpoints for low-latency queries
Manage data recency
At the beginning of each query, Delta tables auto-update to the latest version of the table. This process can be
observed in notebooks when the command status reports: Updating the Delta table's state . However, when
running historical analysis on a table, you may not necessarily need up-to-the-last-minute data, especially for
tables where streaming data is being ingested frequently. In these cases, queries can be run on stale snapshots
of your Delta table. This approach can lower latency in getting results from queries.
You can configure how stale your table data is by setting the Spark session configuration
spark.databricks.delta.stalenessLimit with a time string value, for example, 1h , 15m , 1d for 1 hour, 15
minutes, and 1 day respectively. This configuration is session specific, therefore won’t affect other users
accessing this table from other notebooks, jobs, or BI tools. In addition, this setting doesn’t prevent your table
from updating; it only prevents a query from having to wait for the table to update. The update still occurs in the
background, and will share resources fairly across the cluster. If the staleness limit is exceeded, then the query
will block on the table state update.
Enhanced checkpoints for low-latency queries
Delta Lake writes checkpoints as an aggregate state of a Delta table at an optimized frequency. These
checkpoints serve as the starting point to compute the latest state of the table. Without checkpoints, Delta Lake
would have to read a large collection of JSON files (“delta” files) representing commits to the transaction log to
compute the state of a table. In addition, the column-level statistics Delta Lake uses to perform data skipping are
stored in the checkpoint.

IMPORTANT
Delta Lake checkpoints are different than Structured Streaming checkpoints.

In Databricks Runtime 7.2 and below, column-level statistics are stored in Delta Lake checkpoints as a JSON
column.
In Databricks Runtime 7.3 LTS and above, column-level statistics are stored as a struct. The struct format makes
Delta Lake reads much faster, because:
Delta Lake doesn’t perform expensive JSON parsing to obtain column-level statistics.
Parquet column pruning capabilities significantly reduce the I/O required to read the statistics for a column.
The struct format enables a collection of optimizations that reduce the overhead of Delta Lake read operations
from seconds to tens of milliseconds, which significantly reduces the latency for short queries.
Manage column-level statistics in checkpoints
You manage how statistics are written in checkpoints using the table properties
delta.checkpoint.writeStatsAsJson and delta.checkpoint.writeStatsAsStruct . If both table properties are
false , Delta Lake cannot perform data skipping.

In Databricks Runtime 7.3 LTS and above:


Batch writes write statistics in both JSON and struct format. delta.checkpoint.writeStatsAsJson is true .
delta.checkpoint.writeStatsAsStruct is undefined by default.
Readers use the struct column when available and otherwise fall back to using the JSON column.
For streaming writes:
Databricks Runtime 7.5 and above: write statistics in both JSON format and struct format.
Databricks Runtime 7.3 LTS and 7.4: write statistics in only JSON format (to minimize the impact of
checkpoints on write latency). To also write the struct format, see Trade-offs with statistics in checkpoints.
In Databricks Runtime 7.2 and below, readers only use the JSON column. Therefore, if
delta.checkpoint.writeStatsAsJson is false , such readers cannot perform data skipping.

IMPORTANT
Enhanced checkpoints do not break compatibility with open source Delta Lake readers. However, setting
delta.checkpoint.writeStatsAsJson to false may have implications on proprietary Delta Lake readers. Contact
your vendors to learn more about performance implications.

Trade-offs with statistics in checkpoints


Since writing statistics in a checkpoint has a cost (usually < a minute even for large tables), there is a tradeoff
between the time taken to write a checkpoint and compatibility with Databricks Runtime 7.2 and below. If you
are able to upgrade all of your workloads to Databricks Runtime 7.3 LTS or above you can reduce the cost of
writing a checkpoint by disabling the legacy JSON statistics. This tradeoff is summarized in the following table.
If data skipping is not useful in your application, you can set both properties to false, and no statistics are
collected or written. We do not recommend this configuration.

Enable enhanced checkpoints for Structured Streaming queries


If your Structured Streaming workloads don’t have low latency requirements (subminute latencies), you can
enable enhanced checkpoints by running the following SQL command:

ALTER TABLE [<table-name>|delta.`<path-to-table>`] SET TBLPROPERTIES


('delta.checkpoint.writeStatsAsStruct' = 'true')

If you do not use Databricks Runtime 7.2 or below to query your data, you can also improve the checkpoint
write latency by setting the following table properties:

ALTER TABLE [<table-name>|delta.`<path-to-table>`] SET TBLPROPERTIES


(
'delta.checkpoint.writeStatsAsStruct' = 'true',
'delta.checkpoint.writeStatsAsJson' = 'false'
)

Disable writes from clusters that write checkpoints without the stats struct
Writers in Databricks Runtime 7.2 and below write checkpoints without the stats struct, which prevents
optimizations for Databricks Runtime 7.3 LTS readers.
To block clusters running Databricks Runtime 7.2 and below from writing to a Delta table, you can upgrade the
Delta table using the upgradeTableProtocol method:
Python

from delta.tables import DeltaTable


delta = DeltaTable.forPath(spark, "path_to_table") # or DeltaTable.forName
delta.upgradeTableProtocol(1, 3)

Sc a l a
import io.delta.tables.DeltaTable
val delta = DeltaTable.forPath(spark, "path_to_table") // or DeltaTable.forName
delta.upgradeTableProtocol(1, 3)

WARNING
Applying the upgradeTableProtocol method prevents clusters running Databricks Runtime 7.2 and below from writing
to your table and this change is irreversible. We recommend upgrading your tables only after you are committed to the
new format. You can try out these optimizations by creating a shallow CLONE of your tables using Databricks Runtime 7.3
LTS.

Once you upgrade the table writer version, writers must obey your settings for
'delta.checkpoint.writeStatsAsStruct' and 'delta.checkpoint.writeStatsAsJson' .

The following table summarizes how to take advantage of enhanced checkpoints in various versions of
Databricks Runtime, table protocol versions, and writer types.

Disable writes from clusters using old checkpoint formats


Writers from Databricks Runtime 7.2 and below can write old format checkpoints, which would prevent
optimizations for Databricks Runtime 7.3 LTS writers. To block clusters running Databricks Runtime 7.2 and
below from writing to a Delta table, you can upgrade the Delta table using the upgradeTableProtocol method:
Python

from delta.tables import DeltaTable


delta = DeltaTable.forPath(spark, "path_to_table") # or DeltaTable.forName
delta.upgradeTableProtocol(1, 3)

Sc a l a

import io.delta.tables.DeltaTable
val delta = DeltaTable.forPath(spark, "path_to_table") // or DeltaTable.forName
delta.upgradeTableProtocol(1, 3)
WARNING
Applying the upgradeTableProtocol method prevents clusters running Databricks Runtime 7.2 and below from writing
to your table. The change is irreversible. Therefore, we recommend upgrading your tables only after you are committed to
the new format. You can try out these optimizations by creating a shallow CLONE of your tables using Databricks Runtime
7.3 LTS:
Databricks Runtime 7.x and above: CREATE TABLE CLONE
Databricks Runtime 5.5 LTS and 6.x: Clone (Delta Lake on Azure Databricks)

Frequently asked questions (FAQ)


Why isn’t OPTIMIZE automatic?
The OPTIMIZE operation starts up many Spark jobs in order to optimize the file sizing via compaction (and
optionally perform Z-Ordering). Since much of what OPTIMIZE does is compact small files, you must first
accumulate many small files before this operation has an effect. Therefore, the OPTIMIZE operation is not run
automatically.
Moreover, running OPTIMIZE , especially with ZORDER , is an expensive operation in time and resources. If
Databricks ran OPTIMIZE automatically or waited to write out data in batches, it would remove the ability to run
low-latency Delta Lake streams (where a Delta table is the source). Many customers have Delta tables that are
never optimized because they only stream data from these tables, obviating the query benefits that OPTIMIZE
would provide.
Lastly, Delta Lake automatically collects statistics about the files that are written to the table (whether through an
OPTIMIZE operation or not). This means that reads from Delta tables leverage this information whether or not
the table or a partition has had the OPTIMIZE operation run on it.
How often should I run OPTIMIZE ?
When you choose how often to run OPTIMIZE , there is a trade-off between performance and cost. You should
run OPTIMIZE more often if you want better end-user query performance (necessarily at a higher cost because
of resource usage). You should run it less often if you want to optimize cost.
We recommend you start by running OPTIMIZE on a daily basis. Then modify your job from there.
What’s the best instance type to run OPTIMIZE (bin-packing and Z-Ordering) on?
Both operations are CPU intensive operations doing large amounts of Parquet decoding and encoding.
For these workloads we recommend the F or Fsv2 series.
Auto Optimize
7/21/2022 • 6 minutes to read

Auto Optimize is an optional set of features that automatically compact small files during individual writes to a
Delta table. Paying a small cost during writes offers significant benefits for tables that are queried actively. Auto
Optimize is particularly useful in the following scenarios:
Streaming use cases where latency in the order of minutes is acceptable
MERGE INTO is the preferred method of writing into Delta Lake
CREATE TABLE AS SELECT or INSERT INTO are commonly used operations

How Auto Optimize works


Auto Optimize consists of two complementary features: Optimized Writes and Auto Compaction.
How Optimized Writes work
Azure Databricks dynamically optimizes Apache Spark partition sizes based on the actual data, and attempts to
write out 128 MB files for each table partition. This is an approximate size and can vary depending on dataset
characteristics.

How Auto Compaction works


After an individual write, Azure Databricks checks if files can further be compacted, and runs an OPTIMIZE job
(with 128 MB file sizes instead of the 1 GB file size used in the standard OPTIMIZE ) to further compact files for
partitions that have the most number of small files.

Enable Auto Optimize


You must explicitly enable Optimized Writes and Auto Compaction using one of the following methods:
New table : Set the table properties delta.autoOptimize.optimizeWrite = true and
delta.autoOptimize.autoCompact = true in the CREATE TABLE command.

CREATE TABLE student (id INT, name STRING, age INT) TBLPROPERTIES (delta.autoOptimize.optimizeWrite =
true, delta.autoOptimize.autoCompact = true)
Existing tables : Set the table properties delta.autoOptimize.optimizeWrite = true and
delta.autoOptimize.autoCompact = true in the ALTER TABLE command.

ALTER TABLE [table_name | delta.`<table-path>`] SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite =


true, delta.autoOptimize.autoCompact = true)

All new tables :

set spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true;


set spark.databricks.delta.properties.defaults.autoOptimize.autoCompact = true;

In Databricks Runtime 10.1 and above, the table property delta.autoOptimize.autoCompact also accepts the
values auto and legacy in addition to true and false . When set to auto (recommended), Auto Compaction
uses better defaults, such as setting 32 MB as the target file size (although default behaviors are subject to
change in the future). When set to legacy or true , Auto Compaction uses 128 MB as the target file size.
In addition, you can enable and disable both of these features for Spark sessions with the configurations:
spark.databricks.delta.optimizeWrite.enabled
spark.databricks.delta.autoCompact.enabled

The session configurations take precedence over the table properties allowing you to better control when to opt
in or opt out of these features.

When to opt in and opt out


This section provides guidance on when to opt in and opt out of Auto Optimize features.
When to opt in to Optimized Writes
Optimized Writes aim to maximize the throughput of data being written to a storage service. This can be
achieved by reducing the number of files being written, without sacrificing too much parallelism.
Optimized Writes require the shuffling of data according to the partitioning structure of the target table. This
shuffle naturally incurs additional cost. However, the throughput gains during the write may pay off the cost of
the shuffle. If not, the throughput gains when querying the data should still make this feature worthwhile.
The key part of Optimized Writes is that it is an adaptive shuffle. If you have a streaming ingest use case and
input data rates change over time, the adaptive shuffle will adjust itself accordingly to the incoming data rates
across micro-batches. If you have code snippets where you coalesce(n) or repartition(n) just before you
write out your stream, you can remove those lines.
When to opt in
Streaming use cases where minutes of latency is acceptable
When using SQL commands like MERGE , UPDATE , DELETE , INSERT INTO , CREATE TABLE AS SELECT

When to opt out


When the written data is on the order of terabytes and storage optimized instances are unavailable.
When to opt in to Auto Compaction
Auto Compaction occurs after a write to a table has succeeded and runs synchronously on the cluster that has
performed the write. This means that if you have code patterns where you make a write to Delta Lake, and then
immediately call OPTIMIZE , you can remove the OPTIMIZE call if you enable Auto Compaction.
Auto Compaction uses different heuristics than OPTIMIZE . Since it runs synchronously after a write, we have
tuned Auto Compaction to run with the following properties:
Azure Databricks does not support Z-Ordering with Auto Compaction as Z-Ordering is significantly more
expensive than just compaction.
Auto Compaction generates smaller files (128 MB) than OPTIMIZE (1 GB).
Auto Compaction greedily chooses a limited set of partitions that would best leverage compaction. The
number of partitions selected will vary depending on the size of cluster it is launched on. If your cluster has
more CPUs, more partitions can be optimized.
To control the output file size, set the Spark configuration spark.databricks.delta.autoCompact.maxFileSize .
The default value is 134217728 , which sets the size to 128 MB. Specifying the value 104857600 sets the file
size to 100MB.
When to opt in
Streaming use cases where minutes of latency is acceptable.
When you don’t have regular OPTIMIZE calls on your table.
When to opt out
For DBR 10.3 and below: When other writers perform operations like DELETE , MERGE , UPDATE , or
OPTIMIZE concurrently, because auto compaction can cause a transaction conflict for those jobs.

If Auto Compaction fails due to a transaction conflict, Azure Databricks does not fail or retry the
compaction. The corresponding write query (which triggered the Auto Compaction) will succeed even if
the Auto Compaction does not succeed.
In DBR 10.4 and above, this is not an issue: Auto Compaction does not cause transaction conflicts to other
concurrent operations like DELETE , MERGE , or UPDATE . The other concurrent transactions are given
higher priority and will not fail due to Auto Compaction.

Example workflow: Streaming ingest with concurrent deletes or


updates
This workflow assumes that you have one cluster running a 24/7 streaming job ingesting data, and one cluster
that runs on an hourly, daily, or ad-hoc basis to delete or update a batch of records. For this use case, Azure
Databricks recommends that you:
Enable Optimized Writes on the table level using

ALTER TABLE <table_name|delta.`table_path`> SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite =


true)

This ensures that the number of files written by the stream and the delete and update jobs are of optimal
size.
Enable Auto Compaction on the session level using the following setting on the job that performs the
delete or update.

spark.sql("set spark.databricks.delta.autoCompact.enabled = true")

This allows files to be compacted across your table. Since it happens after the delete or update, you
mitigate the risks of a transaction conflict.

Frequently asked questions (FAQ)


Does Auto Optimize Z-Order files?
Does Auto Optimize corrupt Z-Ordered files?
If I have Auto Optimize enabled on a table that I’m streaming into, and a concurrent transaction conflicts with
the optimize, will my job fail?
Do I need to schedule OPTIMIZE jobs if Auto Optimize is enabled on my table?
I have many small files. Why is Auto Optimize not compacting them?
Does Auto Optimize Z-Order files?
Auto Optimize performs compaction only on small files. It does not Z-Order files.
Does Auto Optimize corrupt Z-Ordered files?
Auto Optimize ignores files that are Z-Ordered. It only compacts new files.
If I have Auto Optimize enabled on a table that I’m streaming into, and a concurrent transaction conflicts with
the optimize, will my job fail?
No. Transaction conflicts that cause Auto Optimize to fail are ignored, and the stream will continue to operate
normally.
Do I need to schedule OPTIMIZE jobs if Auto Optimize is enabled on my table?
For tables with size greater than 10 TB, we recommend that you keep OPTIMIZE running on a schedule to
further consolidate files, and reduce the metadata of your Delta table. Since Auto Optimize does not support Z-
Ordering, you should still schedule OPTIMIZE ... ZORDER BY jobs to run periodically.
I have many small files. Why is Auto Optimize not compacting them?
By default, Auto Optimize does not begin compacting until it finds more than 50 small files in a directory. You
can change this behavior by setting spark.databricks.delta.autoCompact.minNumFiles . Having many small files is
not always a problem, since it can lead to better data skipping, and it can help minimize rewrites during merges
and deletes. However, having too many small files might be a sign that your data is over-partitioned.
Optimize performance with caching
7/21/2022 • 6 minutes to read

The Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage using a fast
intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote
location. Successive reads of the same data are then performed locally, which results in significantly improved
reading speed.
The Delta cache works for all Parquet files and is not limited to Delta Lake format files. The Delta cache supports
reading Parquet files in DBFS, HDFS, Azure Blob storage, Azure Data Lake Storage Gen1, and Azure Data Lake
Storage Gen2. It does not support other storage formats such as CSV, JSON, and ORC.

Delta and Apache Spark caching


There are two types of caching available in Azure Databricks: Delta caching and Spark caching. Here are the
characteristics of each type:
Type of stored data : The Delta cache contains local copies of remote data. It can improve the performance
of a wide range of queries, but cannot be used to store results of arbitrary subqueries. The Spark cache can
store the result of any subquery data and data stored in formats other than Parquet (such as CSV, JSON, and
ORC).
Performance : The data stored in the Delta cache can be read and operated on faster than the data in the
Spark cache. This is because the Delta cache uses efficient decompression algorithms and outputs data in the
optimal format for further processing using whole-stage code generation.
Automatic vs manual control : When the Delta cache is enabled, data that has to be fetched from a remote
source is automatically added to the cache. This process is fully transparent and does not require any action.
However, to preload data into the cache beforehand, you can use the CACHE SELECT command (see Cache a
subset of the data). When you use the Spark cache, you must manually specify the tables and queries to
cache.
Disk vs memor y-based : The Delta cache is stored on the local disk, so that memory is not taken away
from other operations within Spark. Due to the high read speeds of modern SSDs, the Delta cache can be
fully disk-resident without a negative impact on its performance. In contrast, the Spark cache uses memory.

NOTE
You can use Delta caching and Apache Spark caching at the same time.

Summary
The following table summarizes the key differences between Delta and Apache Spark caching so that you can
choose the best tool for your workflow:

F EAT URE DELTA C A C H E A PA C H E SPA RK C A C H E

Stored as Local files on a worker node. In-memory blocks, but it depends on


storage level.

Applied to Any Parquet table stored on WASB and Any DataFrame or RDD.
other file systems.
F EAT URE DELTA C A C H E A PA C H E SPA RK C A C H E

Triggered Automatically, on the first read (if Manually, requires code changes.
cache is enabled).

Evaluated Lazily. Lazily.

Force cache CACHE SELECT command .cache + any action to materialize


the cache and .persist .

Availability Can be enabled or disabled with Always available.


configuration flags, disabled on certain
node types.

Evicted Automatically in LRU fashion or on any Automatically in LRU fashion, manually


file change, manually when restarting a with unpersist .
cluster.

Delta cache consistency


The Delta cache automatically detects when data files are created or deleted and updates its content accordingly.
You can write, modify, and delete table data with no need to explicitly invalidate cached data.
The Delta cache automatically detects files that have been modified or overwritten after being cached. Any stale
entries are automatically invalidated and evicted from the cache.

Use Delta caching


The recommended (and easiest) way to use Delta caching is to choose a Delta Cache Accelerated worker type
when you configure your cluster. Such workers are enabled and configured for Delta caching.
The Delta cache is configured to use at most half of the space available on the local SSDs provided with the
worker nodes. For configuration options, see Configure the Delta cache.

Cache a subset of the data


To explicitly select a subset of data to be cached, use the following syntax:

CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ]

You don’t need to use this command for the Delta cache to work correctly (the data will be cached automatically
when first accessed). But it can be helpful when you require consistent query performance.
For examples and more details, see
Databricks Runtime 7.x and above: CACHE SELECT
Databricks Runtime 5.5 LTS and 6.x: Cache Select (Delta Lake on Azure Databricks)

Monitor the Delta cache


You can check the current state of the Delta cache for each of the executors in the Storage tab of the Spark UI.
The first table summarizes the following metrics for each of the active executor nodes:
Disk Usage : The total size used by the Delta cache manager for storing Parquet data pages.
Max Disk Usage Limit : The maximum size of the disk that can be allocated to the Delta cache manager for
storing Parquet data pages.
Percent Disk Usage : The fraction of disk space used by the Delta cache manager out of the maximum size
that can be allocated for storing Parquet data pages. When a node reaches 100% disk usage, the cache
manager discards the least recently used cache entries to make space for new data.
Metadata Cache Size : The total size used for caching Parquet metadata (file footers).
Max Metadata Cache Size Limit : The maximum size of the disk that can be allocated to the Delta cache
manager for caching Parquet metadata (file footers).
Percent Metadata Usage : The fraction of disk space used by the Delta cache manager out of the maximum
size that can be allocated for Parquet metadata (file footers).
Data Read from IO Cache (Cache Hits) : The total size of Parquet data read from the IO cache for this
node.
Data Written to IO Cache (Cache Misses) : The total size of Parquet data not found in and consequently
written to the IO cache for this node.
Cache Hit Ratio : The fraction of Parquet data read from IO cache out of all Parquet data read for this node.
The second table summarizes the following metrics for all nodes across the cluster runtime, including nodes not
currently active:
Data Read from External Filesystem (All Formats) : The total size of data read of any format from an
external filesystem, that is, not from the IO cache.
Data Read from IO Cache (Cache Hits) : The total size of Parquet data read from the IO cache across the
cluster runtime.
Data Written to IO Cache (Cache Misses) : The total size of Parquet data not found in and consequently
written to the IO cache across the cluster runtime.
Cache Hit Ratio : The fraction of total Parquet data read from IO cache out of all Parquet data read across
the cluster runtime.
Estimated Size of Repeatedly Read Data : The approximate size of data read two or more times. This
column is displayed only if spark.databricks.io.cache.estimateRepeatedReads is true .
Cache Metadata Manager Peak Disk Usage : The peak total size used by the Delta cache manager to run
the IO cache.

Configure the Delta cache


Azure Databricks recommends that you choose cache-accelerated worker instance types for your clusters. Such
instances are automatically configured optimally for the Delta cache.

NOTE
When a worker is decommissioned, the Spark cache stored on that worker is lost. So if autoscaling is enabled, there is
some instability with the cache. Spark would then need to reread missing partitions from source as needed.

Configure disk usage


To configure how the Delta cache uses the worker nodes’ local storage, specify the following Spark configuration
settings during cluster creation:
spark.databricks.io.cache.maxDiskUsage : disk space per node reserved for cached data in bytes
spark.databricks.io.cache.maxMetaDataCache : disk space per node reserved for cached metadata in bytes
spark.databricks.io.cache.compression.enabled : should the cached data be stored in compressed format

Example configuration:

spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false

Enable or disable the Delta cache


To enable and disable the Delta cache, run:

spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")

Disabling the cache does not result in dropping the data that is already in the local storage. Instead, it prevents
queries from adding new data to the cache and reading data from the cache.
Dynamic file pruning
7/21/2022 • 2 minutes to read

Dynamic file pruning (DFP), can significantly improve the performance of many queries on Delta tables. DFP is
especially efficient for non-partitioned tables, or for joins on non-partitioned columns. The performance impact
of DFP is often correlated to the clustering of data so consider using Z-Ordering to maximize the benefit of DFP.
For background and use cases for DFP, see Faster SQL Queries on Delta Lake with Dynamic File Pruning.

NOTE
Available in Databricks Runtime 6.1 and above.

DFP is controlled by the following Apache Spark configuration options:


spark.databricks.optimizer.dynamicFilePruning (default is true ): The main flag that directs the optimizer to
push down DFP filters. When set to false , DFP will not be in effect.
spark.databricks.optimizer.deltaTableSizeThreshold (default is 10,000,000,000 bytes (10 GB) ): Represents
the minimum size (in bytes) of the Delta table on the probe side of the join required to trigger DFP. If the
probe side is not very large, it is probably not worthwhile to push down the filters and we can just simply
scan the whole table. You can find the size of a Delta table by running the DESCRIBE DETAIL table_name
command and then looking at the sizeInBytes column.
spark.databricks.optimizer.deltaTableFilesThreshold (default is 10 in Databricks Runtime 8.4 and above,
1000 in Databricks Runtime 8.3 and below): Represents the number of files of the Delta table on the probe
side of the join required to trigger dynamic file pruning. When the probe side table contains fewer files than
the threshold value, dynamic file pruning is not triggered. If a table has only a few files, it is probably not
worthwhile to enable dynamic file pruning. You can find the size of a Delta table by running the
DESCRIBE DETAIL table_name command and then looking at the numFiles column.
Isolation levels
7/21/2022 • 2 minutes to read

The isolation level of a table defines the degree to which a transaction must be isolated from modifications
made by concurrent transactions. Delta Lake on Azure Databricks supports two isolation levels: Serializable and
WriteSerializable.
Serializable : The strongest isolation level. It ensures that committed write operations and all reads are
Serializable. Operations are allowed as long as there exists a serial sequence of executing them one-at-a-
time that generates the same outcome as that seen in the table. For the write operations, the serial
sequence is exactly the same as that seen in the table’s history.
WriteSerializable (Default) : A weaker isolation level than Serializable. It ensures only that the write
operations (that is, not reads) are serializable. However, this is still stronger than Snapshot isolation.
WriteSerializable is the default isolation level because it provides great balance of data consistency and
availability for most common operations.
In this mode, the content of the Delta table may be different from that which is expected from the
sequence of operations seen in the table history. This is because this mode allows certain pairs of
concurrent writes (say, operations X and Y) to proceed such that the result would be as if Y was
performed before X (that is, serializable between them) even though the history would show that Y was
committed after X. To disallow this reordering, set the table isolation level to be Serializable to cause
these transactions to fail.
Read operations always use snapshot isolation. The write isolation level determines whether or not it is possible
for a reader to see a snapshot of a table, that according to the history, “never existed”.
For the Serializable level, a reader always sees only tables that conform to the history. For the WriteSerializable
level, a reader could see a table that does not exist in the Delta log.
For example, consider txn1, a long running delete and txn2, which inserts data deleted by txn1. txn2 and txn1
complete and they are recorded in that order in the history. According to the history, the data inserted in txn2
should not exist in the table. For Serializable level, a reader would never see data inserted by txn2. However, for
the WriteSerializable level, a reader could at some point see the data inserted by txn2.
For more information on which types of operations can conflict with each other in each isolation level and the
possible errors, see Concurrency control.

Set the isolation level


You set the isolation level using the ALTER TABLE command.

ALTER TABLE <table-name> SET TBLPROPERTIES ('delta.isolationLevel' = <level-name>)

where <level-name> is Serializable or WriteSerializable .


For example, to change the isolation level from the default WriteSerializable to Serializable , run:

ALTER TABLE <table-name> SET TBLPROPERTIES ('delta.isolationLevel' = 'Serializable')


Bloom filter indexes
7/21/2022 • 2 minutes to read

A Bloom filter index is a space-efficient data structure that enables data skipping on chosen columns,
particularly for fields containing arbitrary text.

How Bloom filter indexes work


The Bloom filter operates by either stating that data is definitively not in the file, or that it is probably in the file,
with a defined false positive probability (FPP).
Azure Databricks supports file level Bloom filters; each data file can have a single Bloom filter index file
associated with it. Before reading a file Azure Databricks checks the index file and the file is read only if the index
indicates that the file might match a data filter. Azure Databricks always reads the data file if an index does not
exist or if a Bloom filter is not defined for a queried column.
The size of a Bloom filter depends on the number elements in the set for which the Bloom filter has been created
and the required FPP. The lower the FPP, the higher the number of used bits per element and the more accurate
it will be, at the cost of more disk space and slower downloads. For example, an FPP of 10% requires 5 bits per
element.
A Bloom filter index is an uncompressed Parquet file that contains a single row. Indexes are stored in the
_delta_index subdirectory relative to the data file and use the same name as the data file with the suffix
index.v1.parquet . For example, the index for data file dbfs:/db1/data.0001.parquet.snappy would be named
dbfs:/db1/_delta_index/data.0001.parquet.snappy.index.v1.parquet .

Bloom filters support columns with the following (input) data types: byte , short , int , long , float , double ,
date , timestamp , and string . Nulls are not added to the Bloom filter, so any null related filter requires reading
the data file. Azure Databricks supports the following data source filters: and , or , in , equals , and
equalsnullsafe . Bloom filters are not supported on nested columns.

Configuration
Bloom filters are enabled by default. To disable Bloom filters, set the session level
spark.databricks.io.skipping.bloomFilter.enabled configuration to false .

Create a Bloom filter index


Databricks Runtime 7.x and above: CREATE BLOOM FILTER INDEX (Delta Lake on Azure Databricks)
Databricks Runtime 5.5 LTS and 6.x: Create Bloom Filter Index (Delta Lake on Azure Databricks)

Drop a Bloom filter index


Databricks Runtime 7.x and above: DROP BLOOM FILTER INDEX (Delta Lake on Azure Databricks)
Databricks Runtime 5.5 LTS and 6.x: Drop Bloom Filter Index (Delta Lake on Azure Databricks)

Display the list of Bloom filter indexes


To display the list of indexes, run:
spark.table("<table-with-indexes>").schema.foreach(field => println(s"${field.name}:
metadata=${field.metadata}"))

For example:

Notebook
The following notebook demonstrates how defining an Bloom filter index speeds up “needle in a haystack”
queries.
Bloom filter demo notebook
Get notebook
Optimize join performance
7/21/2022 • 2 minutes to read

Delta Lake on Azure Databricks optimizes range and skew joins. Range join optimizations require tuning based
on your query patterns and you can make your skew joins efficient with skew hints. See the following articles to
learn how to make best use of these join optimizations:
Range join optimization
Skew join optimization
See also:
Join hints
Join Hints on the Apache Spark website
Range join optimization
7/21/2022 • 7 minutes to read

A range join occurs when two relations are joined using a point in interval or interval overlap condition. The
range join optimization support in Databricks Runtime can bring orders of magnitude improvement in query
performance, but requires careful manual tuning.

Point in interval range join


A point in interval range join is a join in which the condition contains predicates specifying that a value from one
relation is between two values from the other relation. For example:

-- using BETWEEN expressions


SELECT *
FROM points JOIN ranges ON points.p BETWEEN ranges.start and ranges.end;

-- using inequality expressions


SELECT *
FROM points JOIN ranges ON points.p >= ranges.start AND points.p < ranges.end;

-- with fixed length interval


SELECT *
FROM points JOIN ranges ON points.p >= ranges.start AND points.p < ranges.start + 100;

-- join two sets of point values within a fixed distance from each other
SELECT *
FROM points1 p1 JOIN points2 p2 ON p1.p >= p2.p - 10 AND p1.p <= p2.p + 10;

-- a range condition together with other join conditions


SELECT *
FROM points, ranges
WHERE points.symbol = ranges.symbol
AND points.p >= ranges.start
AND points.p < ranges.end;

Interval overlap range join


An interval overlap range join is a join in which the condition contains predicates specifying an overlap of
intervals between two values from each relation. For example:

-- overlap of [r1.start, r1.end] with [r2.start, r2.end]


SELECT *
FROM r1 JOIN r2 ON r1.start < r2.end AND r2.start < r1.end;

-- overlap of fixed length intervals


SELECT *
FROM r1 JOIN r2 ON r1.start < r2.start + 100 AND r2.start < r1.start + 100;

-- a range condition together with other join conditions


SELECT *
FROM r1 JOIN r2 ON r1.symbol = r2.symbol
AND r1.start <= r2.end
AND r1.end >= r2.start;
Range join optimization
The range join optimization is performed for joins that:
Have a condition that can be interpreted as a point in interval or interval overlap range join.
All values involved in the range join condition are of a numeric type (integral, floating point, decimal), DATE ,
or TIMESTAMP .
All values involved in the range join condition are of the same type. In the case of the decimal type, the values
also need to be of the same scale and precision.
It is an INNER JOIN , or in case of point in interval range join, a LEFT OUTER JOIN with point value on the left
side, or RIGHT OUTER JOIN with point value on the right side.
Have a bin size tuning parameter.
Bin size
The bin size is a numeric tuning parameter that splits the values domain of the range condition into multiple
bins of equal size. For example, with a bin size of 10, the optimization splits the domain into bins that are
intervals of length 10. If you have a point in range condition of p BETWEEN start AND end , and start is 8 and
end is 22, this value interval overlaps with three bins of length 10 – the first bin from 0 to 10, the second bin
from 10 to 20, and the third bin from 20 to 30. Only the points that fall within the same three bins need to be
considered as possible join matches for that interval. For example, if p is 32, it can be ruled out as falling
between start of 8 and end of 22, because it falls in the bin from 30 to 40.

NOTE
For DATE values, the value of the bin size is interpreted as days. For example, a bin size value of 7 represents a week.
For TIMESTAMP values, the value of the bin size is interpreted as seconds. If a sub-second value is required, fractional
values can be used. For example, a bin size value of 60 represents a minute, and a bin size value of 0.1 represents 100
milliseconds.

You can specify the bin size either by using a range join hint in the query or by setting a session configuration
parameter. The range join optimization is applied only if you manually specify the bin size. Section Choose the
bin size describes how to choose an optimal bin size.

Enable range join using a range join hint


To enable the range join optimization in a SQL query, you can use a range join hint to specify the bin size. The
hint must contain the relation name of one of the joined relations and the numeric bin size parameter. The
relation name can be a table, a view, or a subquery.

SELECT /*+ RANGE_JOIN(points, 10) */ *


FROM points JOIN ranges ON points.p >= ranges.start AND points.p < ranges.end;

SELECT /*+ RANGE_JOIN(r1, 0.1) */ *


FROM (SELECT * FROM ranges WHERE ranges.amount < 100) r1, ranges r2
WHERE r1.start < r2.start + 100 AND r2.start < r1.start + 100;

SELECT /*+ RANGE_JOIN(c, 500) */ *


FROM a
JOIN b ON (a.b_key = b.id)
JOIN c ON (a.ts BETWEEN c.start_time AND c.end_time)
NOTE
In the third example, you must place the hint on c . This is because joins are left associative, so the query is interpreted
as (a JOIN b) JOIN c , and the hint on a applies to the join of a with b and not the join with c .

#create minute table


minutes = spark.sparkContext\
.parallelize(((0, 60),
(60, 120)))\
.toDF(StructType([
StructField('minute_start', IntegerType()),
StructField('minute_end', IntegerType())
]))

#create events table


events = spark.sparkContext\
.parallelize(((12, 33),
(0, 120),
(33, 72),
(65, 178)))\
.toDF(StructType([
StructField('event_start', IntegerType()),
StructField('event_end', IntegerType())
]))

#Range_Join with "hint" on the from table


events.hint("range_join", 60)\
.join(minutes,
on=[events.event_start < minutes.minute_end,
minutes.minute_start < events.event_end])\
.orderBy(events.event_start,
events.event_end,
minutes.minute_start)\
.show()

#Range_Join with "hint" on the join table


events.join(minutes.hint("range_join", 60),
on=[events.event_start < minutes.minute_end,
minutes.minute_start < events.event_end])\
.orderBy(events.event_start,
events.event_end,
minutes.minute_start)\
.show()

You can also place a range join hint on one of the joined DataFrames. In that case, the hint contains just the
numeric bin size parameter.

val df1 = spark.table("ranges").as("left")


val df2 = spark.table("ranges").as("right")

val joined = df1.hint("range_join", 10)


.join(df2, $"left.type" === $"right.type" &&
$"left.end" > $"right.start" &&
$"left.start" < $"right.end")

val joined2 = df1


.join(df2.hint("range_join", 0.5), $"left.type" === $"right.type" &&
$"left.end" > $"right.start" &&
$"left.start" < $"right.end")

Enable range join using session configuration


If you don’t want to modify the query, you can specify the bin size as a configuration parameter.

SET spark.databricks.optimizer.rangeJoin.binSize=5

This configuration parameter applies to any join with a range condition. However, a different bin size set through
a range join hint always overrides the one set through the parameter.

Choose the bin size


The effectiveness of the range join optimization depends on choosing the appropriate bin size.
A small bin size results in a larger number of bins, which helps in filtering the potential matches. However, it
becomes inefficient if the bin size is significantly smaller than the encountered value intervals, and the value
intervals overlap multiple bin intervals. For example, with a condition p BETWEEN start AND end , where start is
1,000,000 and end is 1,999,999, and a bin size of 10, the value interval overlaps with 100,000 bins.
If the length of the interval is fairly uniform and known, we recommend that you set the bin size to the typical
expected length of the value interval. However, if the length of the interval is varying and skewed, a balance
must be found to set a bin size that filters the short intervals efficiently, while preventing the long intervals from
overlapping too many bins. Assuming a table ranges , with intervals that are between columns start and end ,
you can determine different percentiles of the skewed interval length value with the following query:

SELECT APPROX_PERCENTILE(CAST(end - start AS DOUBLE), ARRAY(0.5, 0.9, 0.99, 0.999, 0.9999)) FROM ranges

A recommended setting of bin size would be the maximum of the value at the 90th percentile, or the value at
the 99th percentile divided by 10, or the value at the 99.9th percentile divided by 100 and so on. The rationale is:
If the value at the 90th percentile is the bin size, only 10% of the value interval lengths are longer than the
bin interval, so span more than 2 adjacent bin intervals.
If the value at the 99th percentile is the bin size, only 1% of the value interval lengths span more than 11
adjacent bin intervals.
If the value at the 99.9th percentile is the bin size, only 0.1% of the value interval lengths span more than 101
adjacent bin intervals.
The same can be repeated for the values at the 99.99th, the 99.999th percentile, and so on if needed.
The described method limits the amount of skewed long value intervals that overlap multiple bin intervals. The
bin size value obtained this way is only a starting point for fine tuning; actual results may depend on the specific
workload.
Skew join optimization
7/21/2022 • 2 minutes to read

Data skew is a condition in which a table’s data is unevenly distributed among partitions in the cluster. Data
skew can severely downgrade performance of queries, especially those with joins. Joins between big tables
require shuffling data and the skew can lead to an extreme imbalance of work in the cluster. It’s likely that data
skew is affecting a query if a query appears to be stuck finishing very few tasks (for example, the last 3 tasks out
of 200). To verify that data skew is affecting a query:
1. Click the stage that is stuck and verify that it is doing a join.
2. After the query finishes, find the stage that does a join and check the task duration distribution.
3. Sort the tasks by decreasing duration and check the first few tasks. If one task took much longer to complete
than the other tasks, there is skew.
To ameliorate skew, Delta Lake on Azure Databricks SQL accepts skew hints in queries. With the information
from a skew hint, Databricks Runtime can construct a better query plan, one that does not suffer from data
skew.

NOTE
With Databricks Runtime 7.3 and above, skew join hints are not required. Skew is automatically taken care of if adaptive
query execution (AQE) and spark.sql.adaptive.skewJoin.enabled are both enabled. See Adaptive query execution.

Configure skew hint with relation name


A skew hint must contain at least the name of the relation with skew. A relation is a table, view, or a subquery. All
joins with this relation then use skew join optimization.

-- table with skew


SELECT /*+ SKEW('orders') */ * FROM orders, customers WHERE c_custId = o_custId

-- subquery with skew


SELECT /*+ SKEW('C1') */ *
FROM (SELECT * FROM customers WHERE c_custId < 100) C1, orders
WHERE C1.c_custId = o_custId

Configure skew hint with relation name and column names


There might be multiple joins on a relation and only some of them will suffer from skew. Skew join optimization
has some overhead so it is better to use it only when needed. For this purpose, the skew hint accepts column
names. Only joins with these columns use skew join optimization.
-- single column
SELECT /*+ SKEW('orders', 'o_custId') */ *
FROM orders, customers
WHERE o_custId = c_custId

-- multiple columns
SELECT /*+ SKEW('orders', ('o_custId', 'o_storeRegionId')) */ *
FROM orders, customers
WHERE o_custId = c_custId AND o_storeRegionId = c_regionId

Configure skew hint with relation name, column names, and skew
values
You can also specify skew values in the hint. Depending on the query and data, the skew values might be known
(for example, because they never change) or might be easy to find out. Doing this reduces the overhead of skew
join optimization. Otherwise, Delta Lake detects them automatically.

-- single column, single skew value


SELECT /*+ SKEW('orders', 'o_custId', 0) */ *
FROM orders, customers
WHERE o_custId = c_custId

-- single column, multiple skew values


SELECT /*+ SKEW('orders', 'o_custId', (0, 1, 2)) */ *
FROM orders, customers
WHERE o_custId = c_custId

-- multiple columns, multiple skew values


SELECT /*+ SKEW('orders', ('o_custId', 'o_storeRegionId'), ((0, 1001), (1, 1002))) */ *
FROM orders, customers
WHERE o_custId = c_custId AND o_storeRegionId = c_regionId
Optimized data transformation
7/21/2022 • 2 minutes to read

Azure Databricks optimizes the performance of higher-order functions and DataFrame operations using nested
types. See the following articles to learn how to get started with these optimized higher-order functions and
complex data types:
Higher-order functions
Transform complex data types
Higher-order functions
7/21/2022 • 2 minutes to read

Azure Databricks provides dedicated primitives for manipulating arrays in Apache Spark SQL; these make
working with arrays much easier and more concise and do away with the large amounts of boilerplate code
typically required. The primitives revolve around two functional programming constructs: higher-order
functions and anonymous (lambda) functions. These work together to allow you to define functions that
manipulate arrays in SQL. A higher-order function takes an array, implements how the array is processed, and
what the result of the computation will be. It delegates to a lambda function how to process each item in the
array.

Introduction to higher-order functions notebook


Get notebook

Higher-order functions tutorial Python notebook


Get notebook

Apache Spark built-in functions


Apache Spark has built-in functions for manipulating complex types (for example, array types), including higher-
order functions.
The following notebook illustrates Apache Spark built-in functions.
Apache Spark built-in functions notebook
Get notebook
Transform complex data types
7/21/2022 • 2 minutes to read

While working with nested data types, Delta Lake on Azure Databricks optimizes certain transformations out-of-
the-box. The following notebooks contain many examples on how to convert between complex and primitive
data types using functions natively supported in Apache Spark SQL.

Transforming complex data types Python notebook


Get notebook

Transforming complex data types Scala notebook


Get notebook

Transforming complex data types SQL notebook


Get notebook
Table protocol versioning
7/21/2022 • 2 minutes to read

The transaction log for a Delta table contains protocol versioning information that supports Delta Lake
evolution. Delta Lake tracks minimum reader and writer versions separately.
Delta Lake guarantees backward compatibility. A higher protocol version of Delta Lake reader is always able to
read data that was written by a lower protocol version.
Delta Lake will occasionally break forward compatibility. Lower protocol versions of Delta Lake may not be able
to read and write data that was written by a higher protocol version of Delta Lake. If you try to read and write to
a table with a protocol version of Delta Lake that is too low, you’ll get an error telling you that you need to
upgrade.
When creating a table, Delta Lake chooses the minimum required protocol version based on table characteristics
such as the schema or table properties. You can also set the default protocol versions by setting the SQL
configurations:
spark.databricks.delta.properties.defaults.minWriterVersion = 2 (default)
spark.databricks.delta.properties.defaults.minReaderVersion = 1 (default)
To upgrade a table to a newer protocol version, use the DeltaTable.upgradeTableProtocol method:

WARNING
Protocol version upgrades are irreversible, and upgrading the protocol version may break the existing Delta Lake table
readers, writers, or both. Therefore, we recommend you upgrade specific tables only when needed, such as to opt-in to
new features in Delta Lake. You should also check to make sure that all of your current and future production tools
support Delta Lake tables with the new protocol version.

SQL
-- Upgrades the reader protocol version to 1 and the writer protocol version to 3.
ALTER TABLE <table_identifier> SET TBLPROPERTIES('delta.minReaderVersion' = '1', 'delta.minWriterVersion' =
'3')

Python
from delta.tables import DeltaTable
delta = DeltaTable.forPath(spark, "path_to_table") # or DeltaTable.forName
delta.upgradeTableProtocol(1, 3) # upgrades to readerVersion=1, writerVersion=3

Scala
import io.delta.tables.DeltaTable
val delta = DeltaTable.forPath(spark, "path_to_table") // or DeltaTable.forName
delta.upgradeTableProtocol(1, 3) // Upgrades to readerVersion=1, writerVersion=3.
Features by protocol version
F EAT URE MINWRITERVERSION MINREADERVERSION IN T RO DUC ED IN DO C UM EN TAT IO N

Basic functionality 2 1 – Delta Lake guide

CHECK constraints 3 1 Databricks Runtime CHECK constraint


7.4 (Unsupported)

Change data feed 4 1 Databricks Runtime Change data feed


8.4 (Unsupported)

Generated columns 4 1 Databricks Runtime Use generated


8.3 (Unsupported) columns

Column mapping 5 2 Databricks Runtime Delta column


10.2 (Unsupported) mapping
Optimization examples
7/21/2022 • 2 minutes to read

For an example of the benefits of optimization, see the following notebooks:

Delta Lake on Databricks optimizations Python notebook


Get notebook

Delta Lake on Databricks optimizations Scala notebook


Get notebook

Delta Lake on Databricks optimizations SQL notebook


Get notebook
Genomics guide
7/21/2022 • 2 minutes to read

To learn how to get started with genomics on Azure Databricks, see:


Tertiary analytics with Apache Spark
ADAM
Hail
Glow
Databricks Runtime for Genomics (Deprecated) offers secondary analysis pipelines parallelized with Apache
Spark.

NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.

For documentation on these features along with associated use cases and Azure Databricks example
notebooks, see:
Secondary analysis
DNASeq pipeline
RNASeq pipeline
Tumor/Normal pipeline
Variant annotation methods
Joint genotyping
Joint genotyping pipeline
Features of Databricks Runtime for Genomics have been open-sourced as part of the Databricks-Regeneron
project Glow. For information on Glow, see the Glow documentation.
Tertiary analytics with Apache Spark
7/21/2022 • 2 minutes to read

Use the following guides to get started with open source libraries that extend Apache Spark for genomics on
Databricks Runtime.
ADAM
Hail
Create a cluster
Use Hail in a notebook
Glow
Sync Glow notebooks to your workspace
Set up a Glow environment
Get started with Glow
Setup automated jobs
ADAM
7/21/2022 • 2 minutes to read

ADAM is a library for genomic data processing on Apache Spark. It is used to implement pipelines that operate
on genomic read data such as BAM, SAM, and CRAM files.
To use ADAM in Azure Databricks:
1. Launch a Databricks Runtime cluster with these Spark configurations:

# Hadoop configs
org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.bdgenomics.adam.serialization.ADAMKryoRegistrator
spark.hadoop.hadoopbam.bam.enable-bai-splitter true

2. Install the cluster libraries:


Maven: org.bdgenomics.adam:adam-apis-spark3_2.12:<version>
PyPI: bdgenomics.adam
Hail
7/21/2022 • 2 minutes to read

Hail is a library built on Apache Spark for analyzing large genomic datasets.

IMPORTANT
When you use Hail 0.2.65 and above, use Apache Spark version 3.1 (Databricks Runtime 8.x or 9.x)
Install Hail on Databricks Runtime, not Databricks Runtime for Genomics (deprecated)
Hail is not supported with Credential passthrough
Hail is not supported with Glow, except when exporting from Hail to Glow

Create a cluster
Install Hail via Docker with Databricks Container Services.
For containers to set up a Hail environment, see the ProjectGlow Dockerhub page. Use
projectglow/databricks-hail:<hail_version> , replacing the tag with an available Hail version.

1. Create a jobs cluster with Hail


a. Setup the Databricks CLI.
b. Create a cluster using the Hail Docker container, setting the tag to the desired <hail_version> .
c. An example jobs definition is given below, please edit notebook_path, Databricks Runtime
<databricks_runtime_version> and <hail_version> .

databricks jobs create --json-file hail-create-job.json

hail-create-job.json :

{
"name": "hail",
"notebook_task": {
"notebook_path" : "/Users/<user@organization.com>/hail/docs/hail-tutorial",
},
"new_cluster": {
"spark_version": "<databricks_runtime_version>.x-scala2.12",
"azure_attributes": {
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": -1
},
"node_type_id": "Standard_DS3_v2",
"num_workers": 32,
"docker_image": {
"url": "projectglow/databricks-hail:<hail_version>"
}
}
}

Use Hail in a notebook


For the most part, Hail in Azure Databricks works identically to the Hail documentation. However, there are a few
modifications that are necessary for the Azure Databricks environment.
Initialize Hail
When initializing Hail, pass in the pre-created SparkContext and mark the initialization as idempotent. This
setting enables multiple Azure Databricks notebooks to use the same Hail context.

NOTE
Enable skip_logging_configuration to save logs to the rolling driver log4j output. This setting is supported only in
Hail 0.2.39 and above.

import hail as hl
hl.init(sc, idempotent=True, quiet=True, skip_logging_configuration=True)

Display Bokeh plots


Hail uses the Bokeh library to create plots. The show function built into Bokeh does not work in Azure
Databricks. To display a Bokeh plot generated by Hail, you can run a command like:

from bokeh.embed import components, file_html


from bokeh.resources import CDN
plot = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP')
html = file_html(plot, CDN, "Chart")
displayHTML(html)

See Bokeh for more information.


Glow
7/21/2022 • 2 minutes to read

Glow is an open source project created in collaboration between Databricks and the Regeneron Genetics Center.
For information on features in Glow, see the Glow documentation.

Sync Glow notebooks to your workspace


1. Fork the Glow github repo.
2. Clone your fork to your Databricks workspace using Repos.
3. The notebooks are located under docs/source/_static .

Set up a Glow environment


Install Glow on an Azure Databricks cluster via Docker with Databricks Container Services.
You can find containers on the ProjectGlow Dockerhub page. These setup environments with Glow and other
libraries that were in Databricks Runtime for Genomics (deprecated). Use
projectglow/databricks-glow:<databricks_runtime_version> , replacing the tag with an available Databricks
Runtime version.
Or install both of these cluster libraries:
Maven: io.projectglow:glow-spark3_2.12:<version>
PyPI: glow.py==<version>
IMPORTANT
If you install Glow as a stand-alone PyPi package, install it as a cluster library, instead of a notebook-scoped library
using the %pip magic command.
Ensure that both Maven coordinates and PyPI package are included on the cluster, and that the versions for each
match.
Install the latest version of Glow on Databricks Runtime, not Databricks Runtime for Genomics (deprecated), which has
Glow v0.6 installed by default.
Do not install Hail on a cluster with Glow, except when extracting genotypes from a Hail Matrix Table.

Get started with Glow


Databricks recommends that you run the test notebooks on the test data provided by the notebooks before
moving on to real data. These notebooks are tested nightly with the latest version of the Glow Docker container.

IMPORTANT
Checkpoint to Delta Lake after ingest of or transformations to genotype data.

Setup automated jobs


After you run the sample notebooks, and then apply the code to real data, you are ready to automate the steps
in your pipeline by using jobs.

IMPORTANT
Start small. Experiment on individual variants, samples or chromosomes.
Steps in your pipeline might require a different cluster configuration, depending on the type of computation
performed.

TIP
Use compute-optimized virtual machines to read variant data from cloud object stores.
Use Delta Cache accelerated virtual machines to query variant data.
Use memory-optimized virtual machines for genetic association studies.
Clusters with small machines have a better price-performance ratio when compared with large machines.
The Glow Pipe Transformer supports parallelization of deep learning tools that run on GPUs.

The following example cluster configuration runs a genetic association study on a single chromosome. Edit the
notebook_path and <databricks_runtime_version> as needed.

databricks jobs create --json-file glow-create-job.json

glow-create-job.json :
{
"name": "glow_gwas",
"notebook_task": {
"notebook_path" : "/Users/<user@organization.com>/glow/docs/source/_static/notebooks/tertiary/gwas-
quantitative",
"base_parameters": {
"allele_freq_cutoff": 0.01
}
},
"new_cluster": {
"spark_version": "<databricks_runtime_version>.x-scala2.12",
"azure_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": -1
},
"node_type_id": "Standard_E8s_v3",
"num_workers": 32,
"spark_conf": {
"spark.sql.execution.arrow.maxRecordsPerBatch": 100
},
"docker_image": {
"url": "projectglow/databricks-glow:<databricks_runtime_version>"
}
}
}
Secondary analysis
7/21/2022 • 2 minutes to read

NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.

Databricks Runtime for Genomics (Deprecated) contains pre-packaged pipelines to align reads and detect and
annotate variants in individual samples, parallelized using Apache Spark.
DNASeq pipeline
Setup
Reference genomes
Parameters
Customization
Manifest format
Supported input formats
Output
Troubleshooting
Run programmatically
RNASeq pipeline
Setup
Reference genomes
Parameters
Walkthrough
Additional usage info and troubleshooting
Tumor/Normal pipeline
Walkthrough
Setup
Reference genomes
Parameters
Manifest format
Additional usage info and troubleshooting
Variant annotation methods
Pre-packaged SnpEff annotation pipeline
Pre-packaged VEP annotation pipeline
Variant Annotation using Pipe Transformer
DNASeq pipeline
7/21/2022 • 6 minutes to read

NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.

NOTE
The following library versions are packaged in Databricks Runtime 7.0 for Genomics. For libraries included in lower
versions of Databricks Runtime for Genomics, see the release notes.

The Azure Databricks DNASeq pipeline is a GATK best practices compliant pipeline for short read alignment,
variant calling, and variant annotation. It uses the following software packages, parallelized using Spark.
BWA v0.7.17
ADAM v0.32.0
GATK HaplotypeCaller v4.1.4.1
SnpEff v4.3
For more information about the pipeline implementation and expected runtimes and costs for various option
combinations, see Building the Fastest DNASeq Pipeline at Scala.

Setup
The pipeline is run as an Azure Databricks job. You can set up a cluster policy to save the configuration:

{
"num_workers": {
"type": "unlimited",
"defaultValue": 13
},
"node_type_id": {
"type": "unlimited",
"defaultValue": "Standard_F32s_v2"
},
"spark_env_vars.refGenomeId": {
"type": "unlimited",
"defaultValue": "grch38"
},
"spark_version": {
"type": "regex",
"pattern": ".*-hls.*",
"defaultValue": "7.4.x-hls-scala2.12"
}
}
The cluster configuration should use Databricks Runtime for Genomics.
The task should be the DNASeq notebook found at the bottom of this page.
For best performance, use the compute optimized VMs with at least 60GB of memory. We recommend
Standard_F32s_v2 VMs.
If you’re running base quality score recalibration, use general purpose (Standard_D32s_v3 ) instances
instead since this operation requires more memory.

Reference genomes
You must configure the reference genome using an environment variable. To use GRCh37, set the environment
variable:

refGenomeId=grch37

To use GRCh38 instead, replace grch37 with grch38 .


Custom reference genomes

NOTE
Custom reference genome support is available in Databricks Runtime 6.6 for Genomics and above.

To use a reference build other than GRCh37 or GRCh38, follow these steps:
1. Prepare the reference for use with BWA and GATK.
The reference genome directory contents should include these files:

<reference_name>.dict
<reference_name>.fa
<reference_name>.fa.amb
<reference_name>.fa.ann
<reference_name>.fa.bwt
<reference_name>.fa.fai
<reference_name>.fa.pac
<reference_name>.fa.sa

2. Upload the reference genome files to a directory in cloud storage or DBFS. If you upload the files to cloud
storage, you must mount the directory to a location in DBFS.
3. In your cluster configuration, set an environment variable REF_GENOME_PATH that points to the path of the
fasta file in DBFS. For example,

REF_GENOME_PATH=/mnt/reference-genome/reference.fa

The path must not include a dbfs: prefix.


When you use a custom reference genome, the SnpEff annotation stage is skipped.
TIP
During cluster initialization, the Azure Databricks DNASeq pipeline uses the provided BWA index files to generate an index
image file. If you plan to use the same reference genome many times, you can accelerate cluster startup by building the
index image file ahead of time. This process will reduce cluster startup time by about 30 seconds.
1. Copy the reference genome directory to the driver node of a Databricks Runtime for Genomics cluster.

%sh cp -r /dbfs/<reference_dir_path> /local_disk0/reference-genome

2. Generate the index image file from the BWA index files.

import org.broadinstitute.hellbender.utils.bwa._
BwaMemIndex.createIndexImageFromIndexFiles("/local_disk0/reference-genome/<reference_name>.fa",
"/local_disk0/reference-genome/<reference_name>.fa.img")

3. Copy to the index image file to the same directory as the reference fasta files.

%sh cp /local_disk0/reference-genome/<reference_name>.fa.img /dbfs/<reference_dir_path>

4. Delete the unneeded BWA index files ( .amb , .ann , .bwt , .pac , .sa ) from DBFS.

%fs rm <file>

Parameters
The pipeline accepts parameters that control its behavior. The most important and commonly changed
parameters are documented here; the rest can be found in the DNASeq notebook. After importing the notebook
and setting it as a job task, you can set these parameters for all runs or per-run.

PA RA M ET ER DEFA ULT DESC RIP T IO N

manifest n/a The manifest describing the input.

output n/a The path where pipeline output should


be written.

replayMode skip One of:

* skip : stages are skipped if output


already exists.
* overwrite : existing output is
deleted.

exportVCF false If true, the pipeline writes results in


VCF as well as Delta Lake.
PA RA M ET ER DEFA ULT DESC RIP T IO N

referenceConfidenceMode NONE One of:

* If NONE , only variant sites are


included in the output
* If GVCF , all sites are included, with
adjacent reference sites banded.
* If BP_RESOLUTION , all sites are
included.

perSampleTimeout 12h A timeout applied per sample. After


reaching this timeout, the pipeline
continues on to the next sample. The
value of this parameter must include a
timeout unit: ‘s’ for seconds, ‘m’ for
minutes, or ‘h’ for hours. For example,
‘60m’ results in a timeout of 60
minutes.

TIP
To optimize run time, set the spark.sql.shuffle.partitions Spark configuration to three times the number of cores of
the cluster.

Customization
You can customize the DNASeq pipeline by disabling read alignment, variant calling, and variant annotation. By
default, all three stages are enabled.

val pipeline = new DNASeqPipeline(align = true, callVariants = true, annotate = true)

To disable variant annotation, set the pipeline as follows:

val pipeline = new DNASeqPipeline(align = true, callVariants = true, annotate = false)

The permitted stage combinations are:

REA D A L IGN M EN T VA RIA N T C A L L IN G VA RIA N T A N N OTAT IO N

true true true

true true false

true false false

false true true

false true false

Manifest format
NOTE
Manifest blobs are supported in Databricks Runtime 6.6 for Genomics and above.

The manifest is a CSV file or blob describing where to find the input FASTQ or BAM files. For example:

file_path,sample_id,paired_end,read_group_id
*_R1_*.fastq.bgz,HG001,1,read_group
*_R2_*.fastq.bgz,HG001,2,read_group

If your input consists of unaligned BAM files, you should omit the paired_end field:

file_path,sample_id,paired_end,read_group_id
*.bam,HG001,,read_group

TIP
If the provided manifest is a file, the file_path field in each row can be an absolute path or a path relative to the
manifest file. If the provided manifest is a blob, the file_path field must be an absolute path. You can include globs
(*) to match many files.

Supported input formats


SAM
BAM
CRAM
Parquet
FASTQ
bgzip *.fastq.bgz (recommended) bgzipped files with the *.fastq.gz extension are recognized as
bgz .
uncompressed *.fastq
gzip *.fastq.gz

IMPORTANT
Gzipped files are not splittable. Choose autoscaling clusters to minimize cost for these files.

To block compress a FASTQ, install htslib, which includes the bgzip executable.

Output
The aligned reads, called variants, and annotated variants are all written out to Delta tables inside the provided
output directory if the corresponding stages are enabled. Each table is partitioned by sample ID. In addition, if
you configured the pipeline to export BAMs or VCFs, they’ll appear under the output directory as well.
|---alignments
|---sampleId=HG001
|---Parquet files
|---alignments.bam
|---HG001.bam
|---annotations
|---Delta files
|---annotations.vcf
|---HG001.vcf
|---genotypes
|---Delta files
|---genotypes.vcf
|---HG001.vcf

When you run the pipeline on a new sample, it’ll appear as a new partition. If you run the pipeline for a sample
that already appears in the output directory, that partition will be overwritten.
Since all the information is available in Delta Lake, you can easily analyze it with Spark in Python, R, Scala, or
SQL. For example:
Python

# Load the data


df = spark.read.format("delta").load("/genomics/output_dir/genotypes")
# Show all variants from chromosome 12
display(df.where("contigName == '12'").orderBy("sampleId", "start"))

SQL

-- Register the table in the catalog


CREATE TABLE genotypes
USING delta
LOCATION '/genomics/output_dir/genotypes'

Troubleshooting
Job is slow and few tasks are running
Usually indicates that the input FASTQ files are compressed with gzip instead of bgzip . Gzipped files are not
splittable, so the input cannot be processed in parallel.

Run programmatically
In addition to using the UI, you can start runs of the pipeline programmatically using the Databricks CLI.

After setting up the pipeline job in the UI, copy the Job ID as you pass it to the jobs run-now CLI command.
Here’s an example bash script that you can adapt for your workflow:
# Generate a manifest file
cat <<HERE >manifest.csv
file_path,sample_id,paired_end,read_group_id
dbfs:/genomics/my_new_sample/*_R1_*.fastq.bgz,my_new_sample,1,read_group
dbfs:/genomics/my_new_sample/*_R2_*.fastq.bgz,my_new_sample,2,read_group
HERE

# Upload the file to DBFS


DBFS_PATH=dbfs:/genomics/manifests/$(date +"%Y-%m-%dT%H-%M-%S")-manifest.csv
databricks fs cp manifest.csv $DBFS_PATH

# Start a new run


databricks jobs run-now --job-id <job-id> --notebook-params "{\"manifest\": \"$DBFS_PATH\"}"

In addition to starting runs from the command line, you can use this pattern to invoke the pipeline from
automated systems like Jenkins.
DNASeq pipeline notebook
Get notebook
RNASeq pipeline
7/21/2022 • 2 minutes to read

NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.

NOTE
The following library versions are packaged in Databricks Runtime 7.0 for Genomics. For libraries included in lower
versions of Databricks Runtime for Genomics, see the release notes.

The Databricks RNASeq pipeline handles short read alignment and quantification using STAR v2.6.1a and ADAM
v0.32.0.

Setup
The pipeline is run as an Azure Databricks job. You can set up a cluster policy to save the configuration:

{
"num_workers": {
"type": "unlimited",
"defaultValue": 13
},
"node_type_id": {
"type": "unlimited",
"defaultValue": "Standard_F32s_v2"
},
"spark_env_vars.refGenomeId": {
"type": "unlimited",
"defaultValue": "grch38_star"
},
"spark_version": {
"type": "regex",
"pattern": ".*-hls.*",
"defaultValue": "7.4.x-hls-scala2.12"
}
}

The task should be the RNASeq notebook provided at the bottom of this page.
For best performance, use the compute optimized VMs with at least 60GB of memory. We recommend
Standard_F32s_v2 VMs.

Reference genomes
You must configure the reference genome using environment variables. To use GRCh37, set the environment
variable:
refGenomeId=grch37_star

To use GRCh38 instead, set the environment variable:

refGenomeId=grch38_star

Parameters
The pipeline accepts a number of parameters that control its behavior. The most important and commonly
changed parameters are documented here; the rest can be found in the RNASeq notebook. After importing the
notebook and setting it as a job task, you can set these parameters for all runs or per-run.

PA RA M ET ER DEFA ULT DESC RIP T IO N

manifest n/a The manifest describing the input.

output n/a The path where pipeline output should


be written.

replayMode skip One of:

* skip : stages are skipped if output


already exists.
* overwrite : existing output is
deleted.

perSampleTimeout 12h A timeout applied per sample. After


reaching this timeout, the pipeline
continues on to the next sample. The
value of this parameter must include a
timeout unit: ‘s’ for seconds, ‘m’ for
minutes, or ‘h’ for hours. For example,
‘60m’ results in a timeout of 60
minutes.

Walkthrough
The pipeline consists of two steps:
1. Alignment: Map each short read to the reference genome using the STAR aligner.
2. Quantification: Count how many reads correspond to each reference transcript.

Additional usage info and troubleshooting


The operational aspects of the RNASeq pipeline are very similar to the DNASeq pipeline. For more information
about manifest format, output structure, programmatic usage, and common issues, see DNASeq pipeline.
RNASeq pipeline notebook
Get notebook
Tumor/Normal pipeline
7/21/2022 • 3 minutes to read

NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.

The Azure Databricks tumor/normal pipeline is a GATK best practices compliant pipeline for short read
alignment and somatic variant calling using the MuTect2 variant caller.

Walkthrough
The pipeline consists of the following steps:
1. Normal sample alignment using BWA-MEM.
2. Tumor sample alignment using BWA-MEM.
3. Variant calling with MuTect2.

Setup
The pipeline is run as an Azure Databricks job. You can set up a cluster policy to save the configuration:

{
"num_workers": {
"type": "unlimited",
"defaultValue": 13
},
"node_type_id": {
"type": "unlimited",
"defaultValue": "Standard_F32s_v2"
},
"spark_env_vars.refGenomeId": {
"type": "unlimited",
"defaultValue": "grch38"
},
"spark_version": {
"type": "regex",
"pattern": ".*-hls.*",
"defaultValue": "7.4.x-hls-scala2.12"
}
}

The cluster configuration should use Databricks Runtime for Genomics.


The task should be the tumor/normal notebook found at the bottom of this page.
For best performance, use the compute optimized VMs with at least 60GB of memory. We recommend
Standard_F32s_v2 VMs.
If you’re running base quality score recalibration, use general purpose (Standard_D32s_v3 ) instances
instead since this operation requires more memory.

Reference genomes
You must configure the reference genome using an environment variable. To use GRCh37, set the environment
variable:

refGenomeId=grch37

To use GRCh38, change grch37 to grch38 .


To use a custom reference genome, see instructions in Custom reference genomes.

Parameters
The pipeline accepts parameters that control its behavior. The most important and commonly changed
parameters are documented here. To view all available parameters and their usage information, run the first cell
of the pipeline notebook. New parameters are added regularly. After importing the notebook and setting it as a
job task, you can set these parameters for all runs or per-run.

PA RA M ET ER DEFA ULT DESC RIP T IO N

manifest n/a The manifest describing the input.

output n/a The path where pipeline output should


be written.

replayMode skip * If skip , stages will be skipped if


output already exists.
* If overwrite , existing output will be
deleted.

exportVCF false If true, the pipeline writes results to a


VCF file as well as Delta.

perSampleTimeout 12h A timeout applied per sample. After


reaching this timeout, the pipeline
continues on to the next sample. The
value of this parameter must include a
timeout unit: ‘s’ for seconds, ‘m’ for
minutes, or ‘h’ for hours. For example,
‘60m’ results in a timeout of 60
minutes.

TIP
To optimize run time, set the spark.sql.shuffle.partitions Spark configuration to three times the number of cores of
the cluster.

Manifest format
NOTE
Manifest blobs are supported in Databricks Runtime 6.6 for Genomics and above.

The manifest is a CSV file or blob describing where to find the input FASTQ or BAM files. For example:

pair_id,file_path,sample_id,label,paired_end,read_group_id
HG001,*_R1_*.normal.fastq.bgz,HG001_normal,normal,1,read_group_normal
HG001,*_R2_*.normal.fastq.bgz,HG001_normal,normal,2,read_group_normal
HG001,*_R1_*.tumor.fastq.bgz,HG001_tumor,1,tumor,read_group_tumor
HG001,*_R2_*.tumor.fastq.bgz,HG001_tumor,2,tumor,read_group_tumor

If your input consists of unaligned BAM files, you should omit the paired_end field:

pair_id,file_path,sample_id,label,paired_end,read_group_id
HG001,*.normal.bam,HG001_normal,normal,,read_group_tumor
HG001,*.tumor.bam,HG001_tumor,tumor,,read_group_normal

The tumor and normal samples for a given individual are grouped by the pair_id field. The tumor and normal
sample names read group names must be different within a pair.

TIP
If the provided manifest is a file, the file_path field in each row may be an absolute path or a path relative to the
manifest file. If the provided manifest is a blob, the file_path field must be an absolute path. You can include globs
(*) to match many files.

Additional usage info and troubleshooting


The tumor/normal pipeline shares many operational details with the other Azure Databricks pipelines. For more
detailed usage information, such as output format structure, tips for running programmatically, steps for setting
up custom reference genomes, and common issues, see DNASeq pipeline.

NOTE
The pipeline was renamed from TNSeq to MutSeq in Databricks Runtime 7.3 LTS for Genomics and above.

MutSeq pipeline notebook


Get notebook
TNSeq pipeline notebook (Legacy)
Get notebook
Variant Annotation using Pipe Transformer
7/21/2022 • 2 minutes to read

NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.

Any annotation method can be used on variant data using Glow’s Pipe Transformer.
For example, VEP annotation is performed by downloading annotation data sources (the cache) to each node in
a cluster and calling the VEP command line script with the Pipe Transformer using a script similar to the
following cell.

import glow
import json

input_vcf = "/databricks-datasets/hail/data-001/1kg_sample.vcf.bgz"
input_df = spark.read.format("vcf").load(input_vcf)
cmd = json.dumps([
"/opt/vep/src/ensembl-vep/vep",
"--dir_cache", "/mnt/dbnucleus/dbgenomics/grch37_merged_vep_96",
"--fasta", "/mnt/dbnucleus/dbgenomics/grch37_merged_vep_96/data/human_g1k_v37.fa",
"--assembly", "GRCh37",
"--format", "vcf",
"--output_file", "STDOUT",
"--no_stats",
"--cache",
"--offline",
"--vcf",
"--merged"])
output_df = glow.transform("pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header=input_vcf,
output_formatter='vcf')
output_df.write.format("delta").save("dbfs:/mnt/vep-pipe")
Variant annotation methods
7/21/2022 • 2 minutes to read

NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.

This article describes how to use Databricks Runtime for Genomics (Deprecated) to parallelize variant
annotation methods with Apache Spark using Azure Databricks notebooks.
Pre-packaged SnpEff annotation pipeline
Pre-packaged VEP annotation pipeline
Variant Annotation using Pipe Transformer
Pre-packaged SnpEff annotation pipeline
7/21/2022 • 2 minutes to read

NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.

Setup
Run SnpEff (v4.3) as an Azure Databricks job. Most likely, an Azure Databricks solutions architect will set up the
initial job for you. The necessary details are:
Cluster configuration
Databricks Runtime for Genomics (Deprecated)
Set the task to the SnpEffAnnotationPipeline notebook imported into your workspace.

Benchmarks
The pipeline has been tested on 85.2 million variant sites from the 1000 Genomes project using the following
cluster configurations:
Driver: Standard_DS13_v2
Workers: Standard_D32s_v3 * 7 (224 cores)
Runtime: 2.5 hours

Reference genomes
You must configure the reference genome using environment variables. To use GRCh37, set the environment
variable:

refGenomeId=grch37

To use GRCh38 instead, set the environment variable:

refGenomeId=grch38

Parameters
The pipeline accepts a number of parameters that control its behavior. The most important and commonly
changed parameters are documented here; the rest can be found in the SnpEff Annotation pipeline notebook.
After importing the notebook and setting it as a job task, you can set these parameters for all runs or per-run.
PA RA M ET ER DEFA ULT DESC RIP T IO N

inputVariants n/a Path of input variants (VCF or Delta


Lake).

output n/a The path where pipeline output should


be written.

exportVCF false If true, the pipeline writes results in


VCF as well as Delta Lake.

exportVCFAsSingleFile false If true, exports VCF as single file

Output
The annotated variants are written out to Delta tables inside the provided output directory. If you configured the
pipeline to export to VCF, they’ll appear under the output directory as well.

output
|---annotations
|---Delta files
|---annotations.vcf

SnpEff annotation pipeline notebook


Get notebook
Pre-packaged VEP annotation pipeline
7/21/2022 • 3 minutes to read

NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.

Setup
Run VEP (release 96) as an Azure Databricks job.
The necessary details are:
Cluster configuration
Databricks Runtime for Genomics (Deprecated)
For best performance, set the Spark configuration spark.executor.cores 1 and use memory
optimized instances with at least 200GB of memory. We recommend Standard_L32s VMs.
Set the task as the VEPPipeline notebook imported into your workspace.

Reference genomes
You must configure the reference genome and transcripts using environment variables. To use GRCh37 with
merged Ensembl and RefSeq transcripts, set the environment variable:

refGenomeId=grch37_merged_vep_96

The refGenomeId for all pairs of reference genomes and transcripts are listed:

GRC H 37 GRC H 38

Ensembl grch37_vep_96 grch38_vep_96

RefSeq grch37_refseq_vep_96 grch38_refseq_vep_96

Merged grch37_merged_vep_96 grch38_merged_vep_96

Parameters
The pipeline accepts a number of parameters that control its behavior. After importing the notebook and setting
it as a job task, you can set these parameters for all runs or per-run.
PA RA M ET ER DEFA ULT DESC RIP T IO N

inputVcf n/a Path of the VCF file to annotate with


VEP.

output n/a Path where pipeline output should be


written.

replayMode skip One of:

* skip : if output already exists,


stages are skipped.
* overwrite : existing output is
deleted.

exportVCF false If true, pipeline writes results as both


VCF and Delta Lake.

extraVepOptions --everything --minimal -- Additional command line options to


allele_number --fork 4 pass to VEP. Some options are set by
the pipeline and cannot be overridden:
--assembly , --cache ,
--dir_cache , --fasta , --format ,
--merged , --no_stats , --offline
, --output_file , --refseq , --vcf
. See all possible options on the VEP
site.

LOFTEE
You can run VEP with plugins in order to extend, filter, or manipulate the VEP output. Set up LOFTEE with the
following instructions according to the desired reference genome.
grch37
Create a LOFTEE cluster using an init script.

#!/bin/bash
DIR_VEP_PLUGINS=/opt/vep/Plugins
mkdir -p $DIR_VEP_PLUGINS
cd $DIR_VEP_PLUGINS
echo export PERL5LIB=$PERL5LIB:$DIR_VEP_PLUGINS/loftee >> /databricks/spark/conf/spark-env.sh
git clone --depth 1 --branch master https://github.com/konradjk/loftee.git

We recommend creating a mount point to store any additional files in cloud storage; these files can then be
accessed using the FUSE mount. Replace the values in the scripts with your mount point.
If desired, save the ancestral sequence at the mount point.

cd <mount-point>
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz.fai
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz.gzi

If desired, save the PhyloCSF database at the mount point.


cd <mount-point>
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh37/phylocsf_gerp.sql.gz
gunzip phylocsf_gerp.sql.gz

When running the VEP pipeline, provide the corresponding extra options.

--dir_plugins /opt/vep/Plugins --plugin LoF,loftee_path:/opt/vep/Plugins/loftee,human_ancestor_fa:<mount-


point>/human_ancestor.fa.gz,conservation_file:<mount-point>/phylocsf_gerp.sql

grch38
Create a LOFTEE cluster that can parse BigWig files using an init script.

#!/bin/bash

# Download LOFTEE
DIR_VEP_PLUGINS=/opt/vep/Plugins
mkdir -p $DIR_VEP_PLUGINS
cd $DIR_VEP_PLUGINS
echo export PERL5LIB=$PERL5LIB:$DIR_VEP_PLUGINS/loftee >> /databricks/spark/conf/spark-env.sh
git clone --depth 1 --branch grch38 https://github.com/konradjk/loftee.git

# Download Kent source tree


mkdir -p /tmp/bigfile
cd /tmp/bigfile
wget https://github.com/ucscGenomeBrowser/kent/archive/v335_base.tar.gz
tar xzf v335_base.tar.gz

# Build Kent source


export KENT_SRC=$PWD/kent-335_base/src
export MACHTYPE=$(uname -m)
export CFLAGS="-fPIC"
export MYSQLINC=`mysql_config --include | sed -e 's/^-I//g'`
export MYSQLLIBS=`mysql_config --libs`
cd $KENT_SRC/lib
echo 'CFLAGS="-fPIC"' > ../inc/localEnvironment.mk
make clean
make
cd ../jkOwnLib
make clean
make

# Install Bio::DB::BigFile
cpanm --notest Bio::Perl
cpanm --notest Bio::DB::BigFile

We recommend creating a mount point to store any additional files in cloud storage; these files can then be
accessed using the FUSE mount. Replace the values in the scripts with your mount point.
Save the GERP scores BigWig at the mount point.

cd <mount-point>
wget
https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/gerp_conservation_scores.homo_sapiens.GRCh38.
bw

If desired, save the ancestral sequence at the mount point.


cd <mount-point>
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz.fai
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz.gzi

If desired, save the PhyloCSF database at the mount point.

cd <mount-point>
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/loftee.sql.gz
gunzip loftee.sql.gz

When running the VEP pipeline, provide the corresponding extra options.

--dir_plugins /opt/vep/Plugins --plugin LoF,loftee_path:/opt/vep/Plugins/loftee,gerp_bigwig:<mount-


point>/gerp_conservation_scores.homo_sapiens.GRCh38.bw,human_ancestor_fa:<mount-
point>/human_ancestor.fa.gz,conservation_file:<mount-point>/loftee.sql

VEP pipeline notebook


Get notebook
Joint genotyping
7/21/2022 • 2 minutes to read

NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.

Aggregate genetic variants using the GATK’s GenotypeGVCF implemented on Apache Spark.
Joint genotyping pipeline
Walkthrough
Setup
Reference genomes
Parameters
Output
Manifest format
Troubleshooting
Additional usage info
Joint genotyping pipeline
7/21/2022 • 4 minutes to read

NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.

The Azure Databricks joint genotyping pipeline is a GATK best practices compliant pipeline for joint genotyping
using GenotypeGVCFs.

Walkthrough
The pipeline typically consists of the following steps:
1. Ingest variants into Delta Lake.
2. Joint-call the cohort with GenotypeGVCFs.
During variant ingest, single-sample gVCFs are processed in batches and the rows are stored in Delta Lake to
provide fault tolerance, fast querying, and incremental joint genotyping. In the joint genotyping step, the gVCF
rows are ingested from Delta Lake, split into bins, and distributed to partitions. For each variant site, the relevant
gVCF rows per sample are identified and used for regenotyping.

Setup
The pipeline is run as an Azure Databricks job. Most likely an Azure Databricks solutions architect will work with
you to set up the initial job. The necessary details are:
{
"autoscale.min_workers": {
"type": "unlimited",
"defaultValue": 1
},
"autoscale.max_workers": {
"type": "unlimited",
"defaultValue": 25
},
"node_type_id": {
"type": "unlimited",
"defaultValue": "Standard_L32s_v2"
},
"spark_env_vars.refGenomeId": {
"type": "unlimited",
"defaultValue": "grch38"
},
"spark_version": {
"type": "regex",
"pattern": ".*-hls.*",
"defaultValue": "7.4.x-hls-scala2.12"
}
}

The cluster configuration should use Databricks Runtime for Genomics (Deprecated).
The task should be the joint genotyping pipeline notebook found at the bottom of this page.
For best performance, use the storage-optimized VMs. We recommend Standard_L32s_v2 .
To reduce costs, enable autoscaling with a minimum of 1 worker and a maximum of 10-50 depending on
latency requirements.

Reference genomes
You must configure the reference genome using environment variables. To use GRCh37, set the environment
variable:

refGenomeId=grch37

To use GRCh38, change grch37 to grch38 .


To use a custom reference genome, see instructions in Custom reference genomes.

Parameters
The pipeline accepts parameters that control its behavior. The most important and commonly changed
parameters are documented here. To view all available parameters and their usage information, run the first cell
of the pipeline notebook. New parameters are added regularly. After importing the notebook and setting it as a
job task, you can set these parameters for all runs or per-run.

PA RA M ET ER DEFA ULT DESC RIP T IO N

manifest n/a The manifest describing the input.

output n/a The path where pipeline output is


written.
PA RA M ET ER DEFA ULT DESC RIP T IO N

replayMode skip One of:

* skip : stages are skipped if output


already exists.
* overwrite : existing output is
deleted.

exportVCF false If true, the pipeline writes results in


VCF as well as Delta Lake.

targetedRegions n/a Path to files containing regions to call.


If omitted, calls all regions.

gvcfDeltaOutput n/a If specified, gVCFs are ingested to a


Delta table before genotyping. You
should specify this parameter only if
you expect to joint call the same
gVCFs many times.

performValidation false If true , the system verifies that each


record contains the necessary
information for joint genotyping. In
particular, it checks that the correct
number of genotype probabilities are
present.

validationStringency STRICT How to handle malformed records,


both during loading and validation.

* STRICT : fail the job


* LENIENT : log a warning and drop
the record
* SILENT : drop the record without a
warning

TIP
To perform joint calling from an existing Delta table, set gvcfDeltaOutput to the table path and replayMode to skip .
You can also provide the manifest , which will be used to define the VCF schema and samples; these will be inferred from
the Delta table otherwise. We ignore the targetedRegions and performValidation parameters in this setup.

Output
The regenotyped variants are all written out to Delta tables inside the provided output directory. In addition, if
you configured the pipeline to export VCFs, they’ll appear under the output directory as well.

output
|---genotypes
|---Delta files
|---genotypes.vcf
|---VCF files

Manifest format
NOTE
Manifest blobs are supported in Databricks Runtime 6.6 for Genomics and above.

The manifest is a file or blob describing where to find the input single-sample GVCF files, with each file path on
a new row. For example:

HG00096.g.vcf.bgz
HG00097.g.vcf.bgz

TIP
If the provided manifest is a file, each row may be an absolute path or a path relative to the manifest file. If the provided
manifest is a blob, the row field must be an absolute path. You can include globs (*) to match many files.

Troubleshooting
Job fails with an ArrayIndexOutOfBoundsException

This error usually indicates that an input record has an incorrect number of genotype probabilities. Try setting
the performValidation option to true and the validationStringency option to LENIENT or SILENT .

Additional usage info


The joint genotyping pipeline shares many operational details with the other Azure Databricks pipelines. For
more detailed usage information, such as output format structure, tips for running programmatically, and steps
for setting up custom reference genomes, see DNASeq pipeline.
Joint genotyping pipeline notebook
Get notebook
Security guide
7/21/2022 • 2 minutes to read

Azure Databricks provides many tools for securing your network infrastructure. This guide covers general
security functionality. For information about securing access to your data, see Data governance guide.
Enterprise security for Azure Databricks
Access control
Secret management
Credential passthrough
Customer-managed keys for encryption
Configure double encryption for DBFS root
Secure cluster connectivity (No Public IP / NPIP)
Encrypt traffic between cluster worker nodes
IP access lists
Configure domain name firewall rules
Best practices: GDPR and CCPA compliance using Delta Lake
Configure access to Azure storage with an Azure Active Directory service principal
For security information specific to Databricks SQL, see the Databricks SQL security guide.
Enterprise security for Azure Databricks
7/21/2022 • 16 minutes to read

This article provides an overview of the most important security-related controls and configurations for the
deployment of Azure Databricks.
This article illustrates some scenarios using example companies to compare how small and large organizations
might handle deployment differently. There are references to the fictional large corporation LargeCorp and the
fictional small company SmallCorp . Use these examples as general guides, but every company is different. If
you have questions, contact your Azure Databricks representative.
For detailed information about specific security features, see Security guide. Your Azure Databricks
representative can also provide you with additional security and compliance documentation.

Security and Trust Center


The Databricks Security and Trust Center provides information about the ways in which security is built into
every layer of the Databricks Lakehouse Platform. The Security and Trust Center provides information that
enables you to meet your regulatory needs while taking advantage of the Databricks Lakehouse Platform. Find
the following types of information in the Security and Trust Center:
An overview and list of the security and governance features built into the platform.
Information about the compliance standards the platform meets on each cloud provider.
A due-diligence package to help you evaluate how Azure Databricks helps you meet your compliance and
regulatory needs.
An overview of Databricks’ privacy guidelines and how they are enforced.
The information in this article supplements the Security and Trust Center.

Account plan
Talk to your Azure Databricks representative about the features you want. They will help you choose the pricing
plan (tier) that suits your needs.
The following features are common for security-conscious organizations. All but SSO require the Premium plan.
Single sign-on (SSO): Authenticate users using Azure Active Directory. This is always enabled and is the
only option. If you use a different IdP, federate your IdP with Azure Active Directory.
Role-based access control: Control access to clusters, jobs, data tables, APIs, and workspace resources
such as notebooks, folders, jobs, and registered models.
Credential passthrough: Control access to Azure Data Lake Storage using users’ Azure Active Directory
credentials.
VNet injection: Deploy an Azure Databricks workspace in your own VNet that you manage in your Azure
subscription. Enables you to implement custom network configuration with custom Network Security
Group rules.
Secure cluster connectivity: Enable secure cluster connectivity on the workspace, which means that your
VNet’s Network Security Group (NSG) has no open inbound ports and Databricks Runtime cluster nodes
have no public IP addresses. Also known as “No Public IPs.” You can enable this feature for a workspace
during deployment.
NOTE
Independent of whether secure cluster connectivity is enabled, all Azure Databricks network traffic between the
data plane VNet and the Azure Databricks control plane goes across the Microsoft network backbone, not the
public Internet.

IP access lists: Enforce network location of workspace users.


Customer-managed keys to encrypt your managed services data in the control plane: Encrypt notebook
and secret data in the control plane using an Azure Key Vault key that you manage.
Customer-managed keys to encrypt your root Azure Blob storage (root DBFS and workspace system
data): Encrypt the Databricks File Storage (DBFS) root and other workspace system data in the same
Azure storage blob using an Azure Key Vault key that you manage. DBFS is a distributed file system
mounted into an Azure Databricks workspace and available on Azure Databricks clusters.
You may also be interested in:
Cluster policies: Limit your users’ ability to configure clusters based on a set of rules. Cluster policies let you
enforce particular cluster settings (such as instance types, attached libraries, compute cost) and display
different cluster-creation interfaces for different user levels.

Workspaces
An Azure Databricks workspace is an environment for accessing your Azure Databricks assets. The workspace
organizes your objects (notebooks, libraries, and experiments) into folders. Your workspace provides access to
data and computational resources such as clusters and jobs.
Determine how many workspaces your organization will need, what teams need to collaborate, and your
requirements for geographic regions.
A small organization such as our example SmallCorp might only need one or a small number of workspaces.
This might also be true of a single division of a larger company that is relatively self-contained. The workspace
administrators could be regular users of the workspace. In some cases, a separate department (IT/OpSec) might
take on the role of workspace administrator to deploy according to enterprise governance policies and manage
permissions, users, and groups.
A large organization such as our example LargeCorp typically requires many workspaces. LargeCorp already has
a centralized group (IT/OpSec) that handles all security and administrative functions. That centralized group
typically sets up new workspaces and enforces security controls across the company.
Common reasons a large corporation might create separate workspaces:
Teams handle different levels of confidential information, possibly including personally identifying
information. By separating workspaces, teams keep different levels of confidential assets separate without
additional complexity such as access control lists. For example, the LargeCorp finance team can easily store
its finance-related notebooks separate from workspaces from other departments.
Simplify billing for Databricks usage (DBUs) and Cloud compute to be charged back to different budgets.
Geographic region variations of teams or data sources. Teams in one region may prefer cloud resources
based in a different region for cost, network latency, or legal compliance. Each workspace can be defined in a
different supported region.
Although workspaces are a common approach to segregate access to resources by team, project, or geography,
there are other options. Workspaces administrators can use access control lists (ACLs) within a workspace to
limit access to resources such as notebooks, folders, jobs, and more, based on user and group memberships.
Another option for controlling differential access to data source in a single workspace is credential passthrough.
Plan your virtual network configuration
The default deployment of Azure Databricks creates a new virtual network that is managed by Microsoft. You
can create a new workspace in your own customer-managed virtual network (also known as VNet injection)
instead. To learn why you might want to do this, see Deploy Azure Databricks in your Azure virtual network
(VNet injection).
Smaller companies might use the default VNet or have only a single customer-managed VNet for a single
Databricks workspace. If there are two or three workspaces, depending on the network architecture and regions,
they might share a single customer-managed VNet with multiple workspaces.
Larger companies often want to create and specify a customer-managed VNet for Databricks to use. They could
have some workspaces share a VNet to simplify Azure resource allocation. One organization might have VNets
in different Azure subscriptions and possibly different regions, and that could affect the planning of the number
of workspaces and your planned VNets to support them.

For each workspace you must provide two subnets: one for the container and one for the host. You cannot share
these subnets (or their address space) across workspaces. If you have multiple workspaces in one VNet, it’s
critical to plan your address space within the VNet. For the supported VNet and subnet sizes and the maximum
Azure Databricks nodes for various subnet sizes, see Deploy Azure Databricks in your Azure virtual network
(VNet injection).
Customer-managed keys for root Azure Blob storage (root DBFS and workspace system data)
Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and
available on Azure Databricks clusters. DBFS is implemented as a storage account in your Azure Databricks
workspace’s managed resource group. The default storage location in DBFS is known as the DBFS root. By
default, the storage account is encrypted with Microsoft-managed keys. This root Azure Blob storage in your
own subscription also stores other workspace system data like cluster or job logs and notebook version history.
Optionally, you can secure your workspace’s root Azure Blob storage using customer-managed keys. For details,
see Configure customer-managed keys for DBFS root
Customer-managed keys for managed services in the control plane
An Azure Databricks workspace comprises a control plane that is hosted in an Azure Databricks-managed
subscription and a data plane that is deployed in a virtual network in your subscription. The control plane stores
your notebook source code, partial notebook results, secrets stored with the secrets manager, and other
workspace configuration data. By default, the managed services data in the control plane is encrypted at rest
with a Databricks-managed key.
If your security and compliance requirements specify that you must own and manage the key used for
encrypting your notebooks and all results yourself, you can provide your own key for encrypting the notebook
data that is stored in the Azure Databricks control plane. For details, see Enable customer-managed keys for
managed services.
This feature does not encrypt data stored outside of the control plane. For example, it does not encrypt data in
your root Azure Blob storage, which stores job results and other parts of what is called your root DBFS. You can
encrypt that information with your own customer-managed keys. See Customer-managed keys for root Azure
Blob storage (root DBFS and workspace system data)

Authentication and user account provisioning


Single sign-on (SSO) refers to the ability to authenticate users with the Azure Active Directory (Azure AD)
instance associated with the workspace. SSO is always enabled as part of the built-in support for Azure AD.
Azure Databricks does not have access to user credentials. Azure Databricks authenticates with OpenID Connect,
which is built on top of OAuth 2.0. You can optionally use Azure AD’s support for multi-factor authentication
(MFA).
Azure Databricks behavior for auto-provisioning of local user accounts for Azure Databricks using SSO depends
on whether the user is an admin:
Admin users : If an Azure AD user or service principal has the Contributor or Owner role on the
Databricks resource or a child group, the Azure Databricks local account is provisioned during sign-in.
This is sometimes called just-in-time (JIT) provisioning. The user is added to the built-in Azure Databricks
group for administrators ( admins ).
Non-admin users : There is no JIT provisioning of non-admin user accounts in the Azure Databricks
workspace. For automated provisioning of non-admin users, use System for Cross-domain Identity
Management (SCIM) user synchronization with Azure AD. Azure Databricks and Azure AD both support
SCIM for user and group synchronization, including provisioning and deprovisioning local user accounts.
Although you can use user names to manage access control lists (ACLs) for Databricks resources such as
notebooks, SCIM synchronization of groups with Azure AD makes it even easier to manage Databricks
ACLs. See Access control lists.
NOTE
Azure AD SCIM synchronization of users and groups requires that your Azure AD account be Premium Edition.
See Configure SCIM provisioning for Microsoft Azure Active Directory. If you don’t want to use Azure AD SCIM
synchronization, you can provision users and groups in the Admin Console or directly invoke the Databricks SCIM
REST API to manage users and groups. SCIM is available as Public Preview.

A small company, such as our example SmallCorp, might create new users by email address manually using the
Azure Databricks Admin console, in which case the users must also be members of the Azure AD tenant that is
associated with that workspace. However, Databricks recommends that you use Azure AD user and group
synchronization to automate user provisioning.
A large company, such as our example LargeCorp, typically manages Azure Databricks users and groups using
Azure AD SCIM synchronization of users and groups.

Secure API access


For REST API authentication, use either built-in revocable Azure Databricks personal access tokens or use
revocable Azure Active Directory access tokens.
There is a Token Management API that you can use to review current Azure Databricks personal access tokens,
delete tokens, and set the maximum lifetime of new tokens. Use the related Permissions API to set token
permissions that define which users can create and use tokens to access workspce REST APIs.
Use token permissions to enforce the principle of least privilege so that any individual user or group has access
to REST APIs only if they have a legitimate need.
For workspaces created after the release of Azure Databricks platform version 3.28 (Sept 9-15, 2020), by default
only admin users have the ability to generate personal access tokens. Admins must explicitly grant those
permissions, whether to the entire users group or on a user-by-user or group-by-group basis. Workspaces
created before 3.28 maintain the permissions that were already in place before this change but had different
defaults. If you are not sure when a workspace was created, review the tokens permissions for the workspace.
For a complete list of APIs and admin console tools for tokens, see Manage personal access tokens.

IP access lists
Authentication proves user identity, but it does not enforce the network location of the users. Accessing a cloud
service from an unsecured network poses security risks, especially when the user may have authorized access to
sensitive or personal data. Enterprise network perimeters (for example, firewalls, proxies, DLP, and logging)
apply security policies and limit access to external services, so access beyond these controls are assumed to be
untrusted.
For example, if an employee walks from the office to a coffee shop, the company can block connections to the
Azure Databricks workspace even if the customer has correct credentials to access the web application and the
REST API.
Specify the IP addresses (or CIDR ranges) on the public network that are allowed access. These IP addresses
could belong to egress gateways or specific user environment. You can also specify IP addresses or subnets to
block even if they are included in the allow list. For example, an allowed IP address range might include a
smaller range of infrastructure IP addresses that in practice are outside the actual secure network perimeter.
For details, see IP access lists.
Audit logs
Audit logs are available automatically for Azure Databricks workspaces. You must configure these to flow to
your Azure Storage Account, Azure Log Analytics workspace, or Azure Event Hub. In the Azure portal, the user
interface refers to them as Diagnostic Logs. After experimenting with your new workspace, try examining the
audit logs. See Diagnostic logging in Azure Databricks.

Cluster policies
Use cluster policies to enforce particular cluster settings, such as instance types, number of nodes, attached
libraries, and compute cost, and display different cluster-creation interfaces for different user levels. Managing
cluster configurations using policies can help enforce universal governance controls and manage the costs of
your compute infrastructure.
A small organization like SmallCorp might have a single cluster policy for all clusters.
A large organization like LargeCorp might have more complex policies, for example:
Customer data analysts who work on extremely large data sets and complex calculations might be allowed to
have clusters of up to hundred nodes.
Finance team data analysts might be allowed to use clusters of up to ten nodes.
Human resources department that works with smaller datasets and simpler notebooks might only be
allowed to have autoscaling clusters of four to eight nodes.

Access control lists


Azure Databricks has several solutions for managing access to your data and objects. In addition to the
following information, see Data governance guide for more details about managing access to your
organization’s data.
Access control lets you apply data governance controls for data and objects in a workspace. Admins can enable
ACLs using the Admin Console or Permissions API.
Admins can also delegate some ACL configurations to non-admin users by granting Can manage permission (for
example, any user with Can Manage permission for a cluster can give other users permission to attach to, restart,
resize, and manage that cluster.)
The following ACLs can be set for users, groups, and Azure service principals. Unless otherwise specified, you
can modify ACLs using both the web application and REST API:

O B JEC T DESC RIP T IO N

Table Enforce access to data tables.

Cluster Manage which users can manage, restart, or attach to


clusters. Access to clusters affects security when a cluster is
configured with passthrough authentication for data
sources, see Data source access using Azure Databricks login
credentials.

Pool Manage which users can manage or attach to pools. Some


APIs and documentation refer to pools as instance pools.
Pools reduce cluster start and auto-scaling times by
maintaining a set of idle, ready-to-use cloud instances.
When a cluster attached to a pool needs an instance, it first
attempts to allocate one of the idle instances of the pool. If
the pool has no idle instances, it expands by allocating a new
instance from the instance provider in order to
accommodate the cluster’s request. When a cluster releases
an instance, it returns to the pool and is free for another
cluster to use. Only clusters attached to a pool can use the
idle instances of the pool.

Jobs Manage which users can view, manage, trigger, cancel, or


own a job.

Notebook Manage which users can read, run, edit or manage a


notebook.

Folder (directory Manage which users can read, run, edit, or manage all
notebooks in a folder.

MLflow registered model and experiment Manage which users can read, edit, or manage MLflow
registered models and experiments.

Token Manage which users can create or use tokens. See also
Secure API access.

Depending on team size and sensitivity of the information, a small company like SmallCorp or a small team
within LargeCorp with its own workspace might allow all non-admin users access to the same objects, like
clusters, jobs, notebooks, and directories.
A larger team or organization with very sensitive information would likely want to use all of these access
controls to enforce the Principle of Least Privilege so that any individual user has access only to the resources
for which they have a legitimate need.
For example, suppose that LargeCorp has three people who need access to a specific workspace folder (which
contains notebooks and experiments) for the finance team. LargeCorp can use these APIs to grant directory
access only to the finance data team group.

Data source access using Azure Databricks login credentials


You can authenticate automatically to Accessing Azure Data Lake Storage Gen1 from Azure Databricks and ADLS
Gen2 from Azure Databricks clusters using the same Azure Active Directory (Azure AD) identity that you use to
log into Azure Databricks. When you enable your cluster for Azure Data Lake Storage credential passthrough,
commands that you run on that cluster can read and write data in Azure Data Lake Storage without requiring
you to configure service principal credentials for access to storage.
Credential passthrough applies only to Azure Data Lake Storage Gen1 or Gen2 storage accounts. Azure Data
Lake Storage Gen2 storage accounts must use the hierarchical namespace to work with Azure Data Lake Storage
credential passthrough.
For details, see Access Azure Data Lake Storage using Azure Active Directory credential passthrough.
Azure Data Lake Storage credential passthrough lets you authenticate automatically to Accessing Azure
Data Lake Storage Gen1 from Azure Databricks and ADLS Gen2 from Azure Databricks clusters using the
same Azure Active Directory (Azure AD) identity that you use to log into Azure Databricks. When you
enable your cluster for Azure Data Lake Storage credential passthrough, commands that you run on that
cluster can read and write data in Azure Data Lake Storage without requiring you to configure service
principal credentials for access to storage.
Credential passthrough applies only to Azure Data Lake Storage Gen1 or Gen2 storage accounts. Azure
Data Lake Storage Gen2 storage accounts must use the hierarchical namespace to work with Azure Data
Lake Storage credential passthrough.
For more information and best practices, see Data governance guide.

Secrets
You may also want to use the secret manager to set up secrets that you expect your notebooks to need. A secret
is a key-value pair that stores secret material for an external data source or other calculation, with a key name
unique within a secret scope.
You create secrets using either the REST API or CLI, but you must use the Secrets utility (dbutils.secrets) in a
notebook or job to read your secrets.
Secrets are stored encrypted at rest, but you can add a customer-managed key to add additional security. See
Customer-managed keys for managed services in the control plane

Automation template options


Using Databricks REST APIs, some of your security configuration tasks can be automated. To assist Databricks
customers with typical deployment scenarios, there are ARM templates that you can use to deploy workspaces.
Particularly for large companies with dozens of workspaces, using templates can enable fast and consistent
automated workspace deployments.
Configure access to Azure storage with an Azure
Active Directory service principal
7/21/2022 • 2 minutes to read

Registering an application with Azure Active Directory (Azure AD) creates a service principal you can use to
provide access to Azure storage accounts. You can then configure access to these service principals using
credentials stored with secrets.
Databricks recommends using Azure Active Directory service principals scoped to clusters or Databricks SQL
endpoints to configure data access. See Accessing Azure Data Lake Storage Gen2 and Blob Storage with Azure
Databricks and Configure access to cloud storage.

Register an Azure Active Directory application


Registering an Azure AD application and assigning appropriate permissions will create a service principal that
can access Azure Data Lake Storage Gen2 or Blob Storage resources.
1. In the Azure portal, go to the Azure Active Director y service.
2. Under Manage , click App Registrations .
3. Click + New registration . Enter a name for the application and click Register .
4. Click Cer tificates & Secrets .
5. Click + New client secret .
6. Add a description for the secret and click Add .
7. Copy and save the value for the new secret.
8. In the application registration overview, copy and save the Application (client) ID and Director y (tenant)
ID .
Databricks recommends storing these credentials using secrets.

Assign roles
You control access to storage resources by assigning roles to an Azure AD application registration associated
with the storage account. This example assigns the Storage Blob Data Contributor to an Azure storage
account. You may need to assign other roles depending on specific requirements.
1. In the Azure portal, go to the Storage accounts service.
2. Select an Azure storage account to use with this application registration.
3. Click Access Control (IAM) .
4. Click + Add and select Add role assignment from the dropdown menu.
5. Set the Select field to the Azure AD application name and set Role to Storage Blob Data Contributor .
6. Click Save .
Access control
7/21/2022 • 2 minutes to read

In Azure Databricks, you can use access control lists (ACLs) to configure permission to access workspace objects
(folders, notebooks, experiments, and models), clusters, pools, jobs, Delta Live Tables pipelines, and data tables.
All admin users can manage access control lists, as can users who have been given delegated permissions to
manage access control lists.
Admin users enable and disable access control at the Azure Databricks workspace level. See Enable access
control.

NOTE
Workspace object, cluster, pool, job, Delta Live Tables pipelines, and table access control are available only in the Premium
Plan.

This section covers:


Workspace object access control
Cluster access control
Pool access control
Jobs access control
Delta Live Tables access control
Table access control
Secret access control
Workspace object access control
7/21/2022 • 9 minutes to read

NOTE
Access control is available only in the Premium Plan.

By default, all users can create and modify workspace objects—including folders, notebooks, experiments, and
models—unless an administrator enables workspace access control. With workspace object access control,
individual permissions determine a user’s abilities. This article describes the individual permissions and how to
configure workspace object access control.

TIP
You can manage permissions in a fully automated setup using Databricks Terraform provider and databricks_permissions.

Before you can use workspace object access control, an Azure Databricks admin must enable it for the
workspace. See Enable workspace object access control.

Folder permissions
You can assign five permission levels to folders: No Permissions , Can Read , Can Run , Can Edit , and Can
Manage . The table lists the abilities for each permission.

NO
A B IL IT Y P ERM ISSIO N S C A N REA D C A N RUN C A N EDIT C A N M A N A GE

List items in x x x x x
folder

View items in x x x x
folder

Clone and x x x x
export items

Create, import, x
and delete items

Move and x
rename items

Change x
permissions

Notebooks and experiments in a folder inherit all permissions settings of that folder. For example, a user that
has Can Run permission on a folder has Can Run permission on the notebooks in that folder.
Default folder permissions
Independent of workspace object access control, the following permissions exist:
All users have Can Manage permission for items in the Workspace > Shared folder. You can
grant Can Manage permission to notebooks and folders by moving them to the Shared folder.
All users have Can Manage permission for objects the user creates.
With workspace object access control disabled, the following permissions exist:
All users have Can Edit permission for items in the Workspace folder.
With workspace object access control enabled, the following permissions exist:
Workspace folder
Only administrators can create new items in the Workspace folder.
Existing items in the Workspace folder - Can Manage . For example, if the Workspace folder
contained the Documents and Temp folders, all users continue to have the Can
Manage permission for these folders.
New items in the Workspace folder - No Permissions .
A user has the same permission for all items in a folder, including items created or moved into the
folder after you set the permissions, as the permission the user has on the folder.
User home directory - The user has Can Manage permission. All other users have No Permissions
permission.

Notebook permissions
You can assign five permission levels to notebooks: No Permissions , Can Read , Can Run , Can Edit , and Can
Manage . The table lists the abilities for each permission.

NO
A B IL IT Y P ERM ISSIO N S C A N REA D C A N RUN C A N EDIT C A N M A N A GE

View cells x x x x

Comment x x x x

Run via %run or x x x x


notebook
workflows

Attach and x x x
detach
notebooks

Run commands x x x

Edit cells x x

Change x
permissions

Repos permissions
You can assign five permission levels to repos: No Permissions , Can Read , Can Run , Can Edit , and Can
Manage . The table lists the abilities for each permission.
NO
A B IL IT Y P ERM ISSIO N S C A N REA D C A N RUN C A N EDIT C A N M A N A GE

List items in repo x x x x x

View items in x x x x
repo

Clone and x x x x
export items

Run notebooks x x x
in repo

Edit notebooks x x
in repo

Create, import, x
and delete items

Move and x
rename items

Change x
permissions

Configure notebook, folder, and repos permissions


NOTE
This section describes how to manage permissions using the UI. You can also use the Permissions API 2.0.

1. Select Permissions from the drop-down menu for the notebook, folder, or repo:

2. To grant permissions to a user or group, select from the Add Users, Groups, and Ser vice Principals
drop-down, select the permission, and click Add :
To change the permissions of a user or group, select the new permission from the permission drop-down:

3. After you make changes in the dialog, Done changes to Save Changes and a Cancel button appears.
Click Save Changes or Cancel .

MLflow Experiment permissions


You can assign four permission levels to MLflow Experiments: No Permissions , Can Read , Can Edit , and Can
Manage . The table lists the abilities for each permission.

A B IL IT Y N O P ERM ISSIO N S C A N REA D C A N EDIT C A N M A N A GE

View run info, search, x x x


compare runs

View, list, and x x x


download run
artifacts

Create, delete, and x x


restore runs
A B IL IT Y N O P ERM ISSIO N S C A N REA D C A N EDIT C A N M A N A GE

Log run params, x x


metrics, tags

Log run artifacts x x

Edit experiment tags x x

Purge runs and x


experiments

Grant permissions x

NOTE
Experiment permissions are only enforced on artifacts stored in DBFS locations managed by MLflow. For more
information, see MLflow Artifact permissions.
Create, delete, and restore experiment requires Can Edit or Can Manage access to the folder containing the
experiment.
You can specify the Can Run permission for experiments. It is enforced the same way as Can Edit .

Configure MLflow experiment permissions


You can configure MLflow experiment permissions from the experiments page, the experiment page, or from the
workspace.
Configure MLflow experiment permissions from the experiments page

You can change permissions for an experiment that you own from the experiments page. Click in the
Actions column and select Permissions .
Configure MLflow experiment permissions from the experiment page
All users in your account belong to the group all users . Administrators belong to the group admins , which
has Manage permissions on all objects.

NOTE
Permissions you set in this dialog apply to the notebook that corresponds to this experiment.

1. Click Share .

2. In the dialog, click the Select User, Group or Ser vice Principal… drop-down and select a user, group,
or service principal.
3. Select a permission from the permission drop-down.
4. Click Add .

5. If you want to remove a permission, click for that user, group, or service principal.

6. Click Save or Cancel .


Configure MLflow experiment permissions from the workspace
1. To open the permissions dialog, select Permissions in the experiment’s drop-down menu.

2. Grant or remove permissions. All users in your account belong to the group all users . Administrators
belong to the group admins , which has Can Manage permissions on all items.
To grant permissions, select from the Select User, Group, or Ser vice Principal drop-down, select the
permission, and click Add :
To change existing permissions, select the new permission from the permission drop-down:

To remove a permission, click for that user, group, or service principal.


3. After you make changes in the dialog, click Save or Cancel .
MLflow Artifact permissions
Each MLflow Experiment has an Ar tifact Location that is used to store artifacts logged to MLflow runs.
Starting in MLflow 1.11, artifacts are stored in an MLflow-managed subdirectory of the Databricks File System
(DBFS) by default. MLflow experiment permissions apply to artifacts stored in these managed locations, which
have the prefix dbfs:/databricks/mlflow-tracking . To download or log an artifact, you must have the
appropriate level of access to its associated MLflow experiment.

NOTE
Artifacts stored in MLflow-managed locations can only be accessed using the MLflow Client (version 1.9.1 or later),
which is available for Python, Java, and R. Other access mechanisms, such as dbutils and the DBFS API 2.0, are not
supported for MLflow-managed locations.
You can also specify your own artifact location when creating an MLflow experiment. Experiment access controls are
not enforced on artifacts stored outside of the default MLflow-managed DBFS directory.
MLflow Model permissions
You can assign six permission levels to MLflow Models registered in the MLflow Model Registry: No
Permissions , Can Read , Can Edit , Can Manage Staging Versions , Can Manage Production Versions ,
and Can Manage . The table lists the abilities for each permission.

NOTE
A model version inherits permissions from its parent model; you cannot set permissions for model versions.

CAN CAN
M A N A GE M A N A GE
NO STA GIN G P RO DUC T IO N CAN
A B IL IT Y P ERM ISSIO N S C A N REA D C A N EDIT VERSIO N S VERSIO N S M A N A GE

Create a x x x x x x
model

View model x x x x x
details,
versions,
stage
transition
requests,
activities, and
artifact
download
URIs

Request a x x x x x
model version
stage
transition

Add a version x x x x
to a model

Update x x x x
model and
version
description

Add or edit x x x x
tags for a
model or
model version

Transition x (between x x
model version None,
between Archived, and
stages Staging)

Approve or x (between x x
reject a model None,
version stage Archived, and
transition Staging)
request
CAN CAN
M A N A GE M A N A GE
NO STA GIN G P RO DUC T IO N CAN
A B IL IT Y P ERM ISSIO N S C A N REA D C A N EDIT VERSIO N S VERSIO N S M A N A GE

Cancel a x
model version
stage
transition
request (see
Note)

Modify x
permissions

Rename x
model

Delete model x
and model
versions

NOTE
The creator of a stage transition request can also cancel the request.

Default MLflow Model permissions


Independent of workspace object access control, the following permissions exist:
All users have permission to create a new registered model.
All administrators have Manage permission for all models.
With workspace object access control disabled, the following permissions exist:
All users have Manage permission for all models.
With workspace object access control enabled, the following default permissions exist:
All users have Manage permission for models the user creates.
Non-administrator users have No Permissions on models they did not create.
Configure MLflow Model permissions
All users in your account belong to the group all users . Administrators belong to the group admins , which
has Manage permissions on all objects.

NOTE
This section describes how to manage permissions using the UI. You can also use the Permissions API 2.0.

1. Click Models in the sidebar.


2. Click a model name.
3. Click Permissions .
4. In the dialog, click the Select User, Group or Ser vice Principal… drop-down and select a user, group,
or service principal.

5. Select a permission from the permission drop-down.


6. Click Add .
7. Click Save or Cancel .
Configure permissions for all MLflow Models in Model Registry
Workspace administrators can set permission levels on all models for specific users or groups in Model Registry
using the UI.

1. Click in the sidebar.


2. Click Permissions .

3. Follow the steps listed in Configure MLflow Model permissions, starting at step 4.
When you navigate to a specific model page, permissions set at the registry-wide level are marked
“inherited”.
NOTE
A user with Can Manage permission at the registry-wide level can change registry-wide permissions for all other users.

MLflow Model Artifact permissions


The model files for each MLflow model version are stored in an MLflow-managed location, with the prefix
dbfs:/databricks/model-registry/ .

To get the exact location of the files for a model version, you must have Read access to the model. Use the REST
API endpoint /api/2.0/mlflow/model-versions/get-download-uri .
After obtaining the URI, you can use the DBFS API 2.0 to download the files.
The MLflow Client (for Python, Java, and R) provides several convenience methods that wrap this workflow to
download and load the model, such as mlflow.<flavor>.load_model() .

NOTE
Other access mechanisms, such as dbutils and %fs are not supported for MLflow-managed file locations.

Library and jobs access control


All users can view libraries. To control who can attach libraries to clusters, see Cluster access control.

To control who can run jobs and see the results of job runs, see Jobs access control.
Cluster access control
7/21/2022 • 3 minutes to read

NOTE
Access control is available only in the Premium Plan.

By default, all users can create and modify clusters unless an administrator enables cluster access control. With
cluster access control, permissions determine a user’s abilities. This article describes the permissions.
Before you can use cluster access control, an Azure Databricks admin must enable it for the workspace. See
Enable cluster access control for your workspace.

Types of permissions
You can configure two types of cluster permissions:
The Allow unrestricted cluster creation entitlement controls your ability to create clusters.
Cluster-level permissions control your ability to use and modify a specific cluster.
When cluster access control is enabled:
An administrator can configure whether a user can create clusters.
Any user with Can Manage permission for a cluster can configure whether a user can attach to, restart,
resize, and manage that cluster.

Cluster-level permissions
There are four permission levels for a cluster: No Permissions , Can Attach To , Can Restar t , and Can
Manage . The table lists the abilities for each permission.

IMPORTANT
Users with Can Attach To permissions can view the service account keys in the log4j file. Use caution when granting this
permission level.

A B IL IT Y N O P ERM ISSIO N S C A N AT TA C H TO C A N RESTA RT C A N M A N A GE

Attach notebook to x x x
cluster

View Spark UI x x x

View cluster metrics x x x

View driver logs x (see note) x (see note) x

Terminate cluster x x
A B IL IT Y N O P ERM ISSIO N S C A N AT TA C H TO C A N RESTA RT C A N M A N A GE

Start cluster x x

Restart cluster x x

Edit cluster x

Attach library to x
cluster

Resize cluster x

Modify permissions x

NOTE
Secrets are not redacted from the Spark driver log streams stdout and stderr . To protect secrets that might
appear in those driver log streams such that only users with the Can Manage permission on the cluster can view them,
set the cluster’s Spark configuration property spark.databricks.acl.needAdminPermissionToViewLogs true .
You have Can Manage permission for any cluster that you create.

Configure cluster-level permissions


NOTE
This section describes how to manage permissions using the UI. You can also use the Permissions API 2.0.

Cluster access control must be enabled and you must have Can Manage permission for the cluster.

1. Click Compute in the sidebar.


2. Click the name of the cluster you want to modify.
3. Click Permissions at the top of the page.
4. In the Permission settings for dialog, you can:
Select users and groups from the Add Users and Groups drop-down and assign permission levels
for them.
Update cluster permissions for users and groups that have already been added, using the drop-down
menu beside a user or group name.
5. Click Done .

Example: using cluster-level permissions to enforce cluster


configurations
One benefit of cluster access control is the ability to enforce cluster configurations so that users cannot change
them.
For example, configurations that admins might want to enforce include:
Tags to charge back costs
Azure AD credential passthrough to Azure Data Lake Storage to control access to data
Standard libraries
Azure Databricks recommends the following workflow for organizations that need to lock down cluster
configurations:
1. Disable Allow unrestricted cluster creation for all users.

NOTE
This entitlement cannot be removed from admin users.

2. After you create all of the cluster configurations that you want your users to use, give the users who need
access to a given cluster Can Restar t permission. This allows a user to freely start and stop the cluster
without having to set up all of the configurations manually.
Terraform integration
You can manage permissions in a fully automated setup using Databricks Terraform provider and
databricks_permissions:
resource "databricks_group" "auto" {
display_name = "Automation"
}

resource "databricks_group" "eng" {


display_name = "Engineering"
}

resource "databricks_group" "ds" {


display_name = "Data Science"
}

data "databricks_spark_version" "latest" {}

data "databricks_node_type" "smallest" {


local_disk = true
}

resource "databricks_cluster" "shared_autoscaling" {


cluster_name = "Shared Autoscaling"
spark_version = data.databricks_spark_version.latest.id
node_type_id = data.databricks_node_type.smallest.id
autotermination_minutes = 60
autoscale {
min_workers = 1
max_workers = 10
}
}

resource "databricks_permissions" "cluster_usage" {


cluster_id = databricks_cluster.shared_autoscaling.cluster_id

access_control {
group_name = databricks_group.auto.display_name
permission_level = "CAN_ATTACH_TO"
}

access_control {
group_name = databricks_group.eng.display_name
permission_level = "CAN_RESTART"
}

access_control {
group_name = databricks_group.ds.display_name
permission_level = "CAN_MANAGE"
}
}
Pool access control
7/21/2022 • 2 minutes to read

IMPORTANT
This feature is in Public Preview.

NOTE
Access control is available only in the Premium Plan.

By default, all users can create and modify pools unless an administrator enables pool access control. With pool
access control, permissions determine a user’s abilities. This article describes the individual permissions and
how to configure pool access control.
Before you can use pool access control, an Azure Databricks admin must enable it for the workspace. See Enable
pool access control for your workspace.

Pool permissions
There are three permission levels for a pool: No Permissions , Can Attach To , and Can Manage . The table
lists the abilities for each permission.

A B IL IT Y N O P ERM ISSIO N S C A N AT TA C H TO C A N M A N A GE

Attach cluster to pool x x

Delete pool x

Edit pool x

Modify pool permissions x

Configure pool permissions


To give a user or group permission to manage pools or attach a cluster to a pool using the UI, at the bottom of
the pool configuration page, select the Permissions tab. You can:
Select users and groups from the Select User or Group drop-down and assign permission levels for them.
Update pool permissions for users and groups that have already been added, using the drop-down menu
beside a user or group name.
NOTE
You can also give a user or group permission to manage pools or attach a cluster to a pool using the Permissions API 2.0.

The only way to grant a user or group permission to create a pool is through the SCIM API. Follow the SCIM API
2.0 documentation and grant the user the allow-instance-pool-create entitlement.

Terraform integration
You can manage permissions in a fully automated setup using Databricks Terraform provider and
databricks_permissions:

resource "databricks_group" "auto" {


display_name = "Automation"
}

resource "databricks_group" "eng" {


display_name = "Engineering"
}

data "databricks_node_type" "smallest" {


local_disk = true
}

resource "databricks_instance_pool" "this" {


instance_pool_name = "Reserved Instances"
idle_instance_autotermination_minutes = 60
node_type_id = data.databricks_node_type.smallest.id
min_idle_instances = 0
max_capacity = 10
}

resource "databricks_permissions" "pool_usage" {


instance_pool_id = databricks_instance_pool.this.id

access_control {
group_name = databricks_group.auto.display_name
permission_level = "CAN_ATTACH_TO"
}

access_control {
group_name = databricks_group.eng.display_name
permission_level = "CAN_MANAGE"
}
}
Jobs access control
7/21/2022 • 2 minutes to read

NOTE
Access control is available only in the Premium Plan.

Enabling access control for jobs allows job owners to control who can view job results or manage runs of a job.
This article describes the individual permissions and how to configure jobs access control.
Before you can use jobs access control, an Azure Databricks admin must enable it for the workspace. See Enable
jobs access control for your workspace.

Job permissions
There are five permission levels for jobs: No Permissions , Can View , Can Manage Run , Is Owner , and Can
Manage . Admins are granted the Can Manage permission by default, and they can assign that permission to
non-admin users.

NOTE
The job owner can be changed only by an admin.

The table lists the abilities for each permission.

NO C A N M A N A GE
A B IL IT Y P ERM ISSIO N S C A N VIEW RUN IS O W N ER C A N M A N A GE

View job details x x x x x


and settings

View results, x x x x
Spark UI, logs of
a job run

Run now x x x

Cancel run x x x

Edit job settings x x

Modify x x
permissions

Delete job x x

Change owner
NOTE
The creator of a job has Is Owner permission.
A job cannot have more than one owner.
A job cannot have a group as an owner.
Jobs triggered through Run Now assume the permissions of the job owner and not the user who issued Run Now .
For example, even if job A is configured to run on an existing cluster accessible only to the job owner (user A), a user
(user B) with Can Manage Run permission can start a new run of the job.
You can view notebook run results only if you have the Can View or higher permission on the job. This allows jobs
access control to be intact even if the job notebook was renamed, moved, or deleted.
Jobs access control applies to jobs displayed in the Databricks Jobs UI and their runs. It doesn’t apply to runs spawned
by modularized or linked code in notebooks or runs submitted by API whose ACLs are bundled with the notebooks.

Configure job permissions


NOTE
This section describes how to manage permissions using the UI. You can also use the Permissions API 2.0.

You must have Can Manage or Is Owner permission.


1. Go to the details page for a job.
2. Click the Edit permissions button in the Job details panel.
3. In the pop-up dialog box, assign job permissions via the drop-down menu beside a user’s name.

4. Click Save Changes .

Terraform integration
You can manage permissions in a fully automated setup using Databricks Terraform provider and
databricks_permissions:
resource "databricks_group" "auto" {
display_name = "Automation"
}

resource "databricks_group" "eng" {


display_name = "Engineering"
}

data "databricks_spark_version" "latest" {}

data "databricks_node_type" "smallest" {


local_disk = true
}

resource "databricks_job" "this" {


name = "Featurization"
max_concurrent_runs = 1

new_cluster {
num_workers = 300
spark_version = data.databricks_spark_version.latest.id
node_type_id = data.databricks_node_type.smallest.id
}

notebook_task {
notebook_path = "/Production/MakeFeatures"
}
}

resource "databricks_permissions" "job_usage" {


job_id = databricks_job.this.id

access_control {
group_name = "users"
permission_level = "CAN_VIEW"
}

access_control {
group_name = databricks_group.auto.display_name
permission_level = "CAN_MANAGE_RUN"
}

access_control {
group_name = databricks_group.eng.display_name
permission_level = "CAN_MANAGE"
}
}
Delta Live Tables access control
7/21/2022 • 2 minutes to read

NOTE
Access control is available only in the Premium Plan.

Enabling access control for Delta Live Tables allows pipeline owners to control access to pipelines, including
permissions to view pipeline details, start and stop pipeline updates, and manage pipeline settings. This article
describes the individual permissions and how to configure pipeline access control.

Delta Live Tables permissions


There are five permissions for Delta Live Tables: No Permissions , Can View , Can Run , Can Manage , and Is
Owner . Admins are granted the Can Manage permission by default, and they can assign that permission to
non-admin users.

NOTE
The pipeline owner can be changed only by an admin.

The following table lists the access for each permission. The Admin column specifies access for workspace
admins.

NO CAN
A B IL IT Y P ERM ISSIO N S C A N VIEW C A N RUN M A N A GE IS O W N ER A DM IN

View pipeline x x x x x
details and
list pipeline

View Spark UI x x x x x
and driver
logs

Start and x x x x
stop a
pipeline
update

Stop pipeline x x x x
clusters
directly

Edit pipeline x x x
settings

Delete the x x x
pipeline
NO CAN
A B IL IT Y P ERM ISSIO N S C A N VIEW C A N RUN M A N A GE IS O W N ER A DM IN

Modify x x x
pipeline
permissions

Change x
owner

NOTE
By default, the creator of a pipeline has Is Owner permission.
Each execution of a pipeline is called an update. Updates assume the permissions of the user with Is owner
permission on the pipeline and not the user who started the update. All update actions, for example, fetching
notebooks, creating clusters, and running queries, are executed as the pipeline owner.
Clusters are not re-used between pipeline owners.
A pipeline owner can only be a workspace user or service principal.
A pipeline cannot have more than one owner.
A pipeline owner can be changed only by a workspace admin.

Configure pipeline permissions


You must have Can Manage or Is Owner permission.
1. Go to the details page for a pipeline.
2. Click the Permissions button in the Pipeline Details panel.
3. In the pop-up dialog box, assign permissions by clicking the drop-down menu beside a user’s name.
4. To assign permissions for another user, group, or service principal, click the Select User, Group, or Ser vice
Principal drop-down, select the permission, and click the + Add button.
5. Click Save .
Secret access control
7/21/2022 • 2 minutes to read

By default, all users in all pricing plans can create secrets and secret scopes. Using secret access control,
available with the Premium Plan, you can configure fine-grained permissions for managing access control. This
guide describes how to set up these controls.

NOTE
Access control is available only in the Premium Plan. If your account has the Standard Plan, you must explicitly
grant MANAGE permission to the “users” (all users) group when you create secret scopes.
This article describes how to manage secret access control using the Databricks CLI (version 0.7.1 and above).
Alternatively, you can use the Secrets API 2.0.

Secret access control


Access control for secrets is managed at the secret scope level. An access control list (ACL) defines a relationship
between an Azure Databricks principal (user or group), secret scope, and permission level. In general, a user will
use the most powerful permission available to them (see Permission Levels).
When a secret is read via a notebook using the Secrets utility (dbutils.secrets), the user’s permission will be
applied based on who is executing the command, and they must at least have READ permission.
When a scope is created, an initial MANAGE permission level ACL is applied to the scope. Subsequent access
control configurations can be performed by that principal.

Permission levels
The secret access permissions are as follows:
MANAGE - Allowed to change ACLs, and read and write to this secret scope.
WRITE - Allowed to read and write to this secret scope.
READ - Allowed to read this secret scope and list what secrets are available.
Each permission level is a subset of the previous level’s permissions (that is, a principal with WRITE permission
for a given scope can perform all actions that require READ permission).

NOTE
Databricks admins have MANAGE permissions to all secret scopes in the workspace.

Create a secret ACL


To create a secret ACL for a given secret scope using the Databricks CLI (version 0.7.1 and above):

databricks secrets put-acl --scope <scope-name> --principal <principal> --permission <permission>

Making a put request for a principal that already has an applied permission overwrites the existing permission
level.
View secret ACLs
To view all secret ACLs for a given secret scope:

databricks secrets list-acls --scope <scope-name>

To get the secret ACL applied to a principal for a given secret scope:

databricks secrets get-acl --scope <scope-name> --principal <principal>

If no ACL exists for the given principal and scope, this request will fail.

Delete a secret ACL


To delete a secret ACL applied to a principal for a given secret scope:

databricks secrets delete-acl --scope <scope-name> --principal <principal>

Terraform integration
You can manage permissions in a fully automated setup using Databricks Terraform provider and
databricks_secret_acl:

resource "databricks_group" "ds" {


display_name = "data-scientists"
}

resource "databricks_secret_scope" "app" {


name = "app-secret-scope"
}

resource "databricks_secret_acl" "my_secret_acl" {


principal = databricks_group.ds.display_name
permission = "READ"
scope = databricks_secret_scope.app.name
}

resource "databricks_secret" "publishing_api" {


key = "publishing_api"
string_value = "SECRET_API_TOKEN_HERE"
scope = databricks_secret_scope.app.name
}
Secret management
7/21/2022 • 2 minutes to read

Sometimes accessing data requires that you authenticate to external data sources through JDBC. Instead of
directly entering your credentials into a notebook, use Azure Databricks secrets to store your credentials and
reference them in notebooks and jobs. To manage secrets, you can use the Databricks CLI to access the Secrets
API 2.0.

WARNING
Administrators, secret creators, and users granted permission can read Azure Databricks secrets. While Azure Databricks
makes an effort to redact secret values that might be displayed in notebooks, it is not possible to prevent such users from
reading secrets. For more information, see Secret redaction.

To set up secrets you:


1. Create a secret scope. Secret scope names are case insensitive.
2. Add secrets to the scope. Secret names are case insensitive.
3. If you have the Premium Plan, assign access control to the secret scope.
This guide shows you how to perform these setup tasks and manage secrets. For more information, see:
An end-to-end example of how to use secrets in your workflows.
Reference for the Secrets CLI.
Reference for the Secrets API 2.0.
How to use Secrets utility (dbutils.secrets) to reference secrets in notebooks and jobs.
In this guide:
Secret scopes
Overview
Azure Key Vault-backed scopes
Databricks-backed scopes
Scope permissions
Best practices
Create an Azure Key Vault-backed secret scope
Create an Azure Key Vault-backed secret scope using the UI
Create an Azure Key Vault-backed secret scope using the Databricks CLI
Create a Databricks-backed secret scope
List secret scopes
Delete a secret scope
Secrets
Create a secret
Create a secret in an Azure Key Vault-backed scope
Create a secret in a Databricks-backed scope
List secrets
Read a secret
Delete a secret
Use a secret in a Spark configuration property or environment variable
Requirements and limitations
Syntax for referencing secrets in a Spark configuration property or environment variable
Reference a secret with a Spark configuration property
Reference a secret in an environment variable
Secret redaction
Secret workflow example
Create a secret scope
Create secrets
Create the secrets in an Azure Key Vault-backed scope
Create the secrets in a Databricks-backed scope
Use the secrets in a notebook
Grant access to another group
Secret scopes
7/21/2022 • 6 minutes to read

Managing secrets begins with creating a secret scope. A secret scope is collection of secrets identified by a
name. A workspace is limited to a maximum of 100 secret scopes.

NOTE
Databricks recommends aligning secret scopes to roles or applications rather than individuals.

Overview
There are two types of secret scope: Azure Key Vault-backed and Databricks-backed.
Azure Key Vault-backed scopes
To reference secrets stored in an Azure Key Vault, you can create a secret scope backed by Azure Key Vault. You
can then leverage all of the secrets in the corresponding Key Vault instance from that secret scope. Because the
Azure Key Vault-backed secret scope is a read-only interface to the Key Vault, the PutSecret and DeleteSecret
Secrets API 2.0 operations are not allowed. To manage secrets in Azure Key Vault, you must use the Azure
SetSecret REST API or Azure portal UI.
Databricks-backed scopes
A Databricks-backed secret scope is stored in (backed by) an encrypted database owned and managed by Azure
Databricks. The secret scope name:
Must be unique within a workspace.
Must consist of alphanumeric characters, dashes, underscores, and periods, and may not exceed 128
characters.
The names are considered non-sensitive and are readable by all users in the workspace.
You create a Databricks-backed secret scope using the Databricks CLI (version 0.7.1 and above). Alternatively,
you can use the Secrets API 2.0.
Scope permissions
Scopes are created with permissions controlled by ACLs. By default, scopes are created with MANAGE permission
for the user who created the scope (the “creator”), which lets the creator read secrets in the scope, write secrets
to the scope, and change ACLs for the scope. If your account has the Premium Plan, you can assign granular
permissions at any time after you create the scope. For details, see Secret access control.
You can also override the default and explicitly grant MANAGE permission to all users when you create the scope.
In fact, you must do this if your account does not have the Premium Plan.
Best practices
As a team lead, you might want to create different scopes for Azure Synapse Analytics and Azure Blob storage
credentials and then provide different subgroups in your team access to those scopes. You should consider how
to achieve this using the different scope types:
If you use a Databricks-backed scope and add the secrets in those two scopes, they will be different secrets
(Azure Synapse Analytics in scope 1, and Azure Blob storage in scope 2).
If you use an Azure Key Vault-backed scope with each scope referencing a different Azure Key Vault and
add your secrets to those two Azure Key Vaults, they will be different sets of secrets (Azure Synapse Analytics
ones in scope 1, and Azure Blob storage in scope 2). These will work like Databricks-backed scopes.
If you use two Azure Key Vault-backed scopes with both scopes referencing the same Azure Key Vault and
add your secrets to that Azure Key Vault, all Azure Synapse Analytics and Azure Blob storage secrets will be
available. Since ACLs are at the scope level, all members across the two subgroups will see all secrets. This
arrangement does not satisfy your use case of restricting access to a set of secrets to each group.

Create an Azure Key Vault-backed secret scope


You can create an Azure Key Vault-backed secret scope using the UI or using the Databricks CLI.
Create an Azure Key Vault-backed secret scope using the UI
1. Verify that you have Contributor permission on the Azure Key Vault instance that you want to use to back
the secret scope.
If you do not have a Key Vault instance, follow the instructions in Quickstart: Create a Key Vault using the
Azure portal.
2. Go to https://<databricks-instance>#secrets/createScope . This URL is case sensitive; scope in
createScope must be uppercase.

3. Enter the name of the secret scope. Secret scope names are case insensitive.
4. Use the Manage Principal drop-down to specify whether All Users have MANAGE permission for this
secret scope or only the Creator of the secret scope (that is to say, you).
MANAGE permission allows users to read and write to this secret scope, and, in the case of accounts on the
Premium Plan, to change permissions for the scope.
Your account must have the Premium Plan for you to be able to select Creator. This is the recommended
approach: grant MANAGE permission to the Creator when you create the secret scope, and then assign
more granular access permissions after you have tested the scope. For an example workflow, see Secret
workflow example.
If your account has the Standard Plan, you must set the MANAGE permission to the “All Users” group. If
you select Creator here, you will see an error message when you try to save the scope.
For more information about the MANAGE permission, see Secret access control.
5. Enter the DNS Name (for example, https://databrickskv.vault.azure.net/ ) and Resource ID , for
example:

/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourcegroups/databricks-
rg/providers/Microsoft.KeyVault/vaults/databricksKV

These properties are available from the Proper ties tab of an Azure Key Vault in your Azure portal.

6. Click the Create button.


7. Use the Databricks CLI databricks secrets list-scopes command to verify that the scope was created
successfully.
For an example of using secrets when accessing Azure Blob storage, see Mounting cloud object storage on
Azure Databricks.
Create an Azure Key Vault-backed secret scope using the Databricks CLI
1. Install the CLI and configure it to use an Azure Active Directory (Azure AD) token for authentication.

IMPORTANT
You need an Azure AD user token to create an Azure Key Vault-backed secret scope with the Databricks CLI. You
cannot use an Azure Databricks personal access token or an Azure AD application token that belongs to a service
principal.
If the key vault exists in a different tenant than the Azure Databricks workspace, the Azure AD user who creates
the secret scope must have permission to create service principals in the key vault’s tenant. Otherwise, the
following error occurs:

Unable to grant read/list permission to Databricks service principal to KeyVault


'https://xxxxx.vault.azure.net/': Status code 403, {"odata.error":
{"code":"Authorization_RequestDenied","message":{"lang":"en","value":"Insufficient privileges to
complete the operation."},"requestId":"XXXXX","date":"YYYY-MM-DDTHH:MM:SS"}}

2. Create the Azure Key Vault scope:


databricks secrets create-scope --scope <scope-name> --scope-backend-type AZURE_KEYVAULT --resource-
id <azure-keyvault-resource-id> --dns-name <azure-keyvault-dns-name>

By default, scopes are created with MANAGE permission for the user who created the scope. If your
account does not have the Premium Plan, you must override that default and explicitly grant the MANAGE
permission to the users (all users) group when you create the scope:

databricks secrets create-scope --scope <scope-name> --scope-backend-type AZURE_KEYVAULT --resource-


id <azure-keyvault-resource-id> --dns-name <azure-keyvault-dns-name> --initial-manage-principal users

If your account in on the Premium Plan, you can change permissions at any time after you create the
scope. For details, see Secret access control.
Once you have created a Databricks-backed secret scope, you can add secrets.
For an example of using secrets when accessing Azure Blob storage, see Mounting cloud object storage on
Azure Databricks.

Create a Databricks-backed secret scope


Secret scope names are case insensitive.
To create a scope using the Databricks CLI:

databricks secrets create-scope --scope <scope-name>

By default, scopes are created with MANAGE permission for the user who created the scope. If your account does
not have the Premium Plan, you must override that default and explicitly grant the MANAGE permission to
“users” (all users) when you create the scope:

databricks secrets create-scope --scope <scope-name> --initial-manage-principal users

You can also create a Databricks-backed secret scope using the Secrets API Put secret operation.
If your account has the Premium Plan, you can change permissions at any time after you create the scope. For
details, see Secret access control.
Once you have created a Databricks-backed secret scope, you can add secrets.

List secret scopes


To list the existing scopes in a workspace using the CLI:

databricks secrets list-scopes

You can also list existing scopes using the Secrets API List secrets operation.

Delete a secret scope


Deleting a secret scope deletes all secrets and ACLs applied to the scope. To delete a scope using the CLI:
databricks secrets delete-scope --scope <scope-name>

You can also delete a secret scope using the Secrets API Delete secret scope operation.
Secrets
7/21/2022 • 5 minutes to read

A secret is a key-value pair that stores secret material, with a key name unique within a secret scope. Each scope
is limited to 1000 secrets. The maximum allowed secret value size is 128 KB.

Create a secret
Secret names are case insensitive.
The method for creating a secret depends on whether you are using an Azure Key Vault-backed scope or a
Databricks-backed scope.
Create a secret in an Azure Key Vault-backed scope
To create a secret in Azure Key Vault you use the Azure SetSecret REST API or Azure portal UI.

Create a secret in a Databricks-backed scope


To create a secret in a Databricks-backed scope using the Databricks CLI (version 0.7.1 and above):

databricks secrets put --scope <scope-name> --key <key-name>

An editor opens and displays content like this:

# ----------------------------------------------------------------------
# Do not edit the above line. Everything that follows it will be ignored.
# Please input your secret value above the line. Text will be stored in
# UTF-8 (MB4) form and any trailing new line will be stripped.
# Exit without saving will abort writing secret.

Paste your secret value above the line and save and exit the editor. Your input is stripped of the comments and
stored associated with the key in the scope.
If you issue a write request with a key that already exists, the new value overwrites the existing value.
You can also provide a secret from a file or from the command line. For more information about writing secrets,
see Secrets CLI.
List secrets
To list secrets in a given scope:

databricks secrets list --scope <scope-name>

The response displays metadata information about the secret, such as the secret key name and last updated at
timestamp (in milliseconds since epoch). You use the Secrets utility (dbutils.secrets) in a notebook or job to read
a secret. For example:

databricks secrets list --scope jdbc

Key name Last updated


---------- --------------
password 1531968449039
username 1531968408097

Read a secret
You create secrets using the REST API or CLI, but you must use the Secrets utility (dbutils.secrets) in a notebook
or job to read a secret.

Delete a secret
To delete a secret from a scope with the Databricks CLI:

databricks secrets delete --scope <scope-name> --key <key-name>

You can also use the Secrets API 2.0.


To delete a secret from a scope backed by Azure Key Vault, use the Azure SetSecret REST API or Azure portal UI.

Use a secret in a Spark configuration property or environment


variable
IMPORTANT
This feature is in Public Preview.

NOTE
Available in Databricks Runtime 6.4 Extended Support and above.

You can reference a secret in a Spark configuration property or environment variable. Retrieved secrets are
redacted from notebook output and Spark driver and executor logs.
IMPORTANT
Keep the following security implications in mind when referencing secrets in a Spark configuration property or
environment variable:
If table access control is not enabled on a cluster, any user with Can Attach To permissions on a cluster or Run
permissions on a notebook can read Spark configuration properties from within the notebook. This includes users
who do not have direct permission to read a secret. Databricks recommends enabling table access control on all
clusters or managing access to secrets using secret scopes.
Even when table access control is enabled, users with Can Attach To permissions on a cluster or Run permissions
on a notebook can read cluster environment variables from within the notebook. Databricks does not recommend
storing secrets in cluster environment variables if they must not be available to all users on the cluster.
Secrets are not redacted from the Spark driver log stdout and stderr streams. By default, Spark driver logs are
viewable by users with any of the following cluster level permissions:
Can Attach To
Can Restart
Can Manage
You can optionally limit who can read Spark driver logs to users with the Can Manage permission by setting the
cluster’s Spark configuration property spark.databricks.acl.needAdminPermissionToViewLogs true

Requirements and limitations


The following requirements and limitations apply to referencing secrets in Spark configuration properties and
environment variables:
Cluster owners must have Can Read permission on the secret scope.
Only cluster owners can add a reference to a secret in a Spark configuration property or environment
variable and edit the existing scope and name. Owners change a secret using the Put secret API. You must
restart your cluster to fetch the secret again.
Users with the Can Manage permission on the cluster can delete a secret Spark configuration property or
environment variable.
Syntax for referencing secrets in a Spark configuration property or environment variable
You can refer to a secret using any valid variable name or Spark configuration property. Azure Databricks
enables special behavior for variables referencing secrets based on the syntax of the value being set, not the
variable name.
The syntax of the Spark configuration property or environment variable value must be
{{secrets/<scope-name>/<secret-name>}} . The value must start with {{secrets/ and end with }} .
The variable portions of the Spark configuration property or environment variable are:
<scope-name> : The name of the scope in which the secret is associated.
<secret-name> : The unique name of the secret in the scope.

For example, {{secrets/scope1/key1}} .

NOTE
There should be no spaces between the curly brackets. If there are spaces, they are treated as part of the scope or
secret name.

Reference a secret with a Spark configuration property


You specify a reference to a secret in a Spark configuration property in the following format:
spark.<property-name> {{secrets/<scope-name>/<secret-name>}}

Any Spark configuration <property-name> can reference a secret. Each Spark configuration property can only
reference one secret, but you can configure multiple Spark properties to reference secrets.
Example
You set a Spark configuration to reference a secret:

spark.password {{secrets/scope1/key1}}

To fetch the secret in the notebook and use it:


Python

spark.conf.get("spark.password")

SQL

SELECT ${spark.password};

Reference a secret in an environment variable


You specify a secret path in an environment variable in the following format:

<variable-name>={{secrets/<scope-name>/<secret-name>}}

Environment variables that reference secrets have special behavior. These environment variables are accessible
from a cluster-scoped init script, but are not accessible from a program running in Spark.
Example
You set an environment variable to reference a secret:

SPARKPASSWORD={{secrets/scope1/key1}}

To fetch the secret in an init script, access $SPARKPASSWORD :

if [ -n "$SPARKPASSWORD" ]; then
use ${SPARKPASSWORD}
fi
Secret redaction
7/21/2022 • 2 minutes to read

Storing credentials as Azure Databricks secrets makes it easy to protect your credentials when you run
notebooks and jobs. However, it is easy to accidentally print a secret to standard output buffers or display the
value during variable assignment.
To prevent this, Azure Databricks redacts secret values that are read using dbutils.secrets.get() . When
displayed in notebook cell output, the secret values are replaced with [REDACTED] .

WARNING
Secret redaction for notebook cell output applies only to literals. The secret redaction functionality therefore does not
prevent deliberate and arbitrary transformations of a secret literal. To ensure the proper control of secrets, you should use
Workspace object access control (limiting permission to run commands) to prevent unauthorized access to shared
notebook contexts.
Secret workflow example
7/21/2022 • 2 minutes to read

In this workflow example, we use secrets to set up JDBC credentials for connecting to an Azure Data Lake Store.

Create a secret scope


Create a secret scope called jdbc .
To create a Databricks-backed secret scope:

databricks secrets create-scope --scope jdbc

To create an Azure Key Vault-backed secret scope, follow the instructions in Create an Azure Key Vault-backed
secret scope.

NOTE
If your account does not have the Premium Plan, you must create the scope with MANAGE permission granted to all users
(“users”). For example:

databricks secrets create-scope --scope jdbc --initial-manage-principal users

Create secrets
The method for creating the secrets depends on whether you are using an Azure Key Vault-backed scope or a
Databricks-backed scope.
Create the secrets in an Azure Key Vault-backed scope
Add the secrets username and password using the Azure SetSecret REST API or Azure portal UI:

Create the secrets in a Databricks-backed scope


Add the secrets username and password . Run the following commands and enter the secret values in the
opened editor.
databricks secrets put --scope jdbc --key username
databricks secrets put --scope jdbc --key password

Use the secrets in a notebook


In a notebook, read the secrets that are stored in the secret scope jdbc to configure a JDBC connector:

val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"


val connectionProperties = new java.util.Properties()
connectionProperties.setProperty("Driver", driverClass)

val jdbcUsername = dbutils.secrets.get(scope = "jdbc", key = "username")


val jdbcPassword = dbutils.secrets.get(scope = "jdbc", key = "password")
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")

You can now use these ConnectionProperties with the JDBC connector to talk to your data source. The values
fetched from the scope are never displayed in the notebook (see Secret redaction).

Grant access to another group


NOTE
This step requires that your account have the Premium Plan.

After verifying that the credentials were configured correctly, share these credentials with the datascience
group to use for their analysis.
Grant the datascience group read-only permission to these credentials by making the following request:

databricks secrets put-acl --scope jdbc --principal datascience --permission READ


Best practices: GDPR and CCPA compliance using
Delta Lake
7/21/2022 • 14 minutes to read

This article describes how you can use Delta Lake on Azure Databricks to manage General Data Protection
Regulation (GDPR) and California Consumer Privacy Act (CCPA) compliance for your data lake. Because Delta
Lake adds a transactional layer that provides structured data management on top of your data lake, it can
dramatically simplify and speed up your ability to locate and remove personal information (also known as
“personal data”) in response to consumer GDPR or CCPA requests.

The challenge
Your organization may manage hundreds of terabytes worth of personal information in your cloud. Bringing
these datasets into GDPR and CCPA compliance is of paramount importance, but this can be a big challenge,
especially for larger datasets stored in data lakes.
The challenge typically arises from the following factors:
When you have large amounts (petabyte scale) of data in the cloud, user data can be stored and distributed
across multiple datasets and locations.
Point or ad-hoc queries to find data for specific users is expensive (akin to finding a needle in a haystack),
because it often requires full table scans. Taking a brute force approach to GDPR/CCPA compliance can result
in multiple jobs operating over different tables, resulting in weeks of engineering and operational effort.
Data lakes are inherently append-only and do not support the ability to perform row level “delete” or
“update” operations natively, which means that you must rewrite partitions of data. Typical data lake
offerings do not provide ACID transactional capabilities or efficient methods to find relevant data. Moreover,
read/write consistency is also a concern: while user data is being redacted from the data lake, processes that
read data should be protected from material impacts the way they would with a traditional RDBMS.
Data hygiene in the data lake is challenging, given that data lakes by design support availability and partition
tolerance with eventual consistency. Enforceable and rigorous practices and standards are required to assure
cleansed data.
As a result, organizations that manage user data at this scale often end up writing computationally difficult,
expensive, and time-consuming data pipelines to deal with GDPR and CCPA. For example, you might upload
portions of your data lake into proprietary data warehousing technologies, where GDPR and CCPA compliance-
related deletion activities are performed. This adds complexity and reduces data fidelity by forcing multiple
copies of the data. Moreover, exporting data from such warehouse technologies back into a data lake may
require re-optimization to improve query performance. This too results in multiple copies of data being created
and maintained.

How Delta Lake addresses the challenge


To resolve the issues listed above, the optimal approach to making a data lake GDPR- and CCPA-compliant
requires:
“Pseudonymization,” or reversible tokenization of personal information elements ( identifiers ) to keys (
pseudonyms ) that cannot be externally identified.
Storage of information in a manner linked to pseudonyms rather than identifiers;
Maintenance of strict access and use policies on the combination of the identifiers and pseudonyms;
Pipelines or bucket policies to remove raw data on timelines that help you comply with applicable law;
Structuring pipelines to locate and remove the identifier to destroy the linkage between the pseudonyms and
identifiers
ACID capabilities overlaid on top of the data lake to prevent readers from being negatively affected when
delete or update operations are carried out on the data lake.
High-performance pipeline, supporting, for example, the cleanup of 5TB of data within 10 minutes.
Delta Lake is a very effective tool for addressing these GDPR and CCPA compliance requirements, because its
structured data management system adds transactional capabilities to your data lake. Delta Lake’s well-
organized, well-sized, well-indexed, stats-enabled datasets enable quick and easy search, modification, and
cleanup of your data using standard SQL DML statements like DELETE , UPDATE , and MERGE INTO .
The two use cases described in the following sections demonstrate how to convert your existing data to Delta
Lake and how to delete and clean up personal information quickly and efficiently. This article also suggests
options for pseudonymizing personal information and for improving query performance with Delta Lake.

Delete personal data


This use case demonstrates how efficient Delta Lake can be when deleting personal data from your data lake.
The sample dataset
The workflow described in this article references a database gdpr containing a sample dataset with 65,000,000
rows and as many distinct customer IDs, amounting to 3.228 GB of data. Customer personal information is
captured in the customers table in this database.
The schema of the gdpr.customers table is:

|-- c_customer_sk: integer (nullable = true)


|-- c_customer_id: string (nullable = true)
|-- c_current_cdemo_sk: integer (nullable = true)
|-- c_current_hdemo_sk: integer (nullable = true)
|-- c_current_addr_sk: integer (nullable = true)
|-- c_first_shipto_date_sk: integer (nullable = true)
|-- c_first_sales_date_sk: integer (nullable = true)
|-- c_salutation: string (nullable = true)
|-- c_first_name: string (nullable = true)
|-- c_last_name: string (nullable = true)
|-- c_preferred_cust_flag: string (nullable = true)
|-- c_birth_day: integer (nullable = true)
|-- c_birth_month: integer (nullable = true)
|-- c_birth_year: integer (nullable = true)
|-- c_birth_country: string (nullable = true)
|-- c_email_address: string (nullable = true)
|-- c_last_review_date: string (nullable = true)

The list of customers requesting to be forgotten per GDPR and CCPA come from a transactional database table,
gdpr.customer_delete_keys , that is populated using an online portal. The keys (distinct users) to be deleted
represent roughly 10% (337.615 MB) of the original keys sampled from the original dataset in gdpr.customers .
The schema of the gdpr.customer_delete_keys table contains the following fields:

|-- c_customer_sk: integer (nullable = true)


|-- c_customer_id: string (nullable = true)

The key c_customer_id identifies customers to be deleted.


Step 1: Convert tables to Delta format
To get started with Delta Lake, you need to ingest your raw data (Parquet, CSV, JSON, and so on) and write it out
as managed Delta tables. If your data is already in Parquet format, you can use CONVERT TO DELTA to convert the
Parquet files in place to Delta tables without rewriting any data. If not, you can use the Apache Spark APIs you’re
familiar with to rewrite the format to Delta. Because Delta Lake uses Parquet, which is an open file format, your
converted data won’t be locked in: you can quickly and easily convert your data back into another format if you
need to.
This example converts the Parquet table customers in the gdpr database.

CONVERT TO DELTA gdpr.customers

Step 2: Perform deletes


After you convert your tables to Delta Lake, you can delete the personal information of the users who have
requested to be forgotten.

NOTE
The following example involves a straightforward delete of customer personal data from the customers table. A better
practice is to pseudonymize all customer personal information in your working tables (prior to receiving a data subject
request) and delete the customer entry from the “lookup table” that maps the customer to the pseudonym, while
ensuring that data in working tables cannot be used to reconstruct the customer’s identity. For details, see Pseudonymize
data.

NOTE
The following examples make reference to performance numbers as a way of illustrating the impact of certain
performance options. These numbers were recorded on the dataset described above, on a cluster with 3 worker nodes,
each with 90 GB memory and 12 cores; the driver had 30GB memory and 4 cores.

Here is a simple Delta Lake DELETE FROM operation, deleting the customers included in the
customer_delete_keys table from our sample gdpr.customers table:

DELETE FROM `gdpr.customers` AS t1 WHERE EXISTS (SELECT c_customer_id FROM gdpr.customer_delete_keys WHERE
t1.c_customer_id = c_customer_id)

During testing, this operation took too long to complete: finding files took 32 seconds and rewriting files took
2.6 min. To reduce the time to find the relevant files, you can increase the broadcast threshold:

set spark.sql.autoBroadcastJoinThreshold = 104857600;

This broadcast hint instructs Spark to broadcast each specified table when joining it with another table or view.
This setting dropped file-finding to 8 seconds and writing to 1.6 minutes.
You can speed up performance even more with Delta Lake Z-Ordering (multi-dimensional clustering). Z-
Ordering creates a range partition-based arrangement of data and indexes this information in the Delta table.
Delta Lake uses this z-index to find files impacted by the DELETE operation.
To take advantage of Z-Ordering, you must understand how the data you expect to be deleted is spread across
the target table. For example, if the data, even for a few keys, is spread across 90% of the files for the dataset,
you’ll be rewriting more than 90% of your data. Z-Ordering by relevant key columns reduces the number of files
touched and can make rewrites much more efficient.
In this case, you should Z-Order by the c_customer_id column before running delete:

OPTIMIZE gdpr.customers Z-ORDER BY c_customer_id

After Z-Ordering, finding files took 7 secs and writing dropped to 50 seconds.
Step 3: Clean up stale data
Depending on how long after a consumer request you delete your data and on your underlying data lake, you
may need to delete table history and underlying raw data.
By default, Delta Lake retains table history for 30 days and makes it available for “time travel” and rollbacks.
That means that, even after you have deleted personal information from a Delta table, users in your organization
may be able to view that historical data and roll back to a version of the table in which the personal information
is still stored. If you determine that GDPR or CCPA compliance requires that these stale records be made
unavailable for querying before the default retention period is up, you can use the VACUUM function to remove
files that are no longer referenced by a Delta table and are older than a specified retention threshold. Once you
have removed table history using the VACUUM command, all users lose the ability to view that history and roll
back.
To delete all customers who requested that their information be deleted, and then remove all table history older
than 7 days, you simply run:

VACUUM gdpr.customers

To remove artifacts younger than 7 days, use the RETAIN num HOURS option:

VACUUM gdpr.customers RETAIN 100 HOURS

In addition, if you created Delta tables using Spark APIs to rewrite non-Parquet files to Delta (as opposed to
converting Parquet files to Delta Lake in-place), your raw data may still contain personal information that you
have deleted or anonymized. Databricks recommends that you set up a retention policy with your cloud
provider of thirty days or less to remove raw data automatically.

Pseudonymize data
While the deletion method described above can, strictly, permit your organization to comply with the GDPR and
CCPA requirement to perform deletions of personal information, it comes with a number of downsides. The first
is that the GDPR does not permit any additional processing of personal information once a valid request to
delete has been received. As a consequence, if the data is not stored in a pseudonymized fashion—that is,
replacing personally identifiable information with an artificial identifier or pseudonym—prior to the receipt of
the data subject request, you are obligated to simply delete all of the linked information. If, however, you have
previously pseudonymized the underlying data, your obligations to delete are satisfied by the simple destruction
of any record that links the identifier to the pseudonym (assuming the remaining data is not itself identifiable),
and you may retain the remainder of the data.
In a typical pseudonymization scenario, you keep a secured “lookup table” that maps the customer’s personal
identifiers (name, email address, etc) to the pseudonym. This has the advantage not only of making deletion
easier, but also of allowing you to “restore” the user identity temporarily to update user data over time, an
advantage denied in an anonymization scenario, in which by definition a customer’s identity can never be
restored, and all customer data is by definition static and historical.
For a simple pseudonymization example, consider the customer table updated in the deletion example. In the
pseudonymization scenario, you can create a gdpr.customers_lookup table that contains all customer data that
could be used to identify the customer, with an additional column for a pseudonymized email address. Now, you
can use the pseudo email address as the key in any data tables that reference customers, and when there is a
request to forget this information, you can simply delete that information from the gdpr.customers_lookup table
and the rest of the information can remain non-identifiable forever.
The schema of the gdpr.customers_lookup table is:

|-- c_customer_id: string (nullable = true)


|-- c_email_address: string (nullable = true)
|-- c_email_address_pseudonym: string (nullable = true)
|-- c_first_name: string (nullable = true)
|-- c_last_name: string (nullable = true)

In this scenario, put the remaining customer data, which cannot be used to identify the customer, in a
pseudonymized table called gdpr.customers_pseudo :

|-- c_email_address_pseudonym: string (nullable = true)


|-- c_customer_sk: integer (nullable = true)
|-- c_current_cdemo_sk: integer (nullable = true)
|-- c_current_hdemo_sk: integer (nullable = true)
|-- c_current_addr_sk: integer (nullable = true)
|-- c_first_shipto_date_sk: integer (nullable = true)
|-- c_first_sales_date_sk: integer (nullable = true)
|-- c_salutation: string (nullable = true)
|-- c_preferred_cust_flag: string (nullable = true)
|-- c_birth_year: integer (nullable = true)
|-- c_birth_country: string (nullable = true)
|-- c_last_review_date: string (nullable = true)

Use Delta Lake to pseudonymize customer data


A strong way to pseudonymize personal information is one-way cryptographic hashing and salting with a
remembered salt or salts. Hashing turns data into a fixed-length fingerprint that cannot be computationally
reversed. Salting adds a random string to the data that will be hashed as a way to frustrate attackers who are
using lookup or “rainbow” tables that contain hashes of millions of known email addresses or passwords.
You can salt the column c_email_address by adding a random secret string literal before hashing. This secret
string can be stored using Azure Databricks secrets to add additional security to your salt. If unauthorized Azure
Databricks users try to access the secret, they will see redacted values.

dbutils.secrets.get(scope = "salt", key = "useremail")

res0: String = [REDACTED]

NOTE
This is a simple example to illustrate salting. Using the same salt for all of your customer keys is not a good way to
mitigate attacks; it just makes the customer keys longer. A more secure approach would be to generate a random salt for
each user. See Make your pseudonymization stronger.

Once you salt the column c_email_address , you can hash it and add the hash to the gdpr.customers_lookup
table as c_email_address_pseudonym :

UPDATE gdpr.customers_lookup SET c_email_address_pseudonym = sha2(c_email_address,256)

Now you can use this value for all of your customer-keyed tables.
Make your pseudonymization stronger
To reduce the risk that a compromise of a single salt could have on your database, it is advisable where practical
to use different salts (one per customer, or even per user). Provided that the data attached to the pseudonymous
identifier does not itself contain any information that can identify an individual, if you delete your record of
which salt is related to which user and cannot recreate it, the remaining data should be rendered fully
anonymous and therefore fall outside of the scope of the GDPR and the CCPA. Many organizations choose to
create multiple salts per user and create fully anonymized data outside of the scope of data protection law by
rotating these salts periodically according to business need.
And don’t forget that whether data is “personal” or “identifiable” is not an element-level analysis, but essentially
an array-level analysis. So while obvious things like email addresses are clearly personal, combinations of things
that by themselves would not be personal can also be personal. See, for example
https://aboutmyinfo.org/identity/about: based on an analysis of the 1990 US Census, 87% of the United States
population is uniquely identifiable by the three attributes of zip code, date of birth, and gender. So when you’re
deciding what should be stored as part of the personal identifiers table or the working tables with only
pseudonymous information, make sure to think about whether or not the collision of the seemingly non-
identifiable information might itself be identifiable. And make sure for your own privacy compliance that you
have internal processes that prevent attempts to re-identify individuals with the information you intended to be
non-identifiable (for example differential privacy, privacy preserving histograms, etc.). While it may never be
possible to completely prevent re-identification, following these steps will go a long way towards helping.

Improve query performance


Step 2: Perform deletes showed how to improve Delta Lake query performance by increasing the broadcast
threshold and Z-Ordering, and there are additional performance improvement practices that you should also be
aware of:
Ensure that key columns are within the first 32 columns in a table. Delta Lake collects stats on the first 32
columns, and these stats help with identification of files for deletion or update.
Use the Auto Optimize feature, available in Delta Lake on Azure Databricks, which automatically compacts
small files during individual writes to a Delta table and offers significant benefits for tables that are queried
actively, especially in situations when Delta Lake would otherwise be encountering multiple small files. See
Auto Optimize for guidance about when to use it.
Reduce the size of the source table (for the BroadcastHashJoin ). This helps Delta Lake leverage dynamic file
pruning when determining relevant data for deletes. This will help particularly if the delete operations are not
on partition boundaries.
For any modify operation, such as DELETE , be as specific as possible, providing all of your qualifying
conditions in the search clause. This narrows down the number of files you hit and prevents transaction
conflicts.
Continuously tune for Spark shuffle, cluster utilization, and storage system optimal write throughout.

Learn more
To learn more about Delta Lake on Azure Databricks, see Delta Lake guide.
For blogs about using Delta Lake for GDPR and CCPA compliance written by Databricks experts, see:
How to Avoid Drowning in GDPR Data Subject Requests in a Data Lake
Make Your Data Lake CCPA Compliant with a Unified Approach to Data and Analytics
Efficient Upserts into Data Lakes with Databricks Delta
To learn about purging personal information in the Azure Databricks workspace, see Manage workspace
storage.
IP access lists
7/21/2022 • 6 minutes to read

Security-conscious enterprises that use cloud SaaS applications need to restrict access to their own employees.
Authentication helps to prove user identity, but that does not enforce network location of the users. Accessing a
cloud service from an unsecured network can pose security risks to an enterprise, especially when the user may
have authorized access to sensitive or personal data. Enterprise network perimeters apply security policies and
limit access to external services (for example, firewalls, proxies, DLP, and logging), so access beyond these
controls are assumed to be untrusted.
For example, suppose a hospital employee accesses an Azure Databricks workspace. If the employee walks from
the office to a coffee shop, the hospital can block connections to the Azure Databricks workspace even if the
customer has correct credentials to access the web application and the REST API.
You can configure Azure Databricks workspaces so that employees connect to the service only through existing
corporate networks with a secure perimeter. Azure Databricks customers can use the IP access lists feature to
define a set of approved IP addresses. All incoming access to the web application and REST APIs requires the
user connect from an authorized IP address.
For employees who are remote or travelling, employees could use VPN to connect to the corporate network,
which in turn enables access to the workspace. Using the previous example, the hospital could allow an
employee to use a VPN from the coffee shop to access the Azure Databricks workspace.
Requirements
This feature requires the Premium Plan.

Flexible configuration
The IP access lists feature is flexible:
Your own workspace administrators control the set of IP addresses on the public Internet that are allowed
access. This is known as the allow list. Allow multiple IP addresses explicitly or as entire subnets (for example
216.58.195.78/28).
Workspace administrators can optionally specify IP addresses or subnets to block even if they are included in
the allow list. This is known as the block list. You might use this feature if an allowed IP address range
includes a smaller range of infrastructure IP addresses that in practice are outside the actual secure network
perimeter.
Workspace administrators use REST APIs to update the list of allowed and blocked IP addresses and subnets.

Feature details
The IP Access List API enables Azure Databricks admins to configure IP allow lists and block lists for a
workspace. If the feature is disabled for a workspace, all access is allowed. There is support for allow lists
(inclusion) and block lists (exclusion).
When a connection is attempted:
1. First all block lists are checked. If the connection IP address matches any block list, the connection is
rejected.
2. If the connection was not rejected by block lists , the IP address is compared with the allow lists. If
there is at least one allow list for the workspace, the connection is allowed only if the IP address matches
an allow list. If there are no allow lists for the workspace, all IP addresses are allowed.
For all allow lists and block lists combined, the workspace supports a maximum of 1000 IP/CIDR values, where
one CIDR counts as a single value.
After changes to the IP access list feature, it can take a few minutes for changes to take effect.
How to use the IP access list API
This article discusses the most common tasks you can perform with the API. For the complete REST API
reference, download the OpenAPI spec and view it directly or using an application that reads OpenAPI 3.0. For
more details on using the OpenAPI spec, see IP Access List API 2.0.
To learn about authenticating to Azure Databricks APIs, see Authentication using Azure Databricks personal
access tokens.
The base path for the endpoints described in this article is https://<databricks-instance>/api/2.0 , where
<databricks-instance> is the adb-<workspace-id>.<random-number>.azuredatabricks.net domain name of your
Azure Databricks deployment.

Check if your workspace has the IP access list feature enabled


To check if your workspace has the IP access list feature enabled, call the get feature status API (
GET /workspace-conf ). Pass keys=enableIpAccessLists as arguments to the request.

In the response, the enableIpAccessLists field specifies either true or false .


For example:

curl -X -n \
https://<databricks-instance>/api/2.0/workspace-conf?keys=enableIpAccessLists

Example response:

{
"enableIpAccessLists": "true",
}

Enable or disable the IP access list feature for a workspace


To enable or disable the IP access list feature for a workspace, call the enable or disable the IP access list API (
PATCH /workspace-conf ).

In a JSON request body, specify enableIpAccessLists as true (enabled) or false (disabled).


For example, to enable the feature:

curl -X PATCH -n \
https://<databricks-instance>/api/2.0/workspace-conf \
-d '{
"enableIpAccessLists": "true"
}'

Example response:

{
"enableIpAccessLists": "true"
}

Add an IP access list


To add an IP access list, call the add an IP access list API ( POST /ip-access-lists ).
In the JSON request body, specify:
label — Label for this list.
list_type — Either ALLOW (allow list) or BLOCK (a block list, which means exclude even if in allow list).
ip_addresses — A JSON array of IP addresses and CIDR ranges, as String values.

The response is a copy of the object that you passed in, but with some additional fields, most importantly the
list_id field. You may want to save that value so you can update or delete the list later. If you do not save it,
you are still able to get the ID later by querying the full set of IP access lists with a GET request to the
/ip-access-lists endpoint.

For example, to add an allow list:

curl -X POST -n \
https://<databricks-instance>/api/2.0/ip-access-lists
-d '{
"label": "office",
"list_type": "ALLOW",
"ip_addresses": [
"1.1.1.1",
"2.2.2.2/21"
]
}'

Example response:

{
"ip_access_list": {
"list_id": "<list-id>",
"label": "office",
"ip_addresses": [
"1.1.1.1",
"2.2.2.2/21"
],
"address_count": 2,
"list_type": "ALLOW",
"created_at": 1578423494457,
"created_by": 6476783916686816,
"updated_at": 1578423494457,
"updated_by": 6476783916686816,
"enabled": true
}
}

To add a block list, do the same thing but with list_type set to BLOCK .

Update an IP access list


To update an IP access list:
1. Call the list all IP access lists API ( GET /ip-access-lists ), and find the ID of the list you want to update.
2. Call the update an IP access list API ( PATCH /ip-access-lists/<list-id> ).
In the JSON request body, specify at least one of the following values to update:
label — Label for this list.
list_type — Either ALLOW (allow list) or BLOCK (block list, which means exclude even if in allow list).
ip_addresses — A JSON array of IP addresses and CIDR ranges, as String values.
enabled — Specifies whether this list is enabled. Pass true or false .
The response is a copy of the object that you passed in with additional fields for the ID and modification dates.
For example, to update a list to disable it:

curl -X PATCH -n \
https://<databricks-instance>/api/2.0/ip-access-lists/<list-id>
-d '{ "enabled": "false" }'

Replace an IP access list


To replace an IP access list:
1. Call the list all IP access lists API ( GET /ip-access-lists ), and find the ID of the list you want to replace.
2. Call the replace an IP access list API ( PUT /ip-access-lists/<list-id> ).

In the JSON request body, specify:


label — Label for this list.
list_type — Either ALLOW (allow list) or BLOCK (block list, which means exclude even if in allow list).
ip_addresses — A JSON array of IP addresses and CIDR ranges, as String values.
enabled — Specifies whether this list is enabled. Pass true or false .

The response is a copy of the object that you passed in with additional fields for the ID and modification dates.
For example, to replace the contents of the specified list with the following values:

curl -X PUT -n \
https://<databricks-instance>/api/2.0/ip-access-lists/<list-id>
-d '{
"label": "office",
"list_type": "ALLOW",
"ip_addresses": [
"1.1.1.1",
"2.2.2.2/21"
],
"enabled": "false"
}'

Delete an IP access list


To delete an IP access list:
1. Call the list all IP access lists API ( GET /ip-access-lists ), and find the ID of the list you want to delete.
2. Call the delete an IP access list API ( DELETE /ip-access-lists/<list-id> ). There is no request body.

curl -X DELETE -n \
https://<databricks-instance>/api/2.0/ip-access-lists/<list-id>
Configure domain name firewall rules
7/21/2022 • 2 minutes to read

If your corporate firewall blocks traffic based on domain names, you must allow HTTPS and WebSocket traffic to
Azure Databricks domain names to ensure access to Azure Databricks resources. You can choose between two
options, one more permissive but easier to configure, the other specific to your workspace domains.

Option 1: Allow traffic to *.azuredatabricks.net


Update your firewall rules to allow HTTPS and WebSocket traffic to *.azuredatabricks.net (or
*.databricks.azure.us if your workspace is an Azure Government resource). This is more permissive than
option 2, but it saves you the effort of updating firewall rules for each Azure Databricks workspace in your
account.

Option 2: Allow traffic to your Azure Databricks workspaces only


If you choose to configure firewall rules for each workspace in your account, you must:
1. Identify your workspace domains.
Every Azure Databricks resource has two unique domain names. You can find the first by going to the
Azure Databricks resource in the Azure Portal.

The URL field displays a URL in the format https://adb-<digits>.<digits>.azuredatabricks.net , for


example https://adb-1666506161514800.0.azuredatabricks.net . Remove https:// to get the first domain
name.
The second domain name is exactly the same as the first, except that it has an adb-dp- prefix instead of
adb- . For example, if your first domain name is adb-1666506161514800.0.azuredatabricks.net , the second
domain name is adb-dp-1666506161514800.0.azuredatabricks.net .
2. Update your firewall rules.
Update your firewall rules to allow HTTPS and WebSocket traffic to the two domains identified in step 1.
Secure cluster connectivity (No Public IP / NPIP)
7/21/2022 • 4 minutes to read

With secure cluster connectivity enabled, customer virtual networks have no open ports and Databricks
Runtime cluster nodes have no public IP addresses. Secure cluster connectivity is also known as No Public IP
(NPIP).
At a network level, each cluster initiates a connection to the control plane secure cluster connectivity relay
during cluster creation. The cluster establishes this connection using port 443 (HTTPS) and uses a different IP
address than is used for the Web application and REST API.
When the control plane logically starts new Databricks Runtime jobs or performs other cluster
administration tasks, these requests are sent to the cluster through this tunnel.
The data plane (the VNet) has no open ports, and Databricks Runtime cluster nodes have no public IP
addresses.
Benefits:
Easy network administration, with no need to configure ports on security groups or to configure network
peering.
With enhanced security and simple network administration, information security teams can expedite
approval of Databricks as a PaaS provider.

NOTE
All Azure Databricks network traffic between the data plane VNet and the Azure Databricks control plane goes across the
Microsoft network backbone, not the public Internet. This is true even if secure cluster connectivity is disabled.
Use secure cluster connectivity
To use secure cluster connectivity with a new Azure Databricks workspace, use any of the following options.
Azure Portal: When you provision the workspace, go to the Networking tab and set the option Deploy
Azure Databricks workspace with Secure Cluster Connectivity (No Public IP) to Yes .
ARM Templates: For the Microsoft.Databricks/workspaces resource that creates your new workspace, set the
enableNoPublicIp Boolean parameter to true .

IMPORTANT
In either case, you must register the Azure Resource Provider Microsoft.ManagedIdentity in the Azure subscription
that is used to launch workspaces with secure cluster connectivity. This is a one-time operation per subscription. For
instructions, see Azure resource providers and types.

You cannot add secure cluster connectivity to an existing workspace. For information about migrating your
resources to the new workspaces, contact your Microsoft or Databricks account team for details.
If you’re using ARM templates, add the parameter to one of the following templates, based on whether you
want Azure Databricks to create a default (managed) virtual network for the workspace, or if you want to use
your own virtual network, also known as VNet injection. VNet injection is an optional feature that allows you to
provide your own VNet to host new Azure Databricks clusters.
ARM template to set up a workspace using the default (managed) VNet.
ARM template to set up a workspace using VNet injection.
Egress from workspace subnets
When you enable secure cluster connectivity, both of your workspace subnets are private subnets, since cluster
nodes do not have public IP addresses.
The implementation details of network egress vary based on whether you use the default (managed) VNet or
whether you use the optional VNet injection feature to provide your own VNet in which to deploy your
workspace. See the following sections for details.

IMPORTANT
Additional costs may be incurred due to increased egress traffic when you use secure cluster connectivity. For a smaller
organization that needs a cost-optimized solution, it may be acceptable to disable secure cluster connectivity when you
deploy your workspace. However, for the most secure deployment, Microsoft and Databricks strongly recommend that
you enable secure cluster connectivity.

Egress with default (managed) VNet


If you use secure cluster connectivity with the default VNet that Azure Databricks creates, Azure Databricks
automatically creates a NAT gateway for outbound traffic from your workspace’s subnets to the Azure backbone
and public network. The NAT gateway is created within the managed resource group managed by Azure
Databricks. You cannot modify this resource group or any resources provisioned within it.
The automatically-created NAT gateway incurs additional cost.
Egress with VNet injection
If you use secure cluster connectivity with optional VNet injection to provide your own VNet, ensure that your
workspace has a stable egress public IP and choose one of the following options:
For simple deployments, choose an egress load balancer, also called an outbound load balancer. The load
balancer’s configuration is managed by Azure Databricks. Clusters have a stable public IP, but you cannot
modify the configuration for custom egress needs. This Azure template-only solution has the following
requirements:
Azure Databricks expects additional fields to the ARM template that creates the workspace:
loadBalancerName (load balancer name), loadBalancerBackendPoolName (load balancer backend pool
name), loadBalancerFrontendConfigName (load balancer frontend configuration name) and
loadBalancerPublicIpName (load balancer public IP name).
Azure Databricks expects the Microsoft.Databricks/workspaces resource to have parameters
loadBalancerId (load balancer ID) and loadBalancerBackendPoolName (load balancer backend pool
name).
Azure Databricks does not support changing the configuration of the load balancer.
For deployments that need some customization, choose an Azure NAT gateway. Configure the gateway on
both of the workspace’s subnets to ensure that all outbound traffic to the Azure backbone and public
network transits through it. Clusters have a stable egress public IP, and you can modify the configuration for
custom egress needs. You can implement this solution using either an Azure template or from the Azure
portal.
For deployments with complex routing requirements or deployments that use VNet injection with an egress
firewall such as Azure Firewall or other custom networking architectures, you can use custom routes called
user-defined routes (UDRs). UDRs ensure that network traffic is routed correctly for your workspace, either
directly to the required endpoints or through an egress firewall. If you use such a solution, you must add
direct routes or allowed firewall rules for the Azure Databricks secure cluster connectivity relay and other
required endpoints listed at User-defined route settings for Azure Databricks.
Encrypt traffic between cluster worker nodes
7/21/2022 • 4 minutes to read

IMPORTANT
The example init script that is referenced in this article derives its shared encryption secret from the hash of the keystore
stored in DBFS. If you rotate the secret by updating the keystore file in DBFS, all running clusters must be restarted.
Otherwise, Spark workers may to fail to authenticate with the Spark driver due to inconsistent shared secret, causing jobs
to slow down. Furthermore, since the shared secret is stored in DBFS, any user with DBFS access can retrieve the secret
using a notebook. For further guidance, contact your representative.

Requirements
This feature requires the Premium plan. Contact your Databricks account representative for more
information.

How the init script works


IMPORTANT
The example init script that is referenced in this article derives its shared encryption secret from the hash of the keystore
stored in DBFS. If you rotate the secret by updating the keystore file in DBFS, all running clusters must be restarted.
Otherwise, Spark workers may to fail to authenticate with the Spark driver due to inconsistent shared secret, causing jobs
to slow down. Furthermore, since the shared secret is stored in DBFS, any user with DBFS access can retrieve the secret
using a notebook. For further guidance, contact your representative.

User queries and transformations are typically sent to your clusters over an encrypted channel. By default,
however, the data exchanged between worker nodes in a cluster is not encrypted. If your environment requires
that data be encrypted at all times, whether at rest or in transit, you can create an init script that configures your
clusters to encrypt traffic between worker nodes, using AES 128-bit encryption over a TLS 1.2 connection.

NOTE
Although AES enables cryptographic routines to take advantage of hardware acceleration, there’s a performance penalty
compared to unencrypted traffic. This penalty can result in queries taking longer on an encrypted cluster, depending on
the amount of data shuffled between nodes.

Enabling encryption of traffic between worker nodes requires setting Spark configuration parameters through
an init script. You can use a cluster-scoped init script for a single cluster or a global init script if you want all
clusters in your workspace to use worker-to-worker encryption.
One time, copy the keystore file to a directory in DBFS. Then create the init script that applies the encryption
settings.
The init script must perform the following tasks:
1. Get the JKS keystore file and password.
2. Set the Spark executor configuration.
3. Set the Spark driver configuration.
NOTE
The JKS keystore file used for enabling SSL/HTTPS is dynamically generated for each workspace. The JKS keystore file’s
password is hardcoded and not intended to protect the confidentiality of the keystore.

The following is an example init script that implements these three tasks to generate the cluster encryption
configuration.

Example init script


#!/bin/bash

set -euo pipefail

keystore_dbfs_file="/dbfs/<keystore_directory>/jetty_ssl_driver_keystore.jks"

## Wait till keystore file is available via Fuse

max_attempts=30
while [ ! -f ${keystore_dbfs_file} ];
do
if [ "$max_attempts" == 0 ]; then
echo "ERROR: Unable to find the file : $keystore_dbfs_file .Failing the script."
exit 1
fi
sleep 2s
((max_attempts--))
done
## Derive shared internode encryption secret from the hash of the keystore file
sasl_secret=$(sha256sum $keystore_dbfs_file | cut -d' ' -f1)

if [ -z "${sasl_secret}" ]; then
echo "ERROR: Unable to derive the secret.Failing the script."
exit 1
fi

# The JKS keystore file used for enabling SSL/HTTPS


local_keystore_file="$DB_HOME/keys/jetty_ssl_driver_keystore.jks"
# Password of the JKS keystore file. This jks password is hardcoded and is not intended to protect the
confidentiality
# of the keystore. Do not assume the keystore file itself is protected.
local_keystore_password="gb1gQqZ9ZIHS"

## Updating spark-branch.conf is only needed for driver

if [[ $DB_IS_DRIVER = "TRUE" ]]; then


driver_conf=${DB_HOME}/driver/conf/spark-branch.conf
echo "Configuring driver conf at $driver_conf"

if [ ! -e $driver_conf ] ; then
touch $driver_conf
fi

cat << EOF >> $driver_conf


[driver] {
// Configure inter-node authentication
"spark.authenticate" = true
"spark.authenticate.secret" = "$sasl_secret"
// Configure AES encryption
"spark.network.crypto.enabled" = true
"spark.network.crypto.saslFallback" = false
// Configure SSL
"spark.ssl.enabled" = true
"spark.ssl.keyPassword" = "$local_keystore_password"
"spark.ssl.keyStore" = "$local_keystore_file"
"spark.ssl.keyStore" = "$local_keystore_file"
"spark.ssl.keyStorePassword" = "$local_keystore_password"
"spark.ssl.protocol" ="TLSv1.3"
"spark.ssl.standalone.enabled" = true
"spark.ssl.ui.enabled" = true
}
EOF
echo "Successfully configured driver conf at $driver_conf"
fi

# Setting configs in spark-defaults.conf for the spark master and worker

spark_defaults_conf="$DB_HOME/spark/conf/spark-defaults.conf"
echo "Configuring spark defaults conf at $spark_defaults_conf"
if [ ! -e $spark_defaults_conf ] ; then
touch $spark_defaults_conf
fi

cat << EOF >> $spark_defaults_conf


spark.authenticate true
spark.authenticate.secret $sasl_secret
spark.network.crypto.enabled true
spark.network.crypto.saslFallback false

spark.ssl.enabled true
spark.ssl.keyPassword $local_keystore_password
spark.ssl.keyStore $local_keystore_file
spark.ssl.keyStorePassword $local_keystore_password
spark.ssl.protocol TLSv1.3
spark.ssl.standalone.enabled true
spark.ssl.ui.enabled true
EOF

echo "Successfully configured spark defaults conf at $spark_defaults_conf"

Once the initialization of the driver and worker nodes is complete, all traffic between these nodes is encrypted
using the keystore file.
This following notebook copies the keystore file and generates the init script in DBFS. You can use the init script
to create new clusters with encryption enabled.
Install an encryption init script notebook
Get notebook

Disable encryption between worker nodes


To disable encryption between worker nodes, remove the init script from the cluster configuration, then restart
the cluster.
Customer-managed keys for encryption
7/21/2022 • 2 minutes to read

IMPORTANT
This feature is in Public Preview.

NOTE
This feature requires the Premium Plan.

For some types of data, Azure Databricks supports adding a customer-managed key to help protect and control
access to encrypted data. Azure Databricks has two customer-managed key features for different types of data:
Enable customer-managed keys for managed services
Configure customer-managed keys for DBFS root
The following table lists which customer-managed key features are used for which types of data.

T Y P E O F DATA LO C AT IO N C USTO M ER- M A N A GED K EY F EAT URE

Notebook source and metadata Control plane Managed services

Secrets stored by the secret manager Control plane Managed services


APIs

Databricks SQL queries and query Control plane Managed services


history

Customer-accessible DBFS root data Your workspace’s DBFS root in your DBFS root
workspace root Blob storage in your
Azure subscription. This also includes
workspace libraries and the FileStore
area.

Job results Workspace root Blob storage instance DBFS root


in your Azure subscription

Databricks SQL results Workspace root Blob storage instance DBFS root
in your Azure subscription

Interactive notebook results By default, when you run a notebook For partial results in the control plane,
interactively (rather than as a job) use a customer-managed key for
results are stored in the control plane managed services. For results in the
for performance with some large root Blob storage, which you can
results stored in your workspace root configure for all result storage, use a
Blob storage in your Azure customer-managed key for DBFS root.
subscription. You can choose to
configure Azure Databricks to store all
interactive notebook results in your
Azure subscription.
T Y P E O F DATA LO C AT IO N C USTO M ER- M A N A GED K EY F EAT URE

Other workspace system data in the Workspace root Blob storage in your DBFS root
root Blob storage that is inaccessible Azure subscription
through DBFS, such as notebook
revisions.

For additional security for your workspace’s root Blob storage instance in your Azure subscription, you can
enable double encryption for the DBFS root.
Enable customer-managed keys for managed
services
7/21/2022 • 9 minutes to read

IMPORTANT
This feature is in Public Preview.

NOTE
This feature requires the Premium Plan.

For additional control of your data, you can add your own key to protect and control access to some types of
data. Azure Databricks has two customer-managed key features for different types of data and locations. To
compare them, see Customer-managed keys for encryption.
Managed services data in the Azure Databricks control plane is encrypted at rest. You can add a customer-
managed key for managed services to help protect and control access to the following types of encrypted data:
Notebook source in the Azure Databricks control plane.
Notebook results for notebooks run interactively (not as jobs) that are stored in the control plane. By default,
larger results are also stored in your workspace root bucket. You can configure Azure Databricks to store all
interactive notebook results in your cloud account.
Secrets stored by the secret manager APIs.
Databricks SQL queries and query history.
After you add a customer-managed key encryption for a workspace, Azure Databricks uses your key to control
access to the key that encrypts future write operations to your workspace’s managed services data. Existing data
is not re-encrypted. The data encryption key is cached in memory for several read and write operations and
evicted from memory at a regular interval. New requests for that data require another request to your cloud
service’s key management system. If you delete or revoke your key, reading or writing to the protected data fails
at the end of the cache time interval.
You can rotate (update) the customer-managed key at a later time. See Rotate the key.

IMPORTANT
After you run the key rotation command, you must keep your old KMS key available to Azure Databricks for 24 hours.

NOTE
This feature does not encrypt data stored outside of the control plane. To encrypt data in your workspace’s root Blob
storage, see Configure customer-managed keys for DBFS root.

Step 1: Create a key vault or use an existing one


You must create an Azure Key Vault instance and set its permissions. You can do this through the Azure portal,
but the following instructions use the Azure CLI.
1. Create a key vault or select an existing key vault
To create a key vault, replace the items in brackets with your region, key vault name, and resource
group name:

az keyvault create --location <region> \


--name <key-vault-name> \
--resource-group <resource-group-name>

To use an existing key vault, copy the key vault name for the next step.
2. Get the object ID of the AzureDatabricks application:

az ad sp show --id "2ff814a6-3304-4ab8-85cb-cd0e6f879c1d" \


--query "objectId" \
--output tsv

Instead of using the Azure CLI, you can get the object ID from within the Azure portal:
a. In Azure Active Directory, select Enterprise Applications from the sidebar menu.
b. Search for AzureDatabricks and click the Enterprise application in the results.
c. From Proper ties , copy the object ID.
3. Set the required permissions for your key vault. Replace <key-vault-name> with the vault name that you
used in the previous step and replace <object-id> with the object ID of the AzureDatabricks application.

az keyvault set-policy -n <key-vault-name> \


--key-permissions get wrapKey unwrapKey \
--object-id <object-id>

Step 2: Create a new key or use an existing key


Create a key under the key vault. The KeyType must be RSA , but RSA Key Size and HSM do not matter. The
KeyVault must be in the same Azure tenant as your Azure Databricks workspace. Use whatever tooling you
prefer to use: Azure portal, Azure CLI, or other tooling.
To create the key in CLI, run this command:

az keyvault key create --name <key name> \


--vault-name <key vault name>

Make note of the following values, which you can get from the key ID in the kid property in the response. You
will use them in subsequent steps:
Key vault URL: The beginning part of the key ID that includes the key vault name. It has the form
https://<key-vault-name>.vault.azure.net .
Key name: Name of your key.
Key version: Version of the key.
The full key ID has the form <key-vault-URL>/keys/<key-name>/<key-version> .
If instead you use an existing key, get and copy these values for your key so you can use them in the next steps.
Check to confirm that your existing key is enabled before proceeding.
Step 3: Create or update a workspace with your key
You can deploy a new workspace with customer-managed key for managed services or add customer-managed
key to an existing workspace. You can do both with ARM templates. Use whatever tooling you prefer to use:
Azure portal, Azure CLI, or other tooling.
The following ARM template creates a new workspace with a customer-managed key, using the preview API
version for resource Microsoft.Databricks/workspaces . Save this text locally to a file named
databricks-cmk-template.json .

NOTE
This example template does not include all possible features such as providing your own VNet. If you already use a
template, merge this template’s parameters, resources, and outputs into your existing template.

{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"workspaceName": {
"type": "string",
"metadata": {
"description": "The name of the Azure Databricks workspace to create."
}
},
"pricingTier": {
"type": "string",
"defaultValue": "premium",
"allowedValues": [
"standard",
"premium"
],
"metadata": {
"description": "The pricing tier of workspace."
}
},
"location": {
"type": "string",
"defaultValue": "[resourceGroup().location]",
"metadata": {
"description": "Location for all resources."
}
},
"apiVersion": {
"type": "string",
"defaultValue": "2021-04-01-preview",
"allowedValues":[
"2021-04-01-preview"
],
"metadata": {
"description": "The api version to create the workspace resources"
}
},
"keyvaultUri": {
"type": "string",
"metadata": {
"description": "The key vault URI for customer-managed key for managed services"
}
},
"keyName": {
"type": "string",
"metadata": {
"description": "The key name used for customer-managed key for managed services"
}
}
},
"keyVersion": {
"type": "string",
"metadata": {
"description": "The key version used for customer-managed key for managed services"
}
}
},
"variables": {
"managedResourceGroupName": "[concat('databricks-rg-', parameters('workspaceName'), '-',
uniqueString(parameters('workspaceName'), resourceGroup().id))]"
},
"resources": [
{
"type": "Microsoft.Databricks/workspaces",
"name": "[parameters('workspaceName')]",
"location": "[parameters('location')]",
"apiVersion": "[parameters('apiVersion')]",
"sku": {
"name": "[parameters('pricingTier')]"
},
"properties": {
"ManagedResourceGroupId": "[concat(subscription().id, '/resourceGroups/',
variables('managedResourceGroupName'))]",
"encryption": {
"entities": {
"managedServices": {
"keySource": "Microsoft.Keyvault",
"keyVaultProperties": {
"keyVaultUri": "[parameters('keyvaultUri')]",
"keyName": "[parameters('keyName')]",
"keyVersion": "[parameters('keyVersion')]"
}
}
}
}
}
}
],
"outputs": {
"workspace": {
"type": "object",
"value": "[reference(resourceId('Microsoft.Databricks/workspaces', parameters('workspaceName')))]"
}
}
}

If you use another template already, you can merge this template’s parameters, resources, and outputs into your
existing template.
To use this template to create or update a workspace, you have several options depending on your tooling.
Create workspace with Azure CLI
To create a new workspace with Azure CLI, run the following command:

az deployment group create --resource-group <resource-group-name> \


--template-file <file-name>.json \
--parameters workspaceName=<new-workspace-name> \
keyvaultUri=<keyvaultUrl> \
keyName=<keyName> keyVersion=<keyVersion>

Update workspace with Azure CLI


To update an existing workspace to use a customer-managed key workspace (or to rotate the existing key) using
Azure CLI:
1. If your ARM template that deployed the workspace never added customer-managed keys, add the
resources.properties.encryption section and its related parameters. See the template earlier in this
article.
a. Add the resources.properties.encryption section from the template.
b. In the parameters section, add three new parameters keyvaultUri , keyName , and keyVersion from
the template.
2. Run the same command as for creating a new workspace. As long as the resource group name and the
workspace name are identical to your existing workspace, this command updates the existing workspace
rather than creating a new workspace.

IMPORTANT
Other than changes in the key-related parameters, use the same parameters that were used for creating the
workspace.

az deployment group create --resource-group <existing-resource-group-name> \


--template-file <file-name>.json \
--parameters workspaceName=<existing-workspace-name> \
keyvaultUri=<keyvaultUrl> \
keyName=<keyName> keyVersion=<keyVersion>

IMPORTANT
After you run the key rotation command, you must keep your old KMS key available to Azure Databricks for 24
hours.

Create or update workspace with Azure portal


To use the template in the Azure portal to create or update a workspace:
1. Go to the Custom deployment page.
2. Click Build your own template in the editor .
3. Paste in the JSON.
4. Click Save .
5. Fill in the parameters.
To update an existing workspace, use the same parameters that you used to create the workspace. To add
a key for the first time, add the three key-related parameters. To rotate the key, change some or all of the
key-related parameters. Ensure the resource group name and the workspace name are identical to your
existing workspace. If they are the same, this command updates the existing workspace rather than
creating a new workspace.

IMPORTANT
Other than changes in the key-related parameters, use the same parameters that were used for creating the
workspace.

6. Click Review + Create .


7. If there are no validation issues, click Create .

IMPORTANT
After you run the key rotation command, you must keep your old KMS key available to Azure Databricks for 24
hours.

For more details, see the Azure article Quickstart: Create and deploy ARM templates by using the Azure portal.

Step 4: Optionally export and re-import existing notebooks


After you initially add a key for managed services for an existing workspace, only future write operations use
your key. Existing data is not re-encrypted.
You can export all notebooks and then re-import them so the key that encrypts the data is protected and
controlled by your key. You can use the Export and Import Workspace APIs.

Rotate the key


If you are already using a customer-managed key for managed services, you can update the workspace with a
new key version, or an entirely new key. This is called key rotation.
1. Create a new key or rotate your existing key in the Key Vault. See Step 1: Create a key vault or use an
existing one.

IMPORTANT
Ensure the new key has the proper permission.

2. Confirm that your template has the correct API version 2021-04-01-preview .
3. Update the workspace:

IMPORTANT
After you run the key rotation command, you must keep your old KMS key available to Azure Databricks for 24
hours.

To use the Azure portal, apply the template using the Custom deployment tool. See Create or
update workspace with Azure portal. Ensure that you use the same values for the resource group
name and the workspace name so it updates the existing workspace, rather than creating a new
workspace.
To use the Azure CLI, run the following command. Ensure that you use the same values for the
resource group name and the workspace name so it updates the existing workspace, rather than
creating a new workspace.

IMPORTANT
Other than changes in the key-related parameters, use the same parameters that were used for creating
the workspace.
az deployment group create --resource-group <existing-resource-group-name> \
--template-file <file-name>.json \
--parameters workspaceName=<existing-workspace-name> \
keyvaultUri=<keyvaultUrl> \
keyName=<keyName> keyVersion=<keyVersion>

4. Optionally export and re-import existing notebooks to ensure all existing notebooks use your new key.

Troubleshooting and best practices


Accidental deletion of a key
If you delete your key in the Azure Key Vault, the workspace login will start failing and no notebooks will be
readable by Azure Databricks. To avoid this, we recommend that you enable soft deletes. This option ensures
that if a key is deleted, it can be recovered within a 30 day period. If soft delete is enabled, you can simply re-
enable the key in order to resolve the issue.
Lost keys are unrecoverable
If you lose your key and cannot recover, all the notebook data encrypted by the key is unrecoverable.
Key update failure due to key vault permissions
If you have trouble creating your workspace, check if your key vault has correct permissions. The error that is
returned from Azure may not correctly indicate this as the root cause. Also, the required permissions are get ,
wrapKey , and unwrapKey . See Step 1: Create a key vault or use an existing one.
Configure customer-managed keys for DBFS root
7/21/2022 • 2 minutes to read

NOTE
This feature is available only in the Premium Plan.

For additional control of your data, you can add your own key to protect and control access to some types of
data. Azure Databricks has two customer-managed key features that involve different types of data and
locations. For a comparison, see Customer-managed keys for encryption.
Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and
available on Azure Databricks clusters. DBFS is implemented as a Blob storage instance in your Azure Databricks
workspace’s managed resource group. The default storage location in DBFS is known as the DBFS root. By
default, the storage account is encrypted with Microsoft-managed keys.
After you add a customer-managed key encryption for a workspace, Azure Databricks uses your key to encrypt
future write operations to your workspace’s root Blob storage. Existing data is not re-encrypted.

IMPORTANT
This feature affects your DBFS root but is not used for encrypting data on any additional DBFS mounts such as DBFS
mounts of additional Blob or ADLS storage.

You must use Azure Key Vault to store your customer-managed keys. You can either create your own keys and
store them in the key vault, or you can use the Azure Key Vault APIs to generate keys.
There are three ways of enabling customer-managed keys for your DBFS storage:
Configure customer-managed keys for DBFS using the Azure portal
Configure customer-managed keys for DBFS using the Azure CLI
Configure customer-managed keys for DBFS using PowerShell
Configure customer-managed keys for DBFS using
the Azure portal
7/21/2022 • 2 minutes to read

NOTE
This feature is available only in the Premium Plan.

You can use the Azure portal to configure your own encryption key to encrypt the DBFS root storage account.
You must use Azure Key Vault to store the key.
For more information about customer-managed keys for DBFS, see Configure customer-managed keys for DBFS
root.

Create a key in Azure Key Vault


NOTE
If you already have an existing key vault in the same region and same Azure Active Directory (Azure AD) tenant as your
Azure Databricks workspace, you can skip the first step in this procedure. However, be aware that when you use the Azure
portal to assign a customer-managed key for DBFS root encryption, the system enables the Soft Delete and Do Not
Purge properties by default for your key vault. For more information about these properties, see Azure Key Vault soft-
delete overview.

1. Create a key vault following the instructions in Quickstart: Set and retrieve a key from Azure Key Vault
using the Azure portal.
The Azure Databricks workspace and the key vault must be in the same region and the same Azure Active
Directory (Azure AD) tenant, but they can be in different subscriptions.
2. Create a key in the key vault, continuing to follow the instructions in the Quickstart.
DBFS root storage supports RSA and RSA-HSM keys of sizes 2048, 3072 and 4096. For more information
about keys, see About Key Vault keys.
3. Once your key is created, copy and paste the Key Identifier into a text editor. You will need it when you
configure your key for Azure Databricks.

Encrypt the DBFS root storage account using your key


1. Go to your Azure Databricks service resource in the Azure portal.
2. In the left menu, under Settings , select Encr yption .
3. Select Use your own key , enter your key’s Key Identifier , and select the Subscription that contains
the key.

4. Click Save to save your key configuration.

NOTE
Only users with the key vault Contributor role or higher for the key vault can save.

When the encryption is enabled, the system enables Soft-Delete and Purge Protection on the key vault,
creates a managed identity on the DBFS root, and adds an access policy for this identity in the key vault.

Regenerate (rotate) keys


When you regenerate a key, you must return to the Encr yption page in your Azure Databricks service resource,
update the Key Identifier field with your new key identifier, and click Save . This applies to new versions of the
same key as well as new keys.

IMPORTANT
If you delete the key that is used for encryption, the data in the DBFS root cannot be accessed. You can use the Azure Key
Vault APIs to recover deleted keys.
Configure customer-managed keys for DBFS using
the Azure CLI
7/21/2022 • 2 minutes to read

NOTE
This feature is available only in the Premium Plan.

You can use the Azure CLI to configure your own encryption key to encrypt the DBFS root storage account. You
must use Azure Key Vault to store the key.
For more information about customer-managed keys for DBFS, see Configure customer-managed keys for DBFS
root.

Install the Azure Databricks CLI extension


1. Install the Azure CLI.
2. Install the Azure Databricks CLI extension.

az extension add --name databricks

Prepare a new or existing Azure Databricks workspace for encryption


Replace the placeholder values in brackets with your own values. The <workspace-name> is the resource name as
displayed in the Azure portal.

az login
az account set --subscription <subscription-id>

Prepare for encryption during workspace creation:

az databricks workspace create --name <workspace-name> --location <workspace-location> --resource-group


<resource-group> --sku premium --prepare-encryption

Prepare an existing workspace for encryption:

az databricks workspace update --name <workspace-name> --resource-group <resource-group> --prepare-


encryption

Note the principalId field in the storageAccountIdentity section of the command output. You will provide it as
the managed identity value when you configure your key vault.
For more information about Azure CLI commands for Azure Databricks workspaces, see the az databricks
workspace command reference.

Create a new key vault


The key vault that you use to store customer-managed keys for root DBFS must have two key protection
settings enabled, Soft Delete and Purge Protection . To create a new key vault with these settings enabled, run
the following commands.
Replace the placeholder values in brackets with your own values.

az keyvault create \
--name <key-vault> \
--resource-group <resource-group> \
--location <region> \
--enable-soft-delete \
--enable-purge-protection

For more information about enabling Soft Delete and Purge Protection using the Azure CLI, see How to use Key
Vault soft-delete with CLI.

Configure the key vault access policy


Set the access policy for the key vault so that the Azure Databricks workspace has permission to access it, using
the az keyvault set-policy command.
Replace the placeholder values in brackets with your own values.

az keyvault set-policy \
--name <key-vault> \
--resource-group <resource-group> \
--object-id <managed-identity> \
--key-permissions get unwrapKey wrapKey

Replace <managed-identity> with the principalId value that you noted when you prepared your workspace for
encryption.

Create a new key


Create a key in the key vault using the az keyvault key create command.
Replace the placeholder values in brackets with your own values.

az keyvault key create \


--name <key> \
--vault-name <key-vault>

DBFS root storage supports RSA and RSA-HSM keys of sizes 2048, 3072 and 4096. For more information about
keys, see About Key Vault keys.

Configure DBFS encryption with customer-managed keys


Configure your Azure Databricks workspace to use the key you created in your Azure Key Vault.
Replace the placeholder values in brackets with your own values.

key_vault_uri=$(az keyvault show \


--name <key-vault> \
--resource-group <resource-group> \
--query properties.vaultUri \
--output tsv)
key_version=$(az keyvault key list-versions \
--name <key> \ --vault-name <key-vault> \
--query [-1].kid \
--output tsv | cut -d '/' -f 6)

az databricks workspace update --name <workspace-name> --resource-group <resource-group> --key-source


Microsoft.KeyVault --key-name <key> --key-vault $key_vault_uri --key-version $key_version

Disable customer-managed keys


When you disable customer-managed keys, your storage account is once again encrypted with Microsoft-
managed keys.
Replace the placeholder values in brackets with your own values and use the variables defined in the previous
steps.

az databricks workspace update --name <workspace-name> --resource-group <resource-group> --key-source


Default
Configure customer-managed keys for DBFS using
PowerShell
7/21/2022 • 2 minutes to read

NOTE
This feature is available only in the Premium Plan.

You can use PowerShell to configure your own encryption key to encrypt the DBFS root storage account. You
must use Azure Key Vault to store the key.
For more information about customer-managed keys for DBFS, see Configure customer-managed keys for DBFS
root.

Install the Azure Databricks PowerShell module


1. Install Azure PowerShell.
2. Install the Azure Databricks PowerShell module.

Prepare a new or existing Azure Databricks workspace for encryption


Replace the placeholder values in brackets with your own values. The <workspace-name> is the resource name as
displayed in the Azure portal.
Prepare encryption when you create a workspace:

$workSpace = New-AzDatabricksWorkspace -Name <workspace-name> -Location <workspace-location> -


ResourceGroupName <resource-group> -Sku premium -PrepareEncryption

Prepare an existing workspace for encryption:

$workSpace = Update-AzDatabricksWorkspace -Name <workspace-name> -ResourceGroupName <resource-group> -


PrepareEncryption

For more information about PowerShell cmdlets for Azure Databricks workspaces, see the Az.Databricks
reference.

Create a new key vault


The key vault that you use to store customer-managed keys for default (root) DBFS must have two key
protection settings enabled, Soft Delete and Purge Protection .
In version 2.0.0 and later of the Az.KeyVault module, soft delete is enabled by default when you create a new
key vault.
The following example creates a new key vault with the Soft Delete and Purge Protection properties enabled.
Replace the placeholder values in brackets with your own values.
$keyVault = New-AzKeyVault -Name <key-vault> `
-ResourceGroupName <resource-group> `
-Location <location> `
-EnablePurgeProtection

To learn how to enable Soft Delete and Purge Protection on an existing key vault with PowerShell, see “Enabling
soft-delete” and “Enabling Purge Protection” in How to use Key Vault soft-delete with PowerShell.
Configure the key vault access policy
Set the access policy for the key vault so that the Azure Databricks workspace has permission to access it, using
Set-AzKeyVaultAccessPolicy.

Set-AzKeyVaultAccessPolicy `
-VaultName $keyVault.VaultName `
-ObjectId $workSpace.StorageAccountIdentity.PrincipalId `
-PermissionsToKeys wrapkey,unwrapkey,get

Create a new key


Create a new key in the key vault using the Add-AzKeyVaultKey cmdlet. Replace the placeholder values in
brackets with your own values.

$key = Add-AzKeyVaultKey -VaultName $keyVault.VaultName -Name <key> -Destination 'Software'

DBFS root storage supports RSA and RSA-HSM keys of sizes 2048, 3072 and 4096. For more information about
keys, see About Key Vault keys.

Configure DBFS encryption with customer-managed keys


Configure your Azure Databricks workspace to use the key you created in your Azure Key Vault. Replace the
placeholder values in brackets with your own values.

Update-AzDatabricksWorkspace -ResourceGroupName <resource-group> `


-Name <workspace-name>
-EncryptionKeySource Microsoft.Keyvault `
-EncryptionKeyName $key.Name `
-EncryptionKeyVersion $key.Version `
-EncryptionKeyVaultUri $keyVault.VaultUri

Disable customer-managed keys


When you disable customer-managed keys, your storage account is once again encrypted with Microsoft-
managed keys.
Replace the placeholder values in brackets with your own values and use the variables defined in the previous
steps.

Update-AzDatabricksWorkspace -Name <workspace-name> -ResourceGroupName <resource-group> -EncryptionKeySource


Default
Configure double encryption for DBFS root
7/21/2022 • 2 minutes to read

NOTE
This feature is available only in the Premium Plan.

Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and
available on Azure Databricks clusters. DBFS is implemented as a storage account in your Azure Databricks
workspace’s managed resource group. The default storage location in DBFS is known as the DBFS root.
Azure Storage automatically encrypts all data in a storage account—including DBFS root storage—at the service
level using 256-bit AES encryption. This is one of the strongest block ciphers available and is FIPS 140-2
compliant. If you require higher levels of assurance that your data is secure, you can also enable 256-bit AES
encryption at the Azure Storage infrastructure level. When infrastructure encryption is enabled, data in a storage
account is encrypted twice, once at the service level and once at the infrastructure level, with two different
encryption algorithms and two different keys. Double encryption of Azure Storage data protects against a
scenario where one of the encryption algorithms or keys is compromised. In this scenario, the additional layer of
encryption continues to protect your data.
This article describes how to create a workspace that adds infrastructure encryption (and therefore double
encryption) for a workspace’s root storage. You must enable infrastructure encryption at workspace creation;
you cannot add infrastructure encryption to an existing workspace.

Requirements
Premium Plan

Create a workspace with double encryption using the Azure portal


Follow the instructions for creating a workspace using the Azure portal in Quickstart: Run a Spark job on Azure
Databricks Workspace using the Azure portal, adding these steps:
1. In PowerShell, run the following commands, which will allow you to enable infrastructure encryption in
the Azure portal.

Register-AzProviderFeature -ProviderNamespace Microsoft.Storage -FeatureName


AllowRequireInfraStructureEncryption

Get-AzProviderFeature -ProviderNamespace Microsoft.Storage -FeatureName


AllowRequireInfraStructureEncryption

2. On the Create an Azure Databricks workspace page (Create a resource > Analytics > Azure
Databricks ), click the Advanced tab.
3. Next to Enable Infrastructure Encr yption , select Yes .
4. When you have finished your workspace configuration and created the workspace, verify that
infrastructure encryption is enabled.
In the resource page for the Azure Databricks workspace, go to the sidebar menu and select Settings >
Encr yption . Confirm that Enable Infrastructure Encr yption is selected.

Create a workspace with double encryption using PowerShell


Follow the instructions in Quickstart: Create an Azure Databricks workspace using PowerShell, adding the option
-RequireInfrastructureEncryption to the command you run in the Create an Azure Databricks workspace step:

For example,

New-AzDatabricksWorkspace -Name databricks-test -ResourceGroupName testgroup -Location eastus -


ManagedResourceGroupName databricks-group -Sku premium -RequireInfrastructureEncryption

After your workspace is created, verify that infrastructure encryption is enabled by running:

Get-AzDatabricksWorkspace -Name <workspace-name> -ResourceGroupName <resource-group> | fl

RequireInfrastructureEncryption should be set to true .


For more information about PowerShell cmdlets for Azure Databricks workspaces, see the Az.Databricks module
reference.

Create a workspace with double encryption using the Azure CLI


When you create a workspace using the Azure CLI, include the option --require-infrastructure-encryption .
For example,

az databricks workspace create --name <workspace-name> --location <workspace-location> --resource-group


<resource-group> --sku premium --require-infrastructure-encryption

After your workspace is created, verify that infrastructure encryption is enabled by running:

az databricks workspace show --name <workspace-name> --resource-group <resource-group>

The requireInfrastructureEncryption field should be present in the encryption property and set to true .
For more information about Azure CLI commands for Azure Databricks workspaces, see the az databricks
workspace command reference.
Data governance guide
7/21/2022 • 2 minutes to read

This guide shows how to manage access to your data in Azure Databricks.
The Databricks Security and Trust Center provides information about the ways in which security is built
into every layer of the Databricks Lakehouse Platform. The Security and Trust Center provides
information that enables you to meet your regulatory needs while taking advantage of the Databricks
Lakehouse Platform. Find the following types of information in the Security and Trust Center:
An overview and list of the security and governance features built in the platform.
Information about the compliance standards the platform meets on each cloud provider.
A due-diligence package to help you evaluate how Azure Databricks helps you meet your compliance
and regulatory needs.
An overview of Databricks’ privacy guidelines and how they are enforced.
The information in this article supplements the Security and Trust Center.
Unity Catalog (Preview) is a secure metastore developed by Databricks. Unity Catalog centralizes
metadata and governance of an organization’s data. With Unity Catalog, data governance rules scale with
your needs, regardless of the number of workspaces or the business intelligence tools your organization
uses. See Get started using Unity Catalog.
Table access control lets you apply data governance controls for your data.
Credential passthrough allows you to authenticate automatically to Azure Data Lake Storage from Azure
Databricks clusters using the identity that you use to log in to Azure Databricks.
Audit logs allow your enterprise to monitor details about usage patterns across your Databricks account
and workspaces.
For information about securing your account, workspaces, and compute resources, see Security guide.
In this guide:
Data governance overview
Unity Catalog (Preview)
Table access control
Credential passthrough
Data governance overview
7/21/2022 • 14 minutes to read

This article describes the need for data governance and shares best practices and strategies you can use to
implement these techniques across your organization. It demonstrates a typical deployment workflow you can
employ using Azure Databricks and cloud-native solutions to secure and monitor each layer from the
application down to storage.

Why is data governance important?


Data governance is an umbrella term that encapsulates the policies and practices implemented to securely
manage the data assets within an organization. As one of the key tenets of any successful data governance
practice, data security is likely to be top of mind at any large organization. Key to data security is the ability for
data teams to have superior visibility and auditability of user data access patterns across their organization.
Implementing an effective data governance solution helps companies protect their data from unauthorized
access and ensures that they have rules in place to comply with regulatory requirements.

Governance challenges
Whether you’re managing the data of a startup or a large corporation, security teams and platform owners have
the singular challenge of ensuring that this data is secure and is being managed according to the internal
controls of the organization. Regulatory bodies the world over are changing the way we think about how data is
both captured and stored. These compliance risks only add further complexity to an already tough problem.
How then, do you open your data to those who can drive the use cases of the future? Ultimately, you should be
adopting data policies and practices that help the business to realize value through the meaningful application
of what can often be vast stores of data, stores that are growing all the time. We get solutions to the world’s
toughest problems when data teams have access to many and disparate sources of data.
Typical challenges when considering the security and availability of your data in the cloud:
Do your current data and analytics tools support access controls on your data in the cloud? Do they provide
robust logging of actions taken on the data as it moves through the given tool?
Will the security and monitoring solution you put in place now scale as demand on the data in your data lake
grows? It can be easy enough to provision and monitor data access for a small number of users. What
happens when you want to open up your data lake to hundreds of users? To thousands?
Is there anything you can do to be proactive in ensuring that your data access policies are being observed? It
is not enough to simply monitor; that is just more data. If data availability is merely a challenge of data
security, you should have a solution in place to actively monitor and track access to this information across
the organization.
What steps can you take to identify gaps in your existing data governance solution?

How Azure Databricks addresses these challenges


Azure Databricks provides a number of features to help you meet your data governance needs.
Manage access to data and objects :
The Databricks Security and Trust Center provides information about the ways in which security is built
into every layer of the Databricks Lakehouse Platform. The Security and Trust Center provides
information that enables you to meet your regulatory needs while taking advantage of the Databricks
Lakehouse Platform. Find the following types of information in the Security and Trust Center:
An overview and list of the security and governance features built into the platform.
Information about the compliance standards the platform meets on each cloud provider.
A due-diligence package to help you evaluate how Azure Databricks helps you meet your compliance
and regulatory needs.
An overview of Databricks’ privacy guidelines and how they are enforced.
The information in this article supplements the Security and Trust Center.
Unity Catalog (Preview) is a secure metastore developed by Databricks. Unity Catalog centralizes storage,
metadata, and governance of an organization’s data. With Unity Catalog, data governance rules scale with
your needs, regardless of the number of workspaces or the business intelligence tools your organization
uses. When you store data in Unity Catalog, you selectively grant workspaces access to each metastore,
and you manage access to the data in one place, using account-level identities. See Get started using
Unity Catalog.
Delta Sharing (Preview) is an open protocol developed by Databricks for secure data sharing with other
organizations regardless of which computing platforms they use.
Table access control lets you apply data governance controls for data and objects that are local to a
workspace, using workspace-local users and groups. Admins can enable ACLs using the Admin Console
or Permissions API.
Azure Data Lake Storage credential passthrough lets you authenticate automatically to Accessing Azure
Data Lake Storage Gen1 from Azure Databricks and ADLS Gen2 from Azure Databricks clusters using the
same Azure Active Directory (Azure AD) identity that you use to log into Azure Databricks. When you
enable a cluster for , commands that you run on that cluster can read and write data in Azure Data Lake
Storage without requiring you to configure service principal credentials for access to storage.
Credential passthrough applies only to Azure Data Lake Storage Gen1 or Gen2 storage accounts. Azure
Data Lake Storage Gen2 storage accounts must use the hierarchical namespace to work with .
Cluster policies enable administrators to control access to compute resources.
Manage cluster configurations :
Cluster policies enable administrators to control access to compute resources.
Audit data access :
Diagnostic logs provide visibility into actions and operations across your account and workspaces.
The following sections illustrate how to use these Azure Databricks features to implement a governance
solution.

Manage access to data and objects


This section provides a staged approach to managing access to data and objects. First, you secure access to
Azure Data Lake Storage and enable Azure Data Lake Storage credential passthrough. Next, you enable access
control and implement fine-grained control of individual tables and objects. You can also configure service
principals for long-running or frequent workloads and for connecting to business-intelligence (BI) tools.
Manage data across multiple workspaces using Unity Catalog (Preview)
Unity Catalog (Preview) is a secure metastore developed by Databricks. Unity Catalog centralizes storage,
metadata, and governance of an organization’s data. With Unity Catalog, data governance rules scale with your
needs, regardless of the number of workspaces or the business intelligence tools your organization uses. When
you store data in Unity Catalog, you selectively grant workspaces access to each metastore, and you manage
access to the data in one place, using account-level identities. To learn more, see Unity Catalog (Preview). To
learn how to create metastores, load data into them, migrate existing workspace-local data, and manage access
to data and objects in metastores, see Get started using Unity Catalog. To learn more about auditing Unity
Catalog events, see Audit access and activity for Unity Catalog resources.
Share and consume data with Delta Sharing (Preview)
Delta Sharing (Preview) is an open protocol developed by Databricks for secure data sharing with other
organizations regardless of which computing platforms they use. You can share any data in a Unity Catalog
metastore using Delta Sharing. To learn more about sharing data with Delta Sharing, see Share data using Delta
Sharing (Preview). To learn how to consume data shared with you by a Delta Sharing data provider, see Access
data shared with you using Delta Sharing.
Implement table access control
You can enable table access control in a workspace to programmatically grant, deny, and revoke access to your
data from the Spark SQL API. You can control access to securable objects like databases, tables, views and
functions. Consider a scenario where your company has a database to store financial data. You might want your
analysts to create financial reports using that data. However, there might be sensitive information in another
table in the database that analysts should not access. You can provide the user or group the privileges required
to read data from one table, but deny all privileges to access the second table.
In the following illustration, Alice is an admin who owns the shared_data and private_data tables in the
Finance database. Alice then provides Oscar, an analyst, with the privileges required to read from shared_data
but denies all privileges to private_data .

Alice grants SELECT privileges to Oscar to read from shared_data :

Alice denies all privileges to Oscar to access private_data :

You can take this one step further by defining fine-grained access controls to a subset of a table or by setting
privileges on derived views of a table.

Service principals
How do you grant access to users or service accounts for more long-running or frequent workloads? What if
you want to utilize a business intelligence tool, such as Power BI or Tableau, that needs access to the tables in
Azure Databricks via ODBC/JDBC? In these cases, you should use service principals and OAuth. Service
principals are identity accounts scoped to very specific Azure resources. When building a job in a notebook, you
can add the following lines to the job cluster’s Spark configuration or run directly in the notebook. This allows
you to access the corresponding file store within the scope of the job.

spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "
<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net",
dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net",
"https://login.microsoftonline.com/<directory-id>/oauth2/token")

Similarly, you can access said data by reading directly from an Azure Data Lake Storage Gen1 or Gen2 URI by
mounting your file store(s) with a service principal and an OAuth token. Once you’ve set the configuration
above, you can now access files directly in your Azure Data Lake Storage using the URI:

"abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>"

All users on a cluster with a file system registered in this way will have access to the data in the file system.

Azure Data Lake Storage management


You can access data in Azure Data Lake Storage from an Azure Databricks cluster in a couple of ways. The
methods discussed here mainly correspond to how the data being accessed will be used in the corresponding
workflow. That is, will you be accessing your data in a more interactive, ad-hoc way, perhaps developing an ML
model or building an operational dashboard? In that case, we recommend that you use Azure Active Directory
(Azure AD) credential passthrough. Will you be running automated, scheduled workloads that require one-off
access to the containers in your data lake? Then using service principals to access Azure Data Lake Storage is
preferred.
Azure Data Lake Storage credential passthrough
Credential passthrough provides user-scoped data access controls to any provisioned file stores based on the
user’s role based access controls. When you configure a cluster, select and expand Advanced Options to enable
credential passthrough. Any users who attempt to access data on the cluster will be governed by the access
controls put in place on their corresponding file system resources, according to their Active Directory account.

This solution is suitable for many interactive use cases and offers a streamlined approach, requiring that you
manage permissions in just one place. In this way, you can allocate one cluster to multiple users without having
to worry about provisioning specific access controls for each of your users. Process isolation on Azure
Databricks clusters ensures that user credentials will not be leaked or otherwise shared. This approach also has
the added benefit of logging user-level entries in your Azure storage audit logs, which can help platform admins
to associate storage layer actions with specific users.
Some limitations to this method are:
Supports only Azure Data Lake Storage file systems.
Databricks REST API access.
Table access control: Azure Databricks does not suggest using credential passthrough with table access
control. For more details on the limitations of combining these two features, see Limitations. For more
information about using table access control, see Implement table access control.
Not suitable for long-running jobs or queries, because of the limited time-to-live on a user’s access token.
For these types of workloads, we recommend that you use service principals to access your data.
Securely mount Azure Data Lake Storage using credential passthrough
You can mount an Azure Data Lake Storage account or folder inside it to the Databricks File System (DBFS),
providing an easy and secure way to access data in your data lake. The mount is a pointer to a data lake store, so
the data is never synced locally. When you mount data using a cluster enabled with Azure Data Lake Storage
credential passthrough, any read or write to the mount point uses your Azure AD credentials. This mount point
will be visible to other users, but the only users that will have read and write access are those who:
Have access to the underlying Azure Data Lake Storage storage account
Are using a cluster enabled for Azure Data Lake Storage credential passthrough
To mount Azure Data Lake Storage using credential passthrough, follow the instructions in Mount Azure Data
Lake Storage to DBFS using credential passthrough.

Manage cluster configurations


You can use cluster policies to provision clusters automatically, manage their permissions, and control costs.
Cluster policies allow Azure Databricks administrators to define cluster attributes that are allowed on a cluster,
such as instance types, number of nodes, custom tags, and many more. When an admin creates a policy and
assigns it to a user or a group, those users can only create clusters based on the policy they have access to. This
gives administrators a much higher degree of control on what types of clusters can be created.
You define policies in a JSON policy definition and then create cluster policies using the cluster policies UI or
Cluster Policies API. A user can create a cluster only if they have the create_cluster permission or access to at
least one cluster policy. Extending your requirements for the new analytics project team, as described above,
administrators can now create a cluster policy and assign it to one or more users within the project team who
can now create clusters for the team limited to the rules specified in the cluster policy. The image below provides
an example of a user that has access to the Project Team Cluster Policy creating a cluster based on the policy
definition.
Automatically provision clusters and grant permissions
With the addition of endpoints for both clusters and permissions, the Databricks REST API 2.0 makes it easy to
both provision and grant permission to cluster resources for users and groups at any scale. You can use the
Clusters API 2.0 to create and configure clusters for your specific use case.
You can then use the Permissions API 2.0 to apply access controls to the cluster.
The following is an example of a configuration that might suit a new analytics project team.
The requirements are:
Support the interactive workloads of this team, who are mostly SQL and Python users.
Provision a data source in object storage with credentials that give the team access to the data tied to the
role.
Ensure that users get an equal share of the cluster’s resources.
Provision larger, memory optimized instance types.
Grant permissions to the cluster such that only this new project team has access to it.
Tag this cluster to make sure you can properly do chargebacks on any compute costs incurred.
Deployment script
You deploy this configuration by using the API endpoints in the Clusters and Permissions APIs.
Provision cluster
Endpoint - https://<databricks-instance>/api/2.0/clusters/create

{
"autoscale": {
"min_workers": 2,
"max_workers": 20
},
"cluster_name": "project team interactive cluster",
"spark_version": "latest-stable-scala2.11",
"spark_conf": {
"spark.Azure Databricks.cluster.profile": "serverless",
"spark.Azure Databricks.repl.allowedLanguages": "python,sql",
"spark.Azure Databricks.passthrough.enabled": "true",
"spark.Azure Databricks.pyspark.enableProcessIsolation": "true"
},
"node_type_id": "Standard_D14_v2",
"ssh_public_keys": [],
"custom_tags": {
"ResourceClass": "Serverless",
"team": "new-project-team"
},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"autotermination_minutes": 60,
"enable_elastic_disk": true,
"init_scripts": []
}

Grant cluster permission


Endpoint - https://<databricks-instance>/api/2.0/permissions/clusters/<cluster_id>

{
"access_control_list": [
{
"group_name": "project team",
"permission_level": "CAN_MANAGE"
}
]
}

Instantly you have a cluster that has been provisioned with secure access to critical data in the lake, locked down
to all but the corresponding team, tagged for chargebacks, and configured to meet the requirements of the
project. There are additional configuration steps within your host cloud provider account required to implement
this solution, though too, can be automated to meet the requirements of scale.

Audit access
Configuring access control in Azure Databricks and controlling data access in storage is the first step towards an
efficient data governance solution. However, a complete solution requires auditing access to data and providing
alerting and monitoring capabilities. Databricks provides a comprehensive set of audit events to log activities
provided by Azure Databricks users, allowing enterprises to monitor detailed usage patterns on the platform. To
get a complete understanding of what users are doing on the platform and what data is being accessed, you
should use both native Azure Databricks and cloud provider audit logging capabilities.
Configuring access controls in Azure Databricks and controlling data access in the storage account is a great first
step towards an efficient data governance solution. However, it is incomplete until you can audit access to data
and provide alerting and monitoring capabilities. Azure Databricks provides a comprehensive set of audit events
to log activities performed by users allowing enterprises to monitor detailed usage patterns on the platform.
Make sure you have diagnostic logging enabled in Azure Databricks. Once logging is enabled for your account,
Azure Databricks automatically starts sending diagnostic logs to the delivery location you specified. You also
have the option to Send to Log Analytics , which will forward diagnostic data to Azure Monitor. Here is an
example query you can enter into the Log search box to query all users who have logged into the Azure
Databricks workspace and their location:

In a few steps, you can use Azure monitoring services or create real-time alerts. The Azure Activity Log provides
visibility into the actions taken on your storage accounts and the containers therein. Alert rules can be
configured here as well.

Learn more
Here are some resources to help you build a comprehensive data governance solution that meets your
organization’s needs:
Data governance on Databricks
Access control on Databricks
Data objects in the Databricks Lakehouse
Keep data secure with secrets
Unity Catalog (Preview)
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. During the preview, some functionality is limited. See Unity Catalog public preview
limitations. To participate in the preview, contact your Azure Databricks representative.

Unity Catalog is a fine-grained governance solution for data and AI on the Lakehouse.
Unity Catalog helps simplify security and governance of your data with the following key features:
Define once, secure ever ywhere : Unity Catalog offers a single place to administer data access policies
that apply across all workspaces and personas.
Standards-compliant security model : Unity Catalog’s security model is based on standard ANSI SQL, and
allows administrators to grant permissions at the level of catalogs, databases (also called schemas), tables,
and views in their existing data lake using familiar syntax.
Built-in auditing : Unity Catalog automatically captures user-level audit logs that record access to your data.
Unity Catalog requires an Azure Databricks account on the Premium plan.
In this guide:
Get started using Unity Catalog
Key concepts
Data permissions
Create compute resources
Use Azure managed identities in Unity Catalog to access storage
Create a metastore
Create and manage catalogs
Create and manage schemas (databases)
Manage identities in Unity Catalog
Create tables
Create views
Manage access to data
Manage external locations and storage credentials
Query data
Train a machine-learning model with Python from data in Unity Catalog
Connect to BI tools
Audit access and activity for Unity Catalog resources
Upgrade tables and views to Unity Catalog
Automate Unity Catalog setup using Terraform
Unity Catalog public preview limitations
Get started using Unity Catalog
7/21/2022 • 12 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

This guide helps you get started with Unity Catalog, the Azure Databricks data governance framework.

Requirements
You must be an Azure Databricks account admin.
The first Azure Databricks account admin must be an Azure Active Directory Global Administrator or a
member of the root management group, which is usually named Tenant root group . That user can
assign users with any level of Azure tenant permission as subsequent Azure Databricks account admins
(who can themselves assign more account admins).
Your Azure Databricks account must be on the Premium plan.
In your Azure tenant, you must have permission to create:
A storage account to use with Azure Data Lake Storage Gen2. See Create a storage account to use with
Azure Data Lake Storage Gen2.
A new resource to hold a system-assigned managed identity. This requires that you be a Contributor
or Owner of a resource group in any subscription in the tenant.

Configure and grant access to Azure storage for your metastore


In this step, you create a storage account and container for the metadata and tables that will be managed by the
Unity Catalog metastore, create an Azure connector that generates a system-assigned managed identity, and
give that managed identity access to the storage container.
1. Create a storage account for Azure Data Lake Storage Gen2.
This storage account will contain metadata related to Unity Catalog metastores and their objects, as well
as the data for managed tables in Unity Catalog. See Create a storage account to use with Azure Data
Lake Storage Gen2. Make a note of the region where you created the storage account.
2. Create a storage container that will hold your Unity Catalog metastore’s metadata and managed tables.
You can create no more than one metastore per region. It is recommended that you use the same region
for your metastore and storage container.
Make a note of the ADLSv2 URI for the container, which is in the following format:

abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<metastore-name>

In the steps that follow, replace <storage-container> with this URI.


3. In Azure, create an Azure Databricks access connector that holds a managed identity and give it access to
the storage container.
See Use Azure managed identities in Unity Catalog to access storage.

Create your first metastore and attach a workspace


A metastore is the top-level container for data in Unity Catalog. Each metastore exposes a 3-level namespace (
catalog . schema . table ) by which data can be organized.

A single metastore can be shared across multiple Azure Databricks workspaces in an account. Each linked
workspace has the same view of the data in the metastore, and data access control can be managed across
workspaces. Databricks allows one metastore per region. If you have a multi-region Databricks deployment, you
may want separate metastores for each region, but it is good practice to use a small number of metastores
unless your organization requires hard isolation boundaries between sets of data. Data cannot easily be joined
or queried across metastores.
To create a metastore:
1. Make sure that you have the path to the storage container and the resource ID of the Azure Databricks
access connector that you created in the previous task.
2. Log in to the Azure Databricks account console.

3. Click Data .
4. Click Create Metastore .
5. Enter values for the following fields
Name for the metastore.
Region where the metastore will be deployed.
For best performance, co-locate the access connector, workspaces, metastore and cloud storage
location in the same cloud region.
ADLS Gen 2 path : Enter the path to the storage container that you will use as root storage for the
metastore.
The abfss:// prefix is added automatically.
Access Connector ID : Enter the Azure Databricks access connector’s resource ID in the format:

/subscriptions/12f34567-8ace-9c10-111c-
aea8eba12345c/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/
<connector-name>

6. Click Create .
If the request fails, retry using a different metastore name.
7. When prompted, select workspaces to link to the metastore.
The account-level user who creates a metastore is its owner and metastore admin. Any account admin can
manage permissions for a metastore and its objects. To transfer ownership of a metastore to a different account-
level user or a group, see (Recommended) Transfer ownership of your metastore to a group.

Add users and groups


A Unity Catalog metastore can be shared across multiple Databricks workspaces. So that Databricks has a
consistent view of users and groups across all workspaces, you can now add Azure Active Directory users and
groups as account-level identities. Follow these steps to add account-level identities.

NOTE
Users and groups must be added as account-level identities before they can access Unity Catalog.

1. The initial account-level admin must be a Contributor in the Azure Active Directory root management
group, which is named Tenant root group by default. An Azure Active Directory Global Administrator
can add themselves to this group. Grant yourself this role, or ask an Azure Active Directory Global
Administrator to grant it to you.
The initial account-level admin can add users or groups to the account console, and can designate other
account-level admins by granting the Admin role to users.
2. All Azure Active Directory users who have been added to workspaces in your Azure tenant are
automatically added as account-level identities.
3. To designate additional account-level admins, you grant users the Admin role.

NOTE
It is not possible to grant the Admin role to a group.

a. Log in to the account console by clicking Settings , then clicking Manage account ..
b. Click Users and Groups . A list of Azure Active Directory users appears. Only users and groups
who have been added to workspaces are shown.
c. Click the name of a user.
d. Click Roles .
e. Enable Admin .
To get started, create a group called data-consumers . This group is used later in this walk-through.

Create a compute resource


Tables defined in Unity Catalog are protected by fine-grained access controls. To ensure that access controls are
enforced, Unity Catalog requires clusters to conform to a secure configuration. Unity Catalog is secure by
default, meaning that non-conforming clusters cannot access tables in Unity Catalog.
To create a compute resource that can access data in Unity Catalog:
Create a cluster
To create a Data Science & Engineering cluster that can access Unity Catalog:
1. Log in to the workspace as a workspace-level admin.

2. Click Compute .
3. Click Create cluster .
a. Enter a name for the cluster.
b. Set Databricks runtime version to Runtime: 10.3 (Scala 2.12, Spark 3.2.1) or higher.
4. Click Advanced Options . Set Security Mode to User Isolation or Single User .
User Isolation clusters can be shared by multiple users, but only SQL workloads are supported. Some
advanced cluster features such as library installation, init scripts, and the DBFS Fuse mount are also
disabled to ensure security isolation among cluster users.
To use those advanced cluster features or languages or to run workloads using Python, Scala and R, set
the cluster mode to Single User. Single User cluster can also run SQL workloads. The cluster can be used
exclusively by a single user (by default the owner of the cluster); other users cannot attach to the cluster.
Automated jobs should run in this mode, and the job’s owner should be the cluster’s owner. In this mode,
view security cannot be enforced. A user selecting from a view executes with their own permissions.
For more information about the features available in each security mode, see Cluster security mode.
5. Click Create Cluster .
Create a SQL warehouse
To create a SQL warehouse that can access Unity Catalog data:
1. Log in to the workspace as a workspace-level admin.
2. From the persona switcher, select SQL .
3. Click Create , then select SQL Warehouse .
4. Under Advanced Settings set Channel to Preview .
SQL warehouses are automatically created with the correct security mode, with no configuration required.

Create your first table


In Unity Catalog, metastores contain catalogs that contain schemas (databases), and you always create a table in
a schema. You can refer to a table using three-level notation:

<catalog>.<schema>.<table>

A newly-created metastore contains a catalog named main with an empty schema named default . In this
example, you will create a table named department in the default schema in the main catalog.
To create a table, you must be an account admin, metastore admin, or a user with the CREATE permission on the
parent schema and the USAGE permission on the parent catalog and schema.
Follow these steps to create a table manually. You can also import an example notebook and run it to create a
catalog, schema, and table, along with managing permissions on each.
1. Create a notebook and attach it to the cluster you created in Create a compute resource.
For the notebook language, select SQL , Python , R , or Scala , depending on the language you want to
use.
2. Grant permission to create tables on the default schema.
To create tables, users require the CREATE and USAGE permissions on the schema in addition to the
USAGE permission on the catalog. All users receive the USAGE privilege on the main catalog and the
main.default schema when a metastore is created.

Account admins, metastore admins, and the owner of the schema main.default can use the following
command to GRANT the CREATE privilege to a user or group:
SQL

GRANT CREATE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`;

Python
spark.sql("GRANT CREATE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`")

library(SparkR)

sql("GRANT CREATE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`")

Scala

spark.sql("GRANT CREATE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`")

For example, to allow members of the group data-consumers to create tables in main.default :
SQL

GRANT CREATE ON SCHEMA main.default to `data-consumers`;

Python

spark.sql("GRANT CREATE ON SCHEMA main.default to `data-consumers`")

library(SparkR)

sql("GRANT CREATE ON SCHEMA main.default TO `data-consumers`")

Scala

spark.sql("GRANT CREATE ON SCHEMA main.default to `data-consumers`")

Run the cell.


3. Create a new table called department .
Add a new cell to the notebook. Paste in the following code, which specifies the table name, its columns,
and inserts five rows into it.
SQL

CREATE TABLE main.default.department


(
deptcode INT,
deptname STRING,
location STRING
);

INSERT INTO main.default.department VALUES


(10, 'FINANCE', 'EDINBURGH'),
(20, 'SOFTWARE', 'PADDINGTON'),
(30, 'SALES', 'MAIDSTONE'),
(40, 'MARKETING', 'DARLINGTON'),
(50, 'ADMIN', 'BIRMINGHAM');
Python

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([ \
StructField("deptcode", IntegerType(), True),
StructField("deptname", StringType(), True),
StructField("location", StringType(), True)
])

spark.catalog.createTable(
tableName = "main.default.department",
schema = schema \
)

dfInsert = spark.createDataFrame(
data = [
(10, "FINANCE", "EDINBURGH"),
(20, "SOFTWARE", "PADDINGTON"),
(30, "SALES", "MAIDSTONE"),
(40, "MARKETING", "DARLINGTON"),
(50, "ADMIN", "BIRMINGHAM")
],
schema = schema
)

dfInsert.write.saveAsTable(
name = "main.default.department",
mode = "append"
)

R
library(SparkR)

schema = structType(
structField("deptcode", "integer", TRUE),
structField("deptname", "string", TRUE),
structField("location", "string", TRUE)
)

df = createDataFrame(
data = list(),
schema = schema
)

saveAsTable(
df = df,
tableName = "main.default.department"
)

data = list(
list("deptcode" = 10L, "deptname" = "FINANCE", "location" = "EDINBURGH"),
list("deptcode" = 20L, "deptname" = "SOFTWARE", "location" = "PADDINGTON"),
list("deptcode" = 30L, "deptname" = "SALES", "location" = "MAIDSTONE"),
list("deptcode" = 40L, "deptname" = "MARKETING", "location" = "DARLINGTON"),
list("deptcode" = 50L, "deptname" = "ADMIN", "location" = "BIRMINGHAM")
)

dfInsert = createDataFrame(
data = data,
schema = schema
)

insertInto(
x = dfInsert,
tableName = "main.default.department"
)

Scala

import spark.implicits._
import org.apache.spark.sql.types.StructType

val df = spark.createDataFrame(
new java.util.ArrayList[Row](),
new StructType()
.add("deptcode", "int")
.add("deptname", "string")
.add("location", "string")
)

df.write
.format("delta")
.saveAsTable("main.default.department")

val dfInsert = Seq(


(10, "FINANCE", "EDINBURGH"),
(20, "SOFTWARE", "PADDINGTON"),
(30, "SALES", "MAIDSTONE"),
(40, "MARKETING", "DARLINGTON"),
(50, "ADMIN", "BIRMINGHAM")
).toDF("deptcode", "deptname", "location")

dfInsert.write.insertInto("main.default.department")

Run the cell.


4. Query the table.
Add a new cell to the notebook. Paste in the following code, then run the cell.
SQL

SELECT * from main.default.department;

Python

display(spark.table("main.default.department"))

display(tableToDF("main.default.department"))

Scala

display(spark.table("main.default.department"))

5. Grant the ability to read and query the table to the data-consumers group that you created in Add users
and groups.
Add a new cell to the notebook and paste in the following code:
SQL

GRANT SELECT ON main.default.department TO `data-consumers`;

Python

spark.sql("GRANT SELECT ON main.default.department TO `data-consumers`")

sql("GRANT SELECT ON main.default.department TO `data-consumers`")

Scala

spark.sql("GRANT SELECT ON main.default.department TO `data-consumers`")

NOTE
To grant read access to all account-level users instead of only data-consumers , use the group name
account users instead.

Run the cell.

(Optional) Link the metastore to additional workspaces.


A key benefit of Unity Catalog is the ability to share a single metastore among multiple workspaces. You can
then run different types of workloads against the same data without the need to move or copy data amongst
workspaces. Each workspace can have a maximum of one Unity Catalog metastore assigned to it. To link the
metastore to additional workspaces:
1. Log in to the account console.
2. Click Data .
3. Click the name of a metastore to open its properties.
4. Click the Workspaces tab.
5. Click Assign to workspaces .
6. Select one or more workspaces. You can type part of the workspace name to filter the list.
7. Click Assign . When the assignment is complete, the workspace appears in the metastore’s Workspaces tab.
Users in each of the workspaces you selected can now access data in the metastore.

(Optional) Unlink the metastore from workspaces


A key benefit of Unity Catalog is the ability to share a single metastore among multiple workspaces. To remove a
workspace’s access to data in a metastore, you can unlink the metatore from the workspace.
1. Log in to the account console.
2. Click Data .
3. Click the name of a metastore to open its properties.
4. Click the Workspaces tab.
5. Deselect one or more workspaces. You can type part of the workspace name to filter the list.
6. Click Assign . When the assignment is complete, the workspace no longer appears in the metastore’s
Workspaces tab.
Users in each of the workspaces you selected can no longer access data in the metastore.

Example notebook
You can use the following example SQL notebook to create a catalog, schema, and table, as well as manage
permissions on each.
Create and manage a Unity Catalog table
Get notebook

(Recommended) Transfer ownership of your metastore to a group


When possible, Databricks recommends group ownership over single-user ownership. The user who creates a
metastore is its initial owner. A metastore owner can manage the privileges for all securable objects within a
metastore, as well as create catalogs, external locations, and storage credentials.
1. Log in to the account console.
2. Click Data .
3. Click the name of a metastore to open its properties.
4. Under Owner , click Edit .
5. Select a group from the drop-down. You can enter text in the field to search for options.
6. Click Save .

(Optional) Install the Unity Catalog CLI


The Unity Catalog CLI is part of the Databricks CLI. To use the Unity Catalog CLI, do the following:
1. Set up the CLI.
2. Set up authentication.
3. Optionally, create one or more connection profiles to use with the CLI.
4. Learn how to use the Databricks CLI in general.
5. Begin using the Unity Catalog CLI.

Next steps
Learn more about key concepts of Unity Catalog
Create tables
Create views
Key concepts
7/21/2022 • 9 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

This article explains the key concepts behind how Unity Catalog brings security and governance to your
Lakehouse.

Account-level identities
Unity Catalog uses the account-level identity system in Databricks to resolve users and groups and enforce
permissions. You configure users and groups directly in the Databricks account console. Refer to those account-
level users and groups when creating access-control policies in Unity Catalog.
Although Databricks also allows adding local groups to workspaces, those local groups cannot be used in Unity
Catalog. Commands that reference local groups return an error that the group was not found.
Unity Catalog users must also be added to workspaces to access Unity Catalog data in a notebook, a Databricks
SQL query, the Databricks SQL Data Explorer, or a REST API command, and to join Unity Catalog data with data
that is local to a workspace.

Data permissions
In Unity Catalog, data is secure by default. Initially, users have no access to data in a metastore. Metastore
admins and object owners can manage object permissions using the Databricks SQL Data Explorer or SQL
commands.
To learn more, see Data permissions.

Object model
The following diagram illustrates the main securable objects in Unity Catalog:
NOTE
Some objects, such as external locations and storage credentials, are not shown in the diagram. These objects reside in the
metastore at the same level as catalogs.

Metastore
A metastore is the top-level container of objects in Unity Catalog. It stores data assets (tables and views) and the
permissions that govern access to them. Databricks account admins can create metastores and assign them to
Databricks workspaces to control which workloads use each metastore.

NOTE
Unity Catalog offers a new metastore with built in security and auditing. This is distinct from the metastore used in
previous versions of Databricks, which was based on the Hive Metastore.

Metastore admin
A metastore admin can manage the privileges for all securable objects within a metastore, such as who can
create catalogs or query a table.

NOTE
For more details about Unity Catalog’s data governance model, see Data Permissions.

If necessary, a metastore admin can delegate management of permissions for a metastore object to a different
user or group by changing the object’s ownership. The account-level admin who creates a metastore is its owner.
The owner or owners (if the object is owned by a group) of an object can grant privileges on that object and its
descendents to others.
The initial account-level admin, who must have the Contributor role in the root management group of the
Azure tenant, can enable the Admin role for other users using the Azure Databricks account console. These
account admins are metastore admins.
The account admin who created the metastore is its initial metastore admin. An account admin can assign other
users as metastore admins by changing the metastore’s owner. Account admins can always manage a
metastore, regardless of the metastore’s owner.
Workspace admin
A workspace admin can manage workspace objects like users, jobs, and notebooks, regardless of whether Unity
Catalog is configured for a workspace. In other words, if a workspace is configured to use Unity Catalog,
workspace admins retain the ability to manage workspace objects. Although workspace admins cannot manage
access to data stored in Unity Catalog in the same way a metastore admin can, they do have the ability to
perform workspace management tasks such as adding users and service principals to the workspace, and they
can view and modify workspace objects like jobs and notebooks. This may give access to data registered in
Unity Catalog. The workspace admin therefore remains a privileged role that should be distributed carefully.
Default storage location
Each metastore is configured with a default storage location in an Azure storage account. This is the default
storage location for data in managed tables. External tables store data in other storage paths.
To access the default storage location on the behalf of a user, Unity Catalog uses a root storage credential that is
configured during metastore creation. The root storage credential contains the client secret of a service principal
that has the Azure Blob Contributor role for the default storage location. You can create additional external
credentials that use separate service principals, but the metastore’s root storage credential is still used to write
metadata to the metastore. User code never receives full access to a storage credential. Instead, Unity Catalog
generates scoped access tokens that allow each user or application to access the requested data.
Catalog
A catalog is the first layer of Unity Catalog’s three-level namespace and is used to organize your data assets.
Users can see all catalogs on which they have been assigned the USAGE data permission.
Schema
A schema (also called a database) is the second layer of Unity Catalog’s three-level namespace and organizes
tables and views. To access or list a table or view in a schema, a user must have the USAGE data permission on
the schema and its parent catalog and the SELECT permission on the table or view.
Table
A table resides in the third layer of Unity Catalog’s three-level namespace and contains rows of data. To create a
table, a user must have CREATE and USAGE permissions on the schema and the USAGE permission on its parent
catalog. To query a table, the user must have the SELECT permission on the table and the USAGE permission on
its parent schema and catalog.
A table can be managed or external.
Managed table
Managed tables are the default way to create tables in Unity Catalog. These tables are stored in the managed
storage location you configured when you created each metastore.
To create a managed table, run a CREATE TABLE command without a LOCATION clause.
To delete a managed table, use the DROP TABLE statement.
When a managed table is dropped, its underlying data is deleted from your cloud tenant. The only supported
format for managed tables is Delta.
Example Syntax:

CREATE TABLE <example-table>(id STRING, value STRING)

External table
External tables are tables whose data is stored in a storage location outside of the managed storage location,
and are not fully managed by Unity Catalog. When you run DROP TABLE on an external table, Unity Catalog does
not delete the underlying data. You can manage privileges on external tables and use them in queries in the the
same way as managed tables. To create an external table, specify a LOCATION path in your CREATE TABLE
statement. External tables can use the following file formats:
DELTA
CSV
JSON
AVRO
PARQUET
ORC
TEXT
To manage access to the underlying cloud storage for an external table, Unity Catalog introduces two new object
types: storage credentials and external locations.
A storage credential represents an authentication and authorization mechanism for accessing data stored on
your cloud tenant, such as the client secret for a service principal. Each storage credential is subject to Unity
Catalog access-control policies that control which users and groups can access the credential.
If a user attempts to reference use a storage credential on which they haven’t been granted the USAGE
permission, the request fails and Unity Catalog does not attempt to authenticate to the cloud tenant on behalf of
the user.
An external location is an object that contains a reference to a storage credential and a cloud storage
path. The external location grants access only to that path and its child directories and files. Each external
location is subject to Unity Catalog access-control policies that control which users and groups can access
the credential.
If a user attempts to use an external location on which they haven’t been granted the USAGE permission,
the request fails and Unity Catalog does not attempt to authenticate to the cloud tenant on behalf of the
user.
Only metastore admins can create and grant permissions on storage credentials and external locations.
Example Syntax:

CREATE TABLE <example-table>


(id STRING, value STRING)
USING delta
LOCATION "abfss://<your-storage-path>"

NOTE
Before a user can create an external table, the user must have the CREATE TABLE privilege on an external location or
storage credential that grants access to the LOCATION specified in the CREATE TABLE statement.

View
A view resides in the third layer of Unity Catalog’s three-level namespace and is a read-only object composed
from one or more tables and views in a metastore. A view can be composed from tables and views in multiple
schemas or catalogs.
Example syntax:

CREATE VIEW main.default.experienced_employee


(id COMMENT 'Unique identification number', Name)
COMMENT 'View for experienced employees'
AS SELECT id, name
FROM all_employee
WHERE working_years > 5;

Cluster security mode


To ensure the integrity of access controls and enforce strong isolation guarantees, Unity Catalog imposes some
security requirements on compute resources. For this reason, Unity Catalog introduces the concept of a cluster’s
security mode. Unity Catalog is secure by default; if a cluster is not configured with an appropriate security
mode, the cluster can’t access data in Unity Catalog.

NOTE
If your workspace is assigned to a Unity Catalog metastore, you use security mode instead of High Concurrency clusters
to ensure the integrity of access controls and enforce strong isolation guarantees. High Concurrency cluster mode is not
available with Unity Catalog.
When you create a Data Science & Engineering or Databricks Machine Learning cluster, you can select from the
following cluster security modes:
None : No isolation. Does not enforce workspace-local table access control or credential passthrough. Cannot
access Unity Catalog data.
Single User : Can be used only by a single user (by default, the user who created the cluster). Other users
cannot attach to the cluster. When accessing a view from a cluster with Single User security mode, the view
is executed with the user’s permissions. Single-user clusters support workloads using Python, Scala, and R.
Init scripts, library installation, and DBFS FUSE mounts are supported on single-user clusters. Automated
jobs should use single-user clusters.
User Isolation : Can be shared by multiple users. Only SQL workloads are supported. Library installation,
init scripts, and DBFS FUSE mounts are disabled to enforce strict isolation among the cluster users.
Table ACL only (Legacy) : Enforces workspace-local table access control, but cannot access Unity Catalog
data.
Passthrough only (Legacy) : Enforces workspace-local credential passthrough, but cannot access Unity
Catalog data.
The only security modes supported for Unity Catalog workloads are Single User and User Isolation .
Databricks SQL endpoints automatically use User Isolation , with no configuration required.
You can upgrade an existing cluster to meet the requirements of Unity Catalog by setting its cluster security
mode to Single User or User Isolation security mode.
The following table describes the features that are enabled and disabled for each cluster security mode.

NOTE
The following table is wide. You may need to scroll sideways in your browser to view all columns.

DATA B
IN IT RIC K S
L EGA C SC RIP T RUN T I
Y S AND UN IT Y ME
TA B L E C REDE L IB RA R C ATA L FOR
SUP P O A C C ES N T IA L DB F S Y OG MACHI
SEC URI RT ED UN IT Y S PA SST M ULT I F USE IN STA L DY N A NE
TY L A N GU C ATA L C ONT R H RO U PLE RDD M O UN L AT IO M IC L EA RN I
M O DE A GES OG OL GH USERS API TS N VIEW S NG

None All ✔ ✔ ✔ ✔ ✔

Single All ✔ ✔ ✔ ✔ ✔
User

User SQL ✔ ✔ ✔ ✔
Isolatio
n

Legacy SQL, ✔ ✔ ✔
table Python
ACL

Legacy SQL, ✔ ✔ ✔ ✔ ✔
passth Python
rough
Data permissions
7/21/2022 • 16 minutes to read

This article explain how data permissions work to control access to data and objects in Unity Catalog.
You can use data access control policies to grant and revoke access to Unity Catalog data and objects in the
Databricks SQL Data Explorer, SQL statements in notebooks or Databricks SQL queries, or using the Unity
Catalog REST API.
Initially, users have no access to data in a metastore. Only metastore admins can create schemas, tables, views,
and other Unity Catalog objects and grant or revoke access on them to account-level users or groups. Access
control policies are not inherited. The account-level admin who creates a metastore is its owner and metastore
admin.
Access control policies are applied by Unity Catalog before data can be read or written to your cloud tenant.

Ownership
Each securable object in Unity Catalog has an owner. The owner can be any account-level user or group, called a
principal. The principal that creates an object becomes its initial owner. An object’s owner has all privileges on
the object, such as SELECT and MODIFY on a table, as well as the permission to grant privileges to other
principals.
The object’s owner can transfer ownership to another user or group. A metastore admin can transfer ownership
of any object in the metastore to another user or group.
To see the owner of a securable object, use the following syntax. Replace the placeholder values:
<SECURABLE_TYPE> : The type of securable, such as CATALOG or TABLE .
<catalog> : The parent catalog for a table or view.
<schema> : The parent schema for a table or view.
<securable_name> : The name of the securable, such as a table or view.

SQL

DESCRIBE <SECURABLE_TYPE> EXTENDED <catalog>.<schema>.<securable_name>;

Python

display(spark.sql("DESCRIBE <SECURABLE_TYPE> EXTENDED <catalog>.<schema>.<securable_name>"))

library(SparkR)

display(sql("DESCRIBE <SECURABLE_TYPE> EXTENDED <catalog>.<schema>.<securable_name>"))

Scala

display(spark.sql("DESCRIBE <SECURABLE_TYPE> EXTENDED <catalog>.<schema>.<securable_name>"))


To transfer ownership of an object, use a SQL command with the following syntax. Replace the placeholder
values:
<SECURABLE_TYPE> : The type of securable, such as CATALOG or TABLE .
<SECURABLE_NAME> : The name of the securable.
<PRINCIPAL> : The email address of an account-level user or the name of an account-level group.

SQL

ALTER <SECURABLE_TYPE> <SECURABLE_NAME> OWNER TO <PRINCIPAL>;

Python

spark.sql("ALTER <SECURABLE_TYPE> <SECURABLE_NAME> OWNER TO <PRINCIPAL>")

library(SparkR)

sql("ALTER <SECURABLE_TYPE> <SECURABLE_NAME> OWNER TO <PRINCIPAL>")

Scala

spark.sql("ALTER <SECURABLE_TYPE> <SECURABLE_NAME> OWNER TO <PRINCIPAL>")

For example, to transfer ownership of a table to the accounting group:


SQL

ALTER TABLE orders OWNER TO `accounting`;

Python

spark.sql("ALTER TABLE orders OWNER TO `accounting`")

library(SparkR)

sql("ALTER TABLE orders OWNER TO `accounting`")

Scala

spark.sql("ALTER TABLE orders OWNER TO `accounting`")

Ownership of a metastore
The account-level admin who creates a metastore is its owner. To transfer ownership of a metastore to a
different account-level user or group, see (Recommended) Transfer ownership of your metastore to a group. For
added security, this command is not available using SQL syntax.

Privileges
In Unity Catalog, you can grant the following privileges on a securable object:
USAGE : This privilege does not grant access to the securable itself, but allows the grantee to traverse the
securable in order to access its child objects. For example, to select data from a table, users need to have the
SELECT privilege on that table and USAGE privileges on its parent schema and parent catalog. Thus, you can
use this privilege to restrict access to sections of your data namespace to specific groups.
SELECT : Allows a user to select from a table or view, if the user also has USAGE on its parent catalog and
schema.
MODIFY : Allows the grantee to add, update and delete data to or from the securable if the user also has
USAGE on its parent catalog and schema.
CREATE : Allows a user to create a schema if the user also has USAGE and CREATE permissions on its parent
catalog. Allows a user to create a table or view if the user also has USAGE on its parent catalog and schema
and the CREATE permission on the schema..

In addition, you can grant the following privileges on storage credentials and external locations.
CREATE TABLE : Allows a user to create external tables directly in your cloud tenant using a storage
credential.
READ FILES : When granted on an external location, allows a user to read files directly from your cloud
tenant using the storage credential associated with the external location.
When granted directly on a storage credential, allows a user to read files directly from your cloud tenant
using the storage credential.
WRITE FILES : When granted on an external location, allows a user to write files directly to your cloud
tenant using the storage credential associated with the external location.
When granted directly on a storage credential, allows a user to write files directly to your cloud tenant
using the storage credential.

NOTE
Although you can grant READ FILE and WRITE FILE privileges on a storage credential, Databricks recommends that
you instead grant these privileges on an external location. This allows you to manage permissions at a more granular level
and provides a simpler experience to end users.

In Unity Catalog, privileges are not inherited on child securable objects. For example, if you grant the CREATE
privilege on a catalog to a user, the user does not automatically have the CREATE privilege on all databases in
the catalog.
The following table summarizes the privileges that can be granted on each securable object:

SEC URA B L E P RIVIL EGES

Catalog CREATE , USAGE

Schema CREATE , USAGE

Table SELECT , MODIFY

View SELECT

External location CREATE TABLE , READ FILES , WRITE FILES


SEC URA B L E P RIVIL EGES

Storage credential CREATE TABLE , READ FILES , WRITE FILES

Manage privileges
You can manage privileges for metastore objects in the Databricks SQL data explorer or by using SQL
commands in the Databricks SQL editor or a Data Science & Engineering notebook.
To manage privileges, you use GRANT and REVOKE statements. Only an object’s owner or a metastore admin can
grant privileges on the object and its descendent objects. A built-in account-level group called account users
includes all account-level users.
This section contains examples of using SQL commands to manage privileges. To manage privileges using the
Databricks SQL Data Explorer, see Manage access to data.
Show grants on a securable object
To show grants on an object using SQL, use a command like the following. To use the Databricks SQL Data
Explorer, see Use the Databricks SQL Data Explorer.
Use the following syntax. Replace the placeholder values:
<securable_type> : The type of the securable object, such as catalog or table .
<securable_name> : The name of the securable object.

SQ L

SHOW GRANTS ON <securable_type> <securable_name>;

Python

display(spark.sql("SHOW GRANTS ON <securable_type> <securable_name>"))

library(SparkR)

display(sql("SHOW GRANTS ON <securable_type> <securable_name>"))

Sc a l a

display(spark.sql("SHOW GRANTS ON <securable_type> <securable_name>"))

To show all grants for a given principal on an object, use the following syntax. Replace the placeholder values:
<principal> : The email address of an account-level user or the name of an account-level group.
<securable_type> : The type of the securable object, such as catalog or table .
<securable_name> : The name of the securable object.

SQ L

SHOW GRANTS <principal> ON <securable_type> <securable_name>;

Python

display(spark.sql("SHOW GRANTS <principal> ON <securable_type> <securable_name>"))


R

library(SparkR)

display(sql("SHOW GRANTS <principal> ON <securable_type> <securable_name>"))

Sc a l a

display(spark.sql("SHOW GRANTS <principal> ON <securable_type> <securable_name>"))

Grant a privilege
To grant a privilege using SQL, use a command like the following. To use the Databricks SQL Data Explorer, see
Use the Databricks SQL Data Explorer.
Use the following syntax. Replace the placeholder values:
<privilege> : The privilege to grant, such as SELECT or USAGE .
<securable_type> : The type of the securable object, such as catalog or table .
<securable_name> : The name of the securable object.
<principal> : The email address of an account-level user or the name of an account-level group.

SQ L

GRANT <privilege> ON <securable_type> <securable_name> TO <principal>

Python

spark.sql("GRANT <privilege> ON <securable_type> <securable_name> TO <principal>")

library(SparkR)

sql("GRANT <privilege> ON <securable_type> <securable_name> TO <principal>")

Sc a l a

spark.sql("GRANT <privilege> ON <securable_type> <securable_name> TO <principal>")

Revoke a privilege
To revoke a privilege using SQL, use a command like the following. To use the Databricks SQL Data Explorer, see
Use the Databricks SQL Data Explorer.
Use the following syntax. Replace the placeholder values:
<privilege> : The privilege to grant, such as SELECT or USAGE .
<securable_type> : The type of the securable object, such as catalog or table .
<securable_name> : The name of the securable object.
<principal> : The email address of an account-level user or the name of an account-level group.

SQ L

REVOKE <privilege> ON <securable_type> <securable_name> FROM <principal>

Python
spark.sql("REVOKE <privilege> ON <securable_type> <securable_name> FROM <principal>")

library(SparkR)

sql("REVOKE <privilege> ON <securable_type> <securable_name> FROM <principal>")

Sc a l a

spark.sql("REVOKE <privilege> ON <securable_type> <securable_name> FROM <principal>")

Dynamic views
In Unity Catalog, you can use dynamic views to configure fine-grained access control, including:
Security at the level of columns or rows.
Data masking.

NOTE
Fine-grained access control using dynamic views are not available on clusters with Single User security mode.

Unity Catalog introduces the following functions, which allow you to dynamically limit which users can access a
row, column, or record in a view:
: Returns the current user’s email address.
current_user()
is_account_group_member() : Returns TRUE if the current user is a member of a specific account-level group.
Recommended for use in dynamic views against Unity Catalog data.
is_member() : Returns TRUE if the current user is a member of a specific workspace-level group. This function
is provided for compatibility with the existing Hive metastore. Avoid using it with views against Unity Catalog
data, because it does not evaluate account-level group membership.
The following examples illustrate how to create dynamic views in Unity Catalog.
Column-level permissions
With a dynamic view, you can limit the columns a specific user or group can access. In the following example,
only members of the auditors group can access email addresses from the sales_raw table. During query
analysis, Apache Spark replaces the CASE statement with either the literal string REDACTED or the actual
contents of the email address column. Other columns are returned as normal. This strategy has no negative
impact on the query performance.
SQL
-- Alias the field 'email' to itself (as 'email') to prevent the
-- permission logic from showing up directly in the column name results.
CREATE VIEW sales_redacted AS
SELECT
user_id,
CASE WHEN
is_account_group_member('auditors') THEN email
ELSE 'REDACTED'
END AS email,
country,
product,
total
FROM sales_raw

Python

# Alias the field 'email' to itself (as 'email') to prevent the


# permission logic from showing up directly in the column name results.
spark.sql("CREATE VIEW sales_redacted AS "
"SELECT "
" user_id, "
" CASE WHEN "
" is_account_group_member('auditors') THEN email "
" ELSE 'REDACTED' "
" END AS email, "
" country, "
" product, "
" total "
"FROM sales_raw")

library(SparkR)

# Alias the field 'email' to itself (as 'email') to prevent the


# permission logic from showing up directly in the column name results.
sql(paste("CREATE VIEW sales_redacted AS ",
"SELECT ",
" user_id, ",
" CASE WHEN ",
" is_account_group_member('auditors') THEN email ",
" ELSE 'REDACTED' ",
" END AS email, ",
" country, ",
" product, ",
" total ",
"FROM sales_raw",
sep = ""))

Scala
// Alias the field 'email' to itself (as 'email') to prevent the
// permission logic from showing up directly in the column name results.
spark.sql("CREATE VIEW sales_redacted AS " +
"SELECT " +
" user_id, " +
" CASE WHEN " +
" is_account_group_member('auditors') THEN email " +
" ELSE 'REDACTED' " +
" END AS email, " +
" country, " +
" product, " +
" total " +
"FROM sales_raw")

Row-level permissions
With a dynamic view, you can specify permissions down to the row or field level. In the following example, only
members of the managers group can view transaction amounts when they exceed $1,000,000. Matching results
are filtered out for other users.
SQL

CREATE VIEW sales_redacted AS


SELECT
user_id,
country,
product,
total
FROM sales_raw
WHERE
CASE
WHEN is_account_group_member('managers') THEN TRUE
ELSE total <= 1000000
END;

Python

spark.sql("CREATE VIEW sales_redacted AS "


"SELECT "
" user_id, "
" country, "
" product, "
" total "
"FROM sales_raw "
"WHERE "
"CASE "
" WHEN is_account_group_member('managers') THEN TRUE "
" ELSE total <= 1000000 "
"END")

R
library(SparkR)

sql(paste("CREATE VIEW sales_redacted AS ",


"SELECT ",
" user_id, ",
" country, ",
" product, ",
" total ",
"FROM sales_raw ",
"WHERE ",
"CASE ",
" WHEN is_account_group_member('managers') THEN TRUE ",
" ELSE total <= 1000000 ",
"END",
sep = ""))

Scala

spark.sql("CREATE VIEW sales_redacted AS " +


"SELECT " +
" user_id, " +
" country, " +
" product, " +
" total " +
"FROM sales_raw " +
"WHERE " +
"CASE " +
" WHEN is_account_group_member('managers') THEN TRUE " +
" ELSE total <= 1000000 " +
"END")

Data masking
Because views in Unity Catalog use Spark SQL, you can implement advanced data masking by using more
complex SQL expressions and regular expressions. In the following example, all users can analyze email
domains, but only members of the auditors group can view a user’s entire email address.
SQL

-- The regexp_extract function takes an email address such as


-- user.x.lastname@example.com and extracts 'example', allowing
-- analysts to query the domain name.

CREATE VIEW sales_redacted AS


SELECT
user_id,
region,
CASE
WHEN is_account_group_member('auditors') THEN email
ELSE regexp_extract(email, '^.*@(.*)$', 1)
END
FROM sales_raw

Python
# The regexp_extract function takes an email address such as
# user.x.lastname@example.com and extracts 'example', allowing
# analysts to query the domain name.

spark.sql("CREATE VIEW sales_redacted AS "


"SELECT "
" user_id, "
" region, "
" CASE "
" WHEN is_account_group_member('auditors') THEN email "
" ELSE regexp_extract(email, '^.*@(.*)$', 1) "
" END "
" FROM sales_raw")

library(SparkR)

# The regexp_extract function takes an email address such as


# user.x.lastname@example.com and extracts 'example', allowing
# analysts to query the domain name.

sql(paste("CREATE VIEW sales_redacted AS ",


"SELECT ",
" user_id, ",
" region, ",
" CASE ",
" WHEN is_account_group_member('auditors') THEN email ",
" ELSE regexp_extract(email, '^.*@(.*)$', 1) ",
" END ",
" FROM sales_raw",
sep = ""))

Scala

// The regexp_extract function takes an email address such as


// user.x.lastname@example.com and extracts 'example', allowing
// analysts to query the domain name.

spark.sql("CREATE VIEW sales_redacted AS " +


"SELECT " +
" user_id, " +
" region, " +
" CASE " +
" WHEN is_account_group_member('auditors') THEN email " +
" ELSE regexp_extract(email, '^.*@(.*)$', 1) " +
" END " +
" FROM sales_raw")

Working with Unity Catalog and the legacy Hive metastore


The Unity Catalog metastore is additive, meaning it can be used with the per-workspace Hive metastore in
Databricks. The Hive metastore appears as a top-level catalog called hive_metastore in the three-level
namespace. For example, you can refer to a table called sales_raw in the sales schema in the legacy Hive
metastore by using the following notation:
SQL

SELECT * from hive_metastore.sales.sales_raw;


Python

display(spark.table("hive_metastore.sales.sales_raw"))

library(SparkR)

display(tableToDF("hive_metastore.sales.sales_raw"))

Scala

display(spark.table("hive_metastore.sales.sales_raw"))

You can also specify the catalog and schema with a USE statement:
SQL

USE hive_metastore.sales;
SELECT * from sales_raw;

Python

spark.sql("USE hive_metastore.sales")
display(spark.table("sales_raw"))

library(SparkR)

sql("USE hive_metastore.sales")
display(tableToDF("sales_raw"))

Scala

spark.sql("USE hive_metastore.sales")
display(spark.table("sales_raw"))

Access control in Unity Catalog and the Hive metastore


If you configured table access control on the Hive metastore, Databricks will continue enforcing those access
controls for data in the hive_metastore catalog for clusters running in the User Isolation , or Table ACL only
(Legacy) security mode. The Unity Catalog access model differs slightly from legacy access controls, like no
DENY statements. The Hive metastore is a workspace-level object. Permissions defined within the
hive_metastore catalog always refer to the local users and groups in the workspace. See Differences from table
access control.
If table access control is enabled on the Hive metastore, workloads must use clusters with User Isolation
enabled. SQL warehouses always use this security mode.
Differences from table access control
Unity Catalog has the following key differences from using table access controls in the legacy Hive metastore in
each workspace.
The access control model in Unity Catalog has the following differences from table access control:
Account-level identities : Access control policies in Unity Catalog are applied to account-level users and
groups, while access control policies for the Hive metastore are applied to workspace-level users and groups.
Simplified Privileges : Unity Catalog has only four privileges ( SELECT , MODIFY , CREATE , USAGE ) for
catalogs, schemas, tables, and views, and three privileges specific to external locations and storage
credentials ( READ_FILES , WRITE_FILES , CREATE TABLE ). Unity Catalog has no DENY statements. Instead, access
is always denied until it is explicitly granted. Legacy privileges like READ_METADATA and MODIFY_CLASSPATH are
not supported.
No inheritance of privileges : In Unity Catalog, privileges on a parent object are not inherited by its child
objects.
USAGE permission is required on the catalog for all operations on objects inside the catalog :
Regardless of a principal’s privileges on a table or database, the principal must also have the USAGE privilege
on its parent catalog to access the table or database. With workspace-level table access controls, granting
USAGE on the root catalog automatically grants USAGE on all databases, but USAGE on the root catalog is not
required.
Views : In Unity Catalog, the owner of a view does not need to be an owner of the view’s referenced tables
and views. Having the SELECT privilege is sufficient, along with USAGE on the views’ parent schema and
catalog. With workspace-level table access controls, a view’s owner needs to be an owner of all referenced
tables and views.
No suppor t for ALL FILES or ANONYMOUS FUNCTION s : In Unity Catalog, there is no concept of an ALL FILES
or ANONYMOUS FUNCTION permission. These permissions could be used to circumvent access control
restrictions by allowing an unprivileged user to run privileged code.
Joins
By using three-level namespace notation, you can join data in a Unity Catalog metastore with data in the legacy
Hive metastore.

NOTE
A join with data in the legacy Hive metastore will only work on the workspace where that data resides. Trying to run such
a join in another workspace results in an error. Azure Databricks recommends that you upgrade legacy tables and views to
Unity Catalog.

The following example joins results from the sales_current table in the legacy Hive metastore with the
sales_historical table in the Unity Catalog metastore when the order_id fields are equal.

SQL

SELECT * FROM hive_metastore.sales.sales_current


JOIN main.shared_sales.sales_historical
ON hive_metastore.sales.sales_current.order_id = main.shared_sales.sales_historical.order_id;

Python

dfCurrent = spark.table("hive_metastore.sales.sales_current")
dfHistorical = spark.table("main.shared_sales.sales_historical")

display(dfCurrent.join(
other = dfHistorical,
on = dfCurrent.order_id == dfHistorical.order_id
))

R
library(SparkR)

dfCurrent = tableToDF("hive_metastore.sales.sales_current")
dfHistorical = tableToDF("main.shared_sales.sales_historical")

display(join(
x = dfCurrent,
y = dfHistorical,
joinExpr = dfCurrent$order_id == dfHistorical$order_id))

Scala

val dfCurrent = spark.table("hive_metastore.sales.sales_current")


val dfHistorical = spark.table("main.shared_sales.sales_historical")

display(dfCurrent.join(
right = dfHistorical,
joinExprs = dfCurrent("order_id") === dfHistorical("order_id")
))

The following example expresses a more complex join. It returns the sum of current and historical sales by
customer, assuming that each table contains at least one row for each customer.
SQL

SELECT current.customer_id, current.customer_name, COALESCE(current.total, 0) + COALESCE(historical.total,


0) AS total
FROM hive_metastore.sales.sales_current AS current
FULL OUTER JOIN main.shared_sales.sales_historical AS historical
ON current.customer_id = historical.customer_id;

Python

from pyspark.sql.functions import coalesce

dfCurrent = spark.table("hive_metastore.sales.sales_current")
dfHistorical = spark.table("main.shared_sales.sales_historical")

dfJoin = dfCurrent.join(
other = dfHistorical,
on = dfCurrent.customer_id == dfHistorical.customer_id,
how = "full_outer"
)

display(dfJoin.select(
dfCurrent.customer_id,
dfCurrent.customer_name,
(coalesce(dfCurrent.total) + coalesce(dfHistorical.total)).alias("total")))

R
library(SparkR)

dfCurrent = tableToDF("hive_metastore.sales.sales_current")
dfHistorical = tableToDF("main.shared_sales.sales_historical")

dfJoin = join(
x = dfCurrent,
y = dfHistorical,
joinExpr = dfCurrent$customer_id == dfHistorical$customer_id,
joinType = "fullouter")

display(
select(
x = dfJoin,
col = list(
dfCurrent$customer_id,
dfCurrent$customer_name,
alias(coalesce(dfCurrent$total) + coalesce(dfHistorical$total), "total")
)
)
)

Scala

import org.apache.spark.sql.functions.coalesce

val dfCurrent = spark.table("hive_metastore.sales.sales_current")


val dfHistorical = spark.table("main.shared_sales.sales_historical")

val dfJoin = dfCurrent.join(


right = dfHistorical,
joinExprs = dfCurrent("customer_id") === dfHistorical("customer_id"),
joinType = "full_outer"
)

display(dfJoin.select(
cols = dfCurrent("customer_id"),
dfCurrent("customer_name"),
(coalesce(dfCurrent("total")) + coalesce(dfHistorical("total")).alias("total"))
))

Default catalog
If you omit the top-level catalog name and there is no USE CATALOG statement, the default catalog is assumed. To
configure the default catalog for a workspace, set the spark.databricks.sql.initial.catalog.name value.
Databricks recommends setting the default catalog value to hive_metastore so that your existing code can
operate on current Hive metastore data without any change.
Cluster instance profile
When using the Hive metastore alongside Unity Catalog, the instance profile on the cluster is used to access
Hive Metastore data but not the data in Unity Catalog. Unity Catalog does not rely on the instance profile
configured for a cluster.
Upgrade legacy tables to Unity Catalog
Tables in the Hive metastore do not benefit from the full set of security and governance features that Unity
Catalog introduces, such as built-in auditing and access control. Databricks recommends that you upgrade your
legacy tables by adding them to Unity Catalog.

Next steps
Get started using Unity Catalog
Learn more about key concepts of Unity Catalog
Use Azure managed identities in Unity Catalog to
access storage
7/21/2022 • 9 minutes to read

IMPORTANT
This feature is in Public Preview.

This article describes how to use Azure managed identities for connecting to storage containers on behalf of
Unity Catalog users.

What are Azure managed identities?


Unity Catalog (Preview) can be configured to use an Azure managed identity to access storage containers on
behalf of Unity Catalog users. Managed identities provide an identity for applications to use when they connect
to resources that support Azure Active Directory (Azure AD) authentication.
You can use managed identities in Unity Catalog to support two primary use cases:
As an identity to connect to the metastore’s root storage account (where managed tables are stored).
As an identity to connect to other external storage accounts (either for file-based access or for external
tables).
Configuring Unity Catalog with a managed identity has the following benefits over configuring Unity Catalog
with a service principal:
You can connect to an Azure Data Lake Storage Gen2 account that is protected by a storage firewall.
Managed identities do not require you to maintain credentials or rotate secrets.

Configure a managed identity for Unity Catalog


To configure a managed identity for your Unity Catalog metastore and other external storage, first you must
create an Azure Databricks access connector, to which the system assigns a managed identity. Then you grant
the managed identity access to your Azure Data Lake Storage Gen2 account.

NOTE
Azure Databricks supports only system-assigned managed identities. You cannot use user-assigned managed identities.

Step 1: Create an Azure Databricks access connector


The Azure Databricks access connector is a first-party Azure resource that lets you connect managed identities
to an Azure Databricks account. Azure Databricks account admins can delegate the managed identity assigned
to the access connector to Azure Databricks resources like Unity Catalog metastores.
Each Azure Databricks access connector has one system-assigned managed identity, so you must create a
separate access connector for each managed identity.
1. Log in to the Azure Portal as a Contributor or Owner of a resource group.

NOTE
You cannot manage access connectors as a service principal.

The resource group should be in the same region as the storage account that you want to connect to.
2. Click + Create or Create a new resource .
3. Search for Template Deployment .
4. Click Create .
5. Click Build your own template in the editor .
6. Copy and paste this template into the editor:
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"connectorName": {
"defaultValue": "testConnector",
"type": "String",
"metadata": {
"description": "The name of the Azure Databricks Access Connector to create."
}
},
"accessConnectorRegion": {
"defaultValue": "[resourceGroup().location]",
"type": "String",
"metadata": {
"description": "Location for the access connector resource."
}
},
"enableSystemAssignedIdentity": {
"defaultValue": true,
"type": "bool",
"metadata": {
"description": "Whether the system assigned managed identity is enabled"
}
}
},
"resources": [
{
"type": "Microsoft.Databricks/accessConnectors",
"apiVersion": "2022-04-01-preview",
"name": "[parameters('connectorName')]",
"location": "[parameters('accessConnectorRegion')]",
"identity": {
"type": "[if(parameters('enableSystemAssignedIdentity'), 'SystemAssigned', 'None')]"
}
}
]
}

7. Click Save .
8. On the Basics tab, accept, select, or enter values for the following fields:
Subscription : This is the Azure subscription that the Azure Databricks access connector will be
created in. The default is the Azure subscription you are currently using. It can be any subscription in
the tenant.
Resource group : This should be a resource group in the same region as the storage account that you
will connect to.
Region : This should be the same region as the storage account that you will connect to.
Connector Name : Enter a name that indicates the purpose of the connector resource.
Access Connector Region : Accept the default [resourceGroup().location] to have the connector
resource deploy in the same region as the resource group. You can enter a different region value, but
Databricks recommends that the connector region and resource group region be the same as the
storage account that you will connect to.
Enable System Assigned Identity : Accept the default value of true .
9. Click Next: Review + create > .
10. Click Create .
When the deployment succeeds, the Azure Databricks access connector is deployed with a system-
assigned managed identity.
11. When the deployment is complete, click Go to resource .
12. Make note of the Resource ID .
The resource ID is in the format:

/subscriptions/12f34567-8ace-9c10-111c-
aea8eba12345c/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/<connec
tor-name>

Step 2: Grant the managed identity access to the storage account


1. Log in to your Azure Data Lake Storage Gen2 account as an Owner or a user with the User Access
Administrator Azure RBAC role on the storage account.
2. Go to Access Control (IAM) , click + Add , and select Add role assignment .
3. Select the Storage Blob Data Contributor role and click Next .
4. Under Assign access to , select Managed identity .
5. Click +Select Members , and select All system-assigned managed identities .
6. Search for your connector name, select it, and click Review and Assign .

Use a managed identity to access storage managed by a Unity


Catalog metastore
This section describes how to give the managed identity access to the root storage account, used for managed
storage, when you create a Unity Catalog metastore.
To learn how to upgrade an existing Unity Catalog metastore to use a managed identity, see Upgrade your
existing Unity Catalog metastore to use a managed identity to access its root storage.
1. As an Azure Databricks account admin, log in to the Azure Databricks account console.

2. Click Data .
3. Click Create Metastore .
4. Enter values for the following fields:
Name for the metastore.
Region where the metastore will be deployed.
For best performance, co-locate the access connector, workspaces, metastore and cloud storage
location in the same cloud region.
ADLS Gen 2 path : enter the path to the storage container that you will use as root storage for the
metastore.
The abfss:// prefix is added automatically.
Access Connector ID : enter the Azure Databricks access connector’s resource ID in the format:

/subscriptions/12f34567-8ace-9c10-111c-
aea8eba12345c/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/
<connector-name>
5. Click Create .
If the request fails, retry using a different metastore name.
6. When prompted, select workspaces to link to the metastore.

Use a managed identity to access external storage managed in Unity


Catalog
Unity Catalog gives you the ability to access data outside the metastore root bucket using storage credentials
and external locations. Storage credentials store the managed identity, and external locations define a path to
storage along with a reference to the storage credential. You can use this approach to grant and control access
to existing data in cloud storage and to register external tables in Unity Catalog.
A storage credential can hold a managed identity or service principal. Using a managed identity has the benefit
of allowing Unity Catalog to access storage accounts protected by network rules, which isn’t possible using
service principals, and it removes the need to manage and rotate secrets.
To create a storage credential using a managed identity and assign that storage credential to an external
location, follow the instructions in Manage external locations and storage credentials.

(Recommended) Configure trusted access to Azure Storage based on


your managed identity
Azure Data Lake Storage Gen2 provides a model to secure access to your storage account. When network rules
are configured, only applications requesting data over the specified set of networks or through the specified set
of Azure resources can access a storage account. You can enable a Unity Catalog metastore to access data in
your storage account by adding the system-assigned managed identity to the network rules. See Configure
Azure Storage firewalls and virtual networks.
Requirements
Your Azure Databricks workspace must be deployed in your own Azure virtual network (also known as VNet
injection). See Deploy Azure Databricks in your Azure virtual network (VNet injection).
You must have created an Azure Databricks access connector and given it access to your Azure Storage
account. See Configure a managed identity for Unity Catalog.
Step 1: Enable your managed identity to access Azure Storage
This step is necessary only if “Allow Azure services on the trusted services list to access this storage account” is
disabled for your Azure Storage account. If that configuration is enabled:
Any Azure Databricks access connector in the same tenant as the storage account can access the storage
account.
Any Azure trusted service can access the storage account. See Grant access to trusted Azure services.
The instructions below include a step in which you disable this configuration. You can use the Azure Portal or the
Azure CLI.
Use the Azure Portal
1. Log in to the Azure Portal, find and select the Azure Storage account, and go to the Networking tab.
2. Set Public Network Access to Enabled from selected vir tual networks and IP addresses .
As an option, you can instead set Public Network Access to Disabled . The managed identity can be
used to bypass the check on public network access.
3. Under Resource instances , select a Resource type of Microsoft.Databricks/accessConnectors and
select your Azure Databricks access connector.
4. Under Exceptions , clear the Allow Azure ser vices on the trusted ser vices list to access this
storage account checkbox.
Use the Azure CLI
1. Install the Azure CLI and sign in.
2. Add a network rule to the storage account:

az storage account network-rule add \


-–subscription <subscription id of the resource group> \
-–resource-id <resource Id of the Azure Databricks access connector> \
-–tenant-id <tenant Id> \
-g <name of the Azure Storage resource group> \
-–account-name <name of the Azure Storage resource> \

Add the resource ID in the format:

/subscriptions/12f34567-8ace-9c10-111c-
aea8eba12345c/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/<connec
tor-name>

3. After you create the network rule, go to your Azure Storage account in the Azure Portal and view the
managed identity in the Networking tab under Resource instances , resource type
Microsoft.Databricks/accessConnectors .

4. Under Exceptions , clear the Allow Azure ser vices on the trusted ser vices list to access this
storage account checkbox.
5. Optionally, set Public Network Access to Disabled . The managed identity can be used to bypass the
check on public network access.
The standard approach is to keep this value set to Enabled from selected vir tual networks and IP
addresses .
Step 2. Enable your Azure Databricks workspace to access Azure Storage
Follow the instructions in Securely Accessing Azure Data Sources from Azure Databricks to secure connectivity
from your Azure Databricks workspace to Azure Storage.

Upgrade your existing Unity Catalog metastore to use a managed


identity to access its root storage
If you have a Unity Catalog metastore that was created using a service principal and you would like to upgrade
it to use a managed identity, you can update it using an API call.
1. Create an Azure Databricks access connector and assign it permissions to the storage container that is
being used for your Unity Catalog metastore root storage, using the instructions in Configure a managed
identity for Unity Catalog.
Make note of the access connector’s resource ID.
2. As an account admin, log in to an Azure Databricks workspace that is assigned to the metastore.
You do not have to be a workspace admin.
Make a note of the workspace URL, which is the first portion of the URL, after https:// and inclusive of
azuredatabricks.net .
3. Generate a personal access token.
4. Add the personal access token to the .netrc file in your home directory. This improves security by
preventing the personal access token from appearing in your shell’s command history. See Store tokens
in a .netrc file and use them in curl.
5. Run the following cURL command to recreate the storage credential.
Replace the placeholder values:
<workspace-url> : The URL of the workspace where the personal access token was generated.
<credential-name> : A name for the storage credential.
<access_connector_id> : Resource ID for the Azure Databricks access connector in the format
/subscriptions/12f34567-8ace-9c10-111c-
aea8eba12345c/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/<connector-
name>

curl -n -X POST --header 'Content-Type: application/json' https://<workspace-url>/api/2.0/unity-


catalog/storage-credentials --data "{
\"name\": \"<credential-name>\",
\"azure_managed_identity\": {
\"access_connector_id\": \"<access_connector_id>\"
}
}"

6. Make a note of the storage credential ID in the response.


7. Run the following cURL command to retrieve the metastore_id , where is the URL of the workspace
where the personal access token was generated.

curl -n GET--header 'Content-Type: application/json' https://<workspace-url>/api/2.0/unity-


catalog/metastore_summary

8. Run the following cURL command to update the metastore with the new root storage credential.
Replace the placeholder values:
<workspace-url> : The URL of the workspace where the personal access token was generated.
<metastore-id> : The metastore ID that you retrieved in the previous step.
<storage-credential-id> : The storage credential ID.

curl -n -X PATCH --header 'Content-Type: application/json' https://<workspace-url>/api/2.0/unity-


catalog/metastores/<metastore-id> --data
"{\"storage_root_credential_id\": \"<storage-credential-id>\"}"
Create a metastore
7/21/2022 • 5 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

This article shows how to create a metastore in Unity Catalog and link it to workspaces.

Requirements
You must be an Azure Databricks account admin.
Your Azure Databricks account must be on the Premium plan.
In your Azure tenant, you must have permission to create:
A storage account to use with Azure Data Lake Storage Gen2. See Create a storage account to use with
Azure Data Lake Storage Gen2.
A new resource to hold a system-assigned managed identity. This requires that you be a Contributor
or Owner of a resource group in any subscription in the tenant.

Create the metastore


To create a Unity Catalog metastore, you create a storage container where the metastore’s metadata and
managed tables will be stored, create an identity that Azure Databricks uses to give access to that storage
container, and then provide Azure Databricks with the storage container path and identity.
You can use either an Azure managed identity or a service principal as the identity that gives access to the
metastore’s storage container.
Using a managed identity has the following benefits over service principals:
You can connect to an Azure Data Lake Storage Gen2 account that is protected by a storage firewall.
Managed identities do not require you to maintain credentials or rotate secrets.
Create a metastore that is accessed using a managed identity (recommended)
To create a Unity Catalog metastore that is accessed by an Azure managed identity:
1. Create an Azure Databricks access connector and assign it permissions to the storage container where
you want the metastore’s metadata and managed tables to be stored, using the instructions in Configure
a managed identity for Unity Catalog.
An Azure Databricks access connector is a first-party Azure resource that lets you connect a system-
assigned managed identity to an Azure Databricks account.
Make note of the access connector’s resource ID.
2. Log in to the Azure Databricks account console.

3. Click Data .
4. Click Create Metastore .
5. Enter values for the following fields:
Name for the metastore.
Region where the metastore will be deployed.
For best performance, co-locate the access connector, workspaces, metastore and cloud storage
location in the same cloud region.
ADLS Gen 2 path : Enter the path to the storage container that you will use as root storage for the
metastore.
The abfss:// prefix is added automatically.
Access Connector ID : Enter the Azure Databricks access connector’s resource ID in the format:

/subscriptions/12f34567-8ace-9c10-111c-
aea8eba12345c/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/
<connector-name>

6. Click Create .
If the request fails, retry using a different metastore name.
7. When prompted, select workspaces to link to the metastore.
The user who creates a metastore is its owner. To change ownership of a metastore after creating it, see
(Recommended) Transfer ownership of your metastore to a group.
Create a metastore that is accessed using a service principal
To create a Unity Catalog metastore that is accessed by a service principal:
1. Create a storage account for Azure Data Lake Storage Gen2.
This storage account will contain metadata related to Unity Catalog metastores and their objects, as well
as the data for managed tables in Unity Catalog. See Create a storage account to use with Azure Data
Lake Storage Gen2. Make a note of the region where you created the storage account.
2. Create a container in the new storage account.
Make a note of the ADLSv2 URI for the container, which is in the following format:

abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<metastore-name>

In the steps that follow, replace <storage-container> with this URI.


3. In Azure Active Directory, create a service principal.
Unity Catalog will use this service principal to access containers in the storage account on behalf of Unity
Catalog users. Generate a client secret for the service principal. See Provision a service principal in Azure
portal. Make a note of the client secret for the service principal, the client application ID, and directory ID
where you created this service principal. In the following steps, replace <client-secret> ,
<client-application-id> , and <directory-id> with these values.

4. In the storage account, go to Access Control (IAM) and grant the new service principal the Storage
blob data contributor role.
5. Make note of these properties, which you will use when you create a metastore:
<aad-application-id>
The storage account region
<storage-container>
The service principal’s <client-secret> , <client-application-id> , and <directory-id>
6. Log in to the account console.

7. Click Data .
8. Click Create Metastore .
a. Enter a name for the metastore.
b. Enter the region where the metastore will be deployed. For best performance, co-locate the
workspaces, metastore and cloud storage location in the same cloud region.
c. For ADLS Gen 2 path , enter the value of <storage-container> . The abfss:// prefix is added
automatically.
9. Click Create .
The user who creates a metastore is its owner. To change ownership of a metastore after creating it, see
(Recommended) Transfer ownership of your metastore to a group.
10. Make a note of the metastore’s ID. When you view the metastore’s properties, the metastore’s ID is the
portion of the URL after /data and before /configuration .
11. The metastore has been created, but Unity Catalog cannot yet write data to it. To finish setting up the
metastore:
a. In a separate browser, log in to a workspace that is assigned to the metastore as a workspace
admin.
b. Make a note of the workspace URL, which is the first portion of the URL, after https:// and
inclusive of azuredatabricks.net .
c. Generate a personal access token. See Generate a personal access token.
d. Add the personal access token to the .netrc file in your home directory. This improves security
by preventing the personal access token from appearing in your shell’s command history. See
Store tokens in a .netrc file and use them in curl.
e. Run the following cURL command to create the root storage credential for the metastore. Replace
the placeholder values:
<workspace-url> : The URL of the workspace where the personal access token was generated.
<credential-name> : A name for the storage credential.
<directory-id> : The directory ID for the service principal you created.
<application-id> : The application ID for the service principal you created.
<client-secret> : The value of the client secret you generated for the service principal (not the
client secret ID).

curl -n -X POST --header 'Content-Type: application/json' https://<workspace-


url>/api/2.0/unity-catalog/storage-credentials --data "{
\"name\": \"<credential-name>\",
\"azure_service_principal\": {
\"directory_id\": \"<directory-id>\",
\"application_id\": \"<application-id>\",
\"client_secret\": \"<client-secret>\"
}
}"

Make a note of the storage credential ID, which is the value of id from the cURL command’s
response.
12. Run the following cURL command to update the metastore with the new root storage credential. Replace
the placeholder values:
<workspace-url> : The URL of the workspace where the personal access token was generated.
<metastore-id >: The metastore’s ID.
<storage-credential-id >: The storage credential’s ID from the previous command.

curl -n -X PATCH --header 'Content-Type: application/json' https://<workspace-url>/api/2.0/unity-


catalog/metastores/<metastore-id> --data
"{\"storage_root_credential_id\": \"<storage-credential-id>\"}"

You can now add workspaces to the metastore.

Next steps
Create and manage catalogs
Create and manage schemas (databases)
Create tables
Learn more about the key concepts of Unity Catalog
Create compute resources
7/21/2022 • 3 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

This article shows how to create a Data Science & Engineering or Databricks Machine Learning cluster or a
Databricks SQL warehouse that can access data in Unity Catalog.

Requirements
Your Azure Databricks account must be on the Premium plan.
In a workspace, you must have permission to create compute resources.

Create a Data Science & Engineering cluster


A Data Science & Engineering cluster is designed for running general workloads, such as notebooks.
To create a Data Science & Engineering cluster that can access Unity Catalog:
1. Log in to the workspace as a workspace-level admin.

2. Click Compute .
3. Click Create cluster .
a. Enter a name for the cluster.
b. Set Databricks runtime version to Runtime: 10.3 (Scala 2.12, Spark 3.2.1) or higher.
4. Click Advanced Options . Set Security Mode to User Isolation or Single User .
User Isolation clusters can be shared by multiple users, but only SQL workloads are supported. Some
advanced cluster features such as library installation, init scripts, and the DBFS Fuse mount are also
disabled to ensure security isolation among cluster users.
To use those advanced cluster features or languages or to run workloads using Python, Scala and R, set
the cluster mode to Single User. Single User cluster can also run SQL workloads. The cluster can be used
exclusively by a single user (by default the owner of the cluster); other users cannot attach to the cluster.
Automated jobs should run in this mode, and the job’s owner should be the cluster’s owner. In this mode,
view security cannot be enforced. A user selecting from a view executes with their own permissions.
For more information about the features available in each security mode, see Cluster security mode.
5. Click Create Cluster .

Create a Databricks Machine Learning cluster


A Databricks Machine Learning cluster is purpose-built for machine-learning workloads. You can optionally
create a GPU-enabled Databricks Machine Learning cluster.
To create a Databricks Machine Learning cluster that can access Unity Catalog:
1. Log in to the workspace as a workspace-level admin.

2. In the Data Science & Engineering or Databricks Machine Learning persona, click Compute .
3. Click Create cluster .
a. Enter a name for the cluster.
b. For Databricks runtime version :
a. Click ML .
b. Select either 10.3 ML (Scala 2.12, Spark 3.2.1) or higher, or 10.3 ML (GPU, Scala 2.12,
Spark 3.2.1) or higher.
4. Click Advanced Options . Set Security Mode to User Isolation or Single User . To run Python code,
you must use Single User .
User Isolation clusters can be shared by multiple users, but only SQL workloads are supported. Some
advanced cluster features such as library installation, init scripts, and the DBFS Fuse mount are also
disabled to ensure security isolation among cluster users.
To use those advanced cluster features or languages or to run workloads using Python, Scala and R, set
the cluster mode to Single User. Single User cluster can also run SQL workloads. The cluster can be used
exclusively by a single user (by default the owner of the cluster); other users cannot attach to the cluster.
Automated jobs should run in this mode, and the job’s owner should be the cluster’s owner. In this mode,
view security cannot be enforced. A user selecting from a view executes with their own permissions.
For more information about the features available in each security mode, see Cluster security mode.
5. Click Create Cluster .

Create a Databricks SQL warehouse


A Databricks SQL warehouse is required to run workloads in Databricks SQL, such as queries, dashboards, and
visualizations.
To create a SQL warehouse that can access Unity Catalog data:
1. Log in to the workspace as a workspace-level admin.
2. From the persona switcher, select SQL .
3. Click Create , then select SQL Warehouse .
4. Under Advanced Settings set Channel to Preview .
SQL warehouses are automatically created with the correct security mode, with no configuration required.

Next steps
Create and manage catalogs
Create and manage schemas (databases)
Create tables
Create and manage catalogs
7/21/2022 • 4 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

This article shows how to create and manage catalogs in Unity Catalog. A catalog contains schemas (databases),
and a schema contains tables and views.

Requirements
You must be an Azure Databricks account admin.
Your Azure Databricks account must be on the Premium Plan.
You must have a Unity Catalog metastore linked to the workspace where you perform the catalog creation.
The compute resource that you use to run the notebook, Databricks SQL editor, or Data Explorer workflow to
create the catalog must be compliant with Unity Catalog security requirements.

Create a catalog
To create a catalog, you can use the Data Explorer or a SQL command.
Data explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. Click the Create Catalog button.
5. Assign permissions for your catalog. See Manage privileges.
6. Click Save .
Sql
1. Run the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional.
Replace the placeholder values:
<catalog_name> : A name for the catalog.
<comment> : An optional comment.

CREATE CATALOG [ IF NOT EXISTS ] <catalog_name>


[ COMMENT <comment> ];

For example, to create a catalog named example :

CREATE CATALOG IF NOT EXISTS example;

2. Assign privileges to the catalog. See Manage privileges.


Python
1. Run the following SQL command in a notebook. Items in brackets are optional. Replace the placeholder
values:
<catalog_name> : A name for the catalog.
<comment> : An optional comment.

spark.sql("CREATE CATALOG [ IF NOT EXISTS ] <catalog_name> [ COMMENT <comment> ]")

For example, to create a catalog named example :

spark.sql("CREATE CATALOG IF NOT EXISTS example")

2. Assign privileges to the catalog. See Manage privileges.


R
1. Run the following SQL command in a notebook. Items in brackets are optional. Replace the placeholder
values:
<catalog_name> : A name for the catalog.
<comment> : An optional comment.

library(SparkR)

sql("CREATE CATALOG [ IF NOT EXISTS ] <catalog_name> [ COMMENT <comment> ]")

For example, to create a catalog named example :

library(SparkR)

sql("CREATE CATALOG IF NOT EXISTS example")

2. Assign privileges to the catalog. See Manage privileges.


Scala
1. Run the following SQL command in a notebook. Items in brackets are optional. Replace the placeholder
values:
<catalog_name> : A name for the catalog.
<comment> : An optional comment.

spark.sql("CREATE CATALOG [ IF NOT EXISTS ] <catalog_name> [ COMMENT <comment> ]")

For example, to create a catalog named example :

spark.sql("CREATE CATALOG IF NOT EXISTS example")

2. Assign privileges to the catalog. See Manage privileges.


When you create a catalog, two schemas (databases) are automatically created: default and
information_schema .

Next steps
Now you can add schemas (databases) to your catalog.

Delete a catalog
To delete (or drop) a catalog, you can use the Data Explorer or a SQL command.
Data explorer
You must delete all schemas in the catalog except information_schema before you can delete a catalog. This
includes the auto-created default schema.
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. In the Data pane, on the left, click the catalog you want to delete.
5. In the detail pane, click the three-dot menu to the left of the Create database button and select Delete .
6. On the Delete catalog dialog, click Delete .
Sql
Run the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace
the placeholder <catalog_name> .
For parameter descriptions, see DROP CATALOG.
If you use DROP CATALOG without the CASCADE option, you must delete all schemas in the catalog except
information_schema before you can delete the catalog. This includes the auto-created default schema.

DROP CATALOG [ IF EXISTS ] <catalog_name> [ RESTRICT | CASCADE ]

For example, to delete a catalog named vaccine and its schemas:

DROP CATALOG vaccine CASCADE

Python
Run the following SQL command in a notebook. Items in brackets are optional. Replace the placeholder
<catalog_name> .

For parameter descriptions, see DROP CATALOG.


If you use DROP CATALOG without the CASCADE option, you must delete all schemas in the catalog except
information_schema before you can delete the catalog. This includes the auto-created default schema.

spark.sql("DROP CATALOG [ IF EXISTS ] <catalog_name> [ RESTRICT | CASCADE ]")

For example, to delete a catalog named vaccine and its schemas:

spark.sql("DROP CATALOG vaccine CASCADE")

R
Run the following SQL command in a notebook. Items in brackets are optional. Replace the placeholder
<catalog_name> .

For parameter descriptions, see DROP CATALOG.


If you use DROP CATALOG without the CASCADE option, you must delete all schemas in the catalog except
information_schema before you can delete the catalog. This includes the auto-created default schema.

library(SparkR)

rsql("DROP CATALOG [ IF EXISTS ] <catalog_name> [ RESTRICT | CASCADE ]")

For example, to delete a catalog named vaccine and its schemas:

library(SparkR)

sql("DROP CATALOG vaccine CASCADE")

Scala
Run the following SQL command in a notebook. Items in brackets are optional. Replace the placeholder
<catalog_name> .

For parameter descriptions, see DROP CATALOG.


If you use DROP CATALOG without the CASCADE option, you must delete all schemas in the catalog except
information_schema before you can delete the catalog. This includes the auto-created default schema.

spark.sql("DROP CATALOG [ IF EXISTS ] <catalog_name> [ RESTRICT | CASCADE ]")

For example, to delete a catalog named vaccine and its schemas:

spark.sql("DROP CATALOG vaccine CASCADE")


Create and manage schemas (databases)
7/21/2022 • 5 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

This article shows how to create and manage schemas (databases) in Unity Catalog. A schema contains tables
and views. You create schemas inside catalogs.

Requirements
You must be an Azure Databricks account admin.
Your Azure Databricks account must be on the Premium Plan.
You must have a Unity Catalog metastore linked to the workspace where you perform the schema creation.
The compute resource that you use to run the notebook, Databricks SQL editor, or Data Explorer workflow to
create the schema must be compliant with Unity Catalog security requirements.
You must have the USAGE and CREATE data permissions on the schema’s parent catalog.

Create a schema
To create a schema (database), you can use the Data Explorer or SQL commands.
Data explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. In the Data pane on the left, click the catalog you want to create the schema in.
5. In the detail pane, click Create database .
6. Give the schema a name and add any comment that would help users understand the purpose of the
schema, then click Create .
7. Assign permissions for your catalog. See Manage privileges.
8. Click Save .
Sql
1. Run the following SQL commands in a notebook or Databricks SQL editor. Items in brackets are optional.
You can use either SCHEMA or DATABASE . Replace the placeholder values:
<catalog_name> : The name of the parent catalog for the schema.
<schema_name> : A name for the schema.
<comment> : An optional comment.
<property_name> = <property_value> [ , ... ] : The Spark SQL properties and values to set for the
schema.
For parameter descriptions, see CREATE SCHEMA.
USE CATALOG <catalog>;
CREATE { DATABASE | SCHEMA } [ IF NOT EXISTS ] <schema_name>
[ COMMENT <comment> ]
[ WITH DBPROPERTIES ( <property_name = property_value [ , ... ]> ) ];

You can optionally omit the USE CATALOG statement and replace <schema_name> with
<catalog_name>.<schema_name> .
2. Assign privileges to the schema. See Manage privileges.
Python
1. Run the following SQL commands in a notebook. Items in brackets are optional. You can use either
SCHEMA or DATABASE . Replace the placeholder values:

<catalog_name> : The name of the parent catalog for the schema.


<schema_name> : A name for the schema.
<comment> : An optional comment.
<property_name> = <property_value> [ , ... ] : The Spark SQL properties and values to set for the
schema.
For parameter descriptions, see CREATE SCHEMA.

spark.sql("USE CATALOG <catalog>")

spark.sql("CREATE { DATABASE | SCHEMA } [ IF NOT EXISTS ] <schema_name> " \


"[ COMMENT <comment> ] " \
"[ WITH DBPROPERTIES ( <property_name = property_value [ , ... ]> ) ]")

You can optionally omit the USE CATALOG statement and replace <schema_name> with
<catalog_name>.<schema_name> .
2. Assign privileges to the schema. See Manage privileges.
R
1. Run the following SQL commands in a notebook. Items in brackets are optional. You can use either
SCHEMA or DATABASE . Replace the placeholder values:

<catalog_name> : The name of the parent catalog for the schema.


<schema_name> : A name for the schema.
<comment> : An optional comment.
<property_name> = <property_value> [ , ... ] : The Spark SQL properties and values to set for the
schema.
For parameter descriptions, see CREATE SCHEMA.

library(SparkR)

sql("USE CATALOG <catalog>")

sql(paste("CREATE { DATABASE | SCHEMA } [ IF NOT EXISTS ] <schema_name> ",


"[ COMMENT <comment> ] ",
"[ WITH DBPROPERTIES ( <property_name = property_value [ , ... ]> ) ]",
sep = ""))

You can optionally omit the USE CATALOG statement and replace <schema_name> with
<catalog_name>.<schema_name> .
2. Assign privileges to the schema. See Manage privileges.
Scala
1. Run the following SQL commands in a notebook. Items in brackets are optional. You can use either
SCHEMA or DATABASE . Replace the placeholder values:

<catalog_name> : The name of the parent catalog for the schema.


<schema_name> : A name for the schema.
<comment> : An optional comment.
<property_name> = <property_value> [ , ... ] : The Spark SQL properties and values to set for the
schema.
For parameter descriptions, see CREATE SCHEMA.

spark.sql("USE CATALOG <catalog>")

spark.sql("CREATE { DATABASE | SCHEMA } [ IF NOT EXISTS ] <schema_name> " +


"[ COMMENT <comment> ] " +
"[ WITH DBPROPERTIES ( <property_name = property_value [ , ... ]> ) ]")

You can optionally omit the USE CATALOG statement and replace <schema_name> with
<catalog_name>.<schema_name> .
2. Assign privileges to the schema. See Manage privileges.
Next steps
Now you can add tables to your schema.

Delete a schema
To delete (or drop) a schema (database), you can use the Data Explorer or a SQL command.
Data explorer
You must delete all tables in the schema before you can delete it.
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. In the Data pane, on the left, click the schema (database) that you want to delete.
5. In the detail pane, click the three-dot menu in the upper right corner and select Delete .
6. On the Delete Database dialog, click Delete .
Sql
Run the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace
the placeholder <schema_name> .
For parameter descriptions, see DROP SCHEMA.
If you use DROP SCHEMA without the CASCADE option, you must delete all tables in the schema before you can
delete it.

DROP SCHEMA [ IF EXISTS ] <schema_name> [ RESTRICT | CASCADE ]

For example, to delete a schema named inventory_schema and its tables:


DROP SCHEMA inventory_schema CASCADE

Python
Run the following SQL command in a notebook. Items in brackets are optional. Replace the placeholder
<schema_name> .

For parameter descriptions, see DROP SCHEMA.


If you use DROP SCHEMA without the CASCADE option, you must delete all tables in the schema before you can
delete it.

spark.sql("DROP SCHEMA [ IF EXISTS ] <schema_name> [ RESTRICT | CASCADE ]")

For example, to delete a schema named inventory_schema and its tables:

spark.sql("DROP SCHEMA inventory_schema CASCADE")

R
Run the following SQL command in a notebook. Items in brackets are optional. Replace the placeholder
<schema_name> .

For parameter descriptions, see DROP SCHEMA.


If you use DROP SCHEMA without the CASCADE option, you must delete all tables in the schema before you can
delete it.

library(SparkR)

sql("DROP SCHEMA [ IF EXISTS ] <schema_name> [ RESTRICT | CASCADE ]")

For example, to delete a schema named inventory_schema and its tables:

library(SparkR)

sql("DROP SCHEMA inventory_schema CASCADE")

Scala
Run the following SQL command in a notebook. Items in brackets are optional. Replace the placeholder
<schema_name> .

For parameter descriptions, see DROP SCHEMA.


If you use DROP SCHEMA without the CASCADE option, you must delete all tables in the schema before you can
delete it.

spark.sql("(DROP SCHEMA [ IF EXISTS ] <schema_name> [ RESTRICT | CASCADE ]")

For example, to delete a schema named inventory_schema and its tables:

spark.sql("DROP SCHEMA inventory_schema CASCADE")


Manage identities in Unity Catalog
7/21/2022 • 8 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

This article describes how to manage users, groups, and service principals in Azure Databricks accounts that use
Unity Catalog.

What is the identity management model for Unity Catalog?


In Unity Catalog, you manage identities at the account level, not the workspace level. This section describes the
identity types available in Azure Databricks and explains how the account-centric identity model works in Unity
Catalog.
Account-level identities
There are three types of account-level identities:
Users : Account-level users can be either account admins or account users.
Account admins can manage account-level configurations like workspace creation, network and storage
configuration, audit logging, billing, and identity management. They can also manage Unity Catalog
metastores and data access.
Account admins can assign users to any workspaces that are assigned to a Unity Catalog metastore.
Ser vice principals : Service principals are service users for use with automated tools. An account admin
can assign service principals to workspaces that are assigned to a Unity Catalog metastore, where they
can be used to run jobs and applications.
Groups : Groups simplify identity management, making it easier to assign access to workspaces, data,
and other securable objects.
How do account admins assign users to workspaces?
Unity Catalog uses a federated identity model, enabling you to manage all of your users, service principals, and
groups at the account level and assign workspace access to those account-level identities.
You can add users, service principals, and groups to your account using the account console UI, Azure Active
Directory sync, or the account-level SCIM API. These users can be account admins, who have the ability to
perform account-level administrative tasks (like adding users), or regular users, who don’t.
To enable an account-level identity to work in Azure Databricks, you need to assign them to one or more
workspaces. You can assign an account-level user to a workspace as a workspace user or as a workspace admin.
Workspace admins have the ability to perform many administrative functions for that workspace (like
controlling access to cluster creation or shared notebooks).
Conversely, if you add a user or service principal directly to a workspace that is assigned to a Unity Catalog
metastore, that user or servivce principal will automatically get synced to an account-level identity. Azure
Databricks does not recommend this upstream flow; it’s more straightforward to add your users at the account
level and then assign them to workspaces. Workspace groups are not synced automatically as account-level
groups. If you want to replicate a workspace-local group as an account-level group, you must do so manually.
In order to assign an account-level user to a workspace, that workspace must be enabled for Unity Catalog. In
other words, it has to be assigned to a Unity Catalog metastore. You don’t have to assign all of your workspaces
to Unity Catalog metastores. For those workspaces that aren’t assigned to a Unity Catalog metastore, you
manage your workspace users and workspace admins entirely within the scope of the workspace (the legacy
model). You can’t use the account console or APIs to assign account-level users to these workspaces.
What happens when I enable an existing workspace for Unity Catalog?
When you enable an existing workspace for Unity Catalog, all of the existing workspace users and admins are
synced automatically to the account as users, and you can start managing them (updating, removing, assigning
admin permissions, assigning them to other workspaces) at the account level. If the workspace user shares a
username (email address) with an account-level user or admin that already exists, those users are merged.
How does the Unity Catalog identity management model differ from accounts that don’t use Unity Catalog?
If you don’t use Unity Catalog, you cannot use account-level interfaces to assign users, service principals, and
groups to a workspace. You have to add and manage your workspace users and groups entirely from within the
workspace.
How does SCIM provisioning work differently for workspace - and account-level identities?
If you use account-level identities to manage users in workspaces (which requires that the workspace be
assigned to a Unity Catalog metastore), you should turn off any SCIM provisioning that you have set up to add,
update, and delete users, service principals, or groups in those workspaces. You should instead set up SCIM
provisioning at the account level. Workspace-level SCIM provisioning in this case may result in errors for some
operations (for example, provisioning new groups).
If your workspace is not identity-federated, you can continue to use workspace-level SCIM provisioning.
How do you manage SSO for workspace - and account-level identities?
SSO for account-level and workspace-level identities must be managed separately.

Requirements
You must be an Azure Databricks account admin.
Your Azure Databricks account must be on the Premium plan.
You must have a Unity Catalog metastore.
To manage account-level identities, you must be an account-level admin.
The initial account-level admin must be a member of the root management group for your Azure tenant.
The initial account-level admin can grant others the Admin role in the account console, as long as those
users have been added to Azure Databricks workspaces in the same Azure tenant.
When a user is added to a workspace, that user also becomes available in the account console and can be
granted the account admin role.
Service principals are not supported as account-level identities.

Manage account-level identities manually


A Unity Catalog metastore can be shared across multiple Databricks workspaces. So that Databricks has a
consistent view of users and groups across all workspaces, you can now add Azure Active Directory users and
groups as account-level identities. Follow these steps to add account-level identities.

NOTE
Users and groups must be added as account-level identities before they can access Unity Catalog.

1. The initial account-level admin must be a Contributor in the Azure Active Directory root management
group, which is named Tenant root group by default. An Azure Active Directory Global Administrator
can add themselves to this group. Grant yourself this role, or ask an Azure Active Directory Global
Administrator to grant it to you.
The initial account-level admin can add users or groups to the account console, and can designate other
account-level admins by granting the Admin role to users.
2. All Azure Active Directory users who have been added to workspaces in your Azure tenant are
automatically added as account-level identities.
3. To designate additional account-level admins, you grant users the Admin role.

NOTE
It is not possible to grant the Admin role to a group.

a. Log in to the account console by clicking Settings , then clicking Manage account ..
b. Click Users and Groups . A list of Azure Active Directory users appears. Only users and groups
who have been added to workspaces are shown.
c. Click the name of a user.
d. Click Roles .
e. Enable Admin .

Sync account-level identities from Azure Active Directory


You can sync account-level users and groups from your Azure Active Directory (Azure AD) tenant to Unity
Catalog, rather than manually adding users and groups.
Requirements
You must be a global administrator for the Azure Active Directory account.
Your Azure Active Directory account must be a Premium edition account to provision groups. Provisioning
users is available for any Azure Active Directory edition.
Configure Databricks
1. Log in to the Databricks account console.
2. Click Settings .
3. Click User Provisioning .
4. Click Enable user provisioning .
Make a note of the SCIM token and the account SCIM URL. You will use these to configure your Azure AD
application.
Configure the enterprise application
Follow these steps to enable Azure AD to sync users and groups to your Databricks account. This configuration
is separate from any configurations you have created to sync users and groups to workspaces.
1. In your Azure portal, go to Azure Active Director y > Enterprise Applications .
2. Click + New Application above the application list. Under Add from the gallery, search for and select
Azure Databricks SCIM Provisioning Connector .
3. Enter a Name for the application and click Add .
4. Under the Manage menu, click Provisioning .
5. Set Provisioning Mode to Automatic.
6. Set the SCIM API endpoint URL to the Account SCIM URL that you copied earlier.
7. Set Secret Token to the Azure Databricks SCIM token that you generated earlier.
8. Click Test Connection and wait for the message that confirms that the credentials are authorized to enable
provisioning.
9. Click Save .
Assign users and groups to the application
Users and groups assigned to the SCIM application will be provisioned to the Databricks account. If you have
existing Azure Databricks workspaces, Databricks recommends that you add all existing users and groups in
those workspaces to the SCIM application.
1. Go to Manage > Provisioning .
2. Under Settings , set Scope to Sync only assigned users and groups . Databricks recommends this
option, which syncs only users and groups assigned to the enterprise application.
3. To start synchronizing Azure Active Directory users and groups to Azure Databricks, click the Provisioning
Status toggle.
4. Click Save .
5. Go to Manage > Users and groups .
6. Add some users and groups. Click Add user, select the users and groups, and click the Assign button.
7. Wait a few minutes and check that the users and groups exist in your Azure Databricks account.
In the future, users and groups that you add and assign will automatically be provisioned to the Databricks
account when Azure Active Directory schedules the next sync.
Rotate the SCIM token
If the account-level SCIM token is compromised or if you have business requirements to rotate authentication
tokens periodically, you can rotate the SCIM token.
1. Log in to the Databricks account console.
2. Click Settings .
3. Click User Provisioning .
4. Click Regenerate token . Make a note of the new token. The previous token will continue to work for 24
hours.
5. Within 24 hours, update your SCIM application in Azure AD to use the new SCIM token.
Create tables
7/21/2022 • 15 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

This article shows how to create tables in Unity Catalog. A table can be managed or external.

Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
If necessary, create a metastore.
The metastore must be linked to the workspace where you run the commands to create the catalog or
schema.
You can create a catalog or schema from a Data Science & Engineering notebook or a Databricks SQL
endpoint that uses a compute resource that is compliant with Unity Catalog’s security requirements.
If necessary, create a catalog and schema in the metastore. The catalog and schema will contain the new
table. To create a schema, you must have the USAGE and CREATE data permissions on the schema’s parent
catalog. When you create a metastore, it contains a catalog called main with an empty schema called
default . See Create and manage catalogs and Create and manage schemas (databases).
If necessary, load the data into files on your cloud tenant, and create an external location in the metastore
that grants access to the files.
To create a table, you must have the USAGE permission on the parent catalog and the USAGE and CREATE
permissions on the parent schema.
For additional requirements to create an external table, see Create an external table.

Managed table
Managed tables are the default way to create tables in Unity Catalog. These tables are stored in the managed
storage location you configured when you created each metastore. To create a managed table with SQL, run a
CREATE TABLE command without a LOCATION clause. To delete a managed table with SQL, use the DROP TABLE
statement. When a managed table is dropped, its underlying data is deleted from your cloud tenant. The only
supported format for managed tables is Delta.
Example SQL syntax:

CREATE TABLE <example-table>


(id STRING, value STRING);

External table
External tables are tables whose data is stored in a storage location outside of the managed storage location,
and are not fully managed by Unity Catalog. When you run DROP TABLE on an external table, Unity Catalog does
not delete the underlying data. You can manage privileges on external tables and use them in queries in the the
same way as managed tables. To create an external table with SQL, specify a LOCATION path in your
CREATE TABLE statement. External tables can use the following file formats:
DELTA
CSV
JSON
AVRO
PARQUET
ORC
TEXT
To manage access to the underlying cloud storage for an external table, Unity Catalog introduces two new object
types: storage credentials and external locations.
To learn more, see Create an external table.

Create a managed table


To create a managed table, run the following SQL command. You can also use the example notebook to create a
table. Items in brackets are optional. Replace the placeholder values:
<catalog_name> : The name of the catalog.
<schema_name> : The name of the schema.
<table_name> : A name for the table.
<column_specification> : The name and data type for each column.

SQL

CREATE TABLE <catalog_name>.<schema_name>.<table_name>


(
<column_specification>
);

Python

spark.sql("CREATE TABLE <catalog_name>.<schema_name>.<table_name> "


"("
" <column_specification>"
")")

library(SparkR)

sql(paste("CREATE TABLE <catalog_name>.<schema_name>.<table_name> ",


"(",
" <column_specification>",
")",
sep = ""))

Scala

spark.sql("CREATE TABLE <catalog_name>.<schema_name>.<table_name> " +


"(" +
" <column_specification>" +
")")
For example, to create the table main.default.department and insert five rows into it:
SQL

CREATE TABLE main.default.department


(
deptcode INT,
deptname STRING,
location STRING
);

INSERT INTO main.default.department VALUES


(10, 'FINANCE', 'EDINBURGH'),
(20, 'SOFTWARE', 'PADDINGTON'),
(30, 'SALES', 'MAIDSTONE'),
(40, 'MARKETING', 'DARLINGTON'),
(50, 'ADMIN', 'BIRMINGHAM');

Python

spark.sql("CREATE TABLE main.default.department "


"("
" deptcode INT,"
" deptname STRING,"
" location STRING"
")"
"INSERT INTO main.default.department VALUES "
" (10, 'FINANCE', 'EDINBURGH'),"
" (20, 'SOFTWARE', 'PADDINGTON'),"
" (30, 'SALES', 'MAIDSTONE'),"
" (40, 'MARKETING', 'DARLINGTON'),"
" (50, 'ADMIN', 'BIRMINGHAM')")

library(SparkR)

sql(paste("CREATE TABLE main.default.department ",


"(",
" deptcode INT,",
" deptname STRING,",
" location STRING",
")",
"INSERT INTO main.default.department VALUES ",
" (10, 'FINANCE', 'EDINBURGH'),",
" (20, 'SOFTWARE', 'PADDINGTON'),",
" (30, 'SALES', 'MAIDSTONE'),",
" (40, 'MARKETING', 'DARLINGTON'),",
" (50, 'ADMIN', 'BIRMINGHAM')",
sep = ""))

Scala
spark.sql("CREATE TABLE main.default.department " +
"(" +
" deptcode INT," +
" deptname STRING," +
" location STRING" +
")" +
"INSERT INTO main.default.department VALUES " +
" (10, 'FINANCE', 'EDINBURGH')," +
" (20, 'SOFTWARE', 'PADDINGTON')," +
" (30, 'SALES', 'MAIDSTONE')," +
" (40, 'MARKETING', 'DARLINGTON')," +
" (50, 'ADMIN', 'BIRMINGHAM')")

Example notebook
You can use the following example SQL notebook to create a catalog, schema, and table, and to manage
permissions on them.
Create and manage a table in Unity Catalog
Get notebook

Create an external table


The data in an external table is stored in a path on your cloud tenant. To work with external tables, Unity Catalog
introduces two new objects to access and work with external cloud storage:
A storage credential contains an authentication method for accessing a cloud storage location. The storage
credential does not contain a mapping to the path to which it grants access. Storage credentials are access-
controlled to determine which users can use the credential. To use an external storage credential directly, add
WITH <credential_name> to your SQL command.
An external location maps a storage credential with a cloud storage path to which it grants access. The
external location grants access only to that cloud storage path and its contents. External locations are access-
controlled to determine which users can use them. An external location is used automatically when your SQL
command contains a LOCATION clause and no WITH <credential_name> clause.
Requirements
To create an external table, you must have:
The CREATE_TABLE privilege on an external location or storage credential, that grants access to the
LOCATION accessed by the external table.
The USAGE permission on the table’s parent catalog and schema.
The CREATE permission on the table’s parent schema.

External locations and storage credentials are stored in the top level of the metastore, rather than in a catalog. To
create a storage credential or an external location, you must be the metastore’s owner or an account-level
admin.
To create an external table, follow these high-level steps. You can also use an example notebook to create the
storage credential, external location, and external table and manage permissions for them.
Create a storage credential

NOTE
In the Databricks SQL Data Explorer, you can create storage credentials and view their details, but you cannot modify,
rotate, or delete them. For those operations, you can use the Databricks SQL editor or a notebook.
1. In the Azure Portal, create a service principal.
a. Create a client secret for the service principal and make a note of it.
b. Make a note of the directory ID and application ID for the service principal.
c. In your storage account, grant the service principal the Azure Blob Contributor role.
2. In Azure Databricks, log in to a workspace that is linked to the metastore.
3. From the persona switcher, select SQL
4. Click Data .
5. At the bottom of the screen, click Storage Credentials .
6. Click Create credential .
7. Enter a name for the credential.
8. Enter the directory ID, application ID, and client secret of a service principal that has been granted the Azure
Blob Contributor role on the relevant storage container.
9. Optionally, enter a comment for the storage credential.
10. Click Save .
You can grant permissions directly on the storage credential, but Databricks recommends that you reference it in
an external location and grant permissions to that instead. An external location combines a storage credential
with a specific path, and authorizes access only to that path and its contents.
Manage permissions on a storage credential
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. At the bottom of the screen, click Storage Credentials .
5. Click the name of a storage credential to open its properties.
6. Click Permissions .
7. To grant permission to users or groups, select each identity, then click Grant .
8. To revoke permissions from users or groups, deselect each identity, then click Revoke .
Create an external location
1. From the persona switcher, select SQL .
2. Click Data .
3. Click the + menu at the upper right and select Add an external location .
4. Click Create location .
a. Enter a name for the location.
b. Optionally copy the storage container path from an existing mount point.
c. If you aren’t copying from an existing mount point, enter a storage container path.
d. Select the storage credential that grants access to the location.
e. Click Save .
Manage permissions on an external location
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. At the bottom of the screen, click External Locations .
5. Click the name of an external location to open its properties.
6. Click Permissions .
7. To grant permission to users or groups, select each identity, then click Grant .
8. To revoke permissions from users or groups, deselect each identity, then click Revoke .
Create an external table
You can create an external table using an external location (recommended) or using a storage credential directly.
In the following examples, replace the placeholder values:
<catalog> : The name of the catalog that will contain the table.
<schema> : The name of the schema that will contain the table.
<table_name> : A name for the table.
<column_specification> : The name and data type for each column.
<bucket_path> : The path on your cloud tenant where the table will be created.
<table_directory> : A directory where the table will be created. Use a unique directory for each table.

IMPORTANT
Once a table is created in a path, users can no longer directly access the files in that path even if they have been given
privileges on an external location or storage credential to do so. This is to ensure that users cannot circumvent access
controls applied to tables by reading files from your cloud tenant directly.

Create an external table using an external location


To create an empty external table using an external location:
SQ L

CREATE TABLE <catalog>.<schema>.<table_name>


(
<column_specification>
)
LOCATION 'abfss://<bucket_path>/<table_directory>';

Python

spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> "


"("
" <column_specification>"
") "
"LOCATION 'abfss://<bucket_path>/<table_directory>'")

library(SparkR)

sql(paste("CREATE TABLE <catalog>.<schema>.<table_name> ",


"(",
" <column_specification>",
") ",
"LOCATION 'abfss://<bucket_path>/<table_directory>'",
sep = ""))

Sc a l a

spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> " +


"(" +
" <column_specification>" +
") " +
"LOCATION 'abfss://<bucket_path>/<table_directory>'")
Unity Catalog checks that you have the following permissions:
CREATE TABLE on the external location that references the cloud storage path you specify.
CREATE on the parent schema.
USAGE on the parent catalog.

If so, the external table is created. Otherwise, an error occurs and the external table is not created.

NOTE
You can instead migrate an existing external table in the Hive metastore to Unity Catalog without duplicating its data. See
Upgrade an external table to Unity Catalog.

Create an external table using a storage credential directly

NOTE
Databricks recommends that you use external locations, rather than using storage credentials directly.

To create an external table using a storage credential, add a WITH (CREDENTIAL <credential_name>) clause to your
SQL statement:
SQ L

CREATE TABLE <catalog>.<schema>.<table_name>


(
<column_specification>
)
LOCATION 'abfss://<bucket_path>/<table_directory>'
WITH (CREDENTIAL <storage_credential>);

Python

spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> "


"( "
" <column_specification> "
") "
"LOCATION 'abfss://<bucket_path>/<table_directory>' "
"WITH (CREDENTIAL <storage_credential>)")

library(SparkR)

sql(paste("CREATE TABLE <catalog>.<schema>.<table_name> ",


"( ",
" <column_specification> ",
") ",
"LOCATION 'abfss://<bucket_path>/<table_directory>' ",
"WITH (CREDENTIAL <storage_credential>)",
sep = ""))

Sc a l a
spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> " +
"( " +
" <column_specification> " +
") " +
"LOCATION 'abfss://<bucket_path>/<table_directory>' " +
"WITH (CREDENTIAL <storage_credential>)")

Unity Catalog checks whether you have the CREATE TABLE permission on the storage credential you specify, and
whether the storage credential authorizes reading to and writing from the location you specified in the
LOCATION clause. If both of these things are true, the external table is created. Otherwise, an error occurs and the
external table is not created.
Example notebook
Create and manage an external table in Unity Catalog
Get notebook

Create a table from files stored in your cloud tenant


You can populate a managed or external table with records from files stored in your cloud tenant. Unity Catalog
reads the files at that location and inserts their contents into the table. In Unity Catalog, this is called path-based-
access.

NOTE
A storage path where you create an external table cannot also be used to read or write data files.

Explore the contents of the files


To explore data stored in external location:
1. List the files in a cloud storage path:
SQL

LIST 'abfss://<path_to_files>';

Python

display(spark.sql("LIST 'abfss://<path_to_files>'"))

library(SparkR)

display(sql("LIST 'abfss://<path_to_files>'"))

Scala

display(spark.sql("LIST 'abfss://<path_to_files>'"))

If you have the READ FILES permission on the external location associated with the cloud storage path, a
list of data files in that location is returned.
2. Query the data in the files in a given path:
SQL
SELECT * FROM <format>.`abfss://<path_to_files>`;

Python

display(spark.read.load("abfss:://<path_to_files>"))

library(SparkR)

display(loadDF("abfss:://<path_to_files>"))

Scala

display(spark.read.load("abfss:://<path_to_files>"))

To explore data using a storage credential directly:


SQL

SELECT * FROM <format>.`abfss://<path_to_files>`;


WITH (CREDENTIAL <storage_credential);

Python

display(spark.sql("SELECT * FROM <format>.`abfss://<path_to_files>` "


"WITH (CREDENTIAL <storage_credential)"))

library(SparkR)

display(sql(paste("SELECT * FROM <format>.`abfss://<path_to_files>` ",


"WITH (CREDENTIAL <storage_credential)",
sep = "")))

Scala

display(spark.sql("SELECT * FROM <format>.`abfss://<path_to_files>` " +


"WITH (CREDENTIAL <storage_credential)"))

Create a table from the files

NOTE
You can instead migrate an existing external table in the Hive metastore to Unity Catalog without duplicating its data. See
Upgrade an external table to Unity Catalog.

1. Create a new table and populate it with records data files on your cloud tenant.
IMPORTANT
When you create a table using this method, the storage path is read only once, to prevent duplication of
records. If you want to re-read the contents of the directory, you must drop and re-create the table. For an
existing table, you can insert records from a storage path.
The bucket path where you create a table cannot also be used to read or write data files.
Only the files in the exact directory are read; the read is not recursive.
You must have the following permissions:
USAGE on the parent catalog and schema.
CREATE on the parent schema.
READ FILES on the external location associated with the bucket path where the files are located, or
directly on the storage credential if you are not using an external location.
If you are creating an external table, you need CREATE TABLE on the bucket path where the table will
be created.

To create a managed table and populate it with records from a bucket path:
SQL

CREATE TABLE <catalog>.<schema>.<table_name>


USING delta
(
<column_specification>
)
SELECT * from delta.`abfss://<path_to_files>`;

Python

spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> "


"USING delta "
"( "
" <column_specification> "
") "
"SELECT * from delta.`abfss://<path_to_files>`")

library(SparkR)

sql(paste("CREATE TABLE <catalog>.<schema>.<table_name> ",


"USING delta ",
"( ",
" <column_specification> ",
") ",
"SELECT * from delta.`abfss://<path_to_files>`",
sep = ""))

Scala

spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> " +


"USING delta " +
"( " +
" <column_specification> " +
") " +
"SELECT * from delta.`abfss://<path_to_files>`")

To use a storage credential, add WITH (CREDENTIAL <storage_credential>) to the command.


To create an external table and populate it with records from a bucket path, add a LOCATION clause:
SQL

CREATE TABLE <catalog>.<schema>.<table_name>


USING delta
(
<column_specification>
)
LOCATION `abfss://<table_location>`
SELECT * from <format>.`abfss://<path_to_files>`;

Python

spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> "


"USING delta "
"( "
" <column_specification> "
") "
"LOCATION `abfss://<table_location>` "
"SELECT * from <format>.`abfss://<path_to_files>`")

library(SparkR)

sql(paste("CREATE TABLE <catalog>.<schema>.<table_name> ",


"USING delta ",
"( ",
" <column_specification> ",
") ",
"LOCATION `abfss://<table_location>` ",
"SELECT * from <format>.`abfss://<path_to_files>`",
sep = ""))

Scala

spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> " +


"USING delta " +
"( " +
" <column_specification> " +
") " +
"LOCATION `abfss://<table_location>` " +
"SELECT * from <format>.`abfss://<path_to_files>`")

To use a storage credential directly, add WITH (CREDENTIAL <storage_credential>) to the command.

Insert records from a path into an existing table


To insert records from a bucket path into an existing table, use the COPY INTO command. In the following
examples, replace the placeholder values:
<catalog> : The name of the table’s parent catalog.
<schema> : The name of the table’s parent schema.
<path_to_files> : The bucket path that contains the data files.
<format> : The format of the files.
<table_location> : The bucket path where the table will be created.
<storage_credential> : If you are using a storage credential directly, the name of the storage credential that
authorizes reading from or writing to the bucket path.
IMPORTANT
When you insert records into a table using this method, the bucket path you provide is read only once, to prevent
duplication of records.
The bucket path where you create a table cannot also be used to read or write data files.
Only the files in the exact directory are read; the read is not recursive.
You must have the following permissions:
USAGE on the parent catalog and schema.
MODIFY on the table.
READ FILES on the external location associated with the bucket path where the files are located, or directly on
the storage credential if you are not using an external location.
To insert records into an external table, you need CREATE TABLE on the bucket path where the table is
located.

To insert records from files in a bucket path into a managed table, using an external location to read from
the bucket path:
SQL

COPY INTO <catalog>.<schema>.<table>


FROM (
SELECT *
FROM 'abfss://<path_to_files>'
)
FILEFORMAT = <format>;

Python

spark.sql("COPY INTO <catalog>.<schema>.<table> "


"FROM ( "
" SELECT * "
" FROM 'abfss://<path_to_files>' "
") "
"FILEFORMAT = <format>")

library(SparkR)

sql(paste("COPY INTO <catalog>.<schema>.<table> ",


"FROM ( ",
" SELECT * ",
" FROM 'abfss://<path_to_files>' ",
") ",
"FILEFORMAT = <format>",
sep = ""))

Scala

spark.sql("COPY INTO <catalog>.<schema>.<table> " +


"FROM ( " +
" SELECT * " +
" FROM 'abfss://<path_to_files>' " +
") " +
"FILEFORMAT = <format>")
To use a storage credential directly, add WITH (CREDENTIAL <storage_credential>) to the command.
To insert into an external table, add a LOCATION clause:
SQL

COPY INTO <catalog>.<schema>.<table>


LOCATION `abfss://<table_location>`
FROM (
SELECT *
FROM 'abfss://<path_to_files>'
)
FILEFORMAT = <format>;

Python

spark.sql("COPY INTO <catalog>.<schema>.<table> "


"LOCATION `abfss://<table_location>` "
"FROM ( "
" SELECT * "
" FROM 'abfss://<path_to_files>' "
") "
"FILEFORMAT = <format>")

library(SparkR)

sql(paste("COPY INTO <catalog>.<schema>.<table> ",


"LOCATION `abfss://<table_location>` ",
"FROM ( ",
" SELECT * ",
" FROM 'abfss://<path_to_files>' ",
") ",
"FILEFORMAT = <format>",
sep = ""))

Scala

spark.sql("COPY INTO <catalog>.<schema>.<table> " +


"LOCATION `abfss://<table_location>` " +
"FROM ( " +
" SELECT * " +
" FROM 'abfss://<path_to_files>' " +
") " +
"FILEFORMAT = <format>")

To use a storage credential directly, add WITH (CREDENTIAL <storage_credential>) to the command.

Next steps
Manage access to data
Create views
7/21/2022 • 7 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

This article shows how to create views in Unity Catalog. A view is a read-only object that joins records from
multiple tables and views.

Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
If necessary, create a metastore.
The metastore must be linked to the workspace where you run the commands to create the catalog or
schema.
If necessary, create a catalog and schema in the metastore. The catalog and schema will contain the new
table. To create a schema, you must have the USAGE and CREATE data permissions on the schema’s
parent catalog. See Create and manage catalogs and Create and manage schemas (databases).
If necessary, create tables that will compose the view.
You can create a view from a Data Science & Engineering notebook or a Databricks SQL endpoint that
uses a compute resource that is compliant with Unity Catalog’s security requirements.
The parent catalog and schema must exist. When you create a metastore, it contains a catalog called
main with an empty schema called default . See Create and manage catalogs and Create and manage
schemas (databases).
You must have the USAGE permission on the parent catalog and the USAGE and CREATE permissions on
the parent schema.
You must have SELECT access on all tables and views referenced in the view.

NOTE
To read from a view from a cluster with Single User cluster security mode, you must have SELECT on all
referenced tables and views.

Create a view
To create a view, run the following SQL command. Items in brackets are optional. Replace the placeholder values:
<catalog_name> : The name of the catalog.
<schema_name> : The name of the schema.
<view_name> : A name for the view.
<query> : The query, columns, and tables and views used to compose the view.
SQL

CREATE VIEW <catalog_name>.<schema_name>.<view_name> AS


SELECT <query>;

Python

spark.sql("CREATE VIEW <catalog_name>.<schema_name>.<view_name> AS "


"SELECT <query>")

library(SparkR)

sql(paste("CREATE VIEW <catalog_name>.<schema_name>.<view_name> AS ",


"SELECT <query>",
sep = ""))

Scala

spark.sql("CREATE VIEW <catalog_name>.<schema_name>.<view_name> AS " +


"SELECT <query>")

For example, to create a view named sales_redacted from columns in the sales_raw table:
SQL

CREATE VIEW sales_metastore.sales.sales_redacted AS


SELECT
user_id,
email,
country,
product,
total
FROM sales_metastore.sales.sales_raw;

Python

spark.sql("CREATE VIEW sales_metastore.sales.sales_redacted AS "


"SELECT "
" user_id, "
" email, "
" country, "
" product, "
" total "
"FROM sales_metastore.sales.sales_raw")

R
library(SparkR)

sql(paste("CREATE VIEW sales_metastore.sales.sales_redacted AS ",


"SELECT ",
" user_id, ",
" email, ",
" country, ",
" product, ",
" total ",
"FROM sales_metastore.sales.sales_raw",
sep = ""))

Scala

spark.sql("CREATE VIEW sales_metastore.sales.sales_redacted AS " +


"SELECT " +
" user_id, " +
" email, " +
" country, " +
" product, " +
" total " +
"FROM sales_metastore.sales.sales_raw")

Create a dynamic view


In Unity Catalog, you can use dynamic views to configure fine-grained access control, including:
Security at the level of columns or rows.
Data masking.

NOTE
Fine-grained access control using dynamic views are not available on clusters with Single User security mode.

Unity Catalog introduces the following functions, which allow you to dynamically limit which users can access a
row, column, or record in a view:
current_user(): Returns the current user’s email address.
is_account_group_member() : Returns TRUE if the current user is a member of a specific account-level group.
Recommended for use in dynamic views against Unity Catalog data.
is_member() : Returns TRUE if the current user is a member of a specific workspace-level group. This function
is provided for compatibility with the existing Hive metastore. Avoid using it with views against Unity Catalog
data, because it does not evaluate account-level group membership.
The following examples illustrate how to create dynamic views in Unity Catalog.
Column-level permissions
With a dynamic view, you can limit the columns a specific user or group can access. In the following example,
only members of the auditors group can access email addresses from the sales_raw table. During query
analysis, Apache Spark replaces the CASE statement with either the literal string REDACTED or the actual
contents of the email address column. Other columns are returned as normal. This strategy has no negative
impact on the query performance.
SQL
-- Alias the field 'email' to itself (as 'email') to prevent the
-- permission logic from showing up directly in the column name results.
CREATE VIEW sales_redacted AS
SELECT
user_id,
CASE WHEN
is_account_group_member('auditors') THEN email
ELSE 'REDACTED'
END AS email,
country,
product,
total
FROM sales_raw

Python

# Alias the field 'email' to itself (as 'email') to prevent the


# permission logic from showing up directly in the column name results.
spark.sql("CREATE VIEW sales_redacted AS "
"SELECT "
" user_id, "
" CASE WHEN "
" is_account_group_member('auditors') THEN email "
" ELSE 'REDACTED' "
" END AS email, "
" country, "
" product, "
" total "
"FROM sales_raw")

library(SparkR)

# Alias the field 'email' to itself (as 'email') to prevent the


# permission logic from showing up directly in the column name results.
sql(paste("CREATE VIEW sales_redacted AS ",
"SELECT ",
" user_id, ",
" CASE WHEN ",
" is_account_group_member('auditors') THEN email ",
" ELSE 'REDACTED' ",
" END AS email, ",
" country, ",
" product, ",
" total ",
"FROM sales_raw",
sep = ""))

Scala
// Alias the field 'email' to itself (as 'email') to prevent the
// permission logic from showing up directly in the column name results.
spark.sql("CREATE VIEW sales_redacted AS " +
"SELECT " +
" user_id, " +
" CASE WHEN " +
" is_account_group_member('auditors') THEN email " +
" ELSE 'REDACTED' " +
" END AS email, " +
" country, " +
" product, " +
" total " +
"FROM sales_raw")

Row-level permissions
With a dynamic view, you can specify permissions down to the row or field level. In the following example, only
members of the managers group can view transaction amounts when they exceed $1,000,000. Matching results
are filtered out for other users.
SQL

CREATE VIEW sales_redacted AS


SELECT
user_id,
country,
product,
total
FROM sales_raw
WHERE
CASE
WHEN is_account_group_member('managers') THEN TRUE
ELSE total <= 1000000
END;

Python

spark.sql("CREATE VIEW sales_redacted AS "


"SELECT "
" user_id, "
" country, "
" product, "
" total "
"FROM sales_raw "
"WHERE "
"CASE "
" WHEN is_account_group_member('managers') THEN TRUE "
" ELSE total <= 1000000 "
"END")

R
library(SparkR)

sql(paste("CREATE VIEW sales_redacted AS ",


"SELECT ",
" user_id, ",
" country, ",
" product, ",
" total ",
"FROM sales_raw ",
"WHERE ",
"CASE ",
" WHEN is_account_group_member('managers') THEN TRUE ",
" ELSE total <= 1000000 ",
"END",
sep = ""))

Scala

spark.sql("CREATE VIEW sales_redacted AS " +


"SELECT " +
" user_id, " +
" country, " +
" product, " +
" total " +
"FROM sales_raw " +
"WHERE " +
"CASE " +
" WHEN is_account_group_member('managers') THEN TRUE " +
" ELSE total <= 1000000 " +
"END")

Data masking
Because views in Unity Catalog use Spark SQL, you can implement advanced data masking by using more
complex SQL expressions and regular expressions. In the following example, all users can analyze email
domains, but only members of the auditors group can view a user’s entire email address.
SQL

-- The regexp_extract function takes an email address such as


-- user.x.lastname@example.com and extracts 'example', allowing
-- analysts to query the domain name.

CREATE VIEW sales_redacted AS


SELECT
user_id,
region,
CASE
WHEN is_account_group_member('auditors') THEN email
ELSE regexp_extract(email, '^.*@(.*)$', 1)
END
FROM sales_raw

Python
# The regexp_extract function takes an email address such as
# user.x.lastname@example.com and extracts 'example', allowing
# analysts to query the domain name.

spark.sql("CREATE VIEW sales_redacted AS "


"SELECT "
" user_id, "
" region, "
" CASE "
" WHEN is_account_group_member('auditors') THEN email "
" ELSE regexp_extract(email, '^.*@(.*)$', 1) "
" END "
" FROM sales_raw")

library(SparkR)

# The regexp_extract function takes an email address such as


# user.x.lastname@example.com and extracts 'example', allowing
# analysts to query the domain name.

sql(paste("CREATE VIEW sales_redacted AS ",


"SELECT ",
" user_id, ",
" region, ",
" CASE ",
" WHEN is_account_group_member('auditors') THEN email ",
" ELSE regexp_extract(email, '^.*@(.*)$', 1) ",
" END ",
" FROM sales_raw",
sep = ""))

Scala

// The regexp_extract function takes an email address such as


// user.x.lastname@example.com and extracts 'example', allowing
// analysts to query the domain name.

spark.sql("CREATE VIEW sales_redacted AS " +


"SELECT " +
" user_id, " +
" region, " +
" CASE " +
" WHEN is_account_group_member('auditors') THEN email " +
" ELSE regexp_extract(email, '^.*@(.*)$', 1) " +
" END " +
" FROM sales_raw")

Next steps
Manage access to data
Manage access to data
7/21/2022 • 8 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

This article shows how to manage access to objects and data in Unity Catalog. To learn more about data
permissions and object ownership, see Data permissions.

Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
If necessary, create a metastore.
To manage permissions for a table or view you must be a metastore admin, or you must be the table’s owner
and have the USAGE permission on its parent catalog and schema.
To manage permissions for a schema, you must be a metastore admin, or you must be the schema’s owner
and have the USAGE permission on its parent catalog.
To manage permissions for a catalog, you must be a metastore admin, or you must be the catalog’s owner.
To manage permissions for an external location or storage credential, you must be a metastore admin.

NOTE
Account-level admins can manage any object in any metastore in the account.

Show grants on an object


A securable object’s owner or a metastore admin can list all grants on the object. Optionally, you can list only the
grants for a given principal on the object.
To show all grants on an object, you can use the Databricks SQL Data Explorer or SQL commands.
Use the Databricks SQL Data Explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. Select the object, such as a catalog, schema, table, or view.
5. Click Permissions .
Use SQL
Use the following syntax. Replace the placeholder values:
<securable_type> : The type of the securable object, such as catalog or table .
<securable_name> : The name of the securable object.

SQL
SHOW GRANTS ON <securable_type> <securable_name>;

Python

display(spark.sql("SHOW GRANTS ON <securable_type> <securable_name>"))

library(SparkR)

display(sql("SHOW GRANTS ON <securable_type> <securable_name>"))

Scala

display(spark.sql("SHOW GRANTS ON <securable_type> <securable_name>"))

To show all grants for a given principal on an object, use the following syntax. Replace the placeholder values:
<principal> : The email address of an account-level user or the name of an account-level group.
<securable_type> : The type of the securable object, such as catalog or table .
<securable_name> : The name of the securable object.

SQL

SHOW GRANTS <principal> ON <securable_type> <securable_name>;

Python

display(spark.sql("SHOW GRANTS <principal> ON <securable_type> <securable_name>"))

library(SparkR)

display(sql("SHOW GRANTS <principal> ON <securable_type> <securable_name>"))

Scala

display(spark.sql("SHOW GRANTS <principal> ON <securable_type> <securable_name>"))

Manage privileges
To manage privileges, you use GRANT and REVOKE statements. Only an object’s owner or a metastore admin can
grant privileges on the object and its descendent objects. A built-in account-level group called account users
includes all account-level users.
In Unity Catalog, you can grant the following privileges on a securable object:
USAGE : This privilege does not grant access to the securable itself, but allows the grantee to traverse the
securable in order to access its child objects. For example, to select data from a table, users need to have the
SELECT privilege on that table and USAGE privileges on its parent schema and parent catalog. Thus, you can
use this privilege to restrict access to sections of your data namespace to specific groups.
SELECT : Allows a user to select from a table or view, if the user also has USAGE on its parent catalog and
schema.
MODIFY : Allows the grantee to add, update and delete data to or from the securable if the user also has
USAGE on its parent catalog and schema.
CREATE : Allows a user to create a schema if the user also has USAGE and CREATE permissions on its parent
catalog. Allows a user to create a table or view if the user also has USAGE on its parent catalog and schema
and the CREATE permission on the schema..

In addition, you can grant the following privileges on storage credentials and external locations.
CREATE TABLE : Allows a user to create external tables directly in your cloud tenant using a storage
credential.
READ FILES : When granted on an external location, allows a user to read files directly from your cloud
tenant using the storage credential associated with the external location.
When granted directly on a storage credential, allows a user to read files directly from your cloud tenant
using the storage credential.
WRITE FILES : When granted on an external location, allows a user to write files directly to your cloud
tenant using the storage credential associated with the external location.
When granted directly on a storage credential, allows a user to write files directly to your cloud tenant
using the storage credential.

NOTE
Although you can grant READ FILE and WRITE FILE privileges on a storage credential, Databricks recommends that
you instead grant these privileges on an external location. This allows you to manage permissions at a more granular level
and provides a simpler experience to end users.

In Unity Catalog, privileges are not inherited on child securable objects. For example, if you grant the CREATE
privilege on a catalog to a user, the user does not automatically have the CREATE privilege on all databases in
the catalog.
The following table summarizes the privileges that can be granted on each securable object:

SEC URA B L E P RIVIL EGES

Catalog CREATE , USAGE

Schema CREATE , USAGE

Table SELECT , MODIFY

View SELECT

External location CREATE TABLE , READ FILES , WRITE FILES

Storage credential CREATE TABLE , READ FILES , WRITE FILES

Grant a privilege
To grant a privilege, you can use the Databricks SQL Data Explorer or SQL commands. Keep in mind that when
you grant a privilege on an object, you must also grant the USAGE privilege on its parent objects.
Use the Databricks SQL Data Explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. Select the object, such as a catalog, schema, table, or view.
5. Click Permissions .
6. Click Grant .
7. Enter the email address for a user or the name of a group.
8. Select the permissions to grant.
9. Click OK .
Use SQL
Use the following syntax. Replace the placeholder values:
<privilege> : The privilege to grant, such as SELECT or USAGE .
<securable_type> : The type of the securable object, such as catalog or table .
<securable_name> : The name of the securable object.
<principal> : The email address of an account-level user or the name of an account-level group.

SQ L

GRANT <privilege> ON <securable_type> <securable_name> TO <principal>

Python

spark.sql("GRANT <privilege> ON <securable_type> <securable_name> TO <principal>")

library(SparkR)

sql("GRANT <privilege> ON <securable_type> <securable_name> TO <principal>")

Sc a l a

spark.sql("GRANT <privilege> ON <securable_type> <securable_name> TO <principal>")

Revoke a privilege
To revoke a privilege, you can use the Databricks SQL Data Explorer or SQL commands.
Use the Databricks SQL Data Explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. Select the object, such as a catalog, schema, table, or view.
5. Click Permissions .
6. Select a privilege that has been granted to a user or group.
7. Click Revoke .
8. To confirm, click Revoke .
Use SQL
Use the following syntax. Replace the placeholder values:
<privilege> : The privilege to grant, such as SELECT or USAGE .
<securable_type> : The type of the securable object, such as catalog or table .
<securable_name> : The name of the securable object.
<principal> : The email address of an account-level user or the name of an account-level group.

SQ L

REVOKE <privilege> ON <securable_type> <securable_name> FROM <principal>

Python

spark.sql("REVOKE <privilege> ON <securable_type> <securable_name> FROM <principal>")

library(SparkR)

sql("REVOKE <privilege> ON <securable_type> <securable_name> FROM <principal>")

Sc a l a

spark.sql("REVOKE <privilege> ON <securable_type> <securable_name> FROM <principal>")

Transfer ownership
To transfer ownership of an object within a metastore, you can use SQL.
Each securable object in Unity Catalog has an owner. The owner can be any account-level user or group, called a
principal. The principal that creates an object becomes its initial owner. An object’s owner has all privileges on
the object, such as SELECT and MODIFY on a table, as well as the permission to grant privileges to other
principals.
The object’s owner can transfer ownership to another user or group. A metastore admin can transfer ownership
of any object in the metastore to another user or group.
To see the owner of a securable object, use the following syntax. Replace the placeholder values:
<SECURABLE_TYPE> : The type of securable, such as CATALOG or TABLE .
<catalog> : The parent catalog for a table or view.
<schema> : The parent schema for a table or view.
<securable_name> : The name of the securable, such as a table or view.

SQL

DESCRIBE <SECURABLE_TYPE> EXTENDED <catalog>.<schema>.<securable_name>;

Python

display(spark.sql("DESCRIBE <SECURABLE_TYPE> EXTENDED <catalog>.<schema>.<securable_name>"))

R
library(SparkR)

display(sql("DESCRIBE <SECURABLE_TYPE> EXTENDED <catalog>.<schema>.<securable_name>"))

Scala

display(spark.sql("DESCRIBE <SECURABLE_TYPE> EXTENDED <catalog>.<schema>.<securable_name>"))

To transfer ownership of an object, use a SQL command with the following syntax. Replace the placeholder
values:
<SECURABLE_TYPE> : The type of securable, such as CATALOG or TABLE .
<SECURABLE_NAME> : The name of the securable.
<PRINCIPAL> : The email address of an account-level user or the name of an account-level group.

SQL

ALTER <SECURABLE_TYPE> <SECURABLE_NAME> OWNER TO <PRINCIPAL>;

Python

spark.sql("ALTER <SECURABLE_TYPE> <SECURABLE_NAME> OWNER TO <PRINCIPAL>")

library(SparkR)

sql("ALTER <SECURABLE_TYPE> <SECURABLE_NAME> OWNER TO <PRINCIPAL>")

Scala

spark.sql("ALTER <SECURABLE_TYPE> <SECURABLE_NAME> OWNER TO <PRINCIPAL>")

For example, to transfer ownership of a table to the accounting group:


SQL

ALTER TABLE orders OWNER TO `accounting`;

Python

spark.sql("ALTER TABLE orders OWNER TO `accounting`")

library(SparkR)

sql("ALTER TABLE orders OWNER TO `accounting`")

Scala
spark.sql("ALTER TABLE orders OWNER TO `accounting`")

Transfer ownership of a metastore


When possible, Databricks recommends group ownership over single-user ownership. The user who creates a
metastore is its initial owner. A metastore owner can manage the privileges for all securable objects within a
metastore, as well as create catalogs, external locations, and storage credentials.
1. Log in to the account console.
2. Click Data .
3. Click the name of a metastore to open its properties.
4. Under Owner , click Edit .
5. Select a group from the drop-down. You can enter text in the field to search for options.
6. Click Save .

Dynamic views
Dynamic views allow you to manage which users have access to a view’s rows, columns, or even specific records
by filtering or masking their values. See Create a dynamic view.
Manage external locations and storage credentials
7/21/2022 • 18 minutes to read

External locations and storage credentials allow Unity Catalog to read and write data on your cloud tenant on
behalf of users. These objects are used for:
Creating, reading from, and writing to external tables.
Creating a managed or external table from files stored on your cloud tenant.
Inserting records into tables from files stored on your cloud tenant.
Directly exploring data files stored on your cloud tenant.
A storage credential represents an authentication and authorization mechanism for accessing data stored on
your cloud tenant, using either an Azure managed identity (strongly recommended) or a service principal. Each
storage credential is subject to Unity Catalog access-control policies that control which users and groups can
access the credential. If a user does not have access to a storage credential in Unity Catalog, the request fails and
Unity Catalog does not attempt to authenticate to your cloud tenant on the user’s behalf.
An external location is an object that combines a cloud storage path with a storage credential that authorizes
access to the cloud storage path. Each storage location is subject to Unity Catalog access-control policies that
control which users and groups can access the credential. If a user does not have access to a storage location in
Unity Catalog, the request fails and Unity Catalog does not attempt to authenticate to your cloud tenant on the
user’s behalf.
Databricks recommends using external locations rather than using storage credentials directly.
To create or manage a storage credential or an external location, you must be a workspace admin for the Unity
Catalog-enabled workspace you want to access the storage from.

Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
You must have a Unity Catalog metastore.
To manage permissions for an external location or storage credential, you must be a metastore admin.
Unity Catalog only supports Azure Data Lake Storage Gen2 for external locations.

Manage storage credentials


The following sections illustrate how to create and manage storage credentials.
Create a storage credential
You can use either an Azure managed identity or a service principal as the identity that authorizes access to your
storage container. Managed identities are strongly recommended. They have the benefit of allowing Unity
Catalog to access storage accounts protected by network rules, which isn’t possible using service principals, and
they remove the need to manage and rotate secrets.
To create a storage credential using a managed identity:
1. Create an Azure Databricks access connector and assign it permissions to the storage container that you
would like to access, using the instructions in Configure a managed identity for Unity Catalog.
An Azure Databricks access connector is a first-party Azure resource that lets you connect managed
identities to an Azure Databricks account.
Make note of the access connector’s resource ID.
2. Log in to your Unity Catalog-enabled Azure Databricks workspace as a workspace admin.
3. From the persona switcher at the top of the sidebar, select SQL .

4. Click Data .
5. Click the + menu at the upper right and select Add a storage credential .
6. On the Create a new storage credential dialog, select Managed identity (recommended) .
7. Enter a name for the credential, and enter the access connector’s resource ID in the format:

/subscriptions/12f34567-8ace-9c10-111c-
aea8eba12345c/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/<connec
tor-name>

8. Click Save .
9. Create an external location that references this storage credential.
To create a storage credential using a service principal:
1. In the Azure Portal, create a service principal and grant it access to your storage account.
a. Create a client secret for the service principal and make a note of it.
b. Make a note of the directory ID, and application ID for the service principal.
c. Go to your storage account and grant the service principal the Azure Blob Contributor role.
2. Log in to your Unity Catalog-enabled Azure Databricks workspace as a workspace admin.
3. From the persona switcher at the top of the sidebar, select SQL
4. Click Data .
5. Click the + menu at the upper right and select Add a storage credential .
6. On the Create a new storage credential dialog, select Ser vice principal .
7. Enter a name for the credential, along with the directory ID, application ID, and client secret of the service
principal that has been granted the Azure Blob Contributor role on the storage container you want to
access.
8. Click Save .
9. Create an external location that references this storage credential.
List storage credentials
To view the list of all storage credentials in a metastore, you can use the Data Explorer or a SQL command.
Data explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. At the bottom of the screen, click Storage Credentials .
Sql
Run the following command in a notebook or the Databricks SQL editor.

SHOW STORAGE CREDENTIALS;


Python
Run the following command in a notebook.

display(spark.sql("SHOW STORAGE CREDENTIALS"))

R
Run the following command in a notebook.

library(SparkR)

display(sql("SHOW STORAGE CREDENTIALS"))

Scala
Run the following command in a notebook.

display(spark.sql("SHOW STORAGE CREDENTIALS"))

View a storage credential


To view the properties of a storage credential, you can use the Data Explorer or a SQL command.
Data explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. At the bottom of the screen, click Storage Credentials .
5. Click the name of a storage credential to see its properties.
Sql
Run the following command in a notebook or the Databricks SQL editor. Replace <credential_name> with the
name of the credential.

DESCRIBE STORAGE CREDENTIAL <credential_name>;

Python
Run the following command in a notebook. Replace <credential_name> with the name of the credential.

display(spark.sql("DESCRIBE STORAGE CREDENTIAL <credential_name>"))

R
Run the following command in a notebook. Replace <credential_name> with the name of the credential.

library(SparkR)

display(sql("DESCRIBE STORAGE CREDENTIAL <credential_name>"))

Scala
Run the following command in a notebook. Replace <credential_name> with the name of the credential.

display(spark.sql("DESCRIBE STORAGE CREDENTIAL <credential_name>"))


Rename a storage credential
To rename a storage credential, you can use the Data Explorer or a SQL command.
Data explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. At the bottom of the screen, click Storage Credentials .
5. Click the name of a storage credential to open the edit dialog.
6. Rename the storage credential and save it.
Sql
Run the following command in a notebook or the Databricks SQL editor. Replace the placeholder values:
<credential_name> : The name of the credential.
<new_credential_name> : A new name for the credential.

ALTER STORAGE CREDENTIAL <credential_name> RENAME TO <new_credential_name>;

Python
Run the following command in a notebook. Replace the placeholder values:
<credential_name> : The name of the credential.
<new_credential_name> : A new name for the credential.

spark.sql("ALTER STORAGE CREDENTIAL <credential_name> RENAME TO <new_credential_name>")

R
Run the following command in a notebook. Replace the placeholder values:
<credential_name> : The name of the credential.
<new_credential_name> : A new name for the credential.

library(SparkR)

sql("ALTER STORAGE CREDENTIAL <credential_name> RENAME TO <new_credential_name>")

Scala
Run the following command in a notebook. Replace the placeholder values:
<credential_name> : The name of the credential.
<new_credential_name> : A new name for the credential.

spark.sql("ALTER STORAGE CREDENTIAL <credential_name> RENAME TO <new_credential_name>")

Manage permissions for a storage credential


You can grant permissions directly on the storage credential, but Databricks recommends that you reference it in
an external location and grant permissions to that instead. An external location combines a storage credential
with a specific path, and authorizes access only to that path and its contents.
You can manage permissions for a storage credential using the Data Explorer or SQL commands. You can grant
and revoke the following permissions on a storage credential:
CREATE TABLE
READ FILES
WRITE FILES

Data Explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. At the bottom of the screen, click Storage Credentials .
5. Click the name of a storage credential to open its properties.
6. Click Permissions .
7. To grant permission to users or groups, select each identity, then click Grant .
8. To revoke permissions from users or groups, deselect each identity, then click Revoke .
SQL, Python, R, Scala
In the following examples, replace the placeholder values:
: The email address of the account-level user or the name of the account level group to whom to
<principal>
grant the permission.
<storage_credential_name> : The name of a storage credential.

To show grants on a storage credential, use a command like the following. You can optionally filter the results to
show only the grants for the specified principal.
SQ L

SHOW GRANTS [<principal>] ON STORAGE CREDENTIAL <storage_credential_name>;

Python

display(spark.sql("SHOW GRANTS [<principal>] ON STORAGE CREDENTIAL <storage_credential_name>"))

library(SparkR)

display(sql("SHOW GRANTS [<principal>] ON STORAGE CREDENTIAL <storage_credential_name>"))

Sc a l a

display(spark.sql("SHOW GRANTS [<principal>] ON STORAGE CREDENTIAL <storage_credential_name>"))

To grant permission to create an external table using a storage credential directly:


SQ L

GRANT CREATE TABLE ON STORAGE CREDENTIAL <storage_credential_name> TO <principal>;

Python

spark.sql("GRANT CREATE TABLE ON STORAGE CREDENTIAL <storage_credential_name> TO <principal>")

R
library(SparkR)

sql("GRANT CREATE TABLE ON STORAGE CREDENTIAL <storage_credential_name> TO <principal>")

Sc a l a

spark.sql("GRANT CREATE TABLE ON STORAGE CREDENTIAL <storage_credential_name> TO <principal>")

To grant permission to select from an external table using a storage credential directly:
SQ L

GRANT READ FILES ON STORAGE CREDENTIAL <storage_credential_name> TO <principal>;

Python

spark.sql("GRANT READ FILES ON STORAGE CREDENTIAL <storage_credential_name> TO <principal>")

library(SparkR)

sql("GRANT READ FILES ON STORAGE CREDENTIAL <storage_credential_name> TO <principal>")

Sc a l a

spark.sql("GRANT READ FILES ON STORAGE CREDENTIAL <storage_credential_name> TO <principal>")

NOTE
If a group name contains a space, use back-ticks around it (not apostrophes).

Change the owner of a storage credential


A storage credential’s creator is its initial owner. To change the owner to a different account-level user or group,
do the following:
Sql
Run the following command in a notebook or the Databricks SQL editor. Replace the placeholder values:
<credential_name> : The name of the credential.
<principal> : The email address of an account-level user or the name of an account-level group.

ALTER STORAGE CREDENTIAL <credential_name> OWNER TO <principal>;

Python
Run the following command in a notebook. Replace the placeholder values:
<credential_name> : The name of the credential.
<principal> : The email address of an account-level user or the name of an account-level group.

spark.sql("ALTER STORAGE CREDENTIAL <credential_name> OWNER TO <principal>")

R
Run the following command in a notebook. Replace the placeholder values:
<credential_name> : The name of the credential.
<principal> : The email address of an account-level user or the name of an account-level group.

library(SparkR)

sql("ALTER STORAGE CREDENTIAL <credential_name> OWNER TO <principal>")

Scala
Run the following command in a notebook. Replace the placeholder values:
<credential_name> : The name of the credential.
<principal> : The email address of an account-level user or the name of an account-level group.

spark.sql("ALTER STORAGE CREDENTIAL <credential_name> OWNER TO <principal>")

Delete a storage credential


To delete a storage credential, you can use the Data Explorer or a SQL command.
Data explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. At the bottom of the screen, click Storage Credentials .
5. Click the name of a storage credential to open the edit dialog.
6. Click the Delete button.
Sql
Run the following command in a notebook or the Databricks SQL editor. Replace <credential_name> with the
name of the credential. Portions of the command that are in brackets are optional. By default, if the credential is
used by an external location, it is not deleted. Replace <credential_name> with the name of the credential.
IF EXISTS does not return an error if the credential does not exist.
FORCE deletes the credential even if it is used by an external location. External locations that depend on this
credential can no longer be used.

DROP STORAGE CREDENTIAL [IF EXISTS] <credential_name> [FORCE];

Python
Run the following command in a notebook. Replace <credential_name> with the name of the credential.
Portions of the command that are in brackets are optional. By default, if the credential is used by an external
location, it is not deleted. Replace <credential_name> with the name of the credential.
IF EXISTS does not return an error if the credential does not exist.
FORCE deletes the credential even if it is used by an external location. External locations that depend on
this credential can no longer be used.
<credential_name> : The name of the credential.
<principal> : The email address of an account-level user or the name of an account-level group.
spark.sql("DROP STORAGE CREDENTIAL [IF EXISTS] <credential_name> [FORCE]")

R
Run the following command in a notebook. Replace <credential_name> with the name of the credential.
Portions of the command that are in brackets are optional. By default, if the credential is used by an external
location, it is not deleted. Replace <credential_name> with the name of the credential.
IF EXISTS does not return an error if the credential does not exist.
FORCE deletes the credential even if it is used by an external location. External locations that depend on
this credential can no longer be used.

library(SparkR)

sql("DROP STORAGE CREDENTIAL [IF EXISTS] <credential_name> [FORCE]")

Scala
Run the following command in a notebook. Replace <credential_name> with the name of the credential.
Portions of the command that are in brackets are optional. By default, if the credential is used by an external
location, it is not deleted. Replace <credential_name> with the name of the credential.
IF EXISTS does not return an error if the credential does not exist.
FORCE deletes the credential even if it is used by an external location. External locations that depend on
this credential can no longer be used.

spark.sql("DROP STORAGE CREDENTIAL [IF EXISTS] <credential_name> [FORCE]")

Manage external locations


The following sections illustrate how to create and manage external locations.
Create an external location
You can create an external location using the Data Explorer or a SQL command.
Data Explorer
1. From the persona switcher, select SQL .
2. Click Data .
3. Click the + menu at the upper right and select Add an external location .
4. Click Create location .
a. Enter a name for the location.
b. Optionally copy the storage container path from an existing mount point.
c. If you aren’t copying from an existing mount point, enter a storage container path.
d. Select the storage credential that grants access to the location.
e. Click Save .
SQL, Python, R, Scala
Run the following SQL command in a notebook or the Databricks SQL editor. Replace the placeholder values:
<location_name> : A name for the external location.
<bucket_path> : The path in your cloud tenant that this external location grants access to.
<storage_credential_name>: The name of the storage credential that contains details about a service
principal that is authorized to read to and write from the storage container path.

NOTE
Each cloud storage path can be associated with only one external location. If you attempt to create a second external
location that references the same path, the command fails.
External locations only support Azure Data Lake Storage Gen2 storage.

SQ L

CREATE EXTERNAL LOCATION <location_name>


URL 'abfss://<container_name>@<storage_account>.dfs.core.windows.net/<path>'
WITH ([STORAGE] CREDENTIAL <storage_credential_name>);

Python

spark.sql("CREATE EXTERNAL LOCATION <location_name> "


"URL 'abfss://<container_name>@<storage_account>.dfs.core.windows.net/<path>' "
"WITH ([STORAGE] CREDENTIAL <storage_credential_name>)")

library(SparkR)

sql(paste("CREATE EXTERNAL LOCATION <location_name> ",


"URL 'abfss://<container_name>@<storage_account>.dfs.core.windows.net/<path>' ",
"WITH ([STORAGE] CREDENTIAL <storage_credential_name>)",
sep = ""))

Sc a l a

spark.sql("CREATE EXTERNAL LOCATION <location_name> " +


"URL 'abfss://<container_name>@<storage_account>.dfs.core.windows.net/<path>' " +
"WITH ([STORAGE] CREDENTIAL <storage_credential_name>)")

Describe an external location


To see the properties of an external location, you can use the Data Explorer or a SQL command.
Data explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. At the bottom of the screen, click External Locations .
5. Click the name of an external location to see its properties.
Sql
Run the following command in a notebook or the Databricks SQL editor. Replace <credential_name> with the
name of the credential.

DESCRIBE EXTERNAL LOCATION <location_name>;

Python
Run the following command in a notebook. Replace <credential_name> with the name of the credential.
display(spark.sql("DESCRIBE EXTERNAL LOCATION <location_name>"))

R
Run the following command in a notebook. Replace <credential_name> with the name of the credential.

library(SparkR)

display(sql("DESCRIBE EXTERNAL LOCATION <location_name>"))

Scala
Run the following command in a notebook. Replace <credential_name> with the name of the credential.

display(spark.sql("DESCRIBE EXTERNAL LOCATION <location_name>"))

Modify an external location


To rename an external location, do the following:
Sql
Run the following command in a notebook or the Databricks SQL editor. Replace the placeholder values:
<location_name> : The name of the location.
<new_location_name> : A new name for the location.

ALTER EXTERNAL LOCATION <location_name> RENAME TO <new_location_name>;

Python
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the location.
<new_location_name> : A new name for the location.

spark.sql("ALTER EXTERNAL LOCATION <location_name> RENAME TO <new_location_name>")

R
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the location.
<new_location_name> : A new name for the location.

library(SparkR)

sql("ALTER EXTERNAL LOCATION <location_name> RENAME TO <new_location_name>")

Scala
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the location.
<new_location_name> : A new name for the location.

spark.sql("ALTER EXTERNAL LOCATION <location_name> RENAME TO <new_location_name>")


To change the URI that an external location points to in your cloud tenant, do the following:
Sql
Run the following command in a notebook or the Databricks SQL editor. Replace the placeholder values:
<location_name> : The name of the external location.
<url> : The new storage URL the location should authorize access to in your cloud tenant.

ALTER EXTERNAL LOCATION location_name SET URL `<url>` [FORCE];

Python
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the external location.
<url> : The new storage URL the location should authorize access to in your cloud tenant.

spark.sql("ALTER EXTERNAL LOCATION location_name SET URL `<url>` [FORCE]")

R
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the external location.
<url> : The new storage URL the location should authorize access to in your cloud tenant.

library(SparkR)

sql("ALTER EXTERNAL LOCATION location_name SET URL `<url>` [FORCE]")

Scala
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the external location.
<url> : The new storage URL the location should authorize access to in your cloud tenant.

spark.sql("ALTER EXTERNAL LOCATION location_name SET URL `<url>` [FORCE]")

The FORCE option changes the URL even if external tables depend upon the external location.
To change the storage credential that an external location uses, do the following:
Sql
Run the following command in a notebook or the Databricks SQL editor. Replace the placeholder values:
<location_name> : The name of the external location.
<credential_name> : The name of the storage credential that grants access to the location’s URL in your cloud
tenant.

ALTER EXTERNAL LOCATION <location_name> SET STORAGE CREDENTIAL <credential_name>;

Python
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the external location.
<credential_name> : The name of the storage credential that grants access to the location’s URL in your cloud
tenant.

spark.sql("ALTER EXTERNAL LOCATION <location_name> SET STORAGE CREDENTIAL <credential_name>")

R
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the external location.
<credential_name> : The name of the storage credential that grants access to the location’s URL in your cloud
tenant.

library(SparkR)

sql("ALTER EXTERNAL LOCATION <location_name> SET STORAGE CREDENTIAL <credential_name>")

Scala
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the external location.
<credential_name> : The name of the storage credential that grants access to the location’s URL in your cloud
tenant.

spark.sql("ALTER EXTERNAL LOCATION <location_name> SET STORAGE CREDENTIAL <credential_name>")

Manage permissions for an external location


You can grant and revoke the following permissions on an external location:
CREATE TABLE
READ FILES
WRITE FILES

Data Explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. At the bottom of the screen, click External Locations .
5. Click the name of an external location to open its properties.
6. Click Permissions .
7. To grant permission to users or groups, select each identity, then click Grant .
8. To revoke permissions from users or groups, deselect each identity, then click Revoke .
SQL, Python, R, Scala
In the following examples, replace the placeholder values:
<principal>: The email address of the account-level user or the name of the account level group to
whom to grant the permission.
<location_name> : The name of the external location that authorizes reading from and writing to the
storage container path in your cloud tenant.
<principal> : The email address of an account-level user or the name of an account-level group.
To show grants on an external location, use a command like the following. You c an optionally filter the results to
show only the grants for the specified princ ipal.
SQ L

SHOW GRANTS [<principal>] ON EXTERNAL LOCATION <location_name>;

Python

display(spark.sql("SHOW GRANTS [<principal>] ON EXTERNAL LOCATION <location_name>"))

library(SparkR)

display(sql("SHOW GRANTS [<principal>] ON EXTERNAL LOCATION <location_name>"))

Sc a l a

display(spark.sql("SHOW GRANTS [<principal>] ON EXTERNAL LOCATION <location_name>"))

To grant permission to use an external location to create a table:


SQ L

GRANT CREATE TABLE ON EXTERNAL LOCATION <location_name> TO <principal>;

Python

spark.sql("GRANT CREATE TABLE ON EXTERNAL LOCATION <location_name> TO <principal>")

library(SparkR)

sql("GRANT CREATE TABLE ON EXTERNAL LOCATION <location_name> TO <principal>")

Sc a l a

spark.sql("GRANT CREATE TABLE ON EXTERNAL LOCATION <location_name> TO <principal>")

To grant permission to read files from an external location:


SQ L

GRANT READ FILES ON EXTERNAL LOCATION <location_name> TO <principal>;

Python

spark.sql("GRANT READ FILES ON EXTERNAL LOCATION <location_name> TO <principal>")

library(SparkR)

sql("GRANT READ FILES ON EXTERNAL LOCATION <location_name> TO <principal>")


Sc a l a

spark.sql("GRANT READ FILES ON EXTERNAL LOCATION <location_name> TO <principal>")

NOTE
If a group name contains a space, use back-ticks around it (not apostrophes).

Change the owner of an external location


An external location’s creator is its initial owner. To change the owner to a different account-level user or group,
run the following command in a notebook or the Databricks SQL editor. Replace the placeholder values:
<location_name> : The name of the credential.
<principal> : The email address of an account-level user or the name of an account-level group.

Delete an external location


To delete an external location, do the following:
Sql
Run the following command in a notebook or the Databricks SQL editor. Items in brackets are optional. Replace
<location_name> with the name of the external location.

The FORCE option deletes the external location even if external tables depend upon it.

DROP EXTERNAL LOCATION [IF EXISTS] <location_name> [FORCE];

Python
Run the following command in a notebook. Items in brackets are optional. Replace <location_name> with the
name of the external location.
The FORCE option deletes the external location even if external tables depend upon it.

spark.sql("DROP EXTERNAL LOCATION [IF EXISTS] <location_name> [FORCE]")

R
Run the following command in a notebook. Items in brackets are optional. Replace <location_name> with the
name of the external location.
The FORCE option deletes the external location even if external tables depend upon it.

library(SparkR)

sql("DROP EXTERNAL LOCATION [IF EXISTS] <location_name> [FORCE]")

Scala
Run the following command in a notebook. Items in brackets are optional. Replace <location_name> with the
name of the external location.
The FORCE option deletes the external location even if external tables depend upon it.

spark.sql("DROP EXTERNAL LOCATION [IF EXISTS] <location_name> [FORCE]")


Next steps
Create tables
Create views
Query data
7/21/2022 • 3 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

This article shows how to query data in Unity Catalog.

Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
To query data in a table or view, the user must have the USAGE permission on the parent catalog and
schema and the SELECT permission on the table or view.

NOTE
To read from a view on a cluster with single-user security mode, the user must have SELECT on all referenced
tables and views.

Three-level namespace notation


In Unity Catalog, a table or view is contained within a parent catalog and schema. You can refer to a table or view
using two different styles of notation. You can use USE CATALOG and USE statements to specify the catalog and
schema:
SQL

USE CATALOG <catalog_name>;


USE SCHEMA <schema_name>;
SELECT * from <table_name>;

Python

spark.sql("USE CATALOG <catalog_name>")


spark.sql("USE SCHEMA <schema_name>")

display(spark.table("<table_name>"))

library(SparkR)

sql("USE CATALOG <catalog_name>")


sql("USE SCHEMA <schema_name>")

display(tableToDF("<table_name>"))
Scala

spark.sql("USE CATALOG <catalog_name>")


spark.sql("USE SCHEMA <schema_name>")

display(spark.table("<table_name>"))

As an alternative, you can use three-level namespace notation:


SQL

SELECT * from <catalog_name>.<schema_name>.<table_name>;

Python

display(spark.table("<catalog_name>.<schema_name>.<table_name>"))

library(SparkR)

display(tableToDF("<catalog_name>.<schema_name>.<table_name>"))

Scala

display(spark.table("<catalog_name>.<schema_name>.<table_name>"))

Using three-level namespace simplifies querying data in multiple catalogs and schemas.
You can also use three-level namespace notation for data in the Hive metastore by setting <catalog_name> to
hive_metastore .

Explore tables and views in Databricks SQL


You can quickly explore tables and views without the need to run a cluster by using the Data explorer in
Databricks SQL.
1. When you log in to Databricks SQL, your landing page looks like this:
If the Databricks Data Science & Engineering or Databricks Machine Learning environment appears, use
the sidebar to switch to Databricks SQL.

2. To open data explorer, click Data in the sidebar.


3. In the Data Explorer, select the catalog and schema to view its tables and views.
For objects in the Hive Metastore, you must be running a SQL warehouse to use the Data Explorer.

Select from tables and views


To select from a table or view from a notebook:
1. Use the sidebar to switch to Data Science & Engineering.
2. Attach the notebook to a Data Science & Engineering or Databricks Machine Learning cluster that is
configured for Unity Catalog.
3. In the notebook, create a query that references Unity Catalog tables and views. You can use three-level
namespace notation to easily select data in multiple catalogs and schemas, including the workspace-local
Hive metastore.

NOTE
To read from a view from a cluster with single-user security mode, the user must have SELECT on all referenced
tables and views.

To select from a table or view from Databricks SQL:


1. Use the sidebar to switch to Databricks SQL.
2. Click SQL Editor in the sidebar.
3. Select a SQL warehouse that is configured for Unity Catalog.
4. Compose a query. To insert a table or view into the query, select a catalog and schema, then click the name of
the table or view to insert.
5. Click Run .
Select from files
You can explore the contents of data files stored in your cloud tenant before creating tables from them.
Requirements
You must have the READ FILES permission on either an external location (recommended) associated with the
path to the files in your cloud tenant or directly on the storage credential that contains the IAM role that
authorizes reading from the path. If you have already defined a table on the path, you can access the data
through the table’s path if you have SELECT permission on the table and USAGE permission on the table’s
parent catalog and schema.
To explore data stored in external location:
1. List the files in a cloud storage path:
SQL

LIST 'abfss://<path_to_files>';

Python

display(spark.sql("LIST 'abfss://<path_to_files>'"))

library(SparkR)

display(sql("LIST 'abfss://<path_to_files>'"))

Scala

display(spark.sql("LIST 'abfss://<path_to_files>'"))

If you have the READ FILES permission on the external location associated with the cloud storage path, a
list of data files in that location is returned.
2. Query the data in the files in a given path:
SQL

SELECT * FROM <format>.`abfss://<path_to_files>`;

Python

display(spark.read.load("abfss:://<path_to_files>"))

library(SparkR)

display(loadDF("abfss:://<path_to_files>"))

Scala

display(spark.read.load("abfss:://<path_to_files>"))
To explore data using a storage credential directly:
SQL

SELECT * FROM <format>.`abfss://<path_to_files>`;


WITH (CREDENTIAL <storage_credential);

Python

display(spark.sql("SELECT * FROM <format>.`abfss://<path_to_files>` "


"WITH (CREDENTIAL <storage_credential)"))

library(SparkR)

display(sql(paste("SELECT * FROM <format>.`abfss://<path_to_files>` ",


"WITH (CREDENTIAL <storage_credential)",
sep = "")))

Scala

display(spark.sql("SELECT * FROM <format>.`abfss://<path_to_files>` " +


"WITH (CREDENTIAL <storage_credential)"))

Next steps
Manage access to data
Train a machine-learning model with Python from
data in Unity Catalog
7/21/2022 • 3 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Unity Catalog allows you to apply fine-grained security to tables and to securely access them from any
language, all while interacting seamlessly with other machine-learning components in Azure Databricks. This
article shows how to use Python to train a machine-learning model using data in Unity Catalog.

Requirements
Your Azure Databricks account must be on the Premium plan.
You must be an account admin or the metastore admin for the metastore you use to train the model.

Create a Databricks Machine Learning cluster


Follow these steps to create a Single-User Databricks Machine Learning cluster that can access data in Unity
Catalog.
To create a Databricks Machine Learning cluster that can access Unity Catalog:
1. Log in to the workspace as a workspace-level admin.

2. In the Data Science & Engineering or Databricks Machine Learning persona, click Compute .
3. Click Create cluster .
a. Enter a name for the cluster.
b. For Databricks runtime version :
a. Click ML .
b. Select either 10.3 ML (Scala 2.12, Spark 3.2.1) or higher, or 10.3 ML (GPU, Scala 2.12,
Spark 3.2.1) or higher.
4. Click Advanced Options . Set Security Mode to User Isolation or Single User . To run Python code,
you must use Single User .
User Isolation clusters can be shared by multiple users, but only SQL workloads are supported. Some
advanced cluster features such as library installation, init scripts, and the DBFS Fuse mount are also
disabled to ensure security isolation among cluster users.
To use those advanced cluster features or languages or to run workloads using Python, Scala and R, set
the cluster mode to Single User. Single User cluster can also run SQL workloads. The cluster can be used
exclusively by a single user (by default the owner of the cluster); other users cannot attach to the cluster.
Automated jobs should run in this mode, and the job’s owner should be the cluster’s owner. In this mode,
view security cannot be enforced. A user selecting from a view executes with their own permissions.
For more information about the features available in each security mode, see Cluster security mode.
5. Click Create Cluster .

Create the catalog


Follow these steps to create a new catalog where your machine learning team can store their data assets.
1. As an account admin or the metastore admin, log in to a workspace with the metastore assigned.
2. Create a notebook or open the Databricks SQL editor.
3. Run the following command to create the ml catalog:

CREATE CATALOG ml;

When you create a catalog, a schema named default is automatically created within it.
4. Grant access to the ml catalog and the ml.default schema, and the ability to create tables and views, to
the ml_team group. To include all account level users, you could use the group account users .

GRANT USAGE ON CATALOG ml TO `ml team`;


GRANT USAGE, CREATE ON SCHEMA ml.default TO `ml_team`;

Now, any user in the ml_team group can run the following example notebook.

Import the example notebook


To get started, import the following notebook.
Machine learning with Unity Catalog
Get notebook
To import the notebook:
1. Next to the notebook, click Copy link for impor t .
2. In your workspace, click Workspace .
3. Next to a folder, click , then click Impor t
4. Click URL , then paste in the link you copied.
5. The imported notebook appears in the folder you selected. Double-click the notebook name to open it.
6. At the top of the notebook, select your Databricks Machine Learning cluster to attach the notebook to it.
The notebook is divided into several high-level sections:
1. Setup.
2. Read data from CSV files and writing it to Unity Catalog.
3. Load the data into Pandas dataframes and clean it up.
4. Train a basic classification model.
5. Tune hyperparameters and optimize the model.
6. Write the results to a new table and share it with other users.
To run a cell, click Run . To run the entire notebook, click Run All .
Connect to BI tools
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

This article shows how to connect Unity Catalog to business intelligence (BI) tools.

Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
If necessary, create a metastore.

JDBC
To use Unity Catalog over JDBC, download the Simba JDBC driver version 2.6.21 or above.

ODBC
To use Unity Catalog over ODBC, download the Simba ODBC driver version 2.6.19 or above.

Tableau
To use Unity Catalog data in Tableau, use Tableau Desktop 2021.4 with the Simba ODBC driver version 2.6.19 or
above. See Tableau.
For more information, see Databricks ODBC and JDBC drivers.
Audit access and activity for Unity Catalog
resources
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

This article shows how you can audit Unity Catalog access and activity.

Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
If necessary, create a metastore.
Link the metastore to the workspace in which you will process the diagnostic logs.

Configure diagnostic logs


Unity Catalog captures a diagnostic log of actions performed against the metastore. This enables fine-grained
details about who accessed a given dataset, and helps you meet your compliance and business requirements.
1. Enable diagnostic logs for each workspace in your account.
2. Create a Data Science & Engineering cluster with the Single User cluster security mode. See Create a
Data Science & Engineering cluster.
3. Import the following example notebook into your workspace and attach it to the cluster you just created.
See Import a notebook.
Unity Catalog diagnostic log analysis
Get notebook
4. Fill in the fields at the top of the notebook:
azure_resource_group : The ID of the Azure resource group that contains the Azure Databricks
workspace.
azure_subscription_id : The ID of the Azure subscription that contains the Azure Databricks
workspace.
log_categor y : Optionally filter by log category.
storage_account_access_key : The access key for the storage account where diagnostic logs are
delivered.
storage_account_name : The name of the Azure storage account where diagnostic logs are
delivered.
workspace_name : The name of the Azure Databricks workspace.
5. Run the notebook to create the audit report.
6. To modify the report or to return activities for a given user, see command 24 in the notebook.
Upgrade tables and views to Unity Catalog
7/21/2022 • 4 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

To take advantage of Unity Catalog’s access control and auditing mechanisms, and to share data to multiple
workspaces, you can upgrade tables and views to Unity Catalog.

Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
If necessary, create a metastore.
If necessary, create catalogs and schemas in the metastore. The catalogs and schemas will contain the new
tables and views.

Upgrade a table to Unity Catalog


To upgrade a table to Unity Catalog as a managed table:
1. If necessary, create a metastore. See Create a metastore.
2. Assign that metastore to the workspace that contains the table.
3. Create a new Unity Catalog table by querying the existing table. Replace the placeholder values:
<catalog> : The Unity Catalog catalog for the new table.
<new_schema> : The Unity Catalog schema for the new table.
<new_table> : A name for the Unity Catalog table.
<old_schema> : The schema for the old table, such as default .
<old_table> : The name of the old table.

SQL

CREATE TABLE <catalog>.<new_schema>.<new_table>


AS SELECT * FROM hive_metastore.<old_schema>.<old_table>;

Python

df = spark.table("hive_metastore.<old_schema>.<old_table>")

df.write.saveAsTable(
name = "<catalog>.<new_schema>.<new_table>"
)

R
%r
library(SparkR)

df = tableToDF("hive_metastore.<old_schema>.<old_table>")

saveAsTable(
df = df,
tableName = "<catalog>.<new_schema>.<new_table>"
)

Scala

val df = spark.table("hive_metastore.<old_schema>.<old_table>")

df.write.saveAsTable(
tableName = "<catalog>.<new_schema>.<new_table>"
)

If you want to migrate only some columns or rows, modify the SELECT statement.

NOTE
This command creates a managed table in which data is copied into the storage location that was nominated
when the metastore was set up. To create an external table, where a table is registered in Unity Catalog without
moving the data in cloud storage, see Upgrade an external table to Unity Catalog.

4. Grant account-level users or groups access to the new table. See Manage access to data.
5. After the table is migrated, users should update their existing queries and workloads to use the new table.
6. Before you drop the old table, test for dependencies by revoking access to it and re-running related
queries and workloads.

Upgrade an external table to Unity Catalog


You can copy an external table from your default Hive metastore to the Unity Catalog metastore using the Data
Explorer .
Requirements
Before you begin, you must have:
A storage credential that contains the information about a service principal authorized to access the table’s
location path.
An external location that references the storage credential you just created and the path to the data on your
cloud tenant.
CREATE TABLE permission on the external location of the table to be upgraded.

Upgrade process
To upgrade an external table:
1. If you are not already in Databricks SQL, use the persona switcher in the sidebar to select SQL .

2. Click Data in the sidebar to open the Data Explorer.


3. Select the database, then the table, that you want to upgrade.
4. Click Upgrade action in the top-right corner of the table detail view.
5. Select your destination catalog and database in Unity Catalog, then click Upgrade .
6. The table metadata has been copied to Unity Catalog, and a new table has been created. You can now
define fine-grained access control in the Permissions tab.
7. Modify workloads to use the new table.

NOTE
If you no longer need the old table, you can drop it from the Hive Metastore. Dropping an external table does not
modify the data files on your cloud tenant.

Upgrade a view to Unity Catalog


After you upgrade all of a view’s referenced tables to the same Unity Catalog metastore, you can create a new
view that references the new tables.

NOTE
If your view also references other views, upgrade those views first.

After you upgrade the view, grant access to it to account-level users and groups.
Before you drop the old view, test for dependencies by revoking access to it and re-running related queries and
workloads.

Upgrade a schema or multiple tables to Unity Catalog


You can copy complete schemas (databases) and multiple tables from your default Hive metastore to the Unity
Catalog metastore using the Data Explorer upgrade wizard.
Requirements
Before you begin, you must have:
A storage credential that contains the information about a service principal authorized to access the table’s
location path.
An external location that references the storage credential you just created and the path to the data on your
cloud tenant.
CREATE TABLE permission on the external locations of the tables to be upgraded.

Upgrade process
1. If you are not already in Databricks SQL, use the persona switcher in the sidebar to select SQL .

2. Click Data in the sidebar to open the Data Explorer.


3. Select hive_metastore as your catalog and select the schema (database) that you want to upgrade.
4. Click Upgrade at the top right of the schema detail view.
5. Select all of the tables that you want to upgrade and click Next .
Only external tables in formats supported by Unity Catalog can be upgraded using the upgrade wizard.
6. Set the destination catalog and schema for each table.
You will be able to access the newly created table in the context of that catalog and schema. Destination
catalog and schema can be set either for each table individually or in bulk. To set them in bulk, first select
some or all tables and then set the destination catalog and schema.
7. Review the table configurations. To modify them, click the Previous button.
8. Click Create Quer y for Upgrade .
A query editor appears with generated SQL statements.
9. Run the query.
When the query is done, each table’s metadata has been copied from Hive metastore to Unity Catalog.
These tables are marked as upgraded in the upgrade wizard.
10. Define fine-grained access control using the Permissions tab of each new table.
11. Modify your workloads to use the new table.

NOTE
If you no longer need the old tables, you can drop them from the Hive Metastore. Dropping an external table does not
modify the data files on your cloud tenant.
Automate Unity Catalog setup using Terraform
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

You can automate Unity Catalog setup using Terraform templates.

Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.

Create Terraform templates


For details and recipes, see https://github.com/databrickslabs/terraform-provider-databricks/docs/guides/unity-
catalog.md.
Unity Catalog public preview limitations
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

During the public preview, Unity Catalog has the following limitations:
Python, Scala, and R workloads are supported only on clusters that use the Single User security mode.
Workloads in these languages do not support the use of dynamic views for row-level or column-level
security.
Unity Catalog can be used together with the built-in Hive metastore provided by Databricks. You can’t use
Unity Catalog with external Hive metastores that require configuration using init scripts.
Unity Catalog does not manage partitions like Hive does. Unity Catalog managed tables are, by definition,
Delta tables, for which partition metadata is tracked in the Delta log. External non-Delta tables may be
partitioned in storage; Unity Catalog does not manage the partitions for such tables, which means that you
cannot partition objects for these tables from Unity Catalog.
Overwrite mode for DataFrame write operations into Unity Catalog is supported only for managed Delta
tables and not for other cases, such as external tables. In addition, the user must have the CREATE privilege
on the parent schema and must be the owner of the existing object.
Support for streaming workloads against Unity Catalog is in Private Preview, and there is no support for:
Using external locations as the source or destination for streams.
Storing streaming metadata in external locations. This metadata includes streaming checkpoints and
schema location for the cloud_files source.
Using Python or R to write streaming queries. Only Scala and Java are supported in Private Preview.
The following Delta Lake features aren’t supported:
Low shuffle merge
Sampling
Change data feed
Bloom filters
Table access control
7/21/2022 • 2 minutes to read

Table access control lets you programmatically grant and revoke access to your data from Python and SQL.
By default, all users have access to all data stored in a cluster’s managed tables unless table access control is
enabled for that cluster. Once table access control is enabled, users can set permissions for data objects on that
cluster.

Requirements
This feature requires the Premium Plan.
This feature requires a Data Science & Engineering cluster with an appropriate configuration or a Databricks
SQL endpoint.
This section covers:
Enable table access control for a cluster
Data object privileges
Enable table access control for a cluster
7/21/2022 • 3 minutes to read

NOTE
This article contains references to the term whitelist, a term that Azure Databricks does not use. When the term is
removed from the software, we’ll remove it from this article.

This article describes how to enable table access control for a cluster.
For information about how to set privileges on a data object once table access control has been enabled on a
cluster, see Data object privileges.

Enable table access control for a cluster


Table access control is available in two versions:
SQL-only table access control, which restricts users to SQL commands. You are restricted to the Apache
Spark SQL API, and therefore cannot use Python, Scala, R, RDD APIs, or clients that directly read the data
from cloud storage, such as DBUtils.
Python and SQL table access control, which allows users to run SQL, Python, and PySpark commands. You
are restricted to the Spark SQL API and DataFrame API, and therefore cannot use Scala, R, RDD APIs, or
clients that directly read the data from cloud storage, such as DBUtils.

IMPORTANT
Even if table access control is enabled for a cluster, Azure Databricks administrators have access to file-level data.

SQL -only table access control


This version of table access control restricts users to SQL commands only.
To enable SQL-only table access control on a cluster and restrict that cluster to use only SQL commands, set the
following flag in the cluster’s Spark conf:

spark.databricks.acl.sqlOnly true

NOTE
Access to SQL-only table access control is not affected by the Enable Table Access Control setting in the admin console.
That setting controls only the workspace-wide enablement of Python and SQL table access control.

Python and SQL table access control


This version of table access control lets users run Python commands that use the DataFrame API as well as SQL.
When it is enabled on a cluster, users on that cluster:
Can access Spark only using the Spark SQL API or DataFrame API. In both cases, access to tables and views is
restricted by administrators according to the Azure Databricks Data governance model.
Must run their commands on cluster nodes as a low-privilege user forbidden from accessing sensitive parts
of the filesystem or creating network connections to ports other than 80 and 443.
Only built-in Spark functions can create network connections on ports other than 80 and 443.
Only admin users or users with ANY FILE privilege can read data from external databases through the
PySpark JDBC connector.
If you want Python processes to be able to access additional outbound ports, you can set the Spark
config spark.databricks.pyspark.iptable.outbound.whitelisted.ports to the ports you want to allow
access. The supported format of the configuration value is [port[:port][,port[:port]]...] , for
example: 21,22,9000:9999 . The port must be within the valid range, that is, 0-65535 .
Attempts to get around these restrictions will fail with an exception. These restrictions are in place so that users
can never access unprivileged data through the cluster.
Requirements
Before users can configure Python and SQL table access control, an Azure Databricks admin must:
Enable table access control for the Azure Databricks workspace.
Deny users access to clusters that are not enabled for table access control. In practice, that means denying
most users permission to create clusters and denying users the Can Attach To permission for clusters that
are not enabled for table access control.
For information on both these requirements, see Enable table access control for your workspace.
Create a cluster enabled for table access control
When you create a cluster, click the Enable table access control and only allow Python and SQL
commands option. This option is available only for High Concurrency clusters.

To create the cluster using the REST API, see Create cluster enabled for table access control example.

Set privileges on a data object


See Data object privileges.
Data object privileges
7/21/2022 • 14 minutes to read

The Azure Databricks data governance model lets you programmatically grant, deny, and revoke access to your
data from Spark SQL. This model lets you control access to securable objects like catalogs, schemas (databases),
tables, views, and functions. It also allows for fine-grained access control (to a particular subset of a table, for
example) by setting privileges on derived views created from arbitrary queries. The Azure Databricks SQL query
analyzer enforces these access control policies at runtime on Azure Databricks clusters with table access control
enabled and all SQL warehouses.
This article describes the privileges, objects, and ownership rules that make up the Azure Databricks data
governance model. It also describes how to grant, deny, and revoke object privileges.

Requirements
The requirements for managing object privileges depends on your environment:
Databricks Data Science & Engineering and Databricks Machine Learning
Databricks SQL
Databricks Data Science & Engineering and Databricks Machine Learning
An administrator must enable and enforce table access control for the workspace.
The cluster must be enabled for table access control.
Databricks SQL
See Admin quickstart requirements.

Data governance model


This section describes the Azure Databricks data governance model. Access to securable data objects is
governed by privileges.
Securable objects
The securable objects are:
CATALOG : controls access to the entire data catalog.
SCHEMA : controls access to a schema.
TABLE : controls access to a managed or external table.
VIEW : controls access to SQL views.
FUNCTION : controls access to a named function.

ANONYMOUS FUNCTION : controls access to anonymous or temporary functions.

NOTE
ANONYMOUS FUNCTION objects are not supported in Databricks SQL.

ANY FILE : controls access to the underlying filesystem.


WARNING
Users granted access to ANY FILE can bypass the restrictions put on the catalog, schemas, tables, and views by
reading from the filesystem directly.

Privileges
SELECT : gives read access to an object.
CREATE : gives ability to create an object (for example, a table in a schema).
MODIFY : gives ability to add, delete, and modify data to or from an object.
USAGE : does not give any abilities, but is an additional requirement to perform any action on a schema
object.
READ_METADATA : gives ability to view an object and its metadata.
CREATE_NAMED_FUNCTION : gives ability to create a named UDF in an existing catalog or schema.
MODIFY_CLASSPATH : gives ability to add files to the Spark class path.
ALL PRIVILEGES : gives all privileges (is translated into all the above privileges).

NOTE
The MODIFY_CLASSPATH privilege is not supported in Databricks SQL.

USAGE privilege
To perform an action on a schema object, a user must have the USAGE privilege on that schema in addition to
the privilege to perform that action. Any one of the following satisfy the USAGE requirement:
Be an admin
Have the USAGE privilege on the schema or be in a group that has the USAGE privilege on the schema
Have the USAGE privilege on the CATALOG or be in a group that has the USAGE privilege
Be the owner of the schema or be in a group that owns the schema
Even the owner of an object inside a schema must have the USAGE privilege in order to use it.
As an example, an administrator could define a finance group and an accounting schema for them to use. To
set up a schema that only the finance team can use and share, an admin would do the following:

CREATE SCHEMA accounting;


GRANT USAGE ON SCHEMA accounting TO finance;
GRANT CREATE ON SCHEMA accounting TO finance;

With these privileges, members of the finance group can create tables and views in the accounting schema,
but can’t share those tables or views with any principal that does not have USAGE on the accounting schema.
D a t a b r i c k s D a t a Sc i e n c e & En g i n e e r i n g a n d D a t a b r i c k s R u n t i m e v e r si o n b e h a v i o r

Clusters running Databricks Runtime 7.3 LTS and above enforce the USAGE privilege.
Clusters running Databricks Runtime 7.2 and below do not enforce the USAGE privilege.
To ensure that existing workloads function unchanged, in workspaces that used table access control before
USAGE was introduced have had the USAGE privilege on CATALOG granted to the users group. If you want to
take advantage of the USAGE privilege, you must run REVOKE USAGE ON CATALOG FROM users and then
GRANT USAGE ... as needed.

Privilege hierarchy
When table access control is enabled on the workspace and on all clusters, SQL objects in Azure Databricks are
hierarchical and privileges are inherited downward. This means that granting or denying a privilege on the
CATALOG automatically grants or denies the privilege to all schemas in the catalog. Similarly, privileges granted
on a schema object are inherited by all objects in that schema. This pattern is true for all securable objects.
If you deny a user privileges on a table, the user can’t see the table by attempting to list all tables in the schema.
If you deny a user privileges on a schema, the user can’t see that the schema exists by attempting to list all
schemas in the catalog.
Object ownership
When table access control is enabled on a cluster or SQL warehouse, a user who creates a schema, table, view,
or function becomes its owner. The owner is granted all privileges and can grant privileges to other users.
Groups may own objects, in which case all members of that group are considered owners.
Ownership determines whether or not you can grant privileges on derived objects to other users. For example,
suppose user A owns table T and grants user B SELECT privilege on table T. Even though user B can select from
table T, user B cannot grant SELECT privilege on table T to user C, because user A is still the owner of the
underlying table T. Furthermore, user B cannot circumvent this restriction simply by creating a view V on table T
and granting privileges on that view to user C. When Azure Databricks checks for privileges for user C to access
view V, it also checks that the owner of V and underlying table T are the same. If the owners are not the same,
user C must also have SELECT privileges on underlying table T.
When table access control is disabled on a cluster, no owner is registered when a schema, table, view, or function
is created. To test if an object has an owner, run SHOW GRANTS ON <object-name> . If you do not see an entry with
ActionType OWN , the object does not have an owner.

Assign owner to object


Either the owner of an object or an administrator can transfer ownership of an object using the
ALTER <object> OWNER TO `<user-name>@<user-domain>.com` command:

ALTER SCHEMA <schema-name> OWNER TO `<user-name>@<user-domain>.com`


ALTER TABLE <table-name> OWNER TO `group_name`
ALTER VIEW <view-name> OWNER TO `<user-name>@<user-domain>.com`
ALTER FUNCTION <function-name> OWNER TO `<user-name>@<user-domain>.com`

Users and groups


Administrators and owners can grant privileges to users and groups. Each user is uniquely identified by their
username in Azure Databricks (which typically maps to their email address). All users are implicitly a part of the
“All Users” group, represented as users in SQL.

NOTE
You must enclose user specifications in backticks ( ` ` ), not single quotes ( ' ' ).

Operations and privileges


In Azure Databricks, admin users can manage all object privileges, effectively have all privileges granted on all
securables, and can change the owner of any object. Owners of an object can perform any action on that object,
can grant privileges on that object to other principals, and can transfer ownership of the object to another
principal. The only limit to an owner’s privileges is for objects within a schema; to interact with an object in a
schema the user must also have USAGE on that schema.
The following table maps SQL operations to the privileges required to perform that operation.
NOTE
Any place where a privilege on a table, view, or function is required, USAGE is also required on the schema it’s in.
In any place where a table is referenced in a command, a path could also be referenced. In those instances SELECT or
MODIFY is required on ANY FILE instead of USAGE on the schema and another privilege on the table.
Object ownership is represented here as the OWN privilege.

O P ERAT IO N REQ UIRED P RIVIL EGES

CLONE Ability to SELECT from the table being cloned, CREATE on


the schema, and MODIFY if the a table is being replaced.

COPY INTO SELECT on ANY FILE if copying from a path, MODIFY on


the table being copied into.

CREATE BLOOMFILTER INDEX OWN on the table being indexed.

CREATE SCHEMA CREATE on the CATALOG .

CREATE TABLE Either OWN or both USAGE and CREATE on the schema.

CREATE VIEW Either OWN or both USAGE and CREATE on the schema.

CREATE FUNCTION (External) Either OWN or USAGE and CREATE_NAMED_FUNCTION on


the schema. If a resource is specified then
MODIFY_CLASSPATH on CATALOG is also required.

CREATE FUNCTION (SQL) Either OWN or USAGE on the schema.

CREATE SCHEMA CREATE on the CATALOG .

ALTER SCHEMA OWN on the schema.

ALTER TABLE Usually OWN on the table. MODIFY if only adding or


removing partitions.

ALTER VIEW OWN on the view.

DROP BLOOMFILTER INDEX OWN on the table.

DROP SCHEMA OWN on the schema.

DROP TABLE OWN on the table.

DROP VIEW OWN on the view.

DROP FUNCTION OWN on the function.

EXPLAIN READ_METADATA on the tables and views.


O P ERAT IO N REQ UIRED P RIVIL EGES

DESCRIBE TABLE READ_METADATA on the table.

DESCRIBE HISTORY OWN on the table.

SELECT SELECT on the table.

INSERT MODIFY on the table.

RESTORE TABLE MODIFY on the table.

UPDATE MODIFY on the table.

MERGE INTO MODIFY on the table.

DELETE FROM MODIFY on the table.

TRUNCATE TABLE MODIFY on the table.

OPTIMIZE MODIFY on the table.

VACUUM MODIFY on the table.

FSCK REPAIR TABLE MODIFY on the table.

MSCK OWN on the table.

GRANT OWN on the object.

SHOW GRANTS OWN on the object, or the user subject to the grant.

DENY OWN on the object.

REVOKE OWN on the object.

IMPORTANT
When you use table access control, DROP TABLE statements are case sensitive. If a table name is lower case and the
DROP TABLE references the table name using mixed or upper case, the DROP TABLE statement will fail.

Manage object privileges


You use the GRANT , DENY , REVOKE , MSCK , and SHOW GRANTS operations to manage object privileges.
NOTE
An owner or an administrator of an object can perform GRANT , DENY , REVOKE , and SHOW GRANTS operations.
However, an administrator cannot deny privileges to or revoke privileges from an owner.
A principal that’s not an owner or administrator can perform an operation only if the required privilege has been
granted.
To grant, deny, or revoke a privilege for all users, specify the keyword users after TO . For example,

GRANT SELECT ON ANY FILE TO users

Examples

GRANT SELECT ON SCHEMA <schema-name> TO `<user>@<domain-name>`


GRANT SELECT ON ANONYMOUS FUNCTION TO `<user>@<domain-name>`
GRANT SELECT ON ANY FILE TO `<user>@<domain-name>`

SHOW GRANTS `<user>@<domain-name>` ON SCHEMA <schema-name>

DENY SELECT ON <table-name> TO `<user>@<domain-name>`

REVOKE ALL PRIVILEGES ON SCHEMA default FROM `<user>@<domain-name>`


REVOKE SELECT ON <table-name> FROM `<user>@<domain-name>`

GRANT SELECT ON ANY FILE TO users

Dynamic view functions


Azure Databricks includes two user functions that allow you to express column- and row-level permissions
dynamically in the body of a view definition.
current_user() : return the current user name.
is_member() : determine if the current user is a member of a specific Azure Databricks group.

NOTE
Available in Databricks Runtime 7.3 LTS and above. However, to use these functions in Databricks Runtime 7.3 LTS, you
must set the Spark config spark.databricks.userInfoFunctions.enabled true .

Consider the following example, which combines both functions to determine if a user has the appropriate
group membership:

-- Return: true if the user is a member and false if they are not
SELECT
current_user as user,
-- Check to see if the current user is a member of the "Managers" group.
is_member("Managers") as admin

Allowing administrators to set fine granularity privileges for multiple users and groups within a single view is
both expressive and powerful, while saving on administration overhead.
Column-level permissions
Through dynamic views it’s easy to limit what columns a specific group or user can see. Consider the following
example where only users who belong to the auditors group are able to see email addresses from the
sales_rawtable. At analysis time Spark replaces the CASE statement with either the literal 'REDACTED' or the
column email . This behavior allows for all the usual performance optimizations provided by Spark.

-- Alias the field 'email' to itself (as 'email') to prevent the


-- permission logic from showing up directly in the column name results.
CREATE VIEW sales_redacted AS
SELECT
user_id,
CASE WHEN
is_member('auditors') THEN email
ELSE 'REDACTED'
END AS email,
country,
product,
total
FROM sales_raw

Row-level permissions
Using dynamic views you can specify permissions down to the row or field level. Consider the following
example, where only users who belong to the managers group are able to see transaction amounts ( total
column) greater than $1,000,000.00:

CREATE VIEW sales_redacted AS


SELECT
user_id,
country,
product,
total
FROM sales_raw
WHERE
CASE
WHEN is_member('managers') THEN TRUE
ELSE total <= 1000000
END;

Data masking
As shown in the preceding examples, you can implement column-level masking to prevent users from seeing
specific column data unless they are in the correct group. Because these views are standard Spark SQL, you can
do more advanced types of masking with more complex SQL expressions. The following example lets all users
perform analysis on email domains, but lets members of the auditors group see users’ full email addresses.

-- The regexp_extract function takes an email address such as


-- user.x.lastname@example.com and extracts 'example', allowing
-- analysts to query the domain name

CREATE VIEW sales_redacted AS


SELECT
user_id,
region,
CASE
WHEN is_member('auditors') THEN email
ELSE regexp_extract(email, '^.*@(.*)$', 1)
END
FROM sales_raw

Frequently asked questions (FAQ)


How do I grant, deny, or revoke a privilege for all users?
Specify the keyword users after TO or FROM . For example:

GRANT SELECT ON TABLE <schema-name>.<table-name> TO users

I created an object but now I can’t query, drop, or modify it.


This error can occur because you created that object on a cluster or SQL warehouse without table access control
enabled. When table access control is disabled on a cluster or SQL warehouse, owners are not registered when a
schema, table, or view is created. An admin must assign an owner to the object using the following command:

ALTER [SCHEMA | TABLE | VIEW] <object-name> OWNER TO `<user-name>@<user-domain>.com`;

How do I grant privileges on global and local temporary views?


Privileges on global and local temporary views are not supported. Local temporary views are visible only within
the same session, and views created in the global_temp schema are visible to all users sharing a cluster or SQL
warehouse. However, privileges on the underlying tables and views referenced by any temporary views are
enforced.
How do I grant a user or group privileges on multiple tables at once?
A grant, deny, or revoke statement can be applied to only one object at a time. The recommended way to
organize and grant privileges on multiple tables to a principal is via schemas. Granting a principal SELECT
privilege on a schema implicitly grants that principal SELECT privileges on all tables and views in that schema.
For example, if a schema D has tables t1 and t2, and an admin issues the following GRANT command:

GRANT USAGE, SELECT ON SCHEMA D TO `<user>@<domain-name>`

The principal <user>@<domain-name> can select from tables t1 and t2, as well as any tables and views created in
schema D in the future.
How do I grant a user privileges on all tables except one?
You grant SELECT privilege to the schema and then deny SELECT privilege for the specific table you want to
restrict access to.

GRANT USAGE, SELECT ON SCHEMA D TO `<user>@<domain-name>`


DENY SELECT ON TABLE D.T TO `<user>@<domain-name>`

The principal <user>@<domain-name> can select from all tables in D except D.T.
A user has SELECT privileges on a view of table T, but when that user tries to SELECT from that view, they get
the error User does not have privilege SELECT on table .
This common error can occur for one of the following reasons:
Table T has no registered owner because it was created using a cluster or SQL warehouse for which table
access control is disabled.
The grantor of the SELECT privilege on a view of table T is not the owner of table T or the user does not also
have select SELECT privilege on table T.

Suppose there is a table T owned by A. A owns view V1 on T and B owns view V2 on T.


A user can select on V1 when A has granted SELECT privileges on view V1.
A user can select on V2 when A has granted SELECT privileges on table T and B has granted SELECT
privileges on V2.
As described in the Object ownership section, these conditions ensure that only the owner of an object can grant
other users access to that object.
I tried to run sc.parallelize on a cluster with table access control enabled and it fails.
On clusters with table access control enabled you can use only the Spark SQL and Python DataFrame APIs. The
RDD API is disallowed for security reasons, since Azure Databricks does not have the ability to inspect and
authorize code within an RDD.
I want to manage permissions from infrastructure -as-code
You can manage table access control in a fully automated setup using Databricks Terraform provider and
databricks_sql_permissions:

resource "databricks_sql_permissions" "foo_table" {


table = "foo"

privilege_assignments {
principal = "serge@example.com"
privileges = ["SELECT", "MODIFY"]
}

privilege_assignments {
principal = "special group"
privileges = ["SELECT"]
}
}
Access Azure Data Lake Storage using Azure Active
Directory credential passthrough
7/21/2022 • 11 minutes to read

NOTE
This article contains references to the term whitelisted, a term that Azure Databricks does not use. When the term is
removed from the software, we’ll remove it from this article.

You can authenticate automatically to Accessing Azure Data Lake Storage Gen1 from Azure Databricks (ADLS
Gen1) and ADLS Gen2 from Azure Databricks clusters using the same Azure Active Directory (Azure AD) identity
that you use to log into Azure Databricks. When you enable Azure Data Lake Storage credential passthrough for
your cluster, commands that you run on that cluster can read and write data in Azure Data Lake Storage without
requiring you to configure service principal credentials for access to storage.
Azure Data Lake Storage credential passthrough is supported with Azure Data Lake Storage Gen1 and Gen2
only. Azure Blob storage does not support credential passthrough.
This article covers:
Enabling credential passthrough for standard and high-concurrency clusters.
Configuring credential passthrough and initializing storage resources in ADLS accounts.
Accessing ADLS resources directly when credential passthrough is enabled.
Accessing ADLS resources through a mount point when credential passthrough is enabled.
Supported features and limitations when using credential passthrough.
Notebooks are included to provide examples of using credential passthrough with ADLS Gen1 and ADLS Gen2
storage accounts.

Requirements
Premium Plan. See Upgrade or Downgrade an Azure Databricks Workspace for details on upgrading a
standard plan to a premium plan.
An Azure Data Lake Storage Gen1 or Gen2 storage account. Azure Data Lake Storage Gen2 storage accounts
must use the hierarchical namespace to work with Azure Data Lake Storage credential passthrough. See
Create a storage account for instructions on creating a new ADLS Gen2 account, including how to enable the
hierarchical namespace.
Properly configured user permissions to Azure Data Lake Storage. An Azure Databricks administrator needs
to ensure that users have the correct roles, for example, Storage Blob Data Contributor, to read and write data
stored in Azure Data Lake Storage. See Use the Azure portal to assign an Azure role for access to blob and
queue data.
You cannot use a cluster configured with ADLS credentials, for example, service principal credentials, with
credential passthrough.

IMPORTANT
You cannot authenticate to Azure Data Lake Storage with your Azure Active Directory credentials if you are behind a
firewall that has not been configured to allow traffic to Azure Active Directory. Azure Firewall blocks Active Directory
access by default. To allow access, configure the AzureActiveDirectory service tag. You can find equivalent information for
network virtual appliances under the AzureActiveDirectory tag in the Azure IP Ranges and Service Tags JSON file. For
more information, see Azure Firewall service tags and Azure IP Addresses for Public Cloud.

Logging recommendations
You can log identities passed through to ADLS storage in the Azure storage diagnostic logs. Logging identities
allows ADLS requests to be tied to individual users from Azure Databricks clusters. Turn on diagnostic logging
on your storage account to start receiving these logs:
Azure Data Lake Storage Gen1: Follow the instructions in Enable diagnostic logging for your Data Lake
Storage Gen1 account.
Azure Data Lake Storage Gen2: Configure using PowerShell with the Set-AzStorageServiceLoggingProperty
command. Specify 2.0 as the version, because log entry format 2.0 includes the user principal name in the
request.

Enable Azure Data Lake Storage credential passthrough for a High


Concurrency cluster
High concurrency clusters can be shared by multiple users. They support only Python and SQL with Azure Data
Lake Storage credential passthrough.

IMPORTANT
Enabling Azure Data Lake Storage credential passthrough for a High Concurrency cluster blocks all ports on the cluster
except for ports 44, 53, and 80.

1. When you create a cluster, set Cluster Mode to High Concurrency.


2. Under Advanced Options , select Enable credential passthrough for user-level data access and
only allow Python and SQL commands .

Enable Azure Data Lake Storage credential passthrough for a


Standard cluster
Standard clusters with credential passthrough are limited to a single user. Standard clusters support Python,
SQL, Scala, and R. On Databricks Runtime 6.0 and above, SparkR is supported; on Databricks Runtime 10.1 and
above, sparklyr is supported.
You must assign a user at cluster creation, but the cluster can be edited by a user with Can Manage
permissions at any time to replace the original user.

IMPORTANT
The user assigned to the cluster must have at least Can Attach To permission for the cluster in order to run commands
on the cluster. Admins and the cluster creator have Can Manage permissions, but cannot run commands on the cluster
unless they are the designated cluster user.

1. When you create a cluster, set the Cluster Mode to Standard.


2. Under Advanced Options , select Enable credential passthrough for user-level data access and
select the user name from the Single User Access drop-down.

Create a container
Containers provide a way to organize objects in an Azure storage account.
Access Azure Data Lake Storage directly using credential passthrough
After configuring Azure Data Lake Storage credential passthrough and creating storage containers, you can
access data directly in Azure Data Lake Storage Gen1 using an adl:// path and Azure Data Lake Storage Gen2
using an abfss:// path.
Azure Data Lake Storage Gen1
Python

spark.read.format("csv").load("adl://<storage-account-name>.azuredatalakestore.net/MyData.csv").collect()

# SparkR
library(SparkR)
sparkR.session()
collect(read.df("adl://<storage-account-name>.azuredatalakestore.net/MyData.csv", source = "csv"))

# sparklyr
library(sparklyr)
sc <- spark_connect(method = "databricks")
sc %>% spark_read_csv("adl://<storage-account-name>.azuredatalakestore.net/MyData.csv") %>% sdf_collect()

Replace <storage-account-name> with the ADLS Gen1 storage account name.


Azure Data Lake Storage Gen2
Python

spark.read.format("csv").load("abfss://<container-name>@<storage-account-
name>.dfs.core.windows.net/MyData.csv").collect()

# SparkR
library(SparkR)
sparkR.session()
collect(read.df("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/MyData.csv", source =
"csv"))

# sparklyr
library(sparklyr)
sc <- spark_connect(method = "databricks")
sc %>% spark_read_csv("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/MyData.csv") %>%
sdf_collect()

Replace <container-name> with the name of a container in the ADLS Gen2 storage account.
Replace <storage-account-name> with the ADLS Gen2 storage account name.

Mount Azure Data Lake Storage to DBFS using credential


passthrough
You can mount an Azure Data Lake Storage account or a folder inside it to Databricks File System (DBFS). The
mount is a pointer to a data lake store, so the data is never synced locally.
When you mount data using a cluster enabled with Azure Data Lake Storage credential passthrough, any read or
write to the mount point uses your Azure AD credentials. This mount point will be visible to other users, but the
only users that will have read and write access are those who:
Have access to the underlying Azure Data Lake Storage storage account
Are using a cluster enabled for Azure Data Lake Storage credential passthrough
Azure Data Lake Storage Gen1
To mount an Azure Data Lake Storage Gen1 resource or a folder inside it, use the following commands:
Python
configs = {
"fs.adl.oauth2.access.token.provider.type": "CustomAccessTokenProvider",
"fs.adl.oauth2.access.token.custom.provider":
spark.conf.get("spark.databricks.passthrough.adls.tokenProviderClassName")
}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "adl://<storage-account-name>.azuredatalakestore.net/<directory-name>",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)

Scala

val configs = Map(


"fs.adl.oauth2.access.token.provider.type" -> "CustomAccessTokenProvider",
"fs.adl.oauth2.access.token.custom.provider" ->
spark.conf.get("spark.databricks.passthrough.adls.tokenProviderClassName")
)

// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "adl://<storage-account-name>.azuredatalakestore.net/<directory-name>",
mountPoint = "/mnt/<mount-name>",
extraConfigs = configs)

Replace <storage-account-name> with the ADLS Gen2 storage account name.


Replace <mount-name> with the name of the intended mount point in DBFS.
Azure Data Lake Storage Gen2
To mount an Azure Data Lake Storage Gen2 filesystem or a folder inside it, use the following commands:
Python

configs = {
"fs.azure.account.auth.type": "CustomAccessToken",
"fs.azure.account.custom.token.provider.class":
spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)

Scala

val configs = Map(


"fs.azure.account.auth.type" -> "CustomAccessToken",
"fs.azure.account.custom.token.provider.class" ->
spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
)

// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mountPoint = "/mnt/<mount-name>",
extraConfigs = configs)

Replace <container-name> with the name of a container in the ADLS Gen2 storage account.
Replace <storage-account-name> with the ADLS Gen2 storage account name.
Replace <mount-name> with the name of the intended mount point in DBFS.
WARNING
Do not provide your storage account access keys or service principal credentials to authenticate to the mount point. That
would give other users access to the filesystem using those credentials. The purpose of Azure Data Lake Storage
credential passthrough is to prevent you from having to use those credentials and to ensure that access to the filesystem
is restricted to users who have access to the underlying Azure Data Lake Storage account.

Security
It is safe to share Azure Data Lake Storage credential passthrough clusters with other users. You will be isolated
from each other and will not be able to read or use each other’s credentials.

Supported features
M IN IM UM DATA B RIC K S RUN T IM E
F EAT URE VERSIO N N OT ES

Python and SQL 5.5

Azure Data Lake Storage Gen1 5.5

%run 5.5

DBFS 5.5 Credentials are passed through only if


the DBFS path resolves to a location in
Azure Data Lake Storage Gen1 or
Gen2. For DBFS paths that resolve to
other storage systems, use a different
method to specify your credentials.

Azure Data Lake Storage Gen2 5.5

Delta caching 5.5

PySpark ML API 5.5 The following ML classes are not


suppor ted:

*
org/apache/spark/ml/classification/RandomForestClassifier
*
org/apache/spark/ml/clustering/BisectingKMeans
*
org/apache/spark/ml/clustering/GaussianMixture
* org/spark/ml/clustering/KMeans
* org/spark/ml/clustering/LDA
*
org/spark/ml/evaluation/ClusteringEvaluator
* org/spark/ml/feature/HashingTF
*
org/spark/ml/feature/OneHotEncoder
*
org/spark/ml/feature/StopWordsRemover
*
org/spark/ml/feature/VectorIndexer
*
org/spark/ml/feature/VectorSizeHint
*
org/spark/ml/regression/IsotonicRegression
*
org/spark/ml/regression/RandomForestRegressor
* org/spark/ml/util/DatasetUtils
M IN IM UM DATA B RIC K S RUN T IM E
F EAT URE VERSIO N N OT ES

Broadcast variables 5.5 Within PySpark, there is a limit on the


size of the Python UDFs you can
construct, since large UDFs are sent as
broadcast variables.

Notebook-scoped libraries 5.5

Scala 5.5

SparkR 6.0

sparklyr 10.1

Modularize or link code in notebooks 6.1

PySpark ML API 6.1 All PySpark ML classes supported.

Ganglia UI 6.1

Databricks Connect 7.3 Passthrough is supported on Standard


clusters.

Limitations
The following features are not supported with Azure Data Lake Storage credential passthrough:
%fs (use the equivalent dbutils.fs command instead).
The Databricks REST API.
Table access control. The permissions granted by Azure Data Lake Storage credential passthrough could be
used to bypass the fine-grained permissions of table ACLs, while the extra restrictions of table ACLs will
constrain some of the benefits you get from credential passthrough. In particular:
If you have Azure AD permission to access the data files that underlie a particular table you will have
full permissions on that table via the RDD API, regardless of the restrictions placed on them via table
ACLs.
You will be constrained by table ACLs permissions only when using the DataFrame API. You will see
warnings about not having permission SELECT on any file if you try to read files directly with the
DataFrame API, even though you could read those files directly via the RDD API.
You will be unable to read from tables backed by filesystems other than Azure Data Lake Storage, even
if you have table ACL permission to read the tables.
The following methods on SparkContext ( sc ) and SparkSession ( spark ) objects:
Deprecated methods.
Methods such as addFile() and addJar() that would allow non-admin users to call Scala code.
Any method that accesses a filesystem other than Azure Data Lake Storage Gen1 or Gen2 (to access
other filesystems on a cluster with Azure Data Lake Storage credential passthrough enabled, use a
different method to specify your credentials and see the section on trusted filesystems under
Troubleshooting).
The old Hadoop APIs ( hadoopFile() and hadoopRDD() ).
Streaming APIs, since the passed-through credentials would expire while the stream was still running.
The FUSE mount ( /dbfs ) is available only in Databricks Runtime 7.3 LTS and above. Mount points with
credential passthrough configured are not supported through the FUSE mount.
Azure Data Factory.
MLflow on high concurrency clusters.
azureml-sdk Python package on high concurrency clusters.
You cannot extend the lifetime of Azure Active Directory passthrough tokens using Azure Active Directory
token lifetime policies. As a consequence, if you send a command to the cluster that takes longer than an
hour, it will fail if an Azure Data Lake Storage resource is accessed after the 1 hour mark.
When using Hive 2.3 and above you can’t add a partition on a cluster with credential passthrough enabled.
For more information, see the relevant troubleshooting section.

Example notebooks
The following notebooks demonstrate Azure Data Lake Storage credential passthrough for Azure Data Lake
Storage Gen1 and Gen2.
Azure Data Lake Storage Gen1 passthrough notebook
Get notebook
Azure Data Lake Storage Gen2 passthrough notebook
Get notebook

Troubleshooting
py4j.security.Py4JSecurityException: … is not whitelisted
This exception is thrown when you have accessed a method that Azure Databricks has not explicitly marked as
safe for Azure Data Lake Storage credential passthrough clusters. In most cases, this means that the method
could allow a user on a Azure Data Lake Storage credential passthrough cluster to access another user’s
credentials.
org.apache.spark.api.python.PythonSecurityException: Path … uses an untrusted filesystem
This exception is thrown when you have tried to access a filesystem that is not known by the Azure Data Lake
Storage credential passthrough cluster to be safe. Using an untrusted filesystem might allow a user on a Azure
Data Lake Storage credential passthrough cluster to access another user’s credentials, so we disallow all
filesystems that we are not confident are being used safely.
To configure the set of trusted filesystems on a Azure Data Lake Storage credential passthrough cluster, set the
Spark conf key spark.databricks.pyspark.trustedFilesystems on that cluster to be a comma-separated list of the
class names that are trusted implementations of org.apache.hadoop.fs.FileSystem .
Adding a partition fails with AzureCredentialNotFoundException when credential passthrough is enabled
When using Hive 2.3-3.1, if you try to add a partition on a cluster with credential passthrough enabled, the
following exception occurs:

org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException: Could
not find ADLS Gen2 Token

To work around this issue, add partitions on a cluster without credential passthrough enabled.
Data sharing guide
7/21/2022 • 2 minutes to read

This guide shows how you can use Delta Sharing to share data in Azure Databricks with recipients outside your
organization.
Delta Sharing is an open protocol developed by Databricks for secure data sharing with other organizations
regardless of which computing platforms they use. Delta Sharing is available for data in a Unity Catalog
metastore.
Unity Catalog (Preview) is a secure metastore developed by Databricks. Unity Catalog centralizes metadata and
governance of an organization’s data. With Unity Catalog, data governance rules scale with your needs,
regardless of the number of workspaces or the business intelligence tools your organization uses. See Get
started using Unity Catalog.
To share data using Delta Sharing:
1. You load the data into a Unity Catalog metastore.
You can create new tables and insert records into them, or you can import existing tables into Unity
Catalog from a workspace’s local Hive metastore.
2. You enable Delta Sharing on the metastore.
3. You create shares and recipients. Shares and recipients are Delta Sharing objects.
A share is a read-only collection of tables and table partitions to be shared with one or more
recipients. A metastore can have multiple shares, and you can control which recipients have access to
each share. A single metastore can contain multiple shares, but each share can belong to only one
metastore. If you remove a share, all recipients of that share lose the ability to access it.
A recipient is an object that associates an organization with a credential that allows organization to
access one or more shares. When you create a recipient, a downloadable credential is generated for
that recipient. Each metastore can have multiple recipients, but each recipient can belong to only one
metastore. A recipient can have access to multiple shares. If you remove a recipient, that recipient
loses access to all shares it could previously access.
4. After creating a recipient and granting the recipient access to shares, use a secure channel to
communicate with the recipient, and share with them the unique URL where they can download the
credential.
A credential can be downloaded only one time. Databricks recommends the use of a password manager
for storing and sharing a downloaded credential.
Also share with them the documentation for Delta Sharing data recipients. They can use this
documentation to access the data you share with them.
5. At any time, you can modify the contents of a share, modify the shares to which a recipient has access, or
drop a share or a recipient.
6. Data recipients have immediate read-only access to the live, up-to-date data you share with them.
7. A data provider can enable audit logs for Delta Sharing to understand who is creating shares and
recipients and which recipients are accessing which shares.
8. A data recipient who uses Azure Databricks to access Delta Sharing data can also enable audit logs to
understand who is accessing which Delta Sharing data.
In this guide:
Share data using Delta Sharing (Preview)
Access data shared with you using Delta Sharing
Delta Sharing IP access list guide
Share data using Delta Sharing (Preview)
7/21/2022 • 16 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

This article shows how to share data in Unity Catalog with data recipients outside your organization. If you are a
data recipient, see Access data shared with you using Delta Sharing.

How does Delta Sharing work?


Delta Sharing (Preview) is an open protocol developed by Databricks for secure data sharing with other
organizations regardless of which computing platforms they use. A Databricks user, called a “data provider”, can
use Delta Sharing to share data with a person or group outside of their organization, called a “data recipient”.
Data recipients can immediately begin working with the latest version of the shared data. For a full list of
connectors and information about how to use them, see the Delta Sharing documentation. When Delta Sharing
is enabled on a metastore, Unity Catalog runs a Delta Sharing server.

The shared data is not provided by Databricks directly but by data providers running on Databricks.

NOTE
By accessing a data provider’s shared data as a data recipient, data recipient represents that it has been authorized to
access the data share(s) provided to it by the data provider and acknowledges that (1) Databricks has no liability for such
data or data recipient’s use of such shared data, and (2) Databricks may collect information about data recipient’s use of
and access to the shared data (including identifying any individual or company who accesses the data using the credential
file in connection with such information) and may share it with the applicable data provider.

Requirements
Unity Catalog must be enabled, and at least one metastore must exist.
Only an account admin can enable Delta Sharing for a metastore.
Only a metastore admin or account admin can share data using Delta Sharing. See Metastore admin.
To rotate a recipient’s credential, you must use the Unity Catalog CLI. See (Optional) Install the Unity Catalog
CLI.
To manage shares and recipients, you can use a Data Science & Engineering notebook or a Databricks SQL
query.
Key concepts
In Delta Sharing, a share is a read-only collection of tables and table partitions to be shared with one or more
recipients. A metastore can have multiple shares, and you can control which recipients have access to each
share. A single metastore can contain multiple shares, but each share can belong to only one metastore. If you
remove a share, all recipients of that share lose the ability to access it.
A recipient is an object that associates an organization with a credential that allows the organization to access
one or more shares. When you create a recipient, a downloadable credential is generated for that recipient. Each
metastore can have multiple recipients, but each recipient can belong to only one metastore. A recipient can
have access to multiple shares. If you remove a recipient, that recipient loses access to all shares it could
previously access.
Shares and recipients exist independent of each other.

To grant a recipient access to a share, you:


1. Join the Delta Sharing Public Preview by enabling the External Data Sharing feature group for your Azure
Databricks account.
2. Enable Delta Sharing for each Unity Catalog metastore.
3. Create a share associated with one or more tables in the metastore.
4. Create a recipient. A set of credentials is generated for that recipient.
5. Grant the recipient privileges on one or more shares.
6. Using a secure channel, send the recipient an activation link that allows them to download their credentials.
7. The recipient uses the credential to access the share.
The following sections show how to enable Delta Sharing for an Azure Databricks account, enable Delta Sharing
on a metastore, manage shares and recipients, and audit Delta Sharing activity.
Enable the External Data Sharing feature group for your account
To participate in the Delta Sharing Public Preview, an account admin must enable the External Data Sharing
feature group for your Azure Databricks account.
1. As a Azure Databricks account admin, log in to the Account Console.

2. In the sidebar, click Settings .


3. Go to the Feature enablement tab.
4. On the External Data Sharing Feature Group row, click the Enable button.
Click the Terms link to review the applicable terms for Delta Sharing. Clicking Enable represents
acceptance of these terms.

Enable Delta Sharing on a metastore


Follow these steps for each metastore where you plan to share data using Delta Sharing.
1. Log into the account console.

a. In a workspace, click Settings .


b. Click Manage account .
2. In the sidebar, click Data .
3. Click the name of a metastore to open its details.
4. Click the checkbox next to Enable Delta Sharing and allow a Databricks user to share data
outside their organization .
5. Databricks recommends that you configure the default recipient token lifetime. This is the time after
which recipient tokens expire and must be regenerated. If you do not configure this setting, recipient
tokens do not expire.
When you change the default recipient lifetime, the recipient token lifetime for existing recipients is not
automatically updated. To update the recipient token lifetime for a given recipient, see Rotate a recipient’s
credential.
To configure the default recipient token lifetime:
a. Enable Set expiration .
b. Enter a number of seconds, minutes, hours, or days, and select the unit of measure.
c. Click Enable .
For more information, see Security recommendations for recipients.

(Optional) Install the Unity Catalog CLI


To manage shares and recipients, you can use SQL commands or the Unity Catalog CLI. The CLI runs in your
local environment and does not require Azure Databricks compute resources.
To install the CLI, see (Optional) Install the Unity Catalog CLI in the Unity Catalog documentation.

Modify the recipient token lifetime


When Delta Sharing is enabled, follow these steps to modify the default recipient token lifetime.
NOTE
The recipient token lifetime for existing recipients is not automatically updated when you change the default recipient
token lifetime for a metastore. To update the recipient token lifetime for a given recipient, see Rotate a recipient’s
credential.

1. Log into the account console.

a. In a workspace, click Settings .


b. Click Manage account .
2. In the sidebar, click Data .
3. Click the name of a metastore to open its details.
4. Enable Set expiration .
5. Enter a number of seconds, minutes, hours, or days, and select the unit of measure.
6. Click Enable .
If you disable Set expiration , recipient tokens do not expire. For more information, see Security
recommendations for recipients.

Manage shares
In Delta Sharing, a share is a named object that contains a collection of tables in a metastore that you want to
share as a group. A share can contain tables from only a single metastore. You can add or remove tables from a
share at any time.
The following sections show how to create, describe, update and delete shares.
Create a share
To create a share, run the following command in a notebook or the Databricks SQL editor. Replace the
placeholder values:
<share_name> : a descriptive name for the share.
<comment> : a comment describing the share.

CREATE SHARE [IF NOT EXISTS] <share_name> [COMMENT <comment>];

After you create a share, you can add tables to it and associate it with one or more recipients.
List shares
To list all shares, use the SHOW SHARES command.

SHOW SHARES;

Describe a share
To list a share’s metadata and all tables associated with a share, use the DESCRIBE SHARE command. Replace
<share_name> with the name of the share:

DESCRIBE SHARE <share_name>;


To list all tables in a share, use the SHOW ALL IN SHARES command.

SHOW ALL IN SHARE <share_name>;

Add or remove tables from a share


In the following commands, replace the placeholder values:
<share_name> : A name for the share.
<catalog_name> : The name of the catalog in the metastore.
<schema_name> : The name of the schema in the metastore.
<table_name> : The name of the table to add.
AS <new_schema_name.<new_table_name> : Optionally share the table with a new name.
<comment> : A comment describing the share.

To associate a table with the share, use the ALTER SHARE ADD TABLE command.

ALTER SHARE <share_name>


ADD TABLE <catalog_name>.<schema_name>.<table_name> [deltaSharingPartitionListSpec]
[AS <new_schema_name.<new_table_name>]
[COMMENT <comment>];

For details about sharing a partition, see Partition specifications.


To remove a table from a share, use the ALTER SHARE REMOVE TABLE command.

ALTER SHARE <share_name> REMOVE TABLE <catalog_name>.<schema_name>.<table_name>;

NOTE
You can provide either the original schema and table name in the metastore or the names defined in the share.

When you add or remove tables from a share, the change takes effect the next time a recipient accesses the
share.
Partition specifications
To share only part of a table when adding the table to a share, you can provide a partition specification. The
following example shares part of the data in the inventory table, given that the table is partitioned by year ,
month , and date columns.

Data for the year 2021.


Data for December 2020.
Data for December 25, 2019.

ALTER SHARE share_name


ADD TABLE inventory
PARTITION (year = "2021"),
(year = "2020", month = "Dec"),
(year = "2019", month = "Dec", date = "2019-12-25");

Delete a share
To delete a share, use the DROP SHARE command. Recipients can no longer access data that was previously
shared. Replace <share_name> with the name of the share.
DROP SHARE [IF EXISTS] <share_name>;

Manage recipients
A recipient is a named set of credentials that represents an organization with whom to share data. This section
shows how to manage recipients in Delta Sharing.
Create a recipient
Use the CREATE RECIPIENT command to create a recipient. Replace the placeholder values:
<recipient_> : A descriptive name for the recipient.
<comment> : A comment with more information.

CREATE RECIPIENT [IF NOT EXISTS] <recipient_name> COMMENT <comment>

The recipient’s token will expire after the recipient token lifetime has elapsed. For more information, see Security
recommendations for recipients.
After creating a recipient:
1. Use the DESCRIBE command to get their activation link.
2. Use a secure channel to share the activation link with them, along with the article showing how to access
shared data. The activation link can be accessed only a single time. Recipients should treat the downloaded
credential as a secret and must not share it outside of their organization. If necessary, you can rotate a
recipient’s credential.
3. Grant them access to shares.
List recipients
The SHOW RECIPIENTS command lists all recipients. Optionally, replace <pattern> with a LIKE predicate.

SHOW RECIPIENTS [LIKE <pattern>];

Describe a recipient
To view details about a recipient, including its creator, creation timestamp, token lifetime, activation link, and
whether the credential has been downloaded, use the DESCRIBE RECIPIENT command. Replace <recipient_name>
with the name of the recipient.

DESCRIBE RECIPIENT <recipient_name>;

To show grants to a recipient, see Manage privileges for a recipient.


Delete a recipient
To delete a recipient, use the DROP RECIPIENT command. Replace <recipient_name> with the name of the
recipient to drop. When you drop a recipient, the credential is invalidated and they can no longer view shares.

DROP RECIPIENT [IF EXISTS] <recipient_name>

Rotate a recipient’s credential


You should rotate a recipient’s credential and generate a new activation link:
When the existing recipient token is about to expire.
If a recipient loses their activation URL or if it is compromised.
If the credential is corrupted, lost, or compromised after it is downloaded by a recipient.
To update the recipient’s token lifetime after you modify the recipient token lifetime for a metastore.
1. If you have not already done so, install the Unity Catalog CLI.
2. Run the following command using the Unity Catalog CLI. Arguments in brackets are optional. Replace the
placeholder values:
<recipient_name> : the name of the recipient.
<expiration_seconds> : Optional. The number of seconds until the existing recipient token should
expire. During this period, the existing token will continue to work. A value of 0 means the
existing token expires immediately.

databricks unity-catalog rotate-recipient-token \


--name <recipient_name> \
[--existing-token-expire-in-seconds <expiration_seconds>]

3. View the activation URL by using the DESCRIBE RECIPIENT <recipient_name> command, and share it with
the recipient over a secure channel.

Manage privileges for a recipient


After you have created a share and a recipient, use GRANT and REVOKE statements to grant the recipient access
to the share. In the following examples, replace the placeholder values:
<share_name> : the name of the share.
<recipient_name> : the name of the recipient

To show all grants on a share:

SHOW GRANTS ON SHARE <share_name>;

To view all grants to a recipient:

SHOW GRANTS TO RECIPIENT <recipient_name>;

To grant access:

GRANT SELECT
ON SHARE <share_name>
TO RECIPIENT <recipient_name>;

To revoke access:

REVOKE SELECT
ON SHARE <share_name>
FROM RECIPIENT <recipient_name>;
NOTE
SELECT is the only privilege you can grant on a share.

Security recommendations for recipients


When you enable Delta Sharing, you configure the token lifetime for recipient credentials. If you set the token
lifetime to 0 , recipient tokens never expire. Databricks recommends that you configure tokens to expire.
In the following situations, you should rotate a recipient’s credential:
When the existing recipient token is about to expire.
If you modify the recipient token lifetime and you want to apply the new lifetime to existing recipients.
If a recipient loses their activation URL or if it is compromised.
To update the recipient’s token lifetime after you modify the recipient token lifetime.
At any given time, a recipient can have at most two tokens: an active token and a rotated token. Until the rotated
token expires, attempting to rotate the token again results in an error.
When you rotate a recipient’s credential, you can optionally set --existing-token-expire-in-seconds to the
number of seconds before the existing recipient token expires. If you set the value to 0 , the existing recipient
token expires immediately.
Databricks recommends that you set --existing-token-expire-in-seconds to a relatively short period that gives
the recipient organization time to access the new activation URL while minimizing the amount of time that the
recipient has two active tokens. If you suspect that the recipient token is compromised, Databricks recommends
that you force the existing recipient token to expire immediately.
If a recipient’s existing activation URL has never been accessed and the recipient has not been rotated, rotating
the recipient invalidates the existing activation URL and replaces it with a new one.
If all recipient tokens have expired, rotating the recipient replaces the existing activation URL with a new one.
Databricks recommends that you promptly rotate or drop a recipient whose token has expired.
If a recipient activation link is inadvertently sent to the wrong person or is sent over an insecure channel,
Databricks recommends that you:
1. Revoke the recipient’s access to the share.
2. Rotate the recipient and set --existing-token-expire-in-seconds to 0 .
3. Share the new activation link with the intended recipient over a secure channel.
4. After the activation URL has been accessed, grant the recipient access to the share again.
In an extreme situation, Databricks recommends that you drop and re-create the recipient.

Audit access and activity for Delta Sharing resources


After you configure diagnostic logging, Delta Sharing saves audit logs for activities such as when someone
creates, modifies, updates, or deletes a share or a recipient, when a recipient accesses an activation link and
downloads the credential, or when a recipient’s credential is rotated or expires. Delta Sharing activity is logged at
the account level.
1. Enable diagnostic logs for your account.

IMPORTANT
Delta Sharing activity is logged at the level of the account. Do not enter a value into workspace_ids_filter .
Audit logs are delivered for each workspace in your account, as well as account-level activities. Logs are
delivered to the storage container you configure.
2. Events for Delta Sharing are logged with serviceName set to unityCatalog . The requestParams section of
each event includes a delta_sharing prefix.
For example, the following audit event shows an update to the recipient token lifetime. In this example,
redacted values are replaced with <redacted> .

{
"version":"2.0",
"auditLevel":"ACCOUNT_LEVEL",
"timestamp":1629775584891,
"orgId":"3049059095686970",
"shardName":"example-workspace",
"accountId":"<redacted>",
"sourceIPAddress":"<redacted>",
"userAgent":"curl/7.64.1",
"sessionId":"<redacted>",
"userIdentity":{
"email":"<redacted>",
"subjectName":null
},
"serviceName":"unityCatalog",
"actionName":"updateMetastore",
"requestId":"<redacted>",
"requestParams":{
"id":"<redacted>",
"delta_sharing_enabled":"true"
"delta_sharing_recipient_token_lifetime_in_seconds": 31536000
},
"response":{
"statusCode":200,
"errorMessage":null,
"result":null
},
"MAX_LOG_MESSAGE_LENGTH":16384
}

The following table lists audited events for Delta Sharing, from the point of view of the data provider.

NOTE
The following important fields are always present in the audit log:
userIdentity.email : The ID of the user who initiated the activity.
requestParams.id : the Unity Catalog metastore.

A C T IO N N A M E REQ UEST PA RA M S

updateMetastore delta_sharing_enabled : If present, indicates that Delta


Sharing was enabled.

delta_sharing_recipient_token_lifetime_in_seconds : If
present, indicates that the recipient token lifetime was
updated.
A C T IO N N A M E REQ UEST PA RA M S

createRecipient name : The name of the recipient.

comment : The comment for the recipient.

deleteRecipient name : The name of the recipient.

getRecipient name : The name of the recipient.

listRecipients none

rotateRecipientToken name : The name of the recipient.

comment : The comment given in the rotation command.

createShare name : The name of the share.

comment : The comment for the share.

deleteShare name : The name of the share.

getShare name : The name of the share.

include_shared_objects : Whether the share’s table


names were included in the request.

updateShare name : The name of the share.

updates : A JSON representation of tables that were added


or removed from the share. Each item includes action
(add or remove), name (the actual name of the table),
shared_as (the name the schema and table were shared
as, if different from name ), and partition_specification
(if a partition specification was provided).

listShares none

getSharePermissions name : The name of the share.

updateSharePermissions name : The name of the share.

changes : A JSON representation of the updated


permissions. Each change includes principal (the user or
group to whom permission is granted or revoked), add
(the list of permissions that were granted), remove (the list
of permissions that were revoked).

getRecipientSharePermissions name : The name of the share.

getActivationUrlInfo recipient_name : The name of the recipient who opened


the activation URL.
A C T IO N N A M E REQ UEST PA RA M S

retrieveRecipientToken recipient_name : The name of the recipient who


downloaded the token.

The following Delta Sharing errors are logged, from the point of view of the data recipient. Items between <
and > characters represent placeholder text.
Delta Sharing is not enabled on the selected metastore.

DatabricksServiceException: FEATURE_DISABLED:
Delta Sharing is not enabled`

An operation was attempted on a catalog that does not exist.

DatabricksServiceException: CATALOG_DOES_NOT_EXIST:
Catalog ‘xxx’ does not exist.`

A user who is not an account admin or metastore admin attempted to perform a privileged operation.

DatabricksServiceException: PERMISSION_DENIED:
Only administrators can <operation_name> <operation_target>

An operation was attempted on a metastore from a workspace to which the metastore is not assigned.

DatabricksServiceException: INVALID_STATE:
Workspace <workspace_name> is no longer assigned to this metastore

A request was missing the recipient name or share name.

DatabricksServiceException: INVALID_PARAMETER_VALUE: CreateRecipient/CreateShare Missing required


field: <recipient_name>/<share_name>

A request included an invalid recipient name or share name.

DatabricksServiceException: INVALID_PARAMETER_VALUE: CreateRecipient/CreateShare


<recipient_name>/<share_name> is not a valid name

DatabricksServiceException: INVALID_PARAMETER_VALUE: Only managed or external table on Unity Catalog


can be added to a share.

A user attempted to share a table that is not in a Unity Catalog metastore.


DatabricksServiceException: INVALID_PARAMETER_VALUE: There are already two active tokens for recipient
<recipient_name>.

A user attempted to rotate a recipient that was already in a rotated state and whose previous token had
not yet expired.
DatabricksServiceException: RECIPIENT_ALREADY_EXISTS/SHARE_ALREADY_EXISTS: Recipient/Share <name>
already exists

A user attempted to create a new recipient or share with the same name as an existing one.
DatabricksServiceException: RECIPIENT_DOES_NOT_EXIST/SHARE_DOES_NOT_EXIST: Recipient/Share '<name>'
does not exist.
A user attempted to perform an operation on a recipient or share that does not exist.
DatabricksServiceException: RESOURCE_ALREADY_EXISTS: Shared Table '<name>' already exists.

A user attempted to add a table to a share, but the table had already been added.
DatabricksServiceException: TABLE_DOES_NOT_EXIST: Table '<name>' does not exist.

A user attempted to perform an operation that referenced a table that does not exist.
DatabricksServiceException: SCHEMA_DOES_NOT_EXIST: Schema '<name>' does not exist.

A user attempted to perform an operation that referenced a schema that did not exist.
For auditable events and errors for data recipients, see Audit access and activity for Delta Sharing resources.

Limitations
Only tables stored in a Unity Catalog metastore can be shared with Delta Sharing.
Only managed and external tables in Delta format are supported.
Sharing views is not supported in this preview.

Next steps
Learn how a recipient can access shares with Delta Sharing.
Learn more about Unity Catalog.
Access data shared with you using Delta Sharing
7/21/2022 • 11 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

This article shows data recipients how to access data shared from Delta Sharing using Databricks, Apache Spark,
and pandas.

Delta Sharing and data recipients


Delta Sharing is an open standard for secure data sharing. A Databricks user, called a “data provider”, can use
Delta Sharing to share data with a person or group outside of their organization, called a “data recipient”. For a
full list of connectors and information about how to use them, see the Delta Sharing documentation. This article
shows data recipients how to access data shared from Delta Sharing using Databricks, Apache Spark, and
pandas.
The shared data is not provided by Databricks directly but by data providers running on Databricks.

NOTE
By accessing a data provider’s shared data as a data recipient, data recipient represents that it has been authorized to
access the data share(s) provided to it by the data provider and acknowledges that (1) Databricks has no liability for such
data or data recipient’s use of such shared data, and (2) Databricks may collect information about data recipient’s use of
and access to the shared data (including identifying any individual or company who accesses the data using the credential
file in connection with such information) and may share it with the applicable data provider.

Download the credential file


To grant you access to shared data, the data provider sends you an activation URL, where you can download a
credential file. This article shows how you can use the credential file to access the shared data.

IMPORTANT
You can download a credential file only one time. If you visit the activation link again after you have already downloaded
the credential file, the Download Credential File button is disabled.
Don’t further share the activation link or share the credential file you have received with anyone outside of your
organization.
If you lose the activation link before using it, contact the data provider.

1. Click the activation link shared with you by the data provider. The activation page opens in your browser.
2. Click Download Credential File .
Store the credential file in a secure location. If you need to share it with someone in your organization,
Databricks recommends using a password manager.

Read the shared data


After you download the credential file, your code uses the credential file to authenticate to the data provider’s
Databricks account and read the data the data provider shared with you. The access will persist until the
provider stops sharing the data with you. Updates to the data are available to you in near real time. You can read
and make copies of the shared data, but you can’t modify the source data.
The following sections show how to access shared data using Databricks, Apache Spark, and pandas. If you run
into trouble accessing the shared data, contact the data provider.

NOTE
Partner integrations are, unless otherwise noted, provided by the third parties and you must have an account with the
appropriate provider for the use of their products and services. While Databricks does its best to keep this content up to
date, we make no representation regarding the integrations or the accuracy of the content on the partner integration
pages. Reach out to the appropriate providers regarding the integrations.

Use Databricks
Follow these steps to access shared data in a Azure Databricks workspace using notebook commands. You store
the credential file in DBFS, then use it to authenticate to the data provider’s Azure Databricks account and read
the data the data provider shared with you.
In this example, you create a notebook with multiple cells you can run independently. You could instead add the
notebook commands to the same cell and run them in a sequence.
1. In a text editor, open the credential file you downloaded.

2. Click Workspace .
3. Right-click a folder, then click Create > Notebook .
Enter a name.
Set the default language for the notebook to Python. This is the default.
Select a cluster to attach to the notebook. Select a cluster that runs Databricks Runtime 8.4 or above
or a cluster with the Apache Spark connector library installed. For more information about installing
cluster libraries, see Libraries.
Click Create .
The notebook opens in the notebook editor.
4. To use Python or pandas to access the shared data, install the delta-sharing Python connector. In the
notebook editor, paste the following command:

%sh pip install delta-sharing

5. In the cell actions menu at the far right, click and select Run Cell , or press
shift+enter .
The delta-sharing Python library is installed in the cluster if it isn’t already installed.
6. In the notebook editor, click , and select Add Cell Below . In the new cell, paste the following
command, which uploads the contents of the credential file to a folder in DBFS. Replace the variables as
follows:
<dbfs-path> : the path to the folder where you want to save the credential file
<credential-file-contents> : the contents of the credential file.
The credential file contains JSON which defines three fields: shareCredentialsVersion , endpoint ,
and bearerToken .

%scala
dbutils.fs.put("<dbfs-path>/config.share","""
<credential-file-contents>
""")

7. In the cell actions menu at the far right, click and select Run Cell , or press
shift+enter .
After the credential file is uploaded, you can delete this cell. All workspace users can read the credential
file from DBFS, and the credential file is available in DBFS on all clusters and SQL warehouses. To delete
the cell, click x in the cell actions menu at the far right.
8. Using Python, list the tables in the share. In the notebook editor, click , and select Add Cell Below .
9. In the new cell, paste the following command. Replace <dbfs-path> with the path from the previous
command.
When the code runs, Python reads the credential file from DBFS on the cluster. DBFS is mounted using
FUSE at /dbfs/ .

import delta_sharing

client = delta_sharing.SharingClient(f"/dbfs/<dbfs-path>/config.share")

client.list_all_tables()

10. In the cell actions menu at the far right, click and select Run Cell , or press
shift+enter .
The result is an array of tables, along with metadata for each table. The following output shows two
tables:

Out[10]: [Table(name='example_table', share='example_share_0', schema='default'),


Table(name='other_example_table', share='example_share_0', schema='default')]

If the output is empty or doesn’t contain the tables you expect, contact the data provider.
11. Using Scala, query a shared table. In the notebook editor, click , and select Add Cell Below . In the new
cell, paste the following command. When the code runs, the credential file is read from DBFS through the
JVM.
Replace the variables:
<profile_path> : the DBFS path of the credential file. For example, /<dbfs-path>/config.share .
<share_name> : the value of share= for the table.
<schema_name> : the value of schema= for the table.
<table_name> : the value of name= for the table.
%scala
spark.read.format("deltaSharing")
.load("<profile_path>#<share_name>.<schema_name>.<table_name>").limit(10);

12. In the cell actions menu at the far right, click and select Run Cell , or press
shift+enter .
Each time you load the shared table, you see fresh data from the source.
13. To query the shared data using SQL, you must first create a local table in the workspace from the shared
table, then query the local table.
The shared data is not stored or cached in the local table. Each time you query the local table, you see the
current state of the shared data.
In the notebook editor, click , and select Add Cell Below . In the new cell, paste the following
command.
Replace the variables:
<local_table_name> : the name of the local table.
<profile_path> : the location of the credential file.
<share_name> : the value of share= for the table.
<schema_name> : the value of schema= for the table.
<table_name> : the value of name= for the table.

%sql
DROP TABLE IF EXISTS table_name;

CREATE TABLE <local_table_name> USING deltaSharing LOCATION "<profile_path>#<share_name>.


<schema_name>.<table_name>";

SELECT * FROM <local_table_name> LIMIT 10;

14. In the cell actions menu at the far right, click and select Run Cell , or press
shift+enter .
When you run the command, the shared data is queried directly. As a test, the table is queried and the
first 10 results are returned.
If the output is empty or doesn’t contain the data you expect, contact the data provider.

Audit access and activity for Delta Sharing resources


After you configure diagnostic logging, Delta Sharing saves audit logs for activities such as when someone
creates, modifies, updates, or deletes a share or a recipient, when a recipient accesses an activation link and
downloads the credential, or when a recipient’s credential is rotated or expires. Delta Sharing activity is logged at
the account level.
1. Enable diagnostic logs for your account.

IMPORTANT
Delta Sharing activity is logged at the level of the account. Do not enter a value into workspace_ids_filter .
Audit logs are delivered for each workspace in your account, as well as account-level activities. Logs are
delivered to the storage container you configure.
2. Events for Delta Sharing are logged with serviceName set to unityCatalog . The requestParams section of
each event includes the following fields, which you can share with the data provider to help them
troubleshoot issues.
recipient_name: The name of the recipient in the data provider’s system.
metastore_id : The name of the metastore in the data provider’s system.
sourceIPAddress : The IP address where the request originated.
For example, the following audit event shows that a recipient successfully listed the shares that were
available to them. In this example, redacted values are replaced with <redacted> .

{
"Version":"2.0",
"auditLevel":"ACCOUNT_LEVEL",
"Timestamp":1635235341950,
"orgId":"0",
"shardName":"<redacted>",
"accountId":"<redacted>",
"sourceIPAddress":"<redacted>",
"userAgent":null,
"sessionId":null,
"userIdentity":null,
"serviceName":"unityCatalog",
"actionName":"deltaSharingListShares",
"requestId":"ServiceMain-cddd3114b1b40003",
"requestParams":{
"Metastore_id":"<redacted>",
"Options":"{}",
"Recipient_name":"<redacted>"
},
"Response":{
"statusCode":200,
"errorMessage":null,
"Result":null
},
"MAX_LOG_MESSAGE_LENGTH":16384
}

The following table lists audited events for Delta Sharing, from the point of view of the data recipient.

A C T IO N REQ UEST PA RA M S

deltaSharingListShares options : The pagination options provided with this


request).

deltaSharingListSchemas share : The name of the share.

options : The pagination options provided with this


request.

deltaSharingListTables share : The name of the share).

options : The pagination options provided with this


request).

deltaSharingListAllTables share : The name of the share.


A C T IO N REQ UEST PA RA M S

deltaSharingGetTableVersion share : The name of the share.

schema : The name of the schema.

name : The name of the table.

deltaSharingGetTableMetadata share : The name of the share.

schema : The name of the schema.

name : The name of the table.

predicateHints : The predicates included in the query.

limitHints : The maximum number of rows to return.

The following Delta Sharing errors are logged, from the point of view of the data recipient. Items between <
and > characters represent placeholder text.
The user attempted to access a share they do not have permission to access.

DatabricksServiceException: PERMISSION_DENIED:
User does not have SELECT on Share <share_name>`

The user attempted to access a share that does not exist.

DatabricksServiceException: SHARE_DOES_NOT_EXIST: Share <share_name> does not exist.

The user attempted to access a table that does not exist in the share.

DatabricksServiceException: TABLE_DOES_NOT_EXIST: <table_name> does not exist.

For auditable events and errors for data providers, see Audit access and activity for Delta Sharing resources.
Access shared data outside Azure Databricks
If you don’t use Azure Databricks, follow these instructions to access shared data.
Use Apache Spark
Follow these steps to access shared data in Apache Spark 3.x or above.
1. To access metadata related to the shared data, such as the list of tables shared with you, install the delta-
sharing Python connector:

pip install delta-sharing

2. Install the Apache Spark connector.


3. List the tables in the share. In the following example, replace <profile_path> with the location of the
credential file.
import delta_sharing

client = delta_sharing.SharingClient(f"<profile_path>/config.share")

client.list_all_tables()

The result is an array of tables, along with metadata for each table. The following output shows two
tables:

Out[10]: [Table(name='example_table', share='example_share_0', schema='default'),


Table(name='other_example_table', share='example_share_0', schema='default')]

If the output is empty or doesn’t contain the tables you expect, contact the data provider.
4. Access shared data in Spark using Python:

delta_sharing.load_as_spark(f"<profile_path>#<share_name>.<schema_name>.<table_name>")

spark.read.format("deltaSharing")\
.load("<profile_path>#<share_name>.<schema_name>.<table_name>")\
.limit(10))

Replace the variables as follows:


<profile_path> : the location of the credential file.
<share_name> : the value of share= for the table.
<schema_name> : the value of schema= for the table.
<table_name> : the value of name= for the table.
5. Access shared data in Apache Spark using Scala:

spark.read.format("deltaSharing")
.load("<profile_path>#<share_name>.<schema_name>.<table_name>")
.limit(10)

Replace the variables as follows:


<profile_path> : the location of the credential file.
<share_name> : the value of share= for the table.
<schema_name> : the value of schema= for the table.
<table_name> : the value of name= for the table.

If the output is empty or doesn’t contain the data you expect, contact the data provider.
Use pandas
Follow these steps to access shared data in pandas 0.25.3 or above.
1. To access metadata related to the shared data, such as the list of tables shared with you, you must install
the delta-sharing Python connector:

pip install delta-sharing

2. List the tables in the share. In the following example, replace <profile_path> with the location of the
credential file.
import delta_sharing

client = delta_sharing.SharingClient(f"<profile_path>/config.share")

client.list_all_tables()

If the output is empty or doesn’t contain the tables you expect, contact the data provider.
3. Access shared data in pandas using Python. In the following example, replace the variables as follows:
<profile_path> : the location of the credential file.
<share_name> : the value of share= for the table.
<schema_name> : the value of schema= for the table.
<table_name> : the value of name= for the table.

import delta_sharing
delta_sharing.load_as_pandas(f"<profile_path>#<share_name>.<schema_name>.<table_name>")

If the output is empty or doesn’t contain the data you expect, contact the data provider.

Next steps
Learn more about Databricks
Learn more about Delta Sharing
Learn more about Unity Catalog
Delta Sharing IP access list guide
7/21/2022 • 2 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

The Delta Sharing IP access list API enables the provider metastore admin to configure an IP access list for each
recipient. This list is independent of Workspace IP Access Lists. This API supports allowlists (inclusion) only.
The IP access list affects:
Delta Sharing OSS Protocol REST API access.
Delta Sharing Activation URL access.
Delta Sharing Credential File download.
Each recipient supports a maximum of 100 IP/CIDR values, where one CIDR counts as a single value. Only IPv4
addresses are supported.

Create an IP access list


Use the Databricks Unity Catalog CLI to create and attach an IP access list to a recipient. To do so while creating a
recipient:

databricks unity-catalog create-recipient \


--name <recipient-name> \
--allowed_ip_address=8.8.8.8 \
--allowed_ip_address=8.8.8.4/10

Update an IP access list


To update an IP access list for a recipient, use the Databricks Unity Catalog CLI:

databricks unity-catalog update-recipient \


--name <recipient-name> \
--json='{"ip_access_list": {"allowed_ip_addresses": ["8.8.8.8", "8.8.8.4/10"]}}'

Delete an IP access list


To delete an IP access list for a recipient, use the Databricks Unity Catalog CLI to pass in an empty IP access list:

databricks unity-catalog update-recipient \


--name <recipient-name> \
--json='{"ip_access_list": {}}'
NOTE
This will remove the restrictions, and the recipient can access the shared data from anywhere.

Retrieve an IP access list


To retrieve an IP access list for a recipient, use the Databricks CLI:

databricks unity-catalog get-recipient \


--name <recipient-name>

Audit Logging
The following operations have audit logs related to IP access lists:
Recipient management operations: create, update
Denial of access to any of the Delta Sharing OSS Protocol REST API calls
Denial of access to Delta Sharing Activation URL
Denial of access to Delta Sharing Credential File download
To learn more about how to enable and read audit logs for Delta Sharing, please refer to Audit access and
activity for Delta Sharing resources. The following table lists audited events related to IP access lists:

A C T IO N N A M E REQ UEST PA RA M S SO URC EIPA DDRESS

createRecipient ip_access_list.allowed_ip_addresses N/A


: Allowlist of IP access list.

updateRecipient ip_access_list.allowed_ip_addresses N/A


: Allowlist of IP access list.

getActivationUrlInfo is_ip_access_denied : None if there The recipient IP address.


is no IP access list configured.
Otherwise, true if the request was
denied and false if the request was
not denied.

retrieveRecipientToken is_ip_access_denied : None if there The recipient IP address.


is no IP access list configured.
Otherwise, true if the request was
denied and false if the request was
not denied.

deltaSharing* (All Delta Sharing actions is_ip_access_denied : None if there The recipient IP address.
would have this audit log.) is no IP access list configured.
Otherwise, true if the request was
denied and false if the request was
not denied.
Developer tools and guidance
7/21/2022 • 3 minutes to read

Learn about tools and guidance you can use to work with Azure Databricks assets and data and to develop
Azure Databricks applications.

Use an IDE
You can connect many popular third-party IDEs to an Azure Databricks cluster. This allows you to write code on
your local development machine by using the Spark APIs and then run that code as jobs remotely on an Azure
Databricks cluster.
Databricks recommends that you use dbx by Databricks Labs for local development.
Databricks also provides a code sample that you can explore to use an IDE with dbx .

NOTE
Databricks also supports a tool named Databricks Connect. However, Databricks plans no new feature development for
Databricks Connect at this time. Also, Databricks Connect has several limitations.

Use a connector or driver


You can use connectors and drivers to connect your code to an Azure Databricks cluster or a Databricks SQL
warehouse. These connectors and drivers include:
The Databricks SQL Connector for Python
The Databricks SQL Driver for Go
The Databricks SQL Driver for Node.js
pyodbc
The Databricks ODBC driver
The Databricks JDBC driver
For additional information about connecting your code through JDBC or ODBC, see the JDBC and ODBC
configuration guidance.

Use the command line or a notebook


Databricks provides additional developer tools.

NAME USE T H IS TO O L W H EN Y O U WA N T TO…

Databricks CLI Use the command line to work with Data Science &
Engineering workspace assets such as cluster policies,
clusters, file systems, groups, pools, jobs, libraries, runs,
secrets, and tokens.

Databricks SQL CLI Use the command line to run SQL commands and scripts on
a Databricks SQL warehouse.
NAME USE T H IS TO O L W H EN Y O U WA N T TO…

Databricks Utilities Run Python, R, or Scala code in a notebook to work with


credentials, file systems, libraries, and secrets from an Azure
Databricks cluster.

Call Databricks REST APIs


You can use popular third-party utilities such as curl and tools such as Postman to work with Azure Databricks
resources directly through the Databricks REST APIs.

C AT EGO RY USE T H IS A P I TO W O RK W IT H …

REST API (latest) Data Science & Engineering workspace assets such as
clusters, global init scripts, groups, pools, jobs, libraries,
permissions, secrets, and tokens, by using the latest version
of the Databricks REST API.

REST API 2.1 Data Science & Engineering workspace assets such as jobs,
by using version 2.1 of the Databricks REST API.

REST API 2.0 Data Science & Engineering workspace assets such as
clusters, global init scripts, groups, pools, jobs, libraries,
permissions, secrets, and tokens, by using version 2.0 of the
Databricks REST API.

REST API 1.2 Command executions and execution contexts by using


version 1.2 of the Databricks REST API.

Provision infrastructure
You can use an infrastructure-as-code (IaC) approach to programmatically provision Azure Databricks
infrastructure and assets such as workspaces, clusters, cluster policies, pools, jobs, groups, permissions, secrets,
tokens, and users. For details, see Databricks Terraform provider.

Use CI/CD
To manage the lifecycle of Azure Databricks assets and data, you can use continuous integration and continuous
delivery (CI/CD) and data pipeline tools.

A REA USE T H ESE TO O L S W H EN Y O U WA N T TO…

Continuous integration and delivery on Azure Databricks Develop a CI/CD pipleine for Azure Databricks that uses
using Azure DevOps Azure DevOps.

Continuous integration and delivery on Azure Databricks Develop a CI/CD workflow on GitHub that uses GitHub
using GitHub Actions Actions developed for Azure Databricks.

Continuous integration and delivery on Azure Databricks Develop a CI/CD pipeline for Azure Databricks that uses
using Jenkins Jenkins.

Managing dependencies in data pipelines Manage and schedule a data pipeline that uses Apache
Airflow.
A REA USE T H ESE TO O L S W H EN Y O U WA N T TO…

Service principals for CI/CD Use service principals, instead of users, with CI/CD systems.

Use a SQL database tool


You can use these tools to run SQL commands and scripts and to browse database objects in Azure Databricks.

TO O L USE T H IS W H EN Y O U WA N T TO :

Databricks SQL CLI Use a command line to run SQL commands and scripts on a
Databricks SQL warehouse.

DataGrip integration with Azure Databricks Use a query console, schema navigation, smart code
completion, and other features to run SQL commands and
scripts and to browse database objects in Azure Databricks.

DBeaver integration with Azure Databricks Run SQL commands and browse database objects in Azure
Databricks by using this client software application and
database administration tool.

SQL Workbench/J Run SQL scripts (either interactively or as a batch) in Azure


Databricks by using this SQL query tool.

Use other tools


You can connect many popular third-party tools to clusters and SQL warehouses to access data in Azure
Databricks. See the Databricks integrations.
To authenticate automated scripts, tools, apps, and systems with Azure Databricks workspaces and resources,
Databricks recommends that you use authentication credentials for service principals instead of Azure
Databricks workspace user credentials. See Service principals for Azure Databricks automation.
dbx by Databricks Labs
7/21/2022 • 31 minutes to read

IMPORTANT
dbx by Databricks Labs is provided as-is and is not officially supported by Databricks through customer technical support
channels. Support, questions, and feature requests can be communicated through the Issues page of the
databrickslabs/dbx repo on GitHub. Issues with the use of this code will not be answered or investigated by Databricks
Support.

dbx by Databricks Labs is an open source tool which is designed to extend the Databricks command-line
interface (Databricks CLI) and to provide functionality for rapid development lifecycle and continuous
integration and continuous delivery/deployment (CI/CD) on the Azure Databricks platform.
dbx simplifies jobs launch and deployment processes across multiple environments. It also helps to package
your project and deliver it to your Azure Databricks environment in a versioned fashion. Designed in a CLI-first
manner, it is built to be actively used both inside CI/CD pipelines and as a part of local tooling (such as local
IDEs, including Visual Studio Code and PyCharm).
The typical development workflow with dbx is:
1. Create a remote repository with a Git provider Databricks supports, if you do not have a remote repo
available already.
2. Clone your remote repo into your Azure Databricks workspace.
3. Create or move an Azure Databricks notebook into the cloned repo in your Azure Databricks worksapce. Use
this notebook to begin prototyping the code that you want your Azure Databricks clusters to run.
4. To enhance and modularize your notebook code by adding separate helper classes and functions,
configuration files, and tests, switch over to using a local development machine with dbx , your preferred
IDE, and Git installed.
5. Clone your remote repo to your local development machine.
6. Move your code out of your notebook into one or more local code files.
7. As you code locally, push your work from your local repo to your remote repo. Also, sync your remote repo
with your Azure Databricks workspace.
8. Keep using the notebook in your Azure Databricks workspace for rapid prototyping, and keep moving
validated code from your notebook to your local machine. Keep using your local IDE for tasks such as code
modularization, code completion, linting, unit testing, and step-through debugging of code and objects that
do not require a live connection to Azure Databricks.
9. Use dbx to batch run your local code on your target clusters, as desired. (This is similar to running the spark-
submit script in Spark’s bin directory to launch applications on a Spark cluster.)
10. When you are ready for production, use a CI/CD platform such as GitHub Actions, Azure DevOps, or GitLab
to automate running your remote repo’s code on your clusters.

Requirements
To use dbx , you must have the following installed on your local development machine, regardless of whether
your code uses Python, Scala, or Java:
Python version 3.6 or above.
If your code uses Python, you should use a version of Python that matches the one that is installed on
your target clusters. To get the version of Python that is installed on an existing cluster, you can use the
cluster’s web terminal to run the python --version command. See also the “System environment”
section in the Databricks runtime releases for the Databricks Runtime version for your target clusters.
pip. ( dbx also supports conda, but conda is not covered in this article.)
If your code uses Python, a method to create Python virtual environments to ensure you are using the
correct versions of Python and package dependencies in your dbx projects. This article covers pipenv.
The dbx package from the Python Package Index (PyPI). You can install this by running pip install dbx .
The Databricks CLI, set up with authentication. The Databricks CLI is automatically installed when you
install dbx . This authentication can be set up on your local development machine in one or both of the
following locations:
Within the DATABRICKS_HOST and DATABRICKS_TOKEN environment variables (starting with Databricks
CLI version 0.8.0).
In a profile within your .databrickscfg file.
dbx looks for authentication credentials in these two locations, respectively. dbx uses only the first set
of matching credentials that it finds.

NOTE
dbx does not supporting the use of a .netrc file for authentication.

git for pushing and syncing local and remote code changes.
Continue with the instructions for one of the following IDEs:
Visual Studio Code
PyCharm
IntelliJ IDEA
Eclipse

NOTE
Databricks has validated usage of the preceding IDEs with dbx ; however, dbx should work with any IDE. You can also
use No IDE (terminal only).
dbx is optimized to work with single-file Python code files and compiled Scala and Java JAR files. dbx does not work
with single-file R code files or compiled R code packages. This is because dbx works with the Jobs API 2.0 and 2.1, and
these APIs cannot run single-file R code files or compiled R code packages as jobs.

Visual Studio Code


Complete the following instructions to begin using Visual Studio Code and Python with dbx .
On your local development machine, you must have the following installed in addition to the general
requirements:
Visual Studio Code.
The Python extension for Visual Studio Code. For more information, see Extension Marketplace on the
Visual Studio Code website.
For more information, see Getting Started with Python in VS Code in the Visual Studio Code
documentation.
Follow these steps to begin setting up your dbx project structure:
1. From your terminal, create a blank folder. These instructions use a folder named dbx-demo . You can give
your dbx project’s root folder any name you want. If you use a different name, replace the name
throughout these steps. After you create the folder, switch to it, and then start Visual Studio Code from
that folder.
For Linux and macOS:

mkdir dbx-demo
cd dbx-demo
code .

TIP
If command not found: code displays after you run code . , see Launching from the command line on the
Microsoft website.

For Windows:

md dbx-demo
cd dbx-demo
code .

2. In Visual Studio Code, create a Python virtual environment for this project:
a. On the menu bar, click View > Terminal .
b. From the root of the dbx-demo folder, run the pipenv command with the following option, where
<version> is the target version of Python that you already have installed locally (and, ideally, a
version that matches your target clusters’ version of Python), for example 3.7.5 .

pipenv --python <version>

Make a note of the Virtualenv location value in the output of the pipenv command, as you will
need it in the next step.
3. Select the target Python interpreter, and then activate the Python virtual environment:
a. On the menu bar, click View > Command Palette , type Python: Select , and then click Python:
Select Interpreter .
b. Select the Python interpreter within the path to the Python virtual environment that you just created.
(This path is listed as the Virtualenv location value in the output of the pipenv command.)
c. On the menu bar, click View > Command Palette , type Terminal: Create , and then click Terminal:
Create New Terminal .
For more information, see Using Python environments in VS Code in the Visual Studio Code
documentation.
4. Continue with Create a dbx project.
PyCharm
Complete the following instructions to begin using PyCharm and Python with dbx .
On your local development machine, you must have PyCharm installed in addition to the general requirements.
Follow these steps to begin setting up your dbx project structure:
1. In PyCharm, on the menu bar, click File > New Project .
2. In the Create Project dialog, choose a location for your new project.
3. Expand Python interpreter : New Pipenv environment .
4. Select New environment using , if it is not already selected, and then select Pipenv from the drop-down
list.
5. For Base interpreter , select the location that contains the Python interpreter for the target version of
Python that you already have installed locally (and, ideally, a version that matches your target clusters’
version of Python).
6. For Pipenv executable , select the location that contains your local installation of pipenv , if it is not already
auto-detected.
7. If you want to create a minimal dbx project, and you want to use the main.py file with that minimal dbx
project, then select the Create a main.py welcome script box. Otherwise, clear this box.
8. Click Create .
9. In the Project tool window, right-click the project’s root folder, and then click Open in > Terminal .
10. Continue with Create a dbx project.

IntelliJ IDEA
Complete the following instructions to begin using IntelliJ IDEA and Scala with dbx . These instructions create a
minimal sbt-based Scala project that you can use to start a dbx project.
On your local development machine, you must have the following installed in addition to the general
requirements:
IntelliJ IDEA.
The Scala plugin for IntelliJ IDEA. For more information, see Discover IntelliJ IDEA for Scala in the IntelliJ IDEA
documentation.
Java Runtime Environment (JRE) 8. While any edition of JRE 8 should work, Databricks has so far only
validated usage of dbx and IntelliJ IDEA with the OpenJDK 8 JRE. Databricks has not yet validated usage of
dbx with IntelliJ IDEA and Java 11. For more information, see Java Development Kit (JDK) in the IntelliJ IDEA
documentation.
Follow these steps to begin setting up your dbx project structure:
Step 1: Create an sbt-based Scala project
1. In IntelliJ IDEA, depending on your view, click Projects > New Project or File > New > Project .
2. In the New Project dialog, click Scala , click sbt , and then click Next .
3. Enter a project name and a location for the project.
4. For JDK , select your installation of the OpenJDK 8 JRE.
5. For sbt , choose the highest available version of sbt that is listed.
6. For Scala , ideally, choose the version of Scala that matches your target clusters’ version of Scala. See the
“System environment” section in the Databricks runtime releases for the Databricks Runtime version for your
target clusters.
7. Next to Scala , select the Sources box if it is not already selected.
8. Add a package prefix to Package Prefix . These steps use the package prefix com.example.demo . If you specify
a different package prefix, replace the package prefix throughout these steps.
9. Click Finish .
Step 2: Add an object to the package
You can add any required objects to your package. This package contains a single object named SampleApp .
1. In the Project tool window (View > Tool Windows > Project ), right-click the project-name > src >
main > scala folder, and then click New > Scala Class .
2. Choose Object , and type the object’s name and then press Enter. For example, type SampleApp . If you
enter a different object name here, be sure to replace the name throughout these steps.
3. Replace the contents of the SampleApp.scala file with the following code:

package com.example.demo

object SampleApp {
def main(args: Array[String]) {
}
}

Step 3: Build the project


Add any required project build settings and dependencies to your project. This step assumes that you are
building a project that was set up in the previous steps and it depends on only the following libraries.
1. Replace the contents of the project’s build.sbt file with the following content:

ThisBuild / version := "0.1.0-SNAPSHOT"

ThisBuild / scalaVersion := "2.12.14"

val sparkVersion = "3.2.1"

lazy val root = (project in file("."))


.settings(
name := "dbx-demo",
idePackagePrefix := Some("com.example.demo"),
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion withSources(),
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion withSources(),
libraryDependencies += "org.apache.spark" %% "spark-hive" % sparkVersion withSources()
)

In the preceding file, replace:


2.12.14 with the version of Scala that you chose earlier for this project.
3.2.1 with the version of Spark that you chose earlier for this project.
dbx-demo with the name of your project.
com.example.demo with the name of your package prefix.
2. On the menu bar, click View > Tool Windows > sbt .
3. In the sbt tool window, right-click the name of your project, and click Reload sbt Project . Wait until
sbt finishes downloading the project’s dependencies from an Internet artifact store such as Coursier or
Ivy by default, depending on your version of sbt . You can watch the download progress in the status bar.
If you add or change any more dependencies to this project, you must repeat this project reloading step
for each set of dependencies that you add or change.
4. On the menu bar, click IntelliJ IDEA > Preferences .
5. In the Preferences dialog, click Build, Execution, Deployment > Build Tools > sbt .
6. In JVM , for JRE , select your installation of the OpenJDK 8 JRE.
7. In sbt projects , select the name of your project.
8. In sbt shell , select builds .
9. Click OK .
10. On the menu bar, click Build > Build Project . The build’s results appear in the sbt shell tool window
(View > Tool Windows > sbt shell ).
Step 4: Add code to the project
Add any required code to your project. This step assumes that you only want to add code to the
SampleApp.scala file in the example package.

In the project’s src > main > scala > SampleApp.scala file, add the code that you want dbx to batch run on
your target clusters. For basic testing, use the example Scala code in the section Code example.
Step 5: Run the project
1. On the menu bar, click Run > Edit Configurations .
2. In the Run/Debug Configurations dialog, click the + (Add New Configuration ) icon, or Add new , or
Add new run configuration .
3. In the drop-down, click sbt Task .
4. For Name , enter a name for the configuration, for example, Run the program .
5. For Tasks , enter ~run .
6. Select Use sbt shell .
7. Click OK .
8. On the menu bar, click Run > Run ‘Run the program’ . The run’s results appear in the sbt shell tool
window.
Step 6: Build the project as a JAR
You can add any JAR build settings to your project that you want. This step assumes that you only want to build
a JAR that is based on the project that was set up in the previous steps.
1. On the menu bar, click File > Project Structure .
2. In the Project Structure dialog, click Project Settings > Ar tifacts .
3. Click the + (Add ) icon.
4. In the drop-down list, select JAR > From modules with dependencies .
5. In the Create JAR from Modules dialog, for Module , select the name of your project.
6. For Main Class , click the folder icon.
7. In the Select Main Class dialog, on the Search by Name tab, select SampleApp , and then click OK .
8. For JAR files from libraries , select copy to the output director y and link via manifest .
9. Click OK to close the Create JAR from Modules dialog.
10. Click OK to close the Project Structure dialog.
11. On the menu bar, click Build > Build Ar tifacts .
12. In the context menu that appears, select project-name :jar > Build . Wait while sbt builds your JAR. The
build’s results appear in the Build Output tool window (View > Tool Windows > Build ).
The JAR is built to the project’s out > artifacts > <project-name>_jar folder. The JAR’s name is
<project-name>.jar .
Step 7: Display the terminal in the IDE
With your dbx project structure now in place, you are ready to create your dbx project.
Display the IntelliJ IDEA terminal by clicking View > Tool Windows > Terminal on the menu bar, and then
continue with Create a dbx project.

Eclipse
Complete the following instructions to begin using Eclipse and Java with dbx . These instructions create a
minimal Maven-based Java project that you can use to start a dbx project.
On your local development machine, you must have the following installed in addition to the general
requirements:
A version of Eclipse. These instructions use the Eclipse IDE for Java Developers edition of the Eclipse IDE.
An edition of the Java Runtime Environment (JRE) or Java Development Kit (JDK) 11, depending on your local
machine’s operating system. While any edition of JRE or JDK 11 should work, Databricks has so far only
validated usage of dbx and the Eclipse IDE for Java Developers with Eclipse 2022-03 R, which includes
AdoptOpenJDK 11.
Follow these steps to begin setting up your dbx project structure:
Step 1: Create a Maven-based Java project
1. In Eclipse, click File > New > Project .
2. In the New Project dialog, expand Maven , select Maven Project , and click Next .
3. In the New Maven Project dialog, select Create a simple project (skip archetype selection) , and click
Next .
4. For Group Id , enter a group ID that conforms to Java’s package name rules. These steps use the package
name of com.example.demo . If you enter a different group ID, substitute it throughout these steps.
5. For Ar tifact Id , enter a name for the JAR file without the version number. These steps use the JAR name of
dbx-demo . If you enter a different name for the JAR file, substitute it throughout these steps.
6. Click Finish .
Step 2: Add a class to the package
You can add any classes to your package that you want. This package will contain a single class named
SampleApp .

1. In the Project Explorer view (Window > Show View > Project Explorer ), select the project-name
project icon, and then click File > New > Class .
2. In the New Java Class dialog, for Package , enter com.example.demo .
3. For Name , enter SampleApp .
4. For Modifiers , select public .
5. Leave Superclass blank.
6. For Which method stubs would you like to create , select public static void Main(String[] args) .
7. Click Finish .
Step 3: Add dependencies to the project
1. In the Project Explorer view, double-click project-name > pom.xml .
2. Add the following dependencies as a child element of the <project> element, and then save the file:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.2.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>3.2.1</version>
<scope>provided</scope>
</dependency>
</dependencies>

Replace:
2.12 with your target clusters’ version of Scala.
3.2.1 with your target clusters’ version of Spark.
See the “System environment” section in the Databricks runtime releases for the Databricks Runtime
version for your target clusters.
Step 4: Compile the project
1. In the project’s pom.xml file, add the following Maven compiler properties as a child element of the
<project> element, and then save the file:

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.6</maven.compiler.source>
<maven.compiler.target>1.6</maven.compiler.target>
</properties>

2. In the Project Explorer view, right-click the project-name project icon, and then click Run As > Run
Configurations .
3. In the Run Configurations dialog, click Maven Build .
4. Click the New launch configuration icon.
5. Enter a name for this launch configuration, for example clean compile .
6. For Base director y , click Workspace , choose your project’s directory, and click OK .
7. For Goals , enter clean compile .
8. Click Run . The run’s output appears in the Console view (Window > Show View > Console ).
Step 5: Add code to the project
You can add any code to your project that you want. This step assumes that you only want to add code to a file
named SampleApp.java for a package named com.example.demo .
In the project’s src/main/java > com.example.demo > SampleApp.java file, add the code that you want dbx to
batch run on your target clusters. (If you do not have any code handy, you can use the Java code in the Code
example, listed toward the end of this article.)
Step 6: Run the project
1. In the Project Explorer view, right-click the project-name project icon, and then click Run As > Run
Configurations .
2. In the Run Configurations dialog, expand Java Application , and then click App .
3. Click Run . The run’s output appears in the Console view.
Step 7: Build the project as a JAR
1. In the Project Explorer view, right-click the project-name project icon, and then click Run As > Run
Configurations .
2. In the Run Configurations dialog, click Maven Build .
3. Click the New launch configuration icon.
4. Enter a name for this launch configuration, for example clean package .
5. For Base director y , click Workspace , choose your project’s directory, and click OK .
6. For Goals , enter clean package .
7. Click Run . The run’s output appears in the Console view.
The JAR is built to the <project-name> > target folder. The JAR’s name is <project-name>-0.0.1-SNAPSHOT.jar .

NOTE
If the JAR does not appear in the target folder in the Project Explorer window at first, you can try to display it by
right-clicking the project-name project icon, and then click Refresh .

Step 8: Display the terminal in the IDE


With your dbx project structure now in place, you are ready to create your dbx project. To start, set Project
Explorer view to show the hidden files (files starting with a dot ( ./ )) the dbx generates, as follows:
1. In the Project Explorer view, click the ellipses (View Menu ) filter icon, and then click Filters and
Customization .
2. In the Filters and Customization dialog, on the Pre-set filters tab, clear the . resources* box.
3. Click OK .
Next, display the Eclipse terminal as follows:
1. Click Window > Show View > Terminal on the menu bar.
2. In the terminal’s command prompt does not appear, in the Terminal view, click the Open a Terminal icon.
3. Use the cd command to switch to your project’s root directory.
4. Continue with Create a dbx project.

No IDE (terminal only)


Complete the following instructions to begin using a terminal and Python with dbx .
Follow these steps to use a terminal to begin setting up your dbx project structure:
1. From your terminal, create a blank folder. These instructions use a folder named dbx-demo (but you can
give your dbx project’s root folder any name you want). After you create the folder, switch to it.
For Linux and macOS:

mkdir dbx-demo
cd dbx-demo
For Windows:

md dbx-demo
cd dbx-demo

2. Create a Python virtual environment for this project by running the pipenv command, with the following
option, from the root of the dbx-demo folder, where <version> is the target version of Python that you
already have installed locally, for example 3.7.5 .

pipenv --python <version>

3. Activate your Python virtual environment by running pipenv shell .

pipenv shell

4. Continue with Create a dbx project.

Create a dbx project


With your dbx project structure in place from one of the previous sections, you are now ready to create one of
the following types of projects:
Create a minimal dbx project for Python
Create a minimal dbx project for Scala or Java
Create a dbx templated project for Python with CI/CD support
Create a minimal dbx project for Python
The following minimal dbx project is the simplest and fastest approach to getting started with Python and dbx
. It demonstrates batch running of a single Python code file on an existing Azure Databricks all-purpose cluster
in your Azure Databricks workspace.

NOTE
To create a dbx templated project for Python that demonstrates batch running of code on all-purpose clusters and jobs
clusters, remote code artifact deployments, and CI/CD platform setup, skip ahead to Create a dbx templated project for
Python with CI/CD support.

To complete this procedure, you must have an existing all-purpose cluster in your workspace. (See Display
clusters or Create a cluster.) Ideally (but not required), the version of Python in your Python virtual environment
should match the version that is installed on this cluster. To get the version of Python on the cluster, you can use
the cluster’s web terminal to run the command python --version .

python --version

1. From your terminal, from your dbx project’s root folder, run the dbx configure command with the
following option. This command creates a hidden .dbx folder within your dbx project’s root folder. This
.dbx folder contains lock.json and project.json files.

dbx configure --profile DEFAULT


NOTE
The project.json file contains a reference to the DEFAULT profile within your Databricks CLI .databrickscfg
file. If you want dbx to use a different profile, replace DEFAULT with your target profile’s name.
If you want dbx to use the DATABRICKS_HOST and DATABRICKS_TOKEN environment variables instead of a
profile in your Databricks CLI .databrickscfg file, then leave DEFAULT in the project.json as is. dbx will
use this reference by default.

2. Create a folder named conf within your dbx project’s root folder.
For Linux and macOS:

mkdir conf

For Windows:

md conf

3. Add a file named deployment.yaml file to the conf directory, with the following file contents:

environments:
default:
jobs:
- name: "dbx-demo-job"
spark_python_task:
python_file: "dbx-demo-job.py"

NOTE
The deployment.yaml file contains the lower-cased word default , which is a reference to the upper-cased
DEFAULT profile within your Databricks CLI .databrickscfg file. If you want dbx to use a different profile,
replace default with your target profile’s name.
If you want dbx to use the DATABRICKS_HOST and DATABRICKS_TOKEN environment variables instead of a
profile in your Databricks CLI .databrickscfg file, then leave default in the deployment.yaml as is. dbx
will use this reference by default.

4. Add the code to run on the cluster to a file named dbx-demo-job.py and add the file to the root folder of
your dbx project. (If you do not have any code handy, you can use the Python code in the Code example,
listed toward the end of this article.)

NOTE
You do not have to name this file dbx-demo-job.py . If you choose a different file name, be sure to update the
python_file field in the conf/deployment.yaml file to match.

5. Run the command dbx execute command with the following options. In this command, replace
<existing-cluster-id> with the ID of the target cluster in your workspace. (To get the ID, see Cluster URL
and ID.)
dbx execute --cluster-id=<existing-cluster-id> --job=dbx-demo-job --no-rebuild --no-package

6. To view the run’s results locally, see your terminal’s output. To view the run’s results on your cluster, go to
the Standard output pane in the Driver logs tab for your cluster. (See Cluster driver and worker logs.)
7. Continue with Next steps.
Create a minimal dbx project for Scala or Java
The following minimal dbx project is the simplest and fastest approach to getting started with dbx and Scala
or Java. It demonstrates deploying a single Scala or Java JAR to your Azure Databricks workspace and then
running that deployed JAR on a Azure Databricks jobs cluster in your Azure Databricks workspace.

NOTE
Azure Databricks limits how you can run Scala and Java code on clusters:
You cannot run a single Scala or Java file as a job on a cluster as you can with a single Python file. To run Scala or Java
code, you must first build it into a JAR.
You can run a JAR as a job on an existing all-purpose cluster. However, you cannot reinstall any updates to that JAR on
the same all-purpose cluster. In this case, you must use a job cluster instead. This section uses the job cluster
approach.
You must first deploy the JAR to your Azure Databricks workspace before you can run that deployed JAR on any all-
purpose cluster or jobs cluster in that workspace.

1. In your terminal, from your project’s root folder, run the dbx configure command with the following
option. This command creates a hidden .dbx folder within your project’s root folder. This .dbx folder
contains lock.json and project.json files.

dbx configure --profile DEFAULT

NOTE
The project.json file contains a reference to the DEFAULT profile within your Databricks CLI .databrickscfg
file. If you want dbx to use a different profile, replace DEFAULT with your target profile’s name.
If you want dbx to use the DATABRICKS_HOST and DATABRICKS_TOKEN environment variables instead of a
profile in your Databricks CLI .databrickscfg file, then leave DEFAULT in the project.json file as is. dbx
will use this reference by default.

2. Create a folder named conf within your project’s root folder.


For Linux and macOS:

mkdir conf

For Windows:

md conf

3. Add a file named deployment.yaml file to the conf directory, with the following minimal file contents:
environments:
default:
strict_path_adjustment_policy: true
jobs:
- name: "dbx-demo-job"
new_cluster:
spark_version: "10.4.x-scala2.12"
node_type_id: "Standard_DS3_v2"
num_workers: 2
instance_pool_id: "my-instance-pool"
libraries:
- jar: "file://out/artifacts/dbx_demo_jar/dbx-demo.jar"
spark_jar_task:
main_class_name: "com.example.demo.SampleApp"

Replace:
The value of spark_version with the appropriate Runtime version strings for your target jobs cluster.
The value of node_type_id with the appropriate Cluster node type for your target jobs cluster.
The value of instance_pool_id with the ID of an existing instance pool in your workspace, to enable
faster running of jobs. If you do not have an existing instance pool available or you do not want to use
an instance pool, remove this line altogether.
The value of jar with the path in the project to the JAR. For IntelliJ IDEA with Scala, it could be
file://out/artifacts/dbx_demo_jar/dbx-demo.jar . For the Eclipse IDE with Java, it could be
file://target/dbx-demo-0.0.1-SNAPSHOT.jar .
The value of main_class_name with the name of main class in the JAR, for example
com.example.demo.SampleApp .

NOTE
The deployment.yaml file contains the word default , which is a reference to the default environment in the
.dbx/project.json file, which in turn is a reference to the DEFAULT profile within your Databricks CLI
.databrickscfg file. If you want dbx to use a different profile, replace default in this deployment.yaml file
with the corresponding reference in the .dbx/project.json file, which in turn references the corresponding
profile within your Databricks CLI .databrickscfg file.
If you want dbx to use the DATABRICKS_HOST and DATABRICKS_TOKEN environment variables instead of a
profile in your Databricks CLI .databrickscfg file, then leave default in the deployment.yaml as is. dbx
will use the default environment settings (except for the profile value) in the .dbx/project.json file by
default.

4. Run the dbx deploy command with the following options. dbx deploys the JAR to the location in the
.dbx/project.json file’s artifact_location path for the matching environment. dbx also deploys the
project’s files as part of an MLflow experiment, to the location listed in the .dbx/project.json file’s
workspace_dir path for the matching environment.

dbx deploy --no-rebuild --no-package

5. Run the dbx launch command with the following options. This command runs the job with the matching
name in conf/deployment.yaml . To find the deployed JAR to run as part of the job, dbx references the
location in the .dbx/project.json file’s artifact_location path for the matching environment. To
determine which specific JAR to run, dbx references the MLflow experiment in the location listed in the
.dbx/project.json file’s workspace_dir path for the matching environment.
dbx launch --job=dbx-demo-job

6. To view the job run’s results on your jobs cluster, see View jobs.
7. To view the experiment that the job referenced, see Experiments.
8. Continue with Next steps.
Create a dbx templated project for Python with CI/CD support
The following dbx templated project for Python demonstrates support for batch running of Python code on
Azure Databricks all-purpose clusters and jobs clusters in your Azure Databricks workspaces, remote code
artifact deployments, and CI/CD platform setup. (To create a minimal dbx project for Python that only
demonstrates batch running of a single Python code file on an existing all-purpose cluster, skip back to Create a
minimal dbx project for Python.)
1. From your terminal, in your dbx project’s root folder, run the dbx init command.

dbx init

2. For project_name , enter a name for your project, or press Enter to accept the default project name.
3. For version , enter a starting version number for your project, or press Enter to accept the default project
version.
4. For cloud , select the number that corresponds to the Azure Databricks cloud version that you want your
project to use, or press Enter to accept the default.
5. For cicd_tool , select the number that corresponds to the supported CI/CD tool that you want your
project to use, or press Enter to accept the default.
6. For project_slug , enter a prefix that you want to use for resources in your project, or press Enter to
accept the default.
7. For workspace_dir , enter the local path to the workspace directory for your project, or press Enter to
accept the default.
8. For ar tifact_location , enter the path in your Azure Databricks workspace to where your project’s
artifacts will be written, or press Enter to accept the default.
9. For profile , enter the name of the Databricks CLI authentication profile that you want your project to use,
or press Enter to accept the default.

TIP
You can skip the preceding steps by running dbx init with hard-coded template parameters, for example:

dbx init --template="python_basic" \


-p "project_name=cicd-sample-project" \
-p "cloud=Azure" \
-p "cicd_tool='Azure DevOps'" \
-p "profile=DEFAULT" \
--no-input

dbx calculates the parameters project_slug , workspace_dir , and artifact_location automatically. These three
parameters are optional, and they are useful only for more advanced use cases.
See the init command in CLI Reference in the dbx documentation.
To use your new project, see Basic Python Template in the dbx documentation.
See also Next steps.

Code example
If you do not have any code readily available to batch run with dbx , you can experiment by having dbx batch
run the following code. This code creates a small table in your workspace, queries the table, and then deletes the
table.

TIP
If you want to leave the table in your workspace instead of deleting it, comment out the last line of code in this example
before you batch run it with dbx .

Python
# For testing and debugging of local objects, run
# "pip install pyspark=X.Y.Z", where "X.Y.Z"
# matches the version of PySpark
# on your target clusters.
from pyspark.sql import SparkSession

from pyspark.sql.types import *


from datetime import date

spark = SparkSession.builder.appName("dbx-demo").getOrCreate()

# Create a DataFrame consisting of high and low temperatures


# by airport code and date.
schema = StructType([
StructField('AirportCode', StringType(), False),
StructField('Date', DateType(), False),
StructField('TempHighF', IntegerType(), False),
StructField('TempLowF', IntegerType(), False)
])

data = [
[ 'BLI', date(2021, 4, 3), 52, 43],
[ 'BLI', date(2021, 4, 2), 50, 38],
[ 'BLI', date(2021, 4, 1), 52, 41],
[ 'PDX', date(2021, 4, 3), 64, 45],
[ 'PDX', date(2021, 4, 2), 61, 41],
[ 'PDX', date(2021, 4, 1), 66, 39],
[ 'SEA', date(2021, 4, 3), 57, 43],
[ 'SEA', date(2021, 4, 2), 54, 39],
[ 'SEA', date(2021, 4, 1), 56, 41]
]

temps = spark.createDataFrame(data, schema)

# Create a table on the cluster and then fill


# the table with the DataFrame's contents.
# If the table already exists from a previous run,
# delete it first.
spark.sql('USE default')
spark.sql('DROP TABLE IF EXISTS demo_temps_table')
temps.write.saveAsTable('demo_temps_table')

# Query the table on the cluster, returning rows


# where the airport code is not BLI and the date is later
# than 2021-04-01. Group the results and order by high
# temperature in descending order.
df_temps = spark.sql("SELECT * FROM demo_temps_table " \
"WHERE AirportCode != 'BLI' AND Date > '2021-04-01' " \
"GROUP BY AirportCode, Date, TempHighF, TempLowF " \
"ORDER BY TempHighF DESC")
df_temps.show()

# Results:
#
# +-----------+----------+---------+--------+
# |AirportCode| Date|TempHighF|TempLowF|
# +-----------+----------+---------+--------+
# | PDX|2021-04-03| 64| 45|
# | PDX|2021-04-02| 61| 41|
# | SEA|2021-04-03| 57| 43|
# | SEA|2021-04-02| 54| 39|
# +-----------+----------+---------+--------+

# Clean up by deleting the table from the cluster.


spark.sql('DROP TABLE demo_temps_table')
Scala
package com.example.demo

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import java.sql.Date

object SampleApp {
def main(args: Array[String]) {
val spark = SparkSession.builder().master("local").getOrCreate()

val schema = StructType(Array(


StructField("AirportCode", StringType, false),
StructField("Date", DateType, false),
StructField("TempHighF", IntegerType, false),
StructField("TempLowF", IntegerType, false)
))

val data = List(


Row("BLI", Date.valueOf("2021-04-03"), 52, 43),
Row("BLI", Date.valueOf("2021-04-02"), 50, 38),
Row("BLI", Date.valueOf("2021-04-01"), 52, 41),
Row("PDX", Date.valueOf("2021-04-03"), 64, 45),
Row("PDX", Date.valueOf("2021-04-02"), 61, 41),
Row("PDX", Date.valueOf("2021-04-01"), 66, 39),
Row("SEA", Date.valueOf("2021-04-03"), 57, 43),
Row("SEA", Date.valueOf("2021-04-02"), 54, 39),
Row("SEA", Date.valueOf("2021-04-01"), 56, 41)
)

val rdd = spark.sparkContext.makeRDD(data)


val temps = spark.createDataFrame(rdd, schema)

// Create a table on the Databricks cluster and then fill


// the table with the DataFrame's contents.
// If the table already exists from a previous run,
// delete it first.
spark.sql("USE default")
spark.sql("DROP TABLE IF EXISTS demo_temps_table")
temps.write.saveAsTable("demo_temps_table")

// Query the table on the Databricks cluster, returning rows


// where the airport code is not BLI and the date is later
// than 2021-04-01. Group the results and order by high
// temperature in descending order.
val df_temps = spark.sql("SELECT * FROM demo_temps_table " +
"WHERE AirportCode != 'BLI' AND Date > '2021-04-01' " +
"GROUP BY AirportCode, Date, TempHighF, TempLowF " +
"ORDER BY TempHighF DESC")
df_temps.show()

// Results:
//
// +-----------+----------+---------+--------+
// |AirportCode| Date|TempHighF|TempLowF|
// +-----------+----------+---------+--------+
// | PDX|2021-04-03| 64| 45|
// | PDX|2021-04-02| 61| 41|
// | SEA|2021-04-03| 57| 43|
// | SEA|2021-04-02| 54| 39|
// +-----------+----------+---------+--------+

// Clean up by deleting the table from the Databricks cluster.


spark.sql("DROP TABLE demo_temps_table")
}
}
Java

package com.example.demo;

import java.util.ArrayList;
import java.util.List;
import java.sql.Date;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.*;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.Dataset;

public class SampleApp {


public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Temps Demo")
.config("spark.master", "local")
.getOrCreate();

// Create a Spark DataFrame consisting of high and low temperatures


// by airport code and date.
StructType schema = new StructType(new StructField[] {
new StructField("AirportCode", DataTypes.StringType, false, Metadata.empty()),
new StructField("Date", DataTypes.DateType, false, Metadata.empty()),
new StructField("TempHighF", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("TempLowF", DataTypes.IntegerType, false, Metadata.empty()),
});

List<Row> dataList = new ArrayList<Row>();


dataList.add(RowFactory.create("BLI", Date.valueOf("2021-04-03"), 52, 43));
dataList.add(RowFactory.create("BLI", Date.valueOf("2021-04-02"), 50, 38));
dataList.add(RowFactory.create("BLI", Date.valueOf("2021-04-01"), 52, 41));
dataList.add(RowFactory.create("PDX", Date.valueOf("2021-04-03"), 64, 45));
dataList.add(RowFactory.create("PDX", Date.valueOf("2021-04-02"), 61, 41));
dataList.add(RowFactory.create("PDX", Date.valueOf("2021-04-01"), 66, 39));
dataList.add(RowFactory.create("SEA", Date.valueOf("2021-04-03"), 57, 43));
dataList.add(RowFactory.create("SEA", Date.valueOf("2021-04-02"), 54, 39));
dataList.add(RowFactory.create("SEA", Date.valueOf("2021-04-01"), 56, 41));

Dataset<Row> temps = spark.createDataFrame(dataList, schema);

// Create a table on the Databricks cluster and then fill


// the table with the DataFrame's contents.
// If the table already exists from a previous run,
// delete it first.
spark.sql("USE default");
spark.sql("DROP TABLE IF EXISTS demo_temps_table");
temps.write().saveAsTable("demo_temps_table");

// Query the table on the Databricks cluster, returning rows


// where the airport code is not BLI and the date is later
// than 2021-04-01. Group the results and order by high
// temperature in descending order.
Dataset<Row> df_temps = spark.sql("SELECT * FROM demo_temps_table " +
"WHERE AirportCode != 'BLI' AND Date > '2021-04-01' " +
"GROUP BY AirportCode, Date, TempHighF, TempLowF " +
"ORDER BY TempHighF DESC");
df_temps.show();

// Results:
//
// +-----------+----------+---------+--------+
// |AirportCode| Date|TempHighF|TempLowF|
// +-----------+----------+---------+--------+
// | PDX|2021-04-03| 64| 45|
// | PDX|2021-04-02| 61| 41|
// | SEA|2021-04-03| 57| 43|
// | SEA|2021-04-02| 54| 39|
// +-----------+----------+---------+--------+

// Clean up by deleting the table from the Databricks cluster.


spark.sql("DROP TABLE demo_temps_table");
}
}

Next steps
Extend your conf/deployment.yaml file to support various types of all-purpose and jobs cluster definitions.
Declare multitask jobs in your conf/deployment.yaml file.
Reference environment variables and named properties in your conf/deployment.yaml file.
Batch run code as new jobs on clusters with the dbx execute command.

Limitations
Jobs that contain notebook task types are not supported.
Target all-purpose clusters or jobs clusters must use a Databricks Runtime for Machine Learning (Databricks
Runtime ML) version.
dbx execute cannot be used with multi-task jobs.
Only Python is fully supported.

Additional resources
Batch deploy code artifacts to Azure Databricks workspace storage with the dbx deploy command.
Batch run existing jobs on clusters with the dbx launch command.
Use dbx to do jobless deployments.
Integrate dbx with Azure Data Factory.
Integrate dbx with Apache Airflow.
Learn more about dbx and CI/CD.
dbx documentation
databrickslabs/dbx repository on GitHub
dbx limitations
Use an IDE with Azure Databricks
7/21/2022 • 15 minutes to read

You can use third-party integrated development environments (IDEs) for software development with Azure
Databricks. Some of these IDEs include the following:
Visual Studio Code
PyCharm
IntelliJ IDEA
Eclipse
You use these IDEs to do software development in programming languages that Azure Databricks supports,
including the following languages:
Python
R
Scala
Java
To demonstrate how this can work, this article describes a Python-based code sample that you can work with in
any Python-compatible IDE. Specifically, this article describes how to work with this code sample in Visual Studio
Code, which provides the following developer productivity features:
Code completion
Linting
Testing
Debugging code objects that do not require a real-time connection to remote Azure Databricks resources.
This article uses dbx by Databricks Labs along with Visual Studio Code to submit the code sample to a remote
Azure Databricks workspace. dbx instructs Azure Databricks Workflows to run the submitted code on an Azure
Databricks jobs cluster in that workspace.
You can use popular third-party Git providers for version control and continuous integration and continuous
delivery or continuous deployment (CI/CD) of your code. For version control, these Git providers include the
following:
GitHub
Bitbucket
GitLab
Azure DevOps (not available in Azure China regions)
AWS CodeCommit
GitHub AE
For CI/CD, dbx supports the following CI/CD platforms:
GitHub Actions
Azure Pipelines
GitLab CI/CD
To demonstrate how version control and CI/CD can work, this article describes how to use Visual Studio Code,
dbx , and this code sample, along with GitHub and GitHub Actions.
Code sample requirements
To use this code sample, you must have the following:
An Azure Databricks workspace in your Azure Databricks account. Create a workspace if you do not already
have one.
A GitHub account. Create a GitHub account, if you do not already have one.
Additionally, on your local development machine, you must have the following:
Python version 3.8 or above.
You should use a version of Python that matches the one that is installed on your target clusters. To get
the version of Python that is installed on an existing cluster, you can use the cluster’s web terminal to run
the python --version command. See also the “System environment” section in the Databricks runtime
releases for the Databricks Runtime version for your target clusters. In any case, the version of Python
must be 3.8 or above.
To get the version of Python that is currently referenced on your local machine, run python --version
from your local terminal. (Depending on how you set up Python on your local machine, you may need to
run python3 instead of python throughout this article.) See also Select a Python interpreter.
pip. pip is automatically installed with newer versions of Python. To check whether pip is already
installed, run pip --version from your local terminal. (Depending on how you set up Python or pip on
your local machine, you may need to run pip3 instead of pip throughout this article.)
The dbx package from the Python Package Index (PyPI). You can install this by running pip install dbx .

NOTE
You do not need to install dbx now. You can install it later in the code sample setup section.

A method to create Python virtual environments to ensure you are using the correct versions of Python
and package dependencies in your dbx projects. This article covers pipenv.
The Databricks CLI, set up with authentication.

NOTE
You do not need to install the Databricks CLI now. You can install it later in the code sample setup section. If you
want to install it later, you must remember to set up authentication at that time instead.

Visual Studio Code.


The Python extension for Visual Studio Code.
The GitHub Pull Requests and Issues extension for Visual Studio Code.
Git.

About the code sample


The Python code sample for this article, available in the databricks/ide-best-practices repo in GitHub, does the
following:
1. Gets data from the owid/covid-19-data repo in GitHub.
2. Filters the data for a specific ISO country code.
3. Creates a pivot table from the data.
4. Performs data cleansing on the data.
5. Modularizes the code logic into reusable functions.
6. Unit tests the functions.
7. Provides dbx project configurations and settings to enable the code to write the data to a Delta table in a
remote Azure Databricks workspace.

Set up the code sample


After you have the requirements in place for this code sample, complete the following steps to begin using the
code sample.

NOTE
These steps do not include setting up this code sample for CI/CD. You do not need to set up CI/CD to run this code
sample. If you want to set up CI/CD later, see Run with GitHub Actions.

Step 1: Create a Python virtual environment


1. From your terminal, create a blank folder to contain a virtual environment for this code sample. These
instructions use a parent folder named ide-demo . You can give this folder any name you want. If you use
a different name, replace the name throughout this article. After you create the folder, switch to it, and
then start Visual Studio Code from that folder. Be sure to include the dot ( . ) after the code command.
For Linux and macOS:

mkdir ide-demo
cd ide-demo
code .

TIP
If you get the error command not found: code , see Launching from the command line on the Microsoft website.

For Windows:

md ide-demo
cd ide-demo
code .

2. In Visual Studio Code, on the menu bar, click View > Terminal .
3. From the root of the ide-demo folder, run the pipenv command with the following option, where
<version> is the target version of Python that you already have installed locally (and, ideally, a version
that matches your target clusters’ version of Python), for example 3.8.10 .

pipenv --python <version>

Make a note of the Virtualenv location value in the output of the pipenv command, as you will need it
in the next step.
4. Select the target Python interpreter, and then activate the Python virtual environment:
a. On the menu bar, click View > Command Palette , type Python: Select , and then click Python:
Select Interpreter .
b. Select the Python interpreter within the path to the Python virtual environment that you just
created. (This path is listed as the Virtualenv location value in the output of the pipenv
command.)
c. On the menu bar, click View > Command Palette , type Terminal: Create , and then click
Terminal: Create New Terminal .
d. Make sure that the command prompt indicates that you are in the pipenv shell. To confirm, you
should see something like (<your-username>) before your command prompt. If you do not see it,
run the following command:

pipenv shell

To exit the pipenv shell, run the command exit , and the parentheses disappear.
For more information, see Using Python environments in VS Code in the Visual Studio Code
documentation.
Step 2: Clone the code sample from GitHub
1. In Visual Studio Code, open the ide-demo folder (File > Open Folder ), if it is not already open.
2. Click View > Command Palette , type Git: Clone , and then click Git: Clone .
3. For Provide repositor y URL or pick a repositor y source , enter
https://github.com/databricks/ide-best-practices
4. Browse to your ide-demo folder, and click Select Repositor y Location .
Step 3: Install the code sample’s dependencies
1. Install a version of dbx and the Databricks CLI that is compatible with your version of Python. To do this,
in Visual Studio Code from your terminal, from your ide-demo folder with a pipenv shell activated (
pipenv shell ), run the following command:

pip install dbx

2. Confirm that dbx is installed. To do this, run the following command:

dbx --version

If the version number is returned, dbx is installed.


3. When you install dbx , the Databricks CLI is also automatically installed. To confirm that the Databricks
CLI is installed, run the following command:

databricks --version

If the version number is returned, the Databricks CLI is installed.


4. If you have not set up the Databricks CLI with authentication, you must do it now. To confirm that
authentication is set up, run the following basic command to get some summary information about your
Azure Databricks workspace. Be sure to include the forward slash ( / ) after the ls subcommand:
databricks workspace ls /

If a list of root-level folder names for your workspace is returned, authentication is set up.
5. Install the Python packages that this code sample depends on. To do this, run the following command
from the ide-demo/ide-best-practices folder:

pip install -r unit-requirements.txt

6. Confirm that the code sample’s dependent packages are installed. To do this, run the following command:

pip list

If the packages that are listed in the requirements.txt and unit-requirements.txt files are somewhere in
this list, the dependent packages are installed.

NOTE
The files listed in requirements.txt are for specific package versions. For better compatibility, you can cross-
reference these versions with the cluster node type that you want your Azure Databricks workspace to use for
running deployments on later. See the “System environment” section for your cluster’s Databricks Runtime version
in Databricks runtime releases.

Step 4: Customize the code sample for your Azure Databricks workspace
1. Customize the repo’s dbx project settings. To do this, in the .dbx/project.json file, change the value of
the profile object from DEFAULT to the name of the profile that matches the one that you set up for
authentication with the Databricks CLI. If you did not set up any non-default profile, leave DEFAULT as is.
For example:

{
"environments": {
"default": {
"profile": "DEFAULT",
"workspace_dir": "/Shared/dbx/covid_analysis",
"artifact_location": "dbfs:/Shared/dbx/projects/covid_analysis"
}
}
}

2. Customize the dbx project’s deployment settings. To do this, in the conf/deployment.yml file, change the
value of the spark_version and node_type_id objects from 10.4.x-scala2.12 and m6gd.large to the
Azure Databricks runtime version string and cluster node type that you want your Azure Databricks
workspace to use for running deployments on.
For example, to specify Databricks Runtime 10.4 LTS and a Standard_DS3_v2 node type:
environments:
default:
strict_path_adjustment_policy: true
jobs:
- name: "covid_analysis_etl_integ"
new_cluster:
spark_version: "10.4.x-scala2.12"
num_workers: 1
node_type_id: "Standard_DS3_v2"
spark_python_task:
python_file: "file://jobs/covid_trends_job.py"
- name: "covid_analysis_etl_prod"
new_cluster:
spark_version: "10.4.x-scala2.12"
num_workers: 1
node_type_id: "Standard_DS3_v2"
spark_python_task:
python_file: "file://jobs/covid_trends_job.py"
parameters: ["--prod"]
- name: "covid_analysis_etl_raw"
new_cluster:
spark_version: "10.4.x-scala2.12"
num_workers: 1
node_type_id: "Standard_DS3_v2"
spark_python_task:
python_file: "file://jobs/covid_trends_job_raw.py"

TIP
In this example, each of these three job definitions has the same spark_version and node_type_id value. You can use
different values for different job definitions. You can also create shared values and reuse them across job definitions, to
reduce typing errors and code maintenance. See the YAML example in the dbx documentation.

Explore the code sample


After you set up the code sample, use the following information to learn about how the various files in the
ide-demo/ide-best-practices folder work.

Code modularization
Unmodularized code
The jobs/covid_trends_job_raw.py file is an unmodularized version of the code logic. You can run this file by
itself.
Modularized code
The jobs/covid_trends_job.py file is a modularized version of the code logic. This file relies on the shared code
in the covid_analysis/transforms.py file. The covid_analysis/__init__.py file treats the covide_analysis folder
as a containing package.
Testing
Unit tests
The tests/testdata.csv file contains a small portion of the data in the covid-hospitalizations.csv file for
testing purposes. The tests/transforms_test.py file contains the unit tests for the covid_analysis/transforms.py
file.
Unit test runner
The pytest.ini file contains configuration options for running tests with pytest. See pytest.ini and
Configuration Options in the pytest documentation.
The .coveragerc file contains configuration options for Python code coverage measurements with coverage.py.
See Configuration reference in the coverage.py documentation.
The requirements.txt file, which is a subset of the unit-requirements.txt file that you ran earlier with pip ,
contains a list of packages that the unit tests also depend on.
Packaging
The setup.py file provides commands to be run at the console (console scripts), such as the pip command, for
packaging Python projects with setuptools. See Entry Points in the setuptools documentation.
Other files
There are other files in this code sample that have not been previously described:
The .github/workflows folder contains three files, databricks_pull_request_tests.yml , onpush.yml , and
onrelease.yaml , that represent the GitHub Actions, which are covered later in the GitHub Actions section.
The .gitignore file contains a list of local folders and files that Git ignores for your repo.

Run the code sample


You can use dbx on your local machine to instruct Azure Databricks to run the code sample in your remote
workspace on-demand, as described in the next subsection. Or you can use GitHub Actions to have GitHub run
the code sample every time you push code changes to your GitHub repo.
Run with dbx
1. Install the contents of the covid_analysis folder as a package in Python setuptools development mode
by running the following command from the root of your dbx project (for example, the
ide-demo/ide-best-practices folder). Be sure to include the dot ( . ) at the end of this command:

pip install -e .

This command creates a covid_analysis.egg-info folder, which contains information about the compiled
version of the covid_analysis/__init__.py and covid_analysis/transforms.py files.
2. Run the tests by running the following command:

pytest tests/

The tests’ results are displayed in the terminal. All four tests should show as passing.
3. Optionally, get test coverage metrics for your tests by running the following command:

coverage run -m pytest tests/

NOTE
If a message displays that coverage cannot be found, run pip install coverage , and try again.

To view test coverage results, run the following command:

coverage report -m

4. If all four tests pass, send the dbx project’s contents to your Azure Databricks workspace, by running the
following command:

dbx deploy --environment=default

Information about the project and its runs are sent to the location specified in the workspace_dir object
in the .dbx/project.json file.
The project’s contents are sent to the location specified in the artifact_location object in the
.dbx/project.json file.

5. Run the pre-production version of the code in your workspace, by running the following command:

dbx launch --job=covid_analysis_etl_integ

A link to the run’s results are displayed in the terminal. It should look something like this:

https://<your-workspace-instance-id>/?o=1234567890123456#job/123456789012345/run/12345

Follow this link in your web browser to see the run’s results in your workspace.
6. Run the production version of the code in your workspace, by running the following command:

dbx launch --job=covid_analysis_etl_prod

A link to the run’s results are displayed in the terminal. It should look something like this:

https://<your-workspace-instance-id>/?o=1234567890123456#job/123456789012345/run/23456

Follow this link in your web browser to see the run’s results in your workspace.
Run with GitHub Actions
In the project’s .github/workflows folder, the onpush.yml and onrelease.yml GitHub Actions files do the
following:
On each push to a tag that begins with v , uses dbx to deploy the covid_analysis_etl_prod job.
On each push that is not to a tag that begins with v :
1. Uses pytest to run the unit tests.
2. Uses dbx to deploy the file specified in the covid_analysis_etl_integ job to the remote workspace.
3. Uses dbx to launch the already-deployed file specified in the covid_analysis_etl_integ job on the
remote workspace, tracing this run until it finishes.

NOTE
An additional GitHub Actions file, databricks_pull_request_tests.yml , is provided for you as a template to
experiment with, without impacting the onpush.yml and onrelease.yml GitHub Actions files. You can run this code
sample without the databricks_pull_request_tests.yml GitHub Actions file. Its usage is not covered in this article.

The following subsections describe how to set up and run the onpush.yml and onrelease.yml GitHub Actions
files.
Set up to use GitHub Actions
Set up your Azure Databricks workspace by following the instructions in Service principals for CI/CD. This
includes the following actions:
1. Create an Azure AD service principal.
2. Create an Azure AD token for the Azure AD service principal.
As a security best practice, Databricks recommends that you use an Azure AD token for an Azure AD service
principal, instead of the Azure Databricks personal access token for your workspace user, for enabling GitHub to
authenticate with your Azure Databricks workspace.
After you create the Azure AD service principal and its Azure AD token, stop and make a note of the Azure AD
token value, which you will you use in the next section.
Run GitHub Actions
St e p 1 : P u b l i sh y o u r c l o n e d r e p o

1. In Visual Studio Code, in the sidebar, click the GitHub icon. If the icon is not visible, enable the GitHub Pull
Requests and Issues extension through the Extensions view (View > Extensions ) first.
2. If the Sign In button is visible, click it, and follow the on-screen instructions to sign in to your GitHub
account.
3. On the menu bar, click View > Command Palette , type Publish to GitHub , and then click Publish to
GitHub .
4. Select an option to publish your cloned repo to your GitHub account.
St e p 2 : A d d e n c r y p t e d se c r e t s t o y o u r r e p o

In the GitHub website for your published repo, follow the instructions in Creating encrypted secrets for a
repository, for the following encrypted secrets:
Create an encrypted secret named , set to the value of your per-workspace URL, for example
DATABRICKS_HOST
https://adb-1234567890123456.7.azuredatabricks.net .
Create an encrypted secret named DATABRICKS_TOKEN , set to the value of the Azure AD token for the Azure AD
service principal.
St e p 3 : C r e a t e a n d p u b l i sh a b r a n c h t o y o u r r e p o

1. In Visual Studio Code, in Source Control view (View > Source Control ), click the … (Views and More
Actions ) icon.
2. Click Branch > Create Branch From .
3. Enter a name for the branch, for example my-branch .
4. Select the branch to create the branch from, for example main .
5. Make a minor change to one of the files in your local repo, and then save the file. For example, make a minor
change to a code comment in the tests/transforms_test.py file.
6. In Source Control view, click the … (Views and More Actions ) icon again.
7. Click Changes > Stage All Changes .
8. Click the … (Views and More Actions ) icon again.
9. Click Commit > Commit Staged .
10. Enter a message for the commit.
11. Click the … (Views and More Actions ) icon again.
12. Click Branch > Publish Branch .
St e p 4 : C r e a t e a p u l l r e q u e st a n d m e r g e

1. Go to the GitHub website for your published repo, https://github/<your-GitHub-username>/ide-best-practices


.
2. On the Pull requests tab, next to my-branch had recent pushes , click Compare & pull request .
3. Click Create pull request .
4. On the pull request page, wait for the icon next to CI pipleline / ci-pipeline (push) to display a green
check mark. (It may take a few moments for the several minutes for the icon to appear.) If there is a red X
instead of a green check mark, click Details to find out why. If the icon or Details are no longer showing,
click Show all checks .
5. If the green check mark appears, merge the pull request into the main branch by clicking Merge pull
request .
Databricks Connect
7/21/2022 • 27 minutes to read

NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.

Databricks Connect allows you to connect your favorite IDE (Eclipse, IntelliJ, PyCharm, RStudio, Visual Studio
Code), notebook server (Jupyter Notebook, Zeppelin), and other custom applications to Azure Databricks
clusters.
This article explains how Databricks Connect works, walks you through the steps to get started with Databricks
Connect, explains how to troubleshoot issues that may arise when using Databricks Connect, and differences
between running using Databricks Connect versus running in an Azure Databricks notebook.

Overview
Databricks Connect is a client library for Databricks Runtime. It allows you to write jobs using Spark APIs and
run them remotely on an Azure Databricks cluster instead of in the local Spark session.
For example, when you run the DataFrame command
spark.read.format("parquet").load(...).groupBy(...).agg(...).show() using Databricks Connect, the parsing
and planning of the job runs on your local machine. Then, the logical representation of the job is sent to the
Spark server running in Azure Databricks for execution in the cluster.
With Databricks Connect, you can:
Run large-scale Spark jobs from any Python, Java, Scala, or R application. Anywhere you can import pyspark ,
import org.apache.spark , or require(SparkR) , you can now run Spark jobs directly from your application,
without needing to install any IDE plugins or use Spark submission scripts.
Step through and debug code in your IDE even when working with a remote cluster.
Iterate quickly when developing libraries. You do not need to restart the cluster after changing Python or Java
library dependencies in Databricks Connect, because each client session is isolated from each other in the
cluster.
Shut down idle clusters without losing work. Because the client application is decoupled from the cluster, it is
unaffected by cluster restarts or upgrades, which would normally cause you to lose all the variables, RDDs,
and DataFrame objects defined in a notebook.

NOTE
For Python development with SQL queries, Databricks recommends that you use the Databricks SQL Connector for
Python instead of Databricks Connect. the Databricks SQL Connector for Python is easier to set up than Databricks
Connect. Also, Databricks Connect parses and plans jobs runs on your local machine, while jobs run on remote compute
resources. This can make it especially difficult to debug runtime errors. The Databricks SQL Connector for Python submits
SQL queries directly to remote compute resources and fetches results.

Requirements
Only the following Databricks Runtime versions are supported:
Databricks Runtime 10.4 LTS ML, Databricks Runtime 10.4 LTS
Databricks Runtime 9.1 LTS ML, Databricks Runtime 9.1 LTS
Databricks Runtime 7.3 LTS ML, Databricks Runtime 7.3 LTS
Databricks Runtime 6.4 ML, Databricks Runtime 6.4
The minor version of your client Python installation must be the same as the minor Python version of
your Azure Databricks cluster. The table shows the Python version installed with each Databricks Runtime.

DATA B RIC K S RUN T IM E VERSIO N P Y T H O N VERSIO N

10.4 LTS ML, 10.4 LTS 3.8

9.1 LTS ML, 9.1 LTS 3.8

7.3 LTS ML, 7.3 LTS 3.7

6.4 ML, 6.4 3.7

For example, if you’re using Conda on your local development environment and your cluster is running
Python 3.7, you must create an environment with that version, for example:

conda create --name dbconnect python=3.7


conda

The Databricks Connect major and minor package version must always match your Databricks Runtime
version. Databricks recommends that you always use the most recent package of Databricks Connect that
matches your Databricks Runtime version. For example, when using a Databricks Runtime 7.3 LTS cluster,
use the databricks-connect==7.3.* package.

NOTE
See the Databricks Connect release notes for a list of available Databricks Connect releases and maintenance
updates.

Java Runtime Environment (JRE) 8. The client has been tested with the OpenJDK 8 JRE. The client does not
support Java 11.

NOTE
On Windows, if you see an error that Databricks Connect cannot find winutils.exe , see Cannot find winutils.exe on
Windows.

Set up the client


NOTE
Before you begin to set up the Databricks Connect client, you must meet the requirements for Databricks Connect.

Step 1: Install the client


1. Uninstall PySpark. This is required because the databricks-connect package conflicts with PySpark. For
details, see Conflicting PySpark installations.

pip uninstall pyspark

2. Install the Databricks Connect client.

pip install -U "databricks-connect==7.3.*" # or X.Y.* to match your cluster version.

NOTE
Always specify databricks-connect==X.Y.* instead of databricks-connect=X.Y , to make sure that the
newest package is installed.

Step 2: Configure connection properties


1. Collect the following configuration properties:
Azure Databricks workspace URL.
Azure Databricks personal access token or an Azure Active Directory token.
For Azure Data Lake Storage (ADLS) credential passthrough, you must use an Azure Active
Directory token. Azure Active Directory credential passthrough is supported only on Standard
clusters running Databricks Runtime 7.3 LTS and above, and is not compatible with service
principal authentication.
For more information about authentication with Azure Active Directory tokens, see
Authentication using Azure Active Directory tokens.
The ID of the cluster you created. You can obtain the cluster ID from the URL. Here the cluster ID is
1108-201635-xxxxxxxx .

The unique organization ID for your workspace. See Get workspace, cluster, notebook, folder,
model, and job identifiers.
The port that Databricks Connect connects to. The default port is 15001 . If your cluster is
configured to use a different port, such as 8787 which was given in previous instructions for
Azure Databricks, use the configured port number.
2. Configure the connection. You can use the CLI, SQL configs, or environment variables. The precedence of
configuration methods from highest to lowest is: SQL config keys, CLI, and environment variables.
CLI
a. Run databricks-connect .

databricks-connect configure

The license displays:


Copyright (2018) Databricks, Inc.

This library (the "Software") may not be used except in connection with the
Licensee's use of the Databricks Platform Services pursuant to an Agreement
...

b. Accept the license and supply configuration values. For Databricks Host and Databricks
Token , enter the workspace URL and the personal access token you noted in Step 1.

Do you accept the above agreement? [y/N] y


Set new config values (leave input empty to accept default):
Databricks Host [no current value, must start with https://]: <databricks-url>
Databricks Token [no current value]: <databricks-token>
Cluster ID (e.g., 0921-001415-jelly628) [no current value]: <cluster-id>
Org ID (Azure-only, see ?o=orgId in URL) [0]: <org-id>
Port [15001]: <port>

If you get a message that the Azure Active Directory token is too long, you can leave the Databricks
Token field empty and manually enter the token in ~/.databricks-connect .
SQL configs or environment variables. The following table shows the SQL config keys and the
environment variables that correspond to the configuration properties you noted in Step 1. To set
a SQL config key, use sql("set config=value") . For example:
sql("set spark.databricks.service.clusterId=0304-201045-abcdefgh") .

PA RA M ET ER SQ L C O N F IG K EY EN VIRO N M EN T VA RIA B L E N A M E

Databricks Host spark.databricks.service.address DATABRICKS_ADDRESS

Databricks Token spark.databricks.service.token DATABRICKS_API_TOKEN

Cluster ID spark.databricks.service.clusterId DATABRICKS_CLUSTER_ID

Org ID spark.databricks.service.orgId DATABRICKS_ORG_ID

Port spark.databricks.service.port DATABRICKS_PORT

3. Test connectivity to Azure Databricks.

databricks-connect test

If the cluster you configured is not running, the test starts the cluster which will remain running until its
configured autotermination time. The output should be something like:
* PySpark is installed at /.../3.5.6/lib/python3.5/site-packages/pyspark
* Checking java version
java version "1.8.0_152"
Java(TM) SE Runtime Environment (build 1.8.0_152-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.152-b16, mixed mode)
* Testing scala command
18/12/10 16:38:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/12/10 16:38:50 WARN MetricsSystem: Using default name SparkStatusTracker for source because
neither spark.metrics.namespace nor spark.app.id is set.
18/12/10 16:39:53 WARN SparkServiceRPCClient: Now tracking server state for 5abb7c7e-df8e-4290-947c-
c9a38601024e, invalidating prev state
18/12/10 16:39:59 WARN SparkServiceRPCClient: Syncing 129 files (176036 bytes) took 3003 ms
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0-SNAPSHOT
/_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.range(100).reduce(_ + _)
Spark context Web UI available at https://10.8.5.214:4040
Spark context available as 'sc' (master = local[*], app id = local-1544488730553).
Spark session available as 'spark'.
View job details at <databricks-url>/?o=0#/setting/clusters/<cluster-id>/sparkUi
View job details at <databricks-url>?o=0#/setting/clusters/<cluster-id>/sparkUi
res0: Long = 4950

scala> :quit

* Testing python command


18/12/10 16:40:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/12/10 16:40:17 WARN MetricsSystem: Using default name SparkStatusTracker for source because
neither spark.metrics.namespace nor spark.app.id is set.
18/12/10 16:40:28 WARN SparkServiceRPCClient: Now tracking server state for 5abb7c7e-df8e-4290-947c-
c9a38601024e, invalidating prev state
View job details at <databricks-url>/?o=0#/setting/clusters/<cluster-id>/sparkUi

Set up your IDE or notebook server


The section describes how to configure your preferred IDE or notebook server to use the Databricks Connect
client.
In this section:
Jupyter notebook
PyCharm
SparkR and RStudio Desktop
sparklyr and RStudio Desktop
IntelliJ (Scala or Java)
Eclipse
Visual Studio Code
SBT
Jupyter notebook

NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks
Connect.

The Databricks Connect configuration script automatically adds the package to your project configuration. To get
started in a Python kernel, run:

from pyspark.sql import SparkSession


spark = SparkSession.builder.getOrCreate()

To enable the %sql shorthand for running and visualizing SQL queries, use the following snippet:

from IPython.core.magic import line_magic, line_cell_magic, Magics, magics_class

@magics_class
class DatabricksConnectMagics(Magics):

@line_cell_magic
def sql(self, line, cell=None):
if cell and line:
raise ValueError("Line must be empty for cell magic", line)
try:
from autovizwidget.widget.utils import display_dataframe
except ImportError:
print("Please run `pip install autovizwidget` to enable the visualization widget.")
display_dataframe = lambda x: x
return display_dataframe(self.get_spark().sql(cell or line).toPandas())

def get_spark(self):
user_ns = get_ipython().user_ns
if "spark" in user_ns:
return user_ns["spark"]
else:
from pyspark.sql import SparkSession
user_ns["spark"] = SparkSession.builder.getOrCreate()
return user_ns["spark"]

ip = get_ipython()
ip.register_magics(DatabricksConnectMagics)

PyCharm

NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks
Connect.
The Databricks Connect configuration script automatically adds the package to your project configuration.
Python 3 clusters
1. When you create a PyCharm project, select Existing Interpreter . From the drop-down menu, select the
Conda environment you created (see Requirements).

2. Go to Run > Edit Configurations .


3. Add PYSPARK_PYTHON=python3 as an environment variable.

SparkR and RStudio Desktop


NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks
Connect.

1. Download and unpack the open source Spark onto your local machine. Choose the same version as in
your Azure Databricks cluster (Hadoop 2.7).
2. Run databricks-connect get-jar-dir . This command returns a path like
/usr/local/lib/python3.5/dist-packages/pyspark/jars . Copy the file path of one directory above the JAR
directory file path, for example, /usr/local/lib/python3.5/dist-packages/pyspark , which is the SPARK_HOME
directory.
3. Configure the Spark lib path and Spark home by adding them to the top of your R script. Set
<spark-lib-path> to the directory where you unpacked the open source Spark package in step 1. Set
<spark-home-path> to the Databricks Connect directory from step 2.

# Point to the OSS package path, e.g., /path/to/.../spark-2.4.0-bin-hadoop2.7


library(SparkR, lib.loc = .libPaths(c(file.path('<spark-lib-path>', 'R', 'lib'), .libPaths())))

# Point to the Databricks Connect PySpark installation, e.g., /path/to/.../pyspark


Sys.setenv(SPARK_HOME = "<spark-home-path>")

4. Initiate a Spark session and start running SparkR commands.

sparkR.session()

df <- as.DataFrame(faithful)
head(df)

df1 <- dapply(df, function(x) { x }, schema(df))


collect(df1)

sparklyr and RStudio Desktop

IMPORTANT
This feature is in Public Preview.

NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks
Connect.
You can copy sparklyr-dependent code that you’ve developed locally using Databricks Connect and run it in an Azure
Databricks notebook or hosted RStudio Server in your Azure Databricks workspace with minimal or no code changes.

In this section:
Requirements
Install, configure, and use sparklyr
Resources
sparklyr and RStudio Desktop limitations
Requirements
sparklyr 1.2 or above.
Databricks Runtime 6.4 or above with matching Databricks Connect.
Install, configure, and use sparklyr
1. In RStudio Desktop, install sparklyr 1.2 or above from CRAN or install the latest master version from
GitHub.

# Install from CRAN


install.packages("sparklyr")

# Or install the latest master version from GitHub


install.packages("devtools")
devtools::install_github("sparklyr/sparklyr")

2. Activate the Python environment with Databricks Connect installed and run the following command in
the terminal to get the <spark-home-path> :

databricks-connect get-spark-home

3. Initiate a Spark session and start running sparklyr commands.

library(sparklyr)
sc <- spark_connect(method = "databricks", spark_home = "<spark-home-path>")

iris_tbl <- copy_to(sc, iris, overwrite = TRUE)

library(dplyr)
src_tbls(sc)

iris_tbl %>% count

4. Close the connection.

spark_disconnect(sc)

Resources
For more information, see the sparklyr GitHub README.
For code examples, see sparklyr.
sparklyr and RStudio Desktop limitations
The following features are unsupported:
sparklyr streaming APIs
sparklyr ML APIs
broom APIs
csv_file serialization mode
spark submit
IntelliJ (Scala or Java)

NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks
Connect.

1. Run databricks-connect get-jar-dir .


2. Point the dependencies to the directory returned from the command. Go to File > Project Structure >
Modules > Dependencies > ‘+’ sign > JARs or Directories .

To avoid conflicts, we strongly recommend removing any other Spark installations from your classpath. If
this is not possible, make sure that the JARs you add are at the front of the classpath. In particular, they
must be ahead of any other installed version of Spark (otherwise you will either use one of those other
Spark versions and run locally or throw a ClassDefNotFoundError ).
3. Check the setting of the breakout option in IntelliJ. The default is All and will cause network timeouts if
you set breakpoints for debugging. Set it to Thread to avoid stopping the background network threads.
Eclipse

NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks
Connect.

1. Run databricks-connect get-jar-dir .


2. Point the external JARs configuration to the directory returned from the command. Go to Project menu
> Proper ties > Java Build Path > Libraries > Add External Jars .

To avoid conflicts, we strongly recommend removing any other Spark installations from your classpath. If
this is not possible, make sure that the JARs you add are at the front of the classpath. In particular, they
must be ahead of any other installed version of Spark (otherwise you will either use one of those other
Spark versions and run locally or throw a ClassDefNotFoundError ).
Visual Studio Code

NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks
Connect.

1. Verify that the Python extension is installed.


2. Open the Command Palette (Command+Shift+P on macOS and Ctrl+Shift+P on Windows/Linux).
3. Select a Python interpreter. Go to Code > Preferences > Settings , and choose python settings .
4. Run databricks-connect get-jar-dir .
5. Add the directory returned from the command to the User Settings JSON under python.venvPath . This
should be added to the Python Configuration.
6. Disable the linter. Click the … on the right side and edit json settings . The modified settings are as
follows:
7. If running with a virtual environment, which is the recommended way to develop for Python in VS Code,
in the Command Palette type select python interpreter and point to your environment that matches
your cluster Python version.

For example, if your cluster is Python 3.5, your local environment should be Python 3.5.

SBT

NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks
Connect.

To use SBT, you must configure your build.sbt file to link against the Databricks Connect JARs instead of the
usual Spark library dependency. You do this with the unmanagedBase directive in the following example build file,
which assumes a Scala app that has a com.example.Test main object:
build.sbt
name := "hello-world"
version := "1.0"
scalaVersion := "2.11.6"
// this should be set to the path returned by ``databricks-connect get-jar-dir``
unmanagedBase := new java.io.File("/usr/local/lib/python2.7/dist-packages/pyspark/jars")
mainClass := Some("com.example.Test")

Run examples from your IDE


Java

import java.util.ArrayList;
import java.util.List;
import java.sql.Date;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.*;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.Dataset;

public class App {


public static void main(String[] args) throws Exception {
SparkSession spark = SparkSession
.builder()
.appName("Temps Demo")
.config("spark.master", "local")
.getOrCreate();

// Create a Spark DataFrame consisting of high and low temperatures


// by airport code and date.
StructType schema = new StructType(new StructField[] {
new StructField("AirportCode", DataTypes.StringType, false, Metadata.empty()),
new StructField("Date", DataTypes.DateType, false, Metadata.empty()),
new StructField("TempHighF", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("TempLowF", DataTypes.IntegerType, false, Metadata.empty()),
});

List<Row> dataList = new ArrayList<Row>();


dataList.add(RowFactory.create("BLI", Date.valueOf("2021-04-03"), 52, 43));
dataList.add(RowFactory.create("BLI", Date.valueOf("2021-04-02"), 50, 38));
dataList.add(RowFactory.create("BLI", Date.valueOf("2021-04-01"), 52, 41));
dataList.add(RowFactory.create("PDX", Date.valueOf("2021-04-03"), 64, 45));
dataList.add(RowFactory.create("PDX", Date.valueOf("2021-04-02"), 61, 41));
dataList.add(RowFactory.create("PDX", Date.valueOf("2021-04-01"), 66, 39));
dataList.add(RowFactory.create("SEA", Date.valueOf("2021-04-03"), 57, 43));
dataList.add(RowFactory.create("SEA", Date.valueOf("2021-04-02"), 54, 39));
dataList.add(RowFactory.create("SEA", Date.valueOf("2021-04-01"), 56, 41));

Dataset<Row> temps = spark.createDataFrame(dataList, schema);

// Create a table on the Databricks cluster and then fill


// the table with the DataFrame's contents.
// If the table already exists from a previous run,
// delete it first.
spark.sql("USE default");
spark.sql("DROP TABLE IF EXISTS demo_temps_table");
temps.write().saveAsTable("demo_temps_table");

// Query the table on the Databricks cluster, returning rows


// where the airport code is not BLI and the date is later
// than 2021-04-01. Group the results and order by high
// temperature in descending order.
Dataset<Row> df_temps = spark.sql("SELECT * FROM demo_temps_table " +
"WHERE AirportCode != 'BLI' AND Date > '2021-04-01' " +
"GROUP BY AirportCode, Date, TempHighF, TempLowF " +
"ORDER BY TempHighF DESC");
df_temps.show();

// Results:
//
// +-----------+----------+---------+--------+
// |AirportCode| Date|TempHighF|TempLowF|
// +-----------+----------+---------+--------+
// | PDX|2021-04-03| 64| 45|
// | PDX|2021-04-02| 61| 41|
// | SEA|2021-04-03| 57| 43|
// | SEA|2021-04-02| 54| 39|
// +-----------+----------+---------+--------+

// Clean up by deleting the table from the Databricks cluster.


spark.sql("DROP TABLE demo_temps_table");
}
}

Python
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from datetime import date

spark = SparkSession.builder.appName('temps-demo').getOrCreate()

# Create a Spark DataFrame consisting of high and low temperatures


# by airport code and date.
schema = StructType([
StructField('AirportCode', StringType(), False),
StructField('Date', DateType(), False),
StructField('TempHighF', IntegerType(), False),
StructField('TempLowF', IntegerType(), False)
])

data = [
[ 'BLI', date(2021, 4, 3), 52, 43],
[ 'BLI', date(2021, 4, 2), 50, 38],
[ 'BLI', date(2021, 4, 1), 52, 41],
[ 'PDX', date(2021, 4, 3), 64, 45],
[ 'PDX', date(2021, 4, 2), 61, 41],
[ 'PDX', date(2021, 4, 1), 66, 39],
[ 'SEA', date(2021, 4, 3), 57, 43],
[ 'SEA', date(2021, 4, 2), 54, 39],
[ 'SEA', date(2021, 4, 1), 56, 41]
]

temps = spark.createDataFrame(data, schema)

# Create a table on the Databricks cluster and then fill


# the table with the DataFrame's contents.
# If the table already exists from a previous run,
# delete it first.
spark.sql('USE default')
spark.sql('DROP TABLE IF EXISTS demo_temps_table')
temps.write.saveAsTable('demo_temps_table')

# Query the table on the Databricks cluster, returning rows


# where the airport code is not BLI and the date is later
# than 2021-04-01. Group the results and order by high
# temperature in descending order.
df_temps = spark.sql("SELECT * FROM demo_temps_table " \
"WHERE AirportCode != 'BLI' AND Date > '2021-04-01' " \
"GROUP BY AirportCode, Date, TempHighF, TempLowF " \
"ORDER BY TempHighF DESC")
df_temps.show()

# Results:
#
# +-----------+----------+---------+--------+
# |AirportCode| Date|TempHighF|TempLowF|
# +-----------+----------+---------+--------+
# | PDX|2021-04-03| 64| 45|
# | PDX|2021-04-02| 61| 41|
# | SEA|2021-04-03| 57| 43|
# | SEA|2021-04-02| 54| 39|
# +-----------+----------+---------+--------+

# Clean up by deleting the table from the Databricks cluster.


spark.sql('DROP TABLE demo_temps_table')

Scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import java.sql.Date

object Demo {
def main(args: Array[String]) {
val spark = SparkSession.builder.master("local").getOrCreate()

// Create a Spark DataFrame consisting of high and low temperatures


// by airport code and date.
val schema = StructType(Array(
StructField("AirportCode", StringType, false),
StructField("Date", DateType, false),
StructField("TempHighF", IntegerType, false),
StructField("TempLowF", IntegerType, false)
))

val data = List(


Row("BLI", Date.valueOf("2021-04-03"), 52, 43),
Row("BLI", Date.valueOf("2021-04-02"), 50, 38),
Row("BLI", Date.valueOf("2021-04-01"), 52, 41),
Row("PDX", Date.valueOf("2021-04-03"), 64, 45),
Row("PDX", Date.valueOf("2021-04-02"), 61, 41),
Row("PDX", Date.valueOf("2021-04-01"), 66, 39),
Row("SEA", Date.valueOf("2021-04-03"), 57, 43),
Row("SEA", Date.valueOf("2021-04-02"), 54, 39),
Row("SEA", Date.valueOf("2021-04-01"), 56, 41)
)

val rdd = spark.sparkContext.makeRDD(data)


val temps = spark.createDataFrame(rdd, schema)

// Create a table on the Databricks cluster and then fill


// the table with the DataFrame's contents.
// If the table already exists from a previous run,
// delete it first.
spark.sql("USE default")
spark.sql("DROP TABLE IF EXISTS demo_temps_table")
temps.write.saveAsTable("demo_temps_table")

// Query the table on the Databricks cluster, returning rows


// where the airport code is not BLI and the date is later
// than 2021-04-01. Group the results and order by high
// temperature in descending order.
val df_temps = spark.sql("SELECT * FROM demo_temps_table " +
"WHERE AirportCode != 'BLI' AND Date > '2021-04-01' " +
"GROUP BY AirportCode, Date, TempHighF, TempLowF " +
"ORDER BY TempHighF DESC")
df_temps.show()

// Results:
//
// +-----------+----------+---------+--------+
// |AirportCode| Date|TempHighF|TempLowF|
// +-----------+----------+---------+--------+
// | PDX|2021-04-03| 64| 45|
// | PDX|2021-04-02| 61| 41|
// | SEA|2021-04-03| 57| 43|
// | SEA|2021-04-02| 54| 39|
// +-----------+----------+---------+--------+

// Clean up by deleting the table from the Databricks cluster.


spark.sql("DROP TABLE demo_temps_table")
}
}
Work with dependencies
Typically your main class or Python file will have other dependency JARs and files. You can add such dependency
JARs and files by calling sparkContext.addJar("path-to-the-jar") or
sparkContext.addPyFile("path-to-the-file") . You can also add Egg files and zip files with the addPyFile()
interface. Every time you run the code in your IDE, the dependency JARs and files are installed on the cluster.
Python

from lib import Foo


from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

sc = spark.sparkContext
#sc.setLogLevel("INFO")

print("Testing simple count")


print(spark.range(100).count())

print("Testing addPyFile isolation")


sc.addPyFile("lib.py")
print(sc.parallelize(range(10)).map(lambda i: Foo(2)).collect())

class Foo(object):
def __init__(self, x):
self.x = x

Python + Java UDFs

from pyspark.sql import SparkSession


from pyspark.sql.column import _to_java_column, _to_seq, Column

## In this example, udf.jar contains compiled Java / Scala UDFs:


#package com.example
#
#import org.apache.spark.sql._
#import org.apache.spark.sql.expressions._
#import org.apache.spark.sql.functions.udf
#
#object Test {
# val plusOne: UserDefinedFunction = udf((i: Long) => i + 1)
#}

spark = SparkSession.builder \
.config("spark.jars", "/path/to/udf.jar") \
.getOrCreate()
sc = spark.sparkContext

def plus_one_udf(col):
f = sc._jvm.com.example.Test.plusOne()
return Column(f.apply(_to_seq(sc, [col], _to_java_column)))

sc._jsc.addJar("/path/to/udf.jar")
spark.range(100).withColumn("plusOne", plus_one_udf("id")).show()

Scala
package com.example

import org.apache.spark.sql.SparkSession

case class Foo(x: String)

object Test {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
...
.getOrCreate();
spark.sparkContext.setLogLevel("INFO")

println("Running simple show query...")


spark.read.format("parquet").load("/tmp/x").show()

println("Running simple UDF query...")


spark.sparkContext.addJar("./target/scala-2.11/hello-world_2.11-1.0.jar")
spark.udf.register("f", (x: Int) => x + 1)
spark.range(10).selectExpr("f(id)").show()

println("Running custom objects query...")


val objs = spark.sparkContext.parallelize(Seq(Foo("bye"), Foo("hi"))).collect()
println(objs.toSeq)
}
}

Access DBUtils
You can use dbutils.fs and dbutils.secrets utilities of the Databricks Utilities module. Supported commands
are dbutils.fs.cp , dbutils.fs.head , dbutils.fs.ls , dbutils.fs.mkdirs , dbutils.fs.mv , dbutils.fs.put ,
dbutils.fs.rm , dbutils.secrets.get , dbutils.secrets.getBytes , dbutils.secrets.list ,
dbutils.secrets.listScopes . See File system utility (dbutils.fs) or run dbutils.fs.help() and Secrets utility
(dbutils.secrets) or run dbutils.secrets.help() .
Python

from pyspark.sql import SparkSession


from pyspark.dbutils import DBUtils

spark = SparkSession.builder.getOrCreate()

dbutils = DBUtils(spark)
print(dbutils.fs.ls("dbfs:/"))
print(dbutils.secrets.listScopes())

When using Databricks Runtime 7.3 LTS or above, to access the DBUtils module in a way that works both locally
and in Azure Databricks clusters, use the following get_dbutils() :

def get_dbutils(spark):
from pyspark.dbutils import DBUtils
return DBUtils(spark)

Otherwise, use the following get_dbutils() :


def get_dbutils(spark):
if spark.conf.get("spark.databricks.service.client.enabled") == "true":
from pyspark.dbutils import DBUtils
return DBUtils(spark)
else:
import IPython
return IPython.get_ipython().user_ns["dbutils"]

Scala

val dbutils = com.databricks.service.DBUtils


println(dbutils.fs.ls("dbfs:/"))
println(dbutils.secrets.listScopes())

Copying files between local and remote filesystems


You can use dbutils.fs to copy files between your client and remote filesystems. Scheme file:/ refers to the
local filesystem on the client.

from pyspark.dbutils import DBUtils


dbutils = DBUtils(spark)

dbutils.fs.cp('file:/home/user/data.csv', 'dbfs:/uploads')
dbutils.fs.cp('dbfs:/output/results.csv', 'file:/home/user/downloads/')

The maximum file size that can be transferred that way is 250 MB.
Enable dbutils.secrets.get

Because of security restrictions, the ability to call dbutils.secrets.get is disabled by default. Contact Azure
Databricks support to enable this feature for your workspace.

Access the Hadoop filesystem


You can also access DBFS directly using the standard Hadoop filesystem interface:

> import org.apache.hadoop.fs._

// get new DBFS connection


> val dbfs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
dbfs: org.apache.hadoop.fs.FileSystem = com.databricks.backend.daemon.data.client.DBFS@2d036335

// list files
> dbfs.listStatus(new Path("dbfs:/"))
res1: Array[org.apache.hadoop.fs.FileStatus] = Array(FileStatus{path=dbfs:/$; isDirectory=true; ...})

// open file
> val stream = dbfs.open(new Path("dbfs:/path/to/your_file"))
stream: org.apache.hadoop.fs.FSDataInputStream = org.apache.hadoop.fs.FSDataInputStream@7aa4ef24

// get file contents as string


> import org.apache.commons.io._
> println(new String(IOUtils.toByteArray(stream)))

Set Hadoop configurations


On the client you can set Hadoop configurations using the spark.conf.set API, which applies to SQL and
DataFrame operations. Hadoop configurations set on the sparkContext must be set in the cluster configuration
or using a notebook. This is because configurations set on sparkContext are not tied to user sessions but apply
to the entire cluster.

Troubleshooting
Run databricks-connect test to check for connectivity issues. This section describes some common issues you
may encounter and how to resolve them.
Python version mismatch
Check the Python version you are using locally has at least the same minor release as the version on the cluster
(for example, 3.5.1 versus 3.5.2 is OK, 3.5 versus 3.6 is not).
If you have multiple Python versions installed locally, ensure that Databricks Connect is using the right one by
setting the PYSPARK_PYTHON environment variable (for example, PYSPARK_PYTHON=python3 ).
Server not enabled
Ensure the cluster has the Spark server enabled with spark.databricks.service.server.enabled true . You should
see the following lines in the driver log if it is:

18/10/25 21:39:18 INFO SparkConfUtils$: Set spark config:


spark.databricks.service.server.enabled -> true
...
18/10/25 21:39:21 INFO SparkContext: Loading Spark Service RPC Server
18/10/25 21:39:21 INFO SparkServiceRPCServer:
Starting Spark Service RPC Server
18/10/25 21:39:21 INFO Server: jetty-9.3.20.v20170531
18/10/25 21:39:21 INFO AbstractConnector: Started ServerConnector@6a6c7f42
{HTTP/1.1,[http/1.1]}{0.0.0.0:15001}
18/10/25 21:39:21 INFO Server: Started @5879ms

Conflicting PySpark installations


The databricks-connect package conflicts with PySpark. Having both installed will cause errors when initializing
the Spark context in Python. This can manifest in several ways, including “stream corrupted” or “class not found”
errors. If you have PySpark installed in your Python environment, ensure it is uninstalled before installing
databricks-connect. After uninstalling PySpark, make sure to fully re-install the Databricks Connect package:

pip uninstall pyspark


pip uninstall databricks-connect
pip install -U "databricks-connect==9.1.*" # or X.Y.* to match your cluster version.

Conflicting SPARK_HOME

If you have previously used Spark on your machine, your IDE may be configured to use one of those other
versions of Spark rather than the Databricks Connect Spark. This can manifest in several ways, including “stream
corrupted” or “class not found” errors. You can see which version of Spark is being used by checking the value of
the SPARK_HOME environment variable:
Java

System.out.println(System.getenv("SPARK_HOME"));

Python

import os
print(os.environ['SPARK_HOME'])
Scala

println(sys.env.get("SPARK_HOME"))

Resolution
If SPARK_HOME is set to a version of Spark other than the one in the client, you should unset the SPARK_HOME
variable and try again.
Check your IDE environment variable settings, your .bashrc , .zshrc , or .bash_profile file, and anywhere else
environment variables might be set. You will most likely have to quit and restart your IDE to purge the old state,
and you may even need to create a new project if the problem persists.
You should not need to set SPARK_HOME to a new value; unsetting it should be sufficient.
Conflicting or Missing PATH entry for binaries
It is possible your PATH is configured so that commands like spark-shell will be running some other previously
installed binary instead of the one provided with Databricks Connect. This can cause databricks-connect test to
fail. You should make sure either the Databricks Connect binaries take precedence, or remove the previously
installed ones.
If you can’t run commands like spark-shell , it is also possible your PATH was not automatically set up by
pip install and you’ll need to add the installation bin dir to your PATH manually. It’s possible to use
Databricks Connect with IDEs even if this isn’t set up. However, the databricks-connect test command will not
work.
Conflicting serialization settings on the cluster
If you see “stream corrupted” errors when running databricks-connect test , this may be due to incompatible
cluster serialization configs. For example, setting the spark.io.compression.codec config can cause this issue. To
resolve this issue, consider removing these configs from the cluster settings, or setting the configuration in the
Databricks Connect client.
Cannot find winutils.exe on Windows
If you are using Databricks Connect on Windows and see:

ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

Follow the instructions to configure the Hadoop path on Windows.


The filename, directory name, or volume label syntax is incorrect on Windows
If you are using Databricks Connect on Windows and see:

The filename, directory name, or volume label syntax is incorrect.

Either Java or Databricks Connect was installed into a directory with a space in your path. You can work around
this by either installing into a directory path without spaces, or configuring your path using the short name
form.

Authentication using Azure Active Directory tokens


When you use Databricks Connect, you can authenticate by using an Azure Active Directory token instead of a
personal access token. Azure Active Directory tokens have a limited lifetime. When the Azure Active Directory
token expires, Databricks Connect fails with an Invalid Token error.
In Databricks Connect 7.3.5 and above, you can provide the Azure Active Directory token in your running
Databricks Connect application. Your application needs to obtain the new access token, and set it to the
spark.databricks.service.token SQL config key.

Python

spark.conf.set("spark.databricks.service.token", new_aad_token)

Scala

spark.conf.set("spark.databricks.service.token", newAADToken)

After you update the token, the application can continue to use the same SparkSession and any objects and
state that are created in the context of the session. To avoid intermittent errors, Databricks recommends that you
provide a new token before the old token expires.
You can extend the lifetime of the Azure Active Directory token to persist during the execution of your
application. To do that, attach a TokenLifetimePolicy with an appropriately long lifetime to the Azure Active
Directory authorization application that you used to acquire the access token.

NOTE
Azure Active Directory passthrough uses two tokens: the Azure Active Directory access token that was previously
described that you configure in Databricks Connect, and the ADLS passthrough token for the specific resource that
Databricks generates while Databricks processes the request. You cannot extend the lifetime of ADLS passthrough tokens
by using Azure Active Directory token lifetime policies. If you send a command to the cluster that takes longer than an
hour, it will fail if the command accesses an ADLS resource after the one hour mark.

Limitations
Databricks Connect does not support the following Azure Databricks features and third-party platforms:
Structured Streaming.
Running arbitrary code that is not a part of a Spark job on the remote cluster.
Native Scala, Python, and R APIs for Delta table operations (for example, DeltaTable.forPath ) are not
supported. However, the SQL API ( spark.sql(...) ) with Delta Lake operations and the Spark API (for
example, spark.read.load ) on Delta tables are both supported.
Copy into.
Apache Zeppelin 0.7.x and below.
Connecting to clusters with table access control.
Connecting to clusters with process isolation enabled (in other words, where
spark.databricks.pyspark.enableProcessIsolation is set to true ).

Delta CLONE SQL command.


Global temporary views.
Koalas.
Azure Active Directory credential passthrough is supported only on standard clusters running Databricks
Runtime 7.3 LTS and above, and is not compatible with service principal authentication.
The following Databricks Utilities:
library
notebook workflow
widgets
Databricks SQL Connector for Python
7/21/2022 • 14 minutes to read

The Databricks SQL Connector for Python is a Python library that allows you to use Python code to run SQL
commands on Azure Databricks clusters and Databricks SQL warehouses. The Databricks SQL Connector for
Python is easier to set up and use than similar Python libraries such as pyodbc. This library follows PEP 249 –
Python Database API Specification v2.0.

Requirements
A development machine running Python >=3.7, <3.10.
An existing cluster or SQL warehouse.

Get started
Gather the following information for the cluster or SQL warehouse that you want to use:
Cluster
The server hostname of the cluster. You can get this from the Ser ver Hostname value in the
Advanced Options > JDBC/ODBC tab for your cluster.
The HTTP path of the cluster. You can get this from the HTTP Path value in the Advanced Options >
JDBC/ODBC tab for your cluster.
A valid access token. You can use an Azure Databricks personal access token for the workspace. You
can also use an Azure Active Directory access token.

NOTE
As a security best practice, you should not hard-code this information into your code. Instead, you should retrieve
this information from a secure location. For example, the code examples later in this article use environment
variables.

Sql warehouse
The server hostname of the SQL warehouse. You can get this from the Ser ver Hostname value in the
Connection Details tab for your SQL warehouse.
The HTTP path of the SQL warehouse. You can get this from the HTTP Path value in the Connection
Details tab for your SQL warehouse.
A valid access token. You can use an Azure Databricks personal access token for the workspace. You
can also use an Azure Active Directory access token.

NOTE
As a security best practice, you should not hard-code this information into your code. Instead, you should retrieve
this information from a secure location. For example, the code examples later in this article use environment
variables.

Install the Databricks SQL Connector for Python library on your development machine by running
pip install databricks-sql-connector .
Examples
The following code examples demonstrate how to use the Databricks SQL Connector for Python to query and
insert data, query metadata, manage cursors and connections, and configure logging.
These code example retrieve their server_hostname , http_path , and access_token connection variable values
from these environment variables:
DATABRICKS_SERVER_HOSTNAME , which represents the Ser ver Hostname value from the requirements.
DATABRICKS_HTTP_PATH , which represents the HTTP Path value from the requirements.
DATABRICKS_TOKEN , which represents your access token from the requirements.

You can use other approaches to retrieving these connection variable values. Using environment variables is just
one approach among many.
Query data
Insert data
Query metadata
Manage cursors and connections
Configure logging
Query data
The following code example demonstrates how to call the Databricks SQL Connector for Python to run a basic
SQL command on a cluster or SQL warehouse. This command returns the first two rows from the diamonds
table.

NOTE
The diamonds table is included in the Sample datasets (databricks-datasets).

from databricks import sql


import os

with sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),


http_path = os.getenv("DATABRICKS_HTTP_PATH"),
access_token = os.getenv("DATABRICKS_TOKEN")) as connection:

with connection.cursor() as cursor:


cursor.execute("SELECT * FROM default.diamonds LIMIT 2")
result = cursor.fetchall()

for row in result:


print(row)

Insert data
The following example demonstrate how to insert small amounts of data (thousands of rows):
from databricks import sql
import os

with sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),


http_path = os.getenv("DATABRICKS_HTTP_PATH"),
access_token = os.getenv("DATABRICKS_TOKEN")) as connection:

with connection.cursor() as cursor:


cursor.execute("CREATE TABLE IF NOT EXISTS squares (x int, x_squared int)")

squares = [(i, i * i) for i in range(100)]


values = ",".join([f"({x}, {y})" for (x, y) in squares])

cursor.execute(f"INSERT INTO squares VALUES {values}")

cursor.execute("SELECT * FROM squares LIMIT 10")

result = cursor.fetchall()

for row in result:


print(row)

For large amounts of data, you should first upload the data to cloud storage and then execute the COPY INTO
command.
Query metadata
There are dedicated methods for retrieving metadata. The following example retrieves metadata about columns
in a sample table:

from databricks import sql


import os

with sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),


http_path = os.getenv("DATABRICKS_HTTP_PATH"),
access_token = os.getenv("DATABRICKS_TOKEN")) as connection:

with connection.cursor() as cursor:


cursor.columns(schema_name="default", table_name="squares")
print(cursor.fetchall())

Manage cursors and connections


It is best practice to close any connections and cursors that have been finished with. This frees resources on
Azure Databricks clusters and Databricks SQL warehouses.
You can use a context manager (the with syntax used in previous examples) to manage the resources, or
explicitly call close :
from databricks import sql
import os

connection = sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),


http_path = os.getenv("DATABRICKS_HTTP_PATH"),
access_token = os.getenv("DATABRICKS_TOKEN"))

cursor = connection.cursor()

cursor.execute("SELECT * from range(10)")


print(cursor.fetchall())

cursor.close()
connection.close()

Configure logging
The Databricks SQL Connector uses Python’s standard logging module. You can configure the logging level
similar to the following:

from databricks import sql


import os, logging

logging.getLogger("databricks.sql").setLevel(logging.DEBUG)
logging.basicConfig(filename = "results.log",
level = logging.DEBUG)

connection = sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),


http_path = os.getenv("DATABRICKS_HTTP_PATH"),
access_token = os.getenv("DATABRICKS_TOKEN"))

cursor = connection.cursor()

cursor.execute("SELECT * from range(10)")

result = cursor.fetchall()

for row in result:


logging.debug(row)

cursor.close()
connection.close()

API reference
Package
Module
Methods
connect method
Classes
Connection class
Methods
close method
cursor method
Cursor class
Attributes
arraysize attribute
description attribute
Methods
method
cancel
close method
execute method
executemany method
catalogs method
schemas method
tables method
columns method
fetchall method
fetchmany method
fetchone method
fetchall_arrow method
fetchmany_arrow method
Row class
Methods
asDict method
Type conversions
Package
databricks-sql-connector

Usage: pip install databricks-sql-connector

See also databricks-sql-connector in the Python Package Index (PyPI).


Module
databricks.sql

Usage: from databricks import sql

Methods
connect m et h o d

Creates a connection to a database.


Returns a Connection object.

PA RA M ET ERS

ser ver_hostname

Type: str

The server hostname for the cluster or SQL warehouse. To get the server hostname, see the instructions earlier in this article.

This parameter is required.

Example: adb-1234567890123456.7.azuredatabricks.net
PA RA M ET ERS

http_path

Type: str

The HTTP path of the cluster or SQL warehouse. To get the HTTP path, see the instructions earlier in this article.

This parameter is required.

Example:
sql/protocolv1/o/1234567890123456/1234-567890-test123 for a cluster.
/sql/1.0/warehouses/a1b234c567d8e9fa for a SQL warehouse.

access_token

Type: str

Your Azure Databricks personal access token or Azure Active Directory token for the workspace for the cluster or SQL
warehouse. To create a token, see the instructions earlier in this article.

This parameter is required.

Example: dapi...<the-remaining-portion-of-your-token>

session_configuration

Type: dict[str, Any]

A dictionary of Spark session configuration parameters. Setting a configuration is equivalent to using the SET key=val SQL
command. Run the SQL command SET -v to get a full list of available configurations.

Defaults to None .

This parameter is optional.

Example: {"spark.sql.variable.substitute": True}

http_headers

Type: List[Tuple[str, str]]]

Additional (key, value) pairs to set in HTTP headers on every RPC request the client makes. Typical usage will not set any extra
HTTP headers. Defaults to None .

This parameter is optional.

Since version 2.0

catalog

Type: str

Initial catalog to use for the connection. Defaults to None (in which case the default catalog, typically hive_metastore , will
be used).

This parameter is optional.

Since version 2.0


PA RA M ET ERS

schema

Type: str

Initial schema to use for the connection. Defaults to None (in which case the default schema default will be used).

This parameter is optional.

Since version 2.0

Classes
Connection class
Represents a connection to a database.
Met h o ds
close me t h o d

Closes the connection to the database and releases all associated resources on the server. Any additional calls to
this connection will throw an Error .
No parameters.
No return value.
cursor me t h o d

Returns a mechanism that enables traversal over the records in a database.


No parameters.
Returns a Cursor object.
Cursor class
A t t r i bu t es
arraysize a t t ri b u t e

Used with the fetchmany method, specifies the internal buffer size, which is also how many rows are actually
fetched from the server at a time. The default value is 10000 . For narrow results (results in which each row does
not contain a lot of data), you should increase this value for better performance.
Read-write access.
description a t t ri b u t e

Contains a Python list of tuple objects. Each of these tuple objects contains 7 values, with the first 2 items
of each tuple object containing information describing a single result column as follows:
name : The name of the column.
type_code : A string representing the type of the column. For example, an integer column will have a type
code of int .

The remaining 5 items of each 7-item tuple object are not implemented, and their values are not defined. They
will typically be returned as 4 None values followed by a single True value.
Read-only access.
Met h o ds
cancel me t h o d

Interrupts the running of any database query or command that the cursor has started. To release the associated
resources on the server, call the close method after calling the cancel method.
No parameters.
No return value.
close me t h o d

Closes the cursor and releases the associated resources on the server. Closing an already closed cursor might
throw an error.
No parameters.
No return value.
execute me t h o d

Prepares and then runs a database query or command.


No return value.

PA RA M ET ERS

operation

Type: str

The query or command to prepare and then run.

This parameter is required.

Example without the parameters parameter:

cursor.execute(
'SELECT * FROM default.diamonds WHERE cut="Ideal" LIMIT 2'
)

Example with the parameters parameter:

cursor.execute(
'SELECT * FROM default.diamonds WHERE cut=%(cut_type)s LIMIT 2',
{ 'cut_type': 'Ideal' }
)

parameters

Type: dictionary

A sequence of parameters to use with the operation parameter.

This parameter is optional. The default is None .

executemany me t h o d

Prepares and then runs a database query or command using all parameter sequences in the seq_of_parameters
argument. Only the final result set is retained.
No return value.

PA RA M ET ERS

operation

Type: str

The query or command to prepare and then run.

This parameter is required.


PA RA M ET ERS

seq_of_parameters

Type: list of dict

A sequence of many sets of parameter values to use with the


operation parameter.

This parameter is required.

catalogs me t h o d

Execute a metadata query about the catalogs. Actual results should then be fetched using fetchmany or
fetchall . Important fields in the result set include:

Field name: TABLE_CAT . Type: str . The name of the catalog.

No parameters.
No return value.
Since version 1.0
schemas me t h o d

Execute a metadata query about the schemas. Actual results should then be fetched using fetchmany or
fetchall . Important fields in the result set include:

Field name: TABLE_SCHEM . Type: str . The name of the schema.


Field name: TABLE_CATALOG . Type: str . The catalog to which the schema belongs.

No return value.
Since version 1.0

PA RA M ET ERS

catalog_name

Type: str

A catalog name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

schema_name

Type: str

A schema name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

tables me t h o d

Execute a metadata query about tables and views. Actual results should then be fetched using fetchmany or
fetchall . Important fields in the result set include:

Field name: TABLE_CAT . Type: str . The catalog to which the table belongs.
Field name: TABLE_SCHEM . Type: str . The schema to which the table belongs.
Field name: TABLE_NAME . Type: str . The name of the table.
Field name: TABLE_TYPE . Type: str . The kind of relation, for example VIEW or TABLE (applies to Databricks
Runtime 10.2 and above as well as to Databricks SQL; prior versions of the Databricks Runtime return an
empty string).
No return value.
Since version 1.0

PA RA M ET ERS

catalog_name

Type: str

A catalog name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

schema_name

Type: str

A schema name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

table_name

Type: str

A table name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

table_types

Type: List[str]

A list of table types to match, for example TABLE or VIEW .

This parameter is optional.

columns me t h o d

Execute a metadata query about the columns. Actual results should then be fetched using fetchmany or
fetchall . Important fields in the result set include:

Field name: TABLE_CAT . Type: str . The catalog to which the column belongs.
Field name: TABLE_SCHEM . Type: str . The schema to which the column belongs.
Field name: TABLE_NAME . Type: str . The name of the table to which the column belongs.
Field name: COLUMN_NAME . Type: str . The name of the column.

No return value.
Since version 1.0
PA RA M ET ERS

catalog_name

Type: str

A catalog name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

schema_name

Type: str

A schema name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

table_name

Type: str

A table name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

column_name

Type: str

A column name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

fetchall me t h o d

Gets all (or all remaining) rows of a query.


No parameters.
Returns all (or all remaining) rows of the query as a Python list of Row objects.
Throws an Error if the previous call to the execute method did not return any data or no execute call has yet
been made.
fetchmany me t h o d

Gets the next rows of a query.


Returns up to size (or the arraysize attribute if size is not specified) of the next rows of a query as a Python
list of Row objects. If there are fewer than size rows left to be fetched, all remaining rows will be returned.

Throws an Error if the previous call to the execute method did not return any data or no execute call has yet
been made.
PA RA M ET ERS

size

Type: int

The number of next rows to get.

This parameter is optional. If not specified, the value of the arraysize attribute is used.

Example: cursor.fetchmany(10)

fetchone me t h o d

Gets the next row of the dataset.


No parameters.
Returns the next row of the dataset as a single sequence as a Python tuple object, or returns None if there is
no more available data.
Throws an Error if the previous call to the execute method did not return any data or no execute call has yet
been made.
fetchall_arrow me t h o d

Gets all (or all remaining) rows of a query, as a PyArrow Table object. Queries returning very large amounts of
data should use fetchmany_arrow instead to reduce memory consumption.
No parameters.
Returns all (or all remaining) rows of the query as a PyArrow table.
Throws an Error if the previous call to the execute method did not return any data or no execute call has yet
been made.
Since version 2.0
fetchmany_arrow me t h o d

Gets the next rows of a query as a PyArrow Table object.


Returns up to the size argument (or the arraysize attribute if size is not specified) of the next rows of a query
as a Python PyArrow Table object.
Throws an Error if the previous call to the execute method did not return any data or no execute call has yet
been made.
Since version 2.0

PA RA M ET ERS

size

Type: int

The number of next rows to get.

This parameter is optional. If not specified, the value of the arraysize attribute is used.

Example: cursor.fetchmany_arrow(10)

Row class
The row class is a tuple-like data structure that represents an individual result row. If the row contains a column
with the name "my_column" , you can access the "my_column" field of row via row.my_column . You can also use
numeric indicies to access fields, for example row[0] . If the column name is not allowed as an attribute method
name (for example, it begins with a digit), then you can access the field as row["1_my_column"] .
Since version 1.0
Met h o ds
asDict me t h o d

Return a dictionary representation of the row, which is indexed by field names. If there are duplicate field names,
one of the duplicate fields (but only one) will be returned in the dictionary. Which duplicate field is returned is
not defined.
No parameters.
Returns a dict of fields.
Type conversions
The following table maps Apache Spark SQL data types to their Python data type equivalents.

A PA C H E SPA RK SQ L DATA T Y P E P Y T H O N DATA T Y P E

array str

bigint int

binary bytearray

boolean bool

date datetime.date

decimal decimal.Decimal

double float

int int

map str

null NoneType

smallint int

string str

struct str

timestamp datetime.datetime

tinyint int

Troubleshooting
tokenAuthWrapperInvalidAccessToken: Invalid access token message
Issue : When you run your code, you see a message similar to
Error during request to server: tokenAuthWrapperInvalidAccessToken: Invalid access token .
Possible cause : The value passed to access_token is not a valid Azure Databricks personal access token.
Recommended fix : Check that the value passed to access_token is correct and try again.
gaierror(8, 'nodename nor servname provided, or not known') message
Issue : When you run your code, you see a message similar to
Error during request to server: gaierror(8, 'nodename nor servname provided, or not known') .
Possible cause : The value passed to server_hostname is not the correct host name.
Recommended fix : Check that the value passed to server_hostname is correct and try again.
For more information on finding the server hostname, see Retrieve the connection details.
IpAclError message
Issue : When you run your code, you see the message Error during request to server: IpAclValidation when
you try to use the connector on an Azure Databricks notebook.
Possible cause : You may have IP allow listing enabled for the Azure Databricks workspace. With IP allow listing,
connections from Spark clusters back to the control plane are not allowed by default.
Recommended fix : Ask your administrator to add the data plane subnet to the IP allow list.

Additional resources
For more information, see:
Data types (Databricks SQL)
Built-in Types (for bool , bytearray , float , int , and str ) on the Python website
datetime (for datetime.date and datatime.datetime ) on the Python website
decimal (for decimal.Decimal ) on the Python website
Built-in Constants (for NoneType ) on the Python website
Databricks SQL Driver for Go
7/21/2022 • 2 minutes to read

IMPORTANT
The Databricks SQL Driver for Go is provided as-is and is not officially supported by Databricks through customer
technical support channels. Support, questions, and feature requests can be communicated through the Issues page of
the databricks/databricks-sql-go repo on GitHub. Issues with the use of this code will not be answered or investigated by
Databricks Support.

The Databricks SQL Driver for Go is a Go library that allows you to use Go code to run SQL commands on Azure
Databricks compute resources.

Requirements
A development machine running Go, version 1.18 or higher. To print the installed version of Go, run the
command go version . Download and install Go.
An existing cluster or SQL warehouse.
Display clusters.
Create a cluster.
View SQL warehouses.
Create a SQL warehouse.
The Ser ver Hostname and HTTP Path value for the existing cluster or SQL warehouse.
Get these values for a cluster.
Get these values for a SQL warehouse.
An Azure Databricks personal access token.
Generate a personal access token.
Manage personal access tokens.

NOTE
The Databricks SQL Driver for Go does not support Azure Active Directory (Azure AD) tokens for authentication.

Specify the DSN connection string


The Databricks SQL Driver for Go uses a data source name (DSN) connection string to access clusters and SQL
warehouses. To specify the DSN connection string in the correct format, use the following syntax, where:
[your token] is your Azure Databricks personal access token from the requirements.
[Workspace hostname] is the Ser ver Hostname value from the requirements.
[Endpoint HTTP Path] is the HTTP Path value from the requirements.

databricks://:[your token]@[Workspace hostname][Endpoint HTTP Path]

For example, for a cluster:


databricks://:dapi1ab2c34defabc567890123d4efa56789@adb-
1234567890123456.7.azuredatabricks.net/sql/protocolv1/o/1234567890123456/1234-567890-abcdefgh

For example, for a SQL warehouse:

databricks://:dapi1ab2c34defabc567890123d4efa56789@adb-
1234567890123456.7.azuredatabricks.net/sql/1.0/endpoints/a1b234c5678901d2

NOTE
As a security best practice, you should not hard-code this DSN connection string into your Go code. Instead, you should
retrieve this DSN connection string from a secure location. For example, the code example later in this article uses an
environment variable.

Query data
The following code example demonstrates how to call the Databricks SQL Driver for Go to run a basic SQL
query on an Azure Databricks compute resource. This command returns the first two rows from the diamonds
table.

NOTE
The diamonds table is included in the Sample datasets (databricks-datasets).

This code example retrieves the DSN connection string from an environment variable named DATABRICKS_DSN .

package main

import (
"database/sql"
"fmt"
"os"

_ "github.com/databricks/databricks-sql-go"
)

func main() {
dsn := os.Getenv("DATABRICKS_DSN")

if dsn == "" {
panic("No connection string found." +
"Set the DATABRICKS_DSN environment variable, and try again.")
}

db, err := sql.Open("databricks", dsn)

if err != nil {
panic(err)
}

var (
_c0 string
carat string
cut string
color string
clarity string
depth string
table string
price string
x string
y string
z string
)

rows, err := db.Query("SELECT * FROM default.diamonds LIMIT 2")

if err != nil {
panic(err)
}

defer rows.Close()

for rows.Next() {
err := rows.Scan(&_c0,
&carat,
&cut,
&color,
&clarity,
&depth,
&table,
&price,
&x,
&y,
&z)

if err != nil {
panic(err)
}

fmt.Print(_c0, ",",
carat, ",",
cut, ",",
color, ",",
clarity, ",",
depth, ",",
table, ",",
price, ",",
x, ",",
y, ",",
z, "\n")
}

err = rows.Err()

if err != nil {
panic(err)
}
}

Output:

1,0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31

For additional examples, see the examples folder in the databricks/databricks-sql-go repository on GitHub.

Additional resources
The Databricks SQL Driver for Go repository on GitHub
Go database/sql tutorial
The database/sql package home page
Databricks SQL Driver for Node.js
7/21/2022 • 3 minutes to read

IMPORTANT
The Databricks SQL Driver for Node.js is provided as-is and is not officially supported by Databricks through customer
technical support channels. Support, questions, and feature requests can be communicated through the Issues page of
the databricks/databricks-sql-nodejs repo on GitHub. Issues with the use of this code will not be answered or investigated
by Databricks Support.

The Databricks SQL Driver for Node.js is a Node.js library that allows you to use JavaScript code to run SQL
commands on Azure Databricks compute resources.

Requirements
A development machine running Node.js, version 14 or higher. To print the installed version of Node.js,
run the command node -v . To install and use different versions of Node.js, you can use tools such as
Node Version Manager (nvm).
Node Package Manager ( npm ). Later versions of Node.js already include npm . To check whether npm is
installed, run the command npm -v . To install npm if needed, you can follow instructions such as the
ones at Download and install npm.
The @databricks/sql package from npm. To install the @databricks/sql package in your Node.js project,
use npm to run the following command from within the same directory as your project:

npm i @databricks/sql

An existing cluster or SQL warehouse.


Display clusters.
Create a cluster.
View SQL warehouses.
Create a SQL warehouse.
The Ser ver Hostname and HTTP Path value for the existing cluster or SQL warehouse.
Get these values for a cluster.
Get these values for a SQL warehouse.
An Azure Databricks personal access token.
Generate a personal access token.
Manage personal access tokens.

NOTE
The Databricks SQL Driver for Node.js does not support Azure Active Directory (Azure AD) tokens for
authentication.
Specify the connection variables
To access your cluster or SQL warehouse, the Databricks SQL Driver for Node.js uses connection variables
named token , server_hostname and http_path , representing your Azure Databricks personal access token and
your cluster’s or SQL warehouse’s Ser ver Hostname and HTTP Path values, respectively.
The Azure Databricks personal access token value for token is similar to the following:
dapi1ab2c34defabc567890123d4efa56789 .

The Ser ver Hostname value for server_hostname is similar to the following:
adb-1234567890123456.7.azuredatabricks.net .
The HTTP Path value for is similar to the following: for a cluster,
http_path
sql/protocolv1/o/1234567890123456/1234-567890-abcdefgh ; and for a SQL warehouse,
/sql/1.0/endpoints/a1b234c5678901d2 .

NOTE
As a security best practice, you should not hard code these connection variable values into your code. Instead, you should
retrieve these connection variable values from a secure location. For example, the code example later in this article uses
environment variables.

Query data
The following code example demonstrates how to call the Databricks SQL Driver for Node.js to run a basic SQL
query on an Azure Databricks compute resource. This command returns the first two rows from the diamonds
table.

NOTE
The diamonds table is included in the Sample datasets (databricks-datasets).

This code example retrieves the token , server_hostname and http_path connection variable values from a set
of environment variables. These environment variables have the following environment variable names:
DATABRICKS_TOKEN , which represents your Azure Databricks personal access token from the requirements.
DATABRICKS_SERVER_HOSTNAME , which represents the Ser ver Hostname value from the requirements.
DATABRICKS_HTTP_PATH , which represents the HTTP Path value from the requirements.

You can use other approaches to retrieving these connection variable values. Using environment variables is just
one approach among many.
const { DBSQLClient } = require('@databricks/sql');

var token = process.env.DATABRICKS_TOKEN;


var server_hostname = process.env.DATABRICKS_SERVER_HOSTNAME;
var http_path = process.env.DATABRICKS_HTTP_PATH;

if (!token || !server_hostname || !http_path) {


throw new Error("Cannot find Server Hostname, HTTP Path, or personal access token. " +
"Check the environment variables DATABRICKS_TOKEN, " +
"DATABRICKS_SERVER_HOSTNAME, and DATABRICKS_HTTP_PATH.")
}

const client = new DBSQLClient();


const utils = DBSQLClient.utils;

client.connect(
options = {
token: token,
host: server_hostname,
path: http_path
}).then(
async client => {
const session = await client.openSession();

const queryOperation = await session.executeStatement(


statement = 'SELECT * FROM default.diamonds LIMIT 2',
options = { runAsync: true });

await utils.waitUntilReady(
operation = queryOperation,
progress = false,
callback = () => {});

await utils.fetchAll(
operation = queryOperation
);

await queryOperation.close();

const result = utils.getResult(


operation = queryOperation
).getValue();

console.table(result);

await session.close();
client.close();
}).catch(error => {
console.log(error);
});

Output:

┌─────────┬─────┬────────┬───────────┬───────┬─────────┬────────┬───────┬───────┬────────┬────────┬────────┐
│ (index) │ _c0 │ carat │ cut │ color │ clarity │ depth │ table │ price │ x │ y │ z │
├─────────┼─────┼────────┼───────────┼───────┼─────────┼────────┼───────┼───────┼────────┼────────┼────────┤
│ 0 │ '1' │ '0.23' │ 'Ideal' │ 'E' │ 'SI2' │ '61.5' │ '55' │ '326' │ '3.95' │ '3.98' │ '2.43' │
│ 1 │ '2' │ '0.21' │ 'Premium' │ 'E' │ 'SI1' │ '59.8' │ '61' │ '326' │ '3.89' │ '3.84' │ '2.31' │
└─────────┴─────┴────────┴───────────┴───────┴─────────┴────────┴───────┴───────┴────────┴────────┴────────┘

For additional examples, see the examples folder in the databricks/databricks-sql-nodejs repository on GitHub.

Additional resources
The Databricks SQL Driver for Node.js repository on GitHub
Getting started with the Databricks SQL Driver for Node.js
Troubleshooting the Databricks SQL Driver for Node.js
Connect Python and pyodbc to Azure Databricks
7/21/2022 • 12 minutes to read

You can connect from your local Python code through ODBC to data in a Databricks cluster or SQL warehouse.
To do this, you can use the open source Python code module pyodbc .
Follow these instructions to install, configure, and use pyodbc .
For more information about pyodbc , see the pyodbc Wiki.

NOTE
Databricks offers the Databricks SQL Connector for Python as an alternative to pyodbc . The Databricks SQL Connector
for Python is easier to set up and use, and has a more robust set of coding constructs, than pyodbc . However pyodbc
may have better performance when fetching queries results above 10 MB.

Requirements
A local development machine running one of the following:
macOS
Windows
A Unix or Linux distribution that supports .rpm or .deb files
pip.
For Unix, Linux, or macOS, Homebrew.
An Azure Databricks cluster, a Databricks SQL warehouse, or both. For more information, see Create a cluster
and Create a SQL warehouse.
Follow the instructions for Unix, Linux, or macOS or for Windows.

Unix, Linux, or macOS


If your local Python code is running on a Unix, Linux, or macOS machine, follow these instructions.
Step 1: Install software
In this step, you download and install the Databricks ODBC driver, the unixodbc package, and the pyodbc
module. (The pyodbc module requires the unixodbc package on Unix, Linux, and macOS.)
1. Download the Databricks ODBC driver.
2. To install the Databricks ODBC driver, open the SimbaSparkODBC.zip file that you downloaded.
3. Do one of the following:
macOS : Double-click the extracted Simba Spark.dmg file. Then double-click the SimbaSparkODBC.pkg
file that displays, and follow any on-screen directions.
Linux : Use your distribution’s package manager utility to install the extracted simbaspark.rpm or
simbaspark.deb file, and follow any on-screen directions.
4. Install the unixodbc package: from the terminal, run brew install unixodbc . For more information, see
unixodbc on the Homebrew website.
5. Install the pyodbc module: from the terminal, run pip install pyodbc . For more information, see pyodbc on
the PyPI website and Install in the pyodbc Wiki.
Step 2: Configure software
Specify connection details for the Databricks cluster and SQL warehouse for pyodbc to use.

NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.

1. Add the following content to the /etc/odbc.ini file on your machine:

TIP
If you do not want to or cannot use the /etc/odbc.ini file on your machine, you can specify connection details
directly in Python code. To do this, skip the rest of this step and proceed to Step 3: Test your configuration.

Cluster

[ODBC Data Sources]


Databricks_Cluster = Simba Spark ODBC Driver

[Databricks_Cluster]
Driver = <driver-path>
Description = Simba Spark ODBC Driver DSN
HOST = <server-hostname>
PORT = 443
Schema = default
SparkServerType = 3
AuthMech = 3
UID = token
PWD = <personal-access-token>
ThriftTransport = 2
SSL = 1
HTTPPath = <http-path>

In the preceding configuration file, replace the following placeholders, and then save the file:
Replace <driver-path> with one of the following:
macOS : /Library/simba/spark/lib/libsparkodbc_sbu.dylib
Linux 64-bit : /opt/simba/spark/lib/64/libsparkodbc_sb64.so
Linux 32-bit : /opt/simba/spark/lib/32/libsparkodbc_sb32.so
Replace <server-hostname> with the Ser ver Hostname value from the Advanced Options >
JDBC/ODBC tab for your cluster.
Replace <personal-access-token> with the value of your personal access token for your Azure
Databricks workspace.
Replace <http-path> with the HTTP Path value from the Advanced Options > JDBC/ODBC tab for
your cluster.

TIP
To allow pyodbc to switch connections to a different cluster, add an entry to the [ODBC Data Sources] section
and a matching entry below [Databricks_Cluster] with the specific connection details. Each entry must have a
unique name within this file.
Sql warehouse

[ODBC Data Sources]


SQL_Warehouse = Simba Spark ODBC Driver

[SQL_Warehouse]
Driver = <driver-path>
HOST = <server-hostname>
PORT = 443
Schema = default
SparkServerType = 3
AuthMech = 3
UID = token
PWD = <personal-access-token>
ThriftTransport = 2
SSL = 1
HTTPPath = <http-path>

In the preceding configuration file, replace the following placeholders, and then save the file:
Replace <driver-path> with one of the following:
macOS : /Library/simba/spark/lib/libsparkodbc_sbu.dylib
Linux 64-bit : /opt/simba/spark/lib/64/libsparkodbc_sb64.so
Linux 32-bit : /opt/simba/spark/lib/32/libsparkodbc_sb32.so
Replace <server-hostname> with the Ser ver Hostname value from the Connection Details tab for
your SQL warehouse.
Replace <personal-access-token> with the value of your personal access token for your SQL
warehouse.
Replace <http-path> with the HTTP Path value from the Connection Details tab for your SQL
warehouse.

TIP
To allow pyodbc to switch connections to a different SQL warehouse, add an entry to the
[ODBC Data Sources] section and a matching entry below [SQL_Warehouse] with the specific connection
details. Each entry must have a unique name within this file.

2. Add the preceding information you just added to the /etc/odbc.ini file to the corresponding
/usr/local/etc/odbc.ini file on your machine as well.

3. Add the following content to the /etc/odbcinst.ini file on your machine:

[ODBC Drivers]
Simba SQL Server ODBC Driver = Installed

[Simba Spark ODBC Driver]


Driver = <driver-path>

In the preceding content, replace <driver-path> with one of the following values, and then save the file:
macOS : /Library/simba/spark/lib/libsparkodbc_sbu.dylib
Linux 64-bit : /opt/simba/spark/lib/64/libsparkodbc_sb64.so
Linux 32-bit : /opt/simba/spark/lib/32/libsparkodbc_sb32.so
4. Add the information you just added to the /etc/odbcinst.ini file to the corresponding
/usr/local/etc/odbcinst.ini file on your machine as well.
5. Add the following information at the end of the simba.sparkodbc.ini file on your machine, and then save
the file. For macOS, this file is in /Library/simba/spark/lib .

DriverManagerEncoding=UTF-16
ODBCInstLib=/usr/local/Cellar/unixodbc/2.3.9/lib/libodbcinst.dylib

Step 3: Test your configuration


In this step, you write and run Python code to use your Azure Databricks cluster or Databricks SQL warehouse to
query a database table and display the first two rows of query results.
To query by using a cluster:
1. Create a file named pyodbc-test-cluster.py with the following content. Replace <table-name> with the
name of the database table to query, save the file, and then run the file with your Python interpreter.

import pyodbc

# Replace <table-name> with the name of the database table to query.


table_name = "<table-name>"

# Connect to the Databricks cluster by using the


# Data Source Name (DSN) that you created earlier.
conn = pyodbc.connect("DSN=Databricks_Cluster", autocommit=True)

# Run a SQL query by using the preceding connection.


cursor = conn.cursor()
cursor.execute(f"SELECT * FROM {table_name} LIMIT 2")

# Print the rows retrieved from the query.


print(f"Query output: SELECT * FROM {table_name} LIMIT 2\n")
for row in cursor.fetchall():
print(row)

NOTE
If you skipped Step 2: Configure software and did not use an /etc/odbc.ini file, then specify connection details
in the call to pyodbc.connect , for example:

conn = pyodbc.connect("Driver=<driver-path>;" +
"HOST=<server-hostname>;" +
"PORT=443;" +
"Schema=default;" +
"SparkServerType=3;" +
"AuthMech=3;" +
"UID=token;" +
"PWD=<personal-access-token>;" +
"ThriftTransport=2;" +
"SSL=1;" +
"HTTPPath=<http-path>",
autocommit=True)

Replace the placeholders with the values as described in Step 2: Configure software.

2. To speed up running the code, start the cluster that corresponds to the HTTPPath setting in your
odbc.ini file.

3. Run the pyodbc-test-cluster.py file with your Python interpreter. The first two rows of the database table
are displayed.
To query by using a SQL warehouse:
1. Create a file named pyodbc-test-cluster.py . Replace <table-name> with the name of the database table
to query, and then save the file.

import pyodbc

# Replace <table-name> with the name of the database table to query.


table_name = "<table-name>"

# Connect to the SQL warehouse by using the


# Data Source Name (DSN) that you created earlier.
conn = pyodbc.connect("DSN=SQL_Warehouse", autocommit=True)

# Run a SQL query by using the preceding connection.


cursor = conn.cursor()
cursor.execute(f"SELECT * FROM {table_name} LIMIT 2")

# Print the rows retrieved from the query.


print(f"Query output: SELECT * FROM {table_name} LIMIT 2\n")
for row in cursor.fetchall():
print(row)

NOTE
If you skipped Step 2: Configure software and did not use an /etc/odbc.ini file, then specify connection details
in the call to pyodbc.connect , for example:

conn = pyodbc.connect("Driver=<driver-path>;" +
"HOST=<server-hostname>;" +
"PORT=443;" +
"Schema=default;" +
"SparkServerType=3;" +
"AuthMech=3;" +
"UID=token;" +
"PWD=<personal-access-token>;" +
"ThriftTransport=2;" +
"SSL=1;" +
"HTTPPath=<http-path>",
autocommit=True)

Replace the placeholders with the values as described in Step 2: Configure software.

2. To speed up running the code, start the SQL warehouse that corresponds to the HTTPPath setting in your
odbc.ini file.

3. Run the pyodbc-test-warehouse.py file with your Python interpreter. The first two rows of the database
table are displayed.
Next steps
To run the Python test code against a different cluster or SQL warehouse, change the settings in the
preceding two odbc.ini files. Or add a new entry to the [ODBC Data Sources] section, along with matching
connection details, to the two odbc.ini files. Then change the DSN name in the test code to match the
related name in [ODBC Data Sources] .
To run the Python test code against a different database tables, change the table_name value.
To run the Python test code with a different SQL query, change the execute command string.

Windows
If your local Python code is running on a Windows machine, follow these instructions.
Step 1: Install software
1. Download the Databricks ODBC driver.
2. To install the Databricks ODBC driver, open the SimbaSparkODBC.zip file that you downloaded.
3. Double-click the extracted Simba Spark.msi file, and follow any on-screen directions.
4. Install the pyodbc module: from an administrative command prompt, run pip install pyodbc . For more
information, see pyodbc on the PyPI website and Install in the pyodbc Wiki.
Step 2: Configure software
Specify connection details for the Azure Databricks cluster or Databricks SQL warehouse for pyodbc to use.
To specify connection details for a cluster:
1. Add a data source name (DSN) that contains information about your cluster: start the ODBC Data Sources
application: on the Star t menu, begin typing ODBC , and then click ODBC Data Sources .
2. On the User DSN tab, click Add . In the Create New Data Source dialog box, click Simba Spark ODBC
Driver , and then click Finish .
3. In the Simba Spark ODBC Driver DSN Setup dialog box, change the following values:
Data Source Name : Databricks_Cluster
Description : My cluster
Spark Ser ver Type : SparkThriftServer (Spark 1.1 and later)
Host(s) : The Ser ver Hostname value from the Advanced Options, JDBC/ODBC tab for your cluster.
Por t : 443
Database : default
Mechanism : User Name and Password
User Name : token
Password : The value of your personal access token for your Azure Databricks workspace.
Thrift Transpor t : HTTP
4. Click HTTP Options . In the HTTP Proper ties dialog box, for HTTP Path , enter the HTTP Path value from
the Advanced Options, JDBC/ODBC tab for your cluster, and then click OK .
5. Click SSL Options . In the SSL Options dialog box, check the Enable SSL box, and then click OK .
6. Click Test . If the test succeeds, click OK .

TIP
To allow pyodbc to switch connections to a different cluster, repeat this procedure with the specific connection details.
Each DSN must have a unique name.

To specify connection details for a SQL warehouse:


1. In the ODBC Data Sources application, on the User DSN tab, click Add . In the Create New Data Source
dialog box, click Simba Spark ODBC Driver , and then click Finish .
2. In the Simba Spark ODBC Driver dialog box, enter the following values:
Data Source Name : SQL_Warehouse
Description : My warehouse
Spark Ser ver Type : SparkThriftServer (Spark 1.1 and later)
Host(s) : The Ser ver Hostname value from the Connection Details tab your SQL warehouse.
Por t : 443
Database : default
Mechanism : User Name and Password
User Name : token
Password : The value of your personal access token for your SQL warehouse.
Thrift Transpor t : HTTP
3. Click HTTP Options . In the HTTP Proper ties dialog box, for HTTP Path , enter the HTTP Path value from
the Connection Details tab your SQL warehouse, and then click OK .
4. Click SSL Options . In the SSL Options dialog box, check the Enable SSL box, and then click OK .
5. Click Test . If the test succeeds, click OK .

TIP
To allow pyodbc to connect to switch connections to a different SQL warehouse, repeat this procedure with the specific
connection details. Each DSN must have a unique name.

Step 3: Test your configuration


In this step, you write and run Python code to use your Azure Databricks cluster or Databricks SQL warehouse to
query a database table and display the first two rows of query results.
To query by using a cluster:
1. Create a file named pyodbc-test-cluster.py with the following content. Replace <table-name> with the
name of the database table to query, and then save the file.

import pyodbc

# Replace <table-name> with the name of the database table to query.


table_name = "<table-name>"

# Connect to the Databricks cluster by using the


# Data Source Name (DSN) that you created earlier.
conn = pyodbc.connect("DSN=Databricks_Cluster", autocommit=True)

# Run a SQL query by using the preceding connection.


cursor = conn.cursor()
cursor.execute(f"SELECT * FROM {table_name} LIMIT 2")

# Print the rows retrieved from the query.


print(f"Query output: SELECT * FROM {table_name} LIMIT 2\n")
for row in cursor.fetchall():
print(row)

2. To speed up running the code, start the cluster that corresponds to the Host(s) value in the Simba
Spark ODBC Driver DSN Setup dialog box for your Azure Databricks cluster.
3. Run the pyodbc-test-cluster.py file with your Python interpreter. The first two rows of the database table
are displayed.
To query by using a SQL warehouse:
1. Create a file named pyodbc-test-cluster.py . Replace <table-name> with the name of the database table
to query, and then save the file.
import pyodbc

# Replace <table-name> with the name of the database table to query.


table_name = "<table-name>"

# Connect to the SQL warehouse by using the


# Data Source Name (DSN) that you created earlier.
conn = pyodbc.connect("DSN=SQL_Warehouse", autocommit=True)

# Run a SQL query by using the preceding connection.


cursor = conn.cursor()
cursor.execute(f"SELECT * FROM {table_name} LIMIT 2")

# Print the rows retrieved from the query.


print(f"Query output: SELECT * FROM {table_name} LIMIT 2\n")
for row in cursor.fetchall():
print(row)

2. To speed up running the code, start the SQL warehouse that corresponds to the Host(s) value in the
Simba Spark ODBC Driver DSN Setup dialog box for your Databricks SQL warehouse.
3. Run the pyodbc-test-warehouse.py file with your Python interpreter. The first two rows of the database
table are displayed.
Next steps
To run the Python test code against a different cluster or SQL warehouse, change the Host(s) value in the
Simba Spark ODBC Driver DSN Setup dialog box for your Azure Databricks cluster or Databricks SQL
warehouse. Or create a new DSN. Then change the DSN name in the test code to match the related Data
Source Name .
To run the Python test code against a different database table, change the table_name value.
To run the Python test code with a different SQL query, change the execute command string.

Troubleshooting
This section addresses common issues when using pyodbc with Databricks.
Unicode decode error
Issue : You receive an error message similar to the following:

<class 'pyodbc.Error'> returned a result with an error set


Traceback (most recent call last):
File "/Users/user/.pyenv/versions/3.7.5/lib/python3.7/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 2112-2113: illegal UTF-16 surrogate

Cause : An issue exists in pyodbc version 4.0.31 or below that could manifest with such symptoms when
running queries that return columns with long names or a long error message. The issue has been fixed by a
newer version of pyodbc .
Solution : Upgrade your installation of pyodbc to version 4.0.32 or above.
General troubleshooting
See Issues in the mkleehammer/pyodbc repository on GitHub.
Databricks CLI
7/21/2022 • 8 minutes to read

The Databricks command-line interface (CLI) provides an easy-to-use interface to the Azure Databricks platform.
The open source project is hosted on GitHub. The CLI is built on top of the Databricks REST API 2.0 and is
organized into command groups based on the Cluster Policies API 2.0, Clusters API 2.0, DBFS API 2.0, Groups
API 2.0, Instance Pools API 2.0, Jobs API 2.1, Libraries API 2.0, Delta Live Tables API 2.0, Repos API 2.0, Secrets API
2.0, Token API 2.0, and Workspace API 2.0 through the cluster-policies , clusters , fs , groups ,
instance-pools , jobs and runs , libraries , repos , secrets , tokens , and workspace command groups,
respectively.
For example, you can use the Databricks CLI to do things such as:
Provision compute resources in Azure Databricks workspaces.
Run data processing and data analysis tasks.
List, import, and export notebooks and folders in workspaces.

IMPORTANT
This CLI is under active development and is released as an Experimental client. This means that interfaces are still subject
to change.

Set up the CLI


This section lists CLI requirements and describes how to install and configure your environment to run the CLI.
Requirements
Python 3 - 3.6 and above
Python 2 - 2.7.9 and above

IMPORTANT
On macOS, the default Python 2 installation does not implement the TLSv1_2 protocol and running the CLI with
this Python installation results in the error:
AttributeError: 'module' object has no attribute 'PROTOCOL_TLSv1_2' . Use Homebrew to install a version
of Python that has ssl.PROTOCOL_TLSv1_2 .

Limitations
Using the Databricks CLI with firewall enabled storage containers is not supported. Databricks recommends you
use Databricks Connect or az storage.
Install the CLI
Run pip install databricks-cli using the appropriate version of pip for your Python installation:

pip install databricks-cli

Update the CLI


Run pip install databricks-cli --upgrade using the appropriate version of pip for your Python installation:
pip install databricks-cli --upgrade

To list the version of the CLI that is currently installed, run databricks --version (or databricks -v ):

databricks --version

# Or...

databricks -v

Set up authentication

NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.

Before you can run CLI commands, you must set up authentication. To authenticate to the CLI you can use a
Databricks personal access token or an Azure Active Directory (Azure AD) token.
Set up authentication using an Azure AD token
To configure the CLI using an Azure AD token, generate the Azure AD token and store it in the environment
variable DATABRICKS_AAD_TOKEN .
Un ix, lin u x, m ac os

export DATABRICKS_AAD_TOKEN=<Azure-AD-token>

Or, using jq:

export DATABRICKS_AAD_TOKEN=$(az account get-access-token --resource 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d |


jq .accessToken --raw-output)

W indow s

setx DATABRICKS_AAD_TOKEN "<Azure-AD-token>" /M

Or, using Windows PowerShell and jq:

$databricks_aad_token = az account get-access-token --resource 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d | jq


.accessToken --raw-output
[System.Environment]::SetEnvironmentVariable('DATABRICKS_AAD_TOKEN', $databricks_aad_token,
[System.EnvironmentVariableTarget]::Machine)`

Run the following command:

databricks configure --aad-token

The command issues the prompt:


Databricks Host (should begin with https://):

Enter your per-workspace URL, with the format https://adb-<workspace-id>.<random-number>.azuredatabricks.net


. To get the per-workspace URL, see Per-workspace URL.
After you complete the prompt, your access credentials are stored in the file ~/.databrickscfg on Unix, Linux, or
macOS or %USERPROFILE%\.databrickscfg on Windows. The file contains a default profile entry:

[DEFAULT]
host = <workspace-URL>
token = <Azure-AD-token>

Set up authentication using a Databricks personal access token


To configure the CLI to use a personal access token, run the following command:

databricks configure --token

The command begins by issuing the prompt:

Databricks Host (should begin with https://):

Enter your per-workspace URL, with the format https://adb-<workspace-id>.<random-number>.azuredatabricks.net


. To get the per-workspace URL, see Per-workspace URL.
The command continues by issuing the prompt to enter your personal access token:

Token:

After you complete the prompts, your access credentials are stored in the file ~/.databrickscfg on Unix, Linux,
or macOS, or %USERPROFILE%\.databrickscfg on Windows. The file contains a default profile entry:

[DEFAULT]
host = <workspace-URL>
token = <personal-access-token>

For CLI 0.8.1 and above, you can change the path of this file by setting the environment variable
DATABRICKS_CONFIG_FILE .

Unix, linux, macos

export DATABRICKS_CONFIG_FILE=<path-to-file>

Windows

setx DATABRICKS_CONFIG_FILE "<path-to-file>" /M

IMPORTANT
The CLI does not work with a .netrc file. You can have a .netrc file in your environment for other purposes, but the CLI
will not use that .netrc file.
CLI 0.8.0 and above supports the following environment variables:
DATABRICKS_HOST
DATABRICKS_TOKEN

An environment variable setting takes precedence over the setting in the configuration file.
Test your authentication setup
To check whether you set up authentication correctly, you can run a command such as the following, replacing
<someone@example.com> with your Azure Databricks workspace username:

databricks workspace ls /Users/<someone@example.com>

If successful, this command lists the objects in the specified workspace path.
Connection profiles
The Databricks CLI configuration supports multiple connection profiles. The same installation of Databricks CLI
can be used to make API calls on multiple Azure Databricks workspaces.
To add a connection profile, specify a unique name for the profile:

databricks configure [--token | --aad-token] --profile <profile-name>

The .databrickscfg file contains a corresonding profile entry:

[<profile-name>]
host = <workspace-URL>
token = <token>

To use the connection profile:

databricks <group> <command> --profile <profile-name>

If --profile <profile-name> is not specified, the default profile is used. If a default profile is not found, you are
prompted to configure the CLI with a default profile.
Test your connection profiles
To check whether you set up your connection profiles correctly, you can run a command such as the following,
replacing <someone@example.com> with your Azure Databricks workspace username and <DEFAULT> with one of
your connection profile names:

databricks workspace ls /Users/<someone@example.com> --profile <DEFAULT>

If successful, this command lists the objects in the specified workspace path in the workspace for the specified
connection profile. Run this command for each connection profile that you want to test.
Alias command groups
Sometimes it can be inconvenient to prefix each CLI invocation with the name of a command group, for example
databricks workspace ls . To make the CLI easier to use, you can alias command groups to shorter commands.
For example, to shorten databricks workspace ls to dw ls in the Bourne again shell, you can add
alias dw="databricks workspace" to the appropriate bash profile. Typically, this file is located at ~/.bash_profile
.
TIP
Azure Databricks already aliases databricks fs to dbfs ; databricks fs ls and dbfs ls are equivalent.

Use the CLI


This section shows you how to get CLI help, parse CLI output, and invoke commands in each command group.
Display CLI command group help
You list the subcommands for any command group by running databricks <group> --help (or
databricks <group> -h ). For example, you list the DBFS CLI subcommands by running databricks fs -h :

databricks fs -h

Display CLI subcommand help


You list the help for a subcommand by running databricks <group> <subcommand> --help (or
databricks <group> <subcommand> -h ). For example, you list the help for the DBFS copy files subcommand by
running databricks fs cp -h :

databricks fs cp -h

Use jq to parse CLI output


Some Databricks CLI commands output the JSON response from the API endpoint. Sometimes it can be useful
to parse out parts of the JSON to pipe into other commands. For example, to copy a job definition, you must
take the settings field of a databricks jobs get command and use that as an argument to the
databricks jobs create command. In these cases, we recommend you to use the utility jq .

For example, the following command prints the settings of the job with the ID of 233.

databricks jobs list --output JSON | jq '.jobs[] | select(.job_id == 233) | .settings'

{
"name": "Quickstart",
"new_cluster": {
"spark_version": "7.5.x-scala2.12",
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"num_workers": 8,
...
},
"email_notifications": {},
"timeout_seconds": 0,
"notebook_task": {
"notebook_path": "/Quickstart"
},
"max_concurrent_runs": 1
}

As another example, the following command prints the names and IDs of all available clusters in the workspace:

databricks clusters list --output JSON | jq '[ .clusters[] | { name: .cluster_name, id: .cluster_id } ]'
[
{
"name": "My Cluster 1",
"id": "1234-567890-grip123"
},
{
"name": "My Cluster 2",
"id": "2345-678901-patch234"
}
]

You can install jq for example on macOS using Homebrew with brew install jq or on Windows using
Chocolatey with choco install jq . For more information on jq , see the jq Manual.
JSON string parameters
String parameters are handled differently depending on your operating system:
Unix, linux, macos
You must enclose JSON string parameters in single quotes. For example:

databricks jobs run-now --job-id 9 --jar-params '["20180505", "alantest"]'

Windows
You must enclose JSON string parameters in double quotes, and the quote characters inside the string must be
preceded by \ . For example:

databricks jobs run-now --job-id 9 --jar-params "[\"20180505\", \"alantest\"]"

Troubleshooting
The following sections provide tips for troubleshooting common issues with the Databricks CLI.
Using EOF with databricks configure does not work
For Databricks CLI 0.12.0 and above, using the end of file ( EOF ) sequence in a script to pass parameters to the
databricks configure command does not work. For example, the following script causes Databricks CLI to
ignore the parameters, and no error message is thrown:

# Do not do this.
databricksUrl=<per-workspace-url>
databricksToken=<personal-access-token-or-Azure-AD-token>

databricks configure --token << EOF


$databricksUrl
$databricksToken
EOF

To fix this issue, do one of the following:


Use one of the other programmatic configuration options as described in Set up authentication.
Manually add the host and token values to the .databrickscfg file as described in Set up authentication.
Downgrade your installation of the Databricks CLI to 0.11.0 or below, and run your script again.

CLI commands
Cluster Policies CLI
Clusters CLI
DBFS CLI
Delta Live Tables CLI
Groups CLI
Instance Pools CLI
Jobs CLI
Libraries CLI
Repos CLI
Runs CLI
Secrets CLI
Stack CLI
Tokens CLI
Unity Catalog CLI
Workspace CLI
Databricks SQL CLI
7/21/2022 • 7 minutes to read

IMPORTANT
The Databricks SQL CLI is provided as-is and is not officially supported by Databricks through customer technical support
channels. Support, questions, and feature requests can be communicated through the Issues page of the
databricks/databricks-sql-cli repo on GitHub. Issues with the use of this code will not be answered or investigated by
Databricks Support.

The Databricks SQL command line interface (Databricks SQL CLI) enables you to run SQL queries on your
existing Databricks SQL warehouses from your terminal or Windows Command Prompt instead of from
locations such as the Databricks SQL editor or an Azure Databricks notebook. From the command line, you get
productivity features such as suggestions and syntax highlighting.

Requirements
At least one Databricks SQL warehouse. View your available warehouses. Create an warehouse, if you do not
already have one.
Your warehouse’s connection details. Specifically, you need the Ser ver hostname and HTTP path values.
An Azure Databricks personal access token. Create a personal access token, if you do not already have one.
Python 3.7 or higher. To check whether you have Python installed, run the command python --version from
your terminal or Command Prompt. (On some systems, you may need to enter python3 instead.) Install
Python, if you do not have it already installed.
pip, the package installer for Python. Newer versions of Python install pip by default. To check whether you
have pip installed, run the command pip --version from your terminal or Command Prompt. (On some
systems, you may need to enter pip3 instead.) Install pip, if you do not have it already installed.
The Databricks SQL CLI package from the Python Packaging Index (PyPI). You can use pip to install the
Databricks SQL CLI package from PyPI by running pip install databricks-sql-cli or
python -m pip install databricks-sql-cli .
(Optional) A utility for creating and managing Python virtual environments, such as venv, virtualenv, or
pipenv. Virtual environments help to ensure that you are using the correct versions of Python and the
Databricks SQL CLI together. Setting up and using virtual environments is outside of the scope of this article.
For more information, see Creating Virtual Environments.

Authentication
You must provide the Databricks SQL CLI with authentication details for your Databricks SQL warehouse, so that
the target warehouse is called with the proper access credentials. You can provide this information in several
ways:
In the dbsqlclircsettings file in its default location (or by specifying an alternate settings file through the
--clirc option each time you run a command with the Databricks SQL CLI). See Settings file.
By setting the DBSQLCLI_HOST_NAME , DBSQLCLI_HTTP_PATH and DBSQLCLI_ACCESS_TOKEN environment variables.
See Environment variables.
By specifying the --hostname , --http-path , and --access-token options each time you run a command with
the Databricks SQL CLI. See Command options.
Whenever you run the Databricks SQL CLI, it looks for authentication details in the following order, stopping
when it finds the first set of details:
1. The --hostname , --http-path , and --access-token options.
2. The DBSQLCLI_HOST_NAME , DBSQLCLI_HTTP_PATH and DBSQLCLI_ACCESS_TOKEN environment variables.
3. The dbsqlclirc settings file in its default location (or an alternate settings file specified by the --clirc
option).
Settings file
To use the dbsqlclirc settings file to provide the Databricks SQL CLI with authentication details for your
Databricks SQL warehouse, run the Databricks SQL CLI for the first time, as follows:

dbsqlcli

The Databricks SQL CLI creates a settings file for you, at ~/.dbsqlcli/dbsqlclirc on Unix, Linux, and macOS, and
at %HOMEDRIVE%%HOMEPATH%\.dbsqlcli\dbsqlclirc or %USERPROFILE%\.dbsqlcli\dbsqlclirc on Windows. To
customize this file:
1. Use a text editor to open and edit the dbsqlclirc file.
2. Scroll to the following section:

# [credentials]
# host_name = ""
# http_path = ""
# access_token = ""

3. Remove the four # characters, and:


a. Next to host_name , enter your warehouse’s Ser ver hostname value from the requirements between
the "" characters.
b. Next to http_path , enter your warehouse’s HTTP path value from the requirements between the ""
characters.
c. Next to access_token , enter your personal access token value from the requirements between the ""
characters.
For example:

[credentials]
host_name = "adb-12345678901234567.8.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/1abc2d3456e7f890a"
access_token = "dapi1234567890b2cd34ef5a67bc8de90fa12b"

4. Save the dbsqlclirc file.

Alternatively, instead of using the dbsqlclirc file in its default location, you can specify a file in a different
location by adding the --clirc command option and the path to the alternate file. That alternate file’s contents
must conform to the preceding syntax.
Environment variables
To use the DBSQLCLI_HOST_NAME , DBSQLCLI_HTTP_PATH , and DBSQLCLI_ACCESS_TOKEN environment variables to
provide the Databricks SQL CLI with authentication details for your Databricks SQL warehouse, do the following:
Unix, linux, and macos
To set the environment variables for only the current terminal session, run the following commands. To set the
environment variables for all terminal sessions, enter the following commands into your shell’s startup file and
then restart your terminal. In the following commands, replace the value of:
DBSQLCLI_HOST_NAME with your warehouse’s Ser ver hostname value from the requirements.
DBSQLCLI_HTTP_PATH with your warehouse’s HTTP path value from the requirements.
DBSQLCLI_ACCESS_TOKEN with your personal access token value from the requirements.

export DBSQLCLI_HOST_NAME="adb-12345678901234567.8.azuredatabricks.net"
export DBSQLCLI_HTTP_PATH="/sql/1.0/warehouses/1abc2d3456e7f890a"
export DBSQLCLI_ACCESS_TOKEN="dapi1234567890b2cd34ef5a67bc8de90fa12b"

Windows
To set the environment variables for only the current Command Prompt session, run the following commands,
replacing the value of:
DBSQLCLI_HOST_NAME with your warehouse’s Ser ver hostname value from the requirements.
DBSQLCLI_HTTP_PATH with your warehouse’s HTTP path value from the requirements.
DBSQLCLI_ACCESS_TOKEN with your personal access token value from the requirements.:

set DBSQLCLI_HOST_NAME="adb-12345678901234567.8.azuredatabricks.net"
set DBSQLCLI_HTTP_PATH="/sql/1.0/warehouses/1abc2d3456e7f890a"
set DBSQLCLI_ACCESS_TOKEN="dapi1234567890b2cd34ef5a67bc8de90fa12b"

To set the environment variables for all Command Prompt sessions, run the following commands and then
restart your Command Prompt, replacing the value of:
DBSQLCLI_HOST_NAME with your warehouse’s Ser ver hostname value from the requirements.
DBSQLCLI_HTTP_PATH with your warehouse’s HTTP path value from the requirements.
DBSQLCLI_ACCESS_TOKEN with your personal access token value from the requirements.

setx DBSQLCLI_HOST_NAME "adb-12345678901234567.8.azuredatabricks.net"


setx DBSQLCLI_HTTP_PATH "/sql/1.0/warehouses/1abc2d3456e7f890a"
setx DBSQLCLI_ACCESS_TOKEN "dapi1234567890b2cd34ef5a67bc8de90fa12b"

Command options
To use the --hostname , --http-path , and --access-token options to provide the Databricks SQL CLI with
authentication details for your Databricks SQL warehouse, do the following:
Every time you run a command with the Databricks SQL CLI:
Specify the --hostname option and your warehouse’s Ser ver hostname value from the requirements.
Specify the --http-path option and your warehouse’s HTTP path value from the requirements.
Specify the --access-token option and your personal access token value from the requirements.

For example:

dbsqlcli -e "SELECT * FROM default.diamonds LIMIT 2" \


--hostname "adb-12345678901234567.8.azuredatabricks.net" \
--http-path "/sql/1.0/warehouses/1abc2d3456e7f890a" \
--access-token "dapi1234567890b2cd34ef5a67bc8de90fa12b"

Query sources
The Databricks SQL CLI enables you to run queries in the following ways:
From a query string.
From a file.
In a read-evaluate-print loop (REPL) approach. This approach provides suggestions as you type.
Query string
To run a query as a string, use the -e option followed by the query, represented as a string. For example:

dbsqlcli -e "SELECT * FROM default.diamonds LIMIT 2"

Output:

_c0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31

To switch output formats, use the --table-format option along with a value such as ascii for ASCII table
format, for example:

dbsqlcli -e "SELECT * FROM default.diamonds LIMIT 2" --table-format ascii

Output:

+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| _c0 | carat | cut | color | clarity | depth | table | price | x | y | z |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
| 2 | 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+

For a list of available output format values, see the comments for the table_format setting in the dbsqlclirc
file.
File
To run a file that contains SQL, use the -e option followed by the path to a .sql file. For example:

dbsqlcli -e my-query.sql

Contents of the example my-query.sql file:

SELECT * FROM default.diamonds LIMIT 2;

Output:

_c0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31

To switch output formats, use the --table-format option along with a value such as ascii for ASCII table
format, for example:
dbsqlcli -e my-query.sql --table-format ascii

Output:

+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| _c0 | carat | cut | color | clarity | depth | table | price | x | y | z |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
| 2 | 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+

For a list of available output format values, see the comments for the table_format setting in the dbsqlclirc
file.
REPL
To enter read-evaluate-print loop (REPL) mode scoped to the default database, run the following command:

dbsqlcli

You can also enter REPL mode scoped to a specific database, by running the following command:

dbsqlcli <database-name>

For example:

dbsqlcli default

To exit REPL mode, run the following command:

exit

In REPL mode, you can use the following characters and keys:
Use the semicolon ( ; ) to end a line.
Use F3 to toggle multiline mode.
Use the spacebar to show suggestions at the insertion point, if suggestions are not already displayed.
Use the up and down arrows to navigate suggestions.
Use the right arrow to complete the highlighted suggestion.
For example:
dbsqlcli default

hostname:default> SELECT * FROM diamonds LIMIT 2;

+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| _c0 | carat | cut | color | clarity | depth | table | price | x | y | z |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
| 2 | 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+

2 rows in set
Time: 0.703s

hostname:default> exit

Additional resources
Databricks SQL CLI README
Databricks Utilities
7/21/2022 • 29 minutes to read

Databricks Utilities ( dbutils ) make it easy to perform powerful combinations of tasks. You can use the utilities
to work with object storage efficiently, to chain and parameterize notebooks, and to work with secrets. dbutils
are not supported outside of notebooks.

IMPORTANT
Calling dbutils inside of executors can produce unexpected results. To learn more about limitations of dbutils and
alternatives that could be used instead, see Limitations.

dbutils utilities are available in Python, R, and Scala notebooks.


How to : List utilities, list commands, display command help
Utilities : data, fs, jobs, library, notebook, secrets, widgets, Utilities API library

List available utilities


To list available utilities along with a short description for each utility, run dbutils.help() for Python or Scala.
This example lists available commands for the Databricks Utilities.
Python

dbutils.help()

Scala

dbutils.help()

This module provides various utilities for users to interact with the rest of Databricks.

fs: DbfsUtils -> Manipulates the Databricks filesystem (DBFS) from the console
jobs: JobsUtils -> Utilities for leveraging jobs features
library: LibraryUtils -> Utilities for session isolated libraries
notebook: NotebookUtils -> Utilities for the control flow of a notebook (EXPERIMENTAL)
secrets: SecretUtils -> Provides utilities for leveraging secrets within notebooks
widgets: WidgetsUtils -> Methods to create and get bound value of input widgets inside notebooks

List available commands for a utility


To list available commands for a utility along with a short description of each command, run .help() after the
programmatic name for the utility.
This example lists available commands for the Databricks File System (DBFS) utility.
Python
dbutils.fs.help()

dbutils.fs.help()

Scala

dbutils.fs.help()

dbutils.fs provides utilities for working with FileSystems. Most methods in this package can take either a
DBFS path (e.g., "/foo" or "dbfs:/foo"), or another FileSystem URI. For more info about a method, use
dbutils.fs.help("methodName"). In notebooks, you can also use the %fs shorthand to access DBFS. The %fs
shorthand maps straightforwardly onto dbutils calls. For example, "%fs head --maxBytes=10000 /file/path"
translates into "dbutils.fs.head("/file/path", maxBytes = 10000)".

fsutils

cp(from: String, to: String, recurse: boolean = false): boolean -> Copies a file or directory, possibly
across FileSystems
head(file: String, maxBytes: int = 65536): String -> Returns up to the first 'maxBytes' bytes of the given
file as a String encoded in UTF-8
ls(dir: String): Seq -> Lists the contents of a directory
mkdirs(dir: String): boolean -> Creates the given directory if it does not exist, also creating any
necessary parent directories
mv(from: String, to: String, recurse: boolean = false): boolean -> Moves a file or directory, possibly
across FileSystems
put(file: String, contents: String, overwrite: boolean = false): boolean -> Writes the given String out to a
file, encoded in UTF-8
rm(dir: String, recurse: boolean = false): boolean -> Removes a file or directory

mount

mount(source: String, mountPoint: String, encryptionType: String = "", owner: String = null, extraConfigs:
Map = Map.empty[String, String]): boolean -> Mounts the given source directory into DBFS at the given mount
point
mounts: Seq -> Displays information about what is mounted within DBFS
refreshMounts: boolean -> Forces all machines in this cluster to refresh their mount cache, ensuring they
receive the most recent information
unmount(mountPoint: String): boolean -> Deletes a DBFS mount point
updateMount(source: String, mountPoint: String, encryptionType: String = "", owner: String = null,
extraConfigs: Map = Map.empty[String, String]): boolean -> Similar to mount(), but updates an existing mount
point instead of creating a new one

Display help for a command


To display help for a command, run .help("<command-name>") after the command name.
This example displays help for the DBFS copy command.
Python

dbutils.fs.help("cp")

dbutils.fs.help("cp")
Scala

dbutils.fs.help("cp")

/**
* Copies a file or directory, possibly across FileSystems.
*
* Example: cp("/mnt/my-folder/a", "dbfs://a/b")
*
* @param from FileSystem URI of the source file or directory
* @param to FileSystem URI of the destination file or directory
* @param recurse if true, all files and directories will be recursively copied
* @return true if all files were successfully copied
*/
cp(from: java.lang.String, to: java.lang.String, recurse: boolean = false): boolean

Data utility (dbutils.data)


IMPORTANT
This feature is in Public Preview.

NOTE
Available in Databricks Runtime 9.0 and above.

Commands : summarize
The data utility allows you to understand and interpret datasets. To list the available commands, run
dbutils.data.help() .

dbutils.data provides utilities for understanding and interpreting datasets. This module is currently in
preview and may be unstable. For more info about a method, use dbutils.data.help("methodName").

summarize(df: Object, precise: boolean): void -> Summarize a Spark DataFrame and visualize the statistics to
get quick insights

summarize command (dbutils.data.summarize )


Calculates and displays summary statistics of an Apache Spark DataFrame or pandas DataFrame. This command
is available for Python, Scala and R.
To display help for this command, run dbutils.data.help("summarize") .
In Databricks Runtime 10.1 and above, you can use the additional precise parameter to adjust the precision of
the computed statistics.

NOTE
This feature is in Public Preview.

When precise is set to false (the default), some returned statistics include approximations to reduce run
time.
The number of distinct values for categorical columns may have ~5% relative error for high-
cardinality columns.
The frequent value counts may have an error of up to 0.01% when the number of distinct values is
greater than 10000.
The histograms and percentile estimates may have an error of up to 0.01% relative to the total
number of rows.
When precise is set to true, the statistics are computed with higher precision. All statistics except for the
histograms and percentiles for numeric columns are now exact.
The histograms and percentile estimates may have an error of up to 0.0001% relative to the total
number of rows.
The tooltip at the top of the data summary output indicates the mode of current run.
This example displays summary statistics for an Apache Spark DataFrame with approximations enabled by
default. To see the results, run this command in a notebook. This example is based on Sample datasets
(databricks-datasets).
Python

df = spark.read.format('csv').load(
'/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv',
header=True,
inferSchema=True
)
dbutils.data.summarize(df)

df <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", source = "csv",


header="true", inferSchema = "true")
dbutils.data.summarize(df)

Scala

val df = spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.load("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv")
dbutils.data.summarize(df)

Note that the visualization uses SI notation to concisely render numerical values smaller than 0.01 or larger than
10000. As an example, the numerical value 1.25e-15 will be rendered as 1.25f . One exception: the
visualization uses “ B ” for 1.0e9 (giga) instead of “ G ”.

File system utility (dbutils.fs)


Commands : cp, head, ls, mkdirs, mount, mounts, mv, put, refreshMounts, rm, unmount, updateMount
The file system utility allows you to access Databricks File System (DBFS), making it easier to use Azure
Databricks as a file system. To list the available commands, run dbutils.fs.help() .
dbutils.fs provides utilities for working with FileSystems. Most methods in this package can take either a
DBFS path (e.g., "/foo" or "dbfs:/foo"), or another FileSystem URI. For more info about a method, use
dbutils.fs.help("methodName"). In notebooks, you can also use the %fs shorthand to access DBFS. The %fs
shorthand maps straightforwardly onto dbutils calls. For example, "%fs head --maxBytes=10000 /file/path"
translates into "dbutils.fs.head("/file/path", maxBytes = 10000)".

fsutils

cp(from: String, to: String, recurse: boolean = false): boolean -> Copies a file or directory, possibly
across FileSystems
head(file: String, maxBytes: int = 65536): String -> Returns up to the first 'maxBytes' bytes of the given
file as a String encoded in UTF-8
ls(dir: String): Seq -> Lists the contents of a directory
mkdirs(dir: String): boolean -> Creates the given directory if it does not exist, also creating any
necessary parent directories
mv(from: String, to: String, recurse: boolean = false): boolean -> Moves a file or directory, possibly
across FileSystems
put(file: String, contents: String, overwrite: boolean = false): boolean -> Writes the given String out to a
file, encoded in UTF-8
rm(dir: String, recurse: boolean = false): boolean -> Removes a file or directory

mount

mount(source: String, mountPoint: String, encryptionType: String = "", owner: String = null, extraConfigs:
Map = Map.empty[String, String]): boolean -> Mounts the given source directory into DBFS at the given mount
point
mounts: Seq -> Displays information about what is mounted within DBFS
refreshMounts: boolean -> Forces all machines in this cluster to refresh their mount cache, ensuring they
receive the most recent information
unmount(mountPoint: String): boolean -> Deletes a DBFS mount point
updateMount(source: String, mountPoint: String, encryptionType: String = "", owner: String = null,
extraConfigs: Map = Map.empty[String, String]): boolean -> Similar to mount(), but updates an existing mount
point instead of creating a new one

cp command (dbutils.fs.cp)
Copies a file or directory, possibly across filesystems.
To display help for this command, run dbutils.fs.help("cp") .
This example copies the file named old_file.txt from /FileStore to /tmp/new , renaming the copied file to
new_file.txt .

Python

dbutils.fs.cp("/FileStore/old_file.txt", "/tmp/new/new_file.txt")

# Out[4]: True

dbutils.fs.cp("/FileStore/old_file.txt", "/tmp/new/new_file.txt")

# [1] TRUE

Scala

dbutils.fs.cp("/FileStore/old_file.txt", "/tmp/new/new_file.txt")

// res3: Boolean = true

head command (dbutils.fs.head)


Returns up to the specified maximum number bytes of the given file. The bytes are returned as a UTF-8 encoded
string.
To display help for this command, run dbutils.fs.help("head") .
This example displays the first 25 bytes of the file my_file.txt located in /tmp .
Python

dbutils.fs.head("/tmp/my_file.txt", 25)

# [Truncated to first 25 bytes]


# Out[12]: 'Apache Spark is awesome!\n'

dbutils.fs.head("/tmp/my_file.txt", 25)

# [1] "Apache Spark is awesome!\n"

Scala

dbutils.fs.head("/tmp/my_file.txt", 25)

// [Truncated to first 25 bytes]


// res4: String =
// "Apache Spark is awesome!
// "

ls command (dbutils.fs.ls)
Lists the contents of a directory.
To display help for this command, run dbutils.fs.help("ls") .
This example displays information about the contents of /tmp . The modificationTime field is available in
Databricks Runtime 10.2 and above. In R, modificationTime is returned as a string.
Python

dbutils.fs.ls("/tmp")

# Out[13]: [FileInfo(path='dbfs:/tmp/my_file.txt', name='my_file.txt', size=40,


modificationTime=1622054945000)]

R
dbutils.fs.ls("/tmp")

# For prettier results from dbutils.fs.ls(<dir>), please use `%fs ls <dir>`

# [[1]]
# [[1]]$path
# [1] "dbfs:/tmp/my_file.txt"

# [[1]]$name
# [1] "my_file.txt"

# [[1]]$size
# [1] 40

# [[1]]$isDir
# [1] FALSE

# [[1]]$isFile
# [1] TRUE

# [[1]]$modificationTime
# [1] "1622054945000"

Scala

dbutils.fs.ls("/tmp")

// res6: Seq[com.databricks.backend.daemon.dbutils.FileInfo] = WrappedArray(FileInfo(dbfs:/tmp/my_file.txt,


my_file.txt, 40, 1622054945000))

mkdirs command (dbutils.fs.mkdirs)


Creates the given directory if it does not exist. Also creates any necessary parent directories.
To display help for this command, run dbutils.fs.help("mkdirs") .
This example creates the directory structure /parent/child/grandchild within /tmp .
Python

dbutils.fs.mkdirs("/tmp/parent/child/grandchild")

# Out[15]: True

dbutils.fs.mkdirs("/tmp/parent/child/grandchild")

# [1] TRUE

Scala

dbutils.fs.mkdirs("/tmp/parent/child/grandchild")

// res7: Boolean = true

mount command (dbutils.fs.mount)


Mounts the specified source directory into DBFS at the specified mount point.
To display help for this command, run dbutils.fs.help("mount") .
Python

dbutils.fs.mount(
source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})

Scala

dbutils.fs.mount(
source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>",
mountPoint = "/mnt/<mount-name>",
extraConfigs = Map("<conf-key>" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")))

For additional code examples, see Accessing Azure Data Lake Storage Gen2 and Blob Storage with Azure
Databricks.
mounts command (dbutils.fs.mounts)
Displays information about what is currently mounted within DBFS.
To display help for this command, run dbutils.fs.help("mounts") .
Python

dbutils.fs.mounts()

Scala

dbutils.fs.mounts()

For additional code examples, see Accessing Azure Data Lake Storage Gen2 and Blob Storage with Azure
Databricks.
mv command (dbutils.fs.mv)
Moves a file or directory, possibly across filesystems. A move is a copy followed by a delete, even for moves
within filesystems.
To display help for this command, run dbutils.fs.help("mv") .
This example moves the file my_file.txt from /FileStore to /tmp/parent/child/granchild .
Python

dbutils.fs.mv("/FileStore/my_file.txt", "/tmp/parent/child/grandchild")

# Out[2]: True

dbutils.fs.mv("/FileStore/my_file.txt", "/tmp/parent/child/grandchild")

# [1] TRUE

Scala
dbutils.fs.mv("/FileStore/my_file.txt", "/tmp/parent/child/grandchild")

// res1: Boolean = true

put command (dbutils.fs.put)


Writes the specified string to a file. The string is UTF-8 encoded.
To display help for this command, run dbutils.fs.help("put") .
This example writes the string Hello, Databricks! to a file named hello_db.txt in /tmp . If the file exists, it will
be overwritten.
Python

dbutils.fs.put("/tmp/hello_db.txt", "Hello, Databricks!", True)

# Wrote 18 bytes.
# Out[6]: True

dbutils.fs.put("/tmp/hello_db.txt", "Hello, Databricks!", TRUE)

# [1] TRUE

Scala

dbutils.fs.put("/tmp/hello_db.txt", "Hello, Databricks!", true)

// Wrote 18 bytes.
// res2: Boolean = true

refreshMounts command (dbutils.fs.refreshMounts)


Forces all machines in the cluster to refresh their mount cache, ensuring they receive the most recent
information.
To display help for this command, run dbutils.fs.help("refreshMounts") .
Python

dbutils.fs.refreshMounts()

Scala

dbutils.fs.refreshMounts()

For additiional code examples, see Accessing Azure Data Lake Storage Gen2 and Blob Storage with Azure
Databricks.
rm command (dbutils.fs.rm)
Removes a file or directory.
To display help for this command, run dbutils.fs.help("rm") .
This example removes the file named hello_db.txt in /tmp .
Python
dbutils.fs.rm("/tmp/hello_db.txt")

# Out[8]: True

dbutils.fs.rm("/tmp/hello_db.txt")

# [1] TRUE

Scala

dbutils.fs.rm("/tmp/hello_db.txt")

// res6: Boolean = true

unmount command (dbutils.fs.unmount)


Deletes a DBFS mount point.
To display help for this command, run dbutils.fs.help("unmount") .

dbutils.fs.unmount("/mnt/<mount-name>")

For additional code examples, see Accessing Azure Data Lake Storage Gen2 and Blob Storage with Azure
Databricks.
updateMount command (dbutils.fs.updateMount)
Similar to the dbutils.fs.mount command, but updates an existing mount point instead of creating a new one.
Returns an error if the mount point is not present.
To display help for this command, run dbutils.fs.help("updateMount") .
This command is available in Databricks Runtime 10.2 and above.
Python

dbutils.fs.updateMount(
source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})

Scala

dbutils.fs.updateMount(
source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>",
mountPoint = "/mnt/<mount-name>",
extraConfigs = Map("<conf-key>" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")))

Jobs utility (dbutils.jobs)


Subutilities : taskValues
NOTE
Available in Databricks Runtime 7.3 and above.
This utility is available only for Python.

The jobs utility allows you to leverage jobs features. To display help for this utility, run dbutils.jobs.help() .

Provides utilities for leveraging jobs features.

taskValues: TaskValuesUtils -> Provides utilities for leveraging job task values

taskValues subutility (dbutils.jobs.taskValues)


Commands : get, set

NOTE
Available in Databricks Runtime 7.3 and above.
This subutility is available only for Python.

Provides commands for leveraging job task values.


Use this sub utility to set and get arbitrary values during a job run. These values are called task values. You can
access task values in downstream tasks in the same job run. For example, you can communicate identifiers or
metrics, such as information about the evaluation of a machine learning model, between different tasks within a
job run. Each task can set multiple task values, get them, or both. Each task value has a unique key within the
same task. This unique key is known as the task value’s key. A task value is accessed with the task name and the
task value’s key.
To display help for this subutility, run dbutils.jobs.taskValues.help() .
get command (dbutils.jobs.taskValues.get)

NOTE
Available in Databricks Runtime 7.3 and above.
This command is available only for Python.
On Databricks Runtime 10.4 and earlier, if get cannot find the task, a Py4JJavaError is raised instead of a ValueError .

Gets the contents of the specified task value for the specified task in the current job run.
To display help for this command, run dbutils.jobs.taskValues.help("get") .
For example:

dbutils.jobs.taskValues.get(taskKey = "my-task", \
key = "my-key", \
default = 7, \
debugValue = 42)

In the preceding example:


taskKey is the name of the task within the job. If the command cannot find this task, a ValueError is raised.
key is the name of the task value’s key that you set with the set command (dbutils.jobs.taskValues.set). If the
command cannot find this task value’s key, a ValueError is raised (unless default is specified).
default is an optional value that is returned if key cannot be found. default cannot be None .
debugValue is an optional value that is returned if you try to get the task value from within a notebook that is
running outside of a job. This can be useful during debugging when you want to run your notebook
manually and return some value instead of raising a TypeError by default. debugValue cannot be None .
If you try to get a task value from within a notebook that is running outside of a job, this command raises a
TypeError by default. However, if the debugValue argument is specified in the command, the value of
debugValue is returned instead of raising a TypeError .

set command (dbutils.jobs.taskValues.set)

NOTE
Available in Databricks Runtime 7.3 and above.
This command is available only for Python.

Sets or updates a task value. You can set up to 250 task values for a job run.
To display help for this command, run dbutils.jobs.taskValues.help("set") .
Some examples include:

dbutils.jobs.taskValues.set(key = "my-key", \
value = 5)

dbutils.jobs.taskValues.set(key = "my-other-key", \
value = "my other value")

In the preceding examples:


key is the name of this task value’s key. This name must be unique to the job.
value is the value for this task value’s key. This command must be able to represent the value internally in
JSON format. The size of the JSON representation of the value cannot exceed 48 KiB.
If you try to set a task value from within a notebook that is running outside of a job, this command does
nothing.

Library utility (dbutils.library)


NOTE
The library utility is deprecated.

Commands : install, installPyPI, list, restartPython, updateCondaEnv


The library utility allows you to install Python libraries and create an environment scoped to a notebook session.
The libraries are available both on the driver and on the executors, so you can reference them in user defined
functions. This enables:
Library dependencies of a notebook to be organized within the notebook itself.
Notebook users with different library dependencies to share a cluster without interference.
Detaching a notebook destroys this environment. However, you can recreate it by re-running the library
install API commands in the notebook. See the restartPython API for how you can reset your notebook state
without losing your environment.

IMPORTANT
Library utilities are not available on Databricks Runtime ML or Databricks Runtime for Genomics. Instead, see Notebook-
scoped Python libraries.
For Databricks Runtime 7.2 and above, Databricks recommends using %pip magic commands to install notebook-
scoped libraries. See Notebook-scoped Python libraries.

Library utilities are enabled by default. Therefore, by default the Python environment for each notebook is
isolated by using a separate Python executable that is created when the notebook is attached to and inherits the
default Python environment on the cluster. Libraries installed through an init script into the Azure Databricks
Python environment are still available. You can disable this feature by setting
spark.databricks.libraryIsolation.enabled to false .

This API is compatible with the existing cluster-wide library installation through the UI and REST API. Libraries
installed through this API have higher priority than cluster-wide libraries.
To list the available commands, run dbutils.library.help() .

install(path: String): boolean -> Install the library within the current notebook session
installPyPI(pypiPackage: String, version: String = "", repo: String = "", extras: String = ""): boolean ->
Install the PyPI library within the current notebook session
list: List -> List the isolated libraries added for the current notebook session via dbutils
restartPython: void -> Restart python process for the current notebook session
updateCondaEnv(envYmlContent: String): boolean -> Update the current notebook's Conda environment based on
the specification (content of environment

install command (dbutils.library.install)


Given a path to a library, installs that library within the current notebook session. Libraries installed by calling
this command are available only to the current notebook.
To display help for this command, run dbutils.library.help("install") .
This example installs a .egg or .whl library within a notebook.

IMPORTANT
dbutils.library.install is removed in Databricks Runtime 11.0 and above.
Databricks recommends that you put all your library install commands in the first cell of your notebook and call
restartPython at the end of that cell. The Python notebook state is reset after running restartPython ; the notebook
loses all state including but not limited to local variables, imported libraries, and other ephemeral states. Therefore, we
recommend that you install libraries and reset the notebook state in the first notebook cell.

The accepted library sources are dbfs , abfss , adl , and wasbs .

dbutils.library.install("abfss:/path/to/your/library.egg")
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling
this command.
dbutils.library.install("abfss:/path/to/your/library.whl")
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling
this command.

NOTE
You can directly install custom wheel files using %pip . In the following example we are assuming you have uploaded your
library wheel file to DBFS:

%pip install /dbfs/path/to/your/library.whl

Egg files are not supported by pip, and wheel is considered the standard for build and binary packaging for Python. See
Wheel vs Egg for more details. However, if you want to use an egg file in a way that’s compatible with %pip , you can use
the following workaround:

# This step is only needed if no %pip commands have been run yet.
# It will trigger setting up the isolated notebook environment
%pip install <any-lib> # This doesn't need to be a real library; for example "%pip install any-lib"
would work

import sys
# Assuming the preceding step was completed, the following command
# adds the egg file to the current notebook environment
sys.path.append("/local/path/to/library.egg")

installPyPI command (dbutils.library.installPyPI )


Given a Python Package Index (PyPI) package, install that package within the current notebook session. Libraries
installed by calling this command are isolated among notebooks.
To display help for this command, run dbutils.library.help("installPyPI") .
This example installs a PyPI package in a notebook. version , repo , and extras are optional. Use the extras
argument to specify the Extras feature (extra requirements).

dbutils.library.installPyPI("pypipackage", version="version", repo="repo", extras="extras")


dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling
this command.

IMPORTANT
dbutils.library.installPyPI is removed in Databricks Runtime 11.0 and above.
The version and extraskeys cannot be part of the PyPI package string. For example:
dbutils.library.installPyPI("azureml-sdk[databricks]==1.19.0") is not valid. Use the version and extras
arguments to specify the version and extras information as follows:

dbutils.library.installPyPI("azureml-sdk", version="1.19.0", extras="databricks")


dbutils.library.restartPython() # Removes Python state, but some libraries might not work without
calling this command.
NOTE
When replacing dbutils.library.installPyPI commands with %pip commands, the Python interpreter is
automatically restarted. You can run the install command as follows:

%pip install azureml-sdk[databricks]==1.19.0

This example specifies library requirements in one notebook and installs them by using %run in the other. To do
this, first define the libraries to install in a notebook. This example uses a notebook named InstallDependencies .

dbutils.library.installPyPI("torch")
dbutils.library.installPyPI("scikit-learn", version="1.19.1")
dbutils.library.installPyPI("azureml-sdk", extras="databricks")
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling
this command.

Then install them in the notebook that needs those dependencies.

%run /path/to/InstallDependencies # Install the dependencies in the first cell.

import torch
from sklearn.linear_model import LinearRegression
import azureml
...

This example resets the Python notebook state while maintaining the environment. This technique is available
only in Python notebooks. For example, you can use this technique to reload libraries Azure Databricks
preinstalled with a different version:

dbutils.library.installPyPI("numpy", version="1.15.4")
dbutils.library.restartPython()

# Make sure you start using the library in another cell.


import numpy

You can also use this technique to install libraries such as tensorflow that need to be loaded on process start up:

dbutils.library.installPyPI("tensorflow")
dbutils.library.restartPython()

# Use the library in another cell.


import tensorflow

list command (dbutils.library.list)


Lists the isolated libraries added for the current notebook session through the library utility. This does not
include libraries that are attached to the cluster.
To display help for this command, run dbutils.library.help("list") .
This example lists the libraries installed in a notebook.
dbutils.library.list()

NOTE
The equivalent of this command using %pip is:

%pip freeze

restartPython command (dbutils.library.restartPython)


Restarts the Python process for the current notebook session.
To display help for this command, run dbutils.library.help("restartPython") .
This example restarts the Python process for the current notebook session.

dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling
this command.

updateCondaEnv command (dbutils.library.updateCondaEnv)


Updates the current notebook’s Conda environment based on the contents of environment.yml . This method is
supported only for Databricks Runtime on Conda.
To display help for this command, run dbutils.library.help("updateCondaEnv") .
This example updates the current notebook’s Conda environment based on the contents of the provided
specification.

dbutils.library.updateCondaEnv(
"""
channels:
- anaconda
dependencies:
- gensim=3.4
- nltk=3.4
""")

Notebook utility (dbutils.notebook)


Commands : exit, run
The notebook utility allows you to chain together notebooks and act on their results. See Modularize or link
code in notebooks.
To list the available commands, run dbutils.notebook.help() .

exit(value: String): void -> This method lets you exit a notebook with a value
run(path: String, timeoutSeconds: int, arguments: Map): String -> This method runs a notebook and returns
its exit value.

exit command (dbutils.notebook.exit)


Exits a notebook with a value.
To display help for this command, run dbutils.notebook.help("exit") .
This example exits the notebook with the value Exiting from My Other Notebook .
Python

dbutils.notebook.exit("Exiting from My Other Notebook")

# Notebook exited: Exiting from My Other Notebook

dbutils.notebook.exit("Exiting from My Other Notebook")

# Notebook exited: Exiting from My Other Notebook

Scala

dbutils.notebook.exit("Exiting from My Other Notebook")

// Notebook exited: Exiting from My Other Notebook

run command (dbutils.notebook.run)


Runs a notebook and returns its exit value. The notebook will run in the current cluster by default.

NOTE
The maximum length of the string value returned from the run command is 5 MB. See Get the output for a single run (
GET /jobs/runs/get-output ).

To display help for this command, run dbutils.notebook.help("run") .


This example runs a notebook named My Other Notebook in the same location as the calling notebook. The
called notebook ends with the line of code dbutils.notebook.exit("Exiting from My Other Notebook") . If the
called notebook does not finish running within 60 seconds, an exception is thrown.
Python

dbutils.notebook.run("My Other Notebook", 60)

# Out[14]: 'Exiting from My Other Notebook'

Scala

dbutils.notebook.run("My Other Notebook", 60)

// res2: String = Exiting from My Other Notebook

Secrets utility (dbutils.secrets)


Commands : get, getBytes, list, listScopes
The secrets utility allows you to store and access sensitive credential information without making them visible in
notebooks. See Secret management and Use the secrets in a notebook. To list the available commands, run
dbutils.secrets.help() .
get(scope: String, key: String): String -> Gets the string representation of a secret value with scope and
key
getBytes(scope: String, key: String): byte[] -> Gets the bytes representation of a secret value with scope
and key
list(scope: String): Seq -> Lists secret metadata for secrets within a scope
listScopes: Seq -> Lists secret scopes

get command (dbutils.secrets.get)


Gets the string representation of a secret value for the specified secrets scope and key.

WARNING
Administrators, secret creators, and users granted permission can read Azure Databricks secrets. While Azure Databricks
makes an effort to redact secret values that might be displayed in notebooks, it is not possible to prevent such users from
reading secrets. For more information, see Secret redaction.

To display help for this command, run dbutils.secrets.help("get") .


This example gets the string representation of the secret value for the scope named my-scope and the key
named my-key .
Python

dbutils.secrets.get(scope="my-scope", key="my-key")

# Out[14]: '[REDACTED]'

dbutils.secrets.get(scope="my-scope", key="my-key")

# [1] "[REDACTED]"

Scala

dbutils.secrets.get(scope="my-scope", key="my-key")

// res0: String = [REDACTED]

getBytes command (dbutils.secrets.getBytes)


Gets the bytes representation of a secret value for the specified scope and key.
To display help for this command, run dbutils.secrets.help("getBytes") .
This example gets the secret value ( a1!b2@c3# ) for the scope named my-scope and the key named my-key .
Python

my_secret = dbutils.secrets.getBytes(scope="my-scope", key="my-key")


my_secret.decode("utf-8")

# Out[1]: 'a1!b2@c3#'

R
my_secret = dbutils.secrets.getBytes(scope="my-scope", key="my-key")
print(rawToChar(my_secret))

# [1] "a1!b2@c3#"

Scala

val mySecret = dbutils.secrets.getBytes(scope="my-scope", key="my-key")


val convertedString = new String(mySecret)
println(convertedString)

// a1!b2@c3#
// mySecret: Array[Byte] = Array(97, 49, 33, 98, 50, 64, 99, 51, 35)
// convertedString: String = a1!b2@c3#

list command (dbutils.secrets.list)


Lists the metadata for secrets within the specified scope.
To display help for this command, run dbutils.secrets.help("list") .
This example lists the metadata for secrets within the scope named my-scope .
Python

dbutils.secrets.list("my-scope")

# Out[10]: [SecretMetadata(key='my-key')]

dbutils.secrets.list("my-scope")

# [[1]]
# [[1]]$key
# [1] "my-key"

Scala

dbutils.secrets.list("my-scope")

// res2: Seq[com.databricks.dbutils_v1.SecretMetadata] = ArrayBuffer(SecretMetadata(my-key))

listScopes command (dbutils.secrets.listScopes)


Lists the available scopes.
To display help for this command, run dbutils.secrets.help("listScopes") .
This example lists the available scopes.
Python

dbutils.secrets.listScopes()

# Out[14]: [SecretScope(name='my-scope')]

R
dbutils.secrets.listScopes()

# [[1]]
# [[1]]$name
# [1] "my-scope"

Scala

dbutils.secrets.listScopes()

// res3: Seq[com.databricks.dbutils_v1.SecretScope] = ArrayBuffer(SecretScope(my-scope))

Widgets utility (dbutils.widgets)


Commands : combobox, dropdown, get, getArgument, multiselect, remove, removeAll, text
The widgets utility allows you to parameterize notebooks. See Databricks widgets.
To list the available commands, run dbutils.widgets.help() .

combobox(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a combobox input
widget with a given name, default value and choices
dropdown(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a dropdown input
widget a with given name, default value and choices
get(name: String): String -> Retrieves current value of an input widget
getArgument(name: String, optional: String): String -> (DEPRECATED) Equivalent to get
multiselect(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a multiselect
input widget with a given name, default value and choices
remove(name: String): void -> Removes an input widget from the notebook
removeAll: void -> Removes all widgets in the notebook
text(name: String, defaultValue: String, label: String): void -> Creates a text input widget with a given
name and default value

combobox command (dbutils.widgets.combobox)


Creates and displays a combobox widget with the specified programmatic name, default value, choices, and
optional label.
To display help for this command, run dbutils.widgets.help("combobox") .
This example creates and displays a combobox widget with the programmatic name fruits_combobox . It offers
the choices apple , banana , coconut , and dragon fruit and is set to the initial value of banana . This combobox
widget has an accompanying label Fruits . This example ends by printing the initial value of the combobox
widget, banana .
Python

dbutils.widgets.combobox(
name='fruits_combobox',
defaultValue='banana',
choices=['apple', 'banana', 'coconut', 'dragon fruit'],
label='Fruits'
)

print(dbutils.widgets.get("fruits_combobox"))

# banana

R
dbutils.widgets.combobox(
name='fruits_combobox',
defaultValue='banana',
choices=list('apple', 'banana', 'coconut', 'dragon fruit'),
label='Fruits'
)

print(dbutils.widgets.get("fruits_combobox"))

# [1] "banana"

Scala

dbutils.widgets.combobox(
"fruits_combobox",
"banana",
Array("apple", "banana", "coconut", "dragon fruit"),
"Fruits"
)

print(dbutils.widgets.get("fruits_combobox"))

// banana

dropdown command (dbutils.widgets.dropdown)


Creates and displays a dropdown widget with the specified programmatic name, default value, choices, and
optional label.
To display help for this command, run dbutils.widgets.help("dropdown") .
This example creates and displays a dropdown widget with the programmatic name toys_dropdown . It offers the
choices alphabet blocks , basketball , cape , and doll and is set to the initial value of basketball . This
dropdown widget has an accompanying label Toys . This example ends by printing the initial value of the
dropdown widget, basketball .
Python

dbutils.widgets.dropdown(
name='toys_dropdown',
defaultValue='basketball',
choices=['alphabet blocks', 'basketball', 'cape', 'doll'],
label='Toys'
)

print(dbutils.widgets.get("toys_dropdown"))

# basketball

dbutils.widgets.dropdown(
name='toys_dropdown',
defaultValue='basketball',
choices=list('alphabet blocks', 'basketball', 'cape', 'doll'),
label='Toys'
)

print(dbutils.widgets.get("toys_dropdown"))

# [1] "basketball"
Scala

dbutils.widgets.dropdown(
"toys_dropdown",
"basketball",
Array("alphabet blocks", "basketball", "cape", "doll"),
"Toys"
)

print(dbutils.widgets.get("toys_dropdown"))

// basketball

get command (dbutils.widgets.get)


Gets the current value of the widget with the specified programmatic name. This programmatic name can be
either:
The name of a custom widget in the notebook, for example fruits_combobox or toys_dropdown .
The name of a custom parameter passed to the notebook as part of a notebook task, for example name or
age . For more information, see the coverage of parameters for notebook tasks in the Create a job UI or the
notebook_params field in the Trigger a new job run ( POST /jobs/run-now ) operation in the Jobs API.

To display help for this command, run dbutils.widgets.help("get") .


This example gets the value of the widget that has the programmatic name fruits_combobox .
Python

dbutils.widgets.get('fruits_combobox')

# banana

dbutils.widgets.get('fruits_combobox')

# [1] "banana"

Scala

dbutils.widgets.get("fruits_combobox")

// res6: String = banana

This example gets the value of the notebook task parameter that has the programmatic name age . This
parameter was set to 35 when the related notebook task was run.
Python

dbutils.widgets.get('age')

# 35

dbutils.widgets.get('age')

# [1] "35"
Scala

dbutils.widgets.get("age")

// res6: String = 35

getArgument command (dbutils.widgets.getArgument)


Gets the current value of the widget with the specified programmatic name. If the widget does not exist, an
optional message can be returned.

NOTE
This command is deprecated. Use dbutils.widgets.get instead.

To display help for this command, run dbutils.widgets.help("getArgument") .


This example gets the value of the widget that has the programmatic name fruits_combobox . If this widget does
not exist, the message Error: Cannot find fruits combobox is returned.
Python

dbutils.widgets.getArgument('fruits_combobox', 'Error: Cannot find fruits combobox')

# Deprecation warning: Use dbutils.widgets.text() or dbutils.widgets.dropdown() to create a widget and


dbutils.widgets.get() to get its bound value.
# Out[3]: 'banana'

dbutils.widgets.getArgument('fruits_combobox', 'Error: Cannot find fruits combobox')

# Deprecation warning: Use dbutils.widgets.text() or dbutils.widgets.dropdown() to create a widget and


dbutils.widgets.get() to get its bound value.
# [1] "banana"

Scala

dbutils.widgets.getArgument("fruits_combobox", "Error: Cannot find fruits combobox")

// command-1234567890123456:1: warning: method getArgument in trait WidgetsUtils is deprecated: Use


dbutils.widgets.text() or dbutils.widgets.dropdown() to create a widget and dbutils.widgets.get() to get its
bound value.
// dbutils.widgets.getArgument("fruits_combobox", "Error: Cannot find fruits combobox")
// ^
// res7: String = banana

multiselect command (dbutils.widgets.multiselect)


Creates and displays a multiselect widget with the specified programmatic name, default value, choices, and
optional label.
To display help for this command, run dbutils.widgets.help("multiselect") .
This example creates and displays a multiselect widget with the programmatic name days_multiselect . It offers
the choices Monday through Sunday and is set to the initial value of Tuesday . This multiselect widget has an
accompanying label Days of the Week . This example ends by printing the initial value of the multiselect widget,
Tuesday .

Python
dbutils.widgets.multiselect(
name='days_multiselect',
defaultValue='Tuesday',
choices=['Monday', 'Tuesday', 'Wednesday', 'Thursday',
'Friday', 'Saturday', 'Sunday'],
label='Days of the Week'
)

print(dbutils.widgets.get("days_multiselect"))

# Tuesday

dbutils.widgets.multiselect(
name='days_multiselect',
defaultValue='Tuesday',
choices=list('Monday', 'Tuesday', 'Wednesday', 'Thursday',
'Friday', 'Saturday', 'Sunday'),
label='Days of the Week'
)

print(dbutils.widgets.get("days_multiselect"))

# [1] "Tuesday"

Scala

dbutils.widgets.multiselect(
"days_multiselect",
"Tuesday",
Array("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"),
"Days of the Week"
)

print(dbutils.widgets.get("days_multiselect"))

// Tuesday

remove command (dbutils.widgets.remove )


Removes the widget with the specified programmatic name.
To display help for this command, run dbutils.widgets.help("remove") .

IMPORTANT
If you add a command to remove a widget, you cannot add a subsequent command to create a widget in the same cell.
You must create the widget in another cell.

This example removes the widget with the programmatic name fruits_combobox .
Python

dbutils.widgets.remove('fruits_combobox')

R
dbutils.widgets.remove('fruits_combobox')

Scala

dbutils.widgets.remove("fruits_combobox")

removeAll command (dbutils.widgets.removeAll)


Removes all widgets from the notebook.
To display help for this command, run dbutils.widgets.help("removeAll") .

IMPORTANT
If you add a command to remove all widgets, you cannot add a subsequent command to create any widgets in the same
cell. You must create the widgets in another cell.

This example removes all widgets from the notebook.


Python

dbutils.widgets.removeAll()

dbutils.widgets.removeAll()

Scala

dbutils.widgets.removeAll()

text command (dbutils.widgets.text)


Creates and displays a text widget with the specified programmatic name, default value, and optional label.
To display help for this command, run dbutils.widgets.help("text") .
This example creates and displays a text widget with the programmatic name your_name_text . It is set to the
initial value of Enter your name . This text widget has an accompanying label Your name . This example ends by
printing the initial value of the text widget, Enter your name .
Python

dbutils.widgets.text(
name='your_name_text',
defaultValue='Enter your name',
label='Your name'
)

print(dbutils.widgets.get("your_name_text"))

# Enter your name

R
dbutils.widgets.text(
name='your_name_text',
defaultValue='Enter your name',
label='Your name'
)

print(dbutils.widgets.get("your_name_text"))

# [1] "Enter your name"

Scala

dbutils.widgets.text(
"your_name_text",
"Enter your name",
"Your name"
)

print(dbutils.widgets.get("your_name_text"))

// Enter your name

Databricks Utilities API library


To accelerate application development, it can be helpful to compile, build, and test applications before you
deploy them as production jobs. To enable you to compile against Databricks Utilities, Databricks provides the
dbutils-api library. You can download the dbutils-api library from the DBUtils API webpage on the Maven
Repository website or include the library by adding a dependency to your build file:
SBT

libraryDependencies += "com.databricks" % "dbutils-api_TARGET" % "VERSION"

Maven

<dependency>
<groupId>com.databricks</groupId>
<artifactId>dbutils-api_TARGET</artifactId>
<version>VERSION</version>
</dependency>

Gradle

compile 'com.databricks:dbutils-api_TARGET:VERSION'

Replace TARGET with the desired target (for example 2.12 ) and VERSION with the desired version (for example
0.0.5 ). For a list of available targets and versions, see the DBUtils API webpage on the Maven Repository
website.
Once you build your application against this library, you can deploy the application.
IMPORTANT
The dbutils-api library allows you to locally compile an application that uses dbutils , but not to run it. To run the
application, you must deploy it in Azure Databricks.

Limitations
Calling dbutils inside of executors can produce unexpected results or potentially result in errors.
If you need to run file system operations on executors using dbutils , there are several faster and more scalable
alternatives available:
For file copy or move operations, you can check a faster option of running filesystem operations described in
Parallelize filesystem operations.
For file system list and delete operations, you can refer to parallel listing and delete methods utilizing Spark
in How to list and delete files faster in Databricks.
For information about executors, see Cluster Mode Overview on the Apache Spark website.
REST API (latest)
7/21/2022 • 2 minutes to read

The Databricks REST API allows for programmatic management of various Azure Databricks resources. This
article provides links to the latest version of each API.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

For general usage notes about the Databricks REST API, see Databricks REST API reference. You can also jump
directly to the REST API home pages for versions 2.1, 2.0, or 1.2.
Clusters API 2.0
Cluster Policies API 2.0
Databricks SQL Queries and Dashboards API 2.0
Databricks SQL Query History API 2.0
Databricks SQL Warehouses API 2.0
DBFS API 2.0
Databricks SQL API 2.0
Delta Live Tables API 2.0
Git Credentials API 2.0
Global Init Scripts API 2.0
Groups API 2.0
Instance Pools API 2.0
IP Access List API 2.0
Jobs API 2.1
Libraries API 2.0
MLflow API 2.0
Permissions API 2.0
Repos API 2.0
SCIM API 2.0
Secrets API 2.0
Token API 2.0
Token Management API 2.0
Workspace API 2.0
Clusters API 2.0
7/21/2022 • 46 minutes to read

The Clusters API allows you to create, start, edit, list, terminate, and delete clusters. The maximum allowed size of
a request to the Clusters API is 10MB.
Cluster lifecycle methods require a cluster ID, which is returned from Create. To obtain a list of clusters, invoke
List.
Azure Databricks maps cluster node instance types to compute units known as DBUs. See the instance type
pricing page for a list of the supported instance types and their corresponding DBUs. For instance provider
information, see Azure instance type specifications and pricing.
Azure Databricks always provides one year’s deprecation notice before ceasing support for an instance type.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Create
EN DP O IN T H T T P M ET H O D

2.0/clusters/create POST

Create a new Apache Spark cluster. This method acquires new instances from the cloud provider if necessary.
This method is asynchronous; the returned cluster_id can be used to poll the cluster state. When this method
returns, the cluster is in a PENDING state. The cluster is usable once it enters a RUNNING state. See ClusterState.

NOTE
Azure Databricks may not be able to acquire some of the requested nodes, due to cloud provider limitations or transient
network issues. If it is unable to acquire a sufficient number of the requested nodes, cluster creation will terminate with an
informative error message.

Examples

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/create \
--data @create-cluster.json

create-cluster.json :
{
"cluster_name": "my-cluster",
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"spark_conf": {
"spark.speculation": true
},
"num_workers": 25
}

{ "cluster_id": "1234-567890-undid123" }

Here is an example for an autoscaling cluster. This cluster will start with two nodes, the minimum.

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/create \
--data @create-cluster.json

create-cluster.json :

{
"cluster_name": "autoscaling-cluster",
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"autoscale" : {
"min_workers": 2,
"max_workers": 50
}
}

{ "cluster_id": "1234-567890-hared123" }

This example creates a Single Node cluster. To create a Single Node cluster:
Set spark_conf and custom_tags to the exact values in the example.
Set num_workers to 0 .

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/create \
--data @create-cluster.json

create-cluster.json :
{
"cluster_name": "single-node-cluster",
"spark_version": "7.6.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 0,
"spark_conf": {
"spark.databricks.cluster.profile": "singleNode",
"spark.master": "local[*]"
},
"custom_tags": {
"ResourceClass": "SingleNode"
}
}

{ "cluster_id": "1234-567890-pouch123" }

To create a job or submit a run with a new cluster using a policy, set policy_id to the policy ID:

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/create \
--data @create-cluster.json

create-cluster.json :

{
"num_workers": null,
"autoscale": {
"min_workers": 2,
"max_workers": 8
},
"cluster_name": "my-cluster",
"spark_version": "7.3.x-scala2.12",
"spark_conf": {},
"node_type_id": "Standard_D3_v2",
"custom_tags": {},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"autotermination_minutes": 120,
"init_scripts": [],
"policy_id": "C65B864F02000008"
}

To create a new cluster, define the cluster’s properties in new_cluster :

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/job/create \
--data @create-job.json

create-job.json :
{
"run_name": "my spark task",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"num_workers": 10,
"policy_id": "ABCD000000000000"
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
],
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
}

Request structure of the cluster definition


F IEL D N A M E TYPE DESC RIP T IO N

num_workers OR autoscale INT32 OR AutoScale If num_workers, number of worker


nodes that this cluster should have. A
cluster has one Spark driver and
num_workers executors for a total of
num_workers + 1 Spark nodes.

Note: When reading the properties of


a cluster, this field reflects the desired
number of workers rather than the
actual number of workers. For
instance, if a cluster is resized from 5
to 10 workers, this field will
immediately be updated to reflect the
target size of 10 workers, whereas the
workers listed in executors will
gradually increase from 5 to 10 as the
new nodes are provisioned.

If autoscale, parameters needed in


order to automatically scale clusters up
and down based on load.

cluster_name STRING Cluster name requested by the user.


This doesn’t have to be unique. If not
specified at creation, the cluster name
will be an empty string.

spark_version STRING The runtime version of the cluster. You


can retrieve a list of available runtime
versions by using the Runtime versions
API call. This field is required.
F IEL D N A M E TYPE DESC RIP T IO N

spark_conf SparkConfPair An object containing a set of optional,


user-specified Spark configuration key-
value pairs. You can also pass in a
string of extra JVM options to the
driver and the executors via
spark.driver.extraJavaOptions
and
spark.executor.extraJavaOptions
respectively.

Example Spark confs:


{"spark.speculation": true,
"spark.streaming.ui.retainedBatches":
5}
or
{"spark.driver.extraJavaOptions":
"-verbose:gc -
XX:+PrintGCDetails"}

node_type_id STRING This field encodes, through a single


value, the resources available to each
of the Spark nodes in this cluster. For
example, the Spark nodes can be
provisioned and optimized for memory
or compute intensive workloads A list
of available node types can be
retrieved by using the List node types
API call. This field is required.

driver_node_type_id STRING The node type of the Spark driver. This


field is optional; if unset, the driver
node type will be set as the same value
as node_type_id defined above.

custom_tags ClusterTag An object containing a set of tags for


cluster resources. Databricks tags all
cluster resources (such as VMs) with
these tags in addition to default_tags.

Note :

* Azure Databricks allows at most 43


custom tags.
* If the cluster is created on an
instance pool, the cluster tags are not
copied to the cluster resources. To tag
resources for an instance pool, see the
custom_tags field in the Instance
Pools API 2.0.
F IEL D N A M E TYPE DESC RIP T IO N

cluster_log_conf ClusterLogConf The configuration for delivering Spark


logs to a long-term storage
destination. Only one destination can
be specified for one cluster. If the conf
is given, the logs will be delivered to
the destination every
5 mins . The destination of driver logs
is
<destination>/<cluster-
ID>/driver
, while the destination of executor logs
is
<destination>/<cluster-
ID>/executor
.

init_scripts An array of InitScriptInfo The configuration for storing init


scripts. Any number of scripts can be
specified. The scripts are executed
sequentially in the order provided. If
cluster_log_conf is specified, init
script logs are sent to
<destination>/<cluster-
ID>/init_scripts
.

docker_image DockerImage Docker image for a custom container.

spark_env_vars SparkEnvPair An object containing a set of optional,


user-specified environment variable
key-value pairs. Key-value pairs of the
form (X,Y) are exported as is (that is,
export X='Y' ) while launching the
driver and workers.

In order to specify an additional set of


SPARK_DAEMON_JAVA_OPTS , we
recommend appending them to
$SPARK_DAEMON_JAVA_OPTS as shown
in the following example. This ensures
that all default databricks managed
environmental variables are included
as well.

Example Spark environment variables:


{"SPARK_WORKER_MEMORY":
"28000m", "SPARK_LOCAL_DIRS":
"/local_disk0"}
or
{"SPARK_DAEMON_JAVA_OPTS":
"$SPARK_DAEMON_JAVA_OPTS -
Dspark.shuffle.service.enabled=true"}
F IEL D N A M E TYPE DESC RIP T IO N

autotermination_minutes INT32 Automatically terminates the cluster


after it is inactive for this time in
minutes. If not set, this cluster will not
be automatically terminated. If
specified, the threshold must be
between 10 and 10000 minutes. You
can also set this value to 0 to explicitly
disable automatic termination.

driver_instance_pool_id STRING The ID of the instance pool to use for


drivers. You must also specify
instance_pool_id . Refer to Instance
Pools API 2.0 for details.

instance_pool_id STRING The optional ID of the instance pool to


use for cluster nodes. If
driver_instance_pool_id is present,
instance_pool_id is used for worker
nodes only. Otherwise, it is used for
both the driver and the worker nodes.
Refer to Instance Pools API 2.0 for
details.

idempotency_token STRING An optional token that can be used to


guarantee the idempotency of cluster
creation requests. If the idempotency
token is assigned to a cluster that is
not in the TERMINATED state, the
request does not create a new cluster
but instead returns the ID of the
existing cluster. Otherwise, a new
cluster is created. The idempotency
token is cleared when the cluster is
terminated

If you specify the idempotency token,


upon failure you can retry until the
request succeeds. Azure Databricks will
guarantee that exactly one cluster will
be launched with that idempotency
token.

This token should have at most 64


characters.

apply_policy_default_values BOOL Whether to use policy default values


for missing cluster attributes.

enable_local_disk_encryption BOOL Whether encryption of disks locally


attached to the cluster is enabled.

azure_attributes AzureAttributes Attributes related to clusters running


on Azure. If not specified at cluster
creation, a set of default values is used.
F IEL D N A M E TYPE DESC RIP T IO N

runtime_engine STRING The type of runtime engine to use. If


not specified, the runtime engine type
is inferred based on the
spark_version value. Allowed values
include:

* PHOTON : Use the Photon runtime


engine type.
* STANDARD : Use the standard
runtime engine type.

This field is optional.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING Canonical identifier for the cluster.

Edit
EN DP O IN T H T T P M ET H O D

2.0/clusters/edit POST

Edit the configuration of a cluster to match the provided attributes and size.
You can edit a cluster if it is in a RUNNING or TERMINATED state. If you edit a cluster while it is in a RUNNING state,
it will be restarted so that the new attributes can take effect. If you edit a cluster while it is in a TERMINATED state,
it will remain TERMINATED . The next time it is started using the clusters/start API, the new attributes will take
effect. An attempt to edit a cluster in any other state will be rejected with an INVALID_STATE error code.
Clusters created by the Databricks Jobs service cannot be edited.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/edit \
--data @edit-cluster.json

edit-cluster.json :

{
"cluster_id": "1202-211320-brick1",
"num_workers": 10,
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2"
}

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

num_workers OR autoscale INT32 OR AutoScale If num_workers, number of worker


nodes that this cluster should have. A
cluster has one Spark driver and
num_workers executors for a total of
num_workers + 1 Spark nodes.

Note: When reading the properties of


a cluster, this field reflects the desired
number of workers rather than the
actual number of workers. For
instance, if a cluster is resized from 5
to 10 workers, this field will
immediately be updated to reflect the
target size of 10 workers, whereas the
workers listed in executors will
gradually increase from 5 to 10 as the
new nodes are provisioned.

If autoscale, parameters needed in


order to automatically scale clusters up
and down based on load.

cluster_id STRING Canonical identifier for the cluster. This


field is required.

cluster_name STRING Cluster name requested by the user.


This doesn’t have to be unique. If not
specified at creation, the cluster name
will be an empty string.

spark_version STRING The runtime version of the cluster. You


can retrieve a list of available runtime
versions by using the Runtime versions
API call. This field is required.

spark_conf SparkConfPair An object containing a set of optional,


user-specified Spark configuration key-
value pairs. You can also pass in a
string of extra JVM options to the
driver and the executors via
spark.driver.extraJavaOptions
and
spark.executor.extraJavaOptions
respectively.

Example Spark confs:


{"spark.speculation": true,
"spark.streaming.ui.retainedBatches":
5}
or
{"spark.driver.extraJavaOptions":
"-verbose:gc -
XX:+PrintGCDetails"}
F IEL D N A M E TYPE DESC RIP T IO N

node_type_id STRING This field encodes, through a single


value, the resources available to each
of the Spark nodes in this cluster. For
example, the Spark nodes can be
provisioned and optimized for memory
or compute intensive workloads A list
of available node types can be
retrieved by using the List node types
API call. This field is required.

driver_node_type_id STRING The node type of the Spark driver. This


field is optional; if unset, the driver
node type will be set as the same value
as node_type_id defined above.

cluster_log_conf ClusterLogConf The configuration for delivering Spark


logs to a long-term storage
destination. Only one destination can
be specified for one cluster. If the conf
is given, the logs will be delivered to
the destination every
5 mins . The destination of driver logs
is
<destination>/<cluster-
ID>/driver
, while the destination of executor logs
is
<destination>/<cluster-
ID>/executor
.

init_scripts An array of InitScriptInfo The configuration for storing init


scripts. Any number of destinations
can be specified. The scripts are
executed sequentially in the order
provided. If cluster_log_conf is
specified, init script logs are sent to
<destination>/<cluster-
ID>/init_scripts
.

docker_image DockerImage Docker image for a custom container.


F IEL D N A M E TYPE DESC RIP T IO N

spark_env_vars SparkEnvPair An object containing a set of optional,


user-specified environment variable
key-value pairs. Key-value pairs of the
form (X,Y) are exported as is (that is,
export X='Y' ) while launching the
driver and workers.

In order to specify an additional set of


SPARK_DAEMON_JAVA_OPTS , we
recommend appending them to
$SPARK_DAEMON_JAVA_OPTS as shown
in the following example. This ensures
that all default Databricks managed
environmental variables are included
as well.

Example Spark environment variables:


{"SPARK_WORKER_MEMORY":
"28000m", "SPARK_LOCAL_DIRS":
"/local_disk0"}
or
{"SPARK_DAEMON_JAVA_OPTS":
"$SPARK_DAEMON_JAVA_OPTS -
Dspark.shuffle.service.enabled=true"}

autotermination_minutes INT32 Automatically terminates the cluster


after it is inactive for this time in
minutes. If not set, this cluster will not
be automatically terminated. If
specified, the threshold must be
between 10 and 10000 minutes. You
can also set this value to 0 to explicitly
disable automatic termination.

instance_pool_id STRING The optional ID of the instance pool to


which the cluster belongs. Refer to
Pools for details.

apply_policy_default_values BOOL Whether to use policy default values


for missing cluster attributes.

enable_local_disk_encryption BOOL Whether encryption of disks locally


attached to the cluster is enabled.

azure_attributes AzureAttributes Attributes related to clusters running


on Azure. If not specified at cluster
creation, a set of default values is used.
F IEL D N A M E TYPE DESC RIP T IO N

runtime_engine STRING The type of runtime engine to use. If


not specified, the runtime engine type
is inferred based on the
spark_version value. Allowed values
include:

* PHOTON : Use the Photon runtime


engine type.
* STANDARD : Use the standard
runtime engine type.

This field is optional.

Start
EN DP O IN T H T T P M ET H O D

2.0/clusters/start POST

Start a terminated cluster given its ID. This is similar to createCluster , except:
The terminated cluster ID and attributes are preserved.
The cluster starts with the last specified cluster size. If the terminated cluster is an autoscaling cluster, the
cluster starts with the minimum number of nodes.
If the cluster is in the RESTARTING state, a 400 error is returned.
You cannot start a cluster launched to run a job.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/start \
--data '{ "cluster_id": "1234-567890-reef123" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The cluster to be started. This field is


required.

Restart
EN DP O IN T H T T P M ET H O D

2.0/clusters/restart POST

Restart a cluster given its ID. The cluster must be in the RUNNING state.
Example
curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/restart \
--data '{ "cluster_id": "1234-567890-reef123" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The cluster to be started. This field is


required.

Resize
EN DP O IN T H T T P M ET H O D

2.0/clusters/resize POST

Resize a cluster to have a desired number of workers. The cluster must be in the RUNNING state.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/resize \
--data '{ "cluster_id": "1234-567890-reef123", "num_workers": 30 }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

num_workers OR autoscale INT32 OR AutoScale If num_workers, number of worker


nodes that this cluster should have. A
cluster has one Spark driver and
num_workers executors for a total of
num_workers + 1 Spark nodes.

Note: When reading the properties of


a cluster, this field reflects the desired
number of workers rather than the
actual number of workers. For
instance, if a cluster is resized from 5
to 10 workers, this field will
immediately be updated to reflect the
target size of 10 workers, whereas the
workers listed in executors will
gradually increase from 5 to 10 as the
new nodes are provisioned.

If autoscale, parameters needed in


order to automatically scale clusters up
and down based on load.
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The cluster to be resized. This field is


required.

Delete (terminate)
EN DP O IN T H T T P M ET H O D

2.0/clusters/delete POST

Terminate a cluster given its ID. The cluster is removed asynchronously. Once the termination has completed, the
cluster will be in the TERMINATED state. If the cluster is already in a TERMINATING or TERMINATED state, nothing
will happen.
Unless a cluster is pinned, 30 days after the cluster is terminated, it is permanently deleted.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/delete \
--data '{ "cluster_id": "1234-567890-frays123" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The cluster to be terminated. This field


is required.

Permanent delete
EN DP O IN T H T T P M ET H O D

2.0/clusters/permanent-delete POST

Permanently delete a cluster. If the cluster is running, it is terminated and its resources are asynchronously
removed. If the cluster is terminated, then it is immediately removed.
You cannot perform any action, including retrieve the cluster’s permissions, on a permanently deleted cluster. A
permanently deleted cluster is also no longer returned in the cluster list.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/permanent-delete \
--data '{ "cluster_id": "1234-567890-frays123" }'

{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The cluster to be permanently deleted.


This field is required.

Get
EN DP O IN T H T T P M ET H O D

2.0/clusters/get GET

Retrieve the information for a cluster given its identifier. Clusters can be described while they are running or up
to 30 days after they are terminated.
Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/get \
--data '{ "cluster_id": "1234-567890-reef123" }' \
| jq .
{
"cluster_id": "1234-567890-reef123",
"driver": {
"node_id": "dced0ce388954c38abef081f54c18afd",
"instance_id": "c69c0b119a2a499d8a2843c4d256136a",
"start_timestamp": 1619718438896,
"host_private_ip": "10.0.0.1",
"private_ip": "10.0.0.2"
},
"spark_context_id": 5631707659504820000,
"jdbc_port": 10000,
"cluster_name": "my-cluster",
"spark_version": "8.2.x-scala2.12",
"node_type_id": "Standard_L4s",
"driver_node_type_id": "Standard_L4s",
"custom_tags": {
"ResourceClass": "SingleNode"
},
"autotermination_minutes": 0,
"enable_elastic_disk": true,
"disk_spec": {},
"cluster_source": "UI",
"enable_local_disk_encryption": false,
"azure_attributes": {
"first_on_demand": 1,
"availability": "ON_DEMAND_AZURE",
"spot_bid_max_price": -1
},
"instance_source": {
"node_type_id": "Standard_L4s"
},
"driver_instance_source": {
"node_type_id": "Standard_L4s"
},
"state": "RUNNING",
"state_message": "",
"start_time": 1610745129764,
"last_state_loss_time": 1619718513513,
"num_workers": 0,
"cluster_memory_mb": 32768,
"cluster_cores": 4,
"default_tags": {
"Vendor": "Databricks",
"Creator": "someone@example.com",
"ClusterName": "my-cluster",
"ClusterId": "1234-567890-reef123"
},
"creator_user_name": "someone@example.com",
"pinned_by_user_name": "3401478490056118",
"init_scripts_safe_mode": false
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The cluster about which to retrieve


information. This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

num_workers OR autoscale INT32 OR AutoScale If num_workers, number of worker


nodes that this cluster should have. A
cluster has one Spark driver and
num_workers executors for a total of
num_workers + 1 Spark nodes.

Note: When reading the properties of


a cluster, this field reflects the desired
number of workers rather than the
actual number of workers. For
instance, if a cluster is resized from 5
to 10 workers, this field will
immediately be updated to reflect the
target size of 10 workers, whereas the
workers listed in executors will
gradually increase from 5 to 10 as the
new nodes are provisioned.

If autoscale, parameters needed in


order to automatically scale clusters up
and down based on load.

cluster_id STRING Canonical identifier for the cluster. This


ID is retained during cluster restarts
and resizes, while each new cluster has
a globally unique ID.

creator_user_name STRING Creator user name. The field won’t be


included in the response if the user has
already been deleted.

driver SparkNode Node on which the Spark driver


resides. The driver node contains the
Spark master and the Databricks
application that manages the per-
notebook Spark REPLs.

executors An array of SparkNode Nodes on which the Spark executors


reside.

spark_context_id INT64 A canonical SparkContext identifier.


This value does change when the
Spark driver restarts. The pair
(cluster_id, spark_context_id) is
a globally unique identifier over all
Spark contexts.

jdbc_port INT32 Port on which Spark JDBC server is


listening in the driver node. No service
will be listening on on this port in
executor nodes.

cluster_name STRING Cluster name requested by the user.


This doesn’t have to be unique. If not
specified at creation, the cluster name
will be an empty string.
F IEL D N A M E TYPE DESC RIP T IO N

spark_version STRING The runtime version of the cluster. You


can retrieve a list of available runtime
versions by using the Runtime versions
API call.

spark_conf SparkConfPair An object containing a set of optional,


user-specified Spark configuration key-
value pairs. You can also pass in a
string of extra JVM options to the
driver and the executors via
spark.driver.extraJavaOptions
and
spark.executor.extraJavaOptions
respectively.

Example Spark confs:


{"spark.speculation": true,
"spark.streaming.ui.retainedBatches":
5}
or
{"spark.driver.extraJavaOptions":
"-verbose:gc -
XX:+PrintGCDetails"}

node_type_id STRING This field encodes, through a single


value, the resources available to each
of the Spark nodes in this cluster. For
example, the Spark nodes can be
provisioned and optimized for memory
or compute intensive workloads A list
of available node types can be
retrieved by using the List node types
API call. This field is required.

driver_node_type_id STRING The node type of the Spark driver. This


field is optional; if unset, the driver
node type will be set as the same value
as node_type_id defined above.

custom_tags ClusterTag An object containing a set of tags for


cluster resources. Databricks tags all
cluster resources with these tags in
addition to default_tags.

Note :

* Tags are not supported on legacy


node types such as compute-
optimized and memory-optimized.
* Databricks allows at most 45 custom
tags.
* If the cluster is created on an
instance pool, the cluster tags are not
copied to the cluster resources. To tag
resources for an instance pool, see the
custom_tags field in the Instance
Pools API 2.0.
F IEL D N A M E TYPE DESC RIP T IO N

cluster_log_conf ClusterLogConf The configuration for delivering Spark


logs to a long-term storage
destination. Only one destination can
be specified for one cluster. If the conf
is given, the logs will be delivered to
the destination every
5 mins . The destination of driver logs
is
<destination>/<cluster-
ID>/driver
, while the destination of executor logs
is
<destination>/<cluster-
ID>/executor
.

init_scripts An array of InitScriptInfo The configuration for storing init


scripts. Any number of destinations
can be specified. The scripts are
executed sequentially in the order
provided. If cluster_log_conf is
specified, init script logs are sent to
<destination>/<cluster-
ID>/init_scripts
.

docker_image DockerImage Docker image for a custom container.

spark_env_vars SparkEnvPair An object containing a set of optional,


user-specified environment variable
key-value pairs. Key-value pairs of the
form (X,Y) are exported as is (that is,
export X='Y' ) while launching the
driver and workers.

In order to specify an additional set of


SPARK_DAEMON_JAVA_OPTS , we
recommend appending them to
$SPARK_DAEMON_JAVA_OPTS as shown
in the following example. This ensures
that all default Databricks managed
environmental variables are included
as well.

Example Spark environment variables:


{"SPARK_WORKER_MEMORY":
"28000m", "SPARK_LOCAL_DIRS":
"/local_disk0"}
or
{"SPARK_DAEMON_JAVA_OPTS":
"$SPARK_DAEMON_JAVA_OPTS -
Dspark.shuffle.service.enabled=true"}
F IEL D N A M E TYPE DESC RIP T IO N

autotermination_minutes INT32 Automatically terminates the cluster


after it is inactive for this time in
minutes. If not set, this cluster will not
be automatically terminated. If
specified, the threshold must be
between 10 and 10000 minutes. You
can also set this value to 0 to explicitly
disable automatic termination.

enable_elastic_disk BOOL Autoscaling Local Storage: when


enabled, this cluster will dynamically
acquire additional disk space when its
Spark workers are running low on disk
space. See Autoscaling local storage for
details.

instance_pool_id STRING The optional ID of the instance pool to


which the cluster belongs. Refer to
Pools for details.

state ClusterState State of the cluster.

state_message STRING A message associated with the most


recent state transition (for example,
the reason why the cluster entered the
TERMINATED state).

start_time INT64 Time (in epoch milliseconds) when the


cluster creation request was received
(when the cluster entered the
PENDING state).

terminated_time INT64 Time (in epoch milliseconds) when the


cluster was terminated, if applicable.

last_state_loss_time INT64 Time when the cluster driver last lost


its state (due to a restart or driver
failure).

last_activity_time INT64 Time (in epoch milliseconds) when the


cluster was last active. A cluster is
active if there is at least one command
that has not finished on the cluster.
This field is available after the cluster
has reached the RUNNING state.
Updates to this field are made as best-
effort attempts. Certain versions of
Spark do not support reporting of
cluster activity. Refer to Automatic
termination for details.

cluster_memory_mb INT64 Total amount of cluster memory, in


megabytes.
F IEL D N A M E TYPE DESC RIP T IO N

cluster_cores FLOAT Number of CPU cores available for this


cluster. This can be fractional since
certain node types are configured to
share cores between Spark nodes on
the same instance.

default_tags ClusterTag An object containing a set of tags that


are added by Azure Databricks
regardless of any custom_tags,
including:

* Vendor: Databricks
* Creator:
* ClusterName:
* ClusterId:
* Name: On job clusters:

* RunName:
* JobId: On resources used by
Databricks SQL:

* SqlWarehouseId:

cluster_log_status LogSyncStatus Cluster log delivery status.

termination_reason TerminationReason Information about why the cluster was


terminated. This field appears only
when the cluster is in the
TERMINATING or TERMINATED state.

Pin
NOTE
You must be an Azure Databricks administrator to invoke this API.

EN DP O IN T H T T P M ET H O D

2.0/clusters/pin POST

Ensure that an all-purpose cluster configuration is retained even after a cluster has been terminated for more
than 30 days. Pinning ensures that the cluster is always returned by the List API. Pinning a cluster that is already
pinned has no effect.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/pin \
--data '{ "cluster_id": "1234-567890-reef123" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The cluster to pin. This field is required.

Unpin
NOTE
You must be an Azure Databricks administrator to invoke this API.

EN DP O IN T H T T P M ET H O D

2.0/clusters/unpin POST

Allows the cluster to eventually be removed from the list returned by the List API. Unpinning a cluster that is not
pinned has no effect.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/unpin \
--data '{ "cluster_id": "1234-567890-reef123" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The cluster to unpin. This field is


required.

List
EN DP O IN T H T T P M ET H O D

2.0/clusters/list GET

Return information about all pinned clusters, active clusters, up to 200 of the most recently terminated all-
purpose clusters in the past 30 days, and up to 30 of the most recently terminated job clusters in the past 30
days. For example, if there is 1 pinned cluster, 4 active clusters, 45 terminated all-purpose clusters in the past 30
days, and 50 terminated job clusters in the past 30 days, then this API returns the 1 pinned cluster, 4 active
clusters, all 45 terminated all-purpose clusters, and the 30 most recently terminated job clusters.
Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/list \
| jq .
{
"clusters": [
{
"cluster_id": "1234-567890-reef123",
"driver": {
"node_id": "dced0ce388954c38abef081f54c18afd",
"instance_id": "c69c0b119a2a499d8a2843c4d256136a",
"start_timestamp": 1619718438896,
"host_private_ip": "10.0.0.1",
"private_ip": "10.0.0.2"
},
"spark_context_id": 5631707659504820000,
"jdbc_port": 10000,
"cluster_name": "my-cluster",
"spark_version": "8.2.x-scala2.12",
"node_type_id": "Standard_L4s",
"driver_node_type_id": "Standard_L4s",
"custom_tags": {
"ResourceClass": "SingleNode"
},
"autotermination_minutes": 0,
"enable_elastic_disk": true,
"disk_spec": {},
"cluster_source": "UI",
"enable_local_disk_encryption": false,
"azure_attributes": {
"first_on_demand": 1,
"availability": "ON_DEMAND_AZURE",
"spot_bid_max_price": -1
},
"instance_source": {
"node_type_id": "Standard_L4s"
},
"driver_instance_source": {
"node_type_id": "Standard_L4s"
},
"state": "RUNNING",
"state_message": "",
"start_time": 1610745129764,
"last_state_loss_time": 1619718513513,
"num_workers": 0,
"cluster_memory_mb": 32768,
"cluster_cores": 4,
"default_tags": {
"Vendor": "Databricks",
"Creator": "someone@example.com",
"ClusterName": "my-cluster",
"ClusterId": "1234-567890-reef123"
},
"creator_user_name": "someone@example.com",
"pinned_by_user_name": "3401478490056118",
"init_scripts_safe_mode": false
},
{
"..."
}
]
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

clusters An array of ClusterInfo A list of clusters.


List node types
EN DP O IN T H T T P M ET H O D

2.0/clusters/list-node-types GET

Return a list of supported Spark node types. These node types can be used to launch a cluster.
Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/list-node-types \
| jq .

{
"node_types": [
{
"node_type_id": "Standard_L80s_v2",
"memory_mb": 655360,
"num_cores": 80,
"description": "Standard_L80s_v2",
"instance_type_id": "Standard_L80s_v2",
"is_deprecated": false,
"category": "Storage Optimized",
"support_ebs_volumes": true,
"support_cluster_tags": true,
"num_gpus": 0,
"node_instance_type": {
"instance_type_id": "Standard_L80s_v2",
"local_disks": 1,
"local_disk_size_gb": 800,
"instance_family": "Standard LSv2 Family vCPUs",
"local_nvme_disk_size_gb": 1788,
"local_nvme_disks": 10,
"swap_size": "10g"
},
"is_hidden": false,
"support_port_forwarding": true,
"display_order": 0,
"is_io_cache_enabled": true,
"node_info": {
"available_core_quota": 350,
"total_core_quota": 350
}
},
{
"..."
}
]
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

node_types An array of NodeType The list of available Spark node types.

Runtime versions
EN DP O IN T H T T P M ET H O D

2.0/clusters/spark-versions GET

Return the list of available runtime versions. These versions can be used to launch a cluster.
Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/spark-versions \
| jq .

{
"versions": [
{
"key": "8.2.x-scala2.12",
"name": "8.2 (includes Apache Spark 3.1.1, Scala 2.12)"
},
{
"..."
}
]
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

versions An array of SparkVersion All the available runtime versions.

Events
EN DP O IN T H T T P M ET H O D

2.0/clusters/events POST

Retrieve a list of events about the activity of a cluster. You can retrieve events from active clusters (running,
pending, or reconfiguring) and terminated clusters within 30 days of their last termination. This API is paginated.
If there are more events to read, the response includes all the parameters necessary to request the next page of
events.
Example:

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/events \
--data @list-events.json \
| jq .

list-events.json :
{
"cluster_id": "1234-567890-reef123",
"start_time": 1617238800000,
"end_time": 1619485200000,
"order": "DESC",
"offset": 5,
"limit": 5,
"event_type": "RUNNING"
}

{
"events": [
{
"cluster_id": "1234-567890-reef123",
"timestamp": 1619471498409,
"type": "RUNNING",
"details": {
"current_num_workers": 2,
"target_num_workers": 2
}
},
{
"..."
}
],
"next_page": {
"cluster_id": "1234-567890-reef123",
"start_time": 1617238800000,
"end_time": 1619485200000,
"order": "DESC",
"offset": 10,
"limit": 5
},
"total_count": 25
}

Example request to retrieve the next page of events:

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/events \
--data @list-events.json \
| jq .

list-events.json :

{
"cluster_id": "1234-567890-reef123",
"start_time": 1617238800000,
"end_time": 1619485200000,
"order": "DESC",
"offset": 10,
"limit": 5,
"event_type": "RUNNING"
}
{
"events": [
{
"cluster_id": "1234-567890-reef123",
"timestamp": 1618330776302,
"type": "RUNNING",
"details": {
"current_num_workers": 2,
"target_num_workers": 2
}
},
{
"..."
}
],
"next_page": {
"cluster_id": "1234-567890-reef123",
"start_time": 1617238800000,
"end_time": 1619485200000,
"order": "DESC",
"offset": 15,
"limit": 5
},
"total_count": 25
}

Request structure
Retrieve events pertaining to a specific cluster.

F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The ID of the cluster to retrieve events


about. This field is required.

start_time INT64 The start time in epoch milliseconds. If


empty, returns events starting from
the beginning of time.

end_time INT64 The end time in epoch milliseconds. If


empty, returns events up to the
current time.

order ListOrder The order to list events in; either ASC


or DESC . Defaults to DESC .

event_types An array of ClusterEventType An optional set of event types to filter


on. If empty, all event types are
returned.

offset INT64 The offset in the result set. Defaults to


0 (no offset). When an offset is
specified and the results are requested
in descending order, the end_time field
is required.

limit INT64 The maximum number of events to


include in a page of events. Defaults to
50, and maximum allowed value is
500.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N

events An array of ClusterEvent This list of matching events.

next_page Request structure The parameters required to retrieve


the next page of events. Omitted if
there are no more events to read.

total_count INT64 The total number of events filtered by


the start_time, end_time, and
event_types.

Data structures
In this section:
AutoScale
ClusterInfo
ClusterEvent
ClusterEventType
EventDetails
ClusterAttributes
ClusterSize
ListOrder
ResizeCause
ClusterLogConf
InitScriptInfo
ClusterTag
DbfsStorageInfo
FileStorageInfo
DockerImage
DockerBasicAuth
LogSyncStatus
NodeType
ClusterCloudProviderNodeInfo
ClusterCloudProviderNodeStatus
ParameterPair
SparkConfPair
SparkEnvPair
SparkNode
SparkVersion
TerminationReason
PoolClusterTerminationCode
ClusterSource
ClusterState
TerminationCode
TerminationType
TerminationParameter
AzureAttributes
AzureAvailability
AutoScale
Range defining the min and max number of cluster workers.

F IEL D N A M E TYPE DESC RIP T IO N

min_workers INT32 The minimum number of workers to


which the cluster can scale down when
underutilized. It is also the initial
number of workers the cluster will
have after creation.

max_workers INT32 The maximum number of workers to


which the cluster can scale up when
overloaded. max_workers must be
strictly greater than min_workers.

ClusterInfo
Metadata about a cluster.

F IEL D N A M E TYPE DESC RIP T IO N

num_workers OR autoscale INT32 OR AutoScale If num_workers, number of worker


nodes that this cluster should have. A
cluster has one Spark driver and
num_workers executors for a total of
num_workers + 1 Spark nodes.

Note: When reading the properties of


a cluster, this field reflects the desired
number of workers rather than the
actual number of workers. For
instance, if a cluster is resized from 5
to 10 workers, this field will
immediately be updated to reflect the
target size of 10 workers, whereas the
workers listed in executors will
gradually increase from 5 to 10 as the
new nodes are provisioned.

If autoscale, parameters needed in


order to automatically scale clusters up
and down based on load.

cluster_id STRING Canonical identifier for the cluster. This


ID is retained during cluster restarts
and resizes, while each new cluster has
a globally unique ID.

creator_user_name STRING Creator user name. The field won’t be


included in the response if the user has
already been deleted.
F IEL D N A M E TYPE DESC RIP T IO N

driver SparkNode Node on which the Spark driver


resides. The driver node contains the
Spark master and the Databricks
application that manages the per-
notebook Spark REPLs.

executors An array of SparkNode Nodes on which the Spark executors


reside.

spark_context_id INT64 A canonical SparkContext identifier.


This value does change when the
Spark driver restarts. The pair
(cluster_id, spark_context_id) is
a globally unique identifier over all
Spark contexts.

jdbc_port INT32 Port on which Spark JDBC server is


listening in the driver node. No service
will be listening on on this port in
executor nodes.

cluster_name STRING Cluster name requested by the user.


This doesn’t have to be unique. If not
specified at creation, the cluster name
will be an empty string.

spark_version STRING The runtime version of the cluster. You


can retrieve a list of available runtime
versions by using the Runtime versions
API call.

spark_conf SparkConfPair An object containing a set of optional,


user-specified Spark configuration key-
value pairs. You can also pass in a
string of extra JVM options to the
driver and the executors via
spark.driver.extraJavaOptions
and
spark.executor.extraJavaOptions
respectively.

Example Spark confs:


{"spark.speculation": true,
"spark.streaming.ui.retainedBatches":
5}
or
{"spark.driver.extraJavaOptions":
"-verbose:gc -
XX:+PrintGCDetails"}

node_type_id STRING This field encodes, through a single


value, the resources available to each
of the Spark nodes in this cluster. For
example, the Spark nodes can be
provisioned and optimized for memory
or compute intensive workloads. A list
of available node types can be
retrieved by using the List node types
API call.
F IEL D N A M E TYPE DESC RIP T IO N

driver_node_type_id STRING The node type of the Spark driver. This


field is optional; if unset, the driver
node type will be set as the same value
as node_type_id defined above.

cluster_log_conf ClusterLogConf The configuration for delivering Spark


logs to a long-term storage
destination. Only one destination can
be specified for one cluster. If the conf
is given, the logs will be delivered to
the destination every
5 mins . The destination of driver logs
is
<destination>/<cluster-
ID>/driver
, while the destination of executor logs
is
<destination>/<cluster-
ID>/executor
.

init_scripts An array of InitScriptInfo The configuration for storing init


scripts. Any number of destinations
can be specified. The scripts are
executed sequentially in the order
provided. If cluster_log_conf is
specified, init script logs are sent to
<destination>/<cluster-
ID>/init_scripts
.

docker_image DockerImage Docker image for a custom container.

spark_env_vars SparkEnvPair An object containing a set of optional,


user-specified environment variable
key-value pairs. Key-value pairs of the
form (X,Y) are exported as is (that is,
export X='Y' ) while launching the
driver and workers.

To specify an additional set of


SPARK_DAEMON_JAVA_OPTS , we
recommend appending them to
$SPARK_DAEMON_JAVA_OPTS as shown
in the following example. This ensures
that all default databricks managed
environmental variables are included
as well.

Example Spark environment variables:


{"SPARK_WORKER_MEMORY":
"28000m", "SPARK_LOCAL_DIRS":
"/local_disk0"}
or
{"SPARK_DAEMON_JAVA_OPTS":
"$SPARK_DAEMON_JAVA_OPTS -
Dspark.shuffle.service.enabled=true"}
F IEL D N A M E TYPE DESC RIP T IO N

autotermination_minutes INT32 Automatically terminates the cluster


after it is inactive for this time in
minutes. If not set, this cluster will not
be automatically terminated. If
specified, the threshold must be
between 10 and 10000 minutes. You
can also set this value to 0 to explicitly
disable automatic termination.

enable_elastic_disk BOOL Autoscaling Local Storage: when


enabled, this cluster will dynamically
acquire additional disk space when its
Spark workers are running low on disk
space. See Autoscaling local storage for
details.

instance_pool_id STRING The optional ID of the instance pool to


which the cluster belongs. Refer to
Pools for details.

state ClusterState State of the cluster.

state_message STRING A message associated with the most


recent state transition (for example,
the reason why the cluster entered a
TERMINATED state).

start_time INT64 Time (in epoch milliseconds) when the


cluster creation request was received
(when the cluster entered a PENDING
state).

terminated_time INT64 Time (in epoch milliseconds) when the


cluster was terminated, if applicable.

last_state_loss_time INT64 Time when the cluster driver last lost


its state (due to a restart or driver
failure).

last_activity_time INT64 Time (in epoch milliseconds) when the


cluster was last active. A cluster is
active if there is at least one command
that has not finished on the cluster.
This field is available after the cluster
has reached a RUNNING state.
Updates to this field are made as best-
effort attempts. Certain versions of
Spark do not support reporting of
cluster activity. Refer to Automatic
termination for details.

cluster_memory_mb INT64 Total amount of cluster memory, in


megabytes.
F IEL D N A M E TYPE DESC RIP T IO N

cluster_cores FLOAT Number of CPU cores available for this


cluster. This can be fractional since
certain node types are configured to
share cores between Spark nodes on
the same instance.

default_tags ClusterTag An object containing a set of tags that


are added by Azure Databricks
regardless of any custom_tags,
including:

* Vendor: Databricks
* Creator:
* ClusterName:
* ClusterId:
* Name: On job clusters:

* RunName:
* JobId: On resources used by
Databricks SQL:

* SqlWarehouseId:

cluster_log_status LogSyncStatus Cluster log delivery status.

termination_reason TerminationReason Information about why the cluster was


terminated. This field only appears
when the cluster is in a TERMINATING
or TERMINATED state.

ClusterEvent
Cluster event information.

F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING Canonical identifier for the cluster. This


field is required.

timestamp INT64 The timestamp when the event


occurred, stored as the number of
milliseconds since the unix epoch.
Assigned by the Timeline service.

type ClusterEventType The event type. This field is required.

details EventDetails The event details. This field is required.

ClusterEventType
Type of a cluster event.

EVEN T T Y P E DESC RIP T IO N

CREATING Indicates that the cluster is being created.


EVEN T T Y P E DESC RIP T IO N

DID_NOT_EXPAND_DISK Indicates that a disk is low on space, but adding disks would
put it over the max capacity.

EXPANDED_DISK Indicates that a disk was low on space and the disks were
expanded.

FAILED_TO_EXPAND_DISK Indicates that a disk was low on space and disk space could
not be expanded.

INIT_SCRIPTS_STARTING Indicates that the cluster scoped init script has started.

INIT_SCRIPTS_FINISHED Indicates that the cluster scoped init script has finished.

STARTING Indicates that the cluster is being started.

RESTARTING Indicates that the cluster is being started.

TERMINATING Indicates that the cluster is being terminated.

EDITED Indicates that the cluster has been edited.

RUNNING Indicates the cluster has finished being created. Includes the
number of nodes in the cluster and a failure reason if some
nodes could not be acquired.

RESIZING Indicates a change in the target size of the cluster (upsize or


downsize).

UPSIZE_COMPLETED Indicates that nodes finished being added to the cluster.


Includes the number of nodes in the cluster and a failure
reason if some nodes could not be acquired.

NODES_LOST Indicates that some nodes were lost from the cluster.

DRIVER_HEALTHY Indicates that the driver is healthy and the cluster is ready
for use.

DRIVER_UNAVAILABLE Indicates that the driver is unavailable.

SPARK_EXCEPTION Indicates that a Spark exception was thrown from the driver.

DRIVER_NOT_RESPONDING Indicates that the driver is up but is not responsive, likely


due to GC.

DBFS_DOWN Indicates that the driver is up but DBFS is down.

METASTORE_DOWN Indicates that the driver is up but the metastore is down.

NODE_BLACKLISTED Indicates that a node is not allowed by Spark.

PINNED Indicates that the cluster was pinned.


EVEN T T Y P E DESC RIP T IO N

UNPINNED Indicates that the cluster was unpinned.

EventDetails
Details about a cluster event.

F IEL D N A M E TYPE DESC RIP T IO N

current_num_workers INT32 The number of nodes in the cluster.

target_num_workers INT32 The targeted number of nodes in the


cluster.

previous_attributes ClusterAttributes The cluster attributes before a cluster


was edited.

attributes ClusterAttributes * For created clusters, the attributes of


the cluster.
* For edited clusters, the new
attributes of the cluster.

previous_cluster_size ClusterSize The size of the cluster before an edit or


resize.

cluster_size ClusterSize The cluster size that was set in the


cluster creation or edit.

cause ResizeCause The cause of a change in target size.

reason TerminationReason A termination reason:

* On a TERMINATED event, the reason


for the termination.
* On a RESIZE_COMPLETE event,
indicates the reason that we failed to
acquire some nodes.

user STRING The user that caused the event to


occur. (Empty if it was done by Azure
Databricks.)

ClusterAttributes
Common set of attributes set during cluster creation. These attributes cannot be changed over the lifetime of a
cluster.

F IEL D N A M E TYPE DESC RIP T IO N

cluster_name STRING Cluster name requested by the user.


This doesn’t have to be unique. If not
specified at creation, the cluster name
will be an empty string.
F IEL D N A M E TYPE DESC RIP T IO N

spark_version STRING The runtime version of the cluster, for


example “5.0.x-scala2.11”. You can
retrieve a list of available runtime
versions by using the Runtime versions
API call.

spark_conf SparkConfPair An object containing a set of optional,


user-specified Spark configuration key-
value pairs. You can also pass in a
string of extra JVM options to the
driver and the executors via
spark.driver.extraJavaOptions
and
spark.executor.extraJavaOptions
respectively.

Example Spark confs:


{"spark.speculation": true,
"spark.streaming.ui.retainedBatches":
5}
or
{"spark.driver.extraJavaOptions":
"-verbose:gc -
XX:+PrintGCDetails"}

node_type_id STRING This field encodes, through a single


value, the resources available to each
of the Spark nodes in this cluster. For
example, the Spark nodes can be
provisioned and optimized for memory
or compute intensive workloads A list
of available node types can be
retrieved by using the List node types
API call.

driver_node_type_id STRING The node type of the Spark driver. This


field is optional; if unset, the driver
node type will be set as the same value
as node_type_id defined above.

ssh_public_keys An array of STRING SSH public key contents that will be


added to each Spark node in this
cluster. The corresponding private keys
can be used to login with the user
name ubuntu on port 2200 . Up to
10 keys can be specified.
F IEL D N A M E TYPE DESC RIP T IO N

custom_tags ClusterTag An object containing a set of tags for


cluster resources. Databricks tags all
cluster resources with these tags in
addition to default_tags.

Note :

* Tags are not supported on legacy


node types such as compute-
optimized and memory-optimized.
* Databricks allows at most 45 custom
tags.
* If the cluster is created on an
instance pool, the cluster tags are not
copied to the cluster resources. To tag
resources for an instance pool, see the
custom_tags field in the Instance
Pools API 2.0.

cluster_log_conf ClusterLogConf The configuration for delivering Spark


logs to a long-term storage
destination. Only one destination can
be specified for one cluster. If the conf
is given, the logs will be delivered to
the destination every
5 mins . The destination of driver logs
is
<destination>/<cluster-
ID>/driver
, while the destination of executor logs
is
<destination>/<cluster-
ID>/executor
.

init_scripts An array of InitScriptInfo The configuration for storing init


scripts. Any number of destinations
can be specified. The scripts are
executed sequentially in the order
provided. If cluster_log_conf is
specified, init script logs are sent to
<destination>/<cluster-
ID>/init_scripts
.

docker_image DockerImage Docker image for a custom container.


F IEL D N A M E TYPE DESC RIP T IO N

spark_env_vars SparkEnvPair An object containing a set of optional,


user-specified environment variable
key-value pairs. Key-value pairs of the
form (X,Y) are exported as is (that is,
export X='Y' ) while launching the
driver and workers.

In order to specify an additional set of


SPARK_DAEMON_JAVA_OPTS , we
recommend appending them to
$SPARK_DAEMON_JAVA_OPTS as shown
in the following example. This ensures
that all default databricks managed
environmental variables are included
as well.

Example Spark environment variables:


{"SPARK_WORKER_MEMORY":
"28000m", "SPARK_LOCAL_DIRS":
"/local_disk0"}
or
{"SPARK_DAEMON_JAVA_OPTS":
"$SPARK_DAEMON_JAVA_OPTS -
Dspark.shuffle.service.enabled=true"}

autotermination_minutes INT32 Automatically terminates the cluster


after it is inactive for this time in
minutes. If not set, this cluster will not
be automatically terminated. If
specified, the threshold must be
between 10 and 10000 minutes. You
can also set this value to 0 to explicitly
disable automatic termination.

enable_elastic_disk BOOL Autoscaling Local Storage: when


enabled, this cluster will dynamically
acquire additional disk space when its
Spark workers are running low on disk
space. See Autoscaling local storage for
details.

instance_pool_id STRING The optional ID of the instance pool to


which the cluster belongs. Refer to
Pools for details.

cluster_source ClusterSource Determines whether the cluster was


created by a user through the UI,
created by the Databricks Jobs
scheduler, or through an API request.

policy_id STRING A cluster policy ID.

azure_attributes AzureAttributes Defines attributes such as the instance


availability type, node placement, and
max bid price. If not specified during
cluster creation, a set of default values
is used.

ClusterSize
Cluster size specification.

F IEL D N A M E TYPE DESC RIP T IO N

num_workers OR autoscale INT32 OR AutoScale If num_workers, number of worker


nodes that this cluster should have. A
cluster has one Spark driver and
num_workers executors for a total of
num_workers + 1 Spark nodes.

When reading the properties of a


cluster, this field reflects the desired
number of workers rather than the
actual number of workers. For
instance, if a cluster is resized from 5
to 10 workers, this field is updated to
reflect the target size of 10 workers,
whereas the workers listed in executors
gradually increase from 5 to 10 as the
new nodes are provisioned.

If autoscale, parameters needed in


order to automatically scale clusters up
and down based on load.

ListOrder
Generic ordering enum for list-based queries.

O RDER DESC RIP T IO N

DESC Descending order.

ASC Ascending order.

ResizeCause
Reason why a cluster was resized.

C A USE DESC RIP T IO N

AUTOSCALE Automatically resized based on load.

USER_REQUEST User requested a new size.

AUTORECOVERY Autorecovery monitor resized the cluster after it lost a node.

ClusterLogConf
Path to cluster log.

F IEL D N A M E TYPE DESC RIP T IO N

dbfs DbfsStorageInfo DBFS location of cluster log.


Destination must be provided. For
example,
{ "dbfs" : { "destination" :
"dbfs:/home/cluster_log" } }

InitScriptInfo
Path to an init script. For instructions on using init scripts with Databricks Container Services, see Use an init
script.

NOTE
The file storage type is only available for clusters set up using Databricks Container Services.

F IEL D N A M E TYPE DESC RIP T IO N

dbfs OR file DbfsStorageInfo DBFS location of init script. Destination


must be provided. For example,
FileStorageInfo { "dbfs" : { "destination" :
"dbfs:/home/init_script" } }

File location of init script. Destination


must be provided. For example,
{ "file" : { "destination" :
"file:/my/local/file.sh" } }

ClusterTag
Cluster tag definition.

TYPE DESC RIP T IO N

STRING The key of the tag. The key must:

* Be between 1 and 512 characters long


* Not contain any of the characters <>%*&+?\\/
* Not begin with azure , microsoft , or windows

STRING The value of the tag. The value length must be less than or
equal to 256 UTF-8 characters.

DbfsStorageInfo
DBFS storage information.

F IEL D N A M E TYPE DESC RIP T IO N

destination STRING DBFS destination. Example:


dbfs:/my/path

FileStorageInfo
File storage information.

NOTE
This location type is only available for clusters set up using Databricks Container Services.

F IEL D N A M E TYPE DESC RIP T IO N

destination STRING File destination. Example:


file:/my/file.sh

DockerImage
Docker image connection information.

F IEL D TYPE DESC RIP T IO N

url string URL for the Docker image.

basic_auth DockerBasicAuth Basic authentication information for


Docker repository.

DockerBasicAuth
Docker repository basic authentication information.

F IEL D DESC RIP T IO N

username User name for the Docker repository.

password Password for the Docker repository.

LogSyncStatus
Log delivery status.

F IEL D N A M E TYPE DESC RIP T IO N

last_attempted INT64 The timestamp of last attempt. If the


last attempt fails, last_exception
contains the exception in the last
attempt.

last_exception STRING The exception thrown in the last


attempt, it would be null (omitted in
the response) if there is no exception
in last attempted.

NodeType
Description of a Spark node type including both the dimensions of the node and the instance type on which it
will be hosted.

F IEL D N A M E TYPE DESC RIP T IO N

node_type_id STRING Unique identifier for this node type.


This field is required.

memory_mb INT32 Memory (in MB) available for this node


type. This field is required.

num_cores FLOAT Number of CPU cores available for this


node type. This can be fractional if the
number of cores on a machine
instance is not divisible by the number
of Spark nodes on that machine. This
field is required.

description STRING A string description associated with


this node type. This field is required.
F IEL D N A M E TYPE DESC RIP T IO N

instance_type_id STRING An identifier for the type of hardware


that this node runs on. This field is
required.

is_deprecated BOOL Whether the node type is deprecated.


Non-deprecated node types offer
greater performance.

node_info ClusterCloudProviderNodeInfo Node type info reported by the cloud


provider.

ClusterCloudProviderNodeInfo
Information about an instance supplied by a cloud provider.

F IEL D N A M E TYPE DESC RIP T IO N

status ClusterCloudProviderNodeStatus Status as reported by the cloud


provider.

available_core_quota INT32 Available CPU core quota.

total_core_quota INT32 Total CPU core quota.

ClusterCloudProviderNodeStatus
Status of an instance supplied by a cloud provider.

STAT US DESC RIP T IO N

NotEnabledOnSubscription Node type not available for subscription.

NotAvailableInRegion Node type not available in region.

ParameterPair
Parameter that provides additional information about why a cluster was terminated.

TYPE DESC RIP T IO N

TerminationParameter Type of termination information.

STRING The termination information.

SparkConfPair
Spark configuration key-value pairs.

TYPE DESC RIP T IO N

STRING A configuration property name.

STRING The configuration property value.

SparkEnvPair
Spark environment variable key-value pairs.

IMPORTANT
When specifying environment variables in a job cluster, the fields in this data structure accept only Latin characters (ASCII
character set). Using non-ASCII characters will return an error. Examples of invalid, non-ASCII characters are Chinese,
Japanese kanjis, and emojis.

TYPE DESC RIP T IO N

STRING An environment variable name.

STRING The environment variable value.

SparkNode
Spark driver or executor configuration.

F IEL D N A M E TYPE DESC RIP T IO N

private_ip STRING Private IP address (typically a 10.x.x.x


address) of the Spark node. This is
different from the private IP address of
the host instance.

public_dns STRING Public DNS address of this node. This


address can be used to access the
Spark JDBC server on the driver node.

node_id STRING Globally unique identifier for this node.

instance_id STRING Globally unique identifier for the host


instance from the cloud provider.

start_timestamp INT64 The timestamp (in millisecond) when


the Spark node is launched.

host_private_ip STRING The private IP address of the host


instance.

SparkVersion
Databricks Runtime version of the cluster.

F IEL D N A M E TYPE DESC RIP T IO N

key STRING Databricks Runtime version key, for


example 7.3.x-scala2.12 . The value
that should be provided as the
spark_version when creating a new
cluster. The exact runtime version may
change over time for a “wildcard”
version (that is, 7.3.x-scala2.12 is a
“wildcard” version) with minor bug
fixes.
F IEL D N A M E TYPE DESC RIP T IO N

name STRING A descriptive name for the runtime


version, for example “Databricks
Runtime 7.3 LTS”.

TerminationReason
Reason why a cluster was terminated.

F IEL D N A M E TYPE DESC RIP T IO N

code TerminationCode Status code indicating why a cluster


was terminated.

type TerminationType Reason indicating why a cluster was


terminated.

parameters ParameterPair Object containing a set of parameters


that provide information about why a
cluster was terminated.

PoolClusterTerminationCode
Status code indicating why the cluster was terminated due to a pool failure.

C O DE DESC RIP T IO N

INSTANCE_POOL_MAX_CAPACITY_FAILURE The pool max capacity has been reached.

INSTANCE_POOL_NOT_FOUND_FAILURE The pool specified by the cluster is no longer active or


doesn’t exist.

ClusterSource
Service that created the cluster.

SERVIC E DESC RIP T IO N

UI Cluster created through the UI.

JOB Cluster created by the Databricks job scheduler.

API Cluster created through an API call.

ClusterState
State of a cluster. The allowable state transitions are as follows:
PENDING -> RUNNING
PENDING -> TERMINATING
RUNNING -> RESIZING
RUNNING -> RESTARTING
RUNNING -> TERMINATING
RESTARTING -> RUNNING
RESTARTING -> TERMINATING
RESIZING -> RUNNING
RESIZING -> TERMINATING
TERMINATING -> TERMINATED

STAT E DESC RIP T IO N

PENDING Indicates that a cluster is in the process of being created.

RUNNING Indicates that a cluster has been started and is ready for use.

RESTARTING Indicates that a cluster is in the process of restarting.

RESIZING Indicates that a cluster is in the process of adding or


removing nodes.

TERMINATING Indicates that a cluster is in the process of being destroyed.

TERMINATED Indicates that a cluster has been successfully destroyed.

ERROR This state is no longer used. It was used to indicate a cluster


that failed to be created.
TERMINATING and TERMINATED are used instead.

UNKNOWN Indicates that a cluster is in an unknown state. A cluster


should never be in this state.

TerminationCode
Status code indicating why the cluster was terminated.

C O DE DESC RIP T IO N

USER_REQUEST A user terminated the cluster directly. Parameters should


include a username field that indicates the specific user who
terminated the cluster.

JOB_FINISHED The cluster was launched by a job, and terminated when the
job completed.

INACTIVITY The cluster was terminated since it was idle.

CLOUD_PROVIDER_SHUTDOWN The instance that hosted the Spark driver was terminated by
the cloud provider.

COMMUNICATION_LOST Azure Databricks lost connection to services on the driver


instance. For example, this can happen when problems arise
in cloud networking infrastructure, or when the instance
itself becomes unhealthy.

CLOUD_PROVIDER_LAUNCH_FAILURE Azure Databricks experienced a cloud provider failure when


requesting instances to launch clusters.

SPARK_STARTUP_FAILURE The cluster failed to initialize. Possible reasons may include


failure to create the environment for Spark or issues
launching the Spark master and worker processes.
C O DE DESC RIP T IO N

INVALID_ARGUMENT Cannot launch the cluster because the user specified an


invalid argument. For example, the user might specify an
invalid runtime version for the cluster.

UNEXPECTED_LAUNCH_FAILURE While launching this cluster, Azure Databricks failed to


complete critical setup steps, terminating the cluster.

INTERNAL_ERROR Azure Databricks encountered an unexpected error that


forced the running cluster to be terminated. Contact Azure
Databricks support for additional details.

SPARK_ERROR The Spark driver failed to start. Possible reasons may include
incompatible libraries and initialization scripts that corrupted
the Spark container.

METASTORE_COMPONENT_UNHEALTHY The cluster failed to start because the external metastore


could not be reached. Refer to Troubleshooting.

DBFS_COMPONENT_UNHEALTHY The cluster failed to start because Databricks File System


(DBFS) could not be reached.

AZURE_RESOURCE_PROVIDER_THROTTLING Azure Databricks reached the Azure Resource Provider


request limit. Specifically, the API request rate to the specific
resource type (compute, network, etc.) can’t exceed the limit.
Retry might help to resolve the issue. For further
information, seehttps://docs.microsoft.com/azure/virtual-
machines/troubleshooting/troubleshooting-throttling-errors.

AZURE_RESOURCE_MANAGER_THROTTLING Azure Databricks reached the Azure Resource Manager


request limit which will prevent the Azure SDK from issuing
any read or write request to the Azure Resource Manager.
The request limit is applied to each subscription every hour.
Retry after an hour or changing to a smaller cluster size
might help to resolve the issue. For further information,
seehttps://docs.microsoft.com/azure/azure-resource-
manager/resource-manager-request-limits.

NETWORK_CONFIGURATION_FAILURE The cluster was terminated due to an error in the network


configuration. For example, a workspace with VNet injection
had incorrect DNS settings that blocked access to worker
artifacts.

DRIVER_UNREACHABLE Azure Databricks was not able to access the Spark driver,
because it was not reachable.

DRIVER_UNRESPONSIVE Azure Databricks was not able to access the Spark driver,
because it was unresponsive.

INSTANCE_UNREACHABLE Azure Databricks was not able to access instances in order to


start the cluster. This can be a transient networking issue. If
the problem persists, this usually indicates a networking
environment misconfiguration.

CONTAINER_LAUNCH_FAILURE Azure Databricks was unable to launch containers on worker


nodes for the cluster. Have your admin check your network
configuration.
C O DE DESC RIP T IO N

INSTANCE_POOL_CLUSTER_FAILURE Pool backed cluster specific failure. See Pools for details.

REQUEST_REJECTED Azure Databricks cannot handle the request at this moment.


Try again later and contact Azure Databricks if the problem
persists.

INIT_SCRIPT_FAILURE Azure Databricks cannot load and run a cluster-scoped init


script on one of the cluster’s nodes, or the init script
terminates with a non-zero exit code. See Init script logs.

TRIAL_EXPIRED The Azure Databricks trial subscription expired.

BOOTSTRAP_TIMEOUT The cluster failed to start because of user network


configuration issues. Possible reasons include
misconfiguration of firewall settings, UDR entries, DNS, or
route tables.

TerminationType
Reason why the cluster was terminated.

TYPE DESC RIP T IO N

SUCCESS Termination succeeded.

CLIENT_ERROR Non-retriable. Client must fix parameters before


reattempting the cluster creation.

SERVICE_FAULT Azure Databricks service issue. Client can retry.

CLOUD_FAILURE Cloud provider infrastructure issue. Client can retry after the
underlying issue is resolved.

TerminationParameter
Key that provides additional information about why a cluster was terminated.

K EY DESC RIP T IO N

username The username of the user who terminated the cluster.

databricks_error_message Additional context that may explain the reason for cluster
termination.

inactivity_duration_min An idle cluster was shut down after being inactive for this
duration.

instance_id The ID of the instance that was hosting the Spark driver.

azure_error_code The Azure provided error code describing why cluster nodes
could not be provisioned. For reference, see:
https://docs.microsoft.com/azure/virtual-
machines/windows/error-messages.
K EY DESC RIP T IO N

azure_error_message Human-readable context of various failures from Azure. This


field is unstructured, and its exact format is subject to
change.

instance_pool_id The ID of the instance pool the cluster is using.

instance_pool_error_code The error code for cluster failures specific to a pool.

AzureAttributes
Attributes set during cluster creation related to Azure.

F IEL D N A M E TYPE DESC RIP T IO N

first_on_demand INT32 The first first_on_demand nodes of


the cluster will be placed on on-
demand instances. This value must be
greater than 0, or else cluster creation
validation fails. If this value is greater
than or equal to the current cluster
size, all nodes will be placed on on-
demand instances. If this value is less
than the current cluster size,
first_on_demand nodes will be
placed on on-demand instances and
the remainder will be placed on
availability instances. This value does
not affect cluster size and cannot be
mutated over the lifetime of a cluster.

availability AzureAvailability Availability type used for all


subsequent nodes past the
first_on_demand ones.

spot_bid_max_price DOUBLE The max bid price used for Azure spot
instances. You can set this to greater
than or equal to the current spot price.
You can also set this to -1 (the default),
which specifies that the instance
cannot be evicted on the basis of price.
The price for the instance will be the
current price for spot instances or the
price for a standard instance. You can
view historical pricing and eviction
rates in the Azure portal.

AzureAvailability
The Azure instance availability type behavior.

TYPE DESC RIP T IO N

SPOT_AZURE Use spot instances.

ON_DEMAND_AZURE Use on-demand instances.


TYPE DESC RIP T IO N

SPOT_WITH_FALLBACK_AZURE Preferably use spot instances, but fall back to on-demand


instances if spot instances cannot be acquired (for example, if
Azure spot prices are too high or out of quota). Does not
apply to pool availability.
Cluster Policies API 2.0
7/21/2022 • 7 minutes to read

IMPORTANT
This feature is in Public Preview.

A cluster policy limits the ability to create clusters based on a set of rules. The policy rules limit the attributes or
attribute values available for cluster creation. Cluster policies have ACLs that limit their use to specific users and
groups.
Only admin users can create, edit, and delete policies. Admin users also have access to all policies.
For requirements and limitations on cluster policies, see Manage cluster policies.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Cluster Policies API


The Cluster Policies API allows you to create, list, and edit cluster policies. Creation and editing is available to
admins only. Listing can be performed by any user and is limited to policies accessible by that user.

IMPORTANT
The Cluster Policies API requires a policy JSON definition to be passed within a JSON request in stringified form. In most
cases this requires escaping of the quote characters.

In this section:
Get
List
Create
Edit
Delete
Data structures
Get
EN DP O IN T H T T P M ET H O D

2.0/policies/clusters/get GET

Return a policy specification given a policy ID.


Example
curl --netrc -X GET \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/policies/clusters/get \
--data '{ "policy_id": "ABCD000000000000" }' \
| jq .

{
"policy_id": "ABCD000000000000",
"name": "Test policy",
"definition": "{\"spark_conf.spark.databricks.cluster.profile\":
{\"type\":\"forbidden\",\"hidden\":true}}",
"created_at_timestamp": 1600000000000
}

Request structure

F IEL D N A M E TYPE DESC RIP T IO N

policy_id STRING The policy ID about which to retrieve


information.

Response structure

F IEL D N A M E TYPE DESC RIP T IO N

policy_id STRING Canonical unique identifier for the


cluster policy.

name STRING Cluster policy name. This must be


unique. Length must be between 1
and 100 characters.

definition STRING Policy definition JSON document


expressed in Databricks Policy
Definition Language. The JSON
document must be passed as a string
and cannot be simply embedded in the
requests.

created_at_timestamp INT64 Creation time. The timestamp (in


millisecond) when this cluster policy
was created.

List
EN DP O IN T H T T P M ET H O D

2.0/policies/clusters/list GET

Return a list of policies accessible by the requesting user.


Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/policies/clusters/list \
--data '{ "sort_order": "DESC", "sort_column": "POLICY_CREATION_TIME" }' \
| jq .
{
"policies": [
{
"policy_id": "ABCD000000000001",
"name": "Empty",
"definition": "{}",
"created_at_timestamp": 1600000000002
},
{
"policy_id": "ABCD000000000000",
"name": "Test policy",
"definition": "{\"spark_conf.spark.databricks.cluster.profile\":
{\"type\":\"forbidden\",\"hidden\":true}}",
"created_at_timestamp": 1600000000000
}
],
"total_count": 2
}

Request structure

F IEL D N A M E TYPE DESC RIP T IO N

sort_order ListOrder The order direction to list the policies


in; either ASC or DESC . Defaults to
DESC .

sort_column PolicySortColumn The ClusterPolicy attribute to sort


by. Defaults to
POLICY_CREATION_TIME .

Response structure

F IEL D N A M E TYPE DESC RIP T IO N

policies An array of Policy List of policies.

total_count INT64 The total number of policies.

Create
EN DP O IN T H T T P M ET H O D

2.0/policies/clusters/create POST

Create a new policy with a given name and definition.


Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/policies/clusters/create \
--data @create-cluster-policy.json

create-cluster-policy.json :
{
"name": "Test policy",
"definition": "{\"spark_conf.spark.databricks.cluster.profile\":{\"type\":\"forbidden\",\"hidden\":true}}"
}

{ "policy_id": "ABCD000000000000" }

Request structure

F IEL D N A M E TYPE DESC RIP T IO N

name STRING Cluster policy name. This must be


unique. Length must be between 1
and 100 characters.

definition STRING Policy definition JSON document


expressed in Databricks Policy
Definition Language. You must pass
the JSON document as a string; it
cannot be simply embedded in the
requests.

Response structure

F IEL D N A M E TYPE DESC RIP T IO N

policy_id STRING Canonical unique identifier for the


cluster policy.

Edit
EN DP O IN T H T T P M ET H O D

2.0/policies/clusters/edit POST

Update an existing policy. This may make some clusters governed by this policy invalid. For such clusters the
next cluster edit must provide a confirming configuration, but otherwise they can continue to run.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/policies/clusters/edit \
--data @edit-cluster-policy.json

edit-cluster-policy.json :

{
"policy_id": "ABCD000000000000",
"name": "Test policy",
"definition": "{\"spark_conf.spark.databricks.cluster.profile\":{\"type\":\"forbidden\",\"hidden\":true}}"
}

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

policy_id STRING The ID of the policy to update. This


field is required.

name STRING Cluster policy name. This must be


unique. Length must be between 1
and 100 characters.

definition STRING Policy definition JSON document


expressed in Databricks Policy
Definition Language. You must pass
the JSON document as a string; it
cannot be simply embedded in the
requests.

Delete
EN DP O IN T H T T P M ET H O D

2.0/policies/clusters/delete POST

Delete a policy. Clusters governed by this policy can still run, but cannot be edited.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/policies/clusters/delete \
--data '{ "policy_id": "ABCD000000000000" }'

{}

Request structure

F IEL D N A M E TYPE DESC RIP T IO N

policy_id STRING The ID of the policy to delete. This field


is required.

Data structures
In this section:
Policy
PolicySortColumn
Policy
A cluster policy entity.

F IEL D N A M E TYPE DESC RIP T IO N

policy_id STRING Canonical unique identifier for the


cluster policy.

name STRING Cluster policy name. This must be


unique. Length must be between 1
and 100 characters.
F IEL D N A M E TYPE DESC RIP T IO N

definition STRING Policy definition JSON document


expressed in Databricks Policy
Definition Language. You must pass
the JSON document as a string; it
cannot be simply embedded in the
requests.

creator_user_name STRING Creator user name. The field won’t be


included in the response if the user has
already been deleted.

created_at_timestamp INT64 Creation time. The timestamp (in


millisecond) when this cluster policy
was created.

PolicySortColumn
The sort order for the ListPolices request.

NAME DESC RIP T IO N

POLICY_CREATION_TIME Sort result list by policy creation type.

POLICY_NAME Sort result list by policy name.

Cluster Policy Permissions API


The Cluster Policy Permissions API enables you to set permissions on a cluster policy. When you grant CAN_USE
permission on a policy to a user, the user will be able to create new clusters based on it. A user does not need
the cluster_create permission to create new clusters.
Only admin users can set permissions on cluster policies.
In this section:
Get permissions
Get permission levels
Add or modify permissions
Set or delete permissions
Data structures
Get permissions
EN DP O IN T H T T P M ET H O D

2.0/preview/permissions/cluster- GET
policies/<clusterPolicyId>

Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/preview/permissions/cluster-
policies/ABCD000000000000 \
| jq .
{
"object_id": "/cluster-policies/ABCD000000000000",
"object_type": "cluster-policy",
"access_control_list": [
{
"user_name": "someone@example.com",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": false
}
]
},
{
"group_name": "admins",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": true,
"inherited_from_object": [
"/cluster-policies/cluster-policies"
]
}
]
}
]
}

Request structure

F IEL D N A M E TYPE DESC RIP T IO N

clusterPolicyId STRING The policy about which to retrieve


permissions. This field is required.

Response structure
A Clusters ACL.
Get permission levels
EN DP O IN T H T T P M ET H O D

2.0/preview/permissions/cluster- GET
policies/<clusterPolicyId>/permissionLevels

Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/preview/permissions/cluster-
policies/ABCD000000000000/permissionLevels \
| jq .

{
"permission_levels": [
{
"permission_level": "CAN_USE",
"description": "Can use the policy"
}
]
}
Request structure

F IEL D N A M E TYPE DESC RIP T IO N

clusterPolicyId STRING The policy about which to retrieve


permission levels. This field is required.

Response structure
An array of PermissionLevel with associated description.
Add or modify permissions
EN DP O IN T H T T P M ET H O D

2.0/preview/permissions/cluster- PATCH
policies/<clusterPolicyId>

Example

curl --netrc -X PATCH \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/preview/permissions/cluster-
policies/ABCD000000000000 \
--data @add-cluster-policy-permissions.json \
| jq .

add-cluster-policy-permissions.json :

{
"access_control_list": [
{
"user_name": "someone-else@example.com",
"permission_level": "CAN_USE"
}
]
}
{
"object_id": "/cluster-policies/ABCD000000000000",
"object_type": "cluster-policy",
"access_control_list": [
{
"user_name": "mary@example.com",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": false
}
]
},
{
"user_name": "someone-else@example.com",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": false
}
]
},
{
"group_name": "admins",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": true,
"inherited_from_object": [
"/cluster-policies/"
]
}
]
}
]
}

Request structure

F IEL D N A M E TYPE DESC RIP T IO N

clusterPolicyId STRING The policy about which to modify


permissions. This field is required.

Request body

F IEL D N A M E TYPE DESC RIP T IO N

access_control_list Array of AccessControl An array of access control lists.

Response body
A Clusters ACL.
Set or delete permissions
A PUT request replaces all direct permissions on the cluster policy object. You can make delete requests by
making a GET request to retrieve the current list of permissions followed by a PUT request removing entries to
be deleted.
EN DP O IN T H T T P M ET H O D

2.0/preview/permissions/cluster- PUT
policies/<clusterPolicyId>

Example

curl --netrc -X PUT \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/preview/permissions/cluster-
policies/ABCD000000000000 \
--data @set-cluster-policy-permissions.json \
| jq .

set-cluster-policy-permissions.json :

{
"access_control_list": [
{
"user_name": "someone@example.com",
"permission_level": "CAN_USE"
}
]
}

{
"object_id": "/cluster-policies/ABCD000000000000",
"object_type": "cluster-policy",
"access_control_list": [
{
"user_name": "someone@example.com",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": false
}
]
},
{
"group_name": "admins",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": true,
"inherited_from_object": [
"/cluster-policies/"
]
}
]
}
]
}

Request structure

F IEL D N A M E TYPE DESC RIP T IO N

clusterPolicyId STRING The policy about which to set


permissions. This field is required.

Request body
F IEL D N A M E TYPE DESC RIP T IO N

access_control_list Array of AccessControlInput An array of access controls.

Response body
A Clusters ACL.
Data structures
In this section:
Clusters ACL
AccessControl
Permission
AccessControlInput
PermissionLevel
Clusters ACL

AT T RIB UT E N A M E TYPE DESC RIP T IO N

object_id STRING The ID of the ACL object, for example,


../cluster-
policies/<clusterPolicyId>
.

object_type STRING The Databricks ACL object type, for


example, cluster-policy .

access_control_list Array of AccessControl The access controls set on the ACL


object.

AccessControl

AT T RIB UT E N A M E TYPE DESC RIP T IO N

user_name, group_name, OR STRING Name of the user/group or application


service_principal_name ID of the service principal that has
permissions set on the ACL object.

all_permissions Array of Permission List of all permissions set on this ACL


object for a specific principal. Includes
both permissions directly set on this
ACL object and permissions inherited
from an ancestor ACL object.

Permission

AT T RIB UT E N A M E TYPE DESC RIP T IO N

permission_level STRING The name of the permission level.

inherited BOOLEAN True when the ACL permission is not


set directly but inherited from an
ancestor ACL object. False if set directly
on the ACL object.
AT T RIB UT E N A M E TYPE DESC RIP T IO N

inherited_from_object List[STRING] The list of parent ACL object IDs that


contribute to inherited permission on
an ACL object. This is defined only if
inherited is true.

AccessControlInput
An item representing an ACL rule applied to the principal (user, group, or service principal).

AT T RIB UT E N A M E TYPE DESC RIP T IO N

user_name, group_name, OR STRING Name of the user/group or application


service_principal_name ID of the service principal that has
permissions set on the ACL object.

permission_level STRING The name of the permission level.

PermissionLevel
Permission level that you can set on a cluster policy.

P ERM ISSIO N L EVEL DESC RIP T IO N

CAN_USE Allow user to create clusters based on the policy. The user
does not need the cluster create permission.
DBFS API 2.0
7/21/2022 • 9 minutes to read

The DBFS API is a Databricks API that makes it simple to interact with various data sources without having to
include your credentials every time you read a file. See Databricks File System (DBFS) for more information. For
an easy to use command line client of the DBFS API, see Databricks CLI.

NOTE
To ensure high quality of service under heavy load, Azure Databricks is now enforcing API rate limits for DBFS API calls.
Limits are set per workspace to ensure fair usage and high availability. Automatic retries are available using Databricks CLI
version 0.12.0 and above. We advise all customers to switch to the latest Databricks CLI version.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Limitations
Using the DBFS API with firewall enabled storage containers is not supported. Databricks recommends you use
Databricks Connect or az storage.

Add block
EN DP O IN T H T T P M ET H O D

2.0/dbfs/add-block POST

Append a block of data to the stream specified by the input handle. If the handle does not exist, this call will
throw an exception with RESOURCE_DOES_NOT_EXIST . If the block of data exceeds 1 MB, this call will throw an
exception with MAX_BLOCK_SIZE_EXCEEDED . A typical workflow for file upload would be:
1. Call create and get a handle.
2. Make one or more add-block calls with the handle you have.
3. Call close with the handle you have.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/add-block \
--data '{ "data": "SGVsbG8sIFdvcmxkIQ==", "handle": 1234567890123456 }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

handle INT64 The handle on an open stream. This


field is required.

data BYTES The base64-encoded data to append


to the stream. This has a limit of 1 MB.
This field is required.

Close
EN DP O IN T H T T P M ET H O D

2.0/dbfs/close POST

Close the stream specified by the input handle. If the handle does not exist, this call throws an exception with
RESOURCE_DOES_NOT_EXIST . A typical workflow for file upload would be:

1. Call create and get a handle.


2. Make one or more add-block calls with the handle you have.
3. Call close with the handle you have.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/close \
--data '{ "handle": 1234567890123456 }'

If the call succeeds, no output displays.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

handle INT64 The handle on an open stream. This


field is required.

Create
EN DP O IN T H T T P M ET H O D

2.0/dbfs/create POST

Open a stream to write to a file and returns a handle to this stream. There is a 10 minute idle timeout on this
handle. If a file or directory already exists on the given path and overwrite is set to false, this call throws an
exception with RESOURCE_ALREADY_EXISTS . A typical workflow for file upload would be:
1. Call create and get a handle.
2. Make one or more add-block calls with the handle you have.
3. Call close with the handle you have.
Example
curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/create \
--data '{ "path": "/tmp/HelloWorld.txt", "overwrite": true }'

{ "handle": 1234567890123456 }

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the new file. The path


should be the absolute DBFS path (for
example
/mnt/my-file.txt ). This field is
required.

overwrite BOOL The flag that specifies whether to


overwrite existing file or files.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

handle INT64 Handle which should subsequently be


passed into the add-block and
close calls when writing to a file
through a stream.

Delete
EN DP O IN T H T T P M ET H O D

2.0/dbfs/delete POST

Delete the file or directory (optionally recursively delete all files in the directory). This call throws an exception
with IO_ERROR if the path is a non-empty directory and recursive is set to false or on other similar errors.
When you delete a large number of files, the delete operation is done in increments. The call returns a response
after approximately 45 seconds with an error message (503 Service Unavailable) asking you to re-invoke the
delete operation until the directory structure is fully deleted. For example:

{
"error_code": "PARTIAL_DELETE",
"message": "The requested operation has deleted 324 files. There are more files remaining. You must make
another request to delete more."
}

For operations that delete more than 10K files, we discourage using the DBFS REST API, but advise you to
perform such operations in the context of a cluster, using the File system utility (dbutils.fs). dbutils.fs covers
the functional scope of the DBFS REST API, but from notebooks. Running such operations using notebooks
provides better control and manageability, such as selective deletes, and the possibility to automate periodic
delete jobs.
Example
curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/delete \
--data '{ "path": "/tmp/HelloWorld.txt" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the file or directory to


delete. The path should be the
absolute DBFS path (e.g. /mnt/foo/ ).
This field is required.

recursive BOOL Whether or not to recursively delete


the directory’s contents. Deleting
empty directories can be done without
providing the recursive flag.

Get status
EN DP O IN T H T T P M ET H O D

2.0/dbfs/get-status GET

Get the file information of a file or directory. If the file or directory does not exist, this call throws an exception
with RESOURCE_DOES_NOT_EXIST .
Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/get-status \
--data '{ "path": "/tmp/HelloWorld.txt" }' \
| jq .

{
"path": "/tmp/HelloWorld.txt",
"is_dir": false,
"file_size": 13,
"modification_time": 1622054945000
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the file or directory. The


path should be the absolute DBFS
path (for example, /mnt/my-folder/ ).
This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the file or directory.

is_dir BOOL Whether the path is a directory.

file_size INT64 The length of the file in bytes or zero if


the path is a directory.

modification_time INT64 The last time, in epoch milliseconds,


the file or directory was modified.

List
EN DP O IN T H T T P M ET H O D

2.0/dbfs/list GET

List the contents of a directory, or details of the file. If the file or directory does not exist, this call throws an
exception with RESOURCE_DOES_NOT_EXIST .
When calling list on a large directory, the list operation will time out after approximately 60 seconds. We
strongly recommend using list only on directories containing less than 10K files and discourage using the
DBFS REST API for operations that list more than 10K files. Instead, we recommend that you perform such
operations in the context of a cluster, using the File system utility (dbutils.fs), which provides the same
functionality without timing out.
Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/list \
--data '{ "path": "/tmp" }' \
| jq .

{
"files": [
{
"path": "/tmp/HelloWorld.txt",
"is_dir": false,
"file_size": 13,
"modification_time": 1622054945000
},
{
"..."
}
]
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the file or directory. The


path should be the absolute DBFS
path (e.g. /mnt/foo/ ). This field is
required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

files An array of FileInfo A list of FileInfo that describe contents


of directory or file.

Mkdirs
EN DP O IN T H T T P M ET H O D

2.0/dbfs/mkdirs POST

Create the given directory and necessary parent directories if they do not exist. If there exists a file (not a
directory) at any prefix of the input path, this call throws an exception with RESOURCE_ALREADY_EXISTS . If this
operation fails it may have succeeded in creating some of the necessary parent directories.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/mkdirs \
--data '{ "path": "/tmp/my-new-dir" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the new directory. The


path should be the absolute DBFS
path (for example,
/mnt/my-folder/ ). This field is
required.

Move
EN DP O IN T H T T P M ET H O D

2.0/dbfs/move POST

Move a file from one location to another location within DBFS. If the source file does not exist, this call throws an
exception with RESOURCE_DOES_NOT_EXIST . If there already exists a file in the destination path, this call throws an
exception with RESOURCE_ALREADY_EXISTS . If the given source path is a directory, this call always recursively
moves all files.
When moving a large number of files, the API call will time out after approximately 60 seconds, potentially
resulting in partially moved data. Therefore, for operations that move more than 10K files, we strongly
discourage using the DBFS REST API. Instead, we recommend that you perform such operations in the context of
a cluster, using the File system utility (dbutils.fs) from a notebook, which provides the same functionality without
timing out.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/move \
--data '{ "source_path": "/tmp/HelloWorld.txt", "destination_path": "/tmp/my-new-dir/HelloWorld.txt" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

source_path STRING The source path of the file or directory.


The path should be the absolute DBFS
path (for example,
/mnt/my-source-folder/ ). This field
is required.

destination_path STRING The destination path of the file or


directory. The path should be the
absolute DBFS path (for example,
/mnt/my-destination-folder/ ). This
field is required.

Put
EN DP O IN T H T T P M ET H O D

2.0/dbfs/put POST

Upload a file through the use of multipart form post. It is mainly used for streaming uploads, but can also be
used as a convenient single call for data upload.
The amount of data that can be passed using the contents parameter is limited to 1 MB if specified as a string (
MAX_BLOCK_SIZE_EXCEEDED is thrown if exceeded) and 2 GB as a file.

Example
To upload a local file named HelloWorld.txt in the current directory:

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/put \
--form contents=@HelloWorld.txt \
--form path="/tmp/HelloWorld.txt" \
--form overwrite=true

To upload content Hello, World! as a base64 encoded string:


curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/put \
--data '{ "path": "/tmp/HelloWorld.txt", "contents": "SGVsbG8sIFdvcmxkIQ==", "overwrite": true }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the new file. The path


should be the absolute DBFS path (e.g.
/mnt/foo/ ). This field is required.

contents BYTES This parameter might be absent, and


instead a posted file will be used.

overwrite BOOL The flag that specifies whether to


overwrite existing files.

Read
EN DP O IN T H T T P M ET H O D

2.0/dbfs/read GET

Return the contents of a file. If the file does not exist, this call throws an exception with RESOURCE_DOES_NOT_EXIST .
If the path is a directory, the read length is negative, or if the offset is negative, this call throws an exception with
INVALID_PARAMETER_VALUE . If the read length exceeds 1 MB, this call throws an exception with
MAX_READ_SIZE_EXCEEDED . If offset + length exceeds the number of bytes in a file, reads contents until the end
of file.
Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/read \
--data '{ "path": "/tmp/HelloWorld.txt", "offset": 1, "length": 8 }' \
| jq .

{
"bytes_read": 8,
"data": "ZWxsbywgV28="
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the file to read. The path


should be the absolute DBFS path (e.g.
/mnt/foo/ ). This field is required.
F IEL D N A M E TYPE DESC RIP T IO N

offset INT64 The offset to read from in bytes.

length INT64 The number of bytes to read starting


from the offset. This has a limit of 1
MB, and a default value of 0.5 MB.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

bytes_read INT64 The number of bytes read (could be


less than length if we hit end of file).
This refers to number of bytes read in
unencoded version (response data is
base64-encoded).

data BYTES The base64-encoded contents of the


file read.

Data structures
In this section:
FileInfo
FileInfo
The attributes of a file or directory.

F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the file or directory.

is_dir BOOL Whether the path is a directory.

file_size INT64 The length of the file in bytes or zero if


the path is a directory.

modification_time INT64 The last time, in epoch milliseconds,


the file or directory was modified.
SQL Warehouses APIs 2.0
7/21/2022 • 8 minutes to read

IMPORTANT
To access Databricks REST APIs, you must authenticate.

To configure individual SQL warehouses, use the SQL Warehouses API. To configure all SQL warehouses, use the
Global SQL Warehouses API.

Requirements
To create SQL warehouses you must have cluster create permission, which is enabled in the Data Science &
Engineering workspace.
To manage a SQL warehouse you must have Can Manage permission in Databricks SQL for the warehouse.

SQL Warehouses API


Use this API to create, edit, list, and get SQL warehouses.
In this section:
Create
Delete
Edit
Get
List
Start
Stop
Create
EN DP O IN T H T T P M ET H O D

2.0/sql/warehouses/ POST

2.0/sql/endpoints/ (deprecated) POST

Create a SQL warehouse.

F IEL D N A M E TYPE DESC RIP T IO N

name STRING Name of the SQL warehouse. Must be


unique. This field is required.
F IEL D N A M E TYPE DESC RIP T IO N

cluster_size STRING The size of the clusters allocated to the


warehouse: "XXSMALL" , "XSMALL" ,
"SMALL" , "MEDIUM" , "LARGE" ,
"XLARGE" , "XXLARGE" , "XXXLARGE"
, "XXXXLARGE" . For the mapping from
cluster to instance size, see Cluster
size. This field is required.

min_num_clusters INT32 Minimum number of clusters available


when a SQL warehouse is running. The
default is 1.

max_num_clusters INT32 Maximum number of clusters available


when a SQL warehouse is running. This
field is required. If multi-cluster load
balancing is not enabled, this is limited
to 1 .

auto_stop_mins INT32 Time in minutes until an idle SQL


warehouse terminates all clusters and
stops. This field is optional. Setting this
to 0 disables auto stop. For Classic
SQL warehouses, the default value is
15. For Serverless SQL warehouses,
the default and recommended value is
10.

tags WarehouseTags Key-value pairs that describe the


warehouse. Azure Databricks tags all
warehouse resources with these tags.
This field is optional.

enable_photon BOOLEAN Whether queries are executed on a


native vectorized engine that speeds
up query execution. This field is
optional. The default is true .

channel Channel Whether to use the current SQL


warehouse compute version or the
preview version. Preview versions let
you try out functionality before it
becomes the Databricks SQL standard.
Typically, preview versions are
promoted to the current version two
weeks after initial preview release, but
some previews may last longer. You
can learn about the features in the
latest preview version by reviewing the
release notes. Databricks does not
recommend using preview versions for
production workloads. This field is
optional. The default is
CHANNEL_NAME_CURRENT .
F IEL D N A M E TYPE DESC RIP T IO N

spot_instance_policy WarehouseSpotInstancePolicy The spot policy to use for allocating


instances to clusters. This field is
optional. This field is not used if the
SQL warehouse is a Serverless SQL
warehouse.

Example request

{
"name": "My SQL warehouse",
"cluster_size": "MEDIUM",
"min_num_clusters": 1,
"max_num_clusters": 10,
"tags": {
"custom_tags": [
{
"key": "mykey",
"value": "myvalue"
}
]
},
"enable_photon": "true",
"channel": {
"name": "CHANNEL_NAME_CURRENT"
}
}

Example response

{
"id": "0123456789abcdef"
}

Delete
EN DP O IN T H T T P M ET H O D

2.0/sql/warehouses/{id} DELETE

2.0/sql/endpoints/{id} (deprecated) DELETE

Delete a SQL warehouse.


Edit
EN DP O IN T H T T P M ET H O D

2.0/sql/warehouses/{id}/edit POST

2.0/sql/endpoints/{id}/edit (deprecated) POST

Modify a SQL warehouse. All fields are optional. Missing fields default to the current values.

F IEL D N A M E TYPE DESC RIP T IO N

id STRING ID of the SQL warehouse.


F IEL D N A M E TYPE DESC RIP T IO N

name STRING Name of the SQL warehouse.

cluster_size STRING The size of the clusters allocated to the


warehouse: "XXSMALL" , "XSMALL" ,
"SMALL" , "MEDIUM" , "LARGE" ,
"XLARGE" , "XXLARGE" , "XXXLARGE"
, "XXXXLARGE" . For the mapping from
cluster to instance size, see Cluster
size.

min_num_clusters INT32 Minimum number of clusters available


when a SQL warehouse is running.

max_num_clusters INT32 Maximum number of clusters available


when a SQL warehouse is running. This
field is required. If multi-cluster load
balancing is not enabled, limited to 1 .

auto_stop_mins INT32 Time in minutes until an idle SQL


warehouse terminates all clusters and
stops. Setting this to 0 disables auto
stop. For Classic SQL warehouses, the
default value is 15. For Serverless SQL
warehouses, the default and
recommended value is 10.

tags WarehouseTags Key-value pairs that describe the


warehouse.

spot_instance_policy WarehouseSpotInstancePolicy The spot policy to use for allocating


instances to clusters.

enable_photon BOOLEAN Whether queries are executed on a


native vectorized engine that speeds
up query execution.

channel Channel Whether to use the current SQL


warehouse compute version or the
preview version. Preview versions let
you try out functionality before it
becomes the Databricks SQL standard.
Typically, preview versions are
promoted to the current version two
weeks after initial preview release, but
some previews may last longer. You
can learn about the features in the
latest preview version by reviewing the
release notes. Databricks does not
recommend using preview versions for
production workloads. This field is
optional. The default is
CHANNEL_NAME_CURRENT .

Example request
{
"name": "My Edited SQL warehouse",
"cluster_size": "LARGE",
"auto_stop_mins": 60
}

Get
EN DP O IN T H T T P M ET H O D

2.0/sql/warehouses/{id} GET

2.0/sql/endpoints/{id} (deprecated) GET

Retrieve the info for a SQL warehouse.

F IEL D N A M E TYPE DESC RIP T IO N

id STRING SQL warehouse ID.

name STRING Name of the SQL warehouse.

cluster_size STRING The size of the clusters allocated to the


warehouse: "XXSMALL" , "XSMALL" ,
"SMALL" , "MEDIUM" , "LARGE" ,
"XLARGE" , "XXLARGE" , "XXXLARGE"
, "XXXXLARGE" . For the mapping from
cluster to instance size, see Cluster
size.

spot_instance_policy WarehouseSpotInstancePolicy The spot policy to use for allocating


instances to clusters.

auto_stop_mins INT32 Time until an idle SQL warehouse


terminates all clusters and stops.

num_clusters INT32 Number of clusters allocated to the


warehouse.

min_num_clusters INT32 Minimum number of clusters available


when a SQL warehouse is running.

max_num_clusters INT32 Maximum number of clusters available


when a SQL warehouse is running.

num_active_sessions INT32 Number of active JDBC and ODBC


sessions running on the SQL
warehouse.

state WarehouseState State of the SQL warehouse.

creator_name STRING Email address of the user that created


the warehouse.
F IEL D N A M E TYPE DESC RIP T IO N

creator_id STRING Azure Databricks ID of the user that


created the warehouse.

jdbc_url STRING The URL used to submit SQL


commands to the SQL warehouse
using JDBC.

odbc_params ODBCParams The host, path, protocol, and port


information required to submit SQL
commands to the SQL warehouse
using ODBC.

tags WarehouseTags Key-value pairs that describe the


warehouse.

health WarehouseHealth The health of the warehouse.

enable_photon BOOLEAN Whether queries are executed on a


native vectorized engine that speeds
up query execution.

channel Channel Whether the SQL warehouse uses the


current SQL warehouse compute
version or the preview version. Preview
versions let you try out functionality
before it becomes the Databricks SQL
standard. Typically, preview versions
are promoted to the current version
two weeks after initial preview release,
but some previews may last longer.
You can learn about the features in the
latest preview version by reviewing the
release notes. Databricks does not
recommend using preview versions for
production workloads. This field is
optional. The default is
CHANNEL_NAME_CURRENT .

Example response
{
"id": "7f2629a529869126",
"name": "MyWarehouse",
"size": "SMALL",
"min_num_clusters": 1,
"max_num_clusters": 1,
"auto_stop_mins": 0,
"auto_resume": true,
"num_clusters": 0,
"num_active_sessions": 0,
"state": "STOPPED",
"creator_name": "user@example.com",
"jdbc_url":
"jdbc:spark://hostname.staging.cloud.databricks.com:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath
=/sql/1.0/warehouses/7f2629a529869126;",
"odbc_params": {
"hostname": "hostname.cloud.databricks.com",
"path": "/sql/1.0/warehouses/7f2629a529869126",
"protocol": "https",
"port": 443
},
"tags": {
"custom_tags": [
{
"key": "mykey",
"value": "myvalue"
}
]
},
"spot_instance_policy": "COST_OPTIMIZED",
"enable_photon": true,
"cluster_size": "SMALL",
"channel": {
"name": "CHANNEL_NAME_CURRENT"
}
}

List
EN DP O IN T H T T P M ET H O D

2.0/sql/warehouses/ GET

2.0/sql/endpoints/ (deprecated) GET

List all SQL warehouses in the workspace.


Example response

{
"warehouses": [
{ "id": "123456790abcdef", "name": "My SQL warehouse", "cluster_size": "MEDIUM" },
{ "id": "098765321fedcba", "name": "Another SQL warehouse", "cluster_size": "LARGE" }
]
}

Note: If you use the deprecated 2.0/sql/endpoints/ API, the top-level response field would be “endpoints”
instead of “warehouses”.
Start
EN DP O IN T H T T P M ET H O D

2.0/sql/warehouses/{id}/start POST

2.0/sql/endpoints/{id}/start (deprecated) POST

Start a SQL warehouse.


Stop
EN DP O IN T H T T P M ET H O D

2.0/sql/warehouses/{id}/stop POST

2.0/sql/endpoints/{id}/stop (deprecated) POST

Stop a SQL warehouse.

Global SQL Warehouses API


Use this API to configure the security policy, data access properties, and configuration parameters for all SQL
warehouses.
In this section:
Get
Edit
Get
EN DP O IN T H T T P M ET H O D

/2.0/sql/config/warehouses GET

/2.0/sql/config/endpoints (deprecated) GET

Get the configuration for all SQL warehouses.

F IEL D N A M E TYPE DESC RIP T IO N

security_policy WarehouseSecurityPolicy The policy for controlling access to


datasets.

data_access_config Array of WarehouseConfPair An array of key-value pairs containing


properties for an external Hive
metastore.

sql_configuration_parameters RepeatedWarehouseConfPairs SQL configuration parameters.

Example response
{
"security_policy": "DATA_ACCESS_CONTROL",
"data_access_config": [
{
"key": "spark.sql.hive.metastore.jars",
"value": "/databricks/hive_metastore_jars/*"
}
],
"sql_configuration_parameters": {
"configuration_pairs": [
{
"key" : "legacy_time_parser_policy",
"value": "LEGACY"
}
]
}
}

Edit
Edit the configuration for all SQL warehouses.

IMPORTANT
All fields are required.
Invoking this method restarts all running SQL warehouses.

EN DP O IN T H T T P M ET H O D

/2.0/sql/config/warehouses PUT

/2.0/sql/config/endpoints (deprecated) PUT

F IEL D N A M E TYPE DESC RIP T IO N

security_policy WarehouseSecurityPolicy The policy for controlling access to


datasets.

data_access_config Array of WarehouseConfPair An array of key-value pairs containing


properties for an external Hive
metastore.

sql_configuration_parameters RepeatedWarehouseConfPairs SQL configuration parameters.

Example request
{
"data_access_config": [
{
"key": "spark.sql.hive.metastore.jars",
"value": "/databricks/hive_metastore_jars/*"
}
],
"sql_configuration_parameters": {
"configuration_pairs": [
{
"key" : "legacy_time_parser_policy",
"value": "LEGACY"
}
]
}
}

Data structures
In this section:
WarehouseConfPair
WarehouseHealth
WarehouseSecurityPolicy
WarehouseSpotInstancePolicy
WarehouseState
WarehouseStatus
WarehouseTags
WarehouseTagPair
ODBCParams
RepeatedWarehouseConfPairs
Channel
ChannelName
WarehouseConfPair
F IEL D N A M E TYPE DESC RIP T IO N

key STRING Configuration key name.

value STRING Configuration key value.

WarehouseHealth
F IEL D N A M E TYPE DESC RIP T IO N

status WarehouseStatus Warehouse status.

message STRING A descriptive message about the


health status. Includes information
about errors contributing to current
health status.

WarehouseSecurityPolicy
O P T IO N DESC RIP T IO N

DATA_ACCESS_CONTROL Use data access control to control access to datasets.

WarehouseSpotInstancePolicy
O P T IO N DESC RIP T IO N

COST_OPTIMIZED Use an on-demand instance for the cluster driver and spot
instances for cluster executors. The maximum spot price is
100% of the on-demand price. This is the default policy.

RELIABILITY_OPTIMIZED Use on-demand instances for all cluster nodes.

WarehouseState
State of a SQL warehouse. The allowable state transitions are:
STARTING -> STARTING , RUNNING , STOPPING , DELETING
RUNNING -> STOPPING , DELETING
STOPPING -> STOPPED , STARTING
STOPPED -> STARTING , DELETING
DELETING -> DELETED

STAT E DESC RIP T IO N

STARTING The warehouse is in the process of starting.

RUNNING The starting process is done and the warehouse is ready to


use.

STOPPING The warehouse is in the process of being stopped.

STOPPED The warehouse is stopped. Start by calling start or by


submitting a JDBC or ODBC request.

DELETING The warehouse is in the process of being destroyed.

DELETED The warehouse has been deleted and cannot be recovered.

WarehouseStatus
STAT E DESC RIP T IO N

HEALTHY Warehouse is functioning normally and there are no known


issues.

DEGRADED Warehouse might be functional, but there are some known


issues. Performance might be affected.

FAILED Warehouse is severely affected and will not be able to serve


queries.

WarehouseTags
F IEL D N A M E TYPE DESC RIP T IO N

custom_tags Array of WarehouseTagPair An object containing an array of key-


value pairs.

WarehouseTagPair
F IEL D N A M E TYPE DESC RIP T IO N

key STRING Tag key name.

value STRING Tag key value.

ODBCParams
F IEL D N A M E TYPE DESC RIP T IO N

host STRING ODBC server hostname.

path STRING ODBC server path.

protocol STRING ODBC server protocol.

port INT32 ODBC server port

RepeatedWarehouseConfPairs
F IEL D N A M E TYPE DESC RIP T IO N

configuration_pairs Array of WarehouseConfPair An object containing an array of key-


value pairs.

Channel
F IEL D N A M E TYPE DESC RIP T IO N

name ChannelName Channel Name

ChannelName
NAME DESC RIP T IO N

CHANNEL_NAME_PREVIEW SQL warehouse is set to the preview channel and uses


upcoming functionality.

CHANNEL_NAME_CURRENT SQL warehouse is set to the current channel.


Queries and Dashboards API 2.0
7/21/2022 • 2 minutes to read

The Queries and Dashboards API manages queries, results, and dashboards.
This API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification.
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
Query History API 2.0
7/21/2022 • 2 minutes to read

The Query History API shows SQL queries performed using Databricks SQL warehouses. You can use this
information to help you debug issues with queries.
This API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification.
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
Delta Live Tables API guide
7/21/2022 • 13 minutes to read

The Delta Live Tables API allows you to create, edit, delete, start, and view details about pipelines.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Create a pipeline
EN DP O IN T H T T P M ET H O D

2.0/pipelines POST

Creates a new Delta Live Tables pipeline.


Example
This example creates a new triggered pipeline.
Request

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/pipelines \
--data @pipeline-settings.json

pipeline-settings.json :

{
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"continuous": false
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
This example uses a .netrc file.
Response

{
"pipeline_id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5"
}

Request structure
See PipelineSettings.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N

pipeline_id STRING The unique identifier for the newly


created pipeline.

Edit a pipeline
EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id} PUT

Updates the settings for an existing pipeline.


Example
This example adds a target parameter to the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl --netrc --request PUT \


https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 \
> --data @pipeline-settings.json

pipeline-settings.json
{
"id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"target": "wikipedia_quickstart_data",
"continuous": false
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Request structure
See PipelineSettings.

Delete a pipeline
EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id} DELETE

Deletes a pipeline from the Delta Live Tables system.


Example
This example deletes the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl --netrc --request DELETE \


https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.

Start a pipeline update


EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id}/updates POST

Starts an update for a pipeline.


Example
This example starts an update with full refresh for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/updates \
--data '{ "full_refresh": "true" }'

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Response

{
"update_id": "a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8"
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

full_refresh BOOLEAN Whether to reprocess all data. If true


, the Delta Live Tables system will reset
all tables before running the pipeline.

This field is optional.

The default value is false

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

update_id STRING The unique identifier of the newly


created update.

Stop any active pipeline update


EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id}/stop POST

Stops any active pipeline update. If no update is running, this request is a no-op.
Example
This example stops an update for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/stop

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.

List pipeline events


EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id}/events GET

Retrieves events for a pipeline.


Example
This example retrieves a maximum of 5 events for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 .
Request

curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/events \
--data '{"max_results": 5}'

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

page_token STRING Page token returned by previous call.


This field is mutually exclusive with all
fields in this request except
max_results. An error is returned if any
fields other than max_results are set
when this field is set.

This field is optional.


F IEL D N A M E TYPE DESC RIP T IO N

max_results INT32 The maximum number of entries to


return in a single page. The system
may return fewer than max_results
events in a response, even if there are
more events available.

This field is optional.

The default value is 25.

The maximum value is 100. An error is


returned if the value of
max_results is greater than 100.

order_by STRING A string indicating a sort order by


timestamp for the results, for example,
["timestamp asc"] .

The sort order can be ascending or


descending. By default, events are
returned in descending order by
timestamp.

This field is optional.

filter STRING Criteria to select a subset of results,


expressed using a SQL-like syntax. The
supported filters are:

* level='INFO' (or WARN or ERROR


)
* level in ('INFO', 'WARN')
* id='[event-id]'
* timestamp > 'TIMESTAMP' (or >= ,
< , <= ,=)

Composite expressions are supported,


for example:
level in ('ERROR', 'WARN') AND
timestamp> '2021-07-
22T06:37:33.083Z'

This field is optional.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

events An array of pipeline events. The list of events matching the request
criteria.

next_page_token STRING If present, a token to fetch the next


page of events.

prev_page_token STRING If present, a token to fetch the


previous page of events.
Get pipeline details
EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id} GET

Gets details about a pipeline, including the pipeline settings and recent updates.
Example
This example gets details for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Response
{
"pipeline_id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"spec": {
"id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"target": "wikipedia_quickstart_data",
"continuous": false
},
"state": "IDLE",
"cluster_id": "1234-567891-abcde123",
"name": "Wikipedia pipeline (SQL)",
"creator_user_name": "username",
"latest_updates": [
{
"update_id": "8a0b6d02-fbd0-11eb-9a03-0242ac130003",
"state": "COMPLETED",
"creation_time": "2021-08-13T00:37:30.279Z"
},
{
"update_id": "a72c08ba-fbd0-11eb-9a03-0242ac130003",
"state": "CANCELED",
"creation_time": "2021-08-13T00:35:51.902Z"
},
{
"update_id": "ac37d924-fbd0-11eb-9a03-0242ac130003",
"state": "FAILED",
"creation_time": "2021-08-13T00:33:38.565Z"
}
],
"run_as_user_name": "username"
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

pipeline_id STRING The unique identifier of the pipeline.

spec PipelineSettings The pipeline settings.

state STRING The state of the pipeline. One of


IDLE or RUNNING .

If state = RUNNING , then there is at


least one active update.
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The identifier of the cluster running the


pipeline.

name STRING The user-friendly name for this


pipeline.

creator_user_name STRING The username of the pipeline creator.

latest_updates An array of UpdateStateInfo Status of the most recent updates for


the pipeline, ordered with the newest
update first.

run_as_user_name STRING The username that the pipeline runs


as.

Get update details


EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id}/updates/{update_id} GET

Gets details for a pipeline update.


Example
This example gets details for update 9a84f906-fc51-11eb-9a03-0242ac130003 for the pipeline with ID
a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/updates/9a84f906-fc51-
11eb-9a03-0242ac130003

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Response
{
"update": {
"pipeline_id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"update_id": "9a84f906-fc51-11eb-9a03-0242ac130003",
"config": {
"id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"configuration": {
"pipelines.numStreamRetryAttempts": "5"
},
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"target": "wikipedia_quickstart_data",
"filters": {},
"email_notifications": {},
"continuous": false,
"development": false
},
"cause": "API_CALL",
"state": "COMPLETED",
"creation_time": 1628815050279,
"full_refresh": true
}
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

pipeline_id STRING The unique identifier of the pipeline.

update_id STRING The unique identifier of this update.

config PipelineSettings The pipeline settings.

cause STRING The trigger for the update. One of


API_CALL ,
RETRY_ON_FAILURE ,
SERVICE_UPGRADE .
F IEL D N A M E TYPE DESC RIP T IO N

state STRING The state of the update. One of


QUEUED , CREATED
WAITING_FOR_RESOURCES ,
INITIALIZING , RESETTING ,
SETTING_UP_TABLES , RUNNING ,
STOPPING , COMPLETED ,
FAILED , or CANCELED .

cluster_id STRING The identifier of the cluster running the


pipeline.

creation_time INT64 The timestamp when the update was


created.

full_refresh BOOLEAN Whether the update was triggered to


perform a full refresh. If true, all
pipeline tables were reset before
running the update.

List pipelines
EN DP O IN T H T T P M ET H O D

2.0/pipelines/ GET

Lists pipelines defined in the Delta Live Tables system.


Example
This example retrieves details for up to two pipelines, starting from a specified page_token :
Request

curl -n -X GET https://<databricks-instance>/api/2.0/pipelines \


--data '{ "page_token": "eyJ...==", "max_results": 2 }'

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Response
{
"statuses": [
{
"pipeline_id": "e0f01758-fc61-11eb-9a03-0242ac130003",
"state": "IDLE",
"name": "dlt-pipeline-python",
"latest_updates": [
{
"update_id": "ee9ae73e-fc61-11eb-9a03-0242ac130003",
"state": "COMPLETED",
"creation_time": "2021-08-13T00:34:21.871Z"
}
],
"creator_user_name": "username"
},
{
"pipeline_id": "f4c82f5e-fc61-11eb-9a03-0242ac130003",
"state": "IDLE",
"name": "dlt-pipeline-python",
"creator_user_name": "username"
}
],
"next_page_token": "eyJ...==",
"prev_page_token": "eyJ..x9"
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

page_token STRING Page token returned by previous call.

This field is optional.

max_results INT32 The maximum number of entries to


return in a single page. The system
may return fewer than max_results
events in a response, even if there are
more events available.

This field is optional.

The default value is 25.

The maximum value is 100. An error is


returned if the value of
max_results is greater than 100.

order_by An array of STRING A list of strings specifying the order of


results, for example,
["name asc"] . Supported
order_by fields are id and
name . The default is id asc .

This field is optional.


F IEL D N A M E TYPE DESC RIP T IO N

filter STRING Select a subset of results based on the


specified criteria.

The supported filters are:

"notebook='<path>'" to select
pipelines that reference the provided
notebook path.

name LIKE '[pattern]' to select


pipelines with a name that matches
pattern . Wildcards are supported,
for example:
name LIKE '%shopping%'

Composite filters are not supported.

This field is optional.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

statuses An array of PipelineStateInfo The list of events matching the request


criteria.

next_page_token STRING If present, a token to fetch the next


page of events.

prev_page_token STRING If present, a token to fetch the


previous page of events.

Data structures
In this section:
KeyValue
NotebookLibrary
PipelineLibrary
PipelineSettings
PipelineStateInfo
PipelinesNewCluster
UpdateStateInfo
KeyValue
A key-value pair that specifies configuration parameters.

F IEL D N A M E TYPE DESC RIP T IO N

key STRING The configuration property name.

value STRING The configuration property value.

NotebookLibrary
A specification for a notebook containing pipeline code.

F IEL D N A M E TYPE DESC RIP T IO N

path STRING The absolute path to the notebook.

This field is required.

PipelineLibrary
A specification for pipeline dependencies.

F IEL D N A M E TYPE DESC RIP T IO N

notebook NotebookLibrary The path to a notebook defining Delta


Live Tables datasets. The path must be
in the Databricks workspace, for
example:
{ "notebook" : { "path" : "/my-
pipeline-notebook-path" } }
.

PipelineSettings
The settings for a pipeline deployment.

F IEL D N A M E TYPE DESC RIP T IO N

id STRING The unique identifier for this pipeline.

The identifier is created by the Delta


Live Tables system, and must not be
provided when creating a pipeline.

name STRING A user-friendly name for this pipeline.

This field is optional.

By default, the pipeline name must be


unique. To use a duplicate name, set
allow_duplicate_names to true in
the pipeline configuration.

storage STRING A path to a DBFS directory for storing


checkpoints and tables created by the
pipeline.

This field is optional.

The system uses a default location if


this field is empty.

configuration A map of STRING:STRING A list of key-value pairs to add to the


Spark configuration of the cluster that
will run the pipeline.

This field is optional.

Elements must be formatted as


key:value pairs.
F IEL D N A M E TYPE DESC RIP T IO N

clusters An array of PipelinesNewCluster An array of specifications for the


clusters to run the pipeline.

This field is optional.

If this is not specified, the system will


select a default cluster configuration
for the pipeline.

libraries An array of PipelineLibrary The notebooks containing the pipeline


code and any dependencies required
to run the pipeline.

target STRING A database name for persisting


pipeline output data.

See Delta Live Tables data publishing


for more information.

continuous BOOLEAN Whether this is a continuous pipeline.

This field is optional.

The default value is false .

development BOOLEAN Whether to run the pipeline in


development mode.

This field is optional.

The default value is false .

PipelineStateInfo
The state of a pipeline, the status of the most recent updates, and information about associated resources.

F IEL D N A M E TYPE DESC RIP T IO N

state STRING The state of the pipeline. One of


IDLE or RUNNING .

pipeline_id STRING The unique identifier of the pipeline.

cluster_id STRING The unique identifier of the cluster


running the pipeline.

name STRING The user-friendly name of the pipeline.

latest_updates An array of UpdateStateInfo Status of the most recent updates for


the pipeline, ordered with the newest
update first.

creator_user_name STRING The username of the pipeline creator.


F IEL D N A M E TYPE DESC RIP T IO N

run_as_user_name STRING The username that the pipeline runs


as. This is a read only value derived
from the pipeline owner.

PipelinesNewCluster
A pipeline cluster specification.
The Delta Live Tables system sets the following attributes. These attributes cannot be configured by users:
spark_version
init_scripts

F IEL D N A M E TYPE DESC RIP T IO N

label STRING A label for the cluster specification,


either
default to configure the default
cluster, or
maintenance to configure the
maintenance cluster.

This field is optional. The default value


is default .

spark_conf KeyValue An object containing a set of optional,


user-specified Spark configuration key-
value pairs. You can also pass in a
string of extra JVM options to the
driver and the executors via
spark.driver.extraJavaOptions
and
spark.executor.extraJavaOptions
respectively.

Example Spark confs:


{"spark.speculation": true,
"spark.streaming.ui.retainedBatches":
5}
or
{"spark.driver.extraJavaOptions":
"-verbose:gc -
XX:+PrintGCDetails"}

node_type_id STRING This field encodes, through a single


value, the resources available to each
of the Spark nodes in this cluster. For
example, the Spark nodes can be
provisioned and optimized for memory
or compute intensive workloads A list
of available node types can be
retrieved by using the List node types
API call.

driver_node_type_id STRING The node type of the Spark driver. This


field is optional; if unset, the driver
node type will be set as the same value
as node_type_id defined above.
F IEL D N A M E TYPE DESC RIP T IO N

ssh_public_keys An array of STRING SSH public key contents that will be


added to each Spark node in this
cluster. The corresponding private keys
can be used to login with the user
name ubuntu on port 2200 . Up to
10 keys can be specified.

custom_tags KeyValue An object containing a set of tags for


cluster resources. Databricks tags all
cluster resources with these tags in
addition to default_tags.

Note :

* Tags are not supported on legacy


node types such as compute-
optimized and memory-optimized
* Azure Databricks allows at most 45
custom tags.

cluster_log_conf ClusterLogConf The configuration for delivering Spark


logs to a long-term storage
destination. Only one destination can
be specified for one cluster. If this
configuration is provided, the logs will
be delivered to the destination every
5 mins . The destination of driver logs
is
<destination>/<cluster-
ID>/driver
, while the destination of executor logs
is
<destination>/<cluster-
ID>/executor
.

spark_env_vars KeyValue An object containing a set of optional,


user-specified environment variable
key-value pairs. Key-value pairs of the
form (X,Y) are exported as is (that is,
export X='Y' ) while launching the
driver and workers.

In order to specify an additional set of


SPARK_DAEMON_JAVA_OPTS , Databricks
recommends appending them to
$SPARK_DAEMON_JAVA_OPTS as shown
in the following example. This ensures
that all default Azure Databricks
managed environmental variables are
included as well.

Example Spark environment variables:


{"SPARK_WORKER_MEMORY":
"28000m", "SPARK_LOCAL_DIRS":
"/local_disk0"}
or
{"SPARK_DAEMON_JAVA_OPTS":
"$SPARK_DAEMON_JAVA_OPTS -
Dspark.shuffle.service.enabled=true"}
F IEL D N A M E TYPE DESC RIP T IO N

init_scripts An array of InitScriptInfo The configuration for storing init


scripts. Any number of destinations
can be specified. The scripts are
executed sequentially in the order
provided. If cluster_log_conf is
specified, init script logs are sent to
<destination>/<cluster-
ID>/init_scripts
.

instance_pool_id STRING The optional ID of the instance pool to


which the cluster belongs. See Pools.

driver_instance_pool_id STRING The optional ID of the instance pool to


use for the driver node. You must also
specify
instance_pool_id . See Instance
Pools API 2.0.

policy_id STRING A cluster policy ID.

num_workers OR autoscale INT32 OR AutoScale If num_workers, number of worker


nodes that this cluster should have. A
cluster has one Spark driver and
num_workers executors for a total of
num_workers + 1 Spark nodes.

When reading the properties of a


cluster, this field reflects the desired
number of workers rather than the
actual number of workers. For
instance, if a cluster is resized from 5
to 10 workers, this field is updated to
reflect the target size of 10 workers,
whereas the workers listed in executors
gradually increase from 5 to 10 as the
new nodes are provisioned.

If autoscale, parameters needed to


automatically scale clusters up and
down based on load.

This field is optional.

apply_policy_default_values BOOLEAN Whether to use policy default values


for missing cluster attributes.

UpdateStateInfo
The current state of a pipeline update.

F IEL D N A M E TYPE DESC RIP T IO N

update_id STRING The unique identifier for this update.


F IEL D N A M E TYPE DESC RIP T IO N

state STRING The state of the update. One of


QUEUED , CREATED ,
WAITING_FOR_RESOURCES ,
INITIALIZING , RESETTING ,
SETTING_UP_TABLES , RUNNING ,
STOPPING , COMPLETED ,
FAILED , or CANCELED .

creation_time STRING Timestamp when this update was


created.
Git Credentials API 2.0
7/21/2022 • 2 minutes to read

The Git Credentials API allows users to manage their Git credentials to use Databricks Repos.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

The Git Credentials API is provided as an OpenAPI 3.0 specification that you can download and view as a
structured API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
Global Init Scripts API 2.0
7/21/2022 • 2 minutes to read

Configure a cluster-scoped init script using the DBFS REST API are shell scripts that run during startup on each
cluster node of every cluster in the workspace, before the Apache Spark driver or worker JVM starts. They can
help you to enforce consistent cluster configurations across your workspace. Use them carefully because they
can cause unanticipated impacts, like library conflicts.
The Global Init Scripts API lets Azure Databricks administrators add global cluster initialization scripts in a secure
and controlled manner. To learn how to add them using the UI, see Configure a cluster-scoped init script using
the DBFS REST API.
The Global Init Scripts API is provided as an OpenAPI 3.0 specification that you can download and view as a
structured API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
Groups API 2.0
7/21/2022 • 4 minutes to read

The Groups API allows you to manage groups of users.

NOTE
You must be an Azure Databricks administrator to invoke this API.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Add member
EN DP O IN T H T T P M ET H O D

2.0/groups/add-member POST

Add a user or group to a group. This call returns the error RESOURCE_DOES_NOT_EXIST if a user or group with the
given name does not exist, or if a group with the given parent name does not exist.
Examples
To add a user to a group:

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/add-member \
--data '{ "user_name": "someone@example.com", "parent_name": "reporting-department" }'

{}

To add a group to another group:

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/add-member \
--data '{ "group_name": "reporting-department", "parent_name": "data-ops-read-only" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

user_name OR group_name STRING OR STRING If user_name, the user name.

If group_name, the group name.


F IEL D N A M E TYPE DESC RIP T IO N

parent_name STRING Name of the parent group to which


the new member will be added. This
field is required.

Create
EN DP O IN T H T T P M ET H O D

2.0/groups/create POST

Create a new group with the given name. This call returns an error RESOURCE_ALREADY_EXISTS if a group with the
given name already exists.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/create \
--data '{ "group_name": "reporting-department" }'

{ "group_name": "reporting-department" }

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

group_name STRING Name for the group; must be unique


among groups owned by this
organization. This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

group_name STRING The group name.

List members
EN DP O IN T H T T P M ET H O D

2.0/groups/list-members GET

Return all of the members of a particular group. This call returns the error RESOURCE_DOES_NOT_EXIST if a group
with the given name does not exist. This method is non-recursive; it returns all groups that belong to the given
group but not the principals that belong to those child groups.
Example
curl --netrc -X GET \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/list-members \
--data '{ "group_name": "reporting-department" }' \
| jq .

{
"members": [
{
"user_name": "someone@example.com"
}
]
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

group_name STRING The group whose members we want to


retrieve. This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

members An array of PrincipalName The users and groups that belong to


the given group.

List
EN DP O IN T H T T P M ET H O D

2.0/groups/list GET

Return all of the groups in an organization.


Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/list \
| jq .

{
"group_names": [
"reporting-department",
"data-ops-read-only",
"admins"
]
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

group_names An array of STRING The groups in this organization.


List parents
EN DP O IN T H T T P M ET H O D

2.0/groups/list-parents GET

Retrieve all groups in which a given user or group is a member. This method is non-recursive; it returns all
groups in which the given user or group is a member but not the groups in which those groups are members.
This call returns the error RESOURCE_DOES_NOT_EXIST if a user or group with the given name does not exist.
Examples
To list groups for a user:

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/list-parents \
--data '{ "user_name": "someone@example.com" }' \
| jq .

{
"group_names": [
"reporting-department"
]
}

To list parent groups for a group:

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/list-parents \
--data '{ "group_name": "reporting-department" }' \
| jq .

{
"group_names": [
"data-ops-read-only"
]
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

user_name OR group_name STRING OR STRING If user_name, the user name.

If group_name, the group name.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

group_names An array of STRING The groups in which the given user or


group is a member.

Remove member
EN DP O IN T H T T P M ET H O D

2.0/groups/remove-member POST

Remove a user or group from a group. This call returns the error RESOURCE_DOES_NOT_EXIST if a user or group
with the given name does not exist or if a group with the given parent name does not exist.
Examples
To remove a user from a group:

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/remove-member \
--data '{ "user_name": "someone@example.com", "parent_name": "reporting-department" }'

{}

To remove a group from another group:

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/remove-member \
--data '{ "group_name": "reporting-department", "parent_name": "data-ops-read-only" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

user_name OR group_name STRING OR STRING If user_name, the user name.

If group_name, the group name.

parent_name STRING Name of the parent group from which


the member will be removed. This field
is required.

Delete
EN DP O IN T H T T P M ET H O D

2.0/groups/delete POST

Remove a group from this organization. This call returns the error RESOURCE_DOES_NOT_EXIST if a group with the
given name does not exist.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/delete \
--data '{ "group_name": "reporting-department" }'
{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

group_name STRING The group to remove. This field is


required.

Data structures
In this section:
PrincipalName
PrincipalName
Container type for a name that is either a user name or a group name.

F IEL D N A M E TYPE DESC RIP T IO N

user_name OR group_name STRING OR STRING If user_name, the user name.

If group_name, the group name.


Instance Pools API 2.0
7/21/2022 • 13 minutes to read

The Instance Pools API allows you to create, edit, delete and list instance pools.
An instance pool reduces cluster start and auto-scaling times by maintaining a set of idle, ready-to-use cloud
instances. When a cluster attached to a pool needs an instance, it first attempts to allocate one of the pool’s idle
instances. If the pool has no idle instances, it expands by allocating a new instance from the instance provider in
order to accommodate the cluster’s request. When a cluster releases an instance, it returns to the pool and is
free for another cluster to use. Only clusters attached to a pool can use that pool’s idle instances.
Azure Databricks does not charge DBUs while instances are idle in the pool. Instance provider billing does apply.
See pricing.

Requirements
You must have permission to attach to the pool; see Pool access control.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Create
EN DP O IN T H T T P M ET H O D

2.0/instance-pools/create POST

Create an instance pool. Use the returned instance_pool_id to query the status of the instance pool, which
includes the number of instances currently allocated by the instance pool. If you provide the min_idle_instances
parameter, instances are provisioned in the background and are ready to use once the idle_count in the
InstancePoolStats equals the requested minimum.
If your account has Databricks Container Services enabled and the instance pool is created with
preloaded_docker_images , you can use the instance pool to launch clusters with a Docker image. The Docker
image in the instance pool doesn’t have to match the Docker image in the cluster. However, the container
environment of the cluster created on the pool must align with the container environment of the instance pool:
you cannot use an instance pool created with preloaded_docker_images to launch a cluster without a Docker
image and you cannot use an instance pool created without preloaded_docker_images to a launch cluster with a
Docker image.

NOTE
Azure Databricks may not be able to acquire some of the requested idle instances due to instance provider limitations or
transient network issues. Clusters can still attach to the instance pool, but may not start as quickly.

Example
curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/instance-pools/create \
--data @create-instance-pool.json

create-instance-pool.json :

{
"instance_pool_name": "my-pool",
"node_type_id": "Standard_D3_v2",
"min_idle_instances": 10,
"custom_tags": [
{
"key": "my-key",
"value": "my-value"
}
]
}

{ "instance_pool_id": "1234-567890-fetch12-pool-A3BcdEFg" }

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

instance_pool_name STRING The name of the instance pool. This is


required for create and edit operations.
It must be unique, non-empty, and
less than 100 characters.

min_idle_instances INT32 The minimum number of idle instances


maintained by the pool. This is in
addition to any instances in use by
active clusters.

max_capacity INT32 The maximum number of instances the


pool can contain, including both idle
instances and ones in use by clusters.
Once the maximum capacity is
reached, you cannot create new
clusters from the pool and existing
clusters cannot autoscale up until
some instances are made idle in the
pool via cluster termination or down-
scaling.

node_type_id STRING The node type for the instances in the


pool. All clusters attached to the pool
inherit this node type and the pool’s
idle instances are allocated based on
this type. You can retrieve a list of
available node types by using the List
node types API call.
F IEL D N A M E TYPE DESC RIP T IO N

custom_tags An array of ClusterTag Additional tags for instance pool


resources. Azure Databricks tags all
pool resources (e.g. VM disk volumes)
with these tags in addition to
default_tags.

Azure Databricks allows up to 41


custom tags.

idle_instance_autotermination_minutes INT32 The number of minutes that idle


instances in excess of the
min_idle_instances are maintained by
the pool before being terminated. If
not specified, excess idle instances are
terminated automatically after a
default timeout period. If specified, the
time must be between 0 and 10000
minutes. If 0 is supplied, excess idle
instances are removed as soon as
possible.

enable_elastic_disk BOOL Autoscaling Local Storage: when


enabled, the instances in the pool
dynamically acquire additional disk
space when they are running low on
disk space.

disk_spec DiskSpec Defines the amount of initial remote


storage attached to each instance in
the pool.

preloaded_spark_versions An array of STRING A list with at most one runtime version


the pool installs on each instance. Pool
clusters that use a preloaded runtime
version start faster as they do not
have to wait for the image to
download. You can retrieve a list of
available runtime versions by using the
Runtime versions API call.

preloaded_docker_images An array of DockerImage A list with at most one Docker image


the pool installs on each instance. Pool
clusters that use a preloaded Docker
image start faster as they do not have
to wait for the image to download.
Available only if your account has
Databricks Container Services enabled.

azure_attributes InstancePoolAzureAttributes Defines the instance availability type


(such as spot or on-demand) and max
bid price.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

instance_pool_id STRING The ID of the created instance pool.


Edit
EN DP O IN T H T T P M ET H O D

2.0/instance-pools/edit POST

Edit an instance pool. This modifies the configuration of an existing instance pool.

NOTE
You can edit only the following values: instance_pool_name , min_idle_instances , max_capacity , and
idle_instance_autotermination_minutes .
You must provide an instance_pool_name value.

Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/instance-pools/edit \
--data @edit-instance-pool.json

edit-instance-pool.json :

{
"instance_pool_id": "1234-567890-fetch12-pool-A3BcdEFg",
"instance_pool_name": "my-edited-pool",
"min_idle_instances": 5,
"max_capacity": 200,
"idle_instance_autotermination_minutes": 30
}

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

instance_pool_id STRING The ID of the instance pool to edit.


This field is required.

instance_pool_name STRING The name of the instance pool. This is


required for create and edit operations.
It must be unique, non-empty, and
less than 100 characters.

min_idle_instances INT32 The minimum number of idle instances


maintained by the pool. This is in
addition to any instances in use by
active clusters.
F IEL D N A M E TYPE DESC RIP T IO N

max_capacity INT32 The maximum number of instances the


pool can contain, including both idle
instances and ones in use by clusters.
Once the maximum capacity is
reached, you cannot create new
clusters from the pool and existing
clusters cannot autoscale up until
some instances are made idle in the
pool via cluster termination or down-
scaling.

idle_instance_autotermination_minutes INT32 The number of minutes that idle


instances in excess of the
min_idle_instances are maintained
by the pool before being terminated. If
not specified, excess idle instances are
terminated automatically after a
default timeout period. If specified, the
time must be between 0 and 10000
minutes. If 0 is supplied, excess idle
instances are removed as soon as
possible.

Delete
EN DP O IN T H T T P M ET H O D

2.0/instance-pools/delete POST

Delete an instance pool. This permanently deletes the instance pool. The idle instances in the pool are
terminated asynchronously. New clusters cannot attach to the pool. Running clusters attached to the pool
continue to run but cannot autoscale up. Terminated clusters attached to the pool will fail to start until they are
edited to no longer use the pool.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/instance-pools/delete \
--data '{ "instance_pool_id": "1234-567890-fetch12-pool-A3BcdEFg" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

instance_pool_id STRING The ID of the instance pool to delete.

Get
EN DP O IN T H T T P M ET H O D

2.0/instance-pools/get GET

Retrieve the information for an instance pool given its identifier.


Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/instance-pools/get \
--data '{ "instance_pool_id": "1234-567890-fetch12-pool-A3BcdEFg" }'

{
"instance_pool_name": "mypool",
"node_type_id": "Standard_D3_v2",
"custom_tags": {
"my-key": "my-value"
},
"idle_instance_autotermination_minutes": 60,
"enable_elastic_disk": false,
"preloaded_spark_versions": [
"5.4.x-scala2.11"
],
"instance_pool_id": "101-120000-brick1-pool-ABCD1234",
"default_tags": {
"Vendor": "Databricks",
"DatabricksInstancePoolCreatorId": "100125",
"DatabricksInstancePoolId": "101-120000-brick1-pool-ABCD1234"
},
"state": "ACTIVE",
"stats": {
"used_count": 10,
"idle_count": 5,
"pending_used_count": 5,
"pending_idle_count": 5
},
"status": {}
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

instance_pool_id STRING The instance pool about which to


retrieve information.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

instance_pool_name STRING The name of the instance pool. This is


required for create and edit operations.
It must be unique, non-empty, and
less than 100 characters.

min_idle_instances INT32 The minimum number of idle instances


maintained by the pool. This is in
addition to any instances in use by
active clusters.
F IEL D N A M E TYPE DESC RIP T IO N

max_capacity INT32 The maximum number of instances the


pool can contain, including both idle
instances and ones in use by clusters.
Once the maximum capacity is
reached, you cannot create new
clusters from the pool and existing
clusters cannot autoscale up until
some instances are made idle in the
pool via cluster termination or down-
scaling.

node_type_id STRING The node type for the instances in the


pool. All clusters attached to the pool
inherit this node type and the pool’s
idle instances are allocated based on
this type. You can retrieve a list of
available node types by using the List
node types API call.

custom_tags An array of ClusterTag Additional tags for instance pool


resources. Azure Databricks tags all
pool resources (e.g. VM disk volumes)
with these tags in addition to
default_tags.

Azure Databricks allows up to 41


custom tags.

idle_instance_autotermination_minutes INT32 The number of minutes that idle


instances in excess of the
min_idle_instances are maintained
by the pool before being terminated. If
not specified, excess idle instances are
terminated automatically after a
default timeout period. If specified, the
time must be between 0 and 10000
minutes. If 0 is supplied, excess idle
instances are removed as soon as
possible.

enable_elastic_disk BOOL Autoscaling Local Storage: when


enabled, the instances in the pool
dynamically acquire additional disk
space when they are running low on
disk space.

disk_spec DiskSpec Defines the amount of initial remote


storage attached to each instance in
the pool.

preloaded_spark_versions An array of STRING A list with the runtime version the pool
installs on each instance. Pool clusters
that use a preloaded runtime version
start faster as they do not have to wait
for the image to download. You can
retrieve a list of available runtime
versions by using the Runtime versions
API call.
F IEL D N A M E TYPE DESC RIP T IO N

instance_pool_id STRING The canonical unique identifier for the


instance pool.

default_tags An array of ClusterTag Tags that are added by Azure


Databricks regardless of any
custom_tags, including:

* Vendor: Databricks
* DatabricksInstancePoolCreatorId:
<create_user_id>
* DatabricksInstancePoolId:
<instance_pool_id>

state InstancePoolState Current state of the instance pool.

stats InstancePoolStats Statistics about the usage of the


instance pool.

status InstancePoolStatus Status about failed pending instances


in the pool.

List
EN DP O IN T H T T P M ET H O D

2.0/instance-pools/list GET

List information for all instance pools.


Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/instance-pools/list
{
"instance_pools": [
{
"instance_pool_name": "mypool",
"node_type_id": "Standard_D3_v2",
"idle_instance_autotermination_minutes": 60,
"enable_elastic_disk": false,
"preloaded_spark_versions": [
"5.4.x-scala2.11"
],
"instance_pool_id": "101-120000-brick1-pool-ABCD1234",
"default_tags": {
"Vendor": "Databricks",
"DatabricksInstancePoolCreatorId": "100125",
"DatabricksInstancePoolId": "101-120000-brick1-pool-ABCD1234"
},
"state": "ACTIVE",
"stats": {
"used_count": 10,
"idle_count": 5,
"pending_used_count": 5,
"pending_idle_count": 5
},
"status": {}
},
{
"..."
}
]
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

instance_pools An array of InstancePoolStatus A list of instance pools with their


statistics included.

Data structures
In this section:
InstancePoolState
InstancePoolStats
InstancePoolStatus
PendingInstanceError
DiskSpec
DiskType
InstancePoolAndStats
AzureDiskVolumeType
InstancePoolAzureAttributes
InstancePoolState
The state of an instance pool. The current allowable state transitions are:
ACTIVE -> DELETED
NAME DESC RIP T IO N

ACTIVE Indicates an instance pool is active. Clusters can attach to it.

DELETED Indicates the instance pool has been deleted and is no


longer accessible.

InstancePoolStats
Statistics about the usage of the instance pool.

F IEL D N A M E TYPE DESC RIP T IO N

used_count INT32 Number of active instances that are in


use by a cluster.

idle_count INT32 Number of active instances that are


not in use by a cluster.

pending_used_count INT32 Number of pending instances that are


assigned to a cluster.

pending_idle_count INT32 Number of pending instances that are


not assigned to a cluster.

InstancePoolStatus
Status about failed pending instances in the pool.

F IEL D N A M E TYPE DESC RIP T IO N

pending_instance_errors An array of PendingInstanceError List of error messages for the failed


pending instances.

PendingInstanceError
Error message of a failed pending instance.

F IEL D N A M E TYPE DESC RIP T IO N

instance_id STRING ID of the failed instance.

message STRING Message describing the cause of the


failure.

DiskSpec
Describes the initial set of disks to attach to each instance. For example, if there are 3 instances and each
instance is configured to start with 2 disks, 100 GiB each, then Azure Databricks creates a total of 6 disks, 100
GiB each, for these instances.

F IEL D N A M E TYPE DESC RIP T IO N

disk_type DiskType The type of disks to attach.


F IEL D N A M E TYPE DESC RIP T IO N

disk_count INT32 The number of disks to attach to each


instance:

* This feature is only enabled for


supported node types.
* Users can choose up to the limit of
the disks supported by the node type.
* For node types with no local disk, at
least one disk needs to be specified.

disk_size INT32 The size of each disk (in GiB) to attach.


Values must fall into the supported
range for a particular instance type:

* Premium LRS (SSD): 1 - 1023 GiB


* Standard LRS (HDD): 1- 1023 GiB

DiskType
Describes the type of disk.

F IEL D N A M E TYPE DESC RIP T IO N

azure_disk_volume_type AzureDiskVolumeType The type of Azure disk to use.

InstancePoolAndStats
F IEL D N A M E TYPE DESC RIP T IO N

instance_pool_name STRING The name of the instance pool. This is


required for create and edit operations.
It must be unique, non-empty, and
less than 100 characters.

min_idle_instances INT32 The minimum number of idle instances


maintained by the pool. This is in
addition to any instances in use by
active clusters.

max_capacity INT32 The maximum number of instances the


pool can contain, including both idle
instances and ones in use by clusters.
Once the maximum capacity is
reached, you cannot create new
clusters from the pool and existing
clusters cannot autoscale up until
some instances are made idle in the
pool via cluster termination or down-
scaling.

node_type_id STRING The node type for the instances in the


pool. All clusters attached to the pool
inherit this node type and the pool’s
idle instances are allocated based on
this type. You can retrieve a list of
available node types by using the List
node types API call.
F IEL D N A M E TYPE DESC RIP T IO N

custom_tags An array of ClusterTag Additional tags for instance pool


resources. Azure Databricks tags all
pool resources (e.g. VM disk volumes)
with these tags in addition to
default_tags.

Azure Databricks allows up to 41


custom tags.

idle_instance_autotermination_minutes INT32 The number of minutes that idle


instances in excess of the
min_idle_instances are maintained
by the pool before being terminated. If
not specified, excess idle instances are
terminated automatically after a
default timeout period. If specified, the
time must be between 0 and 10000
minutes. If 0 is supplied, excess idle
instances are removed as soon as
possible.

enable_elastic_disk BOOL Autoscaling Local Storage: when


enabled, the instances in the pool
dynamically acquire additional disk
space when they are running low on
disk space.

disk_spec DiskSpec Defines the amount of initial remote


storage attached to each instance in
the pool.

preloaded_spark_versions An array of STRING A list with the runtime version the pool
installs on each instance. Pool clusters
that use a preloaded runtime version
start faster as they do not have to wait
for the image to download. You can
retrieve a list of available runtime
versions by using the Runtime versions
API call.

instance_pool_id STRING The canonical unique identifier for the


instance pool.

default_tags An array of ClusterTag Tags that are added by Azure


Databricks regardless of any
custom_tags, including:

* Vendor: Databricks
* DatabricksInstancePoolCreatorId:
<create_user_id>
* DatabricksInstancePoolId:
<instance_pool_id>

state InstancePoolState Current state of the instance pool.

stats InstancePoolStats Statistics about the usage of the


instance pool.
AzureDiskVolumeType
All Azure Disk types that Azure Databricks supports. See https://docs.microsoft.com/azure/virtual-
machines/linux/disks-types

NAME DESC RIP T IO N

PREMIUM_LRS Premium storage tier, backed by SSDs.

STANDARD_LRS Standard storage tier, backed by HDDs.

InstancePoolAzureAttributes
Attributes set during instance pools creation related to Azure.

F IEL D N A M E TYPE DESC RIP T IO N

availability AzureAvailability Availability type used for all


subsequent nodes.

spot_bid_max_price DOUBLE The max bid price used for Azure spot
instances. You can set this to greater
than or equal to the current spot price.
You can also set this to -1 (the default),
which specifies that the instance
cannot be evicted on the basis of price.
The price for the instance will be the
current price for spot instances or the
price for a standard instance. You can
view historical pricing and eviction
rates in the Azure portal.
IP Access List API 2.0
7/21/2022 • 2 minutes to read

Azure Databricks workspaces can be configured so that employees connect to the service only through existing
corporate networks with a secure perimeter. Azure Databricks customers can use the IP access lists feature to
define a set of approved IP addresses. All incoming access to the web application and REST APIs requires the
user connect from an authorized IP address.
For more details about this feature and examples of how to use this API, see IP access lists.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

The IP Access List API is provided as an OpenAPI 3.0 specification that you can download and view as a
structured API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
Jobs API 2.1
7/21/2022 • 2 minutes to read

The Jobs API allows you to programmatically manage Azure Databricks jobs. See Jobs.
The Jobs API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
Libraries API 2.0
7/21/2022 • 7 minutes to read

The Libraries API allows you to install and uninstall libraries and get the status of libraries on a cluster.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

All cluster statuses


EN DP O IN T H T T P M ET H O D

2.0/libraries/all-cluster-statuses GET

Get the status of all libraries on all clusters. A status will be available for all libraries installed on clusters via the
API or the libraries UI as well as libraries set to be installed on all clusters via the libraries UI. If a library has been
set to be installed on all clusters, is_library_for_all_clusters will be true , even if the library was also installed
on this specific cluster.
Example
Request

curl --netrc --request GET \


https://<databricks-instance>/api/2.0/libraries/all-cluster-statuses \
| jq .

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file and jq.


Response
{
"statuses": [
{
"cluster_id": "11203-my-cluster",
"library_statuses": [
{
"library": {
"jar": "dbfs:/mnt/libraries/library.jar"
},
"status": "INSTALLING",
"messages": [],
"is_library_for_all_clusters": false
}
]
},
{
"cluster_id": "20131-my-other-cluster",
"library_statuses": [
{
"library": {
"egg": "dbfs:/mnt/libraries/library.egg"
},
"status": "ERROR",
"messages": ["Could not download library"],
"is_library_for_all_clusters": false
}
]
}
]
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

statuses An array of ClusterLibraryStatuses A list of cluster statuses.

Cluster status
EN DP O IN T H T T P M ET H O D

2.0/libraries/cluster-status GET

Get the status of libraries on a cluster. A status will be available for all libraries installed on the cluster via the API
or the libraries UI as well as libraries set to be installed on all clusters via the libraries UI. If a library has been set
to be installed on all clusters, is_library_for_all_clusters will be true , even if the library was also installed on
the cluster.
Example
Request

curl --netrc --request GET \


'https://<databricks-instance>/api/2.0/libraries/cluster-status?cluster_id=<cluster-id>' \
| jq .

Or:
curl --netrc --get \
https://<databricks-instance>/api/2.0/libraries/cluster-status \
--data cluster_id=<cluster-id> \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<cluster-id> with the Azure Databricks workspace ID of the cluster, for example 1234-567890-example123 .

This example uses a .netrc file and jq.


Response

{
"cluster_id": "11203-my-cluster",
"library_statuses": [
{
"library": {
"jar": "dbfs:/mnt/libraries/library.jar"
},
"status": "INSTALLED",
"messages": [],
"is_library_for_all_clusters": false
},
{
"library": {
"pypi": {
"package": "beautifulsoup4"
},
},
"status": "INSTALLING",
"messages": ["Successfully resolved package from PyPI"],
"is_library_for_all_clusters": false
},
{
"library": {
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
},
},
"status": "FAILED",
"messages": ["R package installation is not supported on this spark version.\nPlease upgrade to
Runtime 3.2 or higher"],
"is_library_for_all_clusters": false
}
]
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING Unique identifier of the cluster whose


status should be retrieved. This field is
required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING Unique identifier for the cluster.

library_statuses An array of LibraryFullStatus Status of all libraries on the cluster.

Install
EN DP O IN T H T T P M ET H O D

2.0/libraries/install POST

Install libraries on a cluster. The installation is asynchronous - it completes in the background after the request.

IMPORTANT
This call will fail if the cluster is terminated.

Installing a wheel library on a cluster is like running the pip command against the wheel file directly on driver
and executors. All the dependencies specified in the library setup.py file are installed and this requires the
library name to satisfy the wheel file name convention.
The installation on the executors happens only when a new task is launched. With Databricks Runtime 7.1 and
below, the installation order of libraries is nondeterministic. For wheel libraries, you can ensure a deterministic
installation order by creating a zip file with suffix .wheelhouse.zip that includes all the wheel files.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/libraries/install \
--data @install-libraries.json

install-libraries.json :
{
"cluster_id": "10201-my-cluster",
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"egg": "dbfs:/mnt/libraries/library.egg"
},
{
"whl": "dbfs:/mnt/libraries/mlflow-0.0.1.dev0-py2-none-any.whl"
},
{
"whl": "dbfs:/mnt/libraries/wheel-libraries.wheelhouse.zip"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2",
"exclusions": ["slf4j:slf4j"]
}
},
{
"pypi": {
"package": "simplejson",
"repo": "https://my-pypi-mirror.com"
}
},
{
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
}
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of install-libraries.json with fields that are appropriate for your solution.

This example uses a .netrc file.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING Unique identifier for the cluster on


which to install these libraries. This field
is required.

libraries An array of Library The libraries to install.

Uninstall
EN DP O IN T H T T P M ET H O D

2.0/libraries/uninstall POST

Set libraries to be uninstalled on a cluster. The libraries aren’t uninstalled until the cluster is restarted.
Uninstalling libraries that are not installed on the cluster has no impact but is not an error.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/libraries/uninstall \
--data @uninstall-libraries.json

uninstall-libraries.json :

{
"cluster_id": "10201-my-cluster",
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"cran": "ada"
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of uninstall-libraries.json with fields that are appropriate for your solution.

This example uses a .netrc file.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING Unique identifier for the cluster on


which to uninstall these libraries. This
field is required.

libraries An array of Library The libraries to uninstall.

Data structures
In this section:
ClusterLibraryStatuses
Library
LibraryFullStatus
MavenLibrary
PythonPyPiLibrary
RCranLibrary
LibraryInstallStatus
ClusterLibraryStatuses
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING Unique identifier for the cluster.

library_statuses An array of LibraryFullStatus Status of all libraries on the cluster.

Library
F IEL D N A M E TYPE DESC RIP T IO N

jar OR egg OR whl OR pypi OR maven STRING OR STRING OR STRING OR If jar, URI of the JAR to be installed.
OR cran PythonPyPiLibrary OR MavenLibrary DBFS and ADLS ( abfss ) URIs are
OR RCranLibrary supported. For example:
{ "jar":
"dbfs:/mnt/databricks/library.jar"
}
or
{ "jar": "abfss://my-
bucket/library.jar" }
. If ADLS is used, make sure the cluster
has read access on the library.

If egg, URI of the egg to be installed.


DBFS and ADLS URIs are supported.
For example:
{ "egg": "dbfs:/my/egg" } or
{ "egg": "abfss://my-bucket/egg"
}
.

If whl, URI of the wheel or zipped


wheels to be installed. DBFS and ADLS
URIs are supported. For example:
{ "whl": "dbfs:/my/whl" } or
{ "whl": "abfss://my-bucket/whl"
}
. If ADLS is used, make sure the cluster
has read access on the library. Also the
wheel file name needs to use the
correct convention. If zipped wheels
are to be installed, the file name suffix
should be .wheelhouse.zip .

If pypi, specification of a PyPI library to


be installed. Specifying the repo field
is optional and if not specified, the
default pip index is used. For example:
{ "package": "simplejson",
"repo": "https://my-repo.com" }

If maven, specification of a Maven


library to be installed. For example:
{ "coordinates":
"org.jsoup:jsoup:1.7.2" }

If cran, specification of a CRAN library


to be installed.

LibraryFullStatus
The status of the library on a specific cluster.
F IEL D N A M E TYPE DESC RIP T IO N

library Library Unique identifier for the library.

status LibraryInstallStatus Status of installing the library on the


cluster.

messages An array of STRING All the info and warning messages that
have occurred so far for this library.

is_library_for_all_clusters BOOL Whether the library was set to be


installed on all clusters via the libraries
UI.

MavenLibrary
F IEL D N A M E TYPE DESC RIP T IO N

coordinates STRING Gradle-style Maven coordinates. For


example: org.jsoup:jsoup:1.7.2 .
This field is required.

repo STRING Maven repo to install the Maven


package from. If omitted, both Maven
Central Repository and Spark Packages
are searched.

exclusions An array of STRING List of dependences to exclude. For


example:
["slf4j:slf4j", "*:hadoop-
client"]
.

Maven dependency exclusions:


https://maven.apache.org/guides/intro
duction/introduction-to-optional-and-
excludes-dependencies.html.

PythonPyPiLibrary
F IEL D N A M E TYPE DESC RIP T IO N

package STRING The name of the PyPI package to


install. An optional exact version
specification is also supported.
Examples: simplejson and
simplejson==3.8.0 . This field is
required.

repo STRING The repository where the package can


be found. If not specified, the default
pip index is used.

RCranLibrary
F IEL D N A M E TYPE DESC RIP T IO N

package STRING The name of the CRAN package to


install. This field is required.

repo STRING The repository where the package can


be found. If not specified, the default
CRAN repo is used.

LibraryInstallStatus
The status of a library on a specific cluster.

STAT US DESC RIP T IO N

PENDING No action has yet been taken to install the library. This state
should be very short lived.

RESOLVING Metadata necessary to install the library is being retrieved


from the provided repository.

For Jar, Egg, and Whl libraries, this step is a no-op.

INSTALLING The library is actively being installed, either by adding


resources to Spark or executing system commands inside
the Spark nodes.

INSTALLED The library has been successfully installed.

SKIPPED Installation on a Databricks Runtime 7.0 or above cluster


was skipped due to Scala version incompatibility.

FAILED Some step in installation failed. More information can be


found in the messages field.

UNINSTALL_ON_RESTART The library has been marked for removal. Libraries can be
removed only when clusters are restarted, so libraries that
enter this state will remain until the cluster is restarted.
MLflow API 2.0
7/21/2022 • 2 minutes to read

Azure Databricks provides a managed version of the MLflow tracking server and the Model Registry, which host
the MLflow REST API. You can invoke the MLflow REST API using URLs of the form
https://<databricks-instance>/api/2.0/mlflow/<api-endpoint>

replacing <databricks-instance> with the workspace URL of your Azure Databricks deployment.
MLflow compatibility matrix lists the MLflow release packaged in each Databricks Runtime version and a link to
the respective documentation.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Rate limits
The MLflow APIs are rate limited as four groups, based on their function and maximum throughput. The
following is the list of API groups and their respective limits in qps (queries per second):
Low throughput experiment management (list, update, delete, restore): 7 qps
Search runs: 7 qps
Log batch: 47 qps
All other APIs: 127 qps
In addition, there is a limit of 20 concurrent model versions in Pending status (in creation) per workspace.
If the rate limit is reached, subsequent API calls will return status code 429. All MLflow clients (including the UI)
automatically retry 429s with an exponential backoff.
API reference
The MLflow API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
Permissions API 2.0
7/21/2022 • 2 minutes to read

IMPORTANT
This feature is in Public Preview.

The Permissions API lets you manage permissions for:


Tokens
Clusters
Pools
Jobs
Notebooks
Folders (directories)
MLflow registered models
The Permissions API is provided as an OpenAPI 3.0 specification that you can download and view as a structured
API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
Repos API 2.0
7/21/2022 • 2 minutes to read

The Repos API allows you to manage Databricks repos programmatically. See Git integration with Databricks
Repos.
The Repos API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
SCIM API 2.0
7/21/2022 • 2 minutes to read

IMPORTANT
This feature is in Public Preview.

Azure Databricks supports SCIM, or System for Cross-domain Identity Management, an open standard that
allows you to automate user provisioning using a REST API and JSON. The Azure Databricks SCIM API follows
version 2.0 of the SCIM protocol.

Requirements
Your Azure Databricks account must have the Premium Plan.

SCIM 2.0 APIs


An Azure Databricks workspace administrator can invoke all SCIM API endpoints:
SCIM API 2.0 (Me)
SCIM API 2.0 (Users)
SCIM API 2.0 (ServicePrincipals)
SCIM API 2.0 (Groups)
Non-admin users and service principals can invoke the Me Get endpoint, the Users Get endpoint to display
names and IDs, and the Group Get endpoint to display group display names and IDs.
Call workspace SCIM APIs
For workspace SCIM APIs workspaces, for the examples, replace <databricks-instance> with the workspace URL
of your Azure Databricks deployment.

https://<databricks-instance>/api/2.0/preview/scim/v2/<api-endpoint>

Header parameters
PA RA M ET ER TYPE DESC RIP T IO N
PA RA M ET ER TYPE DESC RIP T IO N

Authorization (required) STRING Set to Bearer <access-token> .

Or: See Authentication using Azure


Databricks personal access tokens,
The .netrc file (if using curl ) Authenticate using Azure Active
Directory tokens, and Token API 2.0 to
learn how to generate tokens.

Impor tant! The Azure Databricks


admin user who generates this token
should not be managed by your
identity provider (IdP). An Azure
Databricks admin user who is
managed by the IdP can be
deprovisioned using the IdP, which
would cause your SCIM provisioning
integration to be disabled.

Instead of an Authorization header,


you can use the .netrc file along with
the --netrc (or -n ) option. This file
stores machine names and tokens
separate from your code and reduces
the need to type credential strings
multiple times. The .netrc contains
one entry for each combination of
<databricks-instance> and token.
For example:

machine <databricks-instance>
login token password <access-
token>

Content-Type (required for write STRING Set to application/scim+json .


operations)

Accept (required for read operations) STRING Set to application/scim+json .

Filter results
Use filters to return a subset of users or groups. For all users, the user userName and group displayName fields
are supported. Admin users can filter users on the active attribute.

O P ERATO R DESC RIP T IO N B EH AVIO R

eq equals Attribute and operator values must be


identical.

ne not equal to Attribute and operator values are not


identical.

co contains Operator value must be a substring of


attribute value.

sw starts with Attribute must start with and contain


operator value.
O P ERATO R DESC RIP T IO N B EH AVIO R

and logical AND Match when all expressions evaluate to


true.

or logical OR Match when any expression evaluates


to true.

Sort results
Sort results using the sortBy and sortOrder query parameters. The default is to sort by ID.

List of all SCIM APIs


SCIM API 2.0 (Me)
SCIM API 2.0 (Users)
SCIM API 2.0 (Groups)
SCIM API 2.0 (ServicePrincipals)
For error codes, see SCIM API 2.0 Error Codes.
SCIM API 2.0 (Me)
7/21/2022 • 2 minutes to read

IMPORTANT
This feature is in Public Preview.

Requirements
Your Azure Databricks account must have the Premium Plan.

Get me
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Me GET

Retrieve the same information about yourself as returned by Get user by ID.
Example

curl --netrc -X GET \


https://<databricks-instance>/api/2.0/preview/scim/v2/Me \
| jq .

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file and jq.


For error codes, see GET Requests.
SCIM API 2.0 (Users)
7/21/2022 • 6 minutes to read

IMPORTANT
This feature is in Public Preview.

An Azure Databricks administrator can invoke all SCIM API endpoints. Non-admin users can invoke the Get
users endpoint to read user display names and IDs.

NOTE
Each workspace can have a maximum of 10,000 users and 5,000 groups. Service principals count toward the user
maximum.

SCIM (Users) lets you create users in Azure Databricks and give them the proper level of access, temporarily lock
and unlock user accounts, and remove access for users (deprovision them) when they leave your organization
or no longer need access to Azure Databricks.
For error codes, see SCIM API 2.0 Error Codes.

Requirements
Your Azure Databricks account must have the Premium Plan.

Get users
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Users GET

Admin users: Retrieve a list of all users in the Azure Databricks workspace.
Non-admin users: Retrieve a list of all users in the Azure Databricks workspace, returning username, user
display name, and object ID only.
Examples
This example gets information about all users.

curl --netrc -X GET \


https://<databricks-instance>/api/2.0/preview/scim/v2/Users \
| jq .

This example uses the eq (equals) filter query parameter with userName to get information about a specific
user.
curl --netrc -X GET \
"https://<databricks-instance>/api/2.0/preview/scim/v2/Users?filter=userName+eq+<username>" \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<username> with the Azure Databricks workspace username of the user, for example someone@example.com .

This example uses a .netrc file and jq.

Get user by ID
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Users/{id} GET

Admin users: Retrieve a single user resource from the Azure Databricks workspace, given their Azure Databricks
ID.
Example
Request

curl --netrc -X GET \


https://<databricks-instance>/api/2.0/preview/scim/v2/Users/<user-id> \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file and jq.
Response

Create user
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Users POST

Admin users: Create a user in the Azure Databricks workspace.


Request parameters follow the standard SCIM 2.0 protocol.
Requests must include the following attributes:
schemas set to urn:ietf:params:scim:schemas:core:2.0:User
userName

Example
curl --netrc -X POST \
https://<databricks-instance>/api/2.0/preview/scim/v2/Users \
--header 'Content-type: application/scim+json' \
--data @create-user.json \
| jq .

create-user.json :

{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:User" ],
"userName": "<username>",
"groups": [
{
"value":"123456"
}
],
"entitlements":[
{
"value":"allow-cluster-create"
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<username> with the Azure Databricks workspace username of the user, for example someone@example.com .

This example uses a .netrc file and jq.

Update user by ID ( PATCH )


EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Users/{id} PATCH

Admin users: Update a user resource with operations on specific attributes, except those that are immutable (
userName and userId ). The PATCH method is recommended over the PUT method for setting or updating user
entitlements.
Request parameters follow the standard SCIM 2.0 protocol and depend on the value of the schemas attribute.
Example
This example adds the allow-cluster-create entitlement to the specified user.

curl --netrc -X PATCH \


https://<databricks-instance>/api/2.0/preview/scim/v2/Users/<user-id> \
--header 'Content-type: application/scim+json' \
--data @update-user.json \
| jq .

update-user.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "add",
"path": "entitlements",
"value": [
{
"value": "allow-cluster-create"
}
]
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file and jq.

Update user by ID ( PUT )


EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Users/{id} PUT

Admin users: Overwrite the user resource across multiple attributes, except those that are immutable ( userName
and userId ).
Request must include the schemas attribute, set to urn:ietf:params:scim:schemas:core:2.0:User .

NOTE
The PATCH method is recommended over the PUT method for setting or updating user entitlements.

Example
This example changes the specified user’s previous entitlements to now have only the allow-cluster-create
entitlement.

curl --netrc -X PUT \


https://<databricks-instance>/api/2.0/preview/scim/v2/Users/<user-id> \
--header 'Content-type: application/scim+json' \
--data @overwrite-user.json \
| jq .

overwrite-user.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:User" ],
"userName": "<username>",
"entitlements": [
{
"value": "allow-cluster-create"
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
<username> with the Azure Databricks workspace username of the user, for example someone@example.com . To
get the username, call Get users.
This example uses a .netrc file and jq.

Delete user by ID
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Users/{id} DELETE

Admin users: Remove a user resource. A user that does not own or belong to a workspace in Azure Databricks is
automatically purged after 30 days.
Deleting a user from a workspace also removes objects associated with the user. For example, notebooks are
archived, clusters are terminated, and jobs become ownerless.
The user’s home directory is not automatically deleted. Only an administrator can access or remove a deleted
user’s home directory.
The access control list (ACL) configuration of a user is preserved even after that user is removed from a
workspace.
Example request

curl --netrc -X DELETE \


https://<databricks-instance>/api/2.0/preview/scim/v2/Users/<user-id>

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file.

Activate and deactivate user by ID


IMPORTANT
This feature is in Public Preview.

EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Users/{id} PATCH

Admin users: Activate or deactivate a user. Deactivating a user removes all access to a workspace for that user
but leaves permissions and objects associated with the user unchanged. Clusters associated with the user keep
running, and notebooks remain in their original locations. The user’s tokens are retained but cannot be used to
authenticate while the user is deactivated. Scheduled jobs, however, fail unless assigned to a new owner.
You can use the Get users and Get user by ID requests to view whether users are active or inactive.

NOTE
Allow at least five minutes for the cache to be cleared for deactivation to take effect.

IMPORTANT
An Azure Active Directory (Azure AD) user with the Contributor or Owner role on the Azure Databricks subscription
can reactivate themselves using the Azure AD login flow. If a user with one of these roles needs to be deactivated, you
should also revoke their privileges on the subscription.

Set the active value to false to deactivate a user and true to activate a user.
Example
Request

curl --netrc -X PATCH \


https://<databricks-instance>/api/2.0/preview/scim/v2/Users/<user-id> \
--header 'Content-type: application/scim+json' \
--data @toggle-user-activation.json \
| jq .

toggle-user-activation.json :

{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "replace",
"path": "active",
"value": [
{
"value": "false"
}
]
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 .

This example uses a .netrc file and jq.


Response

{
"emails": [
{
"type": "work",
"value": "someone@example.com",
"primary": true
}
],
"displayName": "Someone User",
"schemas": [
"urn:ietf:params:scim:schemas:core:2.0:User",
"urn:ietf:params:scim:schemas:extension:workspace:2.0:User"
],
"name": {
"familyName": "User",
"givenName": "Someone"
},
"active": false,
"groups": [],
"id": "123456",
"userName": "someone@example.com"
}

Filter active and inactive users


IMPORTANT
This feature is in Public Preview.

EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Users GET

Admin users: Retrieve a list of active or inactive users.


Example

curl --netrc -X GET \


"https://<databricks-instance>/api/2.0/preview/scim/v2/Users?filter=active+eq+false" \
| jq .

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file and jq.

Automatically deactivate users


IMPORTANT
This feature is in Public Preview.

Admin users: Deactivate users that have not logged in for a customizable period. Scheduled jobs owned by a
user are also considered activity.

EN DP O IN T H T T P M ET H O D

2.0/preview/workspace-conf PATCH

The request body is a key-value pair where the value is the time limit for how long a user can be inactive before
being automatically deactivated.
Example

curl --netrc -X PATCH \


https://<databricks-instance>/api/2.0/preview/workspace-conf \
--data @deactivate-users.json \
| jq .

deactivate-users.json :

{
"maxUserInactiveDays": "90"
}

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file and jq.

Get the maximum user inactivity period of a workspace


IMPORTANT
This feature is in Public Preview.

Admin users: Retrieve the user inactivity limit defined for a workspace.

EN DP O IN T H T T P M ET H O D

2.0/preview/workspace-conf GET

Example request

curl --netrc -X GET \


"https://<databricks-instance>/api/2.0/preview/workspace-conf?keys=maxUserInactiveDays" \
| jq .

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
This example uses a .netrc file and jq.
Example response

{
"maxUserInactiveDays": "90"
}
SCIM API 2.0 (Groups)
7/21/2022 • 3 minutes to read

IMPORTANT
This feature is in Public Preview.

Requirements
Your Azure Databricks account must have the Premium Plan.

NOTE
An Azure Databricks administrator can invoke all SCIM API endpoints.
Non-admin users can invoke the Get groups endpoint to read group display names and IDs.
You can have no more than 10,000 users and 5,000 groups in a workspace.

SCIM (Groups) lets you create users and groups in Azure Databricks and give them the proper level of access
and remove access for groups (deprovision them).
For error codes, see SCIM API 2.0 Error Codes.

Get groups
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Groups GET

Admin users: Retrieve a list of all groups in the Azure Databricks workspace.
Non-admin users: Retrieve a list of all groups in the Azure Databricks workspace, returning group display name
and object ID only.
Examples

curl --netrc -X GET \


https://<databricks-instance>/api/2.0/preview/scim/v2/Groups \
| jq .

You can use filters to specify subsets of groups. For example, you can apply the sw (starts with) filter parameter
to displayName to retrieve a specific group or set of groups. This example retrieves all groups with a
displayName field that start with my- .

curl --netrc -X GET \


"https://<databricks-instance>/api/2.0/preview/scim/v2/Groups?filter=displayName+sw+my-" \
| jq .

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
This example uses a .netrc file and jq.

Get group by ID
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Groups/{id} GET

Admin users: Retrieve a single group resource.


Example request

curl --netrc -X GET \


https://<databricks-instance>/api/2.0/preview/scim/v2/Groups/<group-id> \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file and jq.

Create group
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Groups POST

Admin users: Create a group in Azure Databricks.


Request parameters follow the standard SCIM 2.0 protocol.
Requests must include the following attributes:
schemas set to urn:ietf:params:scim:schemas:core:2.0:Group
displayName

Members list is optional and can include users and other groups. You can also add members to a group using
PATCH .
Example

curl --netrc -X POST \


https://<databricks-instance>/api/2.0/preview/scim/v2/Groups \
--header 'Content-type: application/scim+json' \
--data @create-group.json \
| jq .

create-group.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:Group" ],
"displayName": "<group-name>",
"members": [
{
"value":"<user-id>"
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-name> with the name of the group in the Azure Databricks workspace, for example my-group .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file and jq.

Update group
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Groups/{id} PATCH

Admin users: Update a group in Azure Databricks by adding or removing members. Can add and remove
individual members or groups within the group.
Request parameters follow the standard SCIM 2.0 protocol and depend on the value of the schemas attribute.

NOTE
Azure Databricks does not support updating group names.

Example

curl --netrc -X PATCH \


https://<databricks-instance>/api/2.0/preview/scim/v2/Groups/<group-id> \
--header 'Content-type: application/scim+json' \
--data @update-group.json \
| jq .

Add to group
update-group.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op":"add",
"value": {
"members": [
{
"value":"<user-id>"
}
]
}
}
]
}

Remove from group


update-group.json :

{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "remove",
"path": "members[value eq \"<user-id>\"]"
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file and jq.

Delete group
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Groups/{id} DELETE

Admin users: Remove a group from Azure Databricks. Users in the group are not removed.
Example

curl --netrc -X DELETE \


https://<databricks-instance>/api/2.0/preview/scim/v2/Groups/<group-id>

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file.
SCIM API 2.0 (ServicePrincipals)
7/21/2022 • 6 minutes to read

IMPORTANT
This feature is in Public Preview.

SCIM (ServicePrincipals) lets you manage Azure Active Directory service principals in Azure Databricks.
For error codes, see SCIM API 2.0 Error Codes.
For additional examples, see Service principals for Azure Databricks automation.

Requirements
Your Azure Databricks account must have the Premium Plan.

Get service principals


EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/ServicePrincipals GET

Retrieve a list of all service principals in the Azure Databricks workspace.


When invoked by a non-admin user, only the username, user display name, and object are returned.
Examples

curl --netrc -X GET \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals \
| jq .

You can use filters to specify subsets of service principals. For example, you can apply the eq (equals) filter
parameter to applicationId to retrieve a specific service principal:

curl --netrc -X GET \


"https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals?filter=applicationId+eq+
<application-id>" \
| jq .

In workspaces with a large number of service principals, you can exclude attributes from the request to improve
performance.

curl --netrc -X GET \


"https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals?
excludedAttributes=entitlements,groups" \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<application-id> with the applicationId value of the service principal, for example
12345a67-8b9c-0d1e-23fa-4567b89cde01 .

These examples use a .netrc file and jq.

Get service principal by ID


EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/ServicePrincipals/{id} GET

Retrieve a single service principal resource from the Azure Databricks workspace, given a service principal ID.
Example

curl --netrc -X GET \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals/<service-principal-id> \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file and jq.

Add service principal


EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/ServicePrincipals POST

Add an Azure Active Directory (Azure AD) service principal to the Azure Databricks workspace. In Azure
Databricks, you must create an application in Azure Active Directory and then add it to your Azure Databricks
workspace to use as a service principal. Service principals count toward the limit of 10000 users per workspace.
Request parameters follow the standard SCIM 2.0 protocol.
Example

curl --netrc -X POST \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals \
--header 'Content-type: application/scim+json' \
--data @add-service-principal.json \
| jq .

add-service-principal.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ],
"applicationId": "<azure-application-id>",
"displayName": "<display-name>",
"groups": [
{
"value": "<group-id>"
}
],
"entitlements": [
{
"value":"allow-cluster-create"
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<azure-application-id> with the application ID of the Azure Active Directory (Azure AD) application, for
example 12345a67-8b9c-0d1e-23fa-4567b89cde01
<display-name> with the display name of the service principal, for example someone@example.com .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file and jq.

Update service principal by ID (PATCH)


EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/ServicePrincipals/{id} PATCH

Update a service principal resource with operations on specific attributes, except for applicationId and id ,
which are immutable.
Use the PATCH method to add, update, or remove individual attributes. Use the PUT method to overwrite the
entire service principal in a single operation.
Request parameters follow the standard SCIM 2.0 protocol and depend on the value of the schemas attribute.
Add entitlements
Example

curl --netrc -X PATCH \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals/<service-principal-id> \
--header 'Content-type: application/scim+json' \
--data @change-service-principal.json \
| jq .

change-service-principal.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "add",
"path": "entitlements",
"value": [
{
"value": "allow-cluster-create"
}
]
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file and jq.
Remove entitlements
Example

curl --netrc -X PATCH \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals/<service-principal-id> \
--header 'Content-type: application/scim+json' \
--data @change-service-principal.json \
| jq .

change-service-principal.json :

{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "remove",
"path": "entitlements",
"value": [
{
"value": "allow-cluster-create"
}
]
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file and jq.
Add to a group
Example

curl --netrc -X PATCH \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals/<service-principal-id> \
--header 'Content-type: application/scim+json' \
--data @change-service-principal.json \
| jq .

change-service-principal.json :

{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "add",
"path": "groups",
"value": [
{
"value": "<group-id>"
}
]
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file and jq.
Remove from a group
Example

curl --netrc -X PATCH \


https://<databricks-instance>/api/2.0/preview/scim/v2/Groups/<group-id> \
--header 'Content-type: application/scim+json' \
--data @remove-from-group.json \
| jq .

remove-from-group.json :

{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "remove",
"path": "members[value eq \"<service-principal-id>\"]"
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file and jq.

Update service principal by ID (PUT)


EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/ServicePrincipals/{id} PUT

Overwrite the entire service principal resource, except for applicationId and id , which are immutable.
Use the PATCH method to add, update, or remove individual attributes.

IMPORTANT
You must include the attribute in the request, with the exact value
schemas
urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal .

Examples
Add an entitlement

curl --netrc -X PUT \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals/<service-principal-id> \
--header 'Content-type: application/scim+json' \
--data @update-service-principal.json \
| jq .

update-service-principal.json :

{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ],
"applicationId": "<appliation-id>",
"displayName": "<display-name>",
"groups": [
{
"value": "<group-id>"
}
],
"entitlements": [
{
"value":"allow-cluster-create"
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
<application-id> with the applicationId value of the service principal, for example
12345a67-8b9c-0d1e-23fa-4567b89cde01 .
<display-name> with the display name of the service principal, for example someone@example.com .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file and jq.
Remove all entitlements and groups
Removing all entitlements and groups is a reversible alternative to deactivating the service principal.
Use the PUT method to avoid the need to check the existing entitlements and group memberships first.

curl --netrc -X PUT \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals/<service-principal-id> \
--header 'Content-type: application/scim+json' \
--data @update-service-principal.json \
| jq .

update-service-principal.json :

{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ],
"applicationId": "<application-id>",
"displayName": "<display-name>",
"groups": [],
"entitlements": []
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
<application-id> with the applicationId value of the service principal, for example
12345a67-8b9c-0d1e-23fa-4567b89cde01 .
<display-name> with the display name of the service principal, for example someone@example.com .

This example uses a .netrc file and jq.

Deactivate service principal by ID


EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/ServicePrincipals/{id} DELETE

Deactivate a service principal resource. This operation isn’t reversible.


Example

curl --netrc -X DELETE \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals/<service-principal-id>
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file.
As a reversible alternative, you can remove all of its entitlements and groups instead of deleting the service
principal.
A service principal that does not own or belong to an Azure Databricks workspace is automatically purged after
30 days.
Secrets API 2.0
7/21/2022 • 12 minutes to read

The Secrets API allows you to manage secrets, secret scopes, and access permissions. To manage secrets, you
must:
1. Create a secret scope.
2. Add your secrets to the scope.
3. If you have the Premium Plan, assign access control to the secret scope.
To learn more about creating and managing secrets, see Secret management and Secret access control. You
access and reference secrets in notebooks and jobs by using Secrets utility (dbutils.secrets).

IMPORTANT
To access Databricks REST APIs, you must authenticate. To use the Secrets API with Azure Key Vault secrets, you must
authenticate using an Azure Active Directory token.

Create secret scope


EN DP O IN T H T T P M ET H O D

2.0/secrets/scopes/create POST

You can either:


Create an Azure Key Vault-backed scope in which secrets are stored in Azure-managed storage and
encrypted with a cloud-based specific encryption key.
Create a Databricks-backed secret scope in which secrets are stored in Databricks-managed storage and
encrypted with a cloud-based specific encryption key.
Create an Azure Key Vault-backed scope
The scope name:
Must be unique within a workspace.
Must consist of alphanumeric characters, dashes, underscores, and periods, and may not exceed 128
characters.
The names are considered non-sensitive and are readable by all users in the workspace. By default, a workspace
is limited to a maximum of 100 secret scopes. To increase this maximum for a workspace, contact your
Databricks representative.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/secrets/scopes/create \
--header "Content-Type: application/json" \
--header "Authorization: Bearer <token>" \
--header "X-Databricks-Azure-SP-Management-Token: <management-token>" \
--data @create-scope.json
create-scope.json :

{
"scope": "my-simple-azure-keyvault-scope",
"scope_backend_type": "AZURE_KEYVAULT",
"backend_azure_keyvault":
{
"resource_id": "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/azure-
rg/providers/Microsoft.KeyVault/vaults/my-azure-kv",
"dns_name": "https://my-azure-kv.vault.azure.net/"
},
"initial_manage_principal": "users"
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<token> with your Azure Databricks personal access token. For more information, see Authentication using
Azure Databricks personal access tokens.
<management-token> with your Azure Active Directory token. For more information, see Get Azure AD tokens
by using the Microsoft Authentication Library.
The contents of create-scope.json with fields that are appropriate for your solution.
This example uses a .netrc file.
If initial_manage_principal is specified, the initial ACL applied to the scope is applied to the supplied principal
(user, service principal, or group) with MANAGE permissions. The only supported principal for this option is the
group users , which contains all users in the workspace. If initial_manage_principal is not specified, the initial
ACL with MANAGE permission applied to the scope is assigned to the API request issuer’s user identity.
Throws RESOURCE_ALREADY_EXISTS if a scope with the given name already exists. Throws RESOURCE_LIMIT_EXCEEDED
if maximum number of scopes in the workspace is exceeded. Throws INVALID_PARAMETER_VALUE if the scope
name is invalid.
For more information, see Create an Azure Key Vault-backed secret scope using the Databricks CLI.
Create a Databricks-backed secret scope
The scope name:
Must be unique within a workspace.
Must consist of alphanumeric characters, dashes, underscores, and periods, and may not exceed 128
characters.
The names are considered non-sensitive and are readable by all users in the workspace. A workspace is limited
to a maximum of 100 secret scopes.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/secrets/scopes/create \
--data @create-scope.json

create-scope.json :
{
"scope": "my-simple-databricks-scope",
"initial_manage_principal": "users"
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of create-scope.json with fields that are appropriate for your solution.

This example uses a .netrc file.


Throws RESOURCE_ALREADY_EXISTS if a scope with the given name already exists. Throws RESOURCE_LIMIT_EXCEEDED
if maximum number of scopes in the workspace is exceeded. Throws INVALID_PARAMETER_VALUE if the scope
name is invalid.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

scope STRING Scope name requested by the user.


Scope names are unique. This field is
required.

initial_manage_principal STRING This field is optional. If not specified,


only the API request issuer’s identity is
granted MANAGE permissions on the
new scope. If the string users is
specified, all users in the workspace are
granted MANAGE permissions.

Delete secret scope


EN DP O IN T H T T P M ET H O D

2.0/secrets/scopes/delete POST

Delete a secret scope.


Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/secrets/scopes/delete \
--data @delete-scope.json

delete-scope.json :

{
"scope": "my-secret-scope"
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of delete-scope.json with fields that are appropriate for your solution.
This example uses a .netrc file.
Throws RESOURCE_DOES_NOT_EXIST if the scope does not exist. Throws PERMISSION_DENIED if the user does not
have permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

scope STRING Name of the scope to delete. This field


is required.

List secret scopes


EN DP O IN T H T T P M ET H O D

2.0/secrets/scopes/list GET

List all secret scopes available in the workspace.


Example
Request

curl --netrc --request GET \


https://<databricks-instance>/api/2.0/secrets/scopes/list \
| jq .

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file and jq.


Response

{
"scopes": [
{
"name": "my-databricks-scope",
"backend_type": "DATABRICKS"
},
{
"name": "mount-points",
"backend_type": "DATABRICKS"
}
]
}

Throws PERMISSION_DENIED if you do not have permission to make this API call.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N

scopes An array of SecretScope The available secret scopes.


Put secret
The method for creating or modifying a secret depends on the type of scope backend. To create or modify a
secret in a scope backed by Azure Key Vault, use the Azure SetSecret REST API. To create or modify a secret from
a Databricks-backed scope, use the following endpoint:

EN DP O IN T H T T P M ET H O D

2.0/secrets/put POST

Insert a secret under the provided scope with the given name. If a secret already exists with the same name, this
command overwrites the existing secret’s value. The server encrypts the secret using the secret scope’s
encryption settings before storing it. You must have WRITE or MANAGE permission on the secret scope.
The secret key must consist of alphanumeric characters, dashes, underscores, and periods, and cannot exceed
128 characters. The maximum allowed secret value size is 128 KB. The maximum number of secrets in a given
scope is 1000.
You can read a secret value only from within a command on a cluster (for example, through a notebook); there is
no API to read a secret value outside of a cluster. The permission applied is based on who is invoking the
command and you must have at least READ permission.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/secrets/put \
--data @put-secret.json

put-secret.json :

{
"scope": "my-databricks-scope",
"key": "my-string-key",
"string_value": "my-value"
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of put-secret.json with fields that are appropriate for your solution.

This example uses a .netrc file.


The input fields “string_value” or “bytes_value” specify the type of the secret, which will determine the value
returned when the secret value is requested. Exactly one must be specified.
Throws RESOURCE_DOES_NOT_EXIST if no such secret scope exists. Throws RESOURCE_LIMIT_EXCEEDED if maximum
number of secrets in scope is exceeded. Throws INVALID_PARAMETER_VALUE if the key name or value length is
invalid. Throws PERMISSION_DENIED if the user does not have permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
F IEL D N A M E TYPE DESC RIP T IO N

string_value OR bytes_value STRING OR BYTES If string_value, if specified, the value


will be stored in UTF-8 (MB4) form.

If bytes_value, if specified, value will be


stored as bytes.

scope STRING The name of the scope to which the


secret will be associated with. This field
is required.

key STRING A unique name to identify the secret.


This field is required.

Delete secret
The method for deleting a secret depends on the type of scope backend. To delete a secret from a scope backed
by Azure Key Vault, use the Azure SetSecret REST API. To delete a secret from a Databricks-backed scope, use the
following endpoint:

EN DP O IN T H T T P M ET H O D

2.0/secrets/delete POST

Delete the secret stored in this secret scope. You must have WRITE or MANAGE permission on the secret scope.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/secrets/delete \
--data @delete-secret.json

delete-secret.json :

{
"scope": "my-secret-scope",
"key": "my-secret-key"
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of delete-secret.json with fields that are appropriate for your solution.

This example uses a .netrc file.


Throws RESOURCE_DOES_NOT_EXIST if no such secret scope or secret exists. Throws PERMISSION_DENIED if you do
not have permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

scope STRING The name of the scope that contains


the secret to delete. This field is
required.

key STRING Name of the secret to delete. This field


is required.

List secrets
EN DP O IN T H T T P M ET H O D

2.0/secrets/list GET

List the secret keys that are stored at this scope. This is a metadata-only operation; you cannot retrieve secret
data using this API. You must have READ permission to make this call.
Example
Request

curl --netrc --request GET \


'https://<databricks-instance>/api/2.0/secrets/list?scope=<scope-name>' \
| jq .

Or:

curl --netrc --get \


https://<databricks-instance>/api/2.0/secrets/list \
--data scope=<scope-name> \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<scope-name> with the name of the secrets scope, for example my-scope .

This example uses a .netrc file and jq.


Response

{
"secrets": [
{
"key": "my-string-key",
"last_updated_timestamp": 1520467595000
},
{
"key": "my-byte-key",
"last_updated_timestamp": 1520467595000
}
]
}

The last_updated_timestamp returned is in milliseconds since epoch.


Throws RESOURCE_DOES_NOT_EXIST if no such secret scope exists. Throws PERMISSION_DENIED if you do not have
permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

scope STRING The name of the scope whose secrets


you want to list. This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

secrets An array of SecretMetadata Metadata information of all secrets


contained within the given scope.

Put secret ACL


EN DP O IN T H T T P M ET H O D

2.0/secrets/acls/put POST

Create or overwrite the ACL associated with the given principal (user, service principal, or group) on the
specified scope point. In general, a user, service principal, or group will use the most powerful permission
available to them, and permissions are ordered as follows:
MANAGE - Allowed to change ACLs, and read and write to this secret scope.
WRITE - Allowed to read and write to this secret scope.
READ - Allowed to read this secret scope and list what secrets are available.

You must have the MANAGE permission to invoke this API.


Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/secrets/acls/put \
--data @put-secret-acl.json

put-secret-acl.json :

{
"scope": "my-secret-scope",
"principal": "data-scientists",
"permission": "READ"
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of put-secret-acl.json with fields that are appropriate for your solution.

This example uses a .netrc file.


The principal field specifies an existing Azure Databricks principal to be granted or revoked access using the
unique identifier of that principal. A user is specified with their email, a service principal with its applicationId
value, and a group with its group name.
Throws RESOURCE_DOES_NOT_EXIST if no such secret scope exists. Throws RESOURCE_ALREADY_EXISTS if a permission
for the principal already exists. Throws INVALID_PARAMETER_VALUE if the permission is invalid. Throws
PERMISSION_DENIED if you do not have permission to make this API call.

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

scope STRING The name of the scope to apply


permissions to. This field is required.

principal STRING The principal to which the permission


is applied. This field is required.

permission AclPermission The permission level applied to the


principal. This field is required.

Delete secret ACL


EN DP O IN T H T T P M ET H O D

2.0/secrets/acls/delete POST

Delete the given ACL on the given scope.


You must have the MANAGE permission to invoke this API.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/secrets/acls/delete \
--data @delete-secret-acl.json

delete-secret-acl.json :

{
"scope": "my-secret-scope",
"principal": "data-scientists"
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of delete-secret-acl.json with fields that are appropriate for your solution.

This example uses a .netrc file.


Throws RESOURCE_DOES_NOT_EXIST if no such secret scope, principal, or ACL exists. Throws PERMISSION_DENIED if
you do not have permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

scope STRING The name of the scope to remove


permissions from. This field is required.

principal STRING The principal to remove an existing


ACL from. This field is required.

Get secret ACL


EN DP O IN T H T T P M ET H O D

2.0/secrets/acls/get GET

Describe the details about the given ACL, such as the group and permission.
You must have the MANAGE permission to invoke this API.
Example
Request

curl --netrc --request GET \


'https://<databricks-instance>/api/2.0/secrets/acls/get?scope=<scope-name>&principal=<principal-name>' \
| jq .

Or:

curl --netrc --get \


https://<databricks-instance>/api/2.0/secrets/acls/get \
--data 'scope=<scope-name>&principal=<principal-name>' \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<scope-name> with the name of the secrets scope, for example my-scope .
<principal-name> with the name of the principal, for example users .

This example uses a .netrc file and jq.


Response

{
"principal": "data-scientists",
"permission": "READ"
}

Throws RESOURCE_DOES_NOT_EXIST if no such secret scope exists. Throws PERMISSION_DENIED if you do not have
permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

scope STRING The name of the scope to fetch ACL


information from. This field is required.

principal STRING The principal to fetch ACL information


for. This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

principal STRING The principal to which the permission


is applied. This field is required.

permission AclPermission The permission level applied to the


principal. This field is required.

List secret ACLs


EN DP O IN T H T T P M ET H O D

2.0/secrets/acls/list GET

List the ACLs set on the given scope.


You must have the MANAGE permission to invoke this API.
Example
Request

curl --netrc --request GET \


'https://<databricks-instance>/api/2.0/secrets/acls/list?scope=<scope-name>' \
| jq .

Or:

curl --netrc --get \


https://<databricks-instance>/api/2.0/secrets/acls/list \
--data scope=<scope-name> \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<scope-name> with the name of the secrets scope, for example my-scope .

This example uses a .netrc file and jq.


Response
{
"items": [
{
"principal": "admins",
"permission": "MANAGE"
},
{
"principal": "data-scientists",
"permission": "READ"
}
]
}

Throws RESOURCE_DOES_NOT_EXIST if no such secret scope exists. Throws PERMISSION_DENIED if you do not have
permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

scope STRING The name of the scope to fetch ACL


information from. This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

items An array of AclItem The associated ACLs rule applied to


principals in the given scope.

Data structures
In this section:
AclItem
SecretMetadata
SecretScope
AclPermission
ScopeBackendType
AclItem
An item representing an ACL rule applied to the given principal (user, service principal, or group) on the
associated scope point.

F IEL D N A M E TYPE DESC RIP T IO N

principal STRING The principal to which the permission


is applied. This field is required.

permission AclPermission The permission level applied to the


principal. This field is required.

SecretMetadata
The metadata about a secret. Returned when listing secrets. Does not contain the actual secret value.
F IEL D N A M E TYPE DESC RIP T IO N

key STRING A unique name to identify the secret.

last_updated_timestamp INT64 The last updated timestamp (in


milliseconds) for the secret.

SecretScope
An organizational resource for storing secrets. Secret scopes can be different types, and ACLs can be applied to
control permissions for all secrets within a scope.

F IEL D N A M E TYPE DESC RIP T IO N

name STRING A unique name to identify the secret


scope.

backend_type ScopeBackendType The type of secret scope backend.

AclPermission
The ACL permission levels for secret ACLs applied to secret scopes.

P ERM ISSIO N DESC RIP T IO N

READ Allowed to perform read operations (get, list) on secrets in


this scope.

WRITE Allowed to read and write secrets to this secret scope.

MANAGE Allowed to read/write ACLs, and read/write secrets to this


secret scope.

ScopeBackendType
The type of secret scope backend.

TYPE DESC RIP T IO N

AZURE_KEYVAULT A secret scope in which secrets are stored in an Azure Key


Vault.

DATABRICKS A secret scope in which secrets are stored in Databricks


managed storage and encrypted with a cloud-based specific
encryption key.
Token API 2.0
7/21/2022 • 2 minutes to read

The Token API allows you to create, list, and revoke tokens that can be used to authenticate and access Azure
Databricks REST APIs.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Create
EN DP O IN T H T T P M ET H O D

2.0/token/create POST

Create and return a token. This call returns the error QUOTA_EXCEEDED if the current number of non-expired
tokens exceeds the token quota. The token quota for a user is 600.
Example
Request

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/token/create \
--data '{ "comment": "This is an example token", "lifetime_seconds": 7776000 }' \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
This is an example token with a description to attach to the token.
7776000 with the lifetime of the token, in seconds. This example specifies 90 days.

This example uses a .netrc file and jq.


Response

{
"token_value": "dapi1a2b3c45d67890e1f234567a8bc9012d",
"token_info": {
"token_id": "1234567890a12bc3456de789012f34ab56c78d9012e3fabc4de56f7a89b012c3",
"creation_time": 1626286601651,
"expiry_time": 1634062601651,
"comment": "This is an example token"
}
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

lifetime_seconds LONG The lifetime of the token, in seconds. If


no lifetime is specified, the token
remains valid indefinitely.

comment STRING Optional description to attach to the


token.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

token_value STRING The value of the newly-created token.

token_info Public token info The public metadata of the newly-


created token.

List
EN DP O IN T H T T P M ET H O D

2.0/token/list GET

List all the valid tokens for a user-workspace pair.


Example
Request

curl --netrc --request GET \


https://<databricks-instance>/api/2.0/token/list \
| jq .

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file and jq.


Response

{
"token_infos": [
{
"token_id": "1234567890a12bc3456de789012f34ab56c78d9012e3fabc4de56f7a89b012c3",
"creation_time": 1626286601651,
"expiry_time": 1634062601651,
"comment": "This is an example token"
},
{
"token_id": "2345678901a12bc3456de789012f34ab56c78d9012e3fabc4de56f7a89b012c4",
"creation_time": 1626286906596,
"expiry_time": 1634062906596,
"comment": "This is another example token"
}
]
}
Response structure
F IEL D N A M E TYPE DESC RIP T IO N

token_infos An array of Public token info A list of token information for a user-
workspace pair.

Revoke
EN DP O IN T H T T P M ET H O D

2.0/token/delete POST

Revoke an access token. This call returns the error RESOURCE_DOES_NOT_EXIST if a token with the specified ID is not
valid.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/token/delete \
--data '{ "token_id": "<token-id>" }'

This example uses a .netrc file.


Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<token-id> with the ID of the token, for example
1234567890a12bc3456de789012f34ab56c78d9012e3fabc4de56f7a89b012c3 .

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

token_id STRING The ID of the token to be revoked.

Data structures
In this section:
Public token info
Public token info
A data structure that describes the public metadata of an access token.

F IEL D N A M E TYPE DESC RIP T IO N

token_id STRING The ID of the token.

creation_time LONG Server time (in epoch milliseconds)


when the token was created.
F IEL D N A M E TYPE DESC RIP T IO N

expiry_time LONG Server time (in epoch milliseconds)


when the token will expire, or -1 if not
applicable.

comment STRING Comment the token was created with,


if applicable.
Token Management API 2.0
7/21/2022 • 2 minutes to read

The Token Management API lets Azure Databricks administrators manage their users’ Azure Databricks personal
access tokens. As an admin, you can:
Monitor and revoke users’ personal access tokens.
Control the lifetime of future tokens in your workspace.
You can also control which users can create and use tokens via the Permissions API 2.0 or in the Admin Console.
The Token Management API is provided as an OpenAPI 3.0 specification that you can download and view as a
structured API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
Workspace API 2.0
7/21/2022 • 6 minutes to read

The Workspace API allows you to list, import, export, and delete notebooks and folders. The maximum allowed
size of a request to the Workspace API is 10MB. See Cluster log delivery examples for a how to guide on this
API.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Delete
EN DP O IN T H T T P M ET H O D

2.0/workspace/delete POST

Delete an object or a directory (and optionally recursively deletes all objects in the directory). If path does not
exist, this call returns an error RESOURCE_DOES_NOT_EXIST . If path is a non-empty directory and recursive is set
to false , this call returns an error DIRECTORY_NOT_EMPTY . Object deletion cannot be undone and deleting a
directory recursively is not atomic.
Example
Request:

curl --netrc --request POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/delete \
--header 'Accept: application/json' \
--data '{ "path": "/Users/me@example.com/MyFolder", "recursive": true }'

If successful, this endpoint returns no response.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The absolute path of the notebook or


directory. This field is required.

recursive BOOL The flag that specifies whether to


delete the object recursively. It is
false by default. Please note this
deleting directory is not atomic. If it
fails in the middle, some of objects
under this directory may be deleted
and cannot be undone.

Export
EN DP O IN T H T T P M ET H O D

2.0/workspace/export GET

Export a notebook or contents of an entire directory. You can also export a Databricks Repo, or a notebook or
directory from a Databricks Repo. You cannot export non-notebook files from a Databricks Repo. If path does
not exist, this call returns an error RESOURCE_DOES_NOT_EXIST . You can export a directory only in DBC format. If
the exported data exceeds the size limit, this call returns an error MAX_NOTEBOOK_SIZE_EXCEEDED . This API does not
support exporting a library.
Example
Request:

curl --netrc --request GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/export \
--header 'Accept: application/json' \
--data '{ "path": "/Users/me@example.com/MyFolder/MyNotebook", "format": "SOURCE", "direct_download": true
}'

curl --netrc --request GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/export \
--header 'Accept: application/json' \
--data '{ "path": "/Repos/me@example.com/MyFolder/MyNotebook", "format": "SOURCE", "direct_download": true
}'

Response:
If the direct_download field was set to false or was omitted from the request, a base64-encoded version of the
content is returned, for example:

{
"content": "Ly8gRGF0YWJyaWNrcyBub3RlYm9vayBzb3VyY2UKMSsx",
}

Otherwise, if direct_download was set to true in the request, the content is downloaded.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The absolute path of the notebook or


directory. Exporting a directory is
supported only for DBC . This field is
required.

format ExportFormat This specifies the format of the


exported file. By default, this is
SOURCE . The value is case sensitive.
F IEL D N A M E TYPE DESC RIP T IO N

direct_download BOOL Flag to enable direct download. If it is


true , the response will be the
exported file itself. Otherwise, the
response contains content as base64
encoded string. See Export a notebook
or folder for more information about
how to use it.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

content BYTES The base64-encoded content. If the


limit (10MB) is exceeded, exception
with error code
MAX_NOTEBOOK_SIZE_EXCEEDED is
thrown.

Get status
EN DP O IN T H T T P M ET H O D

2.0/workspace/get-status GET

Gets the status of an object or a directory. If path does not exist, this call returns an error
RESOURCE_DOES_NOT_EXIST .

Example
Request:

curl --netrc --request GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/get-status \
--header 'Accept: application/json' \
--data '{ "path": "/Users/me@example.com/MyFolder/MyNotebook" }'

Response:

{
"object_type": "NOTEBOOK",
"path": "/Users/me@example.com/MyFolder/MyNotebook",
"language": "PYTHON",
"object_id": 123456789012345
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The absolute path of the notebook or


directory. This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

object_type ObjectType The type of the object.

object_id INT64 Unique identifier for the object.

path STRING The absolute path of the object.

language Language The language of the object. This value


is set only if the object type is
NOTEBOOK .

Import
EN DP O IN T H T T P M ET H O D

2.0/workspace/import POST

Import a notebook or the contents of an entire directory. If path already exists and overwrite is set to false ,
this call returns an error RESOURCE_ALREADY_EXISTS . You can use only DBC format to import a directory.
Example
Import a base64-encoded string:

curl --netrc --request POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/import \
--header 'Accept: application/json' \
--data '{ "path": "/Users/me@example.com/MyFolder/MyNotebook", "content":
"Ly8gRGF0YWJyaWNrcyBub3RlYm9vayBzb3VyY2UKMSsx", "language": "PYTHON", "overwrite": true, "format": "SOURCE"
}'

Import a local file:

curl --netrc --request POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/import \
--header 'Content-Type: multipart/form-data' \
--form path=/Users/me@example.com/MyFolder/MyNotebook \
--form content=@myCode.py.zip

If successful, this endpoint returns no response.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The absolute path of the notebook or


directory. Importing directory is only
support for DBC format. This field is
required.

format ExportFormat This specifies the format of the file to


be imported. By default, this is
SOURCE . The value is case sensitive.
F IEL D N A M E TYPE DESC RIP T IO N

language Language The language. If format is set to


SOURCE , this field is required;
otherwise, it will be ignored.

content BYTES The base64-encoded content. This has


a limit of 10 MB. If the limit (10MB) is
exceeded, exception with error code
MAX_NOTEBOOK_SIZE_EXCEEDED is
thrown. This parameter might be
absent, and instead a posted file will be
used. See Import a notebook or
directory for more information about
how to use it.

overwrite BOOL The flag that specifies whether to


overwrite existing object. It is false
by default. For DBC format, overwrite
is not supported since it may contain a
directory.

List
EN DP O IN T H T T P M ET H O D

2.0/workspace/list GET

List the contents of a directory, or the object if it is not a directory. If the input path does not exist, this call
returns an error RESOURCE_DOES_NOT_EXIST .
Example
List directories and their contents:
Request:

curl --netrc --request GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/list \
--header 'Accept: application/json' \
--data '{ "path": "/Users/me@example.com" }'

Response:
{
"objects": [
{
"path": "/Users/me@example.com/MyFolder",
"object_type": "DIRECTORY",
"object_id": 234567890123456
},
{
"path": "/Users/me@example.com/MyFolder/MyNotebook",
"object_type": "NOTEBOOK",
"language": "PYTHON",
"object_id": 123456789012345
},
{
"..."
}
]
}

List repos:

curl --netrc --request GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/list \
--header 'Accept: application/json' \
--data '{ "path": "/Repos/me@example.com" }'

Response:

{
"objects": [
{
"path": "/Repos/me@example.com/MyRepo1",
"object_type": "REPO",
"object_id": 234567890123456
},
{
"path": "/Repos/me@example.com/MyRepo2",
"object_type": "REPO",
"object_id": 123456789012345
},
{
"..."
}
]
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The absolute path of the notebook or


directory. This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

objects An array of ObjectInfo List of objects.


Mkdirs
EN DP O IN T H T T P M ET H O D

2.0/workspace/mkdirs POST

Create the given directory and necessary parent directories if they do not exists. If there exists an object (not a
directory) at any prefix of the input path, this call returns an error RESOURCE_ALREADY_EXISTS . If this operation fails
it may have succeeded in creating some of the necessary parent directories.
Example
Request:

curl --netrc --request POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/mkdirs \
--header 'Accept: application/json' \
--data '{ "path": "/Users/me@example.com/MyFolder" }'

If successful, this endpoint returns no response.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The absolute path of the directory. If


the parent directories do not exist, it
will also create them. If the directory
already exists, this command will do
nothing and succeed. This field is
required.

Data structures
In this section:
ObjectInfo
ExportFormat
Language
ObjectType
ObjectInfo
The information of the object in workspace. It is returned by list and get-status .

F IEL D N A M E TYPE DESC RIP T IO N

object_type ObjectType The type of the object.

object_id INT64 Unique identifier for the object.

path STRING The absolute path of the object.

language Language The language of the object. This value


is set only if the object type is
NOTEBOOK .
ExportFormat
The format for notebook import and export.

F O RM AT DESC RIP T IO N

SOURCE The notebook will be imported/exported as source code.

HTML The notebook will be imported/exported as an HTML file.

JUPYTER The notebook will be imported/exported as a


Jupyter/IPython Notebook file.

DBC The notebook will be imported/exported as Databricks


archive format.

Language
The language of notebook.

L A N GUA GE DESC RIP T IO N

SCALA Scala notebook.

PYTHON Python notebook.

SQL SQL notebook.

R R notebook.

ObjectType
The type of the object in workspace.

TYPE DESC RIP T IO N

NOTEBOOK Notebook

DIRECTORY Directory

LIBRARY Library

REPO Repository
REST API 1.2
7/21/2022 • 9 minutes to read

The Databricks REST API allows you to programmatically access Azure Databricks instead of going through the
web UI.
This article covers REST API 1.2. The REST API latest version, as well as REST API 2.1 and 2.0, are also available.

IMPORTANT
Use the Clusters API 2.0 for managing clusters programmatically and the Libraries API 2.0 for managing libraries
programmatically.
The 1.2 Create an execution context and Run a command APIs continue to be supported.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

REST API use cases


Start Apache Spark jobs triggered from your existing production systems or from workflow systems.
Programmatically bring up a cluster of a certain size at a fixed time of day and then shut it down at night.

API categories
Execution context: create unique variable namespaces where Spark commands can be called.
Command execution: run commands within a specific execution context.

Details
This REST API runs over HTTPS.
For retrieving information, use HTTP GET.
For modifying state, use HTTP POST.
For file upload, use multipart/form-data . Otherwise use application/json .
The response content type is JSON.
Basic authentication is used to authenticate the user for every API call.
User credentials are base64 encoded and are in the HTTP header for every API call. For example,
Authorization: Basic YWRtaW46YWRtaW4= . If you use curl , alternatively you can store user credentials in a
.netrc file.
For more information about using the Databricks REST API, see the Databricks REST API reference.

Get started
To try out the examples in this article, replace <databricks-instance> with the workspace URL of your Azure
Databricks deployment.
The following examples use curl and a .netrc file. You can adapt these curl examples with an HTTP library in
your programming language of choice.
API reference
Get the list of clusters
Get information about a cluster
Restart a cluster
Create an execution context
Get information about an execution context
Delete an execution context
Run a command
Get information about a command
Cancel a command
Get the list of libraries for a cluster
Upload a library to a cluster
Get the list of clusters
Method and path:
GET /api/1.2/clusters/list

Example
Request:

curl --netrc --request GET \


https://<databricks-instance>/api/1.2/clusters/list

Response:

[
{
"id": "1234-567890-span123",
"name": "MyCluster",
"status": "Terminated",
"driverIp": "",
"jdbcPort": 10000,
"numWorkers":0
},
{
"..."
}
]

Request schema
None.
Response schema
An array of objects, with each object representing information about a cluster as follows:

F IEL D

id

Type: string

The ID of the cluster.


F IEL D

name

Type: string

The name of the cluster.

status

Type: string

The status of the cluster. One of:

* Error
* Pending
* Reconfiguring
* Restarting
* Running
* Terminated
* Terminating
* Unknown

driverIp

Type: string

The IP address of the driver.

jdbcPor t

Type: number

The JDBC port number.

numWorkers

Type: number

The number of workers for the cluster.

Get information about a cluster


Method and path:
GET /api/1.2/clusters/status

Example
Request:

curl --netrc --get \


https://<databricks-instance>/api/1.2/clusters/status \
--data clusterId=1234-567890-span123

Response:
{
"id": "1234-567890-span123",
"name": "MyCluster",
"status": "Terminated",
"driverIp": "",
"jdbcPort": 10000,
"numWorkers": 0
}

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster.

Response schema
An object that represents information about the cluster.

F IEL D

id

Type: string

The ID of the cluster.

name

Type: string

The name of the cluster.

status

Type: string

The status of the cluster. One of:

* Error
* Pending
* Reconfiguring
* Restarting
* Running
* Terminated
* Terminating
* Unknown

driverIp

Type: string

The IP address of the driver.


F IEL D

jdbcPor t

Type: number

The JDBC port number.

numWorkers

Type: number

The number of workers for the cluster.

Restart a cluster
Method and path:
POST /api/1.2/clusters/restart

Example
Request:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/clusters/restart \
--data clusterId=1234-567890-span123

Response:

{
"id": "1234-567890-span123"
}

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster to restart.

Response schema

F IEL D

id

Type: string

The ID of the cluster.

Create an execution context


Method and path:
POST /api/1.2/contexts/create

Example
Request:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/contexts/create \
--data clusterId=1234-567890-span123 \
--data language=sql

Response:

{
"id": "1234567890123456789"
}

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster to create the context for.

clusterId

Type: string

The language for the context. One of:

* python
* scala
* sql

Response schema

F IEL D

id

Type: string

The ID of the execution context.

Get information about an execution context


Method and path:
GET /api/1.2/contexts/status

Example
Request:

curl --netrc https://<databricks-instance>/api/1.2/contexts/status?clusterId=1234-567890-


span123&contextId=1234567890123456789

Response:
{
"id": "1234567890123456789",
"status": "Running"
}

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster to get execution context information about.

contextId

Type: string

The ID of the execution context.

Response schema

F IEL D

id

Type: string

The ID of the execution context.

status

Type: string

The status of the execution context. One of:

* Error
* Pending
* Running

Delete an execution context


Method and path:
POST /api/1.2/contexts/destroy

Example
Request:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/contexts/destroy \
--data clusterId=1234-567890-span123 \
--data contextId=1234567890123456789

Response:
{
"id": "1234567890123456789"
}

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster to destroy the execution context for.

contextId

Type: string

The ID of the execution context to destroy.

Response schema

F IEL D

id

Type: string

The ID of the execution context.

Run a command
Method and path:
POST /api/1.2/commands/execute

Example
Request:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/commands/execute \
--header 'Content-Type: application/json' \
--data @execute-command.json

execute-command.json :

{
"clusterId": "1234-567890-span123",
"contextId": "1234567890123456789",
"language": "python",
"command": "print('Hello, World!')"
}

Response:
{
"id": "1234ab56-7890-1cde-234f-5abcdef67890"
}

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster to run the command on.

contextId

Type: string

The ID of the execution context to run the command within.

language

Type: string

The language of the command.

command

Type: string

The command string to run.

Specify either command or


commandFile .

commandFile

Type: string

The path to a file containing the command to run.

Specify either commandFile or


command .

options

Type: string

An optional map of values used downstream. For example, a displayRowLimit override (used in testing).

Response schema

F IEL D

id

Type: string

The ID of the command.


Get information about a command
Method and path:
GET /api/1.2/commands/status

Example
Request:

curl --netrc --get \


https://<databricks-instance>/api/1.2/commands/status \
--data clusterId=1234-567890-span123 \
--data contextId=1234567890123456789 \
--data commandId=1234ab56-7890-1cde-234f-5abcdef67890

Response:

{
"id": "1234ab56-7890-1cde-234f-5abcdef67890",
"status": "Finished",
"results": {
"resultType": "text",
"data": "Hello, World!"
}
}

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster to get the command information about.

contextId

Type: string

The ID of the execution context that is associated with the command.

commandId

Type: string

The ID of the command to get information about.

Response schema

F IEL D

id

Type: string

The ID of the command.


F IEL D

status

Type: string

The status of the command. One of:

* Cancelled
* Cancelling
* Error
* Finished
* Queued
* Running

results

Type: object

The results of the command.

* resultType : The type of result. Type: string One of:

* error
* image
* images
* table
* text

For error :

* cause : The cause of the error. Type: string

For image :

* fileName : The image filename. Type: string

For images :

* fileNames : The images’ filenames. Type: array of string

For table :

* data : The table data. Type: array of array of any

* schema : The table schema. Type: array of array of (string, any)

* truncated : true if partial results are returned. Type: true / false

* isJsonSchema : true if a JSON schema is returned instead of a string representation of the Hive type. Type: true /
false

For text :

* data : The text. Type: string

Cancel a command
Method and path:
POST/api/1.2/commands/cancel
Example
Request:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/commands/cancel \
--data clusterId=1234-567890-span123 \
--data contextId=1234567890123456789 \
--data commandId=1234ab56-7890-1cde-234f-5abcdef67890

Response:

{
"id": "1234ab56-7890-1cde-234f-5abcdef67890"
}

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster that is associated with the command to cancel.

contextId

Type: string

The ID of the execution context that is associated with the command to cancel.

commandId

Type: string

The ID of the command to cancel.

Response schema

F IEL D

id

Type: string

The ID of the command.

Get the list of libraries for a cluster

IMPORTANT
This operation is deprecated. Use the Cluster status operation in the Libraries API instead.

Method and path:


GET /api/1.2/libraries/list

Example
Request:

curl --netrc --get \


https://<databricks-instance>/api/1.2/libraries/list \
--data clusterId=1234-567890-span123

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster.

Response schema
An array of objects, with each object representing information about a library as follows:

F IEL D

name

Type: string

The name of the library.

status

Type: string

The status of the library. One of:

* LibraryError
* LibraryLoaded
* LibraryPending

Upload a library to a cluster

IMPORTANT
This operation is deprecated. Use the Install operation in the Libraries API instead.

Method and path:


POST /api/1.2/libraries/upload

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster to upload the library to.


F IEL D

name

Type: string

The name of the library.

language

Type: string

The language of the library.

uri

Type: string

The URI of the library.

The scheme can be file , http , or


https .

Response schema
Information about the uploaded library.

F IEL D

language

Type: string

The language of the library.

uri

Type: string

The URI of the library.

Additional examples
The following additional examples provide commands that you can use with curl or adapt with an HTTP
library in your programming language of choice.
Create an execution context
Run a command
Upload and run a Spark JAR
Create an execution context
Create an execution context on a specified cluster for a given programming language:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/contexts/create \
--header 'Content-Type: application/json' \
--data '{ "language": "scala", "clusterId": "1234-567890-span123" }'
Get information about the execution context:

curl --netrc --get \


https://<databricks-instance>/api/1.2/contexts/status \
--data 'clusterId=1234-567890-span123&contextId=1234567890123456789'

Delete the execution context:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/contexts/destroy \
--header 'Content-Type: application/json' \
--data '{ "contextId": "1234567890123456789", "clusterId": "1234-567890-span123" }'

Run a command
Known limitations: command execution does not support %run .
Run a command string:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/commands/execute \
--header 'Content-Type: application/json' \
--data '{ "language": "scala", "clusterId": "1234-567890-span123", "contextId": "1234567890123456789",
"command": "sc.parallelize(1 to 10).collect" }'

Run a file:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/commands/execute \
--header 'Content-Type: multipart/form-data' \
--form language=python \
--form clusterId=1234-567890-span123 \
--form contextId=1234567890123456789 \
--form command=@myfile.py

Show the command’s status and result:

curl --netrc --get \


https://<databricks-instance>/api/1.2/commands/status \
--data 'clusterId=1234-567890-span123&contextId=1234567890123456789&commandId=1234ab56-7890-1cde-234f-
5abcdef67890'

Cancel the command:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/commands/cancel \
--data 'clusterId=1234-567890-span123&contextId=1234567890123456789&commandId=1234ab56-7890-1cde-234f-
5abcdef67890' \

Upload and run a Spark JAR


Upload a JAR
Use the REST API (latest) to upload a JAR and attach it to a cluster.
Run a JAR
1. Create an execution context.
curl --netrc --request POST \
https://<databricks-instance>/api/1.2/contexts/create \
--data "language=scala&clusterId=1234-567890-span123"

{
"id": "1234567890123456789"
}

2. Execute a command that uses your JAR.

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/commands/execute \
--data 'language=scala&clusterId=1234-567890-
span123&contextId=1234567890123456789&command=println(com.databricks.apps.logs.chapter1.LogAnalyzer.p
rocessLogFile(sc,null,"dbfs:/somefile.log"))'

{
"id": "1234ab56-7890-1cde-234f-5abcdef67890"
}

3. Check on the status of your command. It may not return immediately if you are running a lengthy Spark
job.

curl --netrc 'https://<databricks-instance>/api/1.2/commands/status?clusterId=1234-567890-


span123&contextId=1234567890123456789&commandId=1234ab56-7890-1cde-234f-5abcdef67890'

{
"id": "1234ab56-7890-1cde-234f-5abcdef67890",
"results": {
"data": "Content Size Avg: 1234, Min: 1234, Max: 1234",
"resultType": "text"
},
"status": "Finished"
}

Allowed values for resultType include:


error
image
images
table
text
Authentication using Azure Databricks personal
access tokens
7/21/2022 • 3 minutes to read

To authenticate to and access Databricks REST APIs, you can use Azure Databricks personal access tokens or
Azure Active Directory (Azure AD) tokens.
This article discusses how to use Azure Databricks personal access tokens. For Azure AD tokens, see
Authenticate using Azure Active Directory tokens.

IMPORTANT
Tokens replace passwords in an authentication flow and should be protected like passwords. To protect tokens, Databricks
recommends that you store tokens in:
Secret management and retrieve tokens in notebooks using the Secrets utility (dbutils.secrets).
A local key store and use the Python keyring package to retrieve tokens at runtime.

NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.

Requirements
Token-based authentication is enabled by default for all Azure Databricks accounts launched after January 2018.
If token-based authentication is disabled, your administrator must enable it before you can perform the tasks
described in Manage personal access tokens.

Generate a personal access token


This section describes how to generate a personal access token in the Azure Databricks UI. You can also generate
and revoke tokens using the Token API 2.0.
The number of personal access tokens per user is limited to 600 per workspace.

1. Click Settings in the lower left corner of your Azure Databricks workspace.
2. Click User Settings .
3. Go to the Access Tokens tab.
4. Click the Generate New Token button.
5. Optionally enter a description (comment) and expiration period.
6. Click the Generate button.
7. Copy the generated token and store in a secure location.

Revoke a personal access token


This section describes how to revoke personal access tokens using the Azure Databricks UI. You can also
generate and revoke access tokens using the Token API 2.0.

1. Click Settings in the lower left corner of your Azure Databricks workspace.
2. Click User Settings .
3. Go to the Access Tokens tab.
4. Click x for the token you want to revoke.
5. On the Revoke Token dialog, click the Revoke Token button.

Use a personal access token to access the Databricks REST API


You can store a personal access token in a .netrc file and use it in curl or pass it to the
Authorization: Bearer header.

Store tokens in a .netrc file and use them in curl

Create a .netrc file with machine , login , and password properties:

machine <databricks-instance>
login token
password <token-value>

where:
<databricks-instance> is the instance ID portion of the workspace URL for your Azure Databricks
deployment. For example, if the workspace URL is https://adb-1234567890123456.7.azuredatabricks.net then
<databricks-instance> is adb-1234567890123456.7.azuredatabricks.net .
token is the literal string token .
<token-value> is the value of your token, for example dapi1234567890ab1cde2f3ab456c7d89efa .

The result looks like this:


machine adb-1234567890123456.7.azuredatabricks.net
login token
password dapi1234567890ab1cde2f3ab456c7d89efa

For multiple machine/token entries, add one line per entry, with the machine , login and password properties
for each machine/token matching pair on the same line. The result looks like this:

machine adb-1234567890123456.7.azuredatabricks.net login token password dapi1234567890ab1cde2f3ab456c7d89efa


machine adb-2345678901234567.8.azuredatabricks.net login token password dapi2345678901cd2efa3b4cd567e8f90abc
machine adb-3456789012345678.9.azuredatabricks.net login token password dapi3456789012de3fab4c5de678f9a01bcd

This example invokes the .netrc file by using --netrc (you can also use -n ) in the curl command. It uses
the specified workspace URL to find the matching machine entry in the .netrc file.

curl --netrc -X GET https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/list

Pass token to Bearer authentication


You can include the token in the header using Bearer authentication. You can use this approach with curl or
any client that you build. For the latter, see Upload a big file into DBFS.
This example uses Bearer authentication to list all available clusters in the specified workspace.

export DATABRICKS_TOKEN=dapi1234567890ab1cde2f3ab456c7d89efa

curl -X GET --header "Authorization: Bearer $DATABRICKS_TOKEN" \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/list
API examples
7/21/2022 • 10 minutes to read

This article contains examples that demonstrate how to use the Azure Databricks REST API.
In the following examples, replace <databricks-instance> with the workspace URL of your Azure Databricks
deployment. <databricks-instance> should start with adb- . Do not use the deprecated regional URL starting
with <azure-region-name> . It may not work for new workspaces, will be less reliable, and will exhibit lower
performance than per-workspace URLs.

Authentication
NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.

To learn how to authenticate to the REST API, review Authentication using Azure Databricks personal access
tokens and Authenticate using Azure Active Directory tokens.
The examples in this article assume you are using Azure Databricks personal access tokens. In the following
examples, replace <your-token> with your personal access token. The curl examples assume that you store
Azure Databricks API credentials under .netrc. The Python examples use Bearer authentication. Although the
examples show storing the token in the code, for leveraging credentials safely in Azure Databricks, we
recommend that you follow the Secret management user guide.
For examples that use Authenticate using Azure Active Directory tokens, see the articles in that section.

Get a gzipped list of clusters


This example uses Databricks REST API version 2.0.

curl -n -H "Accept-Encoding: gzip" https://<databricks-instance>/api/2.0/clusters/list > clusters.gz

Upload a big file into DBFS


The amount of data uploaded by single API call cannot exceed 1MB. To upload a file that is larger than 1MB to
DBFS, use the streaming API, which is a combination of create , addBlock , and close .
Here is an example of how to perform this action using Python. This example uses Databricks REST API version
2.0.
import json
import requests
import base64

DOMAIN = '<databricks-instance>'
TOKEN = '<your-token>'
BASE_URL = 'https://%s/api/2.0/dbfs/' % (DOMAIN)

def dbfs_rpc(action, body):


""" A helper function to make the DBFS API request, request/response is encoded/decoded as JSON """
response = requests.post(
BASE_URL + action,
headers={'Authorization': 'Bearer %s' % TOKEN },
json=body
)
return response.json()

# Create a handle that will be used to add blocks


handle = dbfs_rpc("create", {"path": "/temp/upload_large_file", "overwrite": "true"})['handle']
with open('/a/local/file') as f:
while True:
# A block can be at most 1MB
block = f.read(1 << 20)
if not block:
break
data = base64.standard_b64encode(block)
dbfs_rpc("add-block", {"handle": handle, "data": data})
# close the handle to finish uploading
dbfs_rpc("close", {"handle": handle})

Create a Python 3 cluster (Databricks Runtime 5.5 LTS and higher)


NOTE
Python 3 is the default version of Python in Databricks Runtime 6.0 and above.

The following example shows how to launch a Python 3 cluster using the Databricks REST API and the requests
Python HTTP library. This example uses Databricks REST API version 2.0.

import requests

DOMAIN = '<databricks-instance>'
TOKEN = '<your-token>'

response = requests.post(
'https://%s/api/2.0/clusters/create' % (DOMAIN),
headers={'Authorization': 'Bearer %s' % TOKEN},
json={
"cluster_name": "my-cluster",
"spark_version": "5.5.x-scala2.11",
"node_type_id": "Standard_D3_v2",
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3",
}
}
)

if response.status_code == 200:
print(response.json()['cluster_id'])
else:
print("Error launching cluster: %s: %s" % (response.json()["error_code"], response.json()["message"]))
Create a High Concurrency cluster
The following example shows how to launch a High Concurrency mode cluster using the Databricks REST API.
This example uses Databricks REST API version 2.0.

curl -n -X POST -H 'Content-Type: application/json' -d '{


"cluster_name": "high-concurrency-cluster",
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"spark_conf":{
"spark.databricks.cluster.profile":"serverless",
"spark.databricks.repl.allowedLanguages":"sql,python,r"
},
"custom_tags":{
"ResourceClass":"Serverless"
},
"autoscale":{
"min_workers":1,
"max_workers":2
},
"autotermination_minutes":10
}' https://<databricks-instance>/api/2.0/clusters/create

Jobs API examples


This section shows how to create Python, spark submit, and JAR jobs and run the JAR job and view its output.
Create a Python job
This example shows how to create a Python job. It uses the Apache Spark Python Spark Pi estimation. This
example uses Databricks REST API version 2.0.
1. Download the Python file containing the example and upload it to Databricks File System (DBFS) using
the Databricks CLI.

dbfs cp pi.py dbfs:/docs/pi.py

2. Create the job.


The following examples demonstrate how to create a job using Databricks Runtime and Databricks Light.
Databricks Runtime

curl -n -X POST -H 'Content-Type: application/json' -d \


'{
"name": "SparkPi Python job",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"num_workers": 2
},
"spark_python_task": {
"python_file": "dbfs:/docs/pi.py",
"parameters": [
"10"
]
}
}' https://<databricks-instance>/api/2.0/jobs/create

Databricks Light
curl -n -X POST -H 'Content-Type: application/json' -d
'{
"name": "SparkPi Python job",
"new_cluster": {
"spark_version": "apache-spark-2.4.x-scala2.11",
"node_type_id": "Standard_D3_v2",
"num_workers": 2
},
"spark_python_task": {
"python_file": "dbfs:/docs/pi.py",
"parameters": [
"10"
]
}
}' https://<databricks-instance>/api/2.0/jobs/create

Create a spark-submit job


This example shows how to create a spark-submit job. It uses the Apache Spark SparkPi example and Databricks
REST API version 2.0.
1. Download the JAR containing the example and upload the JAR to Databricks File System (DBFS) using the
Databricks CLI.

dbfs cp SparkPi-assembly-0.1.jar dbfs:/docs/sparkpi.jar

2. Create the job.

curl -n \
-X POST -H 'Content-Type: application/json' \
-d '{
"name": "SparkPi spark-submit job",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 2
},
"spark_submit_task": {
"parameters": [
"--class",
"org.apache.spark.examples.SparkPi",
"dbfs:/docs/sparkpi.jar",
"10"
]
}
}' https://<databricks-instance>/api/2.0/jobs/create

Create and run a spark-submit job for R scripts


This example shows how to create a spark-submit job to run R scripts. This example uses Databricks REST API
version 2.0.
1. Upload the R file to Databricks File System (DBFS) using the Databricks CLI.

dbfs cp your_code.R dbfs:/path/to/your_code.R

If the code uses SparkR, it must first install the package. Databricks Runtime contains the SparkR source
code. Install the SparkR package from its local directory as shown in the following example:
install.packages("/databricks/spark/R/pkg", repos = NULL)
library(SparkR)

sparkR.session()
n <- nrow(createDataFrame(iris))
write.csv(n, "/dbfs/path/to/num_rows.csv")

Databricks Runtime installs the latest version of sparklyr from CRAN. If the code uses sparklyr, You must
specify the Spark master URL in spark_connect . To form the Spark master URL, use the SPARK_LOCAL_IP
environment variable to get the IP, and use the default port 7077. For example:

library(sparklyr)

master <- paste("spark://", Sys.getenv("SPARK_LOCAL_IP"), ":7077", sep="")


sc <- spark_connect(master)
iris_tbl <- copy_to(sc, iris)
write.csv(iris_tbl, "/dbfs/path/to/sparklyr_iris.csv")

2. Create the job.

curl -n \
-X POST -H 'Content-Type: application/json' \
-d '{
"name": "R script spark-submit job",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 2
},
"spark_submit_task": {
"parameters": [ "dbfs:/path/to/your_code.R" ]
}
}' https://<databricks-instance>/api/2.0/jobs/create

This returns a job-id that you can then use to run the job.
3. Run the job using the job-id .

curl -n \
-X POST -H 'Content-Type: application/json' \
-d '{ "job_id": <job-id> }' https://<databricks-instance>/api/2.0/jobs/run-now

Create and run a JAR job


This example shows how to create and run a JAR job. It uses the Apache Spark SparkPi example and Databricks
REST API version 2.0.
1. Download the JAR containing the example.
2. Upload the JAR to your Azure Databricks instance using the API:

curl -n \
-F filedata=@"SparkPi-assembly-0.1.jar" \
-F path="/docs/sparkpi.jar" \
-F overwrite=true \
https://<databricks-instance>/api/2.0/dbfs/put

A successful call returns {} . Otherwise you will see an error message.


3. Get a list of all Spark versions prior to creating your job.

curl -n https://<databricks-instance>/api/2.0/clusters/spark-versions

This example uses 7.3.x-scala2.12 . See Runtime version strings for more information about Spark
cluster versions.
4. Create the job. The JAR is specified as a library and the main class name is referenced in the Spark JAR
task.

curl -n -X POST -H 'Content-Type: application/json' \


-d '{
"name": "SparkPi JAR job",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 2
},
"libraries": [{"jar": "dbfs:/docs/sparkpi.jar"}],
"spark_jar_task": {
"main_class_name":"org.apache.spark.examples.SparkPi",
"parameters": "10"
}
}' https://<databricks-instance>/api/2.0/jobs/create

This returns a job-id that you can then use to run the job.
5. Run the job using run now :

curl -n \
-X POST -H 'Content-Type: application/json' \
-d '{ "job_id": <job-id> }' https://<databricks-instance>/api/2.0/jobs/run-now

6. Navigate to https://<databricks-instance>/#job/<job-id> and you’ll be able to see your job running.


7. You can also check on it from the API using the information returned from the previous request.

curl -n https://<databricks-instance>/api/2.0/jobs/runs/get?run_id=<run-id> | jq

Which should return something like:


{
"job_id": 35,
"run_id": 30,
"number_in_job": 1,
"original_attempt_run_id": 30,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
},
"task": {
"spark_jar_task": {
"jar_uri": "",
"main_class_name": "org.apache.spark.examples.SparkPi",
"parameters": [
"10"
],
"run_as_repl": true
}
},
"cluster_spec": {
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "<node-type>",
"enable_elastic_disk": false,
"num_workers": 1
},
"libraries": [
{
"jar": "dbfs:/docs/sparkpi.jar"
}
]
},
"cluster_instance": {
"cluster_id": "0412-165350-type465",
"spark_context_id": "5998195893958609953"
},
"start_time": 1523552029282,
"setup_duration": 211000,
"execution_duration": 33000,
"cleanup_duration": 2000,
"trigger": "ONE_TIME",
"creator_user_name": "...",
"run_name": "SparkPi JAR job",
"run_page_url": "<databricks-instance>/?o=3901135158661429#job/35/run/1",
"run_type": "JOB_RUN"
}

8. To view the job output, visit the job run details page.

Executing command, time = 1523552263909.


Pi is roughly 3.13973913973914

Create cluster enabled for table access control example


To create a cluster enabled for table access control, specify the following spark_conf property in your request
body. This example uses Databricks REST API version 2.0.
curl -X POST https://<databricks-instance>/api/2.0/clusters/create -d'
{
"cluster_name": "my-cluster",
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"spark_conf": {
"spark.databricks.acl.dfAclsEnabled":true,
"spark.databricks.repl.allowedLanguages": "python,sql"
},
"num_workers": 1,
"custom_tags":{
"costcenter":"Tags",
"applicationname":"Tags1"
}
}'

Cluster log delivery examples


While you can view the Spark driver and executor logs in the Spark UI, Azure Databricks can also deliver the logs
to DBFS destinations. See the following examples.
Create a cluster with logs delivered to a DBFS location
The following cURL command creates a cluster named cluster_log_dbfs and requests Azure Databricks to
sends its logs to dbfs:/logs with the cluster ID as the path prefix. This example uses Databricks REST API
version 2.0.

curl -n -X POST -H 'Content-Type: application/json' -d \


'{
"cluster_name": "cluster_log_dbfs",
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"num_workers": 1,
"cluster_log_conf": {
"dbfs": {
"destination": "dbfs:/logs"
}
}
}' https://<databricks-instance>/api/2.0/clusters/create

The response should contain the cluster ID:

{"cluster_id":"1111-223344-abc55"}

After cluster creation, Azure Databricks syncs log files to the destination every 5 minutes. It uploads driver logs
to dbfs:/logs/1111-223344-abc55/driver and executor logs to dbfs:/logs/1111-223344-abc55/executor .
Check log delivery status
You can retrieve cluster information with log delivery status via API. This example uses Databricks REST API
version 2.0.

curl -n -H 'Content-Type: application/json' -d \


'{
"cluster_id": "1111-223344-abc55"
}' https://<databricks-instance>/api/2.0/clusters/get

If the latest batch of log upload was successful, the response should contain only the timestamp of the last
attempt:
{
"cluster_log_status": {
"last_attempted": 1479338561
}
}

In case of errors, the error message would appear in the response:

{
"cluster_log_status": {
"last_attempted": 1479338561,
"last_exception": "Exception: Access Denied ..."
}
}

Workspace examples
Here are some examples for using the Workspace API to list, get info about, create, delete, export, and import
workspace objects.
List a notebook or a folder
The following cURL command lists a path in the workspace. This example uses Databricks REST API version 2.0.

curl -n -X GET -H 'Content-Type: application/json' -d \


'{
"path": "/Users/user@example.com/"
}' https://<databricks-instance>/api/2.0/workspace/list

The response should contain a list of statuses:

{
"objects": [
{
"object_type": "DIRECTORY",
"path": "/Users/user@example.com/folder"
},
{
"object_type": "NOTEBOOK",
"language": "PYTHON",
"path": "/Users/user@example.com/notebook1"
},
{
"object_type": "NOTEBOOK",
"language": "SCALA",
"path": "/Users/user@example.com/notebook2"
}
]
}

If the path is a notebook, the response contains an array containing the status of the input notebook.
Get information about a notebook or a folder
The following cURL command gets the status of a path in the workspace. This example uses Databricks REST API
version 2.0.
curl -n -X GET -H 'Content-Type: application/json' -d \
'{
"path": "/Users/user@example.com/"
}' https://<databricks-instance>/api/2.0/workspace/get-status

The response should contain the status of the input path:

{
"object_type": "DIRECTORY",
"path": "/Users/user@example.com"
}

Create a folder
The following cURL command creates a folder. It creates the folder recursively like mkdir -p . If the folder
already exists, it will do nothing and succeed. This example uses Databricks REST API version 2.0.

curl -n -X POST -H 'Content-Type: application/json' -d \


'{
"path": "/Users/user@example.com/new/folder"
}' https://<databricks-instance>/api/2.0/workspace/mkdirs

If the request succeeds, an empty JSON string will be returned.


Delete a notebook or folder
The following cURL command deletes a notebook or folder. You can enable recursive to recursively delete a
non-empty folder. This example uses Databricks REST API version 2.0.

curl -n -X POST -H 'Content-Type: application/json' -d \


'{
"path": "/Users/user@example.com/new/folder",
"recursive": "false"
}' https://<databricks-instance>/api/2.0/workspace/delete

If the request succeeds, an empty JSON string is returned.


Export a notebook or folder
The following cURL command exports a notebook. Notebooks can be exported in the following formats:
SOURCE , HTML , JUPYTER , DBC . A folder can be exported only as DBC . This example uses Databricks REST API
version 2.0.

curl -n -X GET \
-d '{ "path": "/Users/user@example.com/notebook", "format": "SOURCE" }' \
https://<databricks-instance>/api/2.0/workspace/export

The response contains base64 encoded notebook content.

{
"content":
"Ly8gRGF0YWJyaWNrcyBub3RlYm9vayBzb3VyY2UKcHJpbnQoImhlbGxvLCB3b3JsZCIpCgovLyBDT01NQU5EIC0tLS0tLS0tLS0KCg=="
}

Alternatively, you can download the exported notebook directly.


curl -n -X GET "https://<databricks-instance>/api/2.0/workspace/export?
format=SOURCE&direct_download=true&path=/Users/user@example.com/notebook"

The response will be the exported notebook content.


Import a notebook or directory
The following cURL command imports a notebook in the workspace. Multiple formats ( SOURCE , HTML , JUPYTER ,
DBC ) are supported. If the format is SOURCE , you must specify language . The content parameter contains
base64 encoded notebook content. You can enable overwrite to overwrite the existing notebook. This example
uses Databricks REST API version 2.0.

curl -n -X POST -H 'Content-Type: application/json' -d \


'{
"path": "/Users/user@example.com/new-notebook",
"format": "SOURCE",
"language": "SCALA",
"content":
"Ly8gRGF0YWJyaWNrcyBub3RlYm9vayBzb3VyY2UKcHJpbnQoImhlbGxvLCB3b3JsZCIpCgovLyBDT01NQU5EIC0tLS0tLS0tLS0KCg==",
"overwrite": "false"
}' https://<databricks-instance>/api/2.0/workspace/import

If the request succeeds, an empty JSON string is returned.


Alternatively, you can import a notebook via multipart form post.

curl -n -X POST https://<databricks-instance>/api/2.0/workspace/import \


-F path="/Users/user@example.com/new-notebook" -F format=SOURCE -F language=SCALA -F overwrite=true -
F content=@notebook.scala
REST API 2.1
7/21/2022 • 2 minutes to read

The Databricks REST API allows for programmatic management of various Azure Databricks resources. This
article provides links to version 2.1 of each API.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

For general usage notes about the Databricks REST API, see Databricks REST API reference.
The REST API latest version, as well as REST API 2.0 and 1.2, are also available.
Jobs API 2.1
Authentication using Azure Databricks personal
access tokens
7/21/2022 • 3 minutes to read

To authenticate to and access Databricks REST APIs, you can use Azure Databricks personal access tokens or
Azure Active Directory (Azure AD) tokens.
This article discusses how to use Azure Databricks personal access tokens. For Azure AD tokens, see
Authenticate using Azure Active Directory tokens.

IMPORTANT
Tokens replace passwords in an authentication flow and should be protected like passwords. To protect tokens, Databricks
recommends that you store tokens in:
Secret management and retrieve tokens in notebooks using the Secrets utility (dbutils.secrets).
A local key store and use the Python keyring package to retrieve tokens at runtime.

NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.

Requirements
Token-based authentication is enabled by default for all Azure Databricks accounts launched after January 2018.
If token-based authentication is disabled, your administrator must enable it before you can perform the tasks
described in Manage personal access tokens.

Generate a personal access token


This section describes how to generate a personal access token in the Azure Databricks UI. You can also generate
and revoke tokens using the Token API 2.0.
The number of personal access tokens per user is limited to 600 per workspace.

1. Click Settings in the lower left corner of your Azure Databricks workspace.
2. Click User Settings .
3. Go to the Access Tokens tab.
4. Click the Generate New Token button.
5. Optionally enter a description (comment) and expiration period.
6. Click the Generate button.
7. Copy the generated token and store in a secure location.

Revoke a personal access token


This section describes how to revoke personal access tokens using the Azure Databricks UI. You can also
generate and revoke access tokens using the Token API 2.0.

1. Click Settings in the lower left corner of your Azure Databricks workspace.
2. Click User Settings .
3. Go to the Access Tokens tab.
4. Click x for the token you want to revoke.
5. On the Revoke Token dialog, click the Revoke Token button.

Use a personal access token to access the Databricks REST API


You can store a personal access token in a .netrc file and use it in curl or pass it to the
Authorization: Bearer header.

Store tokens in a .netrc file and use them in curl

Create a .netrc file with machine , login , and password properties:

machine <databricks-instance>
login token
password <token-value>

where:
<databricks-instance> is the instance ID portion of the workspace URL for your Azure Databricks
deployment. For example, if the workspace URL is https://adb-1234567890123456.7.azuredatabricks.net then
<databricks-instance> is adb-1234567890123456.7.azuredatabricks.net .
token is the literal string token .
<token-value> is the value of your token, for example dapi1234567890ab1cde2f3ab456c7d89efa .

The result looks like this:


machine adb-1234567890123456.7.azuredatabricks.net
login token
password dapi1234567890ab1cde2f3ab456c7d89efa

For multiple machine/token entries, add one line per entry, with the machine , login and password properties
for each machine/token matching pair on the same line. The result looks like this:

machine adb-1234567890123456.7.azuredatabricks.net login token password dapi1234567890ab1cde2f3ab456c7d89efa


machine adb-2345678901234567.8.azuredatabricks.net login token password dapi2345678901cd2efa3b4cd567e8f90abc
machine adb-3456789012345678.9.azuredatabricks.net login token password dapi3456789012de3fab4c5de678f9a01bcd

This example invokes the .netrc file by using --netrc (you can also use -n ) in the curl command. It uses
the specified workspace URL to find the matching machine entry in the .netrc file.

curl --netrc -X GET https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/list

Pass token to Bearer authentication


You can include the token in the header using Bearer authentication. You can use this approach with curl or
any client that you build. For the latter, see Upload a big file into DBFS.
This example uses Bearer authentication to list all available clusters in the specified workspace.

export DATABRICKS_TOKEN=dapi1234567890ab1cde2f3ab456c7d89efa

curl -X GET --header "Authorization: Bearer $DATABRICKS_TOKEN" \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/list
Jobs API 2.1
7/21/2022 • 2 minutes to read

The Jobs API allows you to programmatically manage Azure Databricks jobs. See Jobs.
The Jobs API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
REST API 2.0
7/21/2022 • 2 minutes to read

The Databricks REST API allows for programmatic management of various Azure Databricks resources. This
article provides links to version 2.0 of each API.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

For general usage notes about the Databricks REST API, see Databricks REST API reference.
The REST API latest version, as well as REST API 2.1 and 1.2, are also available.
Clusters API 2.0
Cluster Policies API 2.0
Databricks SQL Warehouses API 2.0
Databricks SQL Queries and Dashboards API 2.0
Databricks SQL Query History API 2.0
DBFS API 2.0
Delta Live Tables API 2.0
Git Credentials API 2.0
Global Init Scripts API 2.0
Groups API 2.0
Instance Pools API 2.0
IP Access List API 2.0
Jobs API 2.0
Libraries API 2.0
MLflow API 2.0
Permissions API 2.0
Repos API 2.0
SCIM API 2.0
Secrets API 2.0
Token API 2.0
Token Management API 2.0
Workspace API 2.0
Authentication using Azure Databricks personal
access tokens
7/21/2022 • 3 minutes to read

To authenticate to and access Databricks REST APIs, you can use Azure Databricks personal access tokens or
Azure Active Directory (Azure AD) tokens.
This article discusses how to use Azure Databricks personal access tokens. For Azure AD tokens, see
Authenticate using Azure Active Directory tokens.

IMPORTANT
Tokens replace passwords in an authentication flow and should be protected like passwords. To protect tokens, Databricks
recommends that you store tokens in:
Secret management and retrieve tokens in notebooks using the Secrets utility (dbutils.secrets).
A local key store and use the Python keyring package to retrieve tokens at runtime.

NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.

Requirements
Token-based authentication is enabled by default for all Azure Databricks accounts launched after January 2018.
If token-based authentication is disabled, your administrator must enable it before you can perform the tasks
described in Manage personal access tokens.

Generate a personal access token


This section describes how to generate a personal access token in the Azure Databricks UI. You can also generate
and revoke tokens using the Token API 2.0.
The number of personal access tokens per user is limited to 600 per workspace.

1. Click Settings in the lower left corner of your Azure Databricks workspace.
2. Click User Settings .
3. Go to the Access Tokens tab.
4. Click the Generate New Token button.
5. Optionally enter a description (comment) and expiration period.
6. Click the Generate button.
7. Copy the generated token and store in a secure location.

Revoke a personal access token


This section describes how to revoke personal access tokens using the Azure Databricks UI. You can also
generate and revoke access tokens using the Token API 2.0.

1. Click Settings in the lower left corner of your Azure Databricks workspace.
2. Click User Settings .
3. Go to the Access Tokens tab.
4. Click x for the token you want to revoke.
5. On the Revoke Token dialog, click the Revoke Token button.

Use a personal access token to access the Databricks REST API


You can store a personal access token in a .netrc file and use it in curl or pass it to the
Authorization: Bearer header.

Store tokens in a .netrc file and use them in curl

Create a .netrc file with machine , login , and password properties:

machine <databricks-instance>
login token
password <token-value>

where:
<databricks-instance> is the instance ID portion of the workspace URL for your Azure Databricks
deployment. For example, if the workspace URL is https://adb-1234567890123456.7.azuredatabricks.net then
<databricks-instance> is adb-1234567890123456.7.azuredatabricks.net .
token is the literal string token .
<token-value> is the value of your token, for example dapi1234567890ab1cde2f3ab456c7d89efa .

The result looks like this:


machine adb-1234567890123456.7.azuredatabricks.net
login token
password dapi1234567890ab1cde2f3ab456c7d89efa

For multiple machine/token entries, add one line per entry, with the machine , login and password properties
for each machine/token matching pair on the same line. The result looks like this:

machine adb-1234567890123456.7.azuredatabricks.net login token password dapi1234567890ab1cde2f3ab456c7d89efa


machine adb-2345678901234567.8.azuredatabricks.net login token password dapi2345678901cd2efa3b4cd567e8f90abc
machine adb-3456789012345678.9.azuredatabricks.net login token password dapi3456789012de3fab4c5de678f9a01bcd

This example invokes the .netrc file by using --netrc (you can also use -n ) in the curl command. It uses
the specified workspace URL to find the matching machine entry in the .netrc file.

curl --netrc -X GET https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/list

Pass token to Bearer authentication


You can include the token in the header using Bearer authentication. You can use this approach with curl or
any client that you build. For the latter, see Upload a big file into DBFS.
This example uses Bearer authentication to list all available clusters in the specified workspace.

export DATABRICKS_TOKEN=dapi1234567890ab1cde2f3ab456c7d89efa

curl -X GET --header "Authorization: Bearer $DATABRICKS_TOKEN" \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/list
Clusters API 2.0
7/21/2022 • 46 minutes to read

The Clusters API allows you to create, start, edit, list, terminate, and delete clusters. The maximum allowed size of
a request to the Clusters API is 10MB.
Cluster lifecycle methods require a cluster ID, which is returned from Create. To obtain a list of clusters, invoke
List.
Azure Databricks maps cluster node instance types to compute units known as DBUs. See the instance type
pricing page for a list of the supported instance types and their corresponding DBUs. For instance provider
information, see Azure instance type specifications and pricing.
Azure Databricks always provides one year’s deprecation notice before ceasing support for an instance type.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Create
EN DP O IN T H T T P M ET H O D

2.0/clusters/create POST

Create a new Apache Spark cluster. This method acquires new instances from the cloud provider if necessary.
This method is asynchronous; the returned cluster_id can be used to poll the cluster state. When this method
returns, the cluster is in a PENDING state. The cluster is usable once it enters a RUNNING state. See ClusterState.

NOTE
Azure Databricks may not be able to acquire some of the requested nodes, due to cloud provider limitations or transient
network issues. If it is unable to acquire a sufficient number of the requested nodes, cluster creation will terminate with an
informative error message.

Examples

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/create \
--data @create-cluster.json

create-cluster.json :
{
"cluster_name": "my-cluster",
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"spark_conf": {
"spark.speculation": true
},
"num_workers": 25
}

{ "cluster_id": "1234-567890-undid123" }

Here is an example for an autoscaling cluster. This cluster will start with two nodes, the minimum.

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/create \
--data @create-cluster.json

create-cluster.json :

{
"cluster_name": "autoscaling-cluster",
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"autoscale" : {
"min_workers": 2,
"max_workers": 50
}
}

{ "cluster_id": "1234-567890-hared123" }

This example creates a Single Node cluster. To create a Single Node cluster:
Set spark_conf and custom_tags to the exact values in the example.
Set num_workers to 0 .

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/create \
--data @create-cluster.json

create-cluster.json :
{
"cluster_name": "single-node-cluster",
"spark_version": "7.6.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 0,
"spark_conf": {
"spark.databricks.cluster.profile": "singleNode",
"spark.master": "local[*]"
},
"custom_tags": {
"ResourceClass": "SingleNode"
}
}

{ "cluster_id": "1234-567890-pouch123" }

To create a job or submit a run with a new cluster using a policy, set policy_id to the policy ID:

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/create \
--data @create-cluster.json

create-cluster.json :

{
"num_workers": null,
"autoscale": {
"min_workers": 2,
"max_workers": 8
},
"cluster_name": "my-cluster",
"spark_version": "7.3.x-scala2.12",
"spark_conf": {},
"node_type_id": "Standard_D3_v2",
"custom_tags": {},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"autotermination_minutes": 120,
"init_scripts": [],
"policy_id": "C65B864F02000008"
}

To create a new cluster, define the cluster’s properties in new_cluster :

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/job/create \
--data @create-job.json

create-job.json :
{
"run_name": "my spark task",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"num_workers": 10,
"policy_id": "ABCD000000000000"
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
],
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
}

Request structure of the cluster definition


F IEL D N A M E TYPE DESC RIP T IO N

num_workers OR autoscale INT32 OR AutoScale If num_workers, number of worker


nodes that this cluster should have. A
cluster has one Spark driver and
num_workers executors for a total of
num_workers + 1 Spark nodes.

Note: When reading the properties of


a cluster, this field reflects the desired
number of workers rather than the
actual number of workers. For
instance, if a cluster is resized from 5
to 10 workers, this field will
immediately be updated to reflect the
target size of 10 workers, whereas the
workers listed in executors will
gradually increase from 5 to 10 as the
new nodes are provisioned.

If autoscale, parameters needed in


order to automatically scale clusters up
and down based on load.

cluster_name STRING Cluster name requested by the user.


This doesn’t have to be unique. If not
specified at creation, the cluster name
will be an empty string.

spark_version STRING The runtime version of the cluster. You


can retrieve a list of available runtime
versions by using the Runtime versions
API call. This field is required.
F IEL D N A M E TYPE DESC RIP T IO N

spark_conf SparkConfPair An object containing a set of optional,


user-specified Spark configuration key-
value pairs. You can also pass in a
string of extra JVM options to the
driver and the executors via
spark.driver.extraJavaOptions
and
spark.executor.extraJavaOptions
respectively.

Example Spark confs:


{"spark.speculation": true,
"spark.streaming.ui.retainedBatches":
5}
or
{"spark.driver.extraJavaOptions":
"-verbose:gc -
XX:+PrintGCDetails"}

node_type_id STRING This field encodes, through a single


value, the resources available to each
of the Spark nodes in this cluster. For
example, the Spark nodes can be
provisioned and optimized for memory
or compute intensive workloads A list
of available node types can be
retrieved by using the List node types
API call. This field is required.

driver_node_type_id STRING The node type of the Spark driver. This


field is optional; if unset, the driver
node type will be set as the same value
as node_type_id defined above.

custom_tags ClusterTag An object containing a set of tags for


cluster resources. Databricks tags all
cluster resources (such as VMs) with
these tags in addition to default_tags.

Note :

* Azure Databricks allows at most 43


custom tags.
* If the cluster is created on an
instance pool, the cluster tags are not
copied to the cluster resources. To tag
resources for an instance pool, see the
custom_tags field in the Instance
Pools API 2.0.
F IEL D N A M E TYPE DESC RIP T IO N

cluster_log_conf ClusterLogConf The configuration for delivering Spark


logs to a long-term storage
destination. Only one destination can
be specified for one cluster. If the conf
is given, the logs will be delivered to
the destination every
5 mins . The destination of driver logs
is
<destination>/<cluster-
ID>/driver
, while the destination of executor logs
is
<destination>/<cluster-
ID>/executor
.

init_scripts An array of InitScriptInfo The configuration for storing init


scripts. Any number of scripts can be
specified. The scripts are executed
sequentially in the order provided. If
cluster_log_conf is specified, init
script logs are sent to
<destination>/<cluster-
ID>/init_scripts
.

docker_image DockerImage Docker image for a custom container.

spark_env_vars SparkEnvPair An object containing a set of optional,


user-specified environment variable
key-value pairs. Key-value pairs of the
form (X,Y) are exported as is (that is,
export X='Y' ) while launching the
driver and workers.

In order to specify an additional set of


SPARK_DAEMON_JAVA_OPTS , we
recommend appending them to
$SPARK_DAEMON_JAVA_OPTS as shown
in the following example. This ensures
that all default databricks managed
environmental variables are included
as well.

Example Spark environment variables:


{"SPARK_WORKER_MEMORY":
"28000m", "SPARK_LOCAL_DIRS":
"/local_disk0"}
or
{"SPARK_DAEMON_JAVA_OPTS":
"$SPARK_DAEMON_JAVA_OPTS -
Dspark.shuffle.service.enabled=true"}
F IEL D N A M E TYPE DESC RIP T IO N

autotermination_minutes INT32 Automatically terminates the cluster


after it is inactive for this time in
minutes. If not set, this cluster will not
be automatically terminated. If
specified, the threshold must be
between 10 and 10000 minutes. You
can also set this value to 0 to explicitly
disable automatic termination.

driver_instance_pool_id STRING The ID of the instance pool to use for


drivers. You must also specify
instance_pool_id . Refer to Instance
Pools API 2.0 for details.

instance_pool_id STRING The optional ID of the instance pool to


use for cluster nodes. If
driver_instance_pool_id is present,
instance_pool_id is used for worker
nodes only. Otherwise, it is used for
both the driver and the worker nodes.
Refer to Instance Pools API 2.0 for
details.

idempotency_token STRING An optional token that can be used to


guarantee the idempotency of cluster
creation requests. If the idempotency
token is assigned to a cluster that is
not in the TERMINATED state, the
request does not create a new cluster
but instead returns the ID of the
existing cluster. Otherwise, a new
cluster is created. The idempotency
token is cleared when the cluster is
terminated

If you specify the idempotency token,


upon failure you can retry until the
request succeeds. Azure Databricks will
guarantee that exactly one cluster will
be launched with that idempotency
token.

This token should have at most 64


characters.

apply_policy_default_values BOOL Whether to use policy default values


for missing cluster attributes.

enable_local_disk_encryption BOOL Whether encryption of disks locally


attached to the cluster is enabled.

azure_attributes AzureAttributes Attributes related to clusters running


on Azure. If not specified at cluster
creation, a set of default values is used.
F IEL D N A M E TYPE DESC RIP T IO N

runtime_engine STRING The type of runtime engine to use. If


not specified, the runtime engine type
is inferred based on the
spark_version value. Allowed values
include:

* PHOTON : Use the Photon runtime


engine type.
* STANDARD : Use the standard
runtime engine type.

This field is optional.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING Canonical identifier for the cluster.

Edit
EN DP O IN T H T T P M ET H O D

2.0/clusters/edit POST

Edit the configuration of a cluster to match the provided attributes and size.
You can edit a cluster if it is in a RUNNING or TERMINATED state. If you edit a cluster while it is in a RUNNING state,
it will be restarted so that the new attributes can take effect. If you edit a cluster while it is in a TERMINATED state,
it will remain TERMINATED . The next time it is started using the clusters/start API, the new attributes will take
effect. An attempt to edit a cluster in any other state will be rejected with an INVALID_STATE error code.
Clusters created by the Databricks Jobs service cannot be edited.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/edit \
--data @edit-cluster.json

edit-cluster.json :

{
"cluster_id": "1202-211320-brick1",
"num_workers": 10,
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2"
}

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

num_workers OR autoscale INT32 OR AutoScale If num_workers, number of worker


nodes that this cluster should have. A
cluster has one Spark driver and
num_workers executors for a total of
num_workers + 1 Spark nodes.

Note: When reading the properties of


a cluster, this field reflects the desired
number of workers rather than the
actual number of workers. For
instance, if a cluster is resized from 5
to 10 workers, this field will
immediately be updated to reflect the
target size of 10 workers, whereas the
workers listed in executors will
gradually increase from 5 to 10 as the
new nodes are provisioned.

If autoscale, parameters needed in


order to automatically scale clusters up
and down based on load.

cluster_id STRING Canonical identifier for the cluster. This


field is required.

cluster_name STRING Cluster name requested by the user.


This doesn’t have to be unique. If not
specified at creation, the cluster name
will be an empty string.

spark_version STRING The runtime version of the cluster. You


can retrieve a list of available runtime
versions by using the Runtime versions
API call. This field is required.

spark_conf SparkConfPair An object containing a set of optional,


user-specified Spark configuration key-
value pairs. You can also pass in a
string of extra JVM options to the
driver and the executors via
spark.driver.extraJavaOptions
and
spark.executor.extraJavaOptions
respectively.

Example Spark confs:


{"spark.speculation": true,
"spark.streaming.ui.retainedBatches":
5}
or
{"spark.driver.extraJavaOptions":
"-verbose:gc -
XX:+PrintGCDetails"}
F IEL D N A M E TYPE DESC RIP T IO N

node_type_id STRING This field encodes, through a single


value, the resources available to each
of the Spark nodes in this cluster. For
example, the Spark nodes can be
provisioned and optimized for memory
or compute intensive workloads A list
of available node types can be
retrieved by using the List node types
API call. This field is required.

driver_node_type_id STRING The node type of the Spark driver. This


field is optional; if unset, the driver
node type will be set as the same value
as node_type_id defined above.

cluster_log_conf ClusterLogConf The configuration for delivering Spark


logs to a long-term storage
destination. Only one destination can
be specified for one cluster. If the conf
is given, the logs will be delivered to
the destination every
5 mins . The destination of driver logs
is
<destination>/<cluster-
ID>/driver
, while the destination of executor logs
is
<destination>/<cluster-
ID>/executor
.

init_scripts An array of InitScriptInfo The configuration for storing init


scripts. Any number of destinations
can be specified. The scripts are
executed sequentially in the order
provided. If cluster_log_conf is
specified, init script logs are sent to
<destination>/<cluster-
ID>/init_scripts
.

docker_image DockerImage Docker image for a custom container.


F IEL D N A M E TYPE DESC RIP T IO N

spark_env_vars SparkEnvPair An object containing a set of optional,


user-specified environment variable
key-value pairs. Key-value pairs of the
form (X,Y) are exported as is (that is,
export X='Y' ) while launching the
driver and workers.

In order to specify an additional set of


SPARK_DAEMON_JAVA_OPTS , we
recommend appending them to
$SPARK_DAEMON_JAVA_OPTS as shown
in the following example. This ensures
that all default Databricks managed
environmental variables are included
as well.

Example Spark environment variables:


{"SPARK_WORKER_MEMORY":
"28000m", "SPARK_LOCAL_DIRS":
"/local_disk0"}
or
{"SPARK_DAEMON_JAVA_OPTS":
"$SPARK_DAEMON_JAVA_OPTS -
Dspark.shuffle.service.enabled=true"}

autotermination_minutes INT32 Automatically terminates the cluster


after it is inactive for this time in
minutes. If not set, this cluster will not
be automatically terminated. If
specified, the threshold must be
between 10 and 10000 minutes. You
can also set this value to 0 to explicitly
disable automatic termination.

instance_pool_id STRING The optional ID of the instance pool to


which the cluster belongs. Refer to
Pools for details.

apply_policy_default_values BOOL Whether to use policy default values


for missing cluster attributes.

enable_local_disk_encryption BOOL Whether encryption of disks locally


attached to the cluster is enabled.

azure_attributes AzureAttributes Attributes related to clusters running


on Azure. If not specified at cluster
creation, a set of default values is used.
F IEL D N A M E TYPE DESC RIP T IO N

runtime_engine STRING The type of runtime engine to use. If


not specified, the runtime engine type
is inferred based on the
spark_version value. Allowed values
include:

* PHOTON : Use the Photon runtime


engine type.
* STANDARD : Use the standard
runtime engine type.

This field is optional.

Start
EN DP O IN T H T T P M ET H O D

2.0/clusters/start POST

Start a terminated cluster given its ID. This is similar to createCluster , except:
The terminated cluster ID and attributes are preserved.
The cluster starts with the last specified cluster size. If the terminated cluster is an autoscaling cluster, the
cluster starts with the minimum number of nodes.
If the cluster is in the RESTARTING state, a 400 error is returned.
You cannot start a cluster launched to run a job.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/start \
--data '{ "cluster_id": "1234-567890-reef123" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The cluster to be started. This field is


required.

Restart
EN DP O IN T H T T P M ET H O D

2.0/clusters/restart POST

Restart a cluster given its ID. The cluster must be in the RUNNING state.
Example
curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/restart \
--data '{ "cluster_id": "1234-567890-reef123" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The cluster to be started. This field is


required.

Resize
EN DP O IN T H T T P M ET H O D

2.0/clusters/resize POST

Resize a cluster to have a desired number of workers. The cluster must be in the RUNNING state.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/resize \
--data '{ "cluster_id": "1234-567890-reef123", "num_workers": 30 }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

num_workers OR autoscale INT32 OR AutoScale If num_workers, number of worker


nodes that this cluster should have. A
cluster has one Spark driver and
num_workers executors for a total of
num_workers + 1 Spark nodes.

Note: When reading the properties of


a cluster, this field reflects the desired
number of workers rather than the
actual number of workers. For
instance, if a cluster is resized from 5
to 10 workers, this field will
immediately be updated to reflect the
target size of 10 workers, whereas the
workers listed in executors will
gradually increase from 5 to 10 as the
new nodes are provisioned.

If autoscale, parameters needed in


order to automatically scale clusters up
and down based on load.
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The cluster to be resized. This field is


required.

Delete (terminate)
EN DP O IN T H T T P M ET H O D

2.0/clusters/delete POST

Terminate a cluster given its ID. The cluster is removed asynchronously. Once the termination has completed, the
cluster will be in the TERMINATED state. If the cluster is already in a TERMINATING or TERMINATED state, nothing
will happen.
Unless a cluster is pinned, 30 days after the cluster is terminated, it is permanently deleted.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/delete \
--data '{ "cluster_id": "1234-567890-frays123" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The cluster to be terminated. This field


is required.

Permanent delete
EN DP O IN T H T T P M ET H O D

2.0/clusters/permanent-delete POST

Permanently delete a cluster. If the cluster is running, it is terminated and its resources are asynchronously
removed. If the cluster is terminated, then it is immediately removed.
You cannot perform any action, including retrieve the cluster’s permissions, on a permanently deleted cluster. A
permanently deleted cluster is also no longer returned in the cluster list.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/permanent-delete \
--data '{ "cluster_id": "1234-567890-frays123" }'

{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The cluster to be permanently deleted.


This field is required.

Get
EN DP O IN T H T T P M ET H O D

2.0/clusters/get GET

Retrieve the information for a cluster given its identifier. Clusters can be described while they are running or up
to 30 days after they are terminated.
Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/get \
--data '{ "cluster_id": "1234-567890-reef123" }' \
| jq .
{
"cluster_id": "1234-567890-reef123",
"driver": {
"node_id": "dced0ce388954c38abef081f54c18afd",
"instance_id": "c69c0b119a2a499d8a2843c4d256136a",
"start_timestamp": 1619718438896,
"host_private_ip": "10.0.0.1",
"private_ip": "10.0.0.2"
},
"spark_context_id": 5631707659504820000,
"jdbc_port": 10000,
"cluster_name": "my-cluster",
"spark_version": "8.2.x-scala2.12",
"node_type_id": "Standard_L4s",
"driver_node_type_id": "Standard_L4s",
"custom_tags": {
"ResourceClass": "SingleNode"
},
"autotermination_minutes": 0,
"enable_elastic_disk": true,
"disk_spec": {},
"cluster_source": "UI",
"enable_local_disk_encryption": false,
"azure_attributes": {
"first_on_demand": 1,
"availability": "ON_DEMAND_AZURE",
"spot_bid_max_price": -1
},
"instance_source": {
"node_type_id": "Standard_L4s"
},
"driver_instance_source": {
"node_type_id": "Standard_L4s"
},
"state": "RUNNING",
"state_message": "",
"start_time": 1610745129764,
"last_state_loss_time": 1619718513513,
"num_workers": 0,
"cluster_memory_mb": 32768,
"cluster_cores": 4,
"default_tags": {
"Vendor": "Databricks",
"Creator": "someone@example.com",
"ClusterName": "my-cluster",
"ClusterId": "1234-567890-reef123"
},
"creator_user_name": "someone@example.com",
"pinned_by_user_name": "3401478490056118",
"init_scripts_safe_mode": false
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The cluster about which to retrieve


information. This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

num_workers OR autoscale INT32 OR AutoScale If num_workers, number of worker


nodes that this cluster should have. A
cluster has one Spark driver and
num_workers executors for a total of
num_workers + 1 Spark nodes.

Note: When reading the properties of


a cluster, this field reflects the desired
number of workers rather than the
actual number of workers. For
instance, if a cluster is resized from 5
to 10 workers, this field will
immediately be updated to reflect the
target size of 10 workers, whereas the
workers listed in executors will
gradually increase from 5 to 10 as the
new nodes are provisioned.

If autoscale, parameters needed in


order to automatically scale clusters up
and down based on load.

cluster_id STRING Canonical identifier for the cluster. This


ID is retained during cluster restarts
and resizes, while each new cluster has
a globally unique ID.

creator_user_name STRING Creator user name. The field won’t be


included in the response if the user has
already been deleted.

driver SparkNode Node on which the Spark driver


resides. The driver node contains the
Spark master and the Databricks
application that manages the per-
notebook Spark REPLs.

executors An array of SparkNode Nodes on which the Spark executors


reside.

spark_context_id INT64 A canonical SparkContext identifier.


This value does change when the
Spark driver restarts. The pair
(cluster_id, spark_context_id) is
a globally unique identifier over all
Spark contexts.

jdbc_port INT32 Port on which Spark JDBC server is


listening in the driver node. No service
will be listening on on this port in
executor nodes.

cluster_name STRING Cluster name requested by the user.


This doesn’t have to be unique. If not
specified at creation, the cluster name
will be an empty string.
F IEL D N A M E TYPE DESC RIP T IO N

spark_version STRING The runtime version of the cluster. You


can retrieve a list of available runtime
versions by using the Runtime versions
API call.

spark_conf SparkConfPair An object containing a set of optional,


user-specified Spark configuration key-
value pairs. You can also pass in a
string of extra JVM options to the
driver and the executors via
spark.driver.extraJavaOptions
and
spark.executor.extraJavaOptions
respectively.

Example Spark confs:


{"spark.speculation": true,
"spark.streaming.ui.retainedBatches":
5}
or
{"spark.driver.extraJavaOptions":
"-verbose:gc -
XX:+PrintGCDetails"}

node_type_id STRING This field encodes, through a single


value, the resources available to each
of the Spark nodes in this cluster. For
example, the Spark nodes can be
provisioned and optimized for memory
or compute intensive workloads A list
of available node types can be
retrieved by using the List node types
API call. This field is required.

driver_node_type_id STRING The node type of the Spark driver. This


field is optional; if unset, the driver
node type will be set as the same value
as node_type_id defined above.

custom_tags ClusterTag An object containing a set of tags for


cluster resources. Databricks tags all
cluster resources with these tags in
addition to default_tags.

Note :

* Tags are not supported on legacy


node types such as compute-
optimized and memory-optimized.
* Databricks allows at most 45 custom
tags.
* If the cluster is created on an
instance pool, the cluster tags are not
copied to the cluster resources. To tag
resources for an instance pool, see the
custom_tags field in the Instance
Pools API 2.0.
F IEL D N A M E TYPE DESC RIP T IO N

cluster_log_conf ClusterLogConf The configuration for delivering Spark


logs to a long-term storage
destination. Only one destination can
be specified for one cluster. If the conf
is given, the logs will be delivered to
the destination every
5 mins . The destination of driver logs
is
<destination>/<cluster-
ID>/driver
, while the destination of executor logs
is
<destination>/<cluster-
ID>/executor
.

init_scripts An array of InitScriptInfo The configuration for storing init


scripts. Any number of destinations
can be specified. The scripts are
executed sequentially in the order
provided. If cluster_log_conf is
specified, init script logs are sent to
<destination>/<cluster-
ID>/init_scripts
.

docker_image DockerImage Docker image for a custom container.

spark_env_vars SparkEnvPair An object containing a set of optional,


user-specified environment variable
key-value pairs. Key-value pairs of the
form (X,Y) are exported as is (that is,
export X='Y' ) while launching the
driver and workers.

In order to specify an additional set of


SPARK_DAEMON_JAVA_OPTS , we
recommend appending them to
$SPARK_DAEMON_JAVA_OPTS as shown
in the following example. This ensures
that all default Databricks managed
environmental variables are included
as well.

Example Spark environment variables:


{"SPARK_WORKER_MEMORY":
"28000m", "SPARK_LOCAL_DIRS":
"/local_disk0"}
or
{"SPARK_DAEMON_JAVA_OPTS":
"$SPARK_DAEMON_JAVA_OPTS -
Dspark.shuffle.service.enabled=true"}
F IEL D N A M E TYPE DESC RIP T IO N

autotermination_minutes INT32 Automatically terminates the cluster


after it is inactive for this time in
minutes. If not set, this cluster will not
be automatically terminated. If
specified, the threshold must be
between 10 and 10000 minutes. You
can also set this value to 0 to explicitly
disable automatic termination.

enable_elastic_disk BOOL Autoscaling Local Storage: when


enabled, this cluster will dynamically
acquire additional disk space when its
Spark workers are running low on disk
space. See Autoscaling local storage for
details.

instance_pool_id STRING The optional ID of the instance pool to


which the cluster belongs. Refer to
Pools for details.

state ClusterState State of the cluster.

state_message STRING A message associated with the most


recent state transition (for example,
the reason why the cluster entered the
TERMINATED state).

start_time INT64 Time (in epoch milliseconds) when the


cluster creation request was received
(when the cluster entered the
PENDING state).

terminated_time INT64 Time (in epoch milliseconds) when the


cluster was terminated, if applicable.

last_state_loss_time INT64 Time when the cluster driver last lost


its state (due to a restart or driver
failure).

last_activity_time INT64 Time (in epoch milliseconds) when the


cluster was last active. A cluster is
active if there is at least one command
that has not finished on the cluster.
This field is available after the cluster
has reached the RUNNING state.
Updates to this field are made as best-
effort attempts. Certain versions of
Spark do not support reporting of
cluster activity. Refer to Automatic
termination for details.

cluster_memory_mb INT64 Total amount of cluster memory, in


megabytes.
F IEL D N A M E TYPE DESC RIP T IO N

cluster_cores FLOAT Number of CPU cores available for this


cluster. This can be fractional since
certain node types are configured to
share cores between Spark nodes on
the same instance.

default_tags ClusterTag An object containing a set of tags that


are added by Azure Databricks
regardless of any custom_tags,
including:

* Vendor: Databricks
* Creator:
* ClusterName:
* ClusterId:
* Name: On job clusters:

* RunName:
* JobId: On resources used by
Databricks SQL:

* SqlWarehouseId:

cluster_log_status LogSyncStatus Cluster log delivery status.

termination_reason TerminationReason Information about why the cluster was


terminated. This field appears only
when the cluster is in the
TERMINATING or TERMINATED state.

Pin
NOTE
You must be an Azure Databricks administrator to invoke this API.

EN DP O IN T H T T P M ET H O D

2.0/clusters/pin POST

Ensure that an all-purpose cluster configuration is retained even after a cluster has been terminated for more
than 30 days. Pinning ensures that the cluster is always returned by the List API. Pinning a cluster that is already
pinned has no effect.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/pin \
--data '{ "cluster_id": "1234-567890-reef123" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The cluster to pin. This field is required.

Unpin
NOTE
You must be an Azure Databricks administrator to invoke this API.

EN DP O IN T H T T P M ET H O D

2.0/clusters/unpin POST

Allows the cluster to eventually be removed from the list returned by the List API. Unpinning a cluster that is not
pinned has no effect.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/unpin \
--data '{ "cluster_id": "1234-567890-reef123" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The cluster to unpin. This field is


required.

List
EN DP O IN T H T T P M ET H O D

2.0/clusters/list GET

Return information about all pinned clusters, active clusters, up to 200 of the most recently terminated all-
purpose clusters in the past 30 days, and up to 30 of the most recently terminated job clusters in the past 30
days. For example, if there is 1 pinned cluster, 4 active clusters, 45 terminated all-purpose clusters in the past 30
days, and 50 terminated job clusters in the past 30 days, then this API returns the 1 pinned cluster, 4 active
clusters, all 45 terminated all-purpose clusters, and the 30 most recently terminated job clusters.
Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/list \
| jq .
{
"clusters": [
{
"cluster_id": "1234-567890-reef123",
"driver": {
"node_id": "dced0ce388954c38abef081f54c18afd",
"instance_id": "c69c0b119a2a499d8a2843c4d256136a",
"start_timestamp": 1619718438896,
"host_private_ip": "10.0.0.1",
"private_ip": "10.0.0.2"
},
"spark_context_id": 5631707659504820000,
"jdbc_port": 10000,
"cluster_name": "my-cluster",
"spark_version": "8.2.x-scala2.12",
"node_type_id": "Standard_L4s",
"driver_node_type_id": "Standard_L4s",
"custom_tags": {
"ResourceClass": "SingleNode"
},
"autotermination_minutes": 0,
"enable_elastic_disk": true,
"disk_spec": {},
"cluster_source": "UI",
"enable_local_disk_encryption": false,
"azure_attributes": {
"first_on_demand": 1,
"availability": "ON_DEMAND_AZURE",
"spot_bid_max_price": -1
},
"instance_source": {
"node_type_id": "Standard_L4s"
},
"driver_instance_source": {
"node_type_id": "Standard_L4s"
},
"state": "RUNNING",
"state_message": "",
"start_time": 1610745129764,
"last_state_loss_time": 1619718513513,
"num_workers": 0,
"cluster_memory_mb": 32768,
"cluster_cores": 4,
"default_tags": {
"Vendor": "Databricks",
"Creator": "someone@example.com",
"ClusterName": "my-cluster",
"ClusterId": "1234-567890-reef123"
},
"creator_user_name": "someone@example.com",
"pinned_by_user_name": "3401478490056118",
"init_scripts_safe_mode": false
},
{
"..."
}
]
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

clusters An array of ClusterInfo A list of clusters.


List node types
EN DP O IN T H T T P M ET H O D

2.0/clusters/list-node-types GET

Return a list of supported Spark node types. These node types can be used to launch a cluster.
Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/list-node-types \
| jq .

{
"node_types": [
{
"node_type_id": "Standard_L80s_v2",
"memory_mb": 655360,
"num_cores": 80,
"description": "Standard_L80s_v2",
"instance_type_id": "Standard_L80s_v2",
"is_deprecated": false,
"category": "Storage Optimized",
"support_ebs_volumes": true,
"support_cluster_tags": true,
"num_gpus": 0,
"node_instance_type": {
"instance_type_id": "Standard_L80s_v2",
"local_disks": 1,
"local_disk_size_gb": 800,
"instance_family": "Standard LSv2 Family vCPUs",
"local_nvme_disk_size_gb": 1788,
"local_nvme_disks": 10,
"swap_size": "10g"
},
"is_hidden": false,
"support_port_forwarding": true,
"display_order": 0,
"is_io_cache_enabled": true,
"node_info": {
"available_core_quota": 350,
"total_core_quota": 350
}
},
{
"..."
}
]
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

node_types An array of NodeType The list of available Spark node types.

Runtime versions
EN DP O IN T H T T P M ET H O D

2.0/clusters/spark-versions GET

Return the list of available runtime versions. These versions can be used to launch a cluster.
Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/spark-versions \
| jq .

{
"versions": [
{
"key": "8.2.x-scala2.12",
"name": "8.2 (includes Apache Spark 3.1.1, Scala 2.12)"
},
{
"..."
}
]
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

versions An array of SparkVersion All the available runtime versions.

Events
EN DP O IN T H T T P M ET H O D

2.0/clusters/events POST

Retrieve a list of events about the activity of a cluster. You can retrieve events from active clusters (running,
pending, or reconfiguring) and terminated clusters within 30 days of their last termination. This API is paginated.
If there are more events to read, the response includes all the parameters necessary to request the next page of
events.
Example:

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/events \
--data @list-events.json \
| jq .

list-events.json :
{
"cluster_id": "1234-567890-reef123",
"start_time": 1617238800000,
"end_time": 1619485200000,
"order": "DESC",
"offset": 5,
"limit": 5,
"event_type": "RUNNING"
}

{
"events": [
{
"cluster_id": "1234-567890-reef123",
"timestamp": 1619471498409,
"type": "RUNNING",
"details": {
"current_num_workers": 2,
"target_num_workers": 2
}
},
{
"..."
}
],
"next_page": {
"cluster_id": "1234-567890-reef123",
"start_time": 1617238800000,
"end_time": 1619485200000,
"order": "DESC",
"offset": 10,
"limit": 5
},
"total_count": 25
}

Example request to retrieve the next page of events:

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/events \
--data @list-events.json \
| jq .

list-events.json :

{
"cluster_id": "1234-567890-reef123",
"start_time": 1617238800000,
"end_time": 1619485200000,
"order": "DESC",
"offset": 10,
"limit": 5,
"event_type": "RUNNING"
}
{
"events": [
{
"cluster_id": "1234-567890-reef123",
"timestamp": 1618330776302,
"type": "RUNNING",
"details": {
"current_num_workers": 2,
"target_num_workers": 2
}
},
{
"..."
}
],
"next_page": {
"cluster_id": "1234-567890-reef123",
"start_time": 1617238800000,
"end_time": 1619485200000,
"order": "DESC",
"offset": 15,
"limit": 5
},
"total_count": 25
}

Request structure
Retrieve events pertaining to a specific cluster.

F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The ID of the cluster to retrieve events


about. This field is required.

start_time INT64 The start time in epoch milliseconds. If


empty, returns events starting from
the beginning of time.

end_time INT64 The end time in epoch milliseconds. If


empty, returns events up to the
current time.

order ListOrder The order to list events in; either ASC


or DESC . Defaults to DESC .

event_types An array of ClusterEventType An optional set of event types to filter


on. If empty, all event types are
returned.

offset INT64 The offset in the result set. Defaults to


0 (no offset). When an offset is
specified and the results are requested
in descending order, the end_time field
is required.

limit INT64 The maximum number of events to


include in a page of events. Defaults to
50, and maximum allowed value is
500.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N

events An array of ClusterEvent This list of matching events.

next_page Request structure The parameters required to retrieve


the next page of events. Omitted if
there are no more events to read.

total_count INT64 The total number of events filtered by


the start_time, end_time, and
event_types.

Data structures
In this section:
AutoScale
ClusterInfo
ClusterEvent
ClusterEventType
EventDetails
ClusterAttributes
ClusterSize
ListOrder
ResizeCause
ClusterLogConf
InitScriptInfo
ClusterTag
DbfsStorageInfo
FileStorageInfo
DockerImage
DockerBasicAuth
LogSyncStatus
NodeType
ClusterCloudProviderNodeInfo
ClusterCloudProviderNodeStatus
ParameterPair
SparkConfPair
SparkEnvPair
SparkNode
SparkVersion
TerminationReason
PoolClusterTerminationCode
ClusterSource
ClusterState
TerminationCode
TerminationType
TerminationParameter
AzureAttributes
AzureAvailability
AutoScale
Range defining the min and max number of cluster workers.

F IEL D N A M E TYPE DESC RIP T IO N

min_workers INT32 The minimum number of workers to


which the cluster can scale down when
underutilized. It is also the initial
number of workers the cluster will
have after creation.

max_workers INT32 The maximum number of workers to


which the cluster can scale up when
overloaded. max_workers must be
strictly greater than min_workers.

ClusterInfo
Metadata about a cluster.

F IEL D N A M E TYPE DESC RIP T IO N

num_workers OR autoscale INT32 OR AutoScale If num_workers, number of worker


nodes that this cluster should have. A
cluster has one Spark driver and
num_workers executors for a total of
num_workers + 1 Spark nodes.

Note: When reading the properties of


a cluster, this field reflects the desired
number of workers rather than the
actual number of workers. For
instance, if a cluster is resized from 5
to 10 workers, this field will
immediately be updated to reflect the
target size of 10 workers, whereas the
workers listed in executors will
gradually increase from 5 to 10 as the
new nodes are provisioned.

If autoscale, parameters needed in


order to automatically scale clusters up
and down based on load.

cluster_id STRING Canonical identifier for the cluster. This


ID is retained during cluster restarts
and resizes, while each new cluster has
a globally unique ID.

creator_user_name STRING Creator user name. The field won’t be


included in the response if the user has
already been deleted.
F IEL D N A M E TYPE DESC RIP T IO N

driver SparkNode Node on which the Spark driver


resides. The driver node contains the
Spark master and the Databricks
application that manages the per-
notebook Spark REPLs.

executors An array of SparkNode Nodes on which the Spark executors


reside.

spark_context_id INT64 A canonical SparkContext identifier.


This value does change when the
Spark driver restarts. The pair
(cluster_id, spark_context_id) is
a globally unique identifier over all
Spark contexts.

jdbc_port INT32 Port on which Spark JDBC server is


listening in the driver node. No service
will be listening on on this port in
executor nodes.

cluster_name STRING Cluster name requested by the user.


This doesn’t have to be unique. If not
specified at creation, the cluster name
will be an empty string.

spark_version STRING The runtime version of the cluster. You


can retrieve a list of available runtime
versions by using the Runtime versions
API call.

spark_conf SparkConfPair An object containing a set of optional,


user-specified Spark configuration key-
value pairs. You can also pass in a
string of extra JVM options to the
driver and the executors via
spark.driver.extraJavaOptions
and
spark.executor.extraJavaOptions
respectively.

Example Spark confs:


{"spark.speculation": true,
"spark.streaming.ui.retainedBatches":
5}
or
{"spark.driver.extraJavaOptions":
"-verbose:gc -
XX:+PrintGCDetails"}

node_type_id STRING This field encodes, through a single


value, the resources available to each
of the Spark nodes in this cluster. For
example, the Spark nodes can be
provisioned and optimized for memory
or compute intensive workloads. A list
of available node types can be
retrieved by using the List node types
API call.
F IEL D N A M E TYPE DESC RIP T IO N

driver_node_type_id STRING The node type of the Spark driver. This


field is optional; if unset, the driver
node type will be set as the same value
as node_type_id defined above.

cluster_log_conf ClusterLogConf The configuration for delivering Spark


logs to a long-term storage
destination. Only one destination can
be specified for one cluster. If the conf
is given, the logs will be delivered to
the destination every
5 mins . The destination of driver logs
is
<destination>/<cluster-
ID>/driver
, while the destination of executor logs
is
<destination>/<cluster-
ID>/executor
.

init_scripts An array of InitScriptInfo The configuration for storing init


scripts. Any number of destinations
can be specified. The scripts are
executed sequentially in the order
provided. If cluster_log_conf is
specified, init script logs are sent to
<destination>/<cluster-
ID>/init_scripts
.

docker_image DockerImage Docker image for a custom container.

spark_env_vars SparkEnvPair An object containing a set of optional,


user-specified environment variable
key-value pairs. Key-value pairs of the
form (X,Y) are exported as is (that is,
export X='Y' ) while launching the
driver and workers.

To specify an additional set of


SPARK_DAEMON_JAVA_OPTS , we
recommend appending them to
$SPARK_DAEMON_JAVA_OPTS as shown
in the following example. This ensures
that all default databricks managed
environmental variables are included
as well.

Example Spark environment variables:


{"SPARK_WORKER_MEMORY":
"28000m", "SPARK_LOCAL_DIRS":
"/local_disk0"}
or
{"SPARK_DAEMON_JAVA_OPTS":
"$SPARK_DAEMON_JAVA_OPTS -
Dspark.shuffle.service.enabled=true"}
F IEL D N A M E TYPE DESC RIP T IO N

autotermination_minutes INT32 Automatically terminates the cluster


after it is inactive for this time in
minutes. If not set, this cluster will not
be automatically terminated. If
specified, the threshold must be
between 10 and 10000 minutes. You
can also set this value to 0 to explicitly
disable automatic termination.

enable_elastic_disk BOOL Autoscaling Local Storage: when


enabled, this cluster will dynamically
acquire additional disk space when its
Spark workers are running low on disk
space. See Autoscaling local storage for
details.

instance_pool_id STRING The optional ID of the instance pool to


which the cluster belongs. Refer to
Pools for details.

state ClusterState State of the cluster.

state_message STRING A message associated with the most


recent state transition (for example,
the reason why the cluster entered a
TERMINATED state).

start_time INT64 Time (in epoch milliseconds) when the


cluster creation request was received
(when the cluster entered a PENDING
state).

terminated_time INT64 Time (in epoch milliseconds) when the


cluster was terminated, if applicable.

last_state_loss_time INT64 Time when the cluster driver last lost


its state (due to a restart or driver
failure).

last_activity_time INT64 Time (in epoch milliseconds) when the


cluster was last active. A cluster is
active if there is at least one command
that has not finished on the cluster.
This field is available after the cluster
has reached a RUNNING state.
Updates to this field are made as best-
effort attempts. Certain versions of
Spark do not support reporting of
cluster activity. Refer to Automatic
termination for details.

cluster_memory_mb INT64 Total amount of cluster memory, in


megabytes.
F IEL D N A M E TYPE DESC RIP T IO N

cluster_cores FLOAT Number of CPU cores available for this


cluster. This can be fractional since
certain node types are configured to
share cores between Spark nodes on
the same instance.

default_tags ClusterTag An object containing a set of tags that


are added by Azure Databricks
regardless of any custom_tags,
including:

* Vendor: Databricks
* Creator:
* ClusterName:
* ClusterId:
* Name: On job clusters:

* RunName:
* JobId: On resources used by
Databricks SQL:

* SqlWarehouseId:

cluster_log_status LogSyncStatus Cluster log delivery status.

termination_reason TerminationReason Information about why the cluster was


terminated. This field only appears
when the cluster is in a TERMINATING
or TERMINATED state.

ClusterEvent
Cluster event information.

F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING Canonical identifier for the cluster. This


field is required.

timestamp INT64 The timestamp when the event


occurred, stored as the number of
milliseconds since the unix epoch.
Assigned by the Timeline service.

type ClusterEventType The event type. This field is required.

details EventDetails The event details. This field is required.

ClusterEventType
Type of a cluster event.

EVEN T T Y P E DESC RIP T IO N

CREATING Indicates that the cluster is being created.


EVEN T T Y P E DESC RIP T IO N

DID_NOT_EXPAND_DISK Indicates that a disk is low on space, but adding disks would
put it over the max capacity.

EXPANDED_DISK Indicates that a disk was low on space and the disks were
expanded.

FAILED_TO_EXPAND_DISK Indicates that a disk was low on space and disk space could
not be expanded.

INIT_SCRIPTS_STARTING Indicates that the cluster scoped init script has started.

INIT_SCRIPTS_FINISHED Indicates that the cluster scoped init script has finished.

STARTING Indicates that the cluster is being started.

RESTARTING Indicates that the cluster is being started.

TERMINATING Indicates that the cluster is being terminated.

EDITED Indicates that the cluster has been edited.

RUNNING Indicates the cluster has finished being created. Includes the
number of nodes in the cluster and a failure reason if some
nodes could not be acquired.

RESIZING Indicates a change in the target size of the cluster (upsize or


downsize).

UPSIZE_COMPLETED Indicates that nodes finished being added to the cluster.


Includes the number of nodes in the cluster and a failure
reason if some nodes could not be acquired.

NODES_LOST Indicates that some nodes were lost from the cluster.

DRIVER_HEALTHY Indicates that the driver is healthy and the cluster is ready
for use.

DRIVER_UNAVAILABLE Indicates that the driver is unavailable.

SPARK_EXCEPTION Indicates that a Spark exception was thrown from the driver.

DRIVER_NOT_RESPONDING Indicates that the driver is up but is not responsive, likely


due to GC.

DBFS_DOWN Indicates that the driver is up but DBFS is down.

METASTORE_DOWN Indicates that the driver is up but the metastore is down.

NODE_BLACKLISTED Indicates that a node is not allowed by Spark.

PINNED Indicates that the cluster was pinned.


EVEN T T Y P E DESC RIP T IO N

UNPINNED Indicates that the cluster was unpinned.

EventDetails
Details about a cluster event.

F IEL D N A M E TYPE DESC RIP T IO N

current_num_workers INT32 The number of nodes in the cluster.

target_num_workers INT32 The targeted number of nodes in the


cluster.

previous_attributes ClusterAttributes The cluster attributes before a cluster


was edited.

attributes ClusterAttributes * For created clusters, the attributes of


the cluster.
* For edited clusters, the new
attributes of the cluster.

previous_cluster_size ClusterSize The size of the cluster before an edit or


resize.

cluster_size ClusterSize The cluster size that was set in the


cluster creation or edit.

cause ResizeCause The cause of a change in target size.

reason TerminationReason A termination reason:

* On a TERMINATED event, the reason


for the termination.
* On a RESIZE_COMPLETE event,
indicates the reason that we failed to
acquire some nodes.

user STRING The user that caused the event to


occur. (Empty if it was done by Azure
Databricks.)

ClusterAttributes
Common set of attributes set during cluster creation. These attributes cannot be changed over the lifetime of a
cluster.

F IEL D N A M E TYPE DESC RIP T IO N

cluster_name STRING Cluster name requested by the user.


This doesn’t have to be unique. If not
specified at creation, the cluster name
will be an empty string.
F IEL D N A M E TYPE DESC RIP T IO N

spark_version STRING The runtime version of the cluster, for


example “5.0.x-scala2.11”. You can
retrieve a list of available runtime
versions by using the Runtime versions
API call.

spark_conf SparkConfPair An object containing a set of optional,


user-specified Spark configuration key-
value pairs. You can also pass in a
string of extra JVM options to the
driver and the executors via
spark.driver.extraJavaOptions
and
spark.executor.extraJavaOptions
respectively.

Example Spark confs:


{"spark.speculation": true,
"spark.streaming.ui.retainedBatches":
5}
or
{"spark.driver.extraJavaOptions":
"-verbose:gc -
XX:+PrintGCDetails"}

node_type_id STRING This field encodes, through a single


value, the resources available to each
of the Spark nodes in this cluster. For
example, the Spark nodes can be
provisioned and optimized for memory
or compute intensive workloads A list
of available node types can be
retrieved by using the List node types
API call.

driver_node_type_id STRING The node type of the Spark driver. This


field is optional; if unset, the driver
node type will be set as the same value
as node_type_id defined above.

ssh_public_keys An array of STRING SSH public key contents that will be


added to each Spark node in this
cluster. The corresponding private keys
can be used to login with the user
name ubuntu on port 2200 . Up to
10 keys can be specified.
F IEL D N A M E TYPE DESC RIP T IO N

custom_tags ClusterTag An object containing a set of tags for


cluster resources. Databricks tags all
cluster resources with these tags in
addition to default_tags.

Note :

* Tags are not supported on legacy


node types such as compute-
optimized and memory-optimized.
* Databricks allows at most 45 custom
tags.
* If the cluster is created on an
instance pool, the cluster tags are not
copied to the cluster resources. To tag
resources for an instance pool, see the
custom_tags field in the Instance
Pools API 2.0.

cluster_log_conf ClusterLogConf The configuration for delivering Spark


logs to a long-term storage
destination. Only one destination can
be specified for one cluster. If the conf
is given, the logs will be delivered to
the destination every
5 mins . The destination of driver logs
is
<destination>/<cluster-
ID>/driver
, while the destination of executor logs
is
<destination>/<cluster-
ID>/executor
.

init_scripts An array of InitScriptInfo The configuration for storing init


scripts. Any number of destinations
can be specified. The scripts are
executed sequentially in the order
provided. If cluster_log_conf is
specified, init script logs are sent to
<destination>/<cluster-
ID>/init_scripts
.

docker_image DockerImage Docker image for a custom container.


F IEL D N A M E TYPE DESC RIP T IO N

spark_env_vars SparkEnvPair An object containing a set of optional,


user-specified environment variable
key-value pairs. Key-value pairs of the
form (X,Y) are exported as is (that is,
export X='Y' ) while launching the
driver and workers.

In order to specify an additional set of


SPARK_DAEMON_JAVA_OPTS , we
recommend appending them to
$SPARK_DAEMON_JAVA_OPTS as shown
in the following example. This ensures
that all default databricks managed
environmental variables are included
as well.

Example Spark environment variables:


{"SPARK_WORKER_MEMORY":
"28000m", "SPARK_LOCAL_DIRS":
"/local_disk0"}
or
{"SPARK_DAEMON_JAVA_OPTS":
"$SPARK_DAEMON_JAVA_OPTS -
Dspark.shuffle.service.enabled=true"}

autotermination_minutes INT32 Automatically terminates the cluster


after it is inactive for this time in
minutes. If not set, this cluster will not
be automatically terminated. If
specified, the threshold must be
between 10 and 10000 minutes. You
can also set this value to 0 to explicitly
disable automatic termination.

enable_elastic_disk BOOL Autoscaling Local Storage: when


enabled, this cluster will dynamically
acquire additional disk space when its
Spark workers are running low on disk
space. See Autoscaling local storage for
details.

instance_pool_id STRING The optional ID of the instance pool to


which the cluster belongs. Refer to
Pools for details.

cluster_source ClusterSource Determines whether the cluster was


created by a user through the UI,
created by the Databricks Jobs
scheduler, or through an API request.

policy_id STRING A cluster policy ID.

azure_attributes AzureAttributes Defines attributes such as the instance


availability type, node placement, and
max bid price. If not specified during
cluster creation, a set of default values
is used.

ClusterSize
Cluster size specification.

F IEL D N A M E TYPE DESC RIP T IO N

num_workers OR autoscale INT32 OR AutoScale If num_workers, number of worker


nodes that this cluster should have. A
cluster has one Spark driver and
num_workers executors for a total of
num_workers + 1 Spark nodes.

When reading the properties of a


cluster, this field reflects the desired
number of workers rather than the
actual number of workers. For
instance, if a cluster is resized from 5
to 10 workers, this field is updated to
reflect the target size of 10 workers,
whereas the workers listed in executors
gradually increase from 5 to 10 as the
new nodes are provisioned.

If autoscale, parameters needed in


order to automatically scale clusters up
and down based on load.

ListOrder
Generic ordering enum for list-based queries.

O RDER DESC RIP T IO N

DESC Descending order.

ASC Ascending order.

ResizeCause
Reason why a cluster was resized.

C A USE DESC RIP T IO N

AUTOSCALE Automatically resized based on load.

USER_REQUEST User requested a new size.

AUTORECOVERY Autorecovery monitor resized the cluster after it lost a node.

ClusterLogConf
Path to cluster log.

F IEL D N A M E TYPE DESC RIP T IO N

dbfs DbfsStorageInfo DBFS location of cluster log.


Destination must be provided. For
example,
{ "dbfs" : { "destination" :
"dbfs:/home/cluster_log" } }

InitScriptInfo
Path to an init script. For instructions on using init scripts with Databricks Container Services, see Use an init
script.

NOTE
The file storage type is only available for clusters set up using Databricks Container Services.

F IEL D N A M E TYPE DESC RIP T IO N

dbfs OR file DbfsStorageInfo DBFS location of init script. Destination


must be provided. For example,
FileStorageInfo { "dbfs" : { "destination" :
"dbfs:/home/init_script" } }

File location of init script. Destination


must be provided. For example,
{ "file" : { "destination" :
"file:/my/local/file.sh" } }

ClusterTag
Cluster tag definition.

TYPE DESC RIP T IO N

STRING The key of the tag. The key must:

* Be between 1 and 512 characters long


* Not contain any of the characters <>%*&+?\\/
* Not begin with azure , microsoft , or windows

STRING The value of the tag. The value length must be less than or
equal to 256 UTF-8 characters.

DbfsStorageInfo
DBFS storage information.

F IEL D N A M E TYPE DESC RIP T IO N

destination STRING DBFS destination. Example:


dbfs:/my/path

FileStorageInfo
File storage information.

NOTE
This location type is only available for clusters set up using Databricks Container Services.

F IEL D N A M E TYPE DESC RIP T IO N

destination STRING File destination. Example:


file:/my/file.sh

DockerImage
Docker image connection information.

F IEL D TYPE DESC RIP T IO N

url string URL for the Docker image.

basic_auth DockerBasicAuth Basic authentication information for


Docker repository.

DockerBasicAuth
Docker repository basic authentication information.

F IEL D DESC RIP T IO N

username User name for the Docker repository.

password Password for the Docker repository.

LogSyncStatus
Log delivery status.

F IEL D N A M E TYPE DESC RIP T IO N

last_attempted INT64 The timestamp of last attempt. If the


last attempt fails, last_exception
contains the exception in the last
attempt.

last_exception STRING The exception thrown in the last


attempt, it would be null (omitted in
the response) if there is no exception
in last attempted.

NodeType
Description of a Spark node type including both the dimensions of the node and the instance type on which it
will be hosted.

F IEL D N A M E TYPE DESC RIP T IO N

node_type_id STRING Unique identifier for this node type.


This field is required.

memory_mb INT32 Memory (in MB) available for this node


type. This field is required.

num_cores FLOAT Number of CPU cores available for this


node type. This can be fractional if the
number of cores on a machine
instance is not divisible by the number
of Spark nodes on that machine. This
field is required.

description STRING A string description associated with


this node type. This field is required.
F IEL D N A M E TYPE DESC RIP T IO N

instance_type_id STRING An identifier for the type of hardware


that this node runs on. This field is
required.

is_deprecated BOOL Whether the node type is deprecated.


Non-deprecated node types offer
greater performance.

node_info ClusterCloudProviderNodeInfo Node type info reported by the cloud


provider.

ClusterCloudProviderNodeInfo
Information about an instance supplied by a cloud provider.

F IEL D N A M E TYPE DESC RIP T IO N

status ClusterCloudProviderNodeStatus Status as reported by the cloud


provider.

available_core_quota INT32 Available CPU core quota.

total_core_quota INT32 Total CPU core quota.

ClusterCloudProviderNodeStatus
Status of an instance supplied by a cloud provider.

STAT US DESC RIP T IO N

NotEnabledOnSubscription Node type not available for subscription.

NotAvailableInRegion Node type not available in region.

ParameterPair
Parameter that provides additional information about why a cluster was terminated.

TYPE DESC RIP T IO N

TerminationParameter Type of termination information.

STRING The termination information.

SparkConfPair
Spark configuration key-value pairs.

TYPE DESC RIP T IO N

STRING A configuration property name.

STRING The configuration property value.

SparkEnvPair
Spark environment variable key-value pairs.

IMPORTANT
When specifying environment variables in a job cluster, the fields in this data structure accept only Latin characters (ASCII
character set). Using non-ASCII characters will return an error. Examples of invalid, non-ASCII characters are Chinese,
Japanese kanjis, and emojis.

TYPE DESC RIP T IO N

STRING An environment variable name.

STRING The environment variable value.

SparkNode
Spark driver or executor configuration.

F IEL D N A M E TYPE DESC RIP T IO N

private_ip STRING Private IP address (typically a 10.x.x.x


address) of the Spark node. This is
different from the private IP address of
the host instance.

public_dns STRING Public DNS address of this node. This


address can be used to access the
Spark JDBC server on the driver node.

node_id STRING Globally unique identifier for this node.

instance_id STRING Globally unique identifier for the host


instance from the cloud provider.

start_timestamp INT64 The timestamp (in millisecond) when


the Spark node is launched.

host_private_ip STRING The private IP address of the host


instance.

SparkVersion
Databricks Runtime version of the cluster.

F IEL D N A M E TYPE DESC RIP T IO N

key STRING Databricks Runtime version key, for


example 7.3.x-scala2.12 . The value
that should be provided as the
spark_version when creating a new
cluster. The exact runtime version may
change over time for a “wildcard”
version (that is, 7.3.x-scala2.12 is a
“wildcard” version) with minor bug
fixes.
F IEL D N A M E TYPE DESC RIP T IO N

name STRING A descriptive name for the runtime


version, for example “Databricks
Runtime 7.3 LTS”.

TerminationReason
Reason why a cluster was terminated.

F IEL D N A M E TYPE DESC RIP T IO N

code TerminationCode Status code indicating why a cluster


was terminated.

type TerminationType Reason indicating why a cluster was


terminated.

parameters ParameterPair Object containing a set of parameters


that provide information about why a
cluster was terminated.

PoolClusterTerminationCode
Status code indicating why the cluster was terminated due to a pool failure.

C O DE DESC RIP T IO N

INSTANCE_POOL_MAX_CAPACITY_FAILURE The pool max capacity has been reached.

INSTANCE_POOL_NOT_FOUND_FAILURE The pool specified by the cluster is no longer active or


doesn’t exist.

ClusterSource
Service that created the cluster.

SERVIC E DESC RIP T IO N

UI Cluster created through the UI.

JOB Cluster created by the Databricks job scheduler.

API Cluster created through an API call.

ClusterState
State of a cluster. The allowable state transitions are as follows:
PENDING -> RUNNING
PENDING -> TERMINATING
RUNNING -> RESIZING
RUNNING -> RESTARTING
RUNNING -> TERMINATING
RESTARTING -> RUNNING
RESTARTING -> TERMINATING
RESIZING -> RUNNING
RESIZING -> TERMINATING
TERMINATING -> TERMINATED

STAT E DESC RIP T IO N

PENDING Indicates that a cluster is in the process of being created.

RUNNING Indicates that a cluster has been started and is ready for use.

RESTARTING Indicates that a cluster is in the process of restarting.

RESIZING Indicates that a cluster is in the process of adding or


removing nodes.

TERMINATING Indicates that a cluster is in the process of being destroyed.

TERMINATED Indicates that a cluster has been successfully destroyed.

ERROR This state is no longer used. It was used to indicate a cluster


that failed to be created.
TERMINATING and TERMINATED are used instead.

UNKNOWN Indicates that a cluster is in an unknown state. A cluster


should never be in this state.

TerminationCode
Status code indicating why the cluster was terminated.

C O DE DESC RIP T IO N

USER_REQUEST A user terminated the cluster directly. Parameters should


include a username field that indicates the specific user who
terminated the cluster.

JOB_FINISHED The cluster was launched by a job, and terminated when the
job completed.

INACTIVITY The cluster was terminated since it was idle.

CLOUD_PROVIDER_SHUTDOWN The instance that hosted the Spark driver was terminated by
the cloud provider.

COMMUNICATION_LOST Azure Databricks lost connection to services on the driver


instance. For example, this can happen when problems arise
in cloud networking infrastructure, or when the instance
itself becomes unhealthy.

CLOUD_PROVIDER_LAUNCH_FAILURE Azure Databricks experienced a cloud provider failure when


requesting instances to launch clusters.

SPARK_STARTUP_FAILURE The cluster failed to initialize. Possible reasons may include


failure to create the environment for Spark or issues
launching the Spark master and worker processes.
C O DE DESC RIP T IO N

INVALID_ARGUMENT Cannot launch the cluster because the user specified an


invalid argument. For example, the user might specify an
invalid runtime version for the cluster.

UNEXPECTED_LAUNCH_FAILURE While launching this cluster, Azure Databricks failed to


complete critical setup steps, terminating the cluster.

INTERNAL_ERROR Azure Databricks encountered an unexpected error that


forced the running cluster to be terminated. Contact Azure
Databricks support for additional details.

SPARK_ERROR The Spark driver failed to start. Possible reasons may include
incompatible libraries and initialization scripts that corrupted
the Spark container.

METASTORE_COMPONENT_UNHEALTHY The cluster failed to start because the external metastore


could not be reached. Refer to Troubleshooting.

DBFS_COMPONENT_UNHEALTHY The cluster failed to start because Databricks File System


(DBFS) could not be reached.

AZURE_RESOURCE_PROVIDER_THROTTLING Azure Databricks reached the Azure Resource Provider


request limit. Specifically, the API request rate to the specific
resource type (compute, network, etc.) can’t exceed the limit.
Retry might help to resolve the issue. For further
information, seehttps://docs.microsoft.com/azure/virtual-
machines/troubleshooting/troubleshooting-throttling-errors.

AZURE_RESOURCE_MANAGER_THROTTLING Azure Databricks reached the Azure Resource Manager


request limit which will prevent the Azure SDK from issuing
any read or write request to the Azure Resource Manager.
The request limit is applied to each subscription every hour.
Retry after an hour or changing to a smaller cluster size
might help to resolve the issue. For further information,
seehttps://docs.microsoft.com/azure/azure-resource-
manager/resource-manager-request-limits.

NETWORK_CONFIGURATION_FAILURE The cluster was terminated due to an error in the network


configuration. For example, a workspace with VNet injection
had incorrect DNS settings that blocked access to worker
artifacts.

DRIVER_UNREACHABLE Azure Databricks was not able to access the Spark driver,
because it was not reachable.

DRIVER_UNRESPONSIVE Azure Databricks was not able to access the Spark driver,
because it was unresponsive.

INSTANCE_UNREACHABLE Azure Databricks was not able to access instances in order to


start the cluster. This can be a transient networking issue. If
the problem persists, this usually indicates a networking
environment misconfiguration.

CONTAINER_LAUNCH_FAILURE Azure Databricks was unable to launch containers on worker


nodes for the cluster. Have your admin check your network
configuration.
C O DE DESC RIP T IO N

INSTANCE_POOL_CLUSTER_FAILURE Pool backed cluster specific failure. See Pools for details.

REQUEST_REJECTED Azure Databricks cannot handle the request at this moment.


Try again later and contact Azure Databricks if the problem
persists.

INIT_SCRIPT_FAILURE Azure Databricks cannot load and run a cluster-scoped init


script on one of the cluster’s nodes, or the init script
terminates with a non-zero exit code. See Init script logs.

TRIAL_EXPIRED The Azure Databricks trial subscription expired.

BOOTSTRAP_TIMEOUT The cluster failed to start because of user network


configuration issues. Possible reasons include
misconfiguration of firewall settings, UDR entries, DNS, or
route tables.

TerminationType
Reason why the cluster was terminated.

TYPE DESC RIP T IO N

SUCCESS Termination succeeded.

CLIENT_ERROR Non-retriable. Client must fix parameters before


reattempting the cluster creation.

SERVICE_FAULT Azure Databricks service issue. Client can retry.

CLOUD_FAILURE Cloud provider infrastructure issue. Client can retry after the
underlying issue is resolved.

TerminationParameter
Key that provides additional information about why a cluster was terminated.

K EY DESC RIP T IO N

username The username of the user who terminated the cluster.

databricks_error_message Additional context that may explain the reason for cluster
termination.

inactivity_duration_min An idle cluster was shut down after being inactive for this
duration.

instance_id The ID of the instance that was hosting the Spark driver.

azure_error_code The Azure provided error code describing why cluster nodes
could not be provisioned. For reference, see:
https://docs.microsoft.com/azure/virtual-
machines/windows/error-messages.
K EY DESC RIP T IO N

azure_error_message Human-readable context of various failures from Azure. This


field is unstructured, and its exact format is subject to
change.

instance_pool_id The ID of the instance pool the cluster is using.

instance_pool_error_code The error code for cluster failures specific to a pool.

AzureAttributes
Attributes set during cluster creation related to Azure.

F IEL D N A M E TYPE DESC RIP T IO N

first_on_demand INT32 The first first_on_demand nodes of


the cluster will be placed on on-
demand instances. This value must be
greater than 0, or else cluster creation
validation fails. If this value is greater
than or equal to the current cluster
size, all nodes will be placed on on-
demand instances. If this value is less
than the current cluster size,
first_on_demand nodes will be
placed on on-demand instances and
the remainder will be placed on
availability instances. This value does
not affect cluster size and cannot be
mutated over the lifetime of a cluster.

availability AzureAvailability Availability type used for all


subsequent nodes past the
first_on_demand ones.

spot_bid_max_price DOUBLE The max bid price used for Azure spot
instances. You can set this to greater
than or equal to the current spot price.
You can also set this to -1 (the default),
which specifies that the instance
cannot be evicted on the basis of price.
The price for the instance will be the
current price for spot instances or the
price for a standard instance. You can
view historical pricing and eviction
rates in the Azure portal.

AzureAvailability
The Azure instance availability type behavior.

TYPE DESC RIP T IO N

SPOT_AZURE Use spot instances.

ON_DEMAND_AZURE Use on-demand instances.


TYPE DESC RIP T IO N

SPOT_WITH_FALLBACK_AZURE Preferably use spot instances, but fall back to on-demand


instances if spot instances cannot be acquired (for example, if
Azure spot prices are too high or out of quota). Does not
apply to pool availability.
Cluster Policies API 2.0
7/21/2022 • 7 minutes to read

IMPORTANT
This feature is in Public Preview.

A cluster policy limits the ability to create clusters based on a set of rules. The policy rules limit the attributes or
attribute values available for cluster creation. Cluster policies have ACLs that limit their use to specific users and
groups.
Only admin users can create, edit, and delete policies. Admin users also have access to all policies.
For requirements and limitations on cluster policies, see Manage cluster policies.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Cluster Policies API


The Cluster Policies API allows you to create, list, and edit cluster policies. Creation and editing is available to
admins only. Listing can be performed by any user and is limited to policies accessible by that user.

IMPORTANT
The Cluster Policies API requires a policy JSON definition to be passed within a JSON request in stringified form. In most
cases this requires escaping of the quote characters.

In this section:
Get
List
Create
Edit
Delete
Data structures
Get
EN DP O IN T H T T P M ET H O D

2.0/policies/clusters/get GET

Return a policy specification given a policy ID.


Example
curl --netrc -X GET \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/policies/clusters/get \
--data '{ "policy_id": "ABCD000000000000" }' \
| jq .

{
"policy_id": "ABCD000000000000",
"name": "Test policy",
"definition": "{\"spark_conf.spark.databricks.cluster.profile\":
{\"type\":\"forbidden\",\"hidden\":true}}",
"created_at_timestamp": 1600000000000
}

Request structure

F IEL D N A M E TYPE DESC RIP T IO N

policy_id STRING The policy ID about which to retrieve


information.

Response structure

F IEL D N A M E TYPE DESC RIP T IO N

policy_id STRING Canonical unique identifier for the


cluster policy.

name STRING Cluster policy name. This must be


unique. Length must be between 1
and 100 characters.

definition STRING Policy definition JSON document


expressed in Databricks Policy
Definition Language. The JSON
document must be passed as a string
and cannot be simply embedded in the
requests.

created_at_timestamp INT64 Creation time. The timestamp (in


millisecond) when this cluster policy
was created.

List
EN DP O IN T H T T P M ET H O D

2.0/policies/clusters/list GET

Return a list of policies accessible by the requesting user.


Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/policies/clusters/list \
--data '{ "sort_order": "DESC", "sort_column": "POLICY_CREATION_TIME" }' \
| jq .
{
"policies": [
{
"policy_id": "ABCD000000000001",
"name": "Empty",
"definition": "{}",
"created_at_timestamp": 1600000000002
},
{
"policy_id": "ABCD000000000000",
"name": "Test policy",
"definition": "{\"spark_conf.spark.databricks.cluster.profile\":
{\"type\":\"forbidden\",\"hidden\":true}}",
"created_at_timestamp": 1600000000000
}
],
"total_count": 2
}

Request structure

F IEL D N A M E TYPE DESC RIP T IO N

sort_order ListOrder The order direction to list the policies


in; either ASC or DESC . Defaults to
DESC .

sort_column PolicySortColumn The ClusterPolicy attribute to sort


by. Defaults to
POLICY_CREATION_TIME .

Response structure

F IEL D N A M E TYPE DESC RIP T IO N

policies An array of Policy List of policies.

total_count INT64 The total number of policies.

Create
EN DP O IN T H T T P M ET H O D

2.0/policies/clusters/create POST

Create a new policy with a given name and definition.


Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/policies/clusters/create \
--data @create-cluster-policy.json

create-cluster-policy.json :
{
"name": "Test policy",
"definition": "{\"spark_conf.spark.databricks.cluster.profile\":{\"type\":\"forbidden\",\"hidden\":true}}"
}

{ "policy_id": "ABCD000000000000" }

Request structure

F IEL D N A M E TYPE DESC RIP T IO N

name STRING Cluster policy name. This must be


unique. Length must be between 1
and 100 characters.

definition STRING Policy definition JSON document


expressed in Databricks Policy
Definition Language. You must pass
the JSON document as a string; it
cannot be simply embedded in the
requests.

Response structure

F IEL D N A M E TYPE DESC RIP T IO N

policy_id STRING Canonical unique identifier for the


cluster policy.

Edit
EN DP O IN T H T T P M ET H O D

2.0/policies/clusters/edit POST

Update an existing policy. This may make some clusters governed by this policy invalid. For such clusters the
next cluster edit must provide a confirming configuration, but otherwise they can continue to run.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/policies/clusters/edit \
--data @edit-cluster-policy.json

edit-cluster-policy.json :

{
"policy_id": "ABCD000000000000",
"name": "Test policy",
"definition": "{\"spark_conf.spark.databricks.cluster.profile\":{\"type\":\"forbidden\",\"hidden\":true}}"
}

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

policy_id STRING The ID of the policy to update. This


field is required.

name STRING Cluster policy name. This must be


unique. Length must be between 1
and 100 characters.

definition STRING Policy definition JSON document


expressed in Databricks Policy
Definition Language. You must pass
the JSON document as a string; it
cannot be simply embedded in the
requests.

Delete
EN DP O IN T H T T P M ET H O D

2.0/policies/clusters/delete POST

Delete a policy. Clusters governed by this policy can still run, but cannot be edited.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/policies/clusters/delete \
--data '{ "policy_id": "ABCD000000000000" }'

{}

Request structure

F IEL D N A M E TYPE DESC RIP T IO N

policy_id STRING The ID of the policy to delete. This field


is required.

Data structures
In this section:
Policy
PolicySortColumn
Policy
A cluster policy entity.

F IEL D N A M E TYPE DESC RIP T IO N

policy_id STRING Canonical unique identifier for the


cluster policy.

name STRING Cluster policy name. This must be


unique. Length must be between 1
and 100 characters.
F IEL D N A M E TYPE DESC RIP T IO N

definition STRING Policy definition JSON document


expressed in Databricks Policy
Definition Language. You must pass
the JSON document as a string; it
cannot be simply embedded in the
requests.

creator_user_name STRING Creator user name. The field won’t be


included in the response if the user has
already been deleted.

created_at_timestamp INT64 Creation time. The timestamp (in


millisecond) when this cluster policy
was created.

PolicySortColumn
The sort order for the ListPolices request.

NAME DESC RIP T IO N

POLICY_CREATION_TIME Sort result list by policy creation type.

POLICY_NAME Sort result list by policy name.

Cluster Policy Permissions API


The Cluster Policy Permissions API enables you to set permissions on a cluster policy. When you grant CAN_USE
permission on a policy to a user, the user will be able to create new clusters based on it. A user does not need
the cluster_create permission to create new clusters.
Only admin users can set permissions on cluster policies.
In this section:
Get permissions
Get permission levels
Add or modify permissions
Set or delete permissions
Data structures
Get permissions
EN DP O IN T H T T P M ET H O D

2.0/preview/permissions/cluster- GET
policies/<clusterPolicyId>

Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/preview/permissions/cluster-
policies/ABCD000000000000 \
| jq .
{
"object_id": "/cluster-policies/ABCD000000000000",
"object_type": "cluster-policy",
"access_control_list": [
{
"user_name": "someone@example.com",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": false
}
]
},
{
"group_name": "admins",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": true,
"inherited_from_object": [
"/cluster-policies/cluster-policies"
]
}
]
}
]
}

Request structure

F IEL D N A M E TYPE DESC RIP T IO N

clusterPolicyId STRING The policy about which to retrieve


permissions. This field is required.

Response structure
A Clusters ACL.
Get permission levels
EN DP O IN T H T T P M ET H O D

2.0/preview/permissions/cluster- GET
policies/<clusterPolicyId>/permissionLevels

Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/preview/permissions/cluster-
policies/ABCD000000000000/permissionLevels \
| jq .

{
"permission_levels": [
{
"permission_level": "CAN_USE",
"description": "Can use the policy"
}
]
}
Request structure

F IEL D N A M E TYPE DESC RIP T IO N

clusterPolicyId STRING The policy about which to retrieve


permission levels. This field is required.

Response structure
An array of PermissionLevel with associated description.
Add or modify permissions
EN DP O IN T H T T P M ET H O D

2.0/preview/permissions/cluster- PATCH
policies/<clusterPolicyId>

Example

curl --netrc -X PATCH \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/preview/permissions/cluster-
policies/ABCD000000000000 \
--data @add-cluster-policy-permissions.json \
| jq .

add-cluster-policy-permissions.json :

{
"access_control_list": [
{
"user_name": "someone-else@example.com",
"permission_level": "CAN_USE"
}
]
}
{
"object_id": "/cluster-policies/ABCD000000000000",
"object_type": "cluster-policy",
"access_control_list": [
{
"user_name": "mary@example.com",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": false
}
]
},
{
"user_name": "someone-else@example.com",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": false
}
]
},
{
"group_name": "admins",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": true,
"inherited_from_object": [
"/cluster-policies/"
]
}
]
}
]
}

Request structure

F IEL D N A M E TYPE DESC RIP T IO N

clusterPolicyId STRING The policy about which to modify


permissions. This field is required.

Request body

F IEL D N A M E TYPE DESC RIP T IO N

access_control_list Array of AccessControl An array of access control lists.

Response body
A Clusters ACL.
Set or delete permissions
A PUT request replaces all direct permissions on the cluster policy object. You can make delete requests by
making a GET request to retrieve the current list of permissions followed by a PUT request removing entries to
be deleted.
EN DP O IN T H T T P M ET H O D

2.0/preview/permissions/cluster- PUT
policies/<clusterPolicyId>

Example

curl --netrc -X PUT \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/preview/permissions/cluster-
policies/ABCD000000000000 \
--data @set-cluster-policy-permissions.json \
| jq .

set-cluster-policy-permissions.json :

{
"access_control_list": [
{
"user_name": "someone@example.com",
"permission_level": "CAN_USE"
}
]
}

{
"object_id": "/cluster-policies/ABCD000000000000",
"object_type": "cluster-policy",
"access_control_list": [
{
"user_name": "someone@example.com",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": false
}
]
},
{
"group_name": "admins",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": true,
"inherited_from_object": [
"/cluster-policies/"
]
}
]
}
]
}

Request structure

F IEL D N A M E TYPE DESC RIP T IO N

clusterPolicyId STRING The policy about which to set


permissions. This field is required.

Request body
F IEL D N A M E TYPE DESC RIP T IO N

access_control_list Array of AccessControlInput An array of access controls.

Response body
A Clusters ACL.
Data structures
In this section:
Clusters ACL
AccessControl
Permission
AccessControlInput
PermissionLevel
Clusters ACL

AT T RIB UT E N A M E TYPE DESC RIP T IO N

object_id STRING The ID of the ACL object, for example,


../cluster-
policies/<clusterPolicyId>
.

object_type STRING The Databricks ACL object type, for


example, cluster-policy .

access_control_list Array of AccessControl The access controls set on the ACL


object.

AccessControl

AT T RIB UT E N A M E TYPE DESC RIP T IO N

user_name, group_name, OR STRING Name of the user/group or application


service_principal_name ID of the service principal that has
permissions set on the ACL object.

all_permissions Array of Permission List of all permissions set on this ACL


object for a specific principal. Includes
both permissions directly set on this
ACL object and permissions inherited
from an ancestor ACL object.

Permission

AT T RIB UT E N A M E TYPE DESC RIP T IO N

permission_level STRING The name of the permission level.

inherited BOOLEAN True when the ACL permission is not


set directly but inherited from an
ancestor ACL object. False if set directly
on the ACL object.
AT T RIB UT E N A M E TYPE DESC RIP T IO N

inherited_from_object List[STRING] The list of parent ACL object IDs that


contribute to inherited permission on
an ACL object. This is defined only if
inherited is true.

AccessControlInput
An item representing an ACL rule applied to the principal (user, group, or service principal).

AT T RIB UT E N A M E TYPE DESC RIP T IO N

user_name, group_name, OR STRING Name of the user/group or application


service_principal_name ID of the service principal that has
permissions set on the ACL object.

permission_level STRING The name of the permission level.

PermissionLevel
Permission level that you can set on a cluster policy.

P ERM ISSIO N L EVEL DESC RIP T IO N

CAN_USE Allow user to create clusters based on the policy. The user
does not need the cluster create permission.
SQL Warehouses APIs 2.0
7/21/2022 • 8 minutes to read

IMPORTANT
To access Databricks REST APIs, you must authenticate.

To configure individual SQL warehouses, use the SQL Warehouses API. To configure all SQL warehouses, use the
Global SQL Warehouses API.

Requirements
To create SQL warehouses you must have cluster create permission, which is enabled in the Data Science &
Engineering workspace.
To manage a SQL warehouse you must have Can Manage permission in Databricks SQL for the warehouse.

SQL Warehouses API


Use this API to create, edit, list, and get SQL warehouses.
In this section:
Create
Delete
Edit
Get
List
Start
Stop
Create
EN DP O IN T H T T P M ET H O D

2.0/sql/warehouses/ POST

2.0/sql/endpoints/ (deprecated) POST

Create a SQL warehouse.

F IEL D N A M E TYPE DESC RIP T IO N

name STRING Name of the SQL warehouse. Must be


unique. This field is required.
F IEL D N A M E TYPE DESC RIP T IO N

cluster_size STRING The size of the clusters allocated to the


warehouse: "XXSMALL" , "XSMALL" ,
"SMALL" , "MEDIUM" , "LARGE" ,
"XLARGE" , "XXLARGE" , "XXXLARGE"
, "XXXXLARGE" . For the mapping from
cluster to instance size, see Cluster
size. This field is required.

min_num_clusters INT32 Minimum number of clusters available


when a SQL warehouse is running. The
default is 1.

max_num_clusters INT32 Maximum number of clusters available


when a SQL warehouse is running. This
field is required. If multi-cluster load
balancing is not enabled, this is limited
to 1 .

auto_stop_mins INT32 Time in minutes until an idle SQL


warehouse terminates all clusters and
stops. This field is optional. Setting this
to 0 disables auto stop. For Classic
SQL warehouses, the default value is
15. For Serverless SQL warehouses,
the default and recommended value is
10.

tags WarehouseTags Key-value pairs that describe the


warehouse. Azure Databricks tags all
warehouse resources with these tags.
This field is optional.

enable_photon BOOLEAN Whether queries are executed on a


native vectorized engine that speeds
up query execution. This field is
optional. The default is true .

channel Channel Whether to use the current SQL


warehouse compute version or the
preview version. Preview versions let
you try out functionality before it
becomes the Databricks SQL standard.
Typically, preview versions are
promoted to the current version two
weeks after initial preview release, but
some previews may last longer. You
can learn about the features in the
latest preview version by reviewing the
release notes. Databricks does not
recommend using preview versions for
production workloads. This field is
optional. The default is
CHANNEL_NAME_CURRENT .
F IEL D N A M E TYPE DESC RIP T IO N

spot_instance_policy WarehouseSpotInstancePolicy The spot policy to use for allocating


instances to clusters. This field is
optional. This field is not used if the
SQL warehouse is a Serverless SQL
warehouse.

Example request

{
"name": "My SQL warehouse",
"cluster_size": "MEDIUM",
"min_num_clusters": 1,
"max_num_clusters": 10,
"tags": {
"custom_tags": [
{
"key": "mykey",
"value": "myvalue"
}
]
},
"enable_photon": "true",
"channel": {
"name": "CHANNEL_NAME_CURRENT"
}
}

Example response

{
"id": "0123456789abcdef"
}

Delete
EN DP O IN T H T T P M ET H O D

2.0/sql/warehouses/{id} DELETE

2.0/sql/endpoints/{id} (deprecated) DELETE

Delete a SQL warehouse.


Edit
EN DP O IN T H T T P M ET H O D

2.0/sql/warehouses/{id}/edit POST

2.0/sql/endpoints/{id}/edit (deprecated) POST

Modify a SQL warehouse. All fields are optional. Missing fields default to the current values.

F IEL D N A M E TYPE DESC RIP T IO N

id STRING ID of the SQL warehouse.


F IEL D N A M E TYPE DESC RIP T IO N

name STRING Name of the SQL warehouse.

cluster_size STRING The size of the clusters allocated to the


warehouse: "XXSMALL" , "XSMALL" ,
"SMALL" , "MEDIUM" , "LARGE" ,
"XLARGE" , "XXLARGE" , "XXXLARGE"
, "XXXXLARGE" . For the mapping from
cluster to instance size, see Cluster
size.

min_num_clusters INT32 Minimum number of clusters available


when a SQL warehouse is running.

max_num_clusters INT32 Maximum number of clusters available


when a SQL warehouse is running. This
field is required. If multi-cluster load
balancing is not enabled, limited to 1 .

auto_stop_mins INT32 Time in minutes until an idle SQL


warehouse terminates all clusters and
stops. Setting this to 0 disables auto
stop. For Classic SQL warehouses, the
default value is 15. For Serverless SQL
warehouses, the default and
recommended value is 10.

tags WarehouseTags Key-value pairs that describe the


warehouse.

spot_instance_policy WarehouseSpotInstancePolicy The spot policy to use for allocating


instances to clusters.

enable_photon BOOLEAN Whether queries are executed on a


native vectorized engine that speeds
up query execution.

channel Channel Whether to use the current SQL


warehouse compute version or the
preview version. Preview versions let
you try out functionality before it
becomes the Databricks SQL standard.
Typically, preview versions are
promoted to the current version two
weeks after initial preview release, but
some previews may last longer. You
can learn about the features in the
latest preview version by reviewing the
release notes. Databricks does not
recommend using preview versions for
production workloads. This field is
optional. The default is
CHANNEL_NAME_CURRENT .

Example request
{
"name": "My Edited SQL warehouse",
"cluster_size": "LARGE",
"auto_stop_mins": 60
}

Get
EN DP O IN T H T T P M ET H O D

2.0/sql/warehouses/{id} GET

2.0/sql/endpoints/{id} (deprecated) GET

Retrieve the info for a SQL warehouse.

F IEL D N A M E TYPE DESC RIP T IO N

id STRING SQL warehouse ID.

name STRING Name of the SQL warehouse.

cluster_size STRING The size of the clusters allocated to the


warehouse: "XXSMALL" , "XSMALL" ,
"SMALL" , "MEDIUM" , "LARGE" ,
"XLARGE" , "XXLARGE" , "XXXLARGE"
, "XXXXLARGE" . For the mapping from
cluster to instance size, see Cluster
size.

spot_instance_policy WarehouseSpotInstancePolicy The spot policy to use for allocating


instances to clusters.

auto_stop_mins INT32 Time until an idle SQL warehouse


terminates all clusters and stops.

num_clusters INT32 Number of clusters allocated to the


warehouse.

min_num_clusters INT32 Minimum number of clusters available


when a SQL warehouse is running.

max_num_clusters INT32 Maximum number of clusters available


when a SQL warehouse is running.

num_active_sessions INT32 Number of active JDBC and ODBC


sessions running on the SQL
warehouse.

state WarehouseState State of the SQL warehouse.

creator_name STRING Email address of the user that created


the warehouse.
F IEL D N A M E TYPE DESC RIP T IO N

creator_id STRING Azure Databricks ID of the user that


created the warehouse.

jdbc_url STRING The URL used to submit SQL


commands to the SQL warehouse
using JDBC.

odbc_params ODBCParams The host, path, protocol, and port


information required to submit SQL
commands to the SQL warehouse
using ODBC.

tags WarehouseTags Key-value pairs that describe the


warehouse.

health WarehouseHealth The health of the warehouse.

enable_photon BOOLEAN Whether queries are executed on a


native vectorized engine that speeds
up query execution.

channel Channel Whether the SQL warehouse uses the


current SQL warehouse compute
version or the preview version. Preview
versions let you try out functionality
before it becomes the Databricks SQL
standard. Typically, preview versions
are promoted to the current version
two weeks after initial preview release,
but some previews may last longer.
You can learn about the features in the
latest preview version by reviewing the
release notes. Databricks does not
recommend using preview versions for
production workloads. This field is
optional. The default is
CHANNEL_NAME_CURRENT .

Example response
{
"id": "7f2629a529869126",
"name": "MyWarehouse",
"size": "SMALL",
"min_num_clusters": 1,
"max_num_clusters": 1,
"auto_stop_mins": 0,
"auto_resume": true,
"num_clusters": 0,
"num_active_sessions": 0,
"state": "STOPPED",
"creator_name": "user@example.com",
"jdbc_url":
"jdbc:spark://hostname.staging.cloud.databricks.com:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath
=/sql/1.0/warehouses/7f2629a529869126;",
"odbc_params": {
"hostname": "hostname.cloud.databricks.com",
"path": "/sql/1.0/warehouses/7f2629a529869126",
"protocol": "https",
"port": 443
},
"tags": {
"custom_tags": [
{
"key": "mykey",
"value": "myvalue"
}
]
},
"spot_instance_policy": "COST_OPTIMIZED",
"enable_photon": true,
"cluster_size": "SMALL",
"channel": {
"name": "CHANNEL_NAME_CURRENT"
}
}

List
EN DP O IN T H T T P M ET H O D

2.0/sql/warehouses/ GET

2.0/sql/endpoints/ (deprecated) GET

List all SQL warehouses in the workspace.


Example response

{
"warehouses": [
{ "id": "123456790abcdef", "name": "My SQL warehouse", "cluster_size": "MEDIUM" },
{ "id": "098765321fedcba", "name": "Another SQL warehouse", "cluster_size": "LARGE" }
]
}

Note: If you use the deprecated 2.0/sql/endpoints/ API, the top-level response field would be “endpoints”
instead of “warehouses”.
Start
EN DP O IN T H T T P M ET H O D

2.0/sql/warehouses/{id}/start POST

2.0/sql/endpoints/{id}/start (deprecated) POST

Start a SQL warehouse.


Stop
EN DP O IN T H T T P M ET H O D

2.0/sql/warehouses/{id}/stop POST

2.0/sql/endpoints/{id}/stop (deprecated) POST

Stop a SQL warehouse.

Global SQL Warehouses API


Use this API to configure the security policy, data access properties, and configuration parameters for all SQL
warehouses.
In this section:
Get
Edit
Get
EN DP O IN T H T T P M ET H O D

/2.0/sql/config/warehouses GET

/2.0/sql/config/endpoints (deprecated) GET

Get the configuration for all SQL warehouses.

F IEL D N A M E TYPE DESC RIP T IO N

security_policy WarehouseSecurityPolicy The policy for controlling access to


datasets.

data_access_config Array of WarehouseConfPair An array of key-value pairs containing


properties for an external Hive
metastore.

sql_configuration_parameters RepeatedWarehouseConfPairs SQL configuration parameters.

Example response
{
"security_policy": "DATA_ACCESS_CONTROL",
"data_access_config": [
{
"key": "spark.sql.hive.metastore.jars",
"value": "/databricks/hive_metastore_jars/*"
}
],
"sql_configuration_parameters": {
"configuration_pairs": [
{
"key" : "legacy_time_parser_policy",
"value": "LEGACY"
}
]
}
}

Edit
Edit the configuration for all SQL warehouses.

IMPORTANT
All fields are required.
Invoking this method restarts all running SQL warehouses.

EN DP O IN T H T T P M ET H O D

/2.0/sql/config/warehouses PUT

/2.0/sql/config/endpoints (deprecated) PUT

F IEL D N A M E TYPE DESC RIP T IO N

security_policy WarehouseSecurityPolicy The policy for controlling access to


datasets.

data_access_config Array of WarehouseConfPair An array of key-value pairs containing


properties for an external Hive
metastore.

sql_configuration_parameters RepeatedWarehouseConfPairs SQL configuration parameters.

Example request
{
"data_access_config": [
{
"key": "spark.sql.hive.metastore.jars",
"value": "/databricks/hive_metastore_jars/*"
}
],
"sql_configuration_parameters": {
"configuration_pairs": [
{
"key" : "legacy_time_parser_policy",
"value": "LEGACY"
}
]
}
}

Data structures
In this section:
WarehouseConfPair
WarehouseHealth
WarehouseSecurityPolicy
WarehouseSpotInstancePolicy
WarehouseState
WarehouseStatus
WarehouseTags
WarehouseTagPair
ODBCParams
RepeatedWarehouseConfPairs
Channel
ChannelName
WarehouseConfPair
F IEL D N A M E TYPE DESC RIP T IO N

key STRING Configuration key name.

value STRING Configuration key value.

WarehouseHealth
F IEL D N A M E TYPE DESC RIP T IO N

status WarehouseStatus Warehouse status.

message STRING A descriptive message about the


health status. Includes information
about errors contributing to current
health status.

WarehouseSecurityPolicy
O P T IO N DESC RIP T IO N

DATA_ACCESS_CONTROL Use data access control to control access to datasets.

WarehouseSpotInstancePolicy
O P T IO N DESC RIP T IO N

COST_OPTIMIZED Use an on-demand instance for the cluster driver and spot
instances for cluster executors. The maximum spot price is
100% of the on-demand price. This is the default policy.

RELIABILITY_OPTIMIZED Use on-demand instances for all cluster nodes.

WarehouseState
State of a SQL warehouse. The allowable state transitions are:
STARTING -> STARTING , RUNNING , STOPPING , DELETING
RUNNING -> STOPPING , DELETING
STOPPING -> STOPPED , STARTING
STOPPED -> STARTING , DELETING
DELETING -> DELETED

STAT E DESC RIP T IO N

STARTING The warehouse is in the process of starting.

RUNNING The starting process is done and the warehouse is ready to


use.

STOPPING The warehouse is in the process of being stopped.

STOPPED The warehouse is stopped. Start by calling start or by


submitting a JDBC or ODBC request.

DELETING The warehouse is in the process of being destroyed.

DELETED The warehouse has been deleted and cannot be recovered.

WarehouseStatus
STAT E DESC RIP T IO N

HEALTHY Warehouse is functioning normally and there are no known


issues.

DEGRADED Warehouse might be functional, but there are some known


issues. Performance might be affected.

FAILED Warehouse is severely affected and will not be able to serve


queries.

WarehouseTags
F IEL D N A M E TYPE DESC RIP T IO N

custom_tags Array of WarehouseTagPair An object containing an array of key-


value pairs.

WarehouseTagPair
F IEL D N A M E TYPE DESC RIP T IO N

key STRING Tag key name.

value STRING Tag key value.

ODBCParams
F IEL D N A M E TYPE DESC RIP T IO N

host STRING ODBC server hostname.

path STRING ODBC server path.

protocol STRING ODBC server protocol.

port INT32 ODBC server port

RepeatedWarehouseConfPairs
F IEL D N A M E TYPE DESC RIP T IO N

configuration_pairs Array of WarehouseConfPair An object containing an array of key-


value pairs.

Channel
F IEL D N A M E TYPE DESC RIP T IO N

name ChannelName Channel Name

ChannelName
NAME DESC RIP T IO N

CHANNEL_NAME_PREVIEW SQL warehouse is set to the preview channel and uses


upcoming functionality.

CHANNEL_NAME_CURRENT SQL warehouse is set to the current channel.


Queries and Dashboards API 2.0
7/21/2022 • 2 minutes to read

The Queries and Dashboards API manages queries, results, and dashboards.
This API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification.
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
Query History API 2.0
7/21/2022 • 2 minutes to read

The Query History API shows SQL queries performed using Databricks SQL warehouses. You can use this
information to help you debug issues with queries.
This API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification.
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
DBFS API 2.0
7/21/2022 • 9 minutes to read

The DBFS API is a Databricks API that makes it simple to interact with various data sources without having to
include your credentials every time you read a file. See Databricks File System (DBFS) for more information. For
an easy to use command line client of the DBFS API, see Databricks CLI.

NOTE
To ensure high quality of service under heavy load, Azure Databricks is now enforcing API rate limits for DBFS API calls.
Limits are set per workspace to ensure fair usage and high availability. Automatic retries are available using Databricks CLI
version 0.12.0 and above. We advise all customers to switch to the latest Databricks CLI version.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Limitations
Using the DBFS API with firewall enabled storage containers is not supported. Databricks recommends you use
Databricks Connect or az storage.

Add block
EN DP O IN T H T T P M ET H O D

2.0/dbfs/add-block POST

Append a block of data to the stream specified by the input handle. If the handle does not exist, this call will
throw an exception with RESOURCE_DOES_NOT_EXIST . If the block of data exceeds 1 MB, this call will throw an
exception with MAX_BLOCK_SIZE_EXCEEDED . A typical workflow for file upload would be:
1. Call create and get a handle.
2. Make one or more add-block calls with the handle you have.
3. Call close with the handle you have.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/add-block \
--data '{ "data": "SGVsbG8sIFdvcmxkIQ==", "handle": 1234567890123456 }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

handle INT64 The handle on an open stream. This


field is required.

data BYTES The base64-encoded data to append


to the stream. This has a limit of 1 MB.
This field is required.

Close
EN DP O IN T H T T P M ET H O D

2.0/dbfs/close POST

Close the stream specified by the input handle. If the handle does not exist, this call throws an exception with
RESOURCE_DOES_NOT_EXIST . A typical workflow for file upload would be:

1. Call create and get a handle.


2. Make one or more add-block calls with the handle you have.
3. Call close with the handle you have.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/close \
--data '{ "handle": 1234567890123456 }'

If the call succeeds, no output displays.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

handle INT64 The handle on an open stream. This


field is required.

Create
EN DP O IN T H T T P M ET H O D

2.0/dbfs/create POST

Open a stream to write to a file and returns a handle to this stream. There is a 10 minute idle timeout on this
handle. If a file or directory already exists on the given path and overwrite is set to false, this call throws an
exception with RESOURCE_ALREADY_EXISTS . A typical workflow for file upload would be:
1. Call create and get a handle.
2. Make one or more add-block calls with the handle you have.
3. Call close with the handle you have.
Example
curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/create \
--data '{ "path": "/tmp/HelloWorld.txt", "overwrite": true }'

{ "handle": 1234567890123456 }

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the new file. The path


should be the absolute DBFS path (for
example
/mnt/my-file.txt ). This field is
required.

overwrite BOOL The flag that specifies whether to


overwrite existing file or files.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

handle INT64 Handle which should subsequently be


passed into the add-block and
close calls when writing to a file
through a stream.

Delete
EN DP O IN T H T T P M ET H O D

2.0/dbfs/delete POST

Delete the file or directory (optionally recursively delete all files in the directory). This call throws an exception
with IO_ERROR if the path is a non-empty directory and recursive is set to false or on other similar errors.
When you delete a large number of files, the delete operation is done in increments. The call returns a response
after approximately 45 seconds with an error message (503 Service Unavailable) asking you to re-invoke the
delete operation until the directory structure is fully deleted. For example:

{
"error_code": "PARTIAL_DELETE",
"message": "The requested operation has deleted 324 files. There are more files remaining. You must make
another request to delete more."
}

For operations that delete more than 10K files, we discourage using the DBFS REST API, but advise you to
perform such operations in the context of a cluster, using the File system utility (dbutils.fs). dbutils.fs covers
the functional scope of the DBFS REST API, but from notebooks. Running such operations using notebooks
provides better control and manageability, such as selective deletes, and the possibility to automate periodic
delete jobs.
Example
curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/delete \
--data '{ "path": "/tmp/HelloWorld.txt" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the file or directory to


delete. The path should be the
absolute DBFS path (e.g. /mnt/foo/ ).
This field is required.

recursive BOOL Whether or not to recursively delete


the directory’s contents. Deleting
empty directories can be done without
providing the recursive flag.

Get status
EN DP O IN T H T T P M ET H O D

2.0/dbfs/get-status GET

Get the file information of a file or directory. If the file or directory does not exist, this call throws an exception
with RESOURCE_DOES_NOT_EXIST .
Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/get-status \
--data '{ "path": "/tmp/HelloWorld.txt" }' \
| jq .

{
"path": "/tmp/HelloWorld.txt",
"is_dir": false,
"file_size": 13,
"modification_time": 1622054945000
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the file or directory. The


path should be the absolute DBFS
path (for example, /mnt/my-folder/ ).
This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the file or directory.

is_dir BOOL Whether the path is a directory.

file_size INT64 The length of the file in bytes or zero if


the path is a directory.

modification_time INT64 The last time, in epoch milliseconds,


the file or directory was modified.

List
EN DP O IN T H T T P M ET H O D

2.0/dbfs/list GET

List the contents of a directory, or details of the file. If the file or directory does not exist, this call throws an
exception with RESOURCE_DOES_NOT_EXIST .
When calling list on a large directory, the list operation will time out after approximately 60 seconds. We
strongly recommend using list only on directories containing less than 10K files and discourage using the
DBFS REST API for operations that list more than 10K files. Instead, we recommend that you perform such
operations in the context of a cluster, using the File system utility (dbutils.fs), which provides the same
functionality without timing out.
Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/list \
--data '{ "path": "/tmp" }' \
| jq .

{
"files": [
{
"path": "/tmp/HelloWorld.txt",
"is_dir": false,
"file_size": 13,
"modification_time": 1622054945000
},
{
"..."
}
]
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the file or directory. The


path should be the absolute DBFS
path (e.g. /mnt/foo/ ). This field is
required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

files An array of FileInfo A list of FileInfo that describe contents


of directory or file.

Mkdirs
EN DP O IN T H T T P M ET H O D

2.0/dbfs/mkdirs POST

Create the given directory and necessary parent directories if they do not exist. If there exists a file (not a
directory) at any prefix of the input path, this call throws an exception with RESOURCE_ALREADY_EXISTS . If this
operation fails it may have succeeded in creating some of the necessary parent directories.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/mkdirs \
--data '{ "path": "/tmp/my-new-dir" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the new directory. The


path should be the absolute DBFS
path (for example,
/mnt/my-folder/ ). This field is
required.

Move
EN DP O IN T H T T P M ET H O D

2.0/dbfs/move POST

Move a file from one location to another location within DBFS. If the source file does not exist, this call throws an
exception with RESOURCE_DOES_NOT_EXIST . If there already exists a file in the destination path, this call throws an
exception with RESOURCE_ALREADY_EXISTS . If the given source path is a directory, this call always recursively
moves all files.
When moving a large number of files, the API call will time out after approximately 60 seconds, potentially
resulting in partially moved data. Therefore, for operations that move more than 10K files, we strongly
discourage using the DBFS REST API. Instead, we recommend that you perform such operations in the context of
a cluster, using the File system utility (dbutils.fs) from a notebook, which provides the same functionality without
timing out.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/move \
--data '{ "source_path": "/tmp/HelloWorld.txt", "destination_path": "/tmp/my-new-dir/HelloWorld.txt" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

source_path STRING The source path of the file or directory.


The path should be the absolute DBFS
path (for example,
/mnt/my-source-folder/ ). This field
is required.

destination_path STRING The destination path of the file or


directory. The path should be the
absolute DBFS path (for example,
/mnt/my-destination-folder/ ). This
field is required.

Put
EN DP O IN T H T T P M ET H O D

2.0/dbfs/put POST

Upload a file through the use of multipart form post. It is mainly used for streaming uploads, but can also be
used as a convenient single call for data upload.
The amount of data that can be passed using the contents parameter is limited to 1 MB if specified as a string (
MAX_BLOCK_SIZE_EXCEEDED is thrown if exceeded) and 2 GB as a file.

Example
To upload a local file named HelloWorld.txt in the current directory:

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/put \
--form contents=@HelloWorld.txt \
--form path="/tmp/HelloWorld.txt" \
--form overwrite=true

To upload content Hello, World! as a base64 encoded string:


curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/put \
--data '{ "path": "/tmp/HelloWorld.txt", "contents": "SGVsbG8sIFdvcmxkIQ==", "overwrite": true }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the new file. The path


should be the absolute DBFS path (e.g.
/mnt/foo/ ). This field is required.

contents BYTES This parameter might be absent, and


instead a posted file will be used.

overwrite BOOL The flag that specifies whether to


overwrite existing files.

Read
EN DP O IN T H T T P M ET H O D

2.0/dbfs/read GET

Return the contents of a file. If the file does not exist, this call throws an exception with RESOURCE_DOES_NOT_EXIST .
If the path is a directory, the read length is negative, or if the offset is negative, this call throws an exception with
INVALID_PARAMETER_VALUE . If the read length exceeds 1 MB, this call throws an exception with
MAX_READ_SIZE_EXCEEDED . If offset + length exceeds the number of bytes in a file, reads contents until the end
of file.
Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/read \
--data '{ "path": "/tmp/HelloWorld.txt", "offset": 1, "length": 8 }' \
| jq .

{
"bytes_read": 8,
"data": "ZWxsbywgV28="
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the file to read. The path


should be the absolute DBFS path (e.g.
/mnt/foo/ ). This field is required.
F IEL D N A M E TYPE DESC RIP T IO N

offset INT64 The offset to read from in bytes.

length INT64 The number of bytes to read starting


from the offset. This has a limit of 1
MB, and a default value of 0.5 MB.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

bytes_read INT64 The number of bytes read (could be


less than length if we hit end of file).
This refers to number of bytes read in
unencoded version (response data is
base64-encoded).

data BYTES The base64-encoded contents of the


file read.

Data structures
In this section:
FileInfo
FileInfo
The attributes of a file or directory.

F IEL D N A M E TYPE DESC RIP T IO N

path STRING The path of the file or directory.

is_dir BOOL Whether the path is a directory.

file_size INT64 The length of the file in bytes or zero if


the path is a directory.

modification_time INT64 The last time, in epoch milliseconds,


the file or directory was modified.
Delta Live Tables API guide
7/21/2022 • 13 minutes to read

The Delta Live Tables API allows you to create, edit, delete, start, and view details about pipelines.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Create a pipeline
EN DP O IN T H T T P M ET H O D

2.0/pipelines POST

Creates a new Delta Live Tables pipeline.


Example
This example creates a new triggered pipeline.
Request

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/pipelines \
--data @pipeline-settings.json

pipeline-settings.json :

{
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"continuous": false
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
This example uses a .netrc file.
Response

{
"pipeline_id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5"
}

Request structure
See PipelineSettings.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N

pipeline_id STRING The unique identifier for the newly


created pipeline.

Edit a pipeline
EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id} PUT

Updates the settings for an existing pipeline.


Example
This example adds a target parameter to the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl --netrc --request PUT \


https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 \
> --data @pipeline-settings.json

pipeline-settings.json
{
"id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"target": "wikipedia_quickstart_data",
"continuous": false
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Request structure
See PipelineSettings.

Delete a pipeline
EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id} DELETE

Deletes a pipeline from the Delta Live Tables system.


Example
This example deletes the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl --netrc --request DELETE \


https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.

Start a pipeline update


EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id}/updates POST

Starts an update for a pipeline.


Example
This example starts an update with full refresh for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/updates \
--data '{ "full_refresh": "true" }'

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Response

{
"update_id": "a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8"
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

full_refresh BOOLEAN Whether to reprocess all data. If true


, the Delta Live Tables system will reset
all tables before running the pipeline.

This field is optional.

The default value is false

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

update_id STRING The unique identifier of the newly


created update.

Stop any active pipeline update


EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id}/stop POST

Stops any active pipeline update. If no update is running, this request is a no-op.
Example
This example stops an update for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/stop

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.

List pipeline events


EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id}/events GET

Retrieves events for a pipeline.


Example
This example retrieves a maximum of 5 events for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 .
Request

curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/events \
--data '{"max_results": 5}'

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

page_token STRING Page token returned by previous call.


This field is mutually exclusive with all
fields in this request except
max_results. An error is returned if any
fields other than max_results are set
when this field is set.

This field is optional.


F IEL D N A M E TYPE DESC RIP T IO N

max_results INT32 The maximum number of entries to


return in a single page. The system
may return fewer than max_results
events in a response, even if there are
more events available.

This field is optional.

The default value is 25.

The maximum value is 100. An error is


returned if the value of
max_results is greater than 100.

order_by STRING A string indicating a sort order by


timestamp for the results, for example,
["timestamp asc"] .

The sort order can be ascending or


descending. By default, events are
returned in descending order by
timestamp.

This field is optional.

filter STRING Criteria to select a subset of results,


expressed using a SQL-like syntax. The
supported filters are:

* level='INFO' (or WARN or ERROR


)
* level in ('INFO', 'WARN')
* id='[event-id]'
* timestamp > 'TIMESTAMP' (or >= ,
< , <= ,=)

Composite expressions are supported,


for example:
level in ('ERROR', 'WARN') AND
timestamp> '2021-07-
22T06:37:33.083Z'

This field is optional.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

events An array of pipeline events. The list of events matching the request
criteria.

next_page_token STRING If present, a token to fetch the next


page of events.

prev_page_token STRING If present, a token to fetch the


previous page of events.
Get pipeline details
EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id} GET

Gets details about a pipeline, including the pipeline settings and recent updates.
Example
This example gets details for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Response
{
"pipeline_id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"spec": {
"id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"target": "wikipedia_quickstart_data",
"continuous": false
},
"state": "IDLE",
"cluster_id": "1234-567891-abcde123",
"name": "Wikipedia pipeline (SQL)",
"creator_user_name": "username",
"latest_updates": [
{
"update_id": "8a0b6d02-fbd0-11eb-9a03-0242ac130003",
"state": "COMPLETED",
"creation_time": "2021-08-13T00:37:30.279Z"
},
{
"update_id": "a72c08ba-fbd0-11eb-9a03-0242ac130003",
"state": "CANCELED",
"creation_time": "2021-08-13T00:35:51.902Z"
},
{
"update_id": "ac37d924-fbd0-11eb-9a03-0242ac130003",
"state": "FAILED",
"creation_time": "2021-08-13T00:33:38.565Z"
}
],
"run_as_user_name": "username"
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

pipeline_id STRING The unique identifier of the pipeline.

spec PipelineSettings The pipeline settings.

state STRING The state of the pipeline. One of


IDLE or RUNNING .

If state = RUNNING , then there is at


least one active update.
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The identifier of the cluster running the


pipeline.

name STRING The user-friendly name for this


pipeline.

creator_user_name STRING The username of the pipeline creator.

latest_updates An array of UpdateStateInfo Status of the most recent updates for


the pipeline, ordered with the newest
update first.

run_as_user_name STRING The username that the pipeline runs


as.

Get update details


EN DP O IN T H T T P M ET H O D

2.0/pipelines/{pipeline_id}/updates/{update_id} GET

Gets details for a pipeline update.


Example
This example gets details for update 9a84f906-fc51-11eb-9a03-0242ac130003 for the pipeline with ID
a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request

curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/updates/9a84f906-fc51-
11eb-9a03-0242ac130003

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Response
{
"update": {
"pipeline_id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"update_id": "9a84f906-fc51-11eb-9a03-0242ac130003",
"config": {
"id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"configuration": {
"pipelines.numStreamRetryAttempts": "5"
},
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"target": "wikipedia_quickstart_data",
"filters": {},
"email_notifications": {},
"continuous": false,
"development": false
},
"cause": "API_CALL",
"state": "COMPLETED",
"creation_time": 1628815050279,
"full_refresh": true
}
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

pipeline_id STRING The unique identifier of the pipeline.

update_id STRING The unique identifier of this update.

config PipelineSettings The pipeline settings.

cause STRING The trigger for the update. One of


API_CALL ,
RETRY_ON_FAILURE ,
SERVICE_UPGRADE .
F IEL D N A M E TYPE DESC RIP T IO N

state STRING The state of the update. One of


QUEUED , CREATED
WAITING_FOR_RESOURCES ,
INITIALIZING , RESETTING ,
SETTING_UP_TABLES , RUNNING ,
STOPPING , COMPLETED ,
FAILED , or CANCELED .

cluster_id STRING The identifier of the cluster running the


pipeline.

creation_time INT64 The timestamp when the update was


created.

full_refresh BOOLEAN Whether the update was triggered to


perform a full refresh. If true, all
pipeline tables were reset before
running the update.

List pipelines
EN DP O IN T H T T P M ET H O D

2.0/pipelines/ GET

Lists pipelines defined in the Delta Live Tables system.


Example
This example retrieves details for up to two pipelines, starting from a specified page_token :
Request

curl -n -X GET https://<databricks-instance>/api/2.0/pipelines \


--data '{ "page_token": "eyJ...==", "max_results": 2 }'

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file.


Response
{
"statuses": [
{
"pipeline_id": "e0f01758-fc61-11eb-9a03-0242ac130003",
"state": "IDLE",
"name": "dlt-pipeline-python",
"latest_updates": [
{
"update_id": "ee9ae73e-fc61-11eb-9a03-0242ac130003",
"state": "COMPLETED",
"creation_time": "2021-08-13T00:34:21.871Z"
}
],
"creator_user_name": "username"
},
{
"pipeline_id": "f4c82f5e-fc61-11eb-9a03-0242ac130003",
"state": "IDLE",
"name": "dlt-pipeline-python",
"creator_user_name": "username"
}
],
"next_page_token": "eyJ...==",
"prev_page_token": "eyJ..x9"
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

page_token STRING Page token returned by previous call.

This field is optional.

max_results INT32 The maximum number of entries to


return in a single page. The system
may return fewer than max_results
events in a response, even if there are
more events available.

This field is optional.

The default value is 25.

The maximum value is 100. An error is


returned if the value of
max_results is greater than 100.

order_by An array of STRING A list of strings specifying the order of


results, for example,
["name asc"] . Supported
order_by fields are id and
name . The default is id asc .

This field is optional.


F IEL D N A M E TYPE DESC RIP T IO N

filter STRING Select a subset of results based on the


specified criteria.

The supported filters are:

"notebook='<path>'" to select
pipelines that reference the provided
notebook path.

name LIKE '[pattern]' to select


pipelines with a name that matches
pattern . Wildcards are supported,
for example:
name LIKE '%shopping%'

Composite filters are not supported.

This field is optional.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

statuses An array of PipelineStateInfo The list of events matching the request


criteria.

next_page_token STRING If present, a token to fetch the next


page of events.

prev_page_token STRING If present, a token to fetch the


previous page of events.

Data structures
In this section:
KeyValue
NotebookLibrary
PipelineLibrary
PipelineSettings
PipelineStateInfo
PipelinesNewCluster
UpdateStateInfo
KeyValue
A key-value pair that specifies configuration parameters.

F IEL D N A M E TYPE DESC RIP T IO N

key STRING The configuration property name.

value STRING The configuration property value.

NotebookLibrary
A specification for a notebook containing pipeline code.

F IEL D N A M E TYPE DESC RIP T IO N

path STRING The absolute path to the notebook.

This field is required.

PipelineLibrary
A specification for pipeline dependencies.

F IEL D N A M E TYPE DESC RIP T IO N

notebook NotebookLibrary The path to a notebook defining Delta


Live Tables datasets. The path must be
in the Databricks workspace, for
example:
{ "notebook" : { "path" : "/my-
pipeline-notebook-path" } }
.

PipelineSettings
The settings for a pipeline deployment.

F IEL D N A M E TYPE DESC RIP T IO N

id STRING The unique identifier for this pipeline.

The identifier is created by the Delta


Live Tables system, and must not be
provided when creating a pipeline.

name STRING A user-friendly name for this pipeline.

This field is optional.

By default, the pipeline name must be


unique. To use a duplicate name, set
allow_duplicate_names to true in
the pipeline configuration.

storage STRING A path to a DBFS directory for storing


checkpoints and tables created by the
pipeline.

This field is optional.

The system uses a default location if


this field is empty.

configuration A map of STRING:STRING A list of key-value pairs to add to the


Spark configuration of the cluster that
will run the pipeline.

This field is optional.

Elements must be formatted as


key:value pairs.
F IEL D N A M E TYPE DESC RIP T IO N

clusters An array of PipelinesNewCluster An array of specifications for the


clusters to run the pipeline.

This field is optional.

If this is not specified, the system will


select a default cluster configuration
for the pipeline.

libraries An array of PipelineLibrary The notebooks containing the pipeline


code and any dependencies required
to run the pipeline.

target STRING A database name for persisting


pipeline output data.

See Delta Live Tables data publishing


for more information.

continuous BOOLEAN Whether this is a continuous pipeline.

This field is optional.

The default value is false .

development BOOLEAN Whether to run the pipeline in


development mode.

This field is optional.

The default value is false .

PipelineStateInfo
The state of a pipeline, the status of the most recent updates, and information about associated resources.

F IEL D N A M E TYPE DESC RIP T IO N

state STRING The state of the pipeline. One of


IDLE or RUNNING .

pipeline_id STRING The unique identifier of the pipeline.

cluster_id STRING The unique identifier of the cluster


running the pipeline.

name STRING The user-friendly name of the pipeline.

latest_updates An array of UpdateStateInfo Status of the most recent updates for


the pipeline, ordered with the newest
update first.

creator_user_name STRING The username of the pipeline creator.


F IEL D N A M E TYPE DESC RIP T IO N

run_as_user_name STRING The username that the pipeline runs


as. This is a read only value derived
from the pipeline owner.

PipelinesNewCluster
A pipeline cluster specification.
The Delta Live Tables system sets the following attributes. These attributes cannot be configured by users:
spark_version
init_scripts

F IEL D N A M E TYPE DESC RIP T IO N

label STRING A label for the cluster specification,


either
default to configure the default
cluster, or
maintenance to configure the
maintenance cluster.

This field is optional. The default value


is default .

spark_conf KeyValue An object containing a set of optional,


user-specified Spark configuration key-
value pairs. You can also pass in a
string of extra JVM options to the
driver and the executors via
spark.driver.extraJavaOptions
and
spark.executor.extraJavaOptions
respectively.

Example Spark confs:


{"spark.speculation": true,
"spark.streaming.ui.retainedBatches":
5}
or
{"spark.driver.extraJavaOptions":
"-verbose:gc -
XX:+PrintGCDetails"}

node_type_id STRING This field encodes, through a single


value, the resources available to each
of the Spark nodes in this cluster. For
example, the Spark nodes can be
provisioned and optimized for memory
or compute intensive workloads A list
of available node types can be
retrieved by using the List node types
API call.

driver_node_type_id STRING The node type of the Spark driver. This


field is optional; if unset, the driver
node type will be set as the same value
as node_type_id defined above.
F IEL D N A M E TYPE DESC RIP T IO N

ssh_public_keys An array of STRING SSH public key contents that will be


added to each Spark node in this
cluster. The corresponding private keys
can be used to login with the user
name ubuntu on port 2200 . Up to
10 keys can be specified.

custom_tags KeyValue An object containing a set of tags for


cluster resources. Databricks tags all
cluster resources with these tags in
addition to default_tags.

Note :

* Tags are not supported on legacy


node types such as compute-
optimized and memory-optimized
* Azure Databricks allows at most 45
custom tags.

cluster_log_conf ClusterLogConf The configuration for delivering Spark


logs to a long-term storage
destination. Only one destination can
be specified for one cluster. If this
configuration is provided, the logs will
be delivered to the destination every
5 mins . The destination of driver logs
is
<destination>/<cluster-
ID>/driver
, while the destination of executor logs
is
<destination>/<cluster-
ID>/executor
.

spark_env_vars KeyValue An object containing a set of optional,


user-specified environment variable
key-value pairs. Key-value pairs of the
form (X,Y) are exported as is (that is,
export X='Y' ) while launching the
driver and workers.

In order to specify an additional set of


SPARK_DAEMON_JAVA_OPTS , Databricks
recommends appending them to
$SPARK_DAEMON_JAVA_OPTS as shown
in the following example. This ensures
that all default Azure Databricks
managed environmental variables are
included as well.

Example Spark environment variables:


{"SPARK_WORKER_MEMORY":
"28000m", "SPARK_LOCAL_DIRS":
"/local_disk0"}
or
{"SPARK_DAEMON_JAVA_OPTS":
"$SPARK_DAEMON_JAVA_OPTS -
Dspark.shuffle.service.enabled=true"}
F IEL D N A M E TYPE DESC RIP T IO N

init_scripts An array of InitScriptInfo The configuration for storing init


scripts. Any number of destinations
can be specified. The scripts are
executed sequentially in the order
provided. If cluster_log_conf is
specified, init script logs are sent to
<destination>/<cluster-
ID>/init_scripts
.

instance_pool_id STRING The optional ID of the instance pool to


which the cluster belongs. See Pools.

driver_instance_pool_id STRING The optional ID of the instance pool to


use for the driver node. You must also
specify
instance_pool_id . See Instance
Pools API 2.0.

policy_id STRING A cluster policy ID.

num_workers OR autoscale INT32 OR AutoScale If num_workers, number of worker


nodes that this cluster should have. A
cluster has one Spark driver and
num_workers executors for a total of
num_workers + 1 Spark nodes.

When reading the properties of a


cluster, this field reflects the desired
number of workers rather than the
actual number of workers. For
instance, if a cluster is resized from 5
to 10 workers, this field is updated to
reflect the target size of 10 workers,
whereas the workers listed in executors
gradually increase from 5 to 10 as the
new nodes are provisioned.

If autoscale, parameters needed to


automatically scale clusters up and
down based on load.

This field is optional.

apply_policy_default_values BOOLEAN Whether to use policy default values


for missing cluster attributes.

UpdateStateInfo
The current state of a pipeline update.

F IEL D N A M E TYPE DESC RIP T IO N

update_id STRING The unique identifier for this update.


F IEL D N A M E TYPE DESC RIP T IO N

state STRING The state of the update. One of


QUEUED , CREATED ,
WAITING_FOR_RESOURCES ,
INITIALIZING , RESETTING ,
SETTING_UP_TABLES , RUNNING ,
STOPPING , COMPLETED ,
FAILED , or CANCELED .

creation_time STRING Timestamp when this update was


created.
Git Credentials API 2.0
7/21/2022 • 2 minutes to read

The Git Credentials API allows users to manage their Git credentials to use Databricks Repos.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

The Git Credentials API is provided as an OpenAPI 3.0 specification that you can download and view as a
structured API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
Global Init Scripts API 2.0
7/21/2022 • 2 minutes to read

Configure a cluster-scoped init script using the DBFS REST API are shell scripts that run during startup on each
cluster node of every cluster in the workspace, before the Apache Spark driver or worker JVM starts. They can
help you to enforce consistent cluster configurations across your workspace. Use them carefully because they
can cause unanticipated impacts, like library conflicts.
The Global Init Scripts API lets Azure Databricks administrators add global cluster initialization scripts in a secure
and controlled manner. To learn how to add them using the UI, see Configure a cluster-scoped init script using
the DBFS REST API.
The Global Init Scripts API is provided as an OpenAPI 3.0 specification that you can download and view as a
structured API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
Groups API 2.0
7/21/2022 • 4 minutes to read

The Groups API allows you to manage groups of users.

NOTE
You must be an Azure Databricks administrator to invoke this API.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Add member
EN DP O IN T H T T P M ET H O D

2.0/groups/add-member POST

Add a user or group to a group. This call returns the error RESOURCE_DOES_NOT_EXIST if a user or group with the
given name does not exist, or if a group with the given parent name does not exist.
Examples
To add a user to a group:

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/add-member \
--data '{ "user_name": "someone@example.com", "parent_name": "reporting-department" }'

{}

To add a group to another group:

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/add-member \
--data '{ "group_name": "reporting-department", "parent_name": "data-ops-read-only" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

user_name OR group_name STRING OR STRING If user_name, the user name.

If group_name, the group name.


F IEL D N A M E TYPE DESC RIP T IO N

parent_name STRING Name of the parent group to which


the new member will be added. This
field is required.

Create
EN DP O IN T H T T P M ET H O D

2.0/groups/create POST

Create a new group with the given name. This call returns an error RESOURCE_ALREADY_EXISTS if a group with the
given name already exists.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/create \
--data '{ "group_name": "reporting-department" }'

{ "group_name": "reporting-department" }

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

group_name STRING Name for the group; must be unique


among groups owned by this
organization. This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

group_name STRING The group name.

List members
EN DP O IN T H T T P M ET H O D

2.0/groups/list-members GET

Return all of the members of a particular group. This call returns the error RESOURCE_DOES_NOT_EXIST if a group
with the given name does not exist. This method is non-recursive; it returns all groups that belong to the given
group but not the principals that belong to those child groups.
Example
curl --netrc -X GET \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/list-members \
--data '{ "group_name": "reporting-department" }' \
| jq .

{
"members": [
{
"user_name": "someone@example.com"
}
]
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

group_name STRING The group whose members we want to


retrieve. This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

members An array of PrincipalName The users and groups that belong to


the given group.

List
EN DP O IN T H T T P M ET H O D

2.0/groups/list GET

Return all of the groups in an organization.


Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/list \
| jq .

{
"group_names": [
"reporting-department",
"data-ops-read-only",
"admins"
]
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

group_names An array of STRING The groups in this organization.


List parents
EN DP O IN T H T T P M ET H O D

2.0/groups/list-parents GET

Retrieve all groups in which a given user or group is a member. This method is non-recursive; it returns all
groups in which the given user or group is a member but not the groups in which those groups are members.
This call returns the error RESOURCE_DOES_NOT_EXIST if a user or group with the given name does not exist.
Examples
To list groups for a user:

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/list-parents \
--data '{ "user_name": "someone@example.com" }' \
| jq .

{
"group_names": [
"reporting-department"
]
}

To list parent groups for a group:

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/list-parents \
--data '{ "group_name": "reporting-department" }' \
| jq .

{
"group_names": [
"data-ops-read-only"
]
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

user_name OR group_name STRING OR STRING If user_name, the user name.

If group_name, the group name.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

group_names An array of STRING The groups in which the given user or


group is a member.

Remove member
EN DP O IN T H T T P M ET H O D

2.0/groups/remove-member POST

Remove a user or group from a group. This call returns the error RESOURCE_DOES_NOT_EXIST if a user or group
with the given name does not exist or if a group with the given parent name does not exist.
Examples
To remove a user from a group:

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/remove-member \
--data '{ "user_name": "someone@example.com", "parent_name": "reporting-department" }'

{}

To remove a group from another group:

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/remove-member \
--data '{ "group_name": "reporting-department", "parent_name": "data-ops-read-only" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

user_name OR group_name STRING OR STRING If user_name, the user name.

If group_name, the group name.

parent_name STRING Name of the parent group from which


the member will be removed. This field
is required.

Delete
EN DP O IN T H T T P M ET H O D

2.0/groups/delete POST

Remove a group from this organization. This call returns the error RESOURCE_DOES_NOT_EXIST if a group with the
given name does not exist.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/delete \
--data '{ "group_name": "reporting-department" }'
{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

group_name STRING The group to remove. This field is


required.

Data structures
In this section:
PrincipalName
PrincipalName
Container type for a name that is either a user name or a group name.

F IEL D N A M E TYPE DESC RIP T IO N

user_name OR group_name STRING OR STRING If user_name, the user name.

If group_name, the group name.


Instance Pools API 2.0
7/21/2022 • 13 minutes to read

The Instance Pools API allows you to create, edit, delete and list instance pools.
An instance pool reduces cluster start and auto-scaling times by maintaining a set of idle, ready-to-use cloud
instances. When a cluster attached to a pool needs an instance, it first attempts to allocate one of the pool’s idle
instances. If the pool has no idle instances, it expands by allocating a new instance from the instance provider in
order to accommodate the cluster’s request. When a cluster releases an instance, it returns to the pool and is
free for another cluster to use. Only clusters attached to a pool can use that pool’s idle instances.
Azure Databricks does not charge DBUs while instances are idle in the pool. Instance provider billing does apply.
See pricing.

Requirements
You must have permission to attach to the pool; see Pool access control.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Create
EN DP O IN T H T T P M ET H O D

2.0/instance-pools/create POST

Create an instance pool. Use the returned instance_pool_id to query the status of the instance pool, which
includes the number of instances currently allocated by the instance pool. If you provide the min_idle_instances
parameter, instances are provisioned in the background and are ready to use once the idle_count in the
InstancePoolStats equals the requested minimum.
If your account has Databricks Container Services enabled and the instance pool is created with
preloaded_docker_images , you can use the instance pool to launch clusters with a Docker image. The Docker
image in the instance pool doesn’t have to match the Docker image in the cluster. However, the container
environment of the cluster created on the pool must align with the container environment of the instance pool:
you cannot use an instance pool created with preloaded_docker_images to launch a cluster without a Docker
image and you cannot use an instance pool created without preloaded_docker_images to a launch cluster with a
Docker image.

NOTE
Azure Databricks may not be able to acquire some of the requested idle instances due to instance provider limitations or
transient network issues. Clusters can still attach to the instance pool, but may not start as quickly.

Example
curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/instance-pools/create \
--data @create-instance-pool.json

create-instance-pool.json :

{
"instance_pool_name": "my-pool",
"node_type_id": "Standard_D3_v2",
"min_idle_instances": 10,
"custom_tags": [
{
"key": "my-key",
"value": "my-value"
}
]
}

{ "instance_pool_id": "1234-567890-fetch12-pool-A3BcdEFg" }

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

instance_pool_name STRING The name of the instance pool. This is


required for create and edit operations.
It must be unique, non-empty, and
less than 100 characters.

min_idle_instances INT32 The minimum number of idle instances


maintained by the pool. This is in
addition to any instances in use by
active clusters.

max_capacity INT32 The maximum number of instances the


pool can contain, including both idle
instances and ones in use by clusters.
Once the maximum capacity is
reached, you cannot create new
clusters from the pool and existing
clusters cannot autoscale up until
some instances are made idle in the
pool via cluster termination or down-
scaling.

node_type_id STRING The node type for the instances in the


pool. All clusters attached to the pool
inherit this node type and the pool’s
idle instances are allocated based on
this type. You can retrieve a list of
available node types by using the List
node types API call.
F IEL D N A M E TYPE DESC RIP T IO N

custom_tags An array of ClusterTag Additional tags for instance pool


resources. Azure Databricks tags all
pool resources (e.g. VM disk volumes)
with these tags in addition to
default_tags.

Azure Databricks allows up to 41


custom tags.

idle_instance_autotermination_minutes INT32 The number of minutes that idle


instances in excess of the
min_idle_instances are maintained by
the pool before being terminated. If
not specified, excess idle instances are
terminated automatically after a
default timeout period. If specified, the
time must be between 0 and 10000
minutes. If 0 is supplied, excess idle
instances are removed as soon as
possible.

enable_elastic_disk BOOL Autoscaling Local Storage: when


enabled, the instances in the pool
dynamically acquire additional disk
space when they are running low on
disk space.

disk_spec DiskSpec Defines the amount of initial remote


storage attached to each instance in
the pool.

preloaded_spark_versions An array of STRING A list with at most one runtime version


the pool installs on each instance. Pool
clusters that use a preloaded runtime
version start faster as they do not
have to wait for the image to
download. You can retrieve a list of
available runtime versions by using the
Runtime versions API call.

preloaded_docker_images An array of DockerImage A list with at most one Docker image


the pool installs on each instance. Pool
clusters that use a preloaded Docker
image start faster as they do not have
to wait for the image to download.
Available only if your account has
Databricks Container Services enabled.

azure_attributes InstancePoolAzureAttributes Defines the instance availability type


(such as spot or on-demand) and max
bid price.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

instance_pool_id STRING The ID of the created instance pool.


Edit
EN DP O IN T H T T P M ET H O D

2.0/instance-pools/edit POST

Edit an instance pool. This modifies the configuration of an existing instance pool.

NOTE
You can edit only the following values: instance_pool_name , min_idle_instances , max_capacity , and
idle_instance_autotermination_minutes .
You must provide an instance_pool_name value.

Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/instance-pools/edit \
--data @edit-instance-pool.json

edit-instance-pool.json :

{
"instance_pool_id": "1234-567890-fetch12-pool-A3BcdEFg",
"instance_pool_name": "my-edited-pool",
"min_idle_instances": 5,
"max_capacity": 200,
"idle_instance_autotermination_minutes": 30
}

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

instance_pool_id STRING The ID of the instance pool to edit.


This field is required.

instance_pool_name STRING The name of the instance pool. This is


required for create and edit operations.
It must be unique, non-empty, and
less than 100 characters.

min_idle_instances INT32 The minimum number of idle instances


maintained by the pool. This is in
addition to any instances in use by
active clusters.
F IEL D N A M E TYPE DESC RIP T IO N

max_capacity INT32 The maximum number of instances the


pool can contain, including both idle
instances and ones in use by clusters.
Once the maximum capacity is
reached, you cannot create new
clusters from the pool and existing
clusters cannot autoscale up until
some instances are made idle in the
pool via cluster termination or down-
scaling.

idle_instance_autotermination_minutes INT32 The number of minutes that idle


instances in excess of the
min_idle_instances are maintained
by the pool before being terminated. If
not specified, excess idle instances are
terminated automatically after a
default timeout period. If specified, the
time must be between 0 and 10000
minutes. If 0 is supplied, excess idle
instances are removed as soon as
possible.

Delete
EN DP O IN T H T T P M ET H O D

2.0/instance-pools/delete POST

Delete an instance pool. This permanently deletes the instance pool. The idle instances in the pool are
terminated asynchronously. New clusters cannot attach to the pool. Running clusters attached to the pool
continue to run but cannot autoscale up. Terminated clusters attached to the pool will fail to start until they are
edited to no longer use the pool.
Example

curl --netrc -X POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/instance-pools/delete \
--data '{ "instance_pool_id": "1234-567890-fetch12-pool-A3BcdEFg" }'

{}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

instance_pool_id STRING The ID of the instance pool to delete.

Get
EN DP O IN T H T T P M ET H O D

2.0/instance-pools/get GET

Retrieve the information for an instance pool given its identifier.


Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/instance-pools/get \
--data '{ "instance_pool_id": "1234-567890-fetch12-pool-A3BcdEFg" }'

{
"instance_pool_name": "mypool",
"node_type_id": "Standard_D3_v2",
"custom_tags": {
"my-key": "my-value"
},
"idle_instance_autotermination_minutes": 60,
"enable_elastic_disk": false,
"preloaded_spark_versions": [
"5.4.x-scala2.11"
],
"instance_pool_id": "101-120000-brick1-pool-ABCD1234",
"default_tags": {
"Vendor": "Databricks",
"DatabricksInstancePoolCreatorId": "100125",
"DatabricksInstancePoolId": "101-120000-brick1-pool-ABCD1234"
},
"state": "ACTIVE",
"stats": {
"used_count": 10,
"idle_count": 5,
"pending_used_count": 5,
"pending_idle_count": 5
},
"status": {}
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

instance_pool_id STRING The instance pool about which to


retrieve information.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

instance_pool_name STRING The name of the instance pool. This is


required for create and edit operations.
It must be unique, non-empty, and
less than 100 characters.

min_idle_instances INT32 The minimum number of idle instances


maintained by the pool. This is in
addition to any instances in use by
active clusters.
F IEL D N A M E TYPE DESC RIP T IO N

max_capacity INT32 The maximum number of instances the


pool can contain, including both idle
instances and ones in use by clusters.
Once the maximum capacity is
reached, you cannot create new
clusters from the pool and existing
clusters cannot autoscale up until
some instances are made idle in the
pool via cluster termination or down-
scaling.

node_type_id STRING The node type for the instances in the


pool. All clusters attached to the pool
inherit this node type and the pool’s
idle instances are allocated based on
this type. You can retrieve a list of
available node types by using the List
node types API call.

custom_tags An array of ClusterTag Additional tags for instance pool


resources. Azure Databricks tags all
pool resources (e.g. VM disk volumes)
with these tags in addition to
default_tags.

Azure Databricks allows up to 41


custom tags.

idle_instance_autotermination_minutes INT32 The number of minutes that idle


instances in excess of the
min_idle_instances are maintained
by the pool before being terminated. If
not specified, excess idle instances are
terminated automatically after a
default timeout period. If specified, the
time must be between 0 and 10000
minutes. If 0 is supplied, excess idle
instances are removed as soon as
possible.

enable_elastic_disk BOOL Autoscaling Local Storage: when


enabled, the instances in the pool
dynamically acquire additional disk
space when they are running low on
disk space.

disk_spec DiskSpec Defines the amount of initial remote


storage attached to each instance in
the pool.

preloaded_spark_versions An array of STRING A list with the runtime version the pool
installs on each instance. Pool clusters
that use a preloaded runtime version
start faster as they do not have to wait
for the image to download. You can
retrieve a list of available runtime
versions by using the Runtime versions
API call.
F IEL D N A M E TYPE DESC RIP T IO N

instance_pool_id STRING The canonical unique identifier for the


instance pool.

default_tags An array of ClusterTag Tags that are added by Azure


Databricks regardless of any
custom_tags, including:

* Vendor: Databricks
* DatabricksInstancePoolCreatorId:
<create_user_id>
* DatabricksInstancePoolId:
<instance_pool_id>

state InstancePoolState Current state of the instance pool.

stats InstancePoolStats Statistics about the usage of the


instance pool.

status InstancePoolStatus Status about failed pending instances


in the pool.

List
EN DP O IN T H T T P M ET H O D

2.0/instance-pools/list GET

List information for all instance pools.


Example

curl --netrc -X GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/instance-pools/list
{
"instance_pools": [
{
"instance_pool_name": "mypool",
"node_type_id": "Standard_D3_v2",
"idle_instance_autotermination_minutes": 60,
"enable_elastic_disk": false,
"preloaded_spark_versions": [
"5.4.x-scala2.11"
],
"instance_pool_id": "101-120000-brick1-pool-ABCD1234",
"default_tags": {
"Vendor": "Databricks",
"DatabricksInstancePoolCreatorId": "100125",
"DatabricksInstancePoolId": "101-120000-brick1-pool-ABCD1234"
},
"state": "ACTIVE",
"stats": {
"used_count": 10,
"idle_count": 5,
"pending_used_count": 5,
"pending_idle_count": 5
},
"status": {}
},
{
"..."
}
]
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

instance_pools An array of InstancePoolStatus A list of instance pools with their


statistics included.

Data structures
In this section:
InstancePoolState
InstancePoolStats
InstancePoolStatus
PendingInstanceError
DiskSpec
DiskType
InstancePoolAndStats
AzureDiskVolumeType
InstancePoolAzureAttributes
InstancePoolState
The state of an instance pool. The current allowable state transitions are:
ACTIVE -> DELETED
NAME DESC RIP T IO N

ACTIVE Indicates an instance pool is active. Clusters can attach to it.

DELETED Indicates the instance pool has been deleted and is no


longer accessible.

InstancePoolStats
Statistics about the usage of the instance pool.

F IEL D N A M E TYPE DESC RIP T IO N

used_count INT32 Number of active instances that are in


use by a cluster.

idle_count INT32 Number of active instances that are


not in use by a cluster.

pending_used_count INT32 Number of pending instances that are


assigned to a cluster.

pending_idle_count INT32 Number of pending instances that are


not assigned to a cluster.

InstancePoolStatus
Status about failed pending instances in the pool.

F IEL D N A M E TYPE DESC RIP T IO N

pending_instance_errors An array of PendingInstanceError List of error messages for the failed


pending instances.

PendingInstanceError
Error message of a failed pending instance.

F IEL D N A M E TYPE DESC RIP T IO N

instance_id STRING ID of the failed instance.

message STRING Message describing the cause of the


failure.

DiskSpec
Describes the initial set of disks to attach to each instance. For example, if there are 3 instances and each
instance is configured to start with 2 disks, 100 GiB each, then Azure Databricks creates a total of 6 disks, 100
GiB each, for these instances.

F IEL D N A M E TYPE DESC RIP T IO N

disk_type DiskType The type of disks to attach.


F IEL D N A M E TYPE DESC RIP T IO N

disk_count INT32 The number of disks to attach to each


instance:

* This feature is only enabled for


supported node types.
* Users can choose up to the limit of
the disks supported by the node type.
* For node types with no local disk, at
least one disk needs to be specified.

disk_size INT32 The size of each disk (in GiB) to attach.


Values must fall into the supported
range for a particular instance type:

* Premium LRS (SSD): 1 - 1023 GiB


* Standard LRS (HDD): 1- 1023 GiB

DiskType
Describes the type of disk.

F IEL D N A M E TYPE DESC RIP T IO N

azure_disk_volume_type AzureDiskVolumeType The type of Azure disk to use.

InstancePoolAndStats
F IEL D N A M E TYPE DESC RIP T IO N

instance_pool_name STRING The name of the instance pool. This is


required for create and edit operations.
It must be unique, non-empty, and
less than 100 characters.

min_idle_instances INT32 The minimum number of idle instances


maintained by the pool. This is in
addition to any instances in use by
active clusters.

max_capacity INT32 The maximum number of instances the


pool can contain, including both idle
instances and ones in use by clusters.
Once the maximum capacity is
reached, you cannot create new
clusters from the pool and existing
clusters cannot autoscale up until
some instances are made idle in the
pool via cluster termination or down-
scaling.

node_type_id STRING The node type for the instances in the


pool. All clusters attached to the pool
inherit this node type and the pool’s
idle instances are allocated based on
this type. You can retrieve a list of
available node types by using the List
node types API call.
F IEL D N A M E TYPE DESC RIP T IO N

custom_tags An array of ClusterTag Additional tags for instance pool


resources. Azure Databricks tags all
pool resources (e.g. VM disk volumes)
with these tags in addition to
default_tags.

Azure Databricks allows up to 41


custom tags.

idle_instance_autotermination_minutes INT32 The number of minutes that idle


instances in excess of the
min_idle_instances are maintained
by the pool before being terminated. If
not specified, excess idle instances are
terminated automatically after a
default timeout period. If specified, the
time must be between 0 and 10000
minutes. If 0 is supplied, excess idle
instances are removed as soon as
possible.

enable_elastic_disk BOOL Autoscaling Local Storage: when


enabled, the instances in the pool
dynamically acquire additional disk
space when they are running low on
disk space.

disk_spec DiskSpec Defines the amount of initial remote


storage attached to each instance in
the pool.

preloaded_spark_versions An array of STRING A list with the runtime version the pool
installs on each instance. Pool clusters
that use a preloaded runtime version
start faster as they do not have to wait
for the image to download. You can
retrieve a list of available runtime
versions by using the Runtime versions
API call.

instance_pool_id STRING The canonical unique identifier for the


instance pool.

default_tags An array of ClusterTag Tags that are added by Azure


Databricks regardless of any
custom_tags, including:

* Vendor: Databricks
* DatabricksInstancePoolCreatorId:
<create_user_id>
* DatabricksInstancePoolId:
<instance_pool_id>

state InstancePoolState Current state of the instance pool.

stats InstancePoolStats Statistics about the usage of the


instance pool.
AzureDiskVolumeType
All Azure Disk types that Azure Databricks supports. See https://docs.microsoft.com/azure/virtual-
machines/linux/disks-types

NAME DESC RIP T IO N

PREMIUM_LRS Premium storage tier, backed by SSDs.

STANDARD_LRS Standard storage tier, backed by HDDs.

InstancePoolAzureAttributes
Attributes set during instance pools creation related to Azure.

F IEL D N A M E TYPE DESC RIP T IO N

availability AzureAvailability Availability type used for all


subsequent nodes.

spot_bid_max_price DOUBLE The max bid price used for Azure spot
instances. You can set this to greater
than or equal to the current spot price.
You can also set this to -1 (the default),
which specifies that the instance
cannot be evicted on the basis of price.
The price for the instance will be the
current price for spot instances or the
price for a standard instance. You can
view historical pricing and eviction
rates in the Azure portal.
IP Access List API 2.0
7/21/2022 • 2 minutes to read

Azure Databricks workspaces can be configured so that employees connect to the service only through existing
corporate networks with a secure perimeter. Azure Databricks customers can use the IP access lists feature to
define a set of approved IP addresses. All incoming access to the web application and REST APIs requires the
user connect from an authorized IP address.
For more details about this feature and examples of how to use this API, see IP access lists.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

The IP Access List API is provided as an OpenAPI 3.0 specification that you can download and view as a
structured API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
Jobs API 2.0
7/21/2022 • 45 minutes to read

The Jobs API allows you to create, edit, and delete jobs. The maximum allowed size of a request to the Jobs API is
10MB. See Create a High Concurrency cluster for a how-to guide on this API.
For details about updates to the Jobs API that support orchestration of multiple tasks with Azure Databricks
jobs, see Jobs API updates.

NOTE
If you receive a 500-level error when making Jobs API requests, Databricks recommends retrying requests for up to 10
min (with a minimum 30 second interval between retries).

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Create
EN DP O IN T H T T P M ET H O D

2.0/jobs/create POST

Create a new job.


Example
This example creates a job that runs a JAR task at 10:15pm each night.
Request

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/jobs/create \
--data @create-job.json \
| jq .

create-job.json :
{
"name": "Nightly model training",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"num_workers": 10
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
],
"timeout_seconds": 3600,
"max_retries": 1,
"schedule": {
"quartz_cron_expression": "0 15 22 * * ?",
"timezone_id": "America/Los_Angeles"
},
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of create-job.json with fields that are appropriate for your solution.

This example uses a .netrc file and jq.


Response

{
"job_id": 1
}

Request structure

IMPORTANT
When you run a job on a new jobs cluster, the job is treated as a Jobs Compute (automated) workload subject to Jobs
Compute pricing.
When you run a job on an existing all-purpose cluster, it is treated as an All-Purpose Compute (interactive) workload
subject to All-Purpose Compute pricing.

F IEL D N A M E TYPE DESC RIP T IO N


F IEL D N A M E TYPE DESC RIP T IO N

existing_cluster_id OR new_cluster STRING OR NewCluster If existing_cluster_id, the ID of an


existing cluster that will be used for all
runs of this job. When running jobs on
an existing cluster, you may need to
manually restart the cluster if it stops
responding. We suggest running jobs
on new clusters for greater reliability.

If new_cluster, a description of a cluster


that will be created for each run.

If specifying a PipelineTask, this field


can be empty.

notebook_task OR spark_jar_task OR NotebookTask OR SparkJarTask OR If notebook_task, indicates that this


spark_python_task OR SparkPythonTask OR SparkSubmitTask job should run a notebook. This field
spark_submit_task OR pipeline_task OR PipelineTask may not be specified in conjunction
with spark_jar_task.

If spark_jar_task, indicates that this job


should run a JAR.

If spark_python_task, indicates that


this job should run a Python file.

If spark_submit_task, indicates that


this job should be launched by the
spark submit script.

If pipeline_task, indicates that this job


should run a Delta Live Tables pipeline.

name STRING An optional name for the job. The


default value is Untitled .

libraries An array of Library An optional list of libraries to be


installed on the cluster that will
execute the job. The default value is an
empty list.

email_notifications JobEmailNotifications An optional set of email addresses


notified when runs of this job begin
and complete and when this job is
deleted. The default behavior is to not
send any emails.

timeout_seconds INT32 An optional timeout applied to each


run of this job. The default behavior is
to have no timeout.

max_retries INT32 An optional maximum number of


times to retry an unsuccessful run. A
run is considered to be unsuccessful if
it completes with the FAILED
result_state or
INTERNAL_ERROR
life_cycle_state . The value -1
means to retry indefinitely and the
value 0 means to never retry. The
default behavior is to never retry.
F IEL D N A M E TYPE DESC RIP T IO N

min_retry_interval_millis INT32 An optional minimal interval in


milliseconds between the start of the
failed run and the subsequent retry
run. The default behavior is that
unsuccessful runs are immediately
retried.

retry_on_timeout BOOL An optional policy to specify whether


to retry a job when it times out. The
default behavior is to not retry on
timeout.

schedule CronSchedule An optional periodic schedule for this


job. The default behavior is that the
job runs when triggered by clicking
Run Now in the Jobs UI or sending an
API request to runNow .

max_concurrent_runs INT32 An optional maximum allowed number


of concurrent runs of the job.

Set this value if you want to be able to


execute multiple runs of the same job
concurrently. This is useful for example
if you trigger your job on a frequent
schedule and want to allow
consecutive runs to overlap with each
other, or if you want to trigger multiple
runs which differ by their input
parameters.

This setting affects only new runs. For


example, suppose the job’s
concurrency is 4 and there are 4
concurrent active runs. Then setting
the concurrency to 3 won’t kill any of
the active runs. However, from then
on, new runs are skipped unless there
are fewer than 3 active runs.

This value cannot exceed 1000. Setting


this value to 0 causes all new runs to
be skipped. The default behavior is to
allow only 1 concurrent run.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

job_id INT64 The canonical identifier for the newly


created job.

List
EN DP O IN T H T T P M ET H O D

2.0/jobs/list GET

List all jobs.


Example
Request

curl --netrc --request GET \


https://<databricks-instance>/api/2.0/jobs/list \
| jq .

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file and jq.


Response

{
"jobs": [
{
"job_id": 1,
"settings": {
"name": "Nightly model training",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"num_workers": 10
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
],
"timeout_seconds": 100000000,
"max_retries": 1,
"schedule": {
"quartz_cron_expression": "0 15 22 * * ?",
"timezone_id": "America/Los_Angeles",
"pause_status": "UNPAUSED"
},
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
},
"created_time": 1457570074236
}
]
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

jobs An array of Job The list of jobs.

Delete
EN DP O IN T H T T P M ET H O D

2.0/jobs/delete POST

Delete a job and send an email to the addresses specified in JobSettings.email_notifications . No action occurs
if the job has already been removed. After the job is removed, neither its details nor its run history is visible in
the Jobs UI or API. The job is guaranteed to be removed upon completion of this request. However, runs that
were active before the receipt of this request may still be active. They will be terminated asynchronously.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/jobs/delete \
--data '{ "job_id": <job-id> }'

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<job-id> with the ID of the job, for example 123 .

This example uses a .netrc file.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

job_id INT64 The canonical identifier of the job to


delete. This field is required.

Get
EN DP O IN T H T T P M ET H O D

2.0/jobs/get GET

Retrieve information about a single job.


Example
Request

curl --netrc --request GET \


'https://<databricks-instance>/api/2.0/jobs/get?job_id=<job-id>' \
| jq .

Or:

curl --netrc --get \


https://<databricks-instance>/api/2.0/jobs/get \
--data job_id=<job-id> \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<job-id> with the ID of the job, for example 123 .

This example uses a .netrc file and jq.


Response
{
"job_id": 1,
"settings": {
"name": "Nightly model training",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"num_workers": 10
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
],
"timeout_seconds": 100000000,
"max_retries": 1,
"schedule": {
"quartz_cron_expression": "0 15 22 * * ?",
"timezone_id": "America/Los_Angeles",
"pause_status": "UNPAUSED"
},
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
},
"created_time": 1457570074236
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

job_id INT64 The canonical identifier of the job to


retrieve information about. This field is
required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

job_id INT64 The canonical identifier for this job.

creator_user_name STRING The creator user name. This field won’t


be included in the response if the user
has been deleted.

settings JobSettings Settings for this job and all of its runs.
These settings can be updated using
the Reset or Update endpoints.

created_time INT64 The time at which this job was created


in epoch milliseconds (milliseconds
since 1/1/1970 UTC).

Reset
EN DP O IN T H T T P M ET H O D

2.0/jobs/reset POST

Overwrite all settings for a specific job. Use the Update endpoint to update job settings partially.
Example
This example request makes job 2 identical to job 1 in the create example.

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/jobs/reset \
--data @reset-job.json \
| jq .

reset-job.json :

{
"job_id": 2,
"new_settings": {
"name": "Nightly model training",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"num_workers": 10
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
],
"timeout_seconds": 100000000,
"max_retries": 1,
"schedule": {
"quartz_cron_expression": "0 15 22 * * ?",
"timezone_id": "America/Los_Angeles",
"pause_status": "UNPAUSED"
},
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
}
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of reset-job.json with fields that are appropriate for your solution.

This example uses a .netrc file and jq.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

job_id INT64 The canonical identifier of the job to


reset. This field is required.
F IEL D N A M E TYPE DESC RIP T IO N

new_settings JobSettings The new settings of the job. These


settings completely replace the old
settings.

Changes to the field


JobSettings.timeout_seconds are
applied to active runs. Changes to
other fields are applied to future runs
only.

Update
EN DP O IN T H T T P M ET H O D

2.0/jobs/update POST

Add, change, or remove specific settings of an existing job. Use the Reset endpoint to overwrite all job settings.
Example
This example request removes libraries and adds email notification settings to job 1 defined in the create
example.

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/jobs/update \
--data @update-job.json \
| jq .

update-job.json :

{
"job_id": 1,
"new_settings": {
"existing_cluster_id": "1201-my-cluster",
"email_notifications": {
"on_start": [ "someone@example.com" ],
"on_success": [],
"on_failure": []
}
},
"fields_to_remove": ["libraries"]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of update-job.json with fields that are appropriate for your solution.

This example uses a .netrc file and jq.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

job_id INT64 The canonical identifier of the job to


update. This field is required.
F IEL D N A M E TYPE DESC RIP T IO N

new_settings JobSettings The new settings for the job. Any top-
level fields specified in new_settings
are completely replaced. Partially
updating nested fields is not
supported.

Changes to the field


JobSettings.timeout_seconds are
applied to active runs. Changes to
other fields are applied to future runs
only.

fields_to_remove An array of STRING Remove top-level fields in the job


settings. Removing nested fields is not
supported. This field is optional.

Run now
IMPORTANT
You can create jobs only in a Data Science & Engineering workspace or a Machine Learning workspace.
A workspace is limited to 1000 concurrent job runs. A 429 Too Many Requests response is returned when you
request a run that cannot start immediately.
The number of jobs a workspace can create in an hour is limited to 5000 (includes “run now” and “runs submit”). This
limit also affects jobs created by the REST API and notebook workflows.

EN DP O IN T H T T P M ET H O D

2.0/jobs/run-now POST

Run a job now and return the run_id of the triggered run.

TIP
If you invoke Create together with Run now, you can use the Runs submit endpoint instead, which allows you to submit
your workload directly without having to create a job.

Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/jobs/run-now \
--data @run-job.json \
| jq .

run-job.json :
An example request for a notebook job:
{
"job_id": 1,
"notebook_params": {
"name": "john doe",
"age": "35"
}
}

An example request for a JAR job:

{
"job_id": 2,
"jar_params": [ "john doe", "35" ]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of run-job.json with fields that are appropriate for your solution.

This example uses a .netrc file and jq.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

job_id INT64

jar_params An array of STRING A list of parameters for jobs with JAR


tasks, e.g.
"jar_params": ["john doe", "35"] .
The parameters will be used to invoke
the main function of the main class
specified in the Spark JAR task. If not
specified upon run-now , it will default
to an empty list. jar_params cannot be
specified in conjunction with
notebook_params. The JSON
representation of this field (i.e.
{"jar_params":["john doe","35"]}
) cannot exceed 10,000 bytes.
F IEL D N A M E TYPE DESC RIP T IO N

notebook_params A map of ParamPair A map from keys to values for jobs


with notebook task, e.g.
"notebook_params": {"name":
"john doe", "age": "35"}
. The map is passed to the notebook
and is accessible through the
dbutils.widgets.get function.

If not specified upon run-now , the


triggered run uses the job’s base
parameters.

You cannot specify notebook_params


in conjunction with jar_params.

The JSON representation of this field


(i.e.
{"notebook_params":{"name":"john
doe","age":"35"}}
) cannot exceed 10,000 bytes.

python_params An array of STRING A list of parameters for jobs with


Python tasks, e.g.
"python_params": ["john doe",
"35"]
. The parameters will be passed to
Python file as command-line
parameters. If specified upon
run-now , it would overwrite the
parameters specified in job setting. The
JSON representation of this field (i.e.
{"python_params":["john
doe","35"]}
) cannot exceed 10,000 bytes.

spark_submit_params An array of STRING A list of parameters for jobs with spark


submit task, e.g.
"spark_submit_params": ["--class",
"org.apache.spark.examples.SparkPi"]
. The parameters will be passed to
spark-submit script as command-line
parameters. If specified upon
run-now , it would overwrite the
parameters specified in job setting. The
JSON representation of this field
cannot exceed 10,000 bytes.
F IEL D N A M E TYPE DESC RIP T IO N

idempotency_token STRING An optional token to guarantee the


idempotency of job run requests. If a
run with the provided token already
exists, the request does not create a
new run but returns the ID of the
existing run instead. If a run with the
provided token is deleted, an error is
returned.

If you specify the idempotency token,


upon failure you can retry until the
request succeeds. Azure Databricks
guarantees that exactly one run is
launched with that idempotency token.

This token must have at most 64


characters.

For more information, see How to


ensure idempotency for jobs.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

run_id INT64 The globally unique ID of the newly


triggered run.

number_in_job INT64 The sequence number of this run


among all runs of the job.

Runs submit
IMPORTANT
You can create jobs only in a Data Science & Engineering workspace or a Machine Learning workspace.
A workspace is limited to 1000 concurrent job runs. A 429 Too Many Requests response is returned when you
request a run that cannot start immediately.
The number of jobs a workspace can create in an hour is limited to 5000 (includes “run now” and “runs submit”). This
limit also affects jobs created by the REST API and notebook workflows.

EN DP O IN T H T T P M ET H O D

2.0/jobs/runs/submit POST

Submit a one-time run. This endpoint allows you to submit a workload directly without creating a job. Runs
submitted using this endpoint don’t display in the UI. Use the jobs/runs/get API to check the run state after the
job is submitted.
Example
Request

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/jobs/runs/submit \
--data @submit-job.json \
| jq .
submit-job.json :

{
"run_name": "my spark task",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"num_workers": 10
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
],
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of submit-job.json with fields that are appropriate for your solution.

This example uses a .netrc file and jq.


Response

{
"run_id": 123
}

Request structure

IMPORTANT
When you run a job on a new jobs cluster, the job is treated as a Jobs Compute (automated) workload subject to Jobs
Compute pricing.
When you run a job on an existing all-purpose cluster, it is treated as an All-Purpose Compute (interactive) workload
subject to All-Purpose Compute pricing.

F IEL D N A M E TYPE DESC RIP T IO N

existing_cluster_id OR new_cluster STRING OR NewCluster If existing_cluster_id, the ID of an


existing cluster that will be used for all
runs of this job. When running jobs on
an existing cluster, you may need to
manually restart the cluster if it stops
responding. We suggest running jobs
on new clusters for greater reliability.

If new_cluster, a description of a cluster


that will be created for each run.

If specifying a PipelineTask, then this


field can be empty.
F IEL D N A M E TYPE DESC RIP T IO N

notebook_task OR spark_jar_task OR NotebookTask OR SparkJarTask OR If notebook_task, indicates that this


spark_python_task OR SparkPythonTask OR SparkSubmitTask job should run a notebook. This field
spark_submit_task OR pipeline_task OR PipelineTask may not be specified in conjunction
with spark_jar_task.

If spark_jar_task, indicates that this job


should run a JAR.

If spark_python_task, indicates that


this job should run a Python file.

If spark_submit_task, indicates that


this job should be launched by the
spark submit script.

If pipeline_task, indicates that this job


should run a Delta Live Tables pipeline.

run_name STRING An optional name for the run. The


default value is Untitled .

libraries An array of Library An optional list of libraries to be


installed on the cluster that will
execute the job. The default value is an
empty list.

timeout_seconds INT32 An optional timeout applied to each


run of this job. The default behavior is
to have no timeout.

idempotency_token STRING An optional token to guarantee the


idempotency of job run requests. If a
run with the provided token already
exists, the request does not create a
new run but returns the ID of the
existing run instead. If a run with the
provided token is deleted, an error is
returned.

If you specify the idempotency token,


upon failure you can retry until the
request succeeds. Azure Databricks
guarantees that exactly one run is
launched with that idempotency token.

This token must have at most 64


characters.

For more information, see How to


ensure idempotency for jobs.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

run_id INT64 The canonical identifier for the newly


submitted run.

Runs list
EN DP O IN T H T T P M ET H O D

2.0/jobs/runs/list GET

List runs in descending order by start time.

NOTE
Runs are automatically removed after 60 days. If you to want to reference them beyond 60 days, you should save old run
results before they expire. To export using the UI, see Export job run results. To export using the Jobs API, see Runs
export.

Example
Request

curl --netrc --request GET \


'https://<databricks-instance>/api/2.0/jobs/runs/list?job_id=<job-id>&active_only=<true-false>&offset=
<offset>&limit=<limit>&run_type=<run-type>' \
| jq .

Or:

curl --netrc --get \


https://<databricks-instance>/api/2.0/jobs/runs/list \
--data 'job_id=<job-id>&active_only=<true-false>&offset=<offset>&limit=<limit>&run_type=<run-type>' \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<job-id> with the ID of the job, for example 123 .
<true-false> with true or false .
<offset> with the offset value.
<limit> with the limit value.
<run-type> with the run_type value.

This example uses a .netrc file and jq.


Response
{
"runs": [
{
"job_id": 1,
"run_id": 452,
"number_in_job": 5,
"state": {
"life_cycle_state": "RUNNING",
"state_message": "Performing action"
},
"task": {
"notebook_task": {
"notebook_path": "/Users/donald@duck.com/my-notebook"
}
},
"cluster_spec": {
"existing_cluster_id": "1201-my-cluster"
},
"cluster_instance": {
"cluster_id": "1201-my-cluster",
"spark_context_id": "1102398-spark-context-id"
},
"overriding_parameters": {
"jar_params": ["param1", "param2"]
},
"start_time": 1457570074236,
"end_time": 1457570075149,
"setup_duration": 259754,
"execution_duration": 3589020,
"cleanup_duration": 31038,
"trigger": "PERIODIC"
}
],
"has_more": true
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

active_only OR completed_only BOOL OR BOOL If active_only is true , only active runs


are included in the results; otherwise,
lists both active and completed runs.
An active run is a run in the PENDING ,
RUNNING , or TERMINATING
RunLifecycleState. This field cannot be
true when completed_only is true .

If completed_only is true , only


completed runs are included in the
results; otherwise, lists both active and
completed runs. This field cannot be
true when active_only is true .

job_id INT64 The job for which to list runs. If


omitted, the Jobs service will list runs
from all jobs.

offset INT32 The offset of the first run to return,


relative to the most recent run.
F IEL D N A M E TYPE DESC RIP T IO N

limit INT32 The number of runs to return. This


value should be greater than 0 and
less than 1000. The default value is 20.
If a request specifies a limit of 0, the
service will instead use the maximum
limit.

run_type STRING The type of runs to return. For a


description of run types, see Run.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

runs An array of Run A list of runs, from most recently


started to least.

has_more BOOL If true, additional runs matching the


provided filter are available for listing.

Runs get
EN DP O IN T H T T P M ET H O D

2.0/jobs/runs/get GET

Retrieve the metadata of a run.

NOTE
Runs are automatically removed after 60 days. If you to want to reference them beyond 60 days, you should save old run
results before they expire. To export using the UI, see Export job run results. To export using the Jobs API, see Runs
export.

Example
Request

curl --netrc --request GET \


'https://<databricks-instance>/api/2.0/jobs/runs/get?run_id=<run-id>' \
| jq .

Or:

curl --netrc --get \


https://<databricks-instance>/api/2.0/jobs/runs/get \
--data run_id=<run-id> \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<run-id> with the ID of the run, for example 123 .

This example uses a .netrc file and jq.


Response

{
"job_id": 1,
"run_id": 452,
"number_in_job": 5,
"state": {
"life_cycle_state": "RUNNING",
"state_message": "Performing action"
},
"task": {
"notebook_task": {
"notebook_path": "/Users/someone@example.com/my-notebook"
}
},
"cluster_spec": {
"existing_cluster_id": "1201-my-cluster"
},
"cluster_instance": {
"cluster_id": "1201-my-cluster",
"spark_context_id": "1102398-spark-context-id"
},
"overriding_parameters": {
"jar_params": ["param1", "param2"]
},
"start_time": 1457570074236,
"end_time": 1457570075149,
"setup_duration": 259754,
"execution_duration": 3589020,
"cleanup_duration": 31038,
"trigger": "PERIODIC"
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

run_id INT64 The canonical identifier of the run for


which to retrieve the metadata. This
field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

job_id INT64 The canonical identifier of the job that


contains this run.

run_id INT64 The canonical identifier of the run. This


ID is unique across all runs of all jobs.

number_in_job INT64 The sequence number of this run


among all runs of the job. This value
starts at 1.

original_attempt_run_id INT64 If this run is a retry of a prior run


attempt, this field contains the run_id
of the original attempt; otherwise, it is
the same as the run_id.

state RunState The result and lifecycle states of the


run.
F IEL D N A M E TYPE DESC RIP T IO N

schedule CronSchedule The cron schedule that triggered this


run if it was triggered by the periodic
scheduler.

task JobTask The task performed by the run, if any.

cluster_spec ClusterSpec A snapshot of the job’s cluster


specification when this run was
created.

cluster_instance ClusterInstance The cluster used for this run. If the run
is specified to use a new cluster, this
field will be set once the Jobs service
has requested a cluster for the run.

overriding_parameters RunParameters The parameters used for this run.

start_time INT64 The time at which this run was started


in epoch milliseconds (milliseconds
since 1/1/1970 UTC). This may not be
the time when the job task starts
executing, for example, if the job is
scheduled to run on a new cluster, this
is the time the cluster creation call is
issued.

end_time INT64 The time at which this run ended in


epoch milliseconds (milliseconds since
1/1/1970 UTC). This field will be set to
0 if the job is still running.

setup_duration INT64 The time it took to set up the cluster in


milliseconds. For runs that run on new
clusters this is the cluster creation
time, for runs that run on existing
clusters this time should be very short.

execution_duration INT64 The time in milliseconds it took to


execute the commands in the JAR or
notebook until they completed, failed,
timed out, were cancelled, or
encountered an unexpected error.

cleanup_duration INT64 The time in milliseconds it took to


terminate the cluster and clean up any
associated artifacts. The total duration
of the run is the sum of the
setup_duration, the
execution_duration, and the
cleanup_duration.

trigger TriggerType The type of trigger that fired this run.

creator_user_name STRING The creator user name. This field won’t


be included in the response if the user
has been deleted

run_page_url STRING The URL to the detail page of the run.


Runs export
EN DP O IN T H T T P M ET H O D

2.0/jobs/runs/export GET

Export and retrieve the job run task.

NOTE
Only notebook runs can be exported in HTML format. Exporting runs of other types will fail.

Example
Request

curl --netrc --request GET \


'https://<databricks-instance>/api/2.0/jobs/runs/export?run_id=<run-id>' \
| jq .

Or:

curl --netrc --get \


https://<databricks-instance>/api/2.0/jobs/runs/export \
--data run_id=<run-id> \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<run-id> with the ID of the run, for example 123 .

This example uses a .netrc file and jq.


Response

{
"views": [ {
"content": "<!DOCTYPE html><html><head>Head</head><body>Body</body></html>",
"name": "my-notebook",
"type": "NOTEBOOK"
} ]
}

To extract the HTML notebook from the JSON response, download and run this Python script.

NOTE
The notebook body in the __DATABRICKS_NOTEBOOK_MODEL object is encoded.

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

run_id INT64 The canonical identifier for the run.


This field is required.
F IEL D N A M E TYPE DESC RIP T IO N

views_to_export ViewsToExport Which views to export (CODE,


DASHBOARDS, or ALL). Defaults to
CODE.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

views An array of ViewItem The exported content in HTML format


(one for every view item).

Runs cancel
EN DP O IN T H T T P M ET H O D

2.0/jobs/runs/cancel POST

Cancel a job run. Because the run is canceled asynchronously, the run may still be running when this request
completes. The run will be terminated shortly. If the run is already in a terminal life_cycle_state , this method
is a no-op.
This endpoint validates that the run_id parameter is valid and for invalid parameters returns HTTP status code
400.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/jobs/runs/cancel \
--data '{ "run_id": <run-id> }'

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<run-id> with the ID of the run, for example 123 .

This example uses a .netrc file.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

run_id INT64 The canonical identifier of the run to


cancel. This field is required.

Runs cancel all


EN DP O IN T H T T P M ET H O D

2.0/jobs/runs/cancel-all POST

Cancel all active runs of a job. Because the run is canceled asynchronously, it doesn’t prevent new runs from
being started.
This endpoint validates that the job_id parameter is valid and for invalid parameters returns HTTP status code
400.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/jobs/runs/cancel-all \
--data '{ "job_id": <job-id> }'

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<job-id> with the ID of the job, for example 123 .

This example uses a .netrc file.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

job_id INT64 The canonical identifier of the job to


cancel all runs of. This field is required.

Runs get output


EN DP O IN T H T T P M ET H O D

2.0/jobs/runs/get-output GET

Retrieve the output and metadata of a run. When a notebook task returns a value through the
dbutils.notebook.exit() call, you can use this endpoint to retrieve that value. Azure Databricks restricts this API to
return the first 5 MB of the output. For returning a larger result, you can store job results in a cloud storage
service.
This endpoint validates that the run_id parameter is valid and for invalid parameters returns HTTP status code
400.
Runs are automatically removed after 60 days. If you to want to reference them beyond 60 days, you should
save old run results before they expire. To export using the UI, see Export job run results. To export using the
Jobs API, see Runs export.
Example
Request

curl --netrc --request GET \


'https://<databricks-instance>/api/2.0/jobs/runs/get-output?run_id=<run-id>' \
| jq .

Or:

curl --netrc --get \


https://<databricks-instance>/api/2.0/jobs/runs/get-output \
--data run_id=<run-id> \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<run-id> with the ID of the run, for example 123 .
This example uses a .netrc file and jq.
Response

{
"metadata": {
"job_id": 1,
"run_id": 452,
"number_in_job": 5,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
},
"task": {
"notebook_task": {
"notebook_path": "/Users/someone@example.com/my-notebook"
}
},
"cluster_spec": {
"existing_cluster_id": "1201-my-cluster"
},
"cluster_instance": {
"cluster_id": "1201-my-cluster",
"spark_context_id": "1102398-spark-context-id"
},
"overriding_parameters": {
"jar_params": ["param1", "param2"]
},
"start_time": 1457570074236,
"setup_duration": 259754,
"execution_duration": 3589020,
"cleanup_duration": 31038,
"trigger": "PERIODIC"
},
"notebook_output": {
"result": "the maybe truncated string passed to dbutils.notebook.exit()"
}
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

run_id INT64 The canonical identifier for the run.


This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N
F IEL D N A M E TYPE DESC RIP T IO N

notebook_output OR error NotebookOutput OR STRING If notebook_output, the output of a


notebook task, if available. A notebook
task that terminates (either
successfully or with a failure) without
calling
dbutils.notebook.exit() is
considered to have an empty output.
This field will be set but its result value
will be empty.

If error, an error message indicating


why output is not available. The
message is unstructured, and its exact
format is subject to change.

metadata Run All details of the run except for its


output.

Runs delete
EN DP O IN T H T T P M ET H O D

2.0/jobs/runs/delete POST

Delete a non-active run. Returns an error if the run is active.


Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/jobs/runs/delete \
--data '{ "run_id": <run-id> }'

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<run-id> with the ID of the run, for example 123 .

This example uses a .netrc file.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

run_id INT64 The canonical identifier of the run for


which to retrieve the metadata.

Data structures
In this section:
ClusterInstance
ClusterSpec
CronSchedule
Job
JobEmailNotifications
JobSettings
JobTask
NewCluster
NotebookOutput
NotebookTask
ParamPair
PipelineTask
Run
RunLifeCycleState
RunParameters
RunResultState
RunState
SparkJarTask
SparkPythonTask
SparkSubmitTask
TriggerType
ViewItem
ViewType
ViewsToExport
ClusterInstance
Identifiers for the cluster and Spark context used by a run. These two values together identify an execution
context across all time.

F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING The canonical identifier for the cluster


used by a run. This field is always
available for runs on existing clusters.
For runs on new clusters, it becomes
available once the cluster is created.
This value can be used to view logs by
browsing to
/#setting/sparkui/$cluster_id/driver-
logs
. The logs will continue to be available
after the run completes.

The response won’t include this field if


the identifier is not available yet.

spark_context_id STRING The canonical identifier for the Spark


context used by a run. This field will be
filled in once the run begins execution.
This value can be used to view the
Spark UI by browsing to
/#setting/sparkui/$cluster_id/$spark_context_id
. The Spark UI will continue to be
available after the run has completed.

The response won’t include this field if


the identifier is not available yet.

ClusterSpec
IMPORTANT
When you run a job on a new jobs cluster, the job is treated as a Jobs Compute (automated) workload subject to Jobs
Compute pricing.
When you run a job on an existing all-purpose cluster, it is treated as an All-Purpose Compute (interactive) workload
subject to All-Purpose Compute pricing.

F IEL D N A M E TYPE DESC RIP T IO N

existing_cluster_id OR new_cluster STRING OR NewCluster If existing_cluster_id, the ID of an


existing cluster that will be used for all
runs of this job. When running jobs on
an existing cluster, you may need to
manually restart the cluster if it stops
responding. We suggest running jobs
on new clusters for greater reliability.

If new_cluster, a description of a cluster


that will be created for each run.

If specifying a PipelineTask, then this


field can be empty.

libraries An array of Library An optional list of libraries to be


installed on the cluster that will
execute the job. The default value is an
empty list.

CronSchedule
F IEL D N A M E TYPE DESC RIP T IO N

quartz_cron_expression STRING A Cron expression using Quartz syntax


that describes the schedule for a job.
See Cron Trigger for details. This field is
required.

timezone_id STRING A Java timezone ID. The schedule for a


job will be resolved with respect to this
timezone. See Java TimeZone for
details. This field is required.

pause_status STRING Indicate whether this schedule is


paused or not. Either “PAUSED” or
“UNPAUSED”.

Job
F IEL D N A M E TYPE DESC RIP T IO N

job_id INT64 The canonical identifier for this job.

creator_user_name STRING The creator user name. This field won’t


be included in the response if the user
has already been deleted.
F IEL D N A M E TYPE DESC RIP T IO N

run_as STRING The user name that the job will run as.
run_as is based on the current job
settings, and is set to the creator of
the job if job access control is disabled,
or the is_owner permission if job
access control is enabled.

settings JobSettings Settings for this job and all of its runs.
These settings can be updated using
the resetJob method.

created_time INT64 The time at which this job was created


in epoch milliseconds (milliseconds
since 1/1/1970 UTC).

JobEmailNotifications

IMPORTANT
The on_start, on_success, and on_failure fields accept only Latin characters (ASCII character set). Using non-ASCII
characters will return an error. Examples of invalid, non-ASCII characters are Chinese, Japanese kanjis, and emojis.

F IEL D N A M E TYPE DESC RIP T IO N

on_start An array of STRING A list of email addresses to be notified


when a run begins. If not specified on
job creation, reset, or update, the list is
empty, and notifications are not sent.

on_success An array of STRING A list of email addresses to be notified


when a run successfully completes. A
run is considered to have completed
successfully if it ends with a
TERMINATED life_cycle_state and
a SUCCESSFUL result_state. If not
specified on job creation, reset, or
update, the list is empty, and
notifications are not sent.

on_failure An array of STRING A list of email addresses to be notified


when a run unsuccessfully completes.
A run is considered to have completed
unsuccessfully if it ends with an
INTERNAL_ERROR
life_cycle_state or a SKIPPED ,
FAILED , or TIMED_OUT result_state.
If this is not specified on job creation,
reset, or update the list will be empty,
and notifications are not sent.

no_alert_for_skipped_runs BOOL If true, do not send email to recipients


specified in on_failure if the run is
skipped.

JobSettings
IMPORTANT
When you run a job on a new jobs cluster, the job is treated as a Jobs Compute (automated) workload subject to Jobs
Compute pricing.
When you run a job on an existing all-purpose cluster, it is treated as an All-Purpose Compute (interactive) workload
subject to All-Purpose Compute pricing.

Settings for a job. These settings can be updated using the resetJob method.

F IEL D N A M E TYPE DESC RIP T IO N

existing_cluster_id OR new_cluster STRING OR NewCluster If existing_cluster_id, the ID of an


existing cluster that will be used for all
runs of this job. When running jobs on
an existing cluster, you may need to
manually restart the cluster if it stops
responding. We suggest running jobs
on new clusters for greater reliability.

If new_cluster, a description of a cluster


that will be created for each run.

If specifying a PipelineTask, then this


field can be empty.

notebook_task OR spark_jar_task OR NotebookTask OR SparkJarTask OR If notebook_task, indicates that this


spark_python_task OR SparkPythonTask OR SparkSubmitTask job should run a notebook. This field
spark_submit_task OR pipeline_task OR PipelineTask may not be specified in conjunction
with spark_jar_task.

If spark_jar_task, indicates that this job


should run a JAR.

If spark_python_task, indicates that


this job should run a Python file.

If spark_submit_task, indicates that


this job should be launched by the
spark submit script.

If pipeline_task, indicates that this job


should run a Delta Live Tables pipeline.

name STRING An optional name for the job. The


default value is Untitled .

libraries An array of Library An optional list of libraries to be


installed on the cluster that will
execute the job. The default value is an
empty list.

email_notifications JobEmailNotifications An optional set of email addresses that


will be notified when runs of this job
begin or complete as well as when this
job is deleted. The default behavior is
to not send any emails.

timeout_seconds INT32 An optional timeout applied to each


run of this job. The default behavior is
to have no timeout.
F IEL D N A M E TYPE DESC RIP T IO N

max_retries INT32 An optional maximum number of


times to retry an unsuccessful run. A
run is considered to be unsuccessful if
it completes with the FAILED
result_state or
INTERNAL_ERROR
life_cycle_state . The value -1
means to retry indefinitely and the
value 0 means to never retry. The
default behavior is to never retry.

min_retry_interval_millis INT32 An optional minimal interval in


milliseconds between attempts. The
default behavior is that unsuccessful
runs are immediately retried.

retry_on_timeout BOOL An optional policy to specify whether


to retry a job when it times out. The
default behavior is to not retry on
timeout.

schedule CronSchedule An optional periodic schedule for this


job. The default behavior is that the
job will only run when triggered by
clicking “Run Now” in the Jobs UI or
sending an API request to
runNow .

max_concurrent_runs INT32 An optional maximum allowed number


of concurrent runs of the job.

Set this value if you want to be able to


execute multiple runs of the same job
concurrently. This is useful for example
if you trigger your job on a frequent
schedule and want to allow
consecutive runs to overlap with each
other, or if you want to trigger multiple
runs which differ by their input
parameters.

This setting affects only new runs. For


example, suppose the job’s
concurrency is 4 and there are 4
concurrent active runs. Then setting
the concurrency to 3 won’t kill any of
the active runs. However, from then
on, new runs will be skipped unless
there are fewer than 3 active runs.

This value cannot exceed 1000. Setting


this value to 0 causes all new runs to
be skipped. The default behavior is to
allow only 1 concurrent run.

JobTask
F IEL D N A M E TYPE DESC RIP T IO N

notebook_task OR spark_jar_task OR NotebookTask OR SparkJarTask OR If notebook_task, indicates that this


spark_python_task OR SparkPythonTask OR SparkSubmitTask job should run a notebook. This field
spark_submit_task OR pipeline_task OR PipelineTask may not be specified in conjunction
with spark_jar_task.

If spark_jar_task, indicates that this job


should run a JAR.

If spark_python_task, indicates that


this job should run a Python file.

If spark_submit_task, indicates that


this job should be launched by the
spark submit script.

If pipeline_task, indicates that this job


should run a Delta Live Tables pipeline.

NewCluster
F IEL D N A M E TYPE DESC RIP T IO N

num_workers OR autoscale INT32 OR AutoScale If num_workers, number of worker


nodes that this cluster should have. A
cluster has one Spark driver and
num_workers executors for a total of
num_workers + 1 Spark nodes.

Note: When reading the properties of


a cluster, this field reflects the desired
number of workers rather than the
actual current number of workers. For
instance, if a cluster is resized from 5
to 10 workers, this field will
immediately be updated to reflect the
target size of 10 workers, whereas the
workers listed in spark_info gradually
increase from 5 to 10 as the new
nodes are provisioned.

If autoscale, parameters needed in


order to automatically scale clusters up
and down based on load.

spark_version STRING The Spark version of the cluster. A list


of available Spark versions can be
retrieved by using the Runtime
versions API call. This field is required.
F IEL D N A M E TYPE DESC RIP T IO N

spark_conf SparkConfPair An object containing a set of optional,


user-specified Spark configuration key-
value pairs. You can also pass in a
string of extra JVM options to the
driver and the executors via
spark.driver.extraJavaOptions
and
spark.executor.extraJavaOptions
respectively.

Example Spark confs:


{"spark.speculation": true,
"spark.streaming.ui.retainedBatches":
5}
or
{"spark.driver.extraJavaOptions":
"-verbose:gc -
XX:+PrintGCDetails"}

node_type_id STRING This field encodes, through a single


value, the resources available to each
of the Spark nodes in this cluster. For
example, the Spark nodes can be
provisioned and optimized for memory
or compute intensive workloads A list
of available node types can be
retrieved by using the List node types
API call. This field is required.

driver_node_type_id STRING The node type of the Spark driver. This


field is optional; if unset, the driver
node type is set as the same value as
node_type_id defined above.

custom_tags ClusterTag An object containing a set of tags for


cluster resources. Databricks tags all
cluster resources (such as VMs) with
these tags in addition to default_tags.

Note :

* Tags are not supported on legacy


node types such as compute-
optimized and memory-optimized
* Databricks allows at most 45 custom
tags

cluster_log_conf ClusterLogConf The configuration for delivering Spark


logs to a long-term storage
destination. Only one destination can
be specified for one cluster. If the conf
is given, the logs will be delivered to
the destination every 5 mins . The
destination of driver logs is
<destination>/<cluster-
id>/driver
, while the destination of executor logs
is
<destination>/<cluster-
id>/executor
.
F IEL D N A M E TYPE DESC RIP T IO N

init_scripts An array of InitScriptInfo The configuration for storing init


scripts. Any number of scripts can be
specified. The scripts are executed
sequentially in the order provided. If
cluster_log_conf is specified, init
script logs are sent to
<destination>/<cluster-
id>/init_scripts
.

spark_env_vars SparkEnvPair An object containing a set of optional,


user-specified environment variable
key-value pairs. Key-value pair of the
form (X,Y) are exported as is (i.e.,
export X='Y' ) while launching the
driver and workers.

In order to specify an additional set of


SPARK_DAEMON_JAVA_OPTS , we
recommend appending them to
$SPARK_DAEMON_JAVA_OPTS as shown
in the following example. This ensures
that all default databricks managed
environmental variables are included
as well.

Example Spark environment variables:


{"SPARK_WORKER_MEMORY":
"28000m", "SPARK_LOCAL_DIRS":
"/local_disk0"}
or
{"SPARK_DAEMON_JAVA_OPTS":
"$SPARK_DAEMON_JAVA_OPTS -
Dspark.shuffle.service.enabled=true"}

enable_elastic_disk BOOL Autoscaling Local Storage: when


enabled, this cluster dynamically
acquires additional disk space when its
Spark workers are running low on disk
space. Refer to Autoscaling local
storage for details.

driver_instance_pool_id STRING The optional ID of the instance pool to


use for the driver node. You must also
specify instance_pool_id . Refer to
Instance Pools API 2.0 for details.

instance_pool_id STRING The optional ID of the instance pool to


use for cluster nodes. If
driver_instance_pool_id is present,
instance_pool_id is used for worker
nodes only. Otherwise, it is used for
both the driver node and worker
nodes. Refer to Instance Pools API 2.0
for details.

NotebookOutput
F IEL D N A M E TYPE DESC RIP T IO N
F IEL D N A M E TYPE DESC RIP T IO N

result STRING The value passed to


dbutils.notebook.exit(). Azure
Databricks restricts this API to return
the first 1 MB of the value. For a larger
result, your job can store the results in
a cloud storage service. This field will
be absent if
dbutils.notebook.exit() was never
called.

truncated BOOLEAN Whether or not the result was


truncated.

NotebookTask
All the output cells are subject to the size of 8MB. If the output of a cell has a larger size, the rest of the run will
be cancelled and the run will be marked as failed. In that case, some of the content output from other cells may
also be missing.
If you need help finding the cell that is beyond the limit, run the notebook against an all-purpose cluster and use
this notebook autosave technique.

F IEL D N A M E TYPE DESC RIP T IO N

notebook_path STRING The absolute path of the notebook to


be run in the Azure Databricks
workspace. This path must begin with
a slash. This field is required.

revision_timestamp LONG The timestamp of the revision of the


notebook.

base_parameters A map of ParamPair Base parameters to be used for each


run of this job. If the run is initiated by
a call to run-now with parameters
specified, the two parameters maps will
be merged. If the same key is specified
in base_parameters and in run-now
, the value from run-now will be used.

Use Task parameter variables to set


parameters containing information
about job runs.

If the notebook takes a parameter that


is not specified in the job’s
base_parameters or the run-now
override parameters, the default value
from the notebook will be used.

Retrieve these parameters in a


notebook using dbutils.widgets.get.

ParamPair
Name-based parameters for jobs running notebook tasks.
IMPORTANT
The fields in this data structure accept only Latin characters (ASCII character set). Using non-ASCII characters will return
an error. Examples of invalid, non-ASCII characters are Chinese, Japanese kanjis, and emojis.

TYPE DESC RIP T IO N

STRING Parameter name. Pass to dbutils.widgets.get to retrieve the


value.

STRING Parameter value.

PipelineTask
F IEL D N A M E TYPE DESC RIP T IO N

pipeline_id STRING The full name of the Delta Live Tables


pipeline task to execute.

Run
All the information about a run except for its output. The output can be retrieved separately with the
getRunOutput method.

F IEL D N A M E TYPE DESC RIP T IO N

job_id INT64 The canonical identifier of the job that


contains this run.

run_id INT64 The canonical identifier of the run. This


ID is unique across all runs of all jobs.

creator_user_name STRING The creator user name. This field won’t


be included in the response if the user
has already been deleted.

number_in_job INT64 The sequence number of this run


among all runs of the job. This value
starts at 1.

original_attempt_run_id INT64 If this run is a retry of a prior run


attempt, this field contains the run_id
of the original attempt; otherwise, it is
the same as the run_id.

state RunState The result and lifecycle states of the


run.

schedule CronSchedule The cron schedule that triggered this


run if it was triggered by the periodic
scheduler.

task JobTask The task performed by the run, if any.

cluster_spec ClusterSpec A snapshot of the job’s cluster


specification when this run was
created.
F IEL D N A M E TYPE DESC RIP T IO N

cluster_instance ClusterInstance The cluster used for this run. If the run
is specified to use a new cluster, this
field will be set once the Jobs service
has requested a cluster for the run.

overriding_parameters RunParameters The parameters used for this run.

start_time INT64 The time at which this run was started


in epoch milliseconds (milliseconds
since 1/1/1970 UTC). This may not be
the time when the job task starts
executing, for example, if the job is
scheduled to run on a new cluster, this
is the time the cluster creation call is
issued.

setup_duration INT64 The time it took to set up the cluster in


milliseconds. For runs that run on new
clusters this is the cluster creation
time, for runs that run on existing
clusters this time should be very short.

execution_duration INT64 The time in milliseconds it took to


execute the commands in the JAR or
notebook until they completed, failed,
timed out, were cancelled, or
encountered an unexpected error.

cleanup_duration INT64 The time in milliseconds it took to


terminate the cluster and clean up any
associated artifacts. The total duration
of the run is the sum of the
setup_duration, the
execution_duration, and the
cleanup_duration.

end_time INT64 The time at which this run ended in


epoch milliseconds (milliseconds since
1/1/1970 UTC). This field will be set to
0 if the job is still running.

trigger TriggerType The type of trigger that fired this run.

run_name STRING An optional name for the run. The


default value is Untitled . The
maximum allowed length is 4096 bytes
in UTF-8 encoding.

run_page_url STRING The URL to the detail page of the run.

run_type STRING The type of the run.

* JOB_RUN - Normal job run. A run


created with Run now.
* WORKFLOW_RUN - Workflow run. A
run created with dbutils.notebook.run.
* SUBMIT_RUN - Submit run. A run
created with Run now.
F IEL D N A M E TYPE DESC RIP T IO N

attempt_number INT32 The sequence number of this run


attempt for a triggered job run. The
initial attempt of a run has an
attempt_number of 0. If the initial run
attempt fails, and the job has a retry
policy ( max_retries > 0),
subsequent runs are created with an
original_attempt_run_id of the
original attempt’s ID and an
incrementing attempt_number . Runs
are retried only until they succeed, and
the maximum attempt_number is the
same as the max_retries value for
the job.

RunLifeCycleState
The life cycle state of a run. Allowed state transitions are:
PENDING -> RUNNING -> TERMINATING -> TERMINATED
PENDING -> SKIPPED
PENDING -> INTERNAL_ERROR
RUNNING -> INTERNAL_ERROR
TERMINATING -> INTERNAL_ERROR

STAT E DESC RIP T IO N

PENDING The run has been triggered. If there is not already an active
run of the same job, the cluster and execution context are
being prepared. If there is already an active run of the same
job, the run will immediately transition into the SKIPPED
state without preparing any resources.

RUNNING The task of this run is being executed.

TERMINATING The task of this run has completed, and the cluster and
execution context are being cleaned up.

TERMINATED The task of this run has completed, and the cluster and
execution context have been cleaned up. This state is
terminal.

SKIPPED This run was aborted because a previous run of the same
job was already active. This state is terminal.

INTERNAL_ERROR An exceptional state that indicates a failure in the Jobs


service, such as network failure over a long period. If a run
on a new cluster ends in the INTERNAL_ERROR state, the
Jobs service terminates the cluster as soon as possible. This
state is terminal.

RunParameters
Parameters for this run. Only one of jar_params, python_params , or notebook_params should be specified in the
run-now request, depending on the type of job task. Jobs with Spark JAR task or Python task take a list of
position-based parameters, and jobs with notebook tasks take a key value map.
F IEL D N A M E TYPE DESC RIP T IO N

jar_params An array of STRING A list of parameters for jobs with Spark


JAR tasks, e.g.
"jar_params": ["john doe", "35"] .
The parameters will be used to invoke
the main function of the main class
specified in the Spark JAR task. If not
specified upon run-now , it will default
to an empty list. jar_params cannot be
specified in conjunction with
notebook_params. The JSON
representation of this field (i.e.
{"jar_params":["john doe","35"]}
) cannot exceed 10,000 bytes.

Use Task parameter variables to set


parameters containing information
about job runs.

notebook_params A map of ParamPair A map from keys to values for jobs


with notebook task, e.g.
"notebook_params": {"name":
"john doe", "age": "35"}
. The map is passed to the notebook
and is accessible through the
dbutils.widgets.get function.

If not specified upon run-now , the


triggered run uses the job’s base
parameters.

notebook_params cannot be specified


in conjunction with jar_params.

Use Task parameter variables to set


parameters containing information
about job runs.

The JSON representation of this field


(i.e.
{"notebook_params":{"name":"john
doe","age":"35"}}
) cannot exceed 10,000 bytes.
F IEL D N A M E TYPE DESC RIP T IO N

python_params An array of STRING A list of parameters for jobs with


Python tasks, e.g.
"python_params": ["john doe",
"35"]
. The parameters are passed to Python
file as command-line parameters. If
specified upon run-now , it would
overwrite the parameters specified in
job setting. The JSON representation
of this field (i.e.
{"python_params":["john
doe","35"]}
) cannot exceed 10,000 bytes.

Use Task parameter variables to set


parameters containing information
about job runs.

> [!IMPORTANT] > > These


parameters accept only Latin
characters (ASCII character set). >
Using non-ASCII characters will return
an error. Examples of invalid, non-ASCII
characters are > Chinese, Japanese
kanjis, and emojis.

spark_submit_params An array of STRING A list of parameters for jobs with spark


submit task, e.g.
"spark_submit_params": ["--class",
"org.apache.spark.examples.SparkPi"]
. The parameters are passed to spark-
submit script as command-line
parameters. If specified upon
run-now , it would overwrite the
parameters specified in job setting. The
JSON representation of this field (i.e.
{"python_params":["john
doe","35"]}
) cannot exceed 10,000 bytes.

Use Task parameter variables to set


parameters containing information
about job runs.

> [!IMPORTANT] > > These


parameters accept only Latin
characters (ASCII character set). >
Using non-ASCII characters will return
an error. Examples of invalid, non-ASCII
characters are > Chinese, Japanese
kanjis, and emojis.

RunResultState
The result state of the run.
If life_cycle_state = TERMINATED : if the run had a task, the result is guaranteed to be available, and it
indicates the result of the task.
If life_cycle_state = PENDING , RUNNING , or SKIPPED , the result state is not available.
If life_cycle_state = TERMINATING or lifecyclestate = INTERNAL_ERROR : the result state is available if the run
had a task and managed to start it.
Once available, the result state never changes.
STAT E DESC RIP T IO N

SUCCESS The task completed successfully.

FAILED The task completed with an error.

TIMEDOUT The run was stopped after reaching the timeout.

CANCELED The run was canceled at user request.

RunState
F IEL D N A M E TYPE DESC RIP T IO N

life_cycle_state RunLifeCycleState A description of a run’s current


location in the run lifecycle. This field is
always available in the response.

result_state RunResultState The result state of a run. If it is not


available, the response won’t include
this field. See RunResultState for details
about the availability of result_state.

user_cancelled_or_timedout BOOLEAN Whether a run was canceled manually


by a user or by the scheduler because
the run timed out.

state_message STRING A descriptive message for the current


state. This field is unstructured, and its
exact format is subject to change.

SparkJarTask
F IEL D N A M E TYPE DESC RIP T IO N

jar_uri STRING Deprecated since 04/2016. Provide a


jar through the libraries field
instead. For an example, see Create.

main_class_name STRING The full name of the class containing


the main method to be executed. This
class must be contained in a JAR
provided as a library.

The code should use


SparkContext.getOrCreate to
obtain a Spark context; otherwise, runs
of the job will fail.

parameters An array of STRING Parameters passed to the main


method.

Use Task parameter variables to set


parameters containing information
about job runs.

SparkPythonTask
F IEL D N A M E TYPE DESC RIP T IO N

python_file STRING The URI of the Python file to be


executed. DBFS paths are supported.
This field is required.

parameters An array of STRING Command line parameters passed to


the Python file.

Use Task parameter variables to set


parameters containing information
about job runs.

SparkSubmitTask

IMPORTANT
You can invoke Spark submit tasks only on new clusters.
In the new_cluster specification, libraries and spark_conf are not supported. Instead, use --jars and
--py-files to add Java and Python libraries and --conf to set the Spark configuration.
master , deploy-mode , and executor-cores are automatically configured by Azure Databricks; you cannot specify
them in parameters.
By default, the Spark submit job uses all available memory (excluding reserved memory for Azure Databricks services).
You can set --driver-memory , and --executor-memory to a smaller value to leave some room for off-heap usage.
The --jars , --py-files , --files arguments support DBFS paths.

For example, assuming the JAR is uploaded to DBFS, you can run SparkPi by setting the following parameters.

{
"parameters": [
"--class",
"org.apache.spark.examples.SparkPi",
"dbfs:/path/to/examples.jar",
"10"
]
}

F IEL D N A M E TYPE DESC RIP T IO N

parameters An array of STRING Command-line parameters passed to


spark submit.

Use Task parameter variables to set


parameters containing information
about job runs.

TriggerType
These are the type of triggers that can fire a run.

TYPE DESC RIP T IO N

PERIODIC Schedules that periodically trigger runs, such as a cron


scheduler.

ONE_TIME One time triggers that fire a single run. This occurs you
triggered a single run on demand through the UI or the API.
TYPE DESC RIP T IO N

RETRY Indicates a run that is triggered as a retry of a previously


failed run. This occurs when you request to re-run the job in
case of failures.

ViewItem
The exported content is in HTML format. For example, if the view to export is dashboards, one HTML string is
returned for every dashboard.

F IEL D N A M E TYPE DESC RIP T IO N

content STRING Content of the view.

name STRING Name of the view item. In the case of


code view, the notebook’s name. In the
case of dashboard view, the
dashboard’s name.

type ViewType Type of the view item.

ViewType
TYPE DESC RIP T IO N

NOTEBOOK Notebook view item.

DASHBOARD Dashboard view item.

ViewsToExport
View to export: either code, all dashboards, or all.

TYPE DESC RIP T IO N

CODE Code view of the notebook.

DASHBOARDS All dashboard views of the notebook.

ALL All views of the notebook.


Libraries API 2.0
7/21/2022 • 7 minutes to read

The Libraries API allows you to install and uninstall libraries and get the status of libraries on a cluster.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

All cluster statuses


EN DP O IN T H T T P M ET H O D

2.0/libraries/all-cluster-statuses GET

Get the status of all libraries on all clusters. A status will be available for all libraries installed on clusters via the
API or the libraries UI as well as libraries set to be installed on all clusters via the libraries UI. If a library has been
set to be installed on all clusters, is_library_for_all_clusters will be true , even if the library was also installed
on this specific cluster.
Example
Request

curl --netrc --request GET \


https://<databricks-instance>/api/2.0/libraries/all-cluster-statuses \
| jq .

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file and jq.


Response
{
"statuses": [
{
"cluster_id": "11203-my-cluster",
"library_statuses": [
{
"library": {
"jar": "dbfs:/mnt/libraries/library.jar"
},
"status": "INSTALLING",
"messages": [],
"is_library_for_all_clusters": false
}
]
},
{
"cluster_id": "20131-my-other-cluster",
"library_statuses": [
{
"library": {
"egg": "dbfs:/mnt/libraries/library.egg"
},
"status": "ERROR",
"messages": ["Could not download library"],
"is_library_for_all_clusters": false
}
]
}
]
}

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

statuses An array of ClusterLibraryStatuses A list of cluster statuses.

Cluster status
EN DP O IN T H T T P M ET H O D

2.0/libraries/cluster-status GET

Get the status of libraries on a cluster. A status will be available for all libraries installed on the cluster via the API
or the libraries UI as well as libraries set to be installed on all clusters via the libraries UI. If a library has been set
to be installed on all clusters, is_library_for_all_clusters will be true , even if the library was also installed on
the cluster.
Example
Request

curl --netrc --request GET \


'https://<databricks-instance>/api/2.0/libraries/cluster-status?cluster_id=<cluster-id>' \
| jq .

Or:
curl --netrc --get \
https://<databricks-instance>/api/2.0/libraries/cluster-status \
--data cluster_id=<cluster-id> \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<cluster-id> with the Azure Databricks workspace ID of the cluster, for example 1234-567890-example123 .

This example uses a .netrc file and jq.


Response

{
"cluster_id": "11203-my-cluster",
"library_statuses": [
{
"library": {
"jar": "dbfs:/mnt/libraries/library.jar"
},
"status": "INSTALLED",
"messages": [],
"is_library_for_all_clusters": false
},
{
"library": {
"pypi": {
"package": "beautifulsoup4"
},
},
"status": "INSTALLING",
"messages": ["Successfully resolved package from PyPI"],
"is_library_for_all_clusters": false
},
{
"library": {
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
},
},
"status": "FAILED",
"messages": ["R package installation is not supported on this spark version.\nPlease upgrade to
Runtime 3.2 or higher"],
"is_library_for_all_clusters": false
}
]
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING Unique identifier of the cluster whose


status should be retrieved. This field is
required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING Unique identifier for the cluster.

library_statuses An array of LibraryFullStatus Status of all libraries on the cluster.

Install
EN DP O IN T H T T P M ET H O D

2.0/libraries/install POST

Install libraries on a cluster. The installation is asynchronous - it completes in the background after the request.

IMPORTANT
This call will fail if the cluster is terminated.

Installing a wheel library on a cluster is like running the pip command against the wheel file directly on driver
and executors. All the dependencies specified in the library setup.py file are installed and this requires the
library name to satisfy the wheel file name convention.
The installation on the executors happens only when a new task is launched. With Databricks Runtime 7.1 and
below, the installation order of libraries is nondeterministic. For wheel libraries, you can ensure a deterministic
installation order by creating a zip file with suffix .wheelhouse.zip that includes all the wheel files.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/libraries/install \
--data @install-libraries.json

install-libraries.json :
{
"cluster_id": "10201-my-cluster",
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"egg": "dbfs:/mnt/libraries/library.egg"
},
{
"whl": "dbfs:/mnt/libraries/mlflow-0.0.1.dev0-py2-none-any.whl"
},
{
"whl": "dbfs:/mnt/libraries/wheel-libraries.wheelhouse.zip"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2",
"exclusions": ["slf4j:slf4j"]
}
},
{
"pypi": {
"package": "simplejson",
"repo": "https://my-pypi-mirror.com"
}
},
{
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
}
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of install-libraries.json with fields that are appropriate for your solution.

This example uses a .netrc file.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING Unique identifier for the cluster on


which to install these libraries. This field
is required.

libraries An array of Library The libraries to install.

Uninstall
EN DP O IN T H T T P M ET H O D

2.0/libraries/uninstall POST

Set libraries to be uninstalled on a cluster. The libraries aren’t uninstalled until the cluster is restarted.
Uninstalling libraries that are not installed on the cluster has no impact but is not an error.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/libraries/uninstall \
--data @uninstall-libraries.json

uninstall-libraries.json :

{
"cluster_id": "10201-my-cluster",
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"cran": "ada"
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of uninstall-libraries.json with fields that are appropriate for your solution.

This example uses a .netrc file.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING Unique identifier for the cluster on


which to uninstall these libraries. This
field is required.

libraries An array of Library The libraries to uninstall.

Data structures
In this section:
ClusterLibraryStatuses
Library
LibraryFullStatus
MavenLibrary
PythonPyPiLibrary
RCranLibrary
LibraryInstallStatus
ClusterLibraryStatuses
F IEL D N A M E TYPE DESC RIP T IO N

cluster_id STRING Unique identifier for the cluster.

library_statuses An array of LibraryFullStatus Status of all libraries on the cluster.

Library
F IEL D N A M E TYPE DESC RIP T IO N

jar OR egg OR whl OR pypi OR maven STRING OR STRING OR STRING OR If jar, URI of the JAR to be installed.
OR cran PythonPyPiLibrary OR MavenLibrary DBFS and ADLS ( abfss ) URIs are
OR RCranLibrary supported. For example:
{ "jar":
"dbfs:/mnt/databricks/library.jar"
}
or
{ "jar": "abfss://my-
bucket/library.jar" }
. If ADLS is used, make sure the cluster
has read access on the library.

If egg, URI of the egg to be installed.


DBFS and ADLS URIs are supported.
For example:
{ "egg": "dbfs:/my/egg" } or
{ "egg": "abfss://my-bucket/egg"
}
.

If whl, URI of the wheel or zipped


wheels to be installed. DBFS and ADLS
URIs are supported. For example:
{ "whl": "dbfs:/my/whl" } or
{ "whl": "abfss://my-bucket/whl"
}
. If ADLS is used, make sure the cluster
has read access on the library. Also the
wheel file name needs to use the
correct convention. If zipped wheels
are to be installed, the file name suffix
should be .wheelhouse.zip .

If pypi, specification of a PyPI library to


be installed. Specifying the repo field
is optional and if not specified, the
default pip index is used. For example:
{ "package": "simplejson",
"repo": "https://my-repo.com" }

If maven, specification of a Maven


library to be installed. For example:
{ "coordinates":
"org.jsoup:jsoup:1.7.2" }

If cran, specification of a CRAN library


to be installed.

LibraryFullStatus
The status of the library on a specific cluster.
F IEL D N A M E TYPE DESC RIP T IO N

library Library Unique identifier for the library.

status LibraryInstallStatus Status of installing the library on the


cluster.

messages An array of STRING All the info and warning messages that
have occurred so far for this library.

is_library_for_all_clusters BOOL Whether the library was set to be


installed on all clusters via the libraries
UI.

MavenLibrary
F IEL D N A M E TYPE DESC RIP T IO N

coordinates STRING Gradle-style Maven coordinates. For


example: org.jsoup:jsoup:1.7.2 .
This field is required.

repo STRING Maven repo to install the Maven


package from. If omitted, both Maven
Central Repository and Spark Packages
are searched.

exclusions An array of STRING List of dependences to exclude. For


example:
["slf4j:slf4j", "*:hadoop-
client"]
.

Maven dependency exclusions:


https://maven.apache.org/guides/intro
duction/introduction-to-optional-and-
excludes-dependencies.html.

PythonPyPiLibrary
F IEL D N A M E TYPE DESC RIP T IO N

package STRING The name of the PyPI package to


install. An optional exact version
specification is also supported.
Examples: simplejson and
simplejson==3.8.0 . This field is
required.

repo STRING The repository where the package can


be found. If not specified, the default
pip index is used.

RCranLibrary
F IEL D N A M E TYPE DESC RIP T IO N

package STRING The name of the CRAN package to


install. This field is required.

repo STRING The repository where the package can


be found. If not specified, the default
CRAN repo is used.

LibraryInstallStatus
The status of a library on a specific cluster.

STAT US DESC RIP T IO N

PENDING No action has yet been taken to install the library. This state
should be very short lived.

RESOLVING Metadata necessary to install the library is being retrieved


from the provided repository.

For Jar, Egg, and Whl libraries, this step is a no-op.

INSTALLING The library is actively being installed, either by adding


resources to Spark or executing system commands inside
the Spark nodes.

INSTALLED The library has been successfully installed.

SKIPPED Installation on a Databricks Runtime 7.0 or above cluster


was skipped due to Scala version incompatibility.

FAILED Some step in installation failed. More information can be


found in the messages field.

UNINSTALL_ON_RESTART The library has been marked for removal. Libraries can be
removed only when clusters are restarted, so libraries that
enter this state will remain until the cluster is restarted.
MLflow API 2.0
7/21/2022 • 2 minutes to read

Azure Databricks provides a managed version of the MLflow tracking server and the Model Registry, which host
the MLflow REST API. You can invoke the MLflow REST API using URLs of the form
https://<databricks-instance>/api/2.0/mlflow/<api-endpoint>

replacing <databricks-instance> with the workspace URL of your Azure Databricks deployment.
MLflow compatibility matrix lists the MLflow release packaged in each Databricks Runtime version and a link to
the respective documentation.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Rate limits
The MLflow APIs are rate limited as four groups, based on their function and maximum throughput. The
following is the list of API groups and their respective limits in qps (queries per second):
Low throughput experiment management (list, update, delete, restore): 7 qps
Search runs: 7 qps
Log batch: 47 qps
All other APIs: 127 qps
In addition, there is a limit of 20 concurrent model versions in Pending status (in creation) per workspace.
If the rate limit is reached, subsequent API calls will return status code 429. All MLflow clients (including the UI)
automatically retry 429s with an exponential backoff.
API reference
The MLflow API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
Permissions API 2.0
7/21/2022 • 2 minutes to read

IMPORTANT
This feature is in Public Preview.

The Permissions API lets you manage permissions for:


Tokens
Clusters
Pools
Jobs
Notebooks
Folders (directories)
MLflow registered models
The Permissions API is provided as an OpenAPI 3.0 specification that you can download and view as a structured
API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
Repos API 2.0
7/21/2022 • 2 minutes to read

The Repos API allows you to manage Databricks repos programmatically. See Git integration with Databricks
Repos.
The Repos API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
SCIM API 2.0
7/21/2022 • 2 minutes to read

IMPORTANT
This feature is in Public Preview.

Azure Databricks supports SCIM, or System for Cross-domain Identity Management, an open standard that
allows you to automate user provisioning using a REST API and JSON. The Azure Databricks SCIM API follows
version 2.0 of the SCIM protocol.

Requirements
Your Azure Databricks account must have the Premium Plan.

SCIM 2.0 APIs


An Azure Databricks workspace administrator can invoke all SCIM API endpoints:
SCIM API 2.0 (Me)
SCIM API 2.0 (Users)
SCIM API 2.0 (ServicePrincipals)
SCIM API 2.0 (Groups)
Non-admin users and service principals can invoke the Me Get endpoint, the Users Get endpoint to display
names and IDs, and the Group Get endpoint to display group display names and IDs.
Call workspace SCIM APIs
For workspace SCIM APIs workspaces, for the examples, replace <databricks-instance> with the workspace URL
of your Azure Databricks deployment.

https://<databricks-instance>/api/2.0/preview/scim/v2/<api-endpoint>

Header parameters
PA RA M ET ER TYPE DESC RIP T IO N
PA RA M ET ER TYPE DESC RIP T IO N

Authorization (required) STRING Set to Bearer <access-token> .

Or: See Authentication using Azure


Databricks personal access tokens,
The .netrc file (if using curl ) Authenticate using Azure Active
Directory tokens, and Token API 2.0 to
learn how to generate tokens.

Impor tant! The Azure Databricks


admin user who generates this token
should not be managed by your
identity provider (IdP). An Azure
Databricks admin user who is
managed by the IdP can be
deprovisioned using the IdP, which
would cause your SCIM provisioning
integration to be disabled.

Instead of an Authorization header,


you can use the .netrc file along with
the --netrc (or -n ) option. This file
stores machine names and tokens
separate from your code and reduces
the need to type credential strings
multiple times. The .netrc contains
one entry for each combination of
<databricks-instance> and token.
For example:

machine <databricks-instance>
login token password <access-
token>

Content-Type (required for write STRING Set to application/scim+json .


operations)

Accept (required for read operations) STRING Set to application/scim+json .

Filter results
Use filters to return a subset of users or groups. For all users, the user userName and group displayName fields
are supported. Admin users can filter users on the active attribute.

O P ERATO R DESC RIP T IO N B EH AVIO R

eq equals Attribute and operator values must be


identical.

ne not equal to Attribute and operator values are not


identical.

co contains Operator value must be a substring of


attribute value.

sw starts with Attribute must start with and contain


operator value.
O P ERATO R DESC RIP T IO N B EH AVIO R

and logical AND Match when all expressions evaluate to


true.

or logical OR Match when any expression evaluates


to true.

Sort results
Sort results using the sortBy and sortOrder query parameters. The default is to sort by ID.

List of all SCIM APIs


SCIM API 2.0 (Me)
SCIM API 2.0 (Users)
SCIM API 2.0 (Groups)
SCIM API 2.0 (ServicePrincipals)
For error codes, see SCIM API 2.0 Error Codes.
SCIM API 2.0 (Me)
7/21/2022 • 2 minutes to read

IMPORTANT
This feature is in Public Preview.

Requirements
Your Azure Databricks account must have the Premium Plan.

Get me
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Me GET

Retrieve the same information about yourself as returned by Get user by ID.
Example

curl --netrc -X GET \


https://<databricks-instance>/api/2.0/preview/scim/v2/Me \
| jq .

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file and jq.


For error codes, see GET Requests.
SCIM API 2.0 (Users)
7/21/2022 • 6 minutes to read

IMPORTANT
This feature is in Public Preview.

An Azure Databricks administrator can invoke all SCIM API endpoints. Non-admin users can invoke the Get
users endpoint to read user display names and IDs.

NOTE
Each workspace can have a maximum of 10,000 users and 5,000 groups. Service principals count toward the user
maximum.

SCIM (Users) lets you create users in Azure Databricks and give them the proper level of access, temporarily lock
and unlock user accounts, and remove access for users (deprovision them) when they leave your organization
or no longer need access to Azure Databricks.
For error codes, see SCIM API 2.0 Error Codes.

Requirements
Your Azure Databricks account must have the Premium Plan.

Get users
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Users GET

Admin users: Retrieve a list of all users in the Azure Databricks workspace.
Non-admin users: Retrieve a list of all users in the Azure Databricks workspace, returning username, user
display name, and object ID only.
Examples
This example gets information about all users.

curl --netrc -X GET \


https://<databricks-instance>/api/2.0/preview/scim/v2/Users \
| jq .

This example uses the eq (equals) filter query parameter with userName to get information about a specific
user.
curl --netrc -X GET \
"https://<databricks-instance>/api/2.0/preview/scim/v2/Users?filter=userName+eq+<username>" \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<username> with the Azure Databricks workspace username of the user, for example someone@example.com .

This example uses a .netrc file and jq.

Get user by ID
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Users/{id} GET

Admin users: Retrieve a single user resource from the Azure Databricks workspace, given their Azure Databricks
ID.
Example
Request

curl --netrc -X GET \


https://<databricks-instance>/api/2.0/preview/scim/v2/Users/<user-id> \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file and jq.
Response

Create user
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Users POST

Admin users: Create a user in the Azure Databricks workspace.


Request parameters follow the standard SCIM 2.0 protocol.
Requests must include the following attributes:
schemas set to urn:ietf:params:scim:schemas:core:2.0:User
userName

Example
curl --netrc -X POST \
https://<databricks-instance>/api/2.0/preview/scim/v2/Users \
--header 'Content-type: application/scim+json' \
--data @create-user.json \
| jq .

create-user.json :

{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:User" ],
"userName": "<username>",
"groups": [
{
"value":"123456"
}
],
"entitlements":[
{
"value":"allow-cluster-create"
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<username> with the Azure Databricks workspace username of the user, for example someone@example.com .

This example uses a .netrc file and jq.

Update user by ID ( PATCH )


EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Users/{id} PATCH

Admin users: Update a user resource with operations on specific attributes, except those that are immutable (
userName and userId ). The PATCH method is recommended over the PUT method for setting or updating user
entitlements.
Request parameters follow the standard SCIM 2.0 protocol and depend on the value of the schemas attribute.
Example
This example adds the allow-cluster-create entitlement to the specified user.

curl --netrc -X PATCH \


https://<databricks-instance>/api/2.0/preview/scim/v2/Users/<user-id> \
--header 'Content-type: application/scim+json' \
--data @update-user.json \
| jq .

update-user.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "add",
"path": "entitlements",
"value": [
{
"value": "allow-cluster-create"
}
]
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file and jq.

Update user by ID ( PUT )


EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Users/{id} PUT

Admin users: Overwrite the user resource across multiple attributes, except those that are immutable ( userName
and userId ).
Request must include the schemas attribute, set to urn:ietf:params:scim:schemas:core:2.0:User .

NOTE
The PATCH method is recommended over the PUT method for setting or updating user entitlements.

Example
This example changes the specified user’s previous entitlements to now have only the allow-cluster-create
entitlement.

curl --netrc -X PUT \


https://<databricks-instance>/api/2.0/preview/scim/v2/Users/<user-id> \
--header 'Content-type: application/scim+json' \
--data @overwrite-user.json \
| jq .

overwrite-user.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:User" ],
"userName": "<username>",
"entitlements": [
{
"value": "allow-cluster-create"
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
<username> with the Azure Databricks workspace username of the user, for example someone@example.com . To
get the username, call Get users.
This example uses a .netrc file and jq.

Delete user by ID
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Users/{id} DELETE

Admin users: Remove a user resource. A user that does not own or belong to a workspace in Azure Databricks is
automatically purged after 30 days.
Deleting a user from a workspace also removes objects associated with the user. For example, notebooks are
archived, clusters are terminated, and jobs become ownerless.
The user’s home directory is not automatically deleted. Only an administrator can access or remove a deleted
user’s home directory.
The access control list (ACL) configuration of a user is preserved even after that user is removed from a
workspace.
Example request

curl --netrc -X DELETE \


https://<databricks-instance>/api/2.0/preview/scim/v2/Users/<user-id>

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file.

Activate and deactivate user by ID


IMPORTANT
This feature is in Public Preview.

EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Users/{id} PATCH

Admin users: Activate or deactivate a user. Deactivating a user removes all access to a workspace for that user
but leaves permissions and objects associated with the user unchanged. Clusters associated with the user keep
running, and notebooks remain in their original locations. The user’s tokens are retained but cannot be used to
authenticate while the user is deactivated. Scheduled jobs, however, fail unless assigned to a new owner.
You can use the Get users and Get user by ID requests to view whether users are active or inactive.

NOTE
Allow at least five minutes for the cache to be cleared for deactivation to take effect.

IMPORTANT
An Azure Active Directory (Azure AD) user with the Contributor or Owner role on the Azure Databricks subscription
can reactivate themselves using the Azure AD login flow. If a user with one of these roles needs to be deactivated, you
should also revoke their privileges on the subscription.

Set the active value to false to deactivate a user and true to activate a user.
Example
Request

curl --netrc -X PATCH \


https://<databricks-instance>/api/2.0/preview/scim/v2/Users/<user-id> \
--header 'Content-type: application/scim+json' \
--data @toggle-user-activation.json \
| jq .

toggle-user-activation.json :

{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "replace",
"path": "active",
"value": [
{
"value": "false"
}
]
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 .

This example uses a .netrc file and jq.


Response

{
"emails": [
{
"type": "work",
"value": "someone@example.com",
"primary": true
}
],
"displayName": "Someone User",
"schemas": [
"urn:ietf:params:scim:schemas:core:2.0:User",
"urn:ietf:params:scim:schemas:extension:workspace:2.0:User"
],
"name": {
"familyName": "User",
"givenName": "Someone"
},
"active": false,
"groups": [],
"id": "123456",
"userName": "someone@example.com"
}

Filter active and inactive users


IMPORTANT
This feature is in Public Preview.

EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Users GET

Admin users: Retrieve a list of active or inactive users.


Example

curl --netrc -X GET \


"https://<databricks-instance>/api/2.0/preview/scim/v2/Users?filter=active+eq+false" \
| jq .

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file and jq.

Automatically deactivate users


IMPORTANT
This feature is in Public Preview.

Admin users: Deactivate users that have not logged in for a customizable period. Scheduled jobs owned by a
user are also considered activity.

EN DP O IN T H T T P M ET H O D

2.0/preview/workspace-conf PATCH

The request body is a key-value pair where the value is the time limit for how long a user can be inactive before
being automatically deactivated.
Example

curl --netrc -X PATCH \


https://<databricks-instance>/api/2.0/preview/workspace-conf \
--data @deactivate-users.json \
| jq .

deactivate-users.json :

{
"maxUserInactiveDays": "90"
}

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file and jq.

Get the maximum user inactivity period of a workspace


IMPORTANT
This feature is in Public Preview.

Admin users: Retrieve the user inactivity limit defined for a workspace.

EN DP O IN T H T T P M ET H O D

2.0/preview/workspace-conf GET

Example request

curl --netrc -X GET \


"https://<databricks-instance>/api/2.0/preview/workspace-conf?keys=maxUserInactiveDays" \
| jq .

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
This example uses a .netrc file and jq.
Example response

{
"maxUserInactiveDays": "90"
}
SCIM API 2.0 (Groups)
7/21/2022 • 3 minutes to read

IMPORTANT
This feature is in Public Preview.

Requirements
Your Azure Databricks account must have the Premium Plan.

NOTE
An Azure Databricks administrator can invoke all SCIM API endpoints.
Non-admin users can invoke the Get groups endpoint to read group display names and IDs.
You can have no more than 10,000 users and 5,000 groups in a workspace.

SCIM (Groups) lets you create users and groups in Azure Databricks and give them the proper level of access
and remove access for groups (deprovision them).
For error codes, see SCIM API 2.0 Error Codes.

Get groups
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Groups GET

Admin users: Retrieve a list of all groups in the Azure Databricks workspace.
Non-admin users: Retrieve a list of all groups in the Azure Databricks workspace, returning group display name
and object ID only.
Examples

curl --netrc -X GET \


https://<databricks-instance>/api/2.0/preview/scim/v2/Groups \
| jq .

You can use filters to specify subsets of groups. For example, you can apply the sw (starts with) filter parameter
to displayName to retrieve a specific group or set of groups. This example retrieves all groups with a
displayName field that start with my- .

curl --netrc -X GET \


"https://<databricks-instance>/api/2.0/preview/scim/v2/Groups?filter=displayName+sw+my-" \
| jq .

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
This example uses a .netrc file and jq.

Get group by ID
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Groups/{id} GET

Admin users: Retrieve a single group resource.


Example request

curl --netrc -X GET \


https://<databricks-instance>/api/2.0/preview/scim/v2/Groups/<group-id> \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file and jq.

Create group
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Groups POST

Admin users: Create a group in Azure Databricks.


Request parameters follow the standard SCIM 2.0 protocol.
Requests must include the following attributes:
schemas set to urn:ietf:params:scim:schemas:core:2.0:Group
displayName

Members list is optional and can include users and other groups. You can also add members to a group using
PATCH .
Example

curl --netrc -X POST \


https://<databricks-instance>/api/2.0/preview/scim/v2/Groups \
--header 'Content-type: application/scim+json' \
--data @create-group.json \
| jq .

create-group.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:Group" ],
"displayName": "<group-name>",
"members": [
{
"value":"<user-id>"
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-name> with the name of the group in the Azure Databricks workspace, for example my-group .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file and jq.

Update group
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Groups/{id} PATCH

Admin users: Update a group in Azure Databricks by adding or removing members. Can add and remove
individual members or groups within the group.
Request parameters follow the standard SCIM 2.0 protocol and depend on the value of the schemas attribute.

NOTE
Azure Databricks does not support updating group names.

Example

curl --netrc -X PATCH \


https://<databricks-instance>/api/2.0/preview/scim/v2/Groups/<group-id> \
--header 'Content-type: application/scim+json' \
--data @update-group.json \
| jq .

Add to group
update-group.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op":"add",
"value": {
"members": [
{
"value":"<user-id>"
}
]
}
}
]
}

Remove from group


update-group.json :

{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "remove",
"path": "members[value eq \"<user-id>\"]"
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file and jq.

Delete group
EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/Groups/{id} DELETE

Admin users: Remove a group from Azure Databricks. Users in the group are not removed.
Example

curl --netrc -X DELETE \


https://<databricks-instance>/api/2.0/preview/scim/v2/Groups/<group-id>

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file.
SCIM API 2.0 (ServicePrincipals)
7/21/2022 • 6 minutes to read

IMPORTANT
This feature is in Public Preview.

SCIM (ServicePrincipals) lets you manage Azure Active Directory service principals in Azure Databricks.
For error codes, see SCIM API 2.0 Error Codes.
For additional examples, see Service principals for Azure Databricks automation.

Requirements
Your Azure Databricks account must have the Premium Plan.

Get service principals


EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/ServicePrincipals GET

Retrieve a list of all service principals in the Azure Databricks workspace.


When invoked by a non-admin user, only the username, user display name, and object are returned.
Examples

curl --netrc -X GET \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals \
| jq .

You can use filters to specify subsets of service principals. For example, you can apply the eq (equals) filter
parameter to applicationId to retrieve a specific service principal:

curl --netrc -X GET \


"https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals?filter=applicationId+eq+
<application-id>" \
| jq .

In workspaces with a large number of service principals, you can exclude attributes from the request to improve
performance.

curl --netrc -X GET \


"https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals?
excludedAttributes=entitlements,groups" \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<application-id> with the applicationId value of the service principal, for example
12345a67-8b9c-0d1e-23fa-4567b89cde01 .

These examples use a .netrc file and jq.

Get service principal by ID


EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/ServicePrincipals/{id} GET

Retrieve a single service principal resource from the Azure Databricks workspace, given a service principal ID.
Example

curl --netrc -X GET \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals/<service-principal-id> \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file and jq.

Add service principal


EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/ServicePrincipals POST

Add an Azure Active Directory (Azure AD) service principal to the Azure Databricks workspace. In Azure
Databricks, you must create an application in Azure Active Directory and then add it to your Azure Databricks
workspace to use as a service principal. Service principals count toward the limit of 10000 users per workspace.
Request parameters follow the standard SCIM 2.0 protocol.
Example

curl --netrc -X POST \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals \
--header 'Content-type: application/scim+json' \
--data @add-service-principal.json \
| jq .

add-service-principal.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ],
"applicationId": "<azure-application-id>",
"displayName": "<display-name>",
"groups": [
{
"value": "<group-id>"
}
],
"entitlements": [
{
"value":"allow-cluster-create"
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<azure-application-id> with the application ID of the Azure Active Directory (Azure AD) application, for
example 12345a67-8b9c-0d1e-23fa-4567b89cde01
<display-name> with the display name of the service principal, for example someone@example.com .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file and jq.

Update service principal by ID (PATCH)


EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/ServicePrincipals/{id} PATCH

Update a service principal resource with operations on specific attributes, except for applicationId and id ,
which are immutable.
Use the PATCH method to add, update, or remove individual attributes. Use the PUT method to overwrite the
entire service principal in a single operation.
Request parameters follow the standard SCIM 2.0 protocol and depend on the value of the schemas attribute.
Add entitlements
Example

curl --netrc -X PATCH \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals/<service-principal-id> \
--header 'Content-type: application/scim+json' \
--data @change-service-principal.json \
| jq .

change-service-principal.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "add",
"path": "entitlements",
"value": [
{
"value": "allow-cluster-create"
}
]
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file and jq.
Remove entitlements
Example

curl --netrc -X PATCH \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals/<service-principal-id> \
--header 'Content-type: application/scim+json' \
--data @change-service-principal.json \
| jq .

change-service-principal.json :

{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "remove",
"path": "entitlements",
"value": [
{
"value": "allow-cluster-create"
}
]
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file and jq.
Add to a group
Example

curl --netrc -X PATCH \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals/<service-principal-id> \
--header 'Content-type: application/scim+json' \
--data @change-service-principal.json \
| jq .

change-service-principal.json :

{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "add",
"path": "groups",
"value": [
{
"value": "<group-id>"
}
]
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file and jq.
Remove from a group
Example

curl --netrc -X PATCH \


https://<databricks-instance>/api/2.0/preview/scim/v2/Groups/<group-id> \
--header 'Content-type: application/scim+json' \
--data @remove-from-group.json \
| jq .

remove-from-group.json :

{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "remove",
"path": "members[value eq \"<service-principal-id>\"]"
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file and jq.

Update service principal by ID (PUT)


EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/ServicePrincipals/{id} PUT

Overwrite the entire service principal resource, except for applicationId and id , which are immutable.
Use the PATCH method to add, update, or remove individual attributes.

IMPORTANT
You must include the attribute in the request, with the exact value
schemas
urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal .

Examples
Add an entitlement

curl --netrc -X PUT \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals/<service-principal-id> \
--header 'Content-type: application/scim+json' \
--data @update-service-principal.json \
| jq .

update-service-principal.json :

{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ],
"applicationId": "<appliation-id>",
"displayName": "<display-name>",
"groups": [
{
"value": "<group-id>"
}
],
"entitlements": [
{
"value":"allow-cluster-create"
}
]
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
<application-id> with the applicationId value of the service principal, for example
12345a67-8b9c-0d1e-23fa-4567b89cde01 .
<display-name> with the display name of the service principal, for example someone@example.com .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file and jq.
Remove all entitlements and groups
Removing all entitlements and groups is a reversible alternative to deactivating the service principal.
Use the PUT method to avoid the need to check the existing entitlements and group memberships first.

curl --netrc -X PUT \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals/<service-principal-id> \
--header 'Content-type: application/scim+json' \
--data @update-service-principal.json \
| jq .

update-service-principal.json :

{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ],
"applicationId": "<application-id>",
"displayName": "<display-name>",
"groups": [],
"entitlements": []
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
<application-id> with the applicationId value of the service principal, for example
12345a67-8b9c-0d1e-23fa-4567b89cde01 .
<display-name> with the display name of the service principal, for example someone@example.com .

This example uses a .netrc file and jq.

Deactivate service principal by ID


EN DP O IN T H T T P M ET H O D

2.0/preview/scim/v2/ServicePrincipals/{id} DELETE

Deactivate a service principal resource. This operation isn’t reversible.


Example

curl --netrc -X DELETE \


https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals/<service-principal-id>
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file.
As a reversible alternative, you can remove all of its entitlements and groups instead of deleting the service
principal.
A service principal that does not own or belong to an Azure Databricks workspace is automatically purged after
30 days.
Secrets API 2.0
7/21/2022 • 12 minutes to read

The Secrets API allows you to manage secrets, secret scopes, and access permissions. To manage secrets, you
must:
1. Create a secret scope.
2. Add your secrets to the scope.
3. If you have the Premium Plan, assign access control to the secret scope.
To learn more about creating and managing secrets, see Secret management and Secret access control. You
access and reference secrets in notebooks and jobs by using Secrets utility (dbutils.secrets).

IMPORTANT
To access Databricks REST APIs, you must authenticate. To use the Secrets API with Azure Key Vault secrets, you must
authenticate using an Azure Active Directory token.

Create secret scope


EN DP O IN T H T T P M ET H O D

2.0/secrets/scopes/create POST

You can either:


Create an Azure Key Vault-backed scope in which secrets are stored in Azure-managed storage and
encrypted with a cloud-based specific encryption key.
Create a Databricks-backed secret scope in which secrets are stored in Databricks-managed storage and
encrypted with a cloud-based specific encryption key.
Create an Azure Key Vault-backed scope
The scope name:
Must be unique within a workspace.
Must consist of alphanumeric characters, dashes, underscores, and periods, and may not exceed 128
characters.
The names are considered non-sensitive and are readable by all users in the workspace. By default, a workspace
is limited to a maximum of 100 secret scopes. To increase this maximum for a workspace, contact your
Databricks representative.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/secrets/scopes/create \
--header "Content-Type: application/json" \
--header "Authorization: Bearer <token>" \
--header "X-Databricks-Azure-SP-Management-Token: <management-token>" \
--data @create-scope.json
create-scope.json :

{
"scope": "my-simple-azure-keyvault-scope",
"scope_backend_type": "AZURE_KEYVAULT",
"backend_azure_keyvault":
{
"resource_id": "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/azure-
rg/providers/Microsoft.KeyVault/vaults/my-azure-kv",
"dns_name": "https://my-azure-kv.vault.azure.net/"
},
"initial_manage_principal": "users"
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<token> with your Azure Databricks personal access token. For more information, see Authentication using
Azure Databricks personal access tokens.
<management-token> with your Azure Active Directory token. For more information, see Get Azure AD tokens
by using the Microsoft Authentication Library.
The contents of create-scope.json with fields that are appropriate for your solution.
This example uses a .netrc file.
If initial_manage_principal is specified, the initial ACL applied to the scope is applied to the supplied principal
(user, service principal, or group) with MANAGE permissions. The only supported principal for this option is the
group users , which contains all users in the workspace. If initial_manage_principal is not specified, the initial
ACL with MANAGE permission applied to the scope is assigned to the API request issuer’s user identity.
Throws RESOURCE_ALREADY_EXISTS if a scope with the given name already exists. Throws RESOURCE_LIMIT_EXCEEDED
if maximum number of scopes in the workspace is exceeded. Throws INVALID_PARAMETER_VALUE if the scope
name is invalid.
For more information, see Create an Azure Key Vault-backed secret scope using the Databricks CLI.
Create a Databricks-backed secret scope
The scope name:
Must be unique within a workspace.
Must consist of alphanumeric characters, dashes, underscores, and periods, and may not exceed 128
characters.
The names are considered non-sensitive and are readable by all users in the workspace. A workspace is limited
to a maximum of 100 secret scopes.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/secrets/scopes/create \
--data @create-scope.json

create-scope.json :
{
"scope": "my-simple-databricks-scope",
"initial_manage_principal": "users"
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of create-scope.json with fields that are appropriate for your solution.

This example uses a .netrc file.


Throws RESOURCE_ALREADY_EXISTS if a scope with the given name already exists. Throws RESOURCE_LIMIT_EXCEEDED
if maximum number of scopes in the workspace is exceeded. Throws INVALID_PARAMETER_VALUE if the scope
name is invalid.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

scope STRING Scope name requested by the user.


Scope names are unique. This field is
required.

initial_manage_principal STRING This field is optional. If not specified,


only the API request issuer’s identity is
granted MANAGE permissions on the
new scope. If the string users is
specified, all users in the workspace are
granted MANAGE permissions.

Delete secret scope


EN DP O IN T H T T P M ET H O D

2.0/secrets/scopes/delete POST

Delete a secret scope.


Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/secrets/scopes/delete \
--data @delete-scope.json

delete-scope.json :

{
"scope": "my-secret-scope"
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of delete-scope.json with fields that are appropriate for your solution.
This example uses a .netrc file.
Throws RESOURCE_DOES_NOT_EXIST if the scope does not exist. Throws PERMISSION_DENIED if the user does not
have permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

scope STRING Name of the scope to delete. This field


is required.

List secret scopes


EN DP O IN T H T T P M ET H O D

2.0/secrets/scopes/list GET

List all secret scopes available in the workspace.


Example
Request

curl --netrc --request GET \


https://<databricks-instance>/api/2.0/secrets/scopes/list \
| jq .

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file and jq.


Response

{
"scopes": [
{
"name": "my-databricks-scope",
"backend_type": "DATABRICKS"
},
{
"name": "mount-points",
"backend_type": "DATABRICKS"
}
]
}

Throws PERMISSION_DENIED if you do not have permission to make this API call.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N

scopes An array of SecretScope The available secret scopes.


Put secret
The method for creating or modifying a secret depends on the type of scope backend. To create or modify a
secret in a scope backed by Azure Key Vault, use the Azure SetSecret REST API. To create or modify a secret from
a Databricks-backed scope, use the following endpoint:

EN DP O IN T H T T P M ET H O D

2.0/secrets/put POST

Insert a secret under the provided scope with the given name. If a secret already exists with the same name, this
command overwrites the existing secret’s value. The server encrypts the secret using the secret scope’s
encryption settings before storing it. You must have WRITE or MANAGE permission on the secret scope.
The secret key must consist of alphanumeric characters, dashes, underscores, and periods, and cannot exceed
128 characters. The maximum allowed secret value size is 128 KB. The maximum number of secrets in a given
scope is 1000.
You can read a secret value only from within a command on a cluster (for example, through a notebook); there is
no API to read a secret value outside of a cluster. The permission applied is based on who is invoking the
command and you must have at least READ permission.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/secrets/put \
--data @put-secret.json

put-secret.json :

{
"scope": "my-databricks-scope",
"key": "my-string-key",
"string_value": "my-value"
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of put-secret.json with fields that are appropriate for your solution.

This example uses a .netrc file.


The input fields “string_value” or “bytes_value” specify the type of the secret, which will determine the value
returned when the secret value is requested. Exactly one must be specified.
Throws RESOURCE_DOES_NOT_EXIST if no such secret scope exists. Throws RESOURCE_LIMIT_EXCEEDED if maximum
number of secrets in scope is exceeded. Throws INVALID_PARAMETER_VALUE if the key name or value length is
invalid. Throws PERMISSION_DENIED if the user does not have permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
F IEL D N A M E TYPE DESC RIP T IO N

string_value OR bytes_value STRING OR BYTES If string_value, if specified, the value


will be stored in UTF-8 (MB4) form.

If bytes_value, if specified, value will be


stored as bytes.

scope STRING The name of the scope to which the


secret will be associated with. This field
is required.

key STRING A unique name to identify the secret.


This field is required.

Delete secret
The method for deleting a secret depends on the type of scope backend. To delete a secret from a scope backed
by Azure Key Vault, use the Azure SetSecret REST API. To delete a secret from a Databricks-backed scope, use the
following endpoint:

EN DP O IN T H T T P M ET H O D

2.0/secrets/delete POST

Delete the secret stored in this secret scope. You must have WRITE or MANAGE permission on the secret scope.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/secrets/delete \
--data @delete-secret.json

delete-secret.json :

{
"scope": "my-secret-scope",
"key": "my-secret-key"
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of delete-secret.json with fields that are appropriate for your solution.

This example uses a .netrc file.


Throws RESOURCE_DOES_NOT_EXIST if no such secret scope or secret exists. Throws PERMISSION_DENIED if you do
not have permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

scope STRING The name of the scope that contains


the secret to delete. This field is
required.

key STRING Name of the secret to delete. This field


is required.

List secrets
EN DP O IN T H T T P M ET H O D

2.0/secrets/list GET

List the secret keys that are stored at this scope. This is a metadata-only operation; you cannot retrieve secret
data using this API. You must have READ permission to make this call.
Example
Request

curl --netrc --request GET \


'https://<databricks-instance>/api/2.0/secrets/list?scope=<scope-name>' \
| jq .

Or:

curl --netrc --get \


https://<databricks-instance>/api/2.0/secrets/list \
--data scope=<scope-name> \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<scope-name> with the name of the secrets scope, for example my-scope .

This example uses a .netrc file and jq.


Response

{
"secrets": [
{
"key": "my-string-key",
"last_updated_timestamp": 1520467595000
},
{
"key": "my-byte-key",
"last_updated_timestamp": 1520467595000
}
]
}

The last_updated_timestamp returned is in milliseconds since epoch.


Throws RESOURCE_DOES_NOT_EXIST if no such secret scope exists. Throws PERMISSION_DENIED if you do not have
permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

scope STRING The name of the scope whose secrets


you want to list. This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

secrets An array of SecretMetadata Metadata information of all secrets


contained within the given scope.

Put secret ACL


EN DP O IN T H T T P M ET H O D

2.0/secrets/acls/put POST

Create or overwrite the ACL associated with the given principal (user, service principal, or group) on the
specified scope point. In general, a user, service principal, or group will use the most powerful permission
available to them, and permissions are ordered as follows:
MANAGE - Allowed to change ACLs, and read and write to this secret scope.
WRITE - Allowed to read and write to this secret scope.
READ - Allowed to read this secret scope and list what secrets are available.

You must have the MANAGE permission to invoke this API.


Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/secrets/acls/put \
--data @put-secret-acl.json

put-secret-acl.json :

{
"scope": "my-secret-scope",
"principal": "data-scientists",
"permission": "READ"
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of put-secret-acl.json with fields that are appropriate for your solution.

This example uses a .netrc file.


The principal field specifies an existing Azure Databricks principal to be granted or revoked access using the
unique identifier of that principal. A user is specified with their email, a service principal with its applicationId
value, and a group with its group name.
Throws RESOURCE_DOES_NOT_EXIST if no such secret scope exists. Throws RESOURCE_ALREADY_EXISTS if a permission
for the principal already exists. Throws INVALID_PARAMETER_VALUE if the permission is invalid. Throws
PERMISSION_DENIED if you do not have permission to make this API call.

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

scope STRING The name of the scope to apply


permissions to. This field is required.

principal STRING The principal to which the permission


is applied. This field is required.

permission AclPermission The permission level applied to the


principal. This field is required.

Delete secret ACL


EN DP O IN T H T T P M ET H O D

2.0/secrets/acls/delete POST

Delete the given ACL on the given scope.


You must have the MANAGE permission to invoke this API.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/secrets/acls/delete \
--data @delete-secret-acl.json

delete-secret-acl.json :

{
"scope": "my-secret-scope",
"principal": "data-scientists"
}

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of delete-secret-acl.json with fields that are appropriate for your solution.

This example uses a .netrc file.


Throws RESOURCE_DOES_NOT_EXIST if no such secret scope, principal, or ACL exists. Throws PERMISSION_DENIED if
you do not have permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

scope STRING The name of the scope to remove


permissions from. This field is required.

principal STRING The principal to remove an existing


ACL from. This field is required.

Get secret ACL


EN DP O IN T H T T P M ET H O D

2.0/secrets/acls/get GET

Describe the details about the given ACL, such as the group and permission.
You must have the MANAGE permission to invoke this API.
Example
Request

curl --netrc --request GET \


'https://<databricks-instance>/api/2.0/secrets/acls/get?scope=<scope-name>&principal=<principal-name>' \
| jq .

Or:

curl --netrc --get \


https://<databricks-instance>/api/2.0/secrets/acls/get \
--data 'scope=<scope-name>&principal=<principal-name>' \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<scope-name> with the name of the secrets scope, for example my-scope .
<principal-name> with the name of the principal, for example users .

This example uses a .netrc file and jq.


Response

{
"principal": "data-scientists",
"permission": "READ"
}

Throws RESOURCE_DOES_NOT_EXIST if no such secret scope exists. Throws PERMISSION_DENIED if you do not have
permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

scope STRING The name of the scope to fetch ACL


information from. This field is required.

principal STRING The principal to fetch ACL information


for. This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

principal STRING The principal to which the permission


is applied. This field is required.

permission AclPermission The permission level applied to the


principal. This field is required.

List secret ACLs


EN DP O IN T H T T P M ET H O D

2.0/secrets/acls/list GET

List the ACLs set on the given scope.


You must have the MANAGE permission to invoke this API.
Example
Request

curl --netrc --request GET \


'https://<databricks-instance>/api/2.0/secrets/acls/list?scope=<scope-name>' \
| jq .

Or:

curl --netrc --get \


https://<databricks-instance>/api/2.0/secrets/acls/list \
--data scope=<scope-name> \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<scope-name> with the name of the secrets scope, for example my-scope .

This example uses a .netrc file and jq.


Response
{
"items": [
{
"principal": "admins",
"permission": "MANAGE"
},
{
"principal": "data-scientists",
"permission": "READ"
}
]
}

Throws RESOURCE_DOES_NOT_EXIST if no such secret scope exists. Throws PERMISSION_DENIED if you do not have
permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

scope STRING The name of the scope to fetch ACL


information from. This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

items An array of AclItem The associated ACLs rule applied to


principals in the given scope.

Data structures
In this section:
AclItem
SecretMetadata
SecretScope
AclPermission
ScopeBackendType
AclItem
An item representing an ACL rule applied to the given principal (user, service principal, or group) on the
associated scope point.

F IEL D N A M E TYPE DESC RIP T IO N

principal STRING The principal to which the permission


is applied. This field is required.

permission AclPermission The permission level applied to the


principal. This field is required.

SecretMetadata
The metadata about a secret. Returned when listing secrets. Does not contain the actual secret value.
F IEL D N A M E TYPE DESC RIP T IO N

key STRING A unique name to identify the secret.

last_updated_timestamp INT64 The last updated timestamp (in


milliseconds) for the secret.

SecretScope
An organizational resource for storing secrets. Secret scopes can be different types, and ACLs can be applied to
control permissions for all secrets within a scope.

F IEL D N A M E TYPE DESC RIP T IO N

name STRING A unique name to identify the secret


scope.

backend_type ScopeBackendType The type of secret scope backend.

AclPermission
The ACL permission levels for secret ACLs applied to secret scopes.

P ERM ISSIO N DESC RIP T IO N

READ Allowed to perform read operations (get, list) on secrets in


this scope.

WRITE Allowed to read and write secrets to this secret scope.

MANAGE Allowed to read/write ACLs, and read/write secrets to this


secret scope.

ScopeBackendType
The type of secret scope backend.

TYPE DESC RIP T IO N

AZURE_KEYVAULT A secret scope in which secrets are stored in an Azure Key


Vault.

DATABRICKS A secret scope in which secrets are stored in Databricks


managed storage and encrypted with a cloud-based specific
encryption key.
Token API 2.0
7/21/2022 • 2 minutes to read

The Token API allows you to create, list, and revoke tokens that can be used to authenticate and access Azure
Databricks REST APIs.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Create
EN DP O IN T H T T P M ET H O D

2.0/token/create POST

Create and return a token. This call returns the error QUOTA_EXCEEDED if the current number of non-expired
tokens exceeds the token quota. The token quota for a user is 600.
Example
Request

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/token/create \
--data '{ "comment": "This is an example token", "lifetime_seconds": 7776000 }' \
| jq .

Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
This is an example token with a description to attach to the token.
7776000 with the lifetime of the token, in seconds. This example specifies 90 days.

This example uses a .netrc file and jq.


Response

{
"token_value": "dapi1a2b3c45d67890e1f234567a8bc9012d",
"token_info": {
"token_id": "1234567890a12bc3456de789012f34ab56c78d9012e3fabc4de56f7a89b012c3",
"creation_time": 1626286601651,
"expiry_time": 1634062601651,
"comment": "This is an example token"
}
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

lifetime_seconds LONG The lifetime of the token, in seconds. If


no lifetime is specified, the token
remains valid indefinitely.

comment STRING Optional description to attach to the


token.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

token_value STRING The value of the newly-created token.

token_info Public token info The public metadata of the newly-


created token.

List
EN DP O IN T H T T P M ET H O D

2.0/token/list GET

List all the valid tokens for a user-workspace pair.


Example
Request

curl --netrc --request GET \


https://<databricks-instance>/api/2.0/token/list \
| jq .

Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .

This example uses a .netrc file and jq.


Response

{
"token_infos": [
{
"token_id": "1234567890a12bc3456de789012f34ab56c78d9012e3fabc4de56f7a89b012c3",
"creation_time": 1626286601651,
"expiry_time": 1634062601651,
"comment": "This is an example token"
},
{
"token_id": "2345678901a12bc3456de789012f34ab56c78d9012e3fabc4de56f7a89b012c4",
"creation_time": 1626286906596,
"expiry_time": 1634062906596,
"comment": "This is another example token"
}
]
}
Response structure
F IEL D N A M E TYPE DESC RIP T IO N

token_infos An array of Public token info A list of token information for a user-
workspace pair.

Revoke
EN DP O IN T H T T P M ET H O D

2.0/token/delete POST

Revoke an access token. This call returns the error RESOURCE_DOES_NOT_EXIST if a token with the specified ID is not
valid.
Example

curl --netrc --request POST \


https://<databricks-instance>/api/2.0/token/delete \
--data '{ "token_id": "<token-id>" }'

This example uses a .netrc file.


Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<token-id> with the ID of the token, for example
1234567890a12bc3456de789012f34ab56c78d9012e3fabc4de56f7a89b012c3 .

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

token_id STRING The ID of the token to be revoked.

Data structures
In this section:
Public token info
Public token info
A data structure that describes the public metadata of an access token.

F IEL D N A M E TYPE DESC RIP T IO N

token_id STRING The ID of the token.

creation_time LONG Server time (in epoch milliseconds)


when the token was created.
F IEL D N A M E TYPE DESC RIP T IO N

expiry_time LONG Server time (in epoch milliseconds)


when the token will expire, or -1 if not
applicable.

comment STRING Comment the token was created with,


if applicable.
Token Management API 2.0
7/21/2022 • 2 minutes to read

The Token Management API lets Azure Databricks administrators manage their users’ Azure Databricks personal
access tokens. As an admin, you can:
Monitor and revoke users’ personal access tokens.
Control the lifetime of future tokens in your workspace.
You can also control which users can create and use tokens via the Permissions API 2.0 or in the Admin Console.
The Token Management API is provided as an OpenAPI 3.0 specification that you can download and view as a
structured API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.

IMPORTANT
To access Databricks REST APIs, you must authenticate.
Workspace API 2.0
7/21/2022 • 6 minutes to read

The Workspace API allows you to list, import, export, and delete notebooks and folders. The maximum allowed
size of a request to the Workspace API is 10MB. See Cluster log delivery examples for a how to guide on this
API.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

Delete
EN DP O IN T H T T P M ET H O D

2.0/workspace/delete POST

Delete an object or a directory (and optionally recursively deletes all objects in the directory). If path does not
exist, this call returns an error RESOURCE_DOES_NOT_EXIST . If path is a non-empty directory and recursive is set
to false , this call returns an error DIRECTORY_NOT_EMPTY . Object deletion cannot be undone and deleting a
directory recursively is not atomic.
Example
Request:

curl --netrc --request POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/delete \
--header 'Accept: application/json' \
--data '{ "path": "/Users/me@example.com/MyFolder", "recursive": true }'

If successful, this endpoint returns no response.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The absolute path of the notebook or


directory. This field is required.

recursive BOOL The flag that specifies whether to


delete the object recursively. It is
false by default. Please note this
deleting directory is not atomic. If it
fails in the middle, some of objects
under this directory may be deleted
and cannot be undone.

Export
EN DP O IN T H T T P M ET H O D

2.0/workspace/export GET

Export a notebook or contents of an entire directory. You can also export a Databricks Repo, or a notebook or
directory from a Databricks Repo. You cannot export non-notebook files from a Databricks Repo. If path does
not exist, this call returns an error RESOURCE_DOES_NOT_EXIST . You can export a directory only in DBC format. If
the exported data exceeds the size limit, this call returns an error MAX_NOTEBOOK_SIZE_EXCEEDED . This API does not
support exporting a library.
Example
Request:

curl --netrc --request GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/export \
--header 'Accept: application/json' \
--data '{ "path": "/Users/me@example.com/MyFolder/MyNotebook", "format": "SOURCE", "direct_download": true
}'

curl --netrc --request GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/export \
--header 'Accept: application/json' \
--data '{ "path": "/Repos/me@example.com/MyFolder/MyNotebook", "format": "SOURCE", "direct_download": true
}'

Response:
If the direct_download field was set to false or was omitted from the request, a base64-encoded version of the
content is returned, for example:

{
"content": "Ly8gRGF0YWJyaWNrcyBub3RlYm9vayBzb3VyY2UKMSsx",
}

Otherwise, if direct_download was set to true in the request, the content is downloaded.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The absolute path of the notebook or


directory. Exporting a directory is
supported only for DBC . This field is
required.

format ExportFormat This specifies the format of the


exported file. By default, this is
SOURCE . The value is case sensitive.
F IEL D N A M E TYPE DESC RIP T IO N

direct_download BOOL Flag to enable direct download. If it is


true , the response will be the
exported file itself. Otherwise, the
response contains content as base64
encoded string. See Export a notebook
or folder for more information about
how to use it.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

content BYTES The base64-encoded content. If the


limit (10MB) is exceeded, exception
with error code
MAX_NOTEBOOK_SIZE_EXCEEDED is
thrown.

Get status
EN DP O IN T H T T P M ET H O D

2.0/workspace/get-status GET

Gets the status of an object or a directory. If path does not exist, this call returns an error
RESOURCE_DOES_NOT_EXIST .

Example
Request:

curl --netrc --request GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/get-status \
--header 'Accept: application/json' \
--data '{ "path": "/Users/me@example.com/MyFolder/MyNotebook" }'

Response:

{
"object_type": "NOTEBOOK",
"path": "/Users/me@example.com/MyFolder/MyNotebook",
"language": "PYTHON",
"object_id": 123456789012345
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The absolute path of the notebook or


directory. This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

object_type ObjectType The type of the object.

object_id INT64 Unique identifier for the object.

path STRING The absolute path of the object.

language Language The language of the object. This value


is set only if the object type is
NOTEBOOK .

Import
EN DP O IN T H T T P M ET H O D

2.0/workspace/import POST

Import a notebook or the contents of an entire directory. If path already exists and overwrite is set to false ,
this call returns an error RESOURCE_ALREADY_EXISTS . You can use only DBC format to import a directory.
Example
Import a base64-encoded string:

curl --netrc --request POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/import \
--header 'Accept: application/json' \
--data '{ "path": "/Users/me@example.com/MyFolder/MyNotebook", "content":
"Ly8gRGF0YWJyaWNrcyBub3RlYm9vayBzb3VyY2UKMSsx", "language": "PYTHON", "overwrite": true, "format": "SOURCE"
}'

Import a local file:

curl --netrc --request POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/import \
--header 'Content-Type: multipart/form-data' \
--form path=/Users/me@example.com/MyFolder/MyNotebook \
--form content=@myCode.py.zip

If successful, this endpoint returns no response.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The absolute path of the notebook or


directory. Importing directory is only
support for DBC format. This field is
required.

format ExportFormat This specifies the format of the file to


be imported. By default, this is
SOURCE . The value is case sensitive.
F IEL D N A M E TYPE DESC RIP T IO N

language Language The language. If format is set to


SOURCE , this field is required;
otherwise, it will be ignored.

content BYTES The base64-encoded content. This has


a limit of 10 MB. If the limit (10MB) is
exceeded, exception with error code
MAX_NOTEBOOK_SIZE_EXCEEDED is
thrown. This parameter might be
absent, and instead a posted file will be
used. See Import a notebook or
directory for more information about
how to use it.

overwrite BOOL The flag that specifies whether to


overwrite existing object. It is false
by default. For DBC format, overwrite
is not supported since it may contain a
directory.

List
EN DP O IN T H T T P M ET H O D

2.0/workspace/list GET

List the contents of a directory, or the object if it is not a directory. If the input path does not exist, this call
returns an error RESOURCE_DOES_NOT_EXIST .
Example
List directories and their contents:
Request:

curl --netrc --request GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/list \
--header 'Accept: application/json' \
--data '{ "path": "/Users/me@example.com" }'

Response:
{
"objects": [
{
"path": "/Users/me@example.com/MyFolder",
"object_type": "DIRECTORY",
"object_id": 234567890123456
},
{
"path": "/Users/me@example.com/MyFolder/MyNotebook",
"object_type": "NOTEBOOK",
"language": "PYTHON",
"object_id": 123456789012345
},
{
"..."
}
]
}

List repos:

curl --netrc --request GET \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/list \
--header 'Accept: application/json' \
--data '{ "path": "/Repos/me@example.com" }'

Response:

{
"objects": [
{
"path": "/Repos/me@example.com/MyRepo1",
"object_type": "REPO",
"object_id": 234567890123456
},
{
"path": "/Repos/me@example.com/MyRepo2",
"object_type": "REPO",
"object_id": 123456789012345
},
{
"..."
}
]
}

Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The absolute path of the notebook or


directory. This field is required.

Response structure
F IEL D N A M E TYPE DESC RIP T IO N

objects An array of ObjectInfo List of objects.


Mkdirs
EN DP O IN T H T T P M ET H O D

2.0/workspace/mkdirs POST

Create the given directory and necessary parent directories if they do not exists. If there exists an object (not a
directory) at any prefix of the input path, this call returns an error RESOURCE_ALREADY_EXISTS . If this operation fails
it may have succeeded in creating some of the necessary parent directories.
Example
Request:

curl --netrc --request POST \


https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/mkdirs \
--header 'Accept: application/json' \
--data '{ "path": "/Users/me@example.com/MyFolder" }'

If successful, this endpoint returns no response.


Request structure
F IEL D N A M E TYPE DESC RIP T IO N

path STRING The absolute path of the directory. If


the parent directories do not exist, it
will also create them. If the directory
already exists, this command will do
nothing and succeed. This field is
required.

Data structures
In this section:
ObjectInfo
ExportFormat
Language
ObjectType
ObjectInfo
The information of the object in workspace. It is returned by list and get-status .

F IEL D N A M E TYPE DESC RIP T IO N

object_type ObjectType The type of the object.

object_id INT64 Unique identifier for the object.

path STRING The absolute path of the object.

language Language The language of the object. This value


is set only if the object type is
NOTEBOOK .
ExportFormat
The format for notebook import and export.

F O RM AT DESC RIP T IO N

SOURCE The notebook will be imported/exported as source code.

HTML The notebook will be imported/exported as an HTML file.

JUPYTER The notebook will be imported/exported as a


Jupyter/IPython Notebook file.

DBC The notebook will be imported/exported as Databricks


archive format.

Language
The language of notebook.

L A N GUA GE DESC RIP T IO N

SCALA Scala notebook.

PYTHON Python notebook.

SQL SQL notebook.

R R notebook.

ObjectType
The type of the object in workspace.

TYPE DESC RIP T IO N

NOTEBOOK Notebook

DIRECTORY Directory

LIBRARY Library

REPO Repository
REST API 1.2
7/21/2022 • 9 minutes to read

The Databricks REST API allows you to programmatically access Azure Databricks instead of going through the
web UI.
This article covers REST API 1.2. The REST API latest version, as well as REST API 2.1 and 2.0, are also available.

IMPORTANT
Use the Clusters API 2.0 for managing clusters programmatically and the Libraries API 2.0 for managing libraries
programmatically.
The 1.2 Create an execution context and Run a command APIs continue to be supported.

IMPORTANT
To access Databricks REST APIs, you must authenticate.

REST API use cases


Start Apache Spark jobs triggered from your existing production systems or from workflow systems.
Programmatically bring up a cluster of a certain size at a fixed time of day and then shut it down at night.

API categories
Execution context: create unique variable namespaces where Spark commands can be called.
Command execution: run commands within a specific execution context.

Details
This REST API runs over HTTPS.
For retrieving information, use HTTP GET.
For modifying state, use HTTP POST.
For file upload, use multipart/form-data . Otherwise use application/json .
The response content type is JSON.
Basic authentication is used to authenticate the user for every API call.
User credentials are base64 encoded and are in the HTTP header for every API call. For example,
Authorization: Basic YWRtaW46YWRtaW4= . If you use curl , alternatively you can store user credentials in a
.netrc file.
For more information about using the Databricks REST API, see the Databricks REST API reference.

Get started
To try out the examples in this article, replace <databricks-instance> with the workspace URL of your Azure
Databricks deployment.
The following examples use curl and a .netrc file. You can adapt these curl examples with an HTTP library in
your programming language of choice.
API reference
Get the list of clusters
Get information about a cluster
Restart a cluster
Create an execution context
Get information about an execution context
Delete an execution context
Run a command
Get information about a command
Cancel a command
Get the list of libraries for a cluster
Upload a library to a cluster
Get the list of clusters
Method and path:
GET /api/1.2/clusters/list

Example
Request:

curl --netrc --request GET \


https://<databricks-instance>/api/1.2/clusters/list

Response:

[
{
"id": "1234-567890-span123",
"name": "MyCluster",
"status": "Terminated",
"driverIp": "",
"jdbcPort": 10000,
"numWorkers":0
},
{
"..."
}
]

Request schema
None.
Response schema
An array of objects, with each object representing information about a cluster as follows:

F IEL D

id

Type: string

The ID of the cluster.


F IEL D

name

Type: string

The name of the cluster.

status

Type: string

The status of the cluster. One of:

* Error
* Pending
* Reconfiguring
* Restarting
* Running
* Terminated
* Terminating
* Unknown

driverIp

Type: string

The IP address of the driver.

jdbcPor t

Type: number

The JDBC port number.

numWorkers

Type: number

The number of workers for the cluster.

Get information about a cluster


Method and path:
GET /api/1.2/clusters/status

Example
Request:

curl --netrc --get \


https://<databricks-instance>/api/1.2/clusters/status \
--data clusterId=1234-567890-span123

Response:
{
"id": "1234-567890-span123",
"name": "MyCluster",
"status": "Terminated",
"driverIp": "",
"jdbcPort": 10000,
"numWorkers": 0
}

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster.

Response schema
An object that represents information about the cluster.

F IEL D

id

Type: string

The ID of the cluster.

name

Type: string

The name of the cluster.

status

Type: string

The status of the cluster. One of:

* Error
* Pending
* Reconfiguring
* Restarting
* Running
* Terminated
* Terminating
* Unknown

driverIp

Type: string

The IP address of the driver.


F IEL D

jdbcPor t

Type: number

The JDBC port number.

numWorkers

Type: number

The number of workers for the cluster.

Restart a cluster
Method and path:
POST /api/1.2/clusters/restart

Example
Request:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/clusters/restart \
--data clusterId=1234-567890-span123

Response:

{
"id": "1234-567890-span123"
}

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster to restart.

Response schema

F IEL D

id

Type: string

The ID of the cluster.

Create an execution context


Method and path:
POST /api/1.2/contexts/create

Example
Request:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/contexts/create \
--data clusterId=1234-567890-span123 \
--data language=sql

Response:

{
"id": "1234567890123456789"
}

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster to create the context for.

clusterId

Type: string

The language for the context. One of:

* python
* scala
* sql

Response schema

F IEL D

id

Type: string

The ID of the execution context.

Get information about an execution context


Method and path:
GET /api/1.2/contexts/status

Example
Request:

curl --netrc https://<databricks-instance>/api/1.2/contexts/status?clusterId=1234-567890-


span123&contextId=1234567890123456789

Response:
{
"id": "1234567890123456789",
"status": "Running"
}

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster to get execution context information about.

contextId

Type: string

The ID of the execution context.

Response schema

F IEL D

id

Type: string

The ID of the execution context.

status

Type: string

The status of the execution context. One of:

* Error
* Pending
* Running

Delete an execution context


Method and path:
POST /api/1.2/contexts/destroy

Example
Request:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/contexts/destroy \
--data clusterId=1234-567890-span123 \
--data contextId=1234567890123456789

Response:
{
"id": "1234567890123456789"
}

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster to destroy the execution context for.

contextId

Type: string

The ID of the execution context to destroy.

Response schema

F IEL D

id

Type: string

The ID of the execution context.

Run a command
Method and path:
POST /api/1.2/commands/execute

Example
Request:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/commands/execute \
--header 'Content-Type: application/json' \
--data @execute-command.json

execute-command.json :

{
"clusterId": "1234-567890-span123",
"contextId": "1234567890123456789",
"language": "python",
"command": "print('Hello, World!')"
}

Response:
{
"id": "1234ab56-7890-1cde-234f-5abcdef67890"
}

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster to run the command on.

contextId

Type: string

The ID of the execution context to run the command within.

language

Type: string

The language of the command.

command

Type: string

The command string to run.

Specify either command or


commandFile .

commandFile

Type: string

The path to a file containing the command to run.

Specify either commandFile or


command .

options

Type: string

An optional map of values used downstream. For example, a displayRowLimit override (used in testing).

Response schema

F IEL D

id

Type: string

The ID of the command.


Get information about a command
Method and path:
GET /api/1.2/commands/status

Example
Request:

curl --netrc --get \


https://<databricks-instance>/api/1.2/commands/status \
--data clusterId=1234-567890-span123 \
--data contextId=1234567890123456789 \
--data commandId=1234ab56-7890-1cde-234f-5abcdef67890

Response:

{
"id": "1234ab56-7890-1cde-234f-5abcdef67890",
"status": "Finished",
"results": {
"resultType": "text",
"data": "Hello, World!"
}
}

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster to get the command information about.

contextId

Type: string

The ID of the execution context that is associated with the command.

commandId

Type: string

The ID of the command to get information about.

Response schema

F IEL D

id

Type: string

The ID of the command.


F IEL D

status

Type: string

The status of the command. One of:

* Cancelled
* Cancelling
* Error
* Finished
* Queued
* Running

results

Type: object

The results of the command.

* resultType : The type of result. Type: string One of:

* error
* image
* images
* table
* text

For error :

* cause : The cause of the error. Type: string

For image :

* fileName : The image filename. Type: string

For images :

* fileNames : The images’ filenames. Type: array of string

For table :

* data : The table data. Type: array of array of any

* schema : The table schema. Type: array of array of (string, any)

* truncated : true if partial results are returned. Type: true / false

* isJsonSchema : true if a JSON schema is returned instead of a string representation of the Hive type. Type: true /
false

For text :

* data : The text. Type: string

Cancel a command
Method and path:
POST/api/1.2/commands/cancel
Example
Request:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/commands/cancel \
--data clusterId=1234-567890-span123 \
--data contextId=1234567890123456789 \
--data commandId=1234ab56-7890-1cde-234f-5abcdef67890

Response:

{
"id": "1234ab56-7890-1cde-234f-5abcdef67890"
}

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster that is associated with the command to cancel.

contextId

Type: string

The ID of the execution context that is associated with the command to cancel.

commandId

Type: string

The ID of the command to cancel.

Response schema

F IEL D

id

Type: string

The ID of the command.

Get the list of libraries for a cluster

IMPORTANT
This operation is deprecated. Use the Cluster status operation in the Libraries API instead.

Method and path:


GET /api/1.2/libraries/list

Example
Request:

curl --netrc --get \


https://<databricks-instance>/api/1.2/libraries/list \
--data clusterId=1234-567890-span123

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster.

Response schema
An array of objects, with each object representing information about a library as follows:

F IEL D

name

Type: string

The name of the library.

status

Type: string

The status of the library. One of:

* LibraryError
* LibraryLoaded
* LibraryPending

Upload a library to a cluster

IMPORTANT
This operation is deprecated. Use the Install operation in the Libraries API instead.

Method and path:


POST /api/1.2/libraries/upload

Request schema

F IEL D

clusterId

Type: string

The ID of the cluster to upload the library to.


F IEL D

name

Type: string

The name of the library.

language

Type: string

The language of the library.

uri

Type: string

The URI of the library.

The scheme can be file , http , or


https .

Response schema
Information about the uploaded library.

F IEL D

language

Type: string

The language of the library.

uri

Type: string

The URI of the library.

Additional examples
The following additional examples provide commands that you can use with curl or adapt with an HTTP
library in your programming language of choice.
Create an execution context
Run a command
Upload and run a Spark JAR
Create an execution context
Create an execution context on a specified cluster for a given programming language:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/contexts/create \
--header 'Content-Type: application/json' \
--data '{ "language": "scala", "clusterId": "1234-567890-span123" }'
Get information about the execution context:

curl --netrc --get \


https://<databricks-instance>/api/1.2/contexts/status \
--data 'clusterId=1234-567890-span123&contextId=1234567890123456789'

Delete the execution context:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/contexts/destroy \
--header 'Content-Type: application/json' \
--data '{ "contextId": "1234567890123456789", "clusterId": "1234-567890-span123" }'

Run a command
Known limitations: command execution does not support %run .
Run a command string:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/commands/execute \
--header 'Content-Type: application/json' \
--data '{ "language": "scala", "clusterId": "1234-567890-span123", "contextId": "1234567890123456789",
"command": "sc.parallelize(1 to 10).collect" }'

Run a file:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/commands/execute \
--header 'Content-Type: multipart/form-data' \
--form language=python \
--form clusterId=1234-567890-span123 \
--form contextId=1234567890123456789 \
--form command=@myfile.py

Show the command’s status and result:

curl --netrc --get \


https://<databricks-instance>/api/1.2/commands/status \
--data 'clusterId=1234-567890-span123&contextId=1234567890123456789&commandId=1234ab56-7890-1cde-234f-
5abcdef67890'

Cancel the command:

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/commands/cancel \
--data 'clusterId=1234-567890-span123&contextId=1234567890123456789&commandId=1234ab56-7890-1cde-234f-
5abcdef67890' \

Upload and run a Spark JAR


Upload a JAR
Use the REST API (latest) to upload a JAR and attach it to a cluster.
Run a JAR
1. Create an execution context.
curl --netrc --request POST \
https://<databricks-instance>/api/1.2/contexts/create \
--data "language=scala&clusterId=1234-567890-span123"

{
"id": "1234567890123456789"
}

2. Execute a command that uses your JAR.

curl --netrc --request POST \


https://<databricks-instance>/api/1.2/commands/execute \
--data 'language=scala&clusterId=1234-567890-
span123&contextId=1234567890123456789&command=println(com.databricks.apps.logs.chapter1.LogAnalyzer.p
rocessLogFile(sc,null,"dbfs:/somefile.log"))'

{
"id": "1234ab56-7890-1cde-234f-5abcdef67890"
}

3. Check on the status of your command. It may not return immediately if you are running a lengthy Spark
job.

curl --netrc 'https://<databricks-instance>/api/1.2/commands/status?clusterId=1234-567890-


span123&contextId=1234567890123456789&commandId=1234ab56-7890-1cde-234f-5abcdef67890'

{
"id": "1234ab56-7890-1cde-234f-5abcdef67890",
"results": {
"data": "Content Size Avg: 1234, Min: 1234, Max: 1234",
"resultType": "text"
},
"status": "Finished"
}

Allowed values for resultType include:


error
image
images
table
text
Databricks Terraform provider
7/21/2022 • 7 minutes to read

HashiCorp Terraform is a popular open source tool for creating safe and predictable cloud infrastructure across
several cloud providers. You can use the Databricks Terraform provider to manage your Azure Databricks
workspaces and the associated cloud infrastructure using a flexible, powerful tool. The goal of the Databricks
Terraform provider is to support all Databricks REST APIs, supporting automation of the most complicated
aspects of deploying and managing your data platforms. Databricks customers are using the Databricks
Terraform provider to deploy and manage clusters and jobs, provision Databricks workspaces, and configure
data access.

Getting started
Complete the following steps to install and configure the command line tools that Terraform needs to operate.
These tools include the Terraform CLI and the Azure CLI. After setting up these tools, complete the steps to
create a base Terraform configuration that you can use later to manage your Azure Databricks workspaces and
the associated Azure cloud infrastructure.

NOTE
This procedure assumes that you have access to a deployed Azure Databricks workspace as a Databricks admin, access to
the corresponding Azure subscription, and the appropriate permissions for the actions you want Terraform to perform in
that Azure subscription. For more information, see the following:
Manage users, groups, and service principals
Assign Azure roles using the Azure portal on the Azure website

1. Install the Terraform CLI. For details, see Download Terraform on the Terraform website.
2. Install the Azure CLI, and then use the Azure CLI to login to Azure by running the az login command.
For details, see Install the Azure CLI on the Microsoft Azure website and Azure Provider: Authenticating
using the Azure CLI on the Terraform website.

az login

TIP
To have Terraform run within the context of a different login, run the az login command again. You can switch
to have Terraform use an Azure subscription other than the one listed as "isDefault": true in the output of
running az login . To do this, run the command az account set --subscription="<subscription ID>" ,
replacing <subscription ID> with the value of the id property of the desired subscription in the output of
running az login .
This procedure uses the Azure CLI, along with the default subscription, to authenticate. For alternative
authentication options, see Authenticating to Azure on the Terraform website.

3. In your terminal, create an empty directory and then switch to it. (Each separate set of Terraform
configuration files must be in its own directory.) For example: mkdir terraform_demo && cd terraform_demo .
mkdir terraform_demo && cd terraform_demo

4. In this empty directory, create a file named main.tf . Add the following content to this file, and then save
the file.

TIP
If you use Visual Studio Code, the HashiCorp Terraform extension for Visual Studio Code adds editing features for
Terraform files such as syntax highlighting, IntelliSense, code navigation, code formatting, a module explorer, and
much more.

terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = ">= 2.26"
}

databricks = {
source = "databricks/databricks"
}
}
}

provider "azurerm" {
features {}
}

provider "databricks" {}

5. Initialize the working directory containing the main.tf file by running the terraform init command. For
more information, see Command: init on the Terraform website.

terraform init

Terraform downloads the azurerm and databricks providers and installs them in a hidden subdirectory
of your current working directory, named .terraform . The terraform init command prints out which
version of the providers were installed. Terraform also creates a lock file named .terraform.lock.hcl
which specifies the exact provider versions used, so that you can control when you want to update the
providers used for your project.
6. Apply the changes required to reach the desired state of the configuration by running the
terraform apply command. For more information, see Command: apply on the Terraform website.

terraform apply

Because no resources have yet been specified in the file, the output is
main.tf
Apply complete! Resources: 0 added, 0 changed, 0 destroyed. Also, Terraform writes data into a file called
terraform.tfstate . To create resources, continue with Sample configuration, Next steps, or both to
specify the desired resources to create, and then run the terraform apply command again. Terraform
stores the IDs and properties of the resources it manages in this terraform.tfstate file, so that it can
update or destroy those resources going forward.
Sample configuration
Complete the following procedure to create a sample Terraform configuration that creates a notebook and a job
to run that notebook, in an existing Azure Databricks workspace.
1. In the main.tf file that you created in Getting started, change the databricks provider to reference an
existing Azure Databricks workspace:

provider "databricks" {
host = var.databricks_workspace_url
}

2. At the end of the main.tf file, add the following code:

variable "databricks_workspace_url" {
description = "The URL to the Azure Databricks workspace (must start with https://)"
type = string
default = "<Azure Databricks workspace URL>"
}

variable "resource_prefix" {
description = "The prefix to use when naming the notebook and job"
type = string
default = "terraform-demo"
}

variable "email_notifier" {
description = "The email address to send job status to"
type = list(string)
default = ["<Your email address>"]
}

// Get information about the Databricks user that is calling


// the Databricks API (the one associated with "databricks_connection_profile").
data "databricks_current_user" "me" {}

// Create a simple, sample notebook. Store it in a subfolder within


// the Databricks current user's folder. The notebook contains the
// following basic Spark code in Python.
resource "databricks_notebook" "this" {
path = "${data.databricks_current_user.me.home}/Terraform/${var.resource_prefix}-
notebook.ipynb"
language = "PYTHON"
content_base64 = base64encode(<<-EOT
# created from ${abspath(path.module)}
display(spark.range(10))
EOT
)
}

// Create a job to run the sample notebook. The job will create
// a cluster to run on. The cluster will use the smallest available
// node type and run the latest version of Spark.

// Get the smallest available node type to use for the cluster. Choose
// only from among available node types with local storage.
data "databricks_node_type" "smallest" {
local_disk = true
}

// Get the latest Spark version to use for the cluster.


data "databricks_spark_version" "latest" {}

// Create the job, emailing notifiers about job success or failure.


resource "databricks_job" "this" {
resource "databricks_job" "this" {
name = "${var.resource_prefix}-job-${data.databricks_current_user.me.alphanumeric}"
new_cluster {
num_workers = 1
spark_version = data.databricks_spark_version.latest.id
node_type_id = data.databricks_node_type.smallest.id
}
notebook_task {
notebook_path = databricks_notebook.this.path
}
email_notifications {
on_success = var.email_notifier
on_failure = var.email_notifier
}
}

// Print the URL to the notebook.


output "notebook_url" {
value = databricks_notebook.this.url
}

// Print the URL to the job.


output "job_url" {
value = databricks_job.this.url
}

3. Replace the following values, and then save the file:


Replace <Azure Databricks workspace URL> with the URL to the Azure Databricks workspace.
Replace <Your email address> with your email address.
4. Run terraform apply .
5. Verify that the notebook and job were created: in the output of the terraform apply command, find the
URLs for notebook_url and job_url and go to them.
6. Run the job: on the Jobs page, click Run Now . After the job finishes, check your email inbox.
7. When you are done with this sample, delete the notebook and job from the Azure Databricks workspace
by running terraform destroy .
8. Verify that the notebook and job were deleted: refresh the notebook and Jobs pages to display a
message that the reources cannot be found.

Next steps
1. Create an Azure Databricks workspace.
2. Manage workspace resources for an Azure Databricks workspace.

Troubleshooting
NOTE
For Terraform-specific support, see the Latest Terraform topics on the HashiCorp Discuss website. For issues specific to the
Databricks Terraform Provider, see Issues in the databrickslabs/terraform-provider-databricks GitHub repository.

Error: Failed to install provider


Issue : If you did not check in a terraform.lock.hcl file to your version control system, and you run the
terraform init command, the following message appears: Failed to install provider . Additional output may
include a message similar to the following:
Error while installing databrickslabs/databricks: v1.0.0: checksum list has no SHA-256 hash for
"https://github.com/databricks/terraform-provider-databricks/releases/download/v1.0.0/terraform-provider-
databricks_1.0.0_darwin_amd64.zip"

Cause : Your Terraform configurations reference outdated Databricks Terraform providers.


Solution :
1. Replace databrickslabs/databricks with databricks/databricks in all of your .tf files.
To automate these replacements, run the following Python command from the parent folder that contains
the .tf files to update:

python3 -c "$(curl -Ls https://dbricks.co/updtfns)"

2. Run the following Terraform command and then approve the changes when prompted:

terraform state replace-provider databrickslabs/databricks databricks/databricks

For information about this command, see Command: state replace-provider in the Terraform
documentation.
3. Verify the changes by running the following Terraform command:

terraform init

Error: Failed to query available provider packages


Issue : If you did not check in a terraform.lock.hcl file to your version control system, and you run the
terraform init command, the following message appears: Failed to query available provider packages .

Cause : Your Terraform configurations reference outdated Databricks Terraform providers.


Solution : Follow the solution instructions in Error: Failed to install provider.

Additional examples
Deploy an Azure Databricks workspace using Terraform
Manage a workspace end-to-end using Terraform
Create clusters
Control access to clusters: see Enable cluster access control for your workspace and Cluster access control
Control access to jobs: see Enable jobs access control for a workspace and Jobs access control
Control access to pools: see Enable instance pool access control for a workspace and Pool access control
Control access to personal access tokens
Control access to notebooks
Configure Databricks Repos
Control access to secrets
Configure usage log delivery
Control access to Databricks SQL tables
Implement CI/CD pipelines to deploy Databricks resources using the Databricks Terraform provider

Additional resources
Databricks Provider Documentation on the Terraform Registry website
Terraform Documentation on the Terraform website
Continuous integration and delivery on Azure
Databricks using Azure DevOps
7/21/2022 • 20 minutes to read

Continuous integration and continuous delivery (CI/CD) refers to the process of developing and delivering
software in short, frequent cycles through the use of automation pipelines. While this is by no means a new
process, having been ubiquitous in traditional software engineering for decades, it is becoming an increasingly
necessary process for data engineering and data science teams. In order for data products to be valuable, they
must be delivered in a timely manner. Additionally, consumers must have confidence in the validity of outcomes
within these products. By automating the building, testing, and deployment of code, development teams are
able to deliver releases more frequently and reliably than the more manual processes that are still prevalent
across many data engineering and data science teams.
Continuous integration begins with the practice of having you commit your code with some frequency to a
branch within a source code repository. Each commit is then merged with the commits from other developers to
ensure that no conflicts were introduced. Changes are further validated by creating a build and running
automated tests against that build. This process ultimately results in an artifact, or deployment bundle, that will
eventually be deployed to a target environment, in this case an Azure Databricks workspace.

Overview of a typical Azure Databricks CI/CD pipeline


Though it can vary based on your needs, a typical configuration for an Azure Databricks pipeline includes the
following steps:
Continuous integration:
1. Code
a. Develop code and unit tests in an Azure Databricks notebook or using an external IDE.
b. Manually run tests.
c. Commit code and tests to a git branch.
2. Build
a. Gather new and updated code and tests.
b. Run automated tests.
c. Build libraries and non-notebook Apache Spark code.
3. Release: Generate a release artifact.
Continuous deliver y:
1. Deploy
a. Deploy notebooks.
b. Deploy libraries.
2. Test: Run automated tests and report results.
3. Operate: Programmatically schedule data engineering, analytics, and machine learning workflows.

Develop and commit your code


One of the first steps in designing a CI/CD pipeline is deciding on a code commit and branching strategy to
manage the development and integration of new and updated code without adversely affecting the code
currently in production. Part of this decision involves choosing a version control system to contain your code
and facilitate the promotion of that code. Azure Databricks supports integrations with GitHub and Bitbucket,
which allow you to commit notebooks to a git repository.
If your version control system is not among those supported through direct notebook integration, or if you want
more flexibility and control than the self-service git integration, you can use the Databricks CLI to export
notebooks and commit them from your local machine. This script should be run from within a local git
repository that is set up to sync with the appropriate remote repository. When executed, this script should:
1. Check out the desired branch.
2. Pull new changes from the remote branch.
3. Export notebooks from the Azure Databricks workspace using the Azure Databricks workspace CLI.
4. Prompt the user for a commit message or use the default if one is not provided.
5. Commit the updated notebooks to the local branch.
6. Push the changes to the remote branch.
The following script performs these steps:

git checkout <branch>


git pull
databricks workspace export_dir --profile <profile> -o <path> ./Workspace

dt=`date '+%Y-%m-%d %H:%M:%S'`


msg_default="DB export on $dt"
read -p "Enter the commit comment [$msg_default]: " msg
msg=${msg:-$msg_default}
echo $msg

git add .
git commit -m "<commit-message>"
git push

If you prefer to develop in an IDE rather than in Azure Databricks notebooks, you can use the VCS integration
features built into modern IDEs or the git CLI to commit your code.
Azure Databricks provides Databricks Connect, an SDK that connects IDEs to Azure Databricks clusters. This is
especially useful when developing libraries, as it allows you to run and unit test your code on Azure Databricks
clusters without having to deploy that code. See Databricks Connect limitations to determine whether your use
case is supported.

NOTE
Databricks now recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.

Depending on your branching strategy and promotion process, the point at which a CI/CD pipeline will initiate a
build will vary. However, committed code from various contributors will eventually be merged into a designated
branch to be built and deployed. Branch management steps run outside of Azure Databricks, using the interfaces
provided by the version control system.
There are numerous CI/CD tools you can use to manage and execute your pipeline. This article illustrates how to
use the Azure DevOps automation server. CI/CD is a design pattern, so the steps and stages outlined in this
article should transfer with a few changes to the pipeline definition language in each tool. Furthermore, much of
the code in this example pipeline runs standard Python code, which you can invoke in other tools.
For information about using Jenkins with Azure Databricks, see Continuous integration and delivery on Azure
Databricks using Jenkins.
Define your build pipeline
Azure DevOps provides a cloud hosted interface for defining the stages of your CI/CD pipeline using YAML. You
define the build pipeline, which runs unit tests and builds a deployment artifact, in the Pipelines interface. Then,
to deploy the code to an Azure Databricks workspace, you specify this deployment artifact in a release pipeline.
In your Azure DevOps project, open the Pipelines menu and click Pipelines .

Click the New Pipeline button to open the Pipeline editor, where you define your build in the
azure-pipelines.yml file.

You can use the Git branch selector to customize the build process for each branch in your
Git repository.
The azure-pipelines.yml file is stored by default in the root directory of the git repository for the pipeline.
Environment variables referenced by the pipeline are configured using the Variables button.

For more information on Azure DevOps and build pipelines, see the Azure DevOps documentation.
Configure your build agent
To execute the pipeline, Azure DevOps provides cloud-hosted, on-demand execution agents that support
deployments to Kubernetes, VMs, Azure Functions, Azure Web Apps, and many more targets. In this example,
you use an on-demand agent to automate the deployment of code to the target Azure Databricks workspace.
Tools or packages required by the pipeline must be defined in the pipeline script and installed on the agent at
execution time.
This example requires the following dependencies:
Conda - Conda is an open source environment management system
Python v3.7.3 - Python will be used to run tests, build a deployment wheel, and execute deployment scripts.
The version of Python is important as tests require that the version of Python running on the agent should
match that of the Azure Databricks cluster. This example uses Databricks Runtime 6.4, which includes Python
3.7.
Python libraries: requests , databricks-connect , databricks-cli , pytest

Here is an example pipeline ( azure-pipelines.yml ). The complete script follows. This article steps through each
section of the script.

# Azure Databricks Build Pipeline


# azure-pipelines.yml

trigger:
- release

pool:
name: Hosted Ubuntu 1604

steps:
- task: UsePythonVersion@0
displayName: 'Use Python 3.7'
inputs:
versionSpec: 3.7

- script: |
pip install pytest requests setuptools wheel
pip install -U databricks-connect==6.4.*
displayName: 'Load Python Dependencies'

- script: |
echo "y
$(WORKSPACE-REGION-URL)
$(CSE-DEVELOP-PAT)
$(EXISTING-CLUSTER-ID)
$(WORKSPACE-ORG-ID)
15001" | databricks-connect configure
displayName: 'Configure DBConnect'

- checkout: self
persistCredentials: true
clean: true

- script: git checkout release


displayName: 'Get Latest Branch'

- script: |
python -m pytest --junit-xml=$(Build.Repository.LocalPath)/logs/TEST-LOCAL.xml
$(Build.Repository.LocalPath)/libraries/python/dbxdemo/test*.py || true

displayName: 'Run Python Unit Tests for library code'

- task: PublishTestResults@2
inputs:
testResultsFiles: '**/TEST-*.xml'
failTaskOnFailedTests: true
publishRunAttachments: true

- script: |
cd $(Build.Repository.LocalPath)/libraries/python/dbxdemo
python3 setup.py sdist bdist_wheel
ls dist/
displayName: 'Build Python Wheel for Libs'

- script: |
git diff --name-only --diff-filter=AMR HEAD^1 HEAD | xargs -I '{}' cp --parents -r '{}'
$(Build.BinariesDirectory)

mkdir -p $(Build.BinariesDirectory)/libraries/python/libs
cp $(Build.Repository.LocalPath)/libraries/python/dbxdemo/dist/*.*
$(Build.BinariesDirectory)/libraries/python/libs
$(Build.BinariesDirectory)/libraries/python/libs

mkdir -p $(Build.BinariesDirectory)/cicd-scripts
cp $(Build.Repository.LocalPath)/cicd-scripts/*.* $(Build.BinariesDirectory)/cicd-scripts

displayName: 'Get Changes'

- task: ArchiveFiles@2
inputs:
rootFolderOrFile: '$(Build.BinariesDirectory)'
includeRootFolder: false
archiveType: 'zip'
archiveFile: '$(Build.ArtifactStagingDirectory)/$(Build.BuildId).zip'
replaceExistingArchive: true

- task: PublishBuildArtifacts@1
inputs:
ArtifactName: 'DatabricksBuild'

Set up the pipeline


In the setup stage you configure the build agent, Databricks CLI, and Databricks Connect with connection
information.

# Specify the trigger event to start the build pipeline.


# In this case, new code merged into the release branch initiates a new build.
trigger:
- release

# Specify the OS for the agent


pool:
name: Hosted Ubuntu 1604

# Install Python. The version must match the version on the Databricks cluster.
steps:
- task: UsePythonVersion@0
displayName: 'Use Python 3.7'
inputs:
versionSpec: 3.7

# Install required Python modules, including databricks-connect, required to execute a unit test
# on a cluster.
- script: |
pip install pytest requests setuptools wheel
pip install -U databricks-connect==6.4.*
displayName: 'Load Python Dependencies'

# Use environment variables to pass Databricks login information to the Databricks Connect
# configuration function
- script: |
echo "y
$(WORKSPACE-REGION-URL)
$(CSE-DEVELOP-PAT)
$(EXISTING-CLUSTER-ID)
$(WORKSPACE-ORG-ID)
15001" | databricks-connect configure
displayName: 'Configure DBConnect'

Get the latest changes


This stage downloads code from the designated branch to the agent execution agent.
- checkout: self
persistCredentials: true
clean: true

- script: git checkout release


displayName: 'Get Latest Branch'

Unit tests in Azure Databricks notebooks


For library code developed outside an Azure Databricks notebook, the process is like traditional software
development practices. You write a unit test using a testing framework, like the Python pytest module, and use
JUnit-formatted XML files to store the test results.
Azure Databricks code is Apache Spark code intended to be executed on Azure Databricks clusters. To unit test
this code, you can use the Databricks Connect SDK configured in Set up the pipeline.
Test library code using Databricks Connect
This stage of the pipeline invokes the unit tests, specifying the name and location for both the tests and the
output files.

- script: |
python -m pytest --junit-xml=$(Build.Repository.LocalPath)/logs/TEST-LOCAL.xml
$(Build.Repository.LocalPath)/libraries/python/dbxdemo/test*.py || true
ls logs
displayName: 'Run Python Unit Tests for library code'

The following snippet ( addcol.py ) is a library function that might be installed on an Azure Databricks cluster.
This simple function adds a new column, populated by a literal, to an Apache Spark DataFrame.

# addcol.py
import pyspark.sql.functions as F

def with_status(df):
return df.withColumn("status", F.lit("checked"))

The following test, test-addcol.py , passes a mock DataFrame object to the with_status function, defined in
addcol.py. The result is then compared to a DataFrame object containing the expected values. If the values
match, the test passes.
# test-addcol.py
import pytest

from dbxdemo.spark import get_spark


from dbxdemo.appendcol import with_status

class TestAppendCol(object):

def test_with_status(self):
source_data = [
("pete", "pan", "peter.pan@databricks.com"),
("jason", "argonaut", "jason.argonaut@databricks.com")
]
source_df = get_spark().createDataFrame(
source_data,
["first_name", "last_name", "email"]
)

actual_df = with_status(source_df)

expected_data = [
("pete", "pan", "peter.pan@databricks.com", "checked"),
("jason", "argonaut", "jason.argonaut@databricks.com", "checked")
]
expected_df = get_spark().createDataFrame(
expected_data,
["first_name", "last_name", "email", "status"]
)

assert(expected_df.collect() == actual_df.collect())

Package library code


This stage of the pipeline packages the library code into a Python wheel.

- script: |
cd $(Build.Repository.LocalPath)/libraries/python/dbxdemo
python3 setup.py sdist bdist_wheel
ls dist/
displayName: 'Build Python Wheel for Libs'

Publish test results


After all unit tests have been executed, publish the results to Azure DevOps. This lets you visualize reports and
dashboards related to the status of the build process.

- task: PublishTestResults@2
inputs:
testResultsFiles: '**/TEST-*.xml'
failTaskOnFailedTests: true
publishRunAttachments: true

Generate and store a deployment artifact


The final step of the build pipeline is to generate the deployment artifact. To do this, you gather all the new or
updated code to be deployed to the Azure Databricks environment, including the notebook code to be deployed
to the workspace, .whl libraries that were generated by the build process, and the result summaries for the
tests, for archiving purposes.
# Use git diff to flag files added in the most recent git merge
- script: |
git diff --name-only --diff-filter=AMR HEAD^1 HEAD | xargs -I '{}' cp --parents -r '{}'
$(Build.BinariesDirectory)

# Add the wheel file you just created along with utility scripts used by the Release pipeline
# The implementation in your Pipeline may be different.
# The objective is to add all files intended for the current release.
mkdir -p $(Build.BinariesDirectory)/libraries/python/libs
cp $(Build.Repository.LocalPath)/libraries/python/dbxdemo/dist/*.*
$(Build.BinariesDirectory)/libraries/python/libs

mkdir -p $(Build.BinariesDirectory)/cicd-scripts
cp $(Build.Repository.LocalPath)/cicd-scripts/*.* $(Build.BinariesDirectory)/cicd-scripts
displayName: 'Get Changes'

# Create the deployment artifact and publish it to the artifact repository


- task: ArchiveFiles@2
inputs:
rootFolderOrFile: '$(Build.BinariesDirectory)'
includeRootFolder: false
archiveType: 'zip'
archiveFile: '$(Build.ArtifactStagingDirectory)/$(Build.BuildId).zip'
replaceExistingArchive: true

- task: PublishBuildArtifacts@1
inputs:
ArtifactName: 'DatabricksBuild'

Define your release pipeline


The release pipeline deploys the artifact to an Azure Databricks environment. Separating the release pipeline
from the build pipeline allows you to create a build without deploying it, or to deploy artifacts from multiple
builds at one time.
1. In your Azure DevOps project, go to the Pipelines menu and click Releases .
2. On the right side of the screen is a list of featured templates for common deployment patterns. For this
pipeline, click .

3. In the Artifacts box on the left side of the screen, click and select the build pipeline created earlier.

You can configure how the pipeline is triggered by clicking , which displays triggering options on the right
side of the screen. If you want a release to be initiated automatically based on build artifact availability or after a
pull request workflow, enable the appropriate trigger.
To add steps or tasks for the deployment, click the link within the stage object.

Add tasks
To add tasks, click the plus sign in the Agent job section, indicated by the red arrow in the following figure. A
searchable list of available tasks appears. There is also a Marketplace for third-party plug-ins that can be used to
supplement the standard Azure DevOps tasks.

Set Python version


The first task you add is Use Python version . As with the build pipeline, you want to make sure that the
Python version is compatible with the scripts called in subsequent tasks.
In this case, set the Python version to 3.7.

Unpackage the build artifact


Extract the archive using the Extract files task. Set the Archive file patterns to *.zip , and set the
Destination folder to the system variable, “$(agent.builddirectory)”. You can optionally set the Display name ;
this is the name that appears on the screen under Agent job .

Deploy the notebooks to the workspace


To deploy the notebooks, this example uses the third-party task Databricks Deploy Notebooks developed by
Data Thirst.
Enter environment variables to set the values for Azure Region and Databricks bearer token .
Set the Source files path to the path of the extracted directory containing your notebooks.
Set the Target files path to the desired path within the Azure Databricks workspace directory structure.

Deploy the library to DBFS


To deploy the Python *.war file, use the third-party task Databricks files to DBFS , also developed by Data
Thirst.
Enter environment variables to set the values for Azure Region and Databricks bearer token .
Set the Local Root Folder to the path of the extracted directory containing the Python libraries.
Set the Target folder in DBFS to the desired DBFS path.
Install the library on a cluster
The final code deployment task installs the library onto a specific cluster. To do this, you create a Python script
task. The Python script, installWhlLibrary.py , is in the artifact created by our build pipeline.

NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.

Set the Script path to $(agent.builddirectory)/cicd-scripts/installWhlLibrary.py . The


installWhlLibrary.py script takes five arguments:
shard - the URL for the target workspace (for example, https://<region>.azuredatabricks.net )
token - a personal access token for workspace
clusterid - the ID for the cluster on which to install the library
libs - the extracted directory containing the libraries
dbfspath - the path within the DBFS file system to retrieve the libraries
Before installing a new version of a library on an Azure Databricks cluster, you must uninstall the existing library.
To do this, invoke the Databricks REST API in a Python script to perform the following steps:
1. Check if the library is installed.
2. Uninstall the library.
3. Restart the cluster if any uninstalls were performed.
4. Wait until the cluster is running again before proceeding.
5. Install the library.

# installWhlLibrary.py
#!/usr/bin/python3
import json
import requests
import sys
import getopt
import time
import os

def main():
shard = ''
token = ''
clusterid = ''
libspath = ''
dbfspath = ''

try:
opts, args = getopt.getopt(sys.argv[1:], 'hstcld',
['shard=', 'token=', 'clusterid=', 'libs=', 'dbfspath='])
except getopt.GetoptError:
print(
'installWhlLibrary.py -s <shard> -t <token> -c <clusterid> -l <libs> -d <dbfspath>')
sys.exit(2)

for opt, arg in opts:


if opt == '-h':
print(
'installWhlLibrary.py -s <shard> -t <token> -c <clusterid> -l <libs> -d <dbfspath>')
sys.exit()
elif opt in ('-s', '--shard'):
shard = arg
elif opt in ('-t', '--token'):
token = arg
elif opt in ('-c', '--clusterid'):
clusterid = arg
clusterid = arg
elif opt in ('-l', '--libs'):
libspath=arg
elif opt in ('-d', '--dbfspath'):
dbfspath=arg

print('-s is ' + shard)


print('-t is ' + token)
print('-c is ' + clusterid)
print('-l is ' + libspath)
print('-d is ' + dbfspath)

# Uninstall library if exists on cluster


i=0

# Generate array from walking local path


libslist = []
for path, subdirs, files in os.walk(libspath):
for name in files:

name, file_extension = os.path.splitext(name)


if file_extension.lower() in ['.whl']:
libslist.append(name + file_extension.lower())

for lib in libslist:


dbfslib = dbfspath + '/' + lib

if (getLibStatus(shard, token, clusterid, dbfslib) is not None:


print(dbfslib + ' before:' + getLibStatus(shard, token, clusterid, dbfslib))
print(dbfslib + " exists. Uninstalling.")
i = i + 1
values = {'cluster_id': clusterid, 'libraries': [{'whl': dbfslib}]}

resp = requests.post(shard + '/api/2.0/libraries/uninstall', data=json.dumps(values), auth=


("token", token))
runjson = resp.text
d = json.loads(runjson)
print(dbfslib + ' after:' + getLibStatus(shard, token, clusterid, dbfslib))

# Restart if libraries uninstalled


if i > 0:
values = {'cluster_id': clusterid}
print("Restarting cluster:" + clusterid)
resp = requests.post(shard + '/api/2.0/clusters/restart', data=json.dumps(values), auth=
("token", token))
restartjson = resp.text
print(restartjson)

p = 0
waiting = True
while waiting:
time.sleep(30)
clusterresp = requests.get(shard + '/api/2.0/clusters/get?cluster_id=' + clusterid,
auth=("token", token))
clusterjson = clusterresp.text
jsonout = json.loads(clusterjson)
current_state = jsonout['state']
print(clusterid + " state:" + current_state)
if current_state in ['TERMINATED', 'RUNNING','INTERNAL_ERROR', 'SKIPPED'] or p >= 10:
break
p = p + 1

print("Installing " + dbfslib)


values = {'cluster_id': clusterid, 'libraries': [{'whl': 'dbfs:' + dbfslib}]}

resp = requests.post(shard + '/api/2.0/libraries/install', data=json.dumps(values), auth=("token",


token))
runjson = resp.text
d = json.loads(runjson)
print(dbfslib + ' after:' + getLibStatus(shard, token, clusterid, dbfslib))
print(dbfslib + ' after:' + getLibStatus(shard, token, clusterid, dbfslib))

def getLibStatus(shard, token, clusterid, dbfslib):

resp = requests.get(shard + '/api/2.0/libraries/cluster-status?cluster_id='+ clusterid, auth=("token",


token))
libjson = resp.text
d = json.loads(libjson)
if (d.get('library_statuses')):
statuses = d['library_statuses']

for status in statuses:


if (status['library'].get('whl')):
if (status['library']['whl'] == 'dbfs:' + dbfslib):
return status['status']
else:
# No libraries found
return "not found"

if __name__ == '__main__':
main()

Run integration tests from an Azure Databricks notebook


You can also run tests directly from notebooks containing asserts. In this case, you use the same test you used in
the unit test, but now it imports the installed appendcol library from the whl that you just installed on the
cluster.
To automate this test and include it in the CI/CD pipeline, use the Databricks REST API to execute the notebook
from the CI/CD server. This allows you to check whether the notebook execution passed or failed using pytest .
Any assert failures appear in the JSON output returned by the REST API and in the JUnit test results.
Step 1: Configure the test environment
Create a Command line task as shown. This task includes commands that create directories for the notebook
execution logs and the test summaries. It also includes a pip command to install the required pytest and
requests modules.

Step 2: Run the notebook


NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.

Create a Python script task and configure it as follows:


Set the Script path to $(agent.builddirectory)/cicd-scripts/executeNotebook.py . This script takes six
arguments:
shard - the URL for the target workspace (for example,https://eastus.azuredatabricks.net)
token - a personal access token for workspace
clusterid - the ID for the cluster on which to execute the test
localpath - the extracted directory containing the test notebooks
workspacepath - the path within the workspace to which to test notebooks were deployed
outfilepath - the path you created to store the JSON output returned by the REST API

The executenotebook.py script runs the notebook using the jobs runs submit endpoint which submits an
anonymous job. Because this endpoint is asynchronous, it uses the job ID initially returned by the REST call to
the poll for the status of the job. After the job completes, the JSON output is saved to the path specified by the
function arguments passed at invocation.

# executenotebook.py
#!/usr/bin/python3
import json
import requests
import os
import sys
import getopt
import time

def main():
shard = ''
token = ''
clusterid = ''
localpath = ''
workspacepath = ''
outfilepath = ''

try:
opts, args = getopt.getopt(sys.argv[1:], 'hs:t:c:lwo',
['shard=', 'token=', 'clusterid=', 'localpath=', 'workspacepath=',
'outfilepath='])
except getopt.GetoptError:
print(
'executenotebook.py -s <shard> -t <token> -c <clusterid> -l <localpath> -w <workspacepath> -o
<outfilepath>)')
sys.exit(2)

for opt, arg in opts:


if opt == '-h':
print(
'executenotebook.py -s <shard> -t <token> -c <clusterid> -l <localpath> -w <workspacepath> -
o <outfilepath>')
sys.exit()
elif opt in ('-s', '--shard'):
shard = arg
elif opt in ('-t', '--token'):
token = arg
elif opt in ('-c', '--clusterid'):
clusterid = arg
elif opt in ('-l', '--localpath'):
localpath = arg
elif opt in ('-w', '--workspacepath'):
workspacepath = arg
elif opt in ('-o', '--outfilepath'):
outfilepath = arg

print('-s is ' + shard)


print('-t is ' + token)
print('-c is ' + clusterid)
print('-l is ' + localpath)
print('-w is ' + workspacepath)
print('-o is ' + outfilepath)
# Generate array from walking local path

notebooks = []
for path, subdirs, files in os.walk(localpath):
for name in files:
fullpath = path + '/' + name
# removes localpath to repo but keeps workspace path
fullworkspacepath = workspacepath + path.replace(localpath, '')

name, file_extension = os.path.splitext(fullpath)


if file_extension.lower() in ['.scala', '.sql', '.r', '.py']:
row = [fullpath, fullworkspacepath, 1]
notebooks.append(row)

# run each element in array


for notebook in notebooks:
nameonly = os.path.basename(notebook[0])
workspacepath = notebook[1]

name, file_extension = os.path.splitext(nameonly)

# workpath removes extension


fullworkspacepath = workspacepath + '/' + name

print('Running job for:' + fullworkspacepath)


values = {'run_name': name, 'existing_cluster_id': clusterid, 'timeout_seconds': 3600,
'notebook_task': {'notebook_path': fullworkspacepath}}

resp = requests.post(shard + '/api/2.0/jobs/runs/submit',


data=json.dumps(values), auth=("token", token))
data=json.dumps(values), auth=("token", token))
runjson = resp.text
print("runjson:" + runjson)
d = json.loads(runjson)
runid = d['run_id']

i=0
waiting = True
while waiting:
time.sleep(10)
jobresp = requests.get(shard + '/api/2.0/jobs/runs/get?run_id='+str(runid),
data=json.dumps(values), auth=("token", token))
jobjson = jobresp.text
print("jobjson:" + jobjson)
j = json.loads(jobjson)
current_state = j['state']['life_cycle_state']
runid = j['run_id']
if current_state in ['TERMINATED', 'INTERNAL_ERROR', 'SKIPPED'] or i >= 12:
break
i=i+1

if outfilepath != '':
file = open(outfilepath + '/' + str(runid) + '.json', 'w')
file.write(json.dumps(j))
file.close()

if __name__ == '__main__':
main()

Step 3: Generate and evaluate test results


This task executes a Python script using pytest to determine if the asserts in the test notebooks passed or
failed.
Create a Command line task. The Script should be:

python -m pytest --junit-xml=$(agent.builddirectory)\logs\xml\TEST-notebookout.xml --


jsonpath=$(agent.builddirectory)\logs\json\ $(agent.builddirectory)\cicd-scripts\evaluatenotebookruns.py ||
true

The arguments are:


junit-xml - the path in which to generate the JUnit test summary logs
jsonpath - the path you created to store the JSON output returned by the REST API
The script evaluatenotebookruns.py defines the test_job_run function, which parses and evaluates the JSON
generated by the previous task. Another test, test_performance , looks for tests that run longer than expected.
# evaluatenotebookruns.py
import unittest
import json
import glob
import os

class TestJobOutput(unittest.TestCase):

test_output_path = '#ENV#'

def test_performance(self):
path = self.test_output_path
statuses = []

for filename in glob.glob(os.path.join(path, '*.json')):


print('Evaluating: ' + filename)
data = json.load(open(filename))
duration = data['execution_duration']
if duration > 100000:
status = 'FAILED'
else:
status = 'SUCCESS'

statuses.append(status)

self.assertFalse('FAILED' in statuses)

def test_job_run(self):
path = self.test_output_path
statuses = []

for filename in glob.glob(os.path.join(path, '*.json')):


print('Evaluating: ' + filename)
data = json.load(open(filename))
status = data['state']['result_state']
statuses.append(status)

self.assertFalse('FAILED' in statuses)

if __name__ == '__main__':
unittest.main()

Publish test results


Use the Publish Test Results task to archive the JSON results and publish the test results to Azure DevOps Test
Hub. This enables you to visualize reports and dashboards related to the status of the test runs.
At this point, you have completed an integration and deployment cycle using the CI/CD pipeline. By automating
this process, you ensure that your code is tested and deployed by an efficient, consistent, and repeatable
process.
Continuous integration and delivery on Azure
Databricks using GitHub Actions
7/21/2022 • 2 minutes to read

IMPORTANT
This feature is in Public Preview.

Below is a list of GitHub Actions developed for Azure Databricks that you can use in your CI/CD workflows on
GitHub.

GIT H UB A C T IO N A C T IO N DESC RIP T IO N

databricks/run-notebook Executes an Azure Databricks notebook as a one-time Azure


Databricks job run, awaits its completion, and returns the
notebook’s output.

databricks/upload-dbfs-temp Uploads a file to a temporary DBFS path for the duration of


the current GitHub Workflow job. Returns the path of the
DBFS tempfile.
Service principals for CI/CD
7/21/2022 • 7 minutes to read

A service principal is an identity created for use with automated tools and applications, including CI/CD
platforms such as GitHub Actions, Airflow in data pipelines, and Jenkins.
As a security best practice, Databricks recommends using an Azure AD service principal and its Azure AD token
instead of your Azure Databricks user or your Azure Databricks personal access token for your workspace user
to give CI/CD platforms access to Azure Databricks resources. Some benefits to this approach include the
following:
You can grant and restrict access to Azure Databricks resources for an Azure AD service principal
independently of a user. For instance, this allows you to prohibit an Azure AD service principal from acting as
an admin in your Azure Databricks workspace while still allowing other specific users in your workspace to
continue to act as admins.
Users can safeguard their access tokens from being accessed by CI/CD platforms.
You can temporarily disable or permanently delete an Azure AD service principal without impacting other
users. For instance, this allows you to pause or remove access from an Azure AD service principal that you
suspect is being used in a malicious way.
If a user leaves your organization, you can remove that user without impacting any Azure AD service
principal.
To give a CI/CD platform access to your Azure Databricks workspace, you provide the CI/CD platform with
information about your Azure AD service principal. For instance, this may involve generating an Azure AD token
for the Azure AD service principal and then giving this Azure AD token to the CI/CD platform.
To give your Azure Databricks workspace access to a Git provider (for example, when you use Azure Databricks
Git integration with Databricks Repos), you must add your Git provider credentials to your workspace,
depending on the Git provider’s requirements.
To complete these two access connections, you do the following:
1. Create an Azure AD service principal.
2. Create an Azure AD token for an Azure AD service principal.
3. Add the Azure AD service principal to your Azure Databricks workspace.
4. Add your Git provider credentials to your workspace with your Azure AD token and the Git Credentials API
2.0.
The first three steps are covered in Service principals for Azure Databricks automation.
To complete the last step, you can use tools such as curl and Postman. You cannot use the Azure Databricks
user interface.
This article describes how to:
1. Provide information to the CI/CD platform, for example the Azure AD token for the Azure AD service
principal, depending on the CI/CD platform’s requirements.
2. If you use Azure Databricks Git integration with Databricks Repos, add your Git provider credentials to your
Azure Databricks workspace.

Requirements
The Azure AD token for an Azure AD service principal. To create an Azure AD service principal and its Azure
AD token, see Service principals for Azure Databricks automation.
A tool to call the Azure Databricks APIs, such as curl or Postman.
If you use a Git provider, an account with your Git provider.

Set up GitHub Actions


GitHub Actions must be able to access your Azure Databricks workspace. If you use Azure Databricks Git
integration with Databricks Repos, your workspace must also be able to access GitHub.
To enable GitHub Actions to access your Azure Databricks workspace, you must provide information about your
Azure AD service principal to GitHub Actions. This can include information such as the Application (client) ID ,
the Director y (tenant) ID , the client secret’s Value , or the access_token value representing the Azure AD
token for your Azure AD service principal, depending on the GitHub Action’s requirements. For more
information, see Service principals for Azure Databricks automation and the GitHub Action’s documentation.
To enable your Azure Databricks workspace to access GitHub (for example, when you use Azure Databricks Git
integration with Databricks Repos), you must add the GitHub personal access token for a GitHub machine user
to your workspace.
This section describes how to set up both of these access connections.
Provide information about your Azure AD service principal to GitHub Actions
This section describes how to enable GitHub Actions to access your Azure Databricks workspace.
As a security best practice, Databricks recommends that you do not enter information about your Azure AD
service principal directly into the body of a GitHub Actions file. You should provide this information to GitHub
Actions by using GitHub encrypted secrets instead.
GitHub Actions, such as the ones that Databricks lists in Continuous integration and delivery on Azure
Databricks using GitHub Actions, rely on various GitHub encrypted secrets such as:
DATABRICKS_HOST , which is the value https:// followed by your workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
AZURE_CREDENTIALS , which is a JSON document that represents the output of running the Azure CLI to get
information about an Azure service principal. For more information, see the documentation for the GitHub
Action.
AZURE_SP_APPLICATION_ID , which is the value of the Application (client) ID for your Azure service principal.
AZURE_SP_TENANT_ID , which is the value of the Director y (tenant) ID for your Azure service principal.
AZURE_SP_CLIENT_SECRET , which is the value of the client secret’s Value for your Azure service principal.

For more information about which GitHub encrypted secrets are required for a GitHub Action, see Service
principals for Azure Databricks automation and the documentation for that GitHub Action.
To add these GitHub encrypted secrets to your GitHub repository, see Creating encrypted secrets for a
repository in the GitHub documentation. For other approaches to add these GitHub repository secrets, see
Encrypted secrets in the GitHub documentation.
Add the GitHub personal access token for a GitHub machine user to your Azure Databricks workspace
This section describes how to enable your Azure Databricks workspace to access GitHub (for example, when you
use Azure Databricks Git integration with Databricks Repos).
As a security best practice, Databricks recommends that you use GitHub machine users instead of GitHub
personal accounts, for many of the same reasons that you should use an Azure AD service principal instead of a
Azure Databricks user. To add the GitHub personal access token for a GitHub machine user to your Azure
Databricks workspace, do the following:
1. Create a GitHub machine user, if you do not already have one available. A GitHub machine user is a
GitHub personal account, separate from your own GitHub personal account, that you can use to automate
activity on GitHub. Create a new separate GitHub account to use as a GitHub machine user, if you do not
already have one available.

NOTE
When you create a new separate GitHub account as a GitHub machine user, you cannot associate it with the email
address for your own GitHub personal account. Instead, see your organization’s email administrator about getting
a separate email address that you can associate with this new separate GitHub account as a GitHub machine user.
See your organization’s account administrator about managing the separate email address and its associated
GitHub machine user and its GitHub personal access tokens within your organization.

2. Give the GitHub machine user access to your GitHub repository. See Inviting a team or person in the
GitHub documentation. To accept the invitation, you may first need to sign out of your GitHub personal
account, and then sign back in as the GitHub machine user.
3. Sign in to GitHub as the machine user, and then create a GitHub personal access token for that machine
user. See Create a personal access token in the GitHub documentation. Be sure to give the GitHub
personal access token repo access.
4. Use a tool such as curl or Postman to call the “create a Git credential entry” ( POST /git-credentials )
operation in the Git Credentials API 2.0. (You cannot use the Azure Databricks user interface for this.) In
the following instructions, replace:
with the Azure AD token for your Azure AD service principal.
<service-principal-access-token>
(Do not use the Azure Databricks personal access token for your workspace user.)

TIP
To confirm that you are using the correct token, you can first use the Azure AD token for your Azure AD
service principal to call the SCIM API 2.0 (Me) API, and review the output of the call.

<machine-user-access-token> with the GitHub personal access token for the GitHub machine user.
<machine-user-name> with the GitHub username of the GitHub machine user.
Curl
Run the following command. Make sure the set-git-credentials.json file is in the same directory where
you run this command. This command uses the environment variable DATABRICKS_HOST , representing
your Azure Databricks per-workspace URL, for example
https://adb-1234567890123456.7.azuredatabricks.net .

curl -X POST \
${DATABRICKS_HOST}/api/2.0/git-credentials \
--header 'Authorization: Bearer <service-principal-access-token>' \
--data @set-git-credentials.json \
| jq .

set-git-credentials.json :
{
"personal_access_token": "<machine-user-access-token>",
"git_username": "<machine-user-name>",
"git_provider": "gitHub"
}

Postman
a. Create a new HTTP request (File > New > HTTP Request ).
b. In the HTTP verb drop-down list, select POST .
c. For Enter request URL , enter http://<databricks-instance-name>/api/2.0/git-credentials , where
<databricks-instance-name> is your Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

d. On the Authorization tab, in the Type list, select Bearer Token .


e. For Token , enter the Azure AD token for the Azure AD service principal (the
<service-principal-access-token> ).

f. On the Headers tab, add the Key and Value pair of Content-Type and application/scim+json

g. On the Body tab, select raw and JSON .


h. Enter the following body payload:

{
"personal_access_token": "<machine-user-access-token>",
"git_username": "<machine-user-name>",
"git_provider": "gitHub"
}

i. Click Send .

TIP
To confirm that the call was successful, you can use the Azure AD token for your Azure AD service principal to call the “get
Git credentials” ( GET /git-credentials ) operation in the Git Credentials API 2.0, and review the output of the call.
Continuous integration and delivery on Azure
Databricks using Jenkins
7/21/2022 • 19 minutes to read

Continuous integration and continuous delivery (CI/CD) refers to the process of developing and delivering
software in short, frequent cycles through the use of automation pipelines. While this is by no means a new
process, having been ubiquitous in traditional software engineering for decades, it is becoming an increasingly
necessary process for data engineering and data science teams. In order for data products to be valuable, they
must be delivered in a timely manner. Additionally, consumers must have confidence in the validity of outcomes
within these products. By automating the building, testing, and deployment of code, development teams are
able to deliver releases more frequently and reliably than the more manual processes that are still prevalent
across many data engineering and data science teams.
Continuous integration begins with the practice of having you commit your code with some frequency to a
branch within a source code repository. Each commit is then merged with the commits from other developers to
ensure that no conflicts were introduced. Changes are further validated by creating a build and running
automated tests against that build. This process ultimately results in an artifact, or deployment bundle, that will
eventually be deployed to a target environment, in this case an Azure Databricks workspace.

Overview of a typical Azure Databricks CI/CD pipeline


Though it can vary based on your needs, a typical configuration for a Azure Databricks pipeline includes the
following steps:
Continuous integration:
1. Code
a. Develop code and unit tests in an Azure Databricks notebook or using an external IDE.
b. Manually run tests.
c. Commit code and tests to a git branch.
2. Build
a. Gather new and updated code and tests.
b. Run automated tests.
c. Build libraries and non-notebook Apache Spark code.
3. Release: Generate a release artifact.
Continuous deliver y:
1. Deploy
a. Deploy notebooks.
b. Deploy libraries.
2. Test: Run automated tests and report results.
3. Operate: Programmatically schedule data engineering, analytics, and machine learning workflows.

Develop and commit your code


One of the first steps in designing a CI/CD pipeline is deciding on a code commit and branching strategy to
manage the development and integration of new and updated code without adversely affecting the code
currently in production. Part of this decision involves choosing a version control system to contain your code
and facilitate the promotion of that code. Azure Databricks supports integrations with GitHub and Bitbucket,
which allow you to commit notebooks to a git repository.
If your version control system is not among those supported through direct notebook integration, or if you want
more flexibility and control than the self-service git integration, you can use the Databricks CLI to export
notebooks and commit them from your local machine. This script should be run from within a local git
repository that is set up to sync with the appropriate remote repository. When executed, this script should:
1. Check out the desired branch.
2. Pull new changes from the remote branch.
3. Export notebooks from the Azure Databricks workspace using the Databricks CLI.
4. Prompt the user for a commit message or use the default if one is not provided.
5. Commit the updated notebooks to the local branch.
6. Push the changes to the remote branch.
The following script performs these steps:

git checkout <branch>


git pull
databricks workspace export_dir --profile <profile> -o <path> ./Workspace

dt=`date '+%Y-%m-%d %H:%M:%S'`


msg_default="DB export on $dt"
read -p "Enter the commit comment [$msg_default]: " msg
msg=${msg:-$msg_default}
echo $msg

git add .
git commit -m "<commit-message>"
git push

If you prefer to develop in an IDE rather than in Azure Databricks notebooks, you can use the VCS integration
features built into modern IDEs or the git CLI to commit your code.
Azure Databricks provides Databricks Connect, an SDK that connects IDEs to Azure Databricks clusters. This is
especially useful when developing libraries, as it allows you to run and unit test your code on Azure Databricks
clusters without having to deploy that code. See Databricks Connect limitations to determine whether your use
case is supported.

NOTE
Databricks now recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.

Depending on your branching strategy and promotion process, the point at which a CI/CD pipeline will initiate a
build will vary. However, committed code from various contributors will eventually be merged into a designated
branch to be built and deployed. Branch management steps run outside of Azure Databricks, using the interfaces
provided by the version control system.
There are numerous CI/CD tools you can use to manage and execute your pipeline. This article illustrates how to
use the Jenkins automation server. CI/CD is a design pattern, so the steps and stages outlined in this article
should transfer with a few changes to the pipeline definition language in each tool. Furthermore, much of the
code in this example pipeline runs standard Python code, which you can invoke in other tools.
For information about using Azure DevOps with Azure Databricks, see Continuous integration and delivery on
Azure Databricks using Azure DevOps.
Configure your agent
Jenkins uses a master service for coordination and one to many execution agents. In this example you use the
default permanent agent node included with the Jenkins server. You must manually install the following tools
and packages required by the pipeline on the agent, in this case the Jenkins server:
Conda: an open source Python environment management system.
Python 3.7.3: used to run tests, build a deployment wheel, and execute deployment scripts. The version of
Python is important as tests require that the version of Python running on the agent should match that of the
Azure Databricks cluster. This example uses Databricks Runtime 6.4, which includes Python 3.7.
Python libraries: requests , databricks-connect , databricks-cli , and pytest .

Design the pipeline


Jenkins provides a few different project types to create CI/CD pipelines. This example implements a Jenkins
Pipeline. Jenkins Pipelines provide an interface to define stages in a Pipeline using Groovy code to call and
configure Jenkins plugins.

You write a Pipeline definition in a text file (called a Jenkinsfile) which in turn is checked into a project’s source
control repository. For more information, see Jenkins Pipeline. Here is an example Pipeline:

// Jenkinsfile
node {
def GITREPO = "/var/lib/jenkins/workspace/${env.JOB_NAME}"
def GITREPOREMOTE = "https://github.com/<repo>"
def GITHUBCREDID = "<github-token>"
def CURRENTRELEASE = "<release>"
def DBTOKEN = "<databricks-token>"
def DBURL = "https://<databricks-instance>"
def SCRIPTPATH = "${GITREPO}/Automation/Deployments"
def NOTEBOOKPATH = "${GITREPO}/Workspace"
def LIBRARYPATH = "${GITREPO}/Libraries"
def BUILDPATH = "${GITREPO}/Builds/${env.JOB_NAME}-${env.BUILD_NUMBER}"
def OUTFILEPATH = "${BUILDPATH}/Validation/Output"
def TESTRESULTPATH = "${BUILDPATH}/Validation/reports/junit"
def WORKSPACEPATH = "/Shared/<path>"
def DBFSPATH = "dbfs:<dbfs-path>"
def CLUSTERID = "<cluster-id>"
def CONDAPATH = "<conda-path>"
def CONDAENV = "<conda-env>"

stage('Setup') {
stage('Setup') {
withCredentials([string(credentialsId: DBTOKEN, variable: 'TOKEN')]) {
sh """#!/bin/bash
# Configure Conda environment for deployment & testing
source ${CONDAPATH}/bin/activate ${CONDAENV}

# Configure Databricks CLI for deployment


echo "${DBURL}
$TOKEN" | databricks configure --token

# Configure Databricks Connect for testing


echo "${DBURL}
$TOKEN
${CLUSTERID}
0
15001" | databricks-connect configure
"""
}
}
stage('Checkout') { // for display purposes
echo "Pulling ${CURRENTRELEASE} Branch from Github"
git branch: CURRENTRELEASE, credentialsId: GITHUBCREDID, url: GITREPOREMOTE
}
stage('Run Unit Tests') {
try {
sh """#!/bin/bash

# Enable Conda environment for tests


source ${CONDAPATH}/bin/activate ${CONDAENV}

# Python tests for libs


python3 -m pytest --junit-xml=${TESTRESULTPATH}/TEST-libout.xml
${LIBRARYPATH}/python/dbxdemo/test*.py || true
"""
} catch(err) {
step([$class: 'JUnitResultArchiver', testResults: '--junit-xml=${TESTRESULTPATH}/TEST-*.xml'])
if (currentBuild.result == 'UNSTABLE')
currentBuild.result = 'FAILURE'
throw err
}
}
stage('Package') {
sh """#!/bin/bash

# Enable Conda environment for tests


source ${CONDAPATH}/bin/activate ${CONDAENV}

# Package Python library to wheel


cd ${LIBRARYPATH}/python/dbxdemo
python3 setup.py sdist bdist_wheel
"""
}
stage('Build Artifact') {
sh """mkdir -p ${BUILDPATH}/Workspace
mkdir -p ${BUILDPATH}/Libraries/python
mkdir -p ${BUILDPATH}/Validation/Output
#Get modified files
git diff --name-only --diff-filter=AMR HEAD^1 HEAD | xargs -I '{}' cp --parents -r '{}'
${BUILDPATH}

# Get packaged libs


find ${LIBRARYPATH} -name '*.whl' | xargs -I '{}' cp '{}' ${BUILDPATH}/Libraries/python/

# Generate artifact
tar -czvf Builds/latest_build.tar.gz ${BUILDPATH}
"""
archiveArtifacts artifacts: 'Builds/latest_build.tar.gz'
}
stage('Deploy') {
sh """#!/bin/bash
sh """#!/bin/bash
# Enable Conda environment for tests
source ${CONDAPATH}/bin/activate ${CONDAENV}

# Use Databricks CLI to deploy notebooks


databricks workspace import_dir ${BUILDPATH}/Workspace ${WORKSPACEPATH}

dbfs cp -r ${BUILDPATH}/Libraries/python ${DBFSPATH}


"""
withCredentials([string(credentialsId: DBTOKEN, variable: 'TOKEN')]) {
sh """#!/bin/bash

#Get space delimited list of libraries


LIBS=\$(find ${BUILDPATH}/Libraries/python/ -name '*.whl' | sed 's#.*/##' | paste -sd " ")

#Script to uninstall, reboot if needed & instsall library


python3 ${SCRIPTPATH}/installWhlLibrary.py --workspace=${DBURL}\
--token=$TOKEN\
--clusterid=${CLUSTERID}\
--libs=\$LIBS\
--dbfspath=${DBFSPATH}
"""
}
}
stage('Run Integration Tests') {
withCredentials([string(credentialsId: DBTOKEN, variable: 'TOKEN')]) {
sh """python3 ${SCRIPTPATH}/executenotebook.py --workspace=${DBURL}\
--token=$TOKEN\
--clusterid=${CLUSTERID}\
--localpath=${NOTEBOOKPATH}/VALIDATION\
--workspacepath=${WORKSPACEPATH}/VALIDATION\
--outfilepath=${OUTFILEPATH}
"""
}
sh """sed -i -e 's #ENV# ${OUTFILEPATH} g' ${SCRIPTPATH}/evaluatenotebookruns.py
python3 -m pytest --junit-xml=${TESTRESULTPATH}/TEST-notebookout.xml
${SCRIPTPATH}/evaluatenotebookruns.py || true
"""
}
stage('Report Test Results') {
sh """find ${OUTFILEPATH} -name '*.json' -exec gzip --verbose {} \\;
touch ${TESTRESULTPATH}/TEST-*.xml
"""
junit "**/reports/junit/*.xml"
}
}

The remainder of this article discusses each step in the Pipeline.

Define environment variables


You can define environment variables to allow the Pipeline stages to be used in different Pipelines.

NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.

GITREPO : local path to the git repository root


GITREPOREMOTE : Web URL of git repository
GITHUBCREDID : Jenkins credential ID for the GitHub personal access token
CURRENTRELEASE : deployment branch
DBTOKEN : Jenkins credential ID for the Azure Databricks personal access token
DBURL : Web URL for Azure Databricks workspace
SCRIPTPATH : local path to the git project directory for automation scripts
NOTEBOOKPATH : local path to the git project directory for notebooks
LIBRARYPATH : local path to the git project directory for library code or other DBFS code
BUILDPATH : local path to the directory for build artifacts
OUTFILEPATH : local path to the JSON result files generated from automated tests
TESTRESULTPATH : local path to the directory for Junit test result summaries
WORKSPACEPATH : Azure Databricks workspace path for notebooks
DBFSPATH : Azure Databricks DBFS path for libraries and non-notebook code
CLUSTERID : Azure Databricks cluster ID to run tests
CONDAPATH : path to Conda installation
CONDAENV : name for Conda environment containing build dependency libraries``

Set up the pipeline


In the Setup stage you configure Databricks CLI and Databricks Connect with connection information.

def GITREPO = "/var/lib/jenkins/workspace/${env.JOB_NAME}"


def GITREPOREMOTE = "https://github.com/<repo>"
def GITHUBCREDID = "<github-token>"
def CURRENTRELEASE = "<release>"
def DBTOKEN = "<databricks-token>"
def DBURL = "https://<databricks-instance>"
def SCRIPTPATH = "${GITREPO}/Automation/Deployments"
def NOTEBOOKPATH = "${GITREPO}/Workspace"
def LIBRARYPATH = "${GITREPO}/Libraries"
def BUILDPATH = "${GITREPO}/Builds/${env.JOB_NAME}-${env.BUILD_NUMBER}"
def OUTFILEPATH = "${BUILDPATH}/Validation/Output"
def TESTRESULTPATH = "${BUILDPATH}/Validation/reports/junit"
def WORKSPACEPATH = "/Shared/<path>"
def DBFSPATH = "dbfs:<dbfs-path>"
def CLUSTERID = "<cluster-id>"
def CONDAPATH = "<conda-path>"
def CONDAENV = "<conda-env>"

stage('Setup') {
withCredentials([string(credentialsId: DBTOKEN, variable: 'TOKEN')]) {
sh """#!/bin/bash
# Configure Conda environment for deployment & testing
source ${CONDAPATH}/bin/activate ${CONDAENV}

# Configure Databricks CLI for deployment


echo "${DBURL}
$TOKEN" | databricks configure --token

# Configure Databricks Connect for testing


echo "${DBURL}
$TOKEN
${CLUSTERID}
0
15001" | databricks-connect configure
"""
}
}
Get the latest changes
The Checkout stage downloads code from the designated branch to the agent execution agent using a Jenkins
plugin:

stage('Checkout') { // for display purposes


echo "Pulling ${CURRENTRELEASE} Branch from Github"
git branch: CURRENTRELEASE, credentialsId: GITHUBCREDID, url: GITREPOREMOTE
}

Develop unit tests


There are a few different options when deciding how to unit test your code. For library code developed outside
an Azure Databricks notebook, the process is like traditional software development practices. You write a unit
test using a testing framework, like the Python pytest module, and JUnit-formatted XML files store the test
results.
The Azure Databricks process differs in that the code being tested is Apache Spark code intended to be executed
on a Spark cluster often running locally or in this case on Azure Databricks. To accommodate this requirement,
you use Databricks Connect. Since the SDK was configured earlier, no changes to the test code are required to
execute the tests on Azure Databricks clusters. You installed Databricks Connect in a Conda virtual environment.
Once the Conda environment is activated, the tests are executed using the Python tool, pytest , to which you
provide the locations for the tests and the resulting output files.
Test library code using Databricks Connect

stage('Run Unit Tests') {


try {
sh """#!/bin/bash
# Enable Conda environment for tests
source ${CONDAPATH}/bin/activate ${CONDAENV}

# Python tests for libs


python3 -m pytest --junit-xml=${TESTRESULTPATH}/TEST-libout.xml
${LIBRARYPATH}/python/dbxdemo/test*.py || true
"""
} catch(err) {
step([$class: 'JUnitResultArchiver', testResults: '--junit-xml=${TESTRESULTPATH}/TEST-*.xml'])
if (currentBuild.result == 'UNSTABLE')
currentBuild.result = 'FAILURE'
throw err
}
}

The following snippet is a library function that might be installed on an Azure Databricks cluster. It is a simple
function that adds a new column, populated by a literal, to an Apache Spark DataFrame.

# addcol.py
import pyspark.sql.functions as F

def with_status(df):
return df.withColumn("status", F.lit("checked"))

This test passes a mock DataFrame object to the with_status function, defined in addcol.py . The result is then
compared to a DataFrame object containing the expected values. If the values match, which in this case they will,
the test passes.
# test-addcol.py
import pytest

from dbxdemo.spark import get_spark


from dbxdemo.appendcol import with_status

class TestAppendCol(object):

def test_with_status(self):
source_data = [
("paula", "white", "paula.white@example.com"),
("john", "baer", "john.baer@example.com")
]
source_df = get_spark().createDataFrame(
source_data,
["first_name", "last_name", "email"]
)

actual_df = with_status(source_df)

expected_data = [
("paula", "white", "paula.white@example.com", "checked"),
("john", "baer", "john.baer@example.com", "checked")
]
expected_df = get_spark().createDataFrame(
expected_data,
["first_name", "last_name", "email", "status"]
)

assert(expected_df.collect() == actual_df.collect())

Package library code


In the Package stage you package the library code into a Python wheel.

stage('Package') {
sh """#!/bin/bash
# Enable Conda environment for tests
source ${CONDAPATH}/bin/activate ${CONDAENV}

# Package Python library to wheel


cd ${LIBRARYPATH}/python/dbxdemo
python3 setup.py sdist bdist_wheel
"""
}

Generate and store a deployment artifact


Building a deployment artifact for Azure Databricks involves gathering all the new or updated code to be
deployed to the appropriate Azure Databricks environment. In the Build Artifact stage you add the notebook
code to be deployed to the workspace, any whl libraries that were generated by the build process, as well as
the result summaries for the tests, for archiving purposes. To do this you use git diff to flag all new files that
have been included in the most recent git merge. This is only an example method, so the implementation in your
Pipeline may differ, but the objective is to add all files intended for the current release.
stage('Build Artifact') {
sh """mkdir -p ${BUILDPATH}/Workspace
mkdir -p ${BUILDPATH}/Libraries/python
mkdir -p ${BUILDPATH}/Validation/Output
#Get Modified Files
git diff --name-only --diff-filter=AMR HEAD^1 HEAD | xargs -I '{}' cp --parents -r '{}' ${BUILDPATH}

# Get packaged libs


find ${LIBRARYPATH} -name '*.whl' | xargs -I '{}' cp '{}' ${BUILDPATH}/Libraries/python/

# Generate artifact
tar -czvf Builds/latest_build.tar.gz ${BUILDPATH}
"""
archiveArtifacts artifacts: 'Builds/latest_build.tar.gz'
}

Deploy artifacts
In the Deploy stage you use the Databricks CLI, which, like the Databricks Connect module used earlier, is
installed in your Conda environment, so you must activate it for this shell session. You use the Workspace CLI
and DBFS CLI to upload the notebooks and libraries, respectively:

databricks workspace import_dir <local build path> <remote workspace path>


dbfs cp -r <local build path> <remote dbfs path>

stage('Deploy') {
sh """#!/bin/bash
# Enable Conda environment for tests
source ${CONDAPATH}/bin/activate ${CONDAENV}

# Use Databricks CLI to deploy notebooks


databricks workspace import_dir ${BUILDPATH}/Workspace ${WORKSPACEPATH}

dbfs cp -r ${BUILDPATH}/Libraries/python ${DBFSPATH}


"""
withCredentials([string(credentialsId: DBTOKEN, variable: 'TOKEN')]) {
sh """#!/bin/bash

#Get space delimited list of libraries


LIBS=\$(find ${BUILDPATH}/Libraries/python/ -name '*.whl' | sed 's#.*/##' | paste -sd " ")

#Script to uninstall, reboot if needed & instsall library


python3 ${SCRIPTPATH}/installWhlLibrary.py --workspace=${DBURL}\
--token=$TOKEN\
--clusterid=${CLUSTERID}\
--libs=\$LIBS\
--dbfspath=${DBFSPATH}
"""
}
}

Installing a new version of a library on an Azure Databricks cluster requires that you first uninstall the existing
library. To do this, you invoke the Databricks REST API in a Python script to perform the following steps:
1. Check if the library is installed.
2. Uninstall the library.
3. Restart the cluster if any uninstalls were performed.
a. Wait until the cluster is running again before proceeding.
4. Install the library.
# installWhlLibrary.py
#!/usr/bin/python3
import json
import requests
import sys
import getopt
import time

def main():
workspace = ''
token = ''
clusterid = ''
libs = ''
dbfspath = ''

try:
opts, args = getopt.getopt(sys.argv[1:], 'hstcld',
['workspace=', 'token=', 'clusterid=', 'libs=', 'dbfspath='])
except getopt.GetoptError:
print(
'installWhlLibrary.py -s <workspace> -t <token> -c <clusterid> -l <libs> -d <dbfspath>')
sys.exit(2)

for opt, arg in opts:


if opt == '-h':
print(
'installWhlLibrary.py -s <workspace> -t <token> -c <clusterid> -l <libs> -d <dbfspath>')
sys.exit()
elif opt in ('-s', '--workspace'):
workspace = arg
elif opt in ('-t', '--token'):
token = arg
elif opt in ('-c', '--clusterid'):
clusterid = arg
elif opt in ('-l', '--libs'):
libs=arg
elif opt in ('-d', '--dbfspath'):
dbfspath=arg

print('-s is ' + workspace)


print('-t is ' + token)
print('-c is ' + clusterid)
print('-l is ' + libs)
print('-d is ' + dbfspath)

libslist = libs.split()

# Uninstall Library if exists on cluster


i=0
for lib in libslist:
dbfslib = dbfspath + lib
print(dbfslib + ' before:' + getLibStatus(workspace, token, clusterid, dbfslib))

if (getLibStatus(workspace, token, clusterid, dbfslib) != 'not found'):


print(dbfslib + " exists. Uninstalling.")
i = i + 1
values = {'cluster_id': clusterid, 'libraries': [{'whl': dbfslib}]}

resp = requests.post(workspace + '/api/2.0/libraries/uninstall', data=json.dumps(values), auth=


("token", token))
runjson = resp.text
d = json.loads(runjson)
print(dbfslib + ' after:' + getLibStatus(workspace, token, clusterid, dbfslib))

# Restart if libraries uninstalled


if i > 0:
values = {'cluster_id': clusterid}
print("Restarting cluster:" + clusterid)
print("Restarting cluster:" + clusterid)
resp = requests.post(workspace + '/api/2.0/clusters/restart', data=json.dumps(values), auth=("token",
token))
restartjson = resp.text
print(restartjson)

p = 0
waiting = True
while waiting:
time.sleep(30)
clusterresp = requests.get(workspace + '/api/2.0/clusters/get?cluster_id=' + clusterid,
auth=("token", token))
clusterjson = clusterresp.text
jsonout = json.loads(clusterjson)
current_state = jsonout['state']
print(clusterid + " state:" + current_state)
if current_state in ['RUNNING','INTERNAL_ERROR', 'SKIPPED'] or p >= 10:
break
p = p + 1

# Install Libraries
for lib in libslist:
dbfslib = dbfspath + lib
print("Installing " + dbfslib)
values = {'cluster_id': clusterid, 'libraries': [{'whl': dbfslib}]}

resp = requests.post(workspace + '/api/2.0/libraries/install', data=json.dumps(values), auth=("token",


token))
runjson = resp.text
d = json.loads(runjson)
print(dbfslib + ' after:' + getLibStatus(workspace, token, clusterid, dbfslib))

def getLibStatus(workspace, token, clusterid, dbfslib):


resp = requests.get(workspace + '/api/2.0/libraries/cluster-status?cluster_id='+ clusterid, auth=("token",
token))
libjson = resp.text
d = json.loads(libjson)
if (d.get('library_statuses')):
statuses = d['library_statuses']

for status in statuses:


if (status['library'].get('whl')):
if (status['library']['whl'] == dbfslib):
return status['status']
else:
return "not found"
else:
# No libraries found
return "not found"

if __name__ == '__main__':
main()

Test notebook code using another notebook


Once the artifact has been deployed, it is important to run integration tests to ensure all the code is working
together in the new environment. To do this you can run a notebook containing asserts to test the deployment.
In this case you are using the same test you used in the unit test, but now it is importing the installed appendcol
library from the whl that was just installed on the cluster.
To automate this test and include it in your CI/CD Pipeline, use the Databricks REST API to run the notebook
from the Jenkins server. This allows you to check whether the notebook run passed or failed using pytest . If the
asserts in the notebook fail, this will be shown in the JSON output returned by the REST API and subsequently in
the JUnit test results.
stage('Run Integration Tests') {
withCredentials([string(credentialsId: DBTOKEN, variable: 'TOKEN')]) {
sh """python3 ${SCRIPTPATH}/executenotebook.py --workspace=${DBURL}\
--token=$TOKEN\
--clusterid=${CLUSTERID}\
--localpath=${NOTEBOOKPATH}/VALIDATION\
--workspacepath=${WORKSPACEPATH}/VALIDATION\
--outfilepath=${OUTFILEPATH}
"""
}
sh """sed -i -e 's #ENV# ${OUTFILEPATH} g' ${SCRIPTPATH}/evaluatenotebookruns.py
python3 -m pytest --junit-xml=${TESTRESULTPATH}/TEST-notebookout.xml
${SCRIPTPATH}/evaluatenotebookruns.py || true
"""
}

This stage calls two Python automation scripts. The first script, executenotebook.py , runs the notebook using the
Create and trigger a one-time run ( POST /jobs/runs/submit ) endpoint which submits an anonymous job. Since
this endpoint is asynchronous, it uses the job ID initially returned by the REST call to poll for the status of the
job. Once the job has completed, the JSON output is saved to the path specified by the function arguments
passed at invocation.

# executenotebook.py
#!/usr/bin/python3
import json
import requests
import os
import sys
import getopt
import time

def main():
workspace = ''
token = ''
clusterid = ''
localpath = ''
workspacepath = ''
outfilepath = ''

try:
opts, args = getopt.getopt(sys.argv[1:], 'hs:t:c:lwo',
['workspace=', 'token=', 'clusterid=', 'localpath=', 'workspacepath=',
'outfilepath='])
except getopt.GetoptError:
print(
'executenotebook.py -s <workspace> -t <token> -c <clusterid> -l <localpath> -w <workspacepath> -o
<outfilepath>)')
sys.exit(2)

for opt, arg in opts:


if opt == '-h':
print(
'executenotebook.py -s <workspace> -t <token> -c <clusterid> -l <localpath> -w <workspacepath>
-o <outfilepath>')
sys.exit()
elif opt in ('-s', '--workspace'):
workspace = arg
elif opt in ('-t', '--token'):
token = arg
elif opt in ('-c', '--clusterid'):
clusterid = arg
elif opt in ('-l', '--localpath'):
localpath = arg
elif opt in ('-w', '--workspacepath'):
workspacepath = arg
workspacepath = arg
elif opt in ('-o', '--outfilepath'):
outfilepath = arg

print('-s is ' + workspace)


print('-t is ' + token)
print('-c is ' + clusterid)
print('-l is ' + localpath)
print('-w is ' + workspacepath)
print('-o is ' + outfilepath)
# Generate array from walking local path

notebooks = []
for path, subdirs, files in os.walk(localpath):
for name in files:
fullpath = path + '/' + name
# removes localpath to repo but keeps workspace path
fullworkspacepath = workspacepath + path.replace(localpath, '')

name, file_extension = os.path.splitext(fullpath)


if file_extension.lower() in ['.scala', '.sql', '.r', '.py']:
row = [fullpath, fullworkspacepath, 1]
notebooks.append(row)

# run each element in list


for notebook in notebooks:
nameonly = os.path.basename(notebook[0])
workspacepath = notebook[1]

name, file_extension = os.path.splitext(nameonly)

# workpath removes extension


fullworkspacepath = workspacepath + '/' + name

print('Running job for:' + fullworkspacepath)


values = {'run_name': name, 'existing_cluster_id': clusterid, 'timeout_seconds': 3600,
'notebook_task': {'notebook_path': fullworkspacepath}}

resp = requests.post(workspace + '/api/2.0/jobs/runs/submit',


data=json.dumps(values), auth=("token", token))
runjson = resp.text
print("runjson:" + runjson)
d = json.loads(runjson)
runid = d['run_id']

i=0
waiting = True
while waiting:
time.sleep(10)
jobresp = requests.get(workspace + '/api/2.0/jobs/runs/get?run_id='+str(runid),
data=json.dumps(values), auth=("token", token))
jobjson = jobresp.text
print("jobjson:" + jobjson)
j = json.loads(jobjson)
current_state = j['state']['life_cycle_state']
runid = j['run_id']
if current_state in ['TERMINATED', 'INTERNAL_ERROR', 'SKIPPED'] or i >= 12:
break
i=i+1

if outfilepath != '':
file = open(outfilepath + '/' + str(runid) + '.json', 'w')
file.write(json.dumps(j))
file.close()

if __name__ == '__main__':
main()
The second script, evaluatenotebookruns.py , defines the test_job_run function, which parses and evaluates the
JSON to determine if the assert statements within the notebook passed or failed. An additional test,
test_performance , catches tests that run longer than expected.

# evaluatenotebookruns.py
import unittest
import json
import glob
import os

class TestJobOutput(unittest.TestCase):

test_output_path = '#ENV#'

def test_performance(self):
path = self.test_output_path
statuses = []

for filename in glob.glob(os.path.join(path, '*.json')):


print('Evaluating: ' + filename)
data = json.load(open(filename))
duration = data['execution_duration']
if duration > 100000:
status = 'FAILED'
else:
status = 'SUCCESS'

statuses.append(status)

self.assertFalse('FAILED' in statuses)

def test_job_run(self):
path = self.test_output_path
statuses = []

for filename in glob.glob(os.path.join(path, '*.json')):


print('Evaluating: ' + filename)
data = json.load(open(filename))
status = data['state']['result_state']
statuses.append(status)

self.assertFalse('FAILED' in statuses)

if __name__ == '__main__':
unittest.main()

As seen earlier in the unit test stage, you use pytest to run the tests and generate the result summaries.
Publish test results
The JSON results are archived and the test results are published to Jenkins using the junit Jenkins plugin. This
enables you to visualize reports and dashboards related to the status of the build process.

stage('Report Test Results') {


sh """find ${OUTFILEPATH} -name '*.json' -exec gzip --verbose {} \\;
touch ${TESTRESULTPATH}/TEST-*.xml
"""
junit "**/reports/junit/*.xml"
}
At this point, the CI/CD pipeline has completed an integration and deployment cycle. By automating this process,
you can ensure that your code has been tested and deployed by an efficient, consistent, and repeatable process.
Databricks SQL CLI
7/21/2022 • 7 minutes to read

IMPORTANT
The Databricks SQL CLI is provided as-is and is not officially supported by Databricks through customer technical support
channels. Support, questions, and feature requests can be communicated through the Issues page of the
databricks/databricks-sql-cli repo on GitHub. Issues with the use of this code will not be answered or investigated by
Databricks Support.

The Databricks SQL command line interface (Databricks SQL CLI) enables you to run SQL queries on your
existing Databricks SQL warehouses from your terminal or Windows Command Prompt instead of from
locations such as the Databricks SQL editor or an Azure Databricks notebook. From the command line, you get
productivity features such as suggestions and syntax highlighting.

Requirements
At least one Databricks SQL warehouse. View your available warehouses. Create an warehouse, if you do not
already have one.
Your warehouse’s connection details. Specifically, you need the Ser ver hostname and HTTP path values.
An Azure Databricks personal access token. Create a personal access token, if you do not already have one.
Python 3.7 or higher. To check whether you have Python installed, run the command python --version from
your terminal or Command Prompt. (On some systems, you may need to enter python3 instead.) Install
Python, if you do not have it already installed.
pip, the package installer for Python. Newer versions of Python install pip by default. To check whether you
have pip installed, run the command pip --version from your terminal or Command Prompt. (On some
systems, you may need to enter pip3 instead.) Install pip, if you do not have it already installed.
The Databricks SQL CLI package from the Python Packaging Index (PyPI). You can use pip to install the
Databricks SQL CLI package from PyPI by running pip install databricks-sql-cli or
python -m pip install databricks-sql-cli .
(Optional) A utility for creating and managing Python virtual environments, such as venv, virtualenv, or
pipenv. Virtual environments help to ensure that you are using the correct versions of Python and the
Databricks SQL CLI together. Setting up and using virtual environments is outside of the scope of this article.
For more information, see Creating Virtual Environments.

Authentication
You must provide the Databricks SQL CLI with authentication details for your Databricks SQL warehouse, so that
the target warehouse is called with the proper access credentials. You can provide this information in several
ways:
In the dbsqlclircsettings file in its default location (or by specifying an alternate settings file through the
--clirc option each time you run a command with the Databricks SQL CLI). See Settings file.
By setting the DBSQLCLI_HOST_NAME , DBSQLCLI_HTTP_PATH and DBSQLCLI_ACCESS_TOKEN environment variables.
See Environment variables.
By specifying the --hostname , --http-path , and --access-token options each time you run a command with
the Databricks SQL CLI. See Command options.
Whenever you run the Databricks SQL CLI, it looks for authentication details in the following order, stopping
when it finds the first set of details:
1. The --hostname , --http-path , and --access-token options.
2. The DBSQLCLI_HOST_NAME , DBSQLCLI_HTTP_PATH and DBSQLCLI_ACCESS_TOKEN environment variables.
3. The dbsqlclirc settings file in its default location (or an alternate settings file specified by the --clirc
option).
Settings file
To use the dbsqlclirc settings file to provide the Databricks SQL CLI with authentication details for your
Databricks SQL warehouse, run the Databricks SQL CLI for the first time, as follows:

dbsqlcli

The Databricks SQL CLI creates a settings file for you, at ~/.dbsqlcli/dbsqlclirc on Unix, Linux, and macOS, and
at %HOMEDRIVE%%HOMEPATH%\.dbsqlcli\dbsqlclirc or %USERPROFILE%\.dbsqlcli\dbsqlclirc on Windows. To
customize this file:
1. Use a text editor to open and edit the dbsqlclirc file.
2. Scroll to the following section:

# [credentials]
# host_name = ""
# http_path = ""
# access_token = ""

3. Remove the four # characters, and:


a. Next to host_name , enter your warehouse’s Ser ver hostname value from the requirements between
the "" characters.
b. Next to http_path , enter your warehouse’s HTTP path value from the requirements between the ""
characters.
c. Next to access_token , enter your personal access token value from the requirements between the ""
characters.
For example:

[credentials]
host_name = "adb-12345678901234567.8.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/1abc2d3456e7f890a"
access_token = "dapi1234567890b2cd34ef5a67bc8de90fa12b"

4. Save the dbsqlclirc file.

Alternatively, instead of using the dbsqlclirc file in its default location, you can specify a file in a different
location by adding the --clirc command option and the path to the alternate file. That alternate file’s contents
must conform to the preceding syntax.
Environment variables
To use the DBSQLCLI_HOST_NAME , DBSQLCLI_HTTP_PATH , and DBSQLCLI_ACCESS_TOKEN environment variables to
provide the Databricks SQL CLI with authentication details for your Databricks SQL warehouse, do the following:
Unix, linux, and macos
To set the environment variables for only the current terminal session, run the following commands. To set the
environment variables for all terminal sessions, enter the following commands into your shell’s startup file and
then restart your terminal. In the following commands, replace the value of:
DBSQLCLI_HOST_NAME with your warehouse’s Ser ver hostname value from the requirements.
DBSQLCLI_HTTP_PATH with your warehouse’s HTTP path value from the requirements.
DBSQLCLI_ACCESS_TOKEN with your personal access token value from the requirements.

export DBSQLCLI_HOST_NAME="adb-12345678901234567.8.azuredatabricks.net"
export DBSQLCLI_HTTP_PATH="/sql/1.0/warehouses/1abc2d3456e7f890a"
export DBSQLCLI_ACCESS_TOKEN="dapi1234567890b2cd34ef5a67bc8de90fa12b"

Windows
To set the environment variables for only the current Command Prompt session, run the following commands,
replacing the value of:
DBSQLCLI_HOST_NAME with your warehouse’s Ser ver hostname value from the requirements.
DBSQLCLI_HTTP_PATH with your warehouse’s HTTP path value from the requirements.
DBSQLCLI_ACCESS_TOKEN with your personal access token value from the requirements.:

set DBSQLCLI_HOST_NAME="adb-12345678901234567.8.azuredatabricks.net"
set DBSQLCLI_HTTP_PATH="/sql/1.0/warehouses/1abc2d3456e7f890a"
set DBSQLCLI_ACCESS_TOKEN="dapi1234567890b2cd34ef5a67bc8de90fa12b"

To set the environment variables for all Command Prompt sessions, run the following commands and then
restart your Command Prompt, replacing the value of:
DBSQLCLI_HOST_NAME with your warehouse’s Ser ver hostname value from the requirements.
DBSQLCLI_HTTP_PATH with your warehouse’s HTTP path value from the requirements.
DBSQLCLI_ACCESS_TOKEN with your personal access token value from the requirements.

setx DBSQLCLI_HOST_NAME "adb-12345678901234567.8.azuredatabricks.net"


setx DBSQLCLI_HTTP_PATH "/sql/1.0/warehouses/1abc2d3456e7f890a"
setx DBSQLCLI_ACCESS_TOKEN "dapi1234567890b2cd34ef5a67bc8de90fa12b"

Command options
To use the --hostname , --http-path , and --access-token options to provide the Databricks SQL CLI with
authentication details for your Databricks SQL warehouse, do the following:
Every time you run a command with the Databricks SQL CLI:
Specify the --hostname option and your warehouse’s Ser ver hostname value from the requirements.
Specify the --http-path option and your warehouse’s HTTP path value from the requirements.
Specify the --access-token option and your personal access token value from the requirements.

For example:

dbsqlcli -e "SELECT * FROM default.diamonds LIMIT 2" \


--hostname "adb-12345678901234567.8.azuredatabricks.net" \
--http-path "/sql/1.0/warehouses/1abc2d3456e7f890a" \
--access-token "dapi1234567890b2cd34ef5a67bc8de90fa12b"

Query sources
The Databricks SQL CLI enables you to run queries in the following ways:
From a query string.
From a file.
In a read-evaluate-print loop (REPL) approach. This approach provides suggestions as you type.
Query string
To run a query as a string, use the -e option followed by the query, represented as a string. For example:

dbsqlcli -e "SELECT * FROM default.diamonds LIMIT 2"

Output:

_c0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31

To switch output formats, use the --table-format option along with a value such as ascii for ASCII table
format, for example:

dbsqlcli -e "SELECT * FROM default.diamonds LIMIT 2" --table-format ascii

Output:

+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| _c0 | carat | cut | color | clarity | depth | table | price | x | y | z |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
| 2 | 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+

For a list of available output format values, see the comments for the table_format setting in the dbsqlclirc
file.
File
To run a file that contains SQL, use the -e option followed by the path to a .sql file. For example:

dbsqlcli -e my-query.sql

Contents of the example my-query.sql file:

SELECT * FROM default.diamonds LIMIT 2;

Output:

_c0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31

To switch output formats, use the --table-format option along with a value such as ascii for ASCII table
format, for example:
dbsqlcli -e my-query.sql --table-format ascii

Output:

+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| _c0 | carat | cut | color | clarity | depth | table | price | x | y | z |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
| 2 | 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+

For a list of available output format values, see the comments for the table_format setting in the dbsqlclirc
file.
REPL
To enter read-evaluate-print loop (REPL) mode scoped to the default database, run the following command:

dbsqlcli

You can also enter REPL mode scoped to a specific database, by running the following command:

dbsqlcli <database-name>

For example:

dbsqlcli default

To exit REPL mode, run the following command:

exit

In REPL mode, you can use the following characters and keys:
Use the semicolon ( ; ) to end a line.
Use F3 to toggle multiline mode.
Use the spacebar to show suggestions at the insertion point, if suggestions are not already displayed.
Use the up and down arrows to navigate suggestions.
Use the right arrow to complete the highlighted suggestion.
For example:
dbsqlcli default

hostname:default> SELECT * FROM diamonds LIMIT 2;

+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| _c0 | carat | cut | color | clarity | depth | table | price | x | y | z |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
| 2 | 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+

2 rows in set
Time: 0.703s

hostname:default> exit

Additional resources
Databricks SQL CLI README
DataGrip integration with Azure Databricks
7/21/2022 • 5 minutes to read

DataGrip is an integrated development environment (IDE) for database developers that provides a query
console, schema navigation, explain plans, smart code completion, real-time analysis and quick fixes,
refactorings, version control integration, and other features.
This article describes how to use your local development machine to install, configure, and use DataGrip to work
with databases in Azure Databricks.

NOTE
This article was tested with macOS, Databricks JDBC Driver version 2.6.25, and DataGrip version 2021.1.1.

Requirements
Before you install DataGrip, your local development machine must meet the following requirements:
A Linux, macOS, or Windows operating system.
Download the Databricks JDBC Driver onto your local development machine, extracting the
DatabricksJDBC42.jar file from the downloaded DatabricksJDBC42-<version>.zip file.
An Azure Databricks cluster or SQL warehouse to connect DataGrip to.

Step 1: Install DataGrip


Download and install DataGrip.
Linux : Download the .zip file, extract its contents, and then follow the instructions in the
file.
Install-Linux-tar.txt
macOS : Download and run the .dmg file.
Windows : Download and run the .exe file.

For more information, see Install DataGrip on the DataGrip website.

Step 2: Configure the Databricks JDBC Driver for DataGrip


Set up DataGrip with information about the Databricks JDBC Driver that you downloaded earlier.
1. Start DataGrip.
2. Click File > Data Sources .
3. In the Data Sources and Drivers dialog box, click the Drivers tab.
4. Click the + (Driver ) button to add a driver.
5. For Name , enter Databricks .
6. On the General tab, in the Driver Files list, click the + (Add ) button.
7. Click Custom JARs .
8. Browse to and select the DatabricksJDBC42.jar file that you extracted earlier, and then click Open .
9. For Class , select com.databricks.client.jdbc.Driver .
10. Click OK .
Step 3: Connect DataGrip to your Azure Databricks databases
Use DataGrip to connect to the cluster or SQL warehouse that you want to use to access the databases in your
Azure Databricks workspace.
1. In DataGrip, click File > Data Sources .
2. On the Data Sources tab, click the + (Add ) button.
3. Select the Databricks driver that you added in the preceding step.
4. On the General tab, for URL , enter the value of the JDBC URL field for your Azure Databricks resource
as follows:
Cluster
a. Find the JDBC URL field value on the JDBC/ODBC tab within the Advanced Options area for your
cluster. The JDBC URL should look similar to this one:

jdbc:databricks://adb-
1234567890123456.7.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/proto
colv1/o/1234567890123456/1234-567890-reef123;AuthMech=3;UID=token;PWD=<personal-access-token>

IMPORTANT
If the JDBC URL starts with jdbc:spark: , you must change it to jdbc:databricks: or else you will get
a connection error later.

b. Replace <personal-access-token> with your personal access token for the Azure Databricks
workspace.

TIP
If you do not want to store your personal access token on your local development machine, omit
UID=token;PWD=<personal-access-token> from the JDBC URL and, in the Save list, choose Never . You
will be prompted for your User ( token ) and Password (your personal access token) each time you try
to connect.

c. For Name , enter Databricks cluster .


For more information, see Data sources and drivers dialog on the DataGrip website.
Sql warehouse
a. Find the JDBC URL field value on the Connection Details tab for your SQL warehouse. The JDBC
URL should look similar to this one:

jdbc:databricks://adb-
1234567890123456.7.azuredatabricks.net:443/default;transportMode=http;ssl=1;AuthMech=3;httpPat
h=/sql/1.0/warehouses/a123456bcde7f890;

IMPORTANT
If the JDBC URL starts with jdbc:spark: , you must change it to jdbc:databricks: or else you will get
a connection error later.
b. For User , enter token .
c. For Password , enter your personal access token.

TIP
If you do not want to store your personal access token on your local development machine, leave User
and Password blank and, in the Save list, select Never . You will be prompted for your User (the word
token ) and Password (your personal access token) each time you try to connect.

d. For Name , enter Databricks SQL warehouse .


For more information, see Data sources and drivers dialog on the DataGrip website.
5. Click Test Connection .

TIP
You should start your resource before testing your connection. Otherwise the test might take several minutes to
complete while the resource starts.

6. If the connection succeeds, on the Schemas tab, check the boxes for the schemas that you want to be
able to access, for example default .
7. Click OK .
Repeat the instructions in this step for each resource that you want DataGrip to access.

Step 4: Use DataGrip to browse tables


Use DataGrip to access tables in your Azure Databricks workspace.
1. In DataGrip, in the Database window, expand your resource node, expand the schema you want to browse,
and then expand tables .
2. Double-click a table. The first set of rows from the table are displayed.
Repeat the instructions in this step to access additional tables.
To access tables in other schemas, in the Database window’s toolbar, click the Data Source Proper ties icon. In
the Data Sources and Drivers dialog box, on the Schemas tab, check the box for each additional schema you
want to access, and then click OK .

Step 5: Use DataGrip to run SQL statements


Use DataGrip to load the sample diamonds table from the Sample datasets (databricks-datasets) into the
default database in your workspace and then query the table. For more information, see Create a table in _. If
you do not want to load a sample table, skip ahead to Next steps.
1. In DataGrip, in the Database window, with the default schema expanded, click File > New > SQL File .
2. Enter a name for the file, for example create_diamonds .
3. In the file tab, enter these SQL statements, which deletes a table named diamonds if it exists, and then
creates a table named diamonds based on the contents of the CSV file within the specified Databricks File
System (DBFS) mount point:
DROP TABLE IF EXISTS diamonds;

CREATE TABLE diamonds USING CSV OPTIONS (path "/databricks-datasets/Rdatasets/data-


001/csv/ggplot2/diamonds.csv", header "true");

4. Select the DROP TABLE statement.


5. On the file tab’s toolbar, click the Execute icon.
6. Select DROP TABLE IF EXISTS diamonds; CREATE TABLE diamon. .. from the drop-down list.

TIP
To change what happens when you click the Execute icon, select Customize in the drop-down list.

7. In the Database window, double-click the diamonds table to see its data. If the diamonds table is not
displayed, click the Refresh button in the window’s toolbar.
To delete the diamonds table:
1. In DataGrip, in the Database window’s toolbar, click the Jump to Quer y Console button.
2. Select console (Default) .
3. In the console tab, enter this SQL statement:

DROP TABLE diamonds;

4. Select the DROP TABLE statement.


5. On the console tab’s toolbar, click the Execute icon. The diamonds table disappears from the list of tables.
If the diamonds table does not disappear, click the Refresh button in the Database window’s toolbar.

Next steps
Learn more about the Query console in DataGrip.
Learn about the Data editor in DataGrip.
Learn more about the various tool windows in DataGrip.
Learn how to search in DataGrip.
Learn how to export data in DataGrip.
Learn how to find and replace text using regular expressions in DataGrip.

Additional resources
DataGrip documentation
DataGrip Support
DBeaver integration with Azure Databricks
7/21/2022 • 6 minutes to read

DBeaver is a local, multi-platform database tool for developers, database administrators, data analysts, data
engineers, and others who need to work with databases. DBeaver supports Azure Databricks as well as other
popular databases.
This article describes how to use your local development machine to install, configure, and use the free, open
source DBeaver Community Edition (CE) to work with databases in Azure Databricks.

NOTE
This article was tested with macOS, Databricks JDBC Driver version 2.6.25, and DBeaver CE version 22.1.0.

Requirements
Before you install DBeaver, your local development machine must meet the following requirements:
A Linux 64-bit, macOS, or Windows 64-bit operating system. (Linux 32-bit is supported but not
recommended.)
The Databricks JDBC Driver onto your local development machine, extracting the DatabricksJDBC42.jar file
from the downloaded DatabricksJDBC42-<version>.zip file.
You must also have an Azure Databricks cluster or SQL warehouse to connect DBeaver to.

Step 1: Install DBeaver


Download and install DBeaver CE as follows:
Linux : Download and run one of the Linux installers from the Download page on the DBeaver website. snap
and flatpak installation options are provided on this page as well.
macOS : Use Homebrew to run brew install --cask dbeaver-community , or use MacPorts to run
sudo port install dbeaver-community . A macOS installer is also available from the Download page on the
DBeaver website.
Windows : Use Chocolatey to run choco install dbeaver . A Windows installer is also available from the
Download page on the DBeaver website.

Step 2: Configure the Databricks JDBC Driver for DBeaver


Set up DBeaver with information about the Databricks JDBC Driver that you downloaded earlier.
1. Start DBeaver.
2. If you are prompted to create a new database, click No .
3. If you are prompted to connect to or select a database, click Cancel .
4. Click Database > Driver Manager .
5. In the Driver Manager dialog box, click New .
6. In the Create new driver dialog box, click the Libraries tab.
7. Click Add File .
8. Browse to the DatabricksJDBC42.jar file that you extracted earlier and click Open .
9. Click Find Class .
10. In the Driver class list, confirm that com.databricks.client.jdbc.Driver is selected.
11. On the Settings tab, for Driver Name , enter Databricks .
12. On the Settings tab, for Class Name , enter com.databricks.client.jdbc.Driver .
13. Click OK .
14. In the Driver Manager dialog box, click Close .

Step 3: Connect DBeaver to your Azure Databricks databases


Use DBeaver to connect to the cluster or SQL warehouse to access the databases in your Azure Databricks
workspace.
1. In DBeaver, click Database > New Database Connection .
2. In the Connect to a database dialog box, on the All tab, click Databricks , and then click Next .
3. Click the Main tab and enter a value in the JDBC URL field for your Azure Databricks resource:
Cluster
a. Find the JDBC URL field value on the JDBC/ODBC tab within the Advanced Options area for your
cluster. The JDBC URL should look similar to this one:
jdbc:databricks://adb-
1234567890123456.7.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/1234567890123456/1234-
567890-reef123;AuthMech=3;UID=token;PWD=<personal-access-token>

IMPORTANT
If the JDBC URL starts with jdbc:spark: , you must change it to jdbc:databricks: or else you will get
a connection error later.

b. Replace <personal-access-token> with your personal access token for the Azure Databricks
workspace.
c. Check Save password locally .

TIP
If you do not want to store your personal access token on your local development machine, omit
UID=token;PWD=<personal-access-token> from the JDBC URL and uncheck Save password locally . You will
be prompted for your Username ( token ) and Password (your personal access token) each time you try to
connect.

Sql warehouse
a. Find the JDBC URL field value on the Connection Details tab for your SQL warehouse. The JDBC
URL should look similar to this one:

jdbc:databricks://adb-
1234567890123456.7.azuredatabricks.net:443/default;transportMode=http;ssl=1;AuthMech=3;httpPat
h=/sql/1.0/warehouses/a123456bcde7f890;

IMPORTANT
If the JDBC URL starts with jdbc:spark: , you must change it to jdbc:databricks: or else you will get
a connection error later.

b. For Username , enter token .


c. For Password , enter your personal access token.
d. Check Save password locally .
TIP
If you do not want to store your personal access token on your local development machine, leave Username and
Password blank and uncheck Save password locally . You will be prompted for your Username (the word
token ) and Password (your personal access token) each time you try to connect.

4. Click Test Connection .

TIP
You should start your Azure Databricks resource before testing your connection. Otherwise the test might take
several minutes to complete while the resource starts.

5. If the connection succeeds, in the Connection Test dialog box, click OK .


6. In the Connect to a database dialog box, click Finish .
In the Database Navigator window, a Databricks entry is displayed. To change the connection’s name to
make it easier to identify:
1. Right-click Databricks , and then click Edit Connection .
2. In the Connection configuration dialog box, click General .
3. For Connection name , replace Databricks with a different name for the connection.
4. Click OK .
Repeat the instructions in this step for each resource that you want DBeaver to access.

Step 4: Use DBeaver to browse data objects


Use DBeaver to access data objects in your Azure Databricks workspace such as tables and table properties,
views, indexes, data types, and other data object types.
1. In DBeaver, in the Database Navigator window, right-click the connection that you want to use.
2. If Connect is enabled, click it. (If Connect is disabled, you are already connected.)

TIP
You should start your resource before trying to connect to it. Otherwise the connection might take several
minutes to complete while the resource starts.

3. Expand the connection that you just connected to.


4. Expand and browse available data objects. Double-click a data object to get more information about it.
Repeat the instructions in this step to access additional data objects.

Step 5: Use DBeaver to run SQL statements


Use DBeaver to load the sample diamonds table from the Sample datasets (databricks-datasets) into the
default database in your workspace and then query the table. For more information, see Create a table. If you
do not want to load a sample table, skip ahead to Next steps.
1. In DBeaver, in the Database Navigator window, right-click the connection that you want to use.
2. If Connect is enabled, click it. (If Connect is disabled, you are already connected.)
TIP
You should start your resource before trying to connect to it. Otherwise the connection might take several
minutes to complete while the resource starts.

3. Click SQL Editor > New SQL Script .


4. On the (connection-name) Script-1 tab, enter these SQL statements, which deletes a table named
diamonds if it exists, and then creates a table named diamonds based on the contents of the CSV file in
the Databricks File System (DBFS) mount point:

DROP TABLE IF EXISTS diamonds;

CREATE TABLE diamonds USING CSV OPTIONS (path "/databricks-datasets/Rdatasets/data-


001/csv/ggplot2/diamonds.csv", header "true");

5. Click SQL Editor > Execute SQL Statement .


6. In the Database Navigator window, expand the default database and click Refresh .
7. Expand Tables , and then double-click diamonds .
8. Within the diamonds tab, click the Data tab to see the table’s data.
To delete the diamonds table:
1. Click SQL Editor > New SQL Script .
2. On the (connection-name) Script-2 tab, enter this SQL statement, which deletes the diamonds table.

DROP TABLE IF EXISTS diamonds;

3. On the SQL Editor menu, click Execute SQL Statement .


4. In the Database Navigator window, right-click the default database and then click Refresh . The
diamonds table disappears from the list of tables.

Next steps
Use the Database object editor to work with database object properties, data, and entity relation diagrams.
Use the Data editor to view and edit data in a database table or view.
Use the SQL editor to work with SQL scripts.
Work with entity relation diagrams (ERDs) in DBeaver.
Import and export data into and from DBeaver.
Migrate data using DBeaver.
Troubleshoot JDBC driver issues with DBeaver.

Additional resources
DBeaver documentation
DBeaver support
DBeaver editions
CloudBeaver
Service principals for Azure Databricks automation
7/21/2022 • 6 minutes to read

A service principal is an identity created for use with automated tools and systems including scripts, apps, and
CI/CD platforms.
As a security best practice, Databricks recommends using an Azure AD service principal and its Azure AD token
instead of your Azure Databricks user or your Azure Databricks personal access token for your workspace user
to give automated tools and systems access to Azure Databricks resources. Some benefits to this approach
include the following:
You can grant and restrict access to Azure Databricks resources for an Azure AD service principal
independently of a user. For instance, this allows you to prohibit an Azure AD service principal from acting as
an admin in your Azure Databricks workspace while still allowing other specific users in your workspace to
continue to act as admins.
Users can safeguard their access tokens from being accessed by automated tools and systems.
You can temporarily disable or permanently delete an Azure AD service principal without impacting other
users. For instance, this allows you to pause or remove access from an Azure AD service principal that you
suspect is being used in a malicious way.
If a user leaves your organization, you can remove that user without impacting any Azure AD service
principal.
To create an Azure AD service principal, you use these tools and APIs:
You create an Azure AD service principal with tools such as the Azure portal.
You create an Azure AD token for an Azure AD service principal with tools such as curl and Postman.
After you create an Azure AD service principal, you add it to your Azure Databricks workspace with the SCIM
API 2.0 (ServicePrincipals). To call this API, you can also use tools such as curl and Postman. You cannot use
the Azure Databricks user interface.
This article describes how to:
1. Add an Azure AD service principal to your Azure Databricks workspace.
2. Create an Azure AD token for the Azure AD service principal.
To create an Azure AD service principal, see Provision a service principal in Azure portal.

Requirements
Access to the Azure portal.
One of the following, which enables you to call the Azure Databricks APIs:
An Azure Databricks personal access token for your Azure Databricks workspace user.
An Azure AD token for your Azure AD application.
A tool to call the Azure Databricks APIs, such as curl or Postman.
If you want to call the Azure Databricks APIs with Postman, note that instead of entering your Azure Databricks
workspace instance name, for example adb-1234567890123456.7.azuredatabricks.net and your Azure Databricks
personal access token for your workspace user for every Postman example in this article, you can define
variables and use variables in Postman instead.
If you want to call the Azure Databricks APIs with curl , this article’s curl examples use two environment
variables, DATABRICKS_HOST and DATABRICKS_TOKEN , representing your Azure Databricks per-workspace URL, for
example https://adb-1234567890123456.7.azuredatabricks.net ; and your Azure Databricks personal access token
for your workspace user. To set these environment variables, do the following:
Unix, linux, and macos
To set the environment variables for only the current terminal session, run the following commands. To set the
environment variables for all terminal sessions, enter the following commands into your shell’s startup file and
then restart your terminal. Replace the example values here with your own values.

export DATABRICKS_HOST="https://adb-12345678901234567.8.azuredatabricks.net"
export DATABRICKS_TOKEN="dapi1234567890b2cd34ef5a67bc8de90fa12b"

Windows
To set the environment variables for only the current Command Prompt session, run the following commands.
Replace the example values here with your own values.

set DATABRICKS_HOST="https://adb-12345678901234567.8.azuredatabricks.net"
set DATABRICKS_TOKEN="dapi1234567890b2cd34ef5a67bc8de90fa12b"

To set the environment variables for all Command Prompt sessions, run the following commands and then
restart your Command Prompt. Replace the example values here with your own values.

setx DATABRICKS_HOST "https://adb-12345678901234567.8.azuredatabricks.net"


setx DATABRICKS_TOKEN "dapi1234567890b2cd34ef5a67bc8de90fa12b"

If you want to call the Azure Databricks APIs with curl , also note the following:
This article’s curl examples use shell command formatting for Unix, Linux, and macOS. For the Windows
Command shell, replace \ with ^ , and replace ${...} with %...% .
You can use a tool such as jq to format the JSON-formatted output of curl for easier reading and querying.
This article’s curl examples use jq to format the JSON output.
If you work with multiple Azure Databricks workspaces, instead of constantly changing the DATABRICKS_HOST
and DATABRICKS_TOKEN variables, you can use a .netrc file. If you use a .netrc file, modify this article’s curl
examples as follows:
Change curl -X to curl --netrc -X
Replace ${DATABRICKS_HOST} with your Azure Databricks per-workspace URL, for example
https://adb-1234567890123456.7.azuredatabricks.net
Remove --header "Authorization: Bearer ${DATABRICKS_TOKEN}" \

Add an Azure AD service principal to an Azure Databricks workspace


Step 1: Create the Azure AD service principal
If you already have an Azure AD service principal available, skip ahead to Step 2.
To create an Azure AD service principal, follow the instructions in Provision a service principal in Azure portal.
After you create the Azure AD service principal, copy the following values for the Azure AD service principal, as
you will need them in later steps.
Application (client) ID
Director y (tenant) ID
The client secret’s Value
Step 2: Add the Azure AD service principal to the Azure Databricks workspace
You can use tools such as curl and Postman to add the Azure AD service principal to your Azure Databricks
workspace. In the following instructions, replace:
<application-client-id>with the Application (client) ID value for the Azure AD service principal.
<display-name> with a display name for the Azure AD service principal.
The entitlements array with any additional entitlements for the Azure AD service principal. This example
grants the Azure AD service principal the ability to create clusters. Workspace access and Databricks SQL
access is granted to the Azure AD service principal by default.
<group-id> with the group ID for any group in your Azure Databricks workspace that you want the Azure AD
service principal to belong to. (It can be easier to set access permissions on groups instead of each Azure AD
service principal individually.)
To add additional groups, add each group ID to the groups array.
To get a group ID, call Get groups.
To create a group, Add a group with the user interface or call the Create group API.
To add access permissions to a group, see Manage workspace-level groups for user interface options
or call the Permissions API 2.0.
To not add the Azure AD service principal to any groups, remove the groups array.
Curl
Run the following command. Make sure the add-service-principal.json file is in the same directory where you
run this command.

curl -X POST \
${DATABRICKS_HOST}/api/2.0/preview/scim/v2/ServicePrincipals \
--header "Content-type: application/scim+json" \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \
--data @add-service-principal.json \
| jq .

add-service-principal.json :

{
"applicationId": "<application-client-id>",
"displayName": "<display-name>",
"entitlements": [
{
"value": "allow-cluster-create"
}
],
"groups": [
{
"value": "<group-id>"
}
],
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ],
"active": true
}

Postman
1. Create a new HTTP request (File > New > HTTP Request ).
2. In the HTTP verb drop-down list, select POST .
3. For Enter request URL , enter
https://<databricks-instance-name>/api/2.0/preview/scim/v2/ServicePrincipals , where
<databricks-instance-name> is your Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .

4. On the Authorization tab, in the Type list, select Bearer Token .


5. For Token , enter your Azure Databricks personal access token for your workspace user.
6. On the Headers tab, add the Key and Value pair of Content-Type and application/scim+json

7. On the Body tab, select raw and JSON .


8. Enter the following body payload:

{
"applicationId": "<application-client-id>",
"displayName": "<display-name>",
"entitlements": [
{
"value": "allow-cluster-create"
}
],
"groups": [
{
"value": "<group-id>"
}
],
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ],
"active": true
}

9. Click Send .

Create an Azure AD token for an Azure AD service principal


To create an Azure AD token for an Azure AD service principal, follow the instructions in Get an Azure AD access
token.
After you create the Azure AD token, copy the access_token value, as you will need to provide it to your script,
app, or system.
Azure Databricks for Python developers
7/21/2022 • 8 minutes to read

This section provides a guide to developing notebooks and jobs in Azure Databricks using the Python language.
The first subsection provides links to tutorials for common workflows and tasks. The second subsection provides
links to APIs, libraries, and key tools.
A basic workflow for getting started is:
Import code: Either import your own code from files or Git repos or try a tutorial listed below. Databricks
recommends learning using interactive Databricks Notebooks.
Run your code on a cluster: Either create a cluster of your own, or ensure you have permissions to use a
shared cluster. Attach your notebook to the cluster, and run the notebook.
Beyond this, you can branch out into more specific topics:
Work with larger data sets using Apache Spark
Add visualizations
Automate your workload as a job
Use machine learning to analyze your data
Develop in IDEs

Tutorials
The below tutorials provide example code and notebooks to learn about common workflows. See Import a
notebook for instructions on importing notebook examples into your workspace.
Interactive data science and machine learning
Getting started with Apache Spark DataFrames for data preparation and analytics: Introduction to
DataFrames - Python
End-to-end example of building machine learning models on Azure Databricks. For additional examples, see
10-minute tutorials: Get started with machine learning on Azure Databricks and the MLflow guide’s
Quickstart Python.
Databricks AutoML lets you get started quickly with developing machine learning models on your own
datasets. Its glass-box approach generates notebooks with the complete machine learning workflow, which
you may clone, modify, and rerun.
Data engineering
Introduction to DataFrames - Python provides a walkthrough and FAQ to help you learn about Apache Spark
DataFrames for data preparation and analytics.
Delta Lake quickstart.
Delta Live Tables quickstart provides a walkthrough of Delta Live Tables to build and manage reliable data
pipelines, including Python examples.
Production machine learning and machine learning operations
MLflow Model Registry example
End-to-end example of building machine learning models on Azure Databricks

Reference
The below subsections list key features and tips to help you begin developing in Azure Databricks with Python.
Python APIs
Python code that runs outside of Databricks can generally run within Databricks, and vice versa. If you have
existing code, just import it into Databricks to get started. See Manage code with notebooks and Databricks
Repos below for details.
Databricks can run both single-machine and distributed Python workloads. For single-machine computing, you
can use Python APIs and libraries as usual; for example, pandas and scikit-learn will “just work.” For distributed
Python workloads, Databricks offers two popular APIs out of the box: the Pandas API on Spark and PySpark.
Pandas API on Spark

NOTE
The Koalas open-source project now recommends switching to the Pandas API on Spark. The Pandas API on Spark is
available on clusters that run Databricks Runtime 10.0 (Unsupported) and above. For clusters that run Databricks
Runtime 9.1 LTS and below, use Koalas instead.

pandas is a Python package commonly used by data scientists for data analysis and manipulation. However,
pandas does not scale out to big data. Pandas API on Spark fills this gap by providing pandas-equivalent APIs
that work on Apache Spark. This open-source API is an ideal choice for data scientists who are familiar with
pandas but not Apache Spark.
PySpark API
PySpark is the official Python API for Apache Spark. This API provides more flexibility than the Pandas API on
Spark. These links provide an introduction to and reference for PySpark.
Introduction to DataFrames
Introduction to Structured Streaming
PySpark API reference
Manage code with notebooks and Databricks Repos
Databricks notebooks support Python. These notebooks provide functionality similar to that of Jupyter, but with
additions such as built-in visualizations using big data, Apache Spark integrations for debugging and
performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by
importing a notebook. Once you have access to a cluster, you can attach a notebook to the cluster and run the
notebook.

TIP
To completely reset the state of your notebook, it can be useful to restart the iPython kernel. For Jupyter users, the
“restart kernel” option in Jupyter corresponds to detaching and re-attaching a notebook in Databricks. To restart the
kernel in a Python notebook, click on the cluster dropdown in the upper-left and click Detach & Re-attach . This
detaches the notebook from your cluster and reattaches it, which restarts the Python process.

Databricks Repos allows users to synchronize notebooks and other files with Git repositories. Databricks Repos
helps with code versioning and collaboration, and it can simplify importing a full repository of code into Azure
Databricks, viewing past notebook versions, and integrating with IDE development. Get started by cloning a
remote Git repository. You can then open or create notebooks with the repository clone, attach the notebook to
a cluster, and run the notebook.
Clusters and libraries
Azure Databricks Clusters provide compute management for both single nodes and large clusters. You can
customize cluster hardware and libraries according to your needs. Data scientists will generally begin work
either by creating a cluster or using an existing shared cluster. Once you have access to a cluster, you can attach
a notebook to the cluster or run a job on the cluster.
For small workloads which only require single nodes, data scientists can use Single Node clusters for cost
savings.
For detailed tips, see Best practices: Cluster configuration
Administrators can set up cluster policies to simplify and guide cluster creation.
Azure Databricks clusters use a Databricks Runtime, which provides many popular libraries out-of-the-box,
including Apache Spark, Delta Lake, pandas, and more. You can also install additional third-party or custom
Python libraries to use with notebooks and jobs.
Start with the default libraries in the Databricks Runtime. Use the Databricks Runtime for Machine Learning
for machine learning workloads. For full lists of pre-installed libraries, see Databricks runtime releases.
Customize your environment using Notebook-scoped Python libraries, which allow you to modify your
notebook or job environment with libraries from PyPI or other repositories. The %pip install my_library
magic command installs my_library to all nodes in your currently attached cluster, yet does not interfere
with other workloads on shared clusters.
Install non-Python libraries as Cluster libraries as needed.
For more details, see Libraries.
Visualizations
Azure Databricks Python notebooks have built-in support for many types of visualizations. You can also use
legacy visualizations.
You can also visualize data using third-party libraries; some are pre-installed in the Databricks Runtime, but you
can install custom libraries as well. Popular options include:
Bokeh
Matplotlib
Plotly
Jobs
You can automate Python workloads as scheduled or triggered Jobs in Databricks. Jobs can run notebooks,
Python scripts, and Python wheels.
For details on creating a job via the UI, see Create a job.
The Jobs API 2.1 allows you to create, edit, and delete jobs.
The Jobs CLI provides a convenient command line interface for calling the Jobs API.

TIP
To schedule a Python script instead of a notebook, use the spark_python_task field under tasks in the body of a
create job request.

Machine learning
Databricks supports a wide variety of machine learning (ML) workloads, including traditional ML on tabular
data, deep learning for computer vision and natural language processing, recommendation systems, graph
analytics, and more. For general information about machine learning on Databricks, see the Databricks Machine
Learning guide.
For ML algorithms, you can use pre-installed libraries in the Databricks Runtime for Machine Learning, which
includes popular Python tools such as scikit-learn, TensorFlow, Keras, PyTorch, Apache Spark MLlib, and XGBoost.
You can also install custom libraries.
For machine learning operations (MLOps), Azure Databricks provides a managed service for the open source
library MLFlow. MLflow Tracking lets you record model development and save models in reusable formats; the
MLflow Model Registry lets you manage and automate the promotion of models towards production; and Jobs
and MLflow Model Serving allow hosting models as batch and streaming jobs and as REST endpoints. For more
information and examples, see the MLflow guide or the MLflow Python API docs.
To get started with common machine learning workloads, see the following pages:
Training scikit-learn and tracking with MLflow: 10-minute tutorial: machine learning on Databricks with scikit-
learn
Training deep learning models: Deep learning
Hyperparameter tuning: Parallelize hyperparameter tuning with scikit-learn and MLflow
Graph analytics: GraphFrames user guide - Python
IDEs, developer tools, and APIs
In addition to developing Python code within Azure Databricks notebooks, you can develop externally using
integrated development environments (IDEs) such as PyCharm, Jupyter, and Visual Studio Code. To synchronize
work between external development environments and Databricks, there are several options:
Code : You can synchronize code using Git. See Git integration with Databricks Repos.
Libraries and Jobs : You can create libraries (such as wheels) externally and upload them to Databricks.
Those libraries may be imported within Databricks notebooks, or they can be used to create jobs. See
Libraries and Jobs.
Remote machine execution : You can run code from your local IDE for interactive development and testing.
The IDE can communicate with Azure Databricks to execute large computations on Azure Databricks clusters.
To learn to use Databricks Connect to create this connection, see Use an IDE.
Databricks provides a full set of REST APIs which support automation and integration with external tooling. You
can use APIs to manage resources like clusters and libraries, code and other workspace objects, workloads and
jobs, and more. See REST API (latest).
For more information on IDEs, developer tools, and APIs, see Developer tools and guidance.
Additional resources
The Databricks Academy offers self-paced and instructor-led courses on many topics.
Features that support interoperability between PySpark and pandas
pandas function APIs
pandas user-defined functions
Optimize conversion between PySpark and pandas DataFrames
Python and SQL database connectivity
The Databricks SQL Connector for Python allows you to use Python code to run SQL commands on
Azure Databricks resources.
pyodbc allows you to connect from your local Python code through ODBC to data stored in the
Databricks Lakehouse.
FAQs and tips for moving Python workloads to Databricks
Migrate single node workloads to Azure Databricks
Migrate production workloads to Azure Databricks
Knowledge Base
Pandas API on Spark
7/21/2022 • 2 minutes to read

NOTE
This feature is available on clusters that run Databricks Runtime 10.0 (Unsupported) and above. For clusters that run
Databricks Runtime 9.1 LTS and below, use Koalas instead.

Commonly used by data scientists, pandas is a Python package that provides easy-to-use data structures and
data analysis tools for the Python programming language. However, pandas does not scale out to big data.
Pandas API on Spark fills this gap by providing pandas equivalent APIs that work on Apache Spark. Pandas API
on Spark is useful not only for pandas users but also PySpark users, because pandas API on Spark supports
many tasks that are difficult to do with PySpark, for example plotting data directly from a PySpark DataFrame.

Requirements
Pandas API on Spark is available beginning in Apache Spark 3.2 (which is included beginning in Databricks
Runtime 10.0 (Unsupported)) by using the following import statement:

import pyspark.pandas as ps

Notebook
The following notebook shows how to migrate from pandas to pandas API on Spark.
pandas to pandas API on Spark notebook
Get notebook

Resources
Pandas API on Spark user guide on the Apache Spark website
Migrating from Koalas to pandas API on Spark on the Apache Spark website
Pandas API on Spark reference on the Apache Spark website
Koalas
7/21/2022 • 2 minutes to read

NOTE
Koalas is deprecated on clusters that run Databricks Runtime 10.0 (Unsupported) and above. For clusters running
Databricks Runtime 10.0 (Unsupported) and above, use Pandas API on Spark instead.
If you try using Koalas on clusters that run Databricks Runtime 10.0 (Unsupported) and above, an informational message
displays, recommending that you use Pandas API on Spark instead.

Koalas provides a drop-in replacement for pandas. Commonly used by data scientists, pandas is a Python
package that provides easy-to-use data structures and data analysis tools for the Python programming
language. However, pandas does not scale out to big data. Koalas fills this gap by providing pandas equivalent
APIs that work on Apache Spark. Koalas is useful not only for pandas users but also PySpark users, because
Koalas supports many tasks that are difficult to do with PySpark, for example plotting data directly from a
PySpark DataFrame.

Requirements
Koalas is included on clusters running Databricks Runtime 7.3 through 9.1. For clusters running Databricks
Runtime 10.0 and above, use Pandas API on Spark instead.
To use Koalas on a cluster running Databricks Runtime 7.0 or below, install Koalas as an Azure Databricks
PyPI library.
To use Koalas in an IDE, notebook server, or other custom applications that connect to an Azure Databricks
cluster, install Databricks Connect and follow the Koalas installation instructions.

Notebook
The following notebook shows how to migrate from pandas to Koalas.
pandas to Koalas notebook
Get notebook

Resources
Koalas documentation
10 Minutes from pandas to Koalas on Apache Spark
Azure Databricks for R developers
7/21/2022 • 2 minutes to read

This section provides a guide to developing notebooks in Azure Databricks using the R language.

R APIs
Azure Databricks supports two APIs that provide an R interface to Apache Spark: SparkR and sparklyr.
SparkR
These articles provide an introduction and reference for SparkR.
SparkR overview
SparkR ML tutorials
SparkR function reference
sparklyr
This article provides an introduction to sparklyr.
sparklyr

Visualizations
Azure Databricks R notebooks supports various types of visualizations using the display function.
Visualizations in R

Tools
In addition to Azure Databricks notebooks, you can also use the following R developer tools:
RStudio on Azure Databricks
Shiny on hosted RStudio Server
Use Shiny inside Databricks notebooks
renv on Azure Databricks
Use SparkR and RStudio Desktop with Databricks Connect.
Use sparklyr and RStudio Desktop with Databricks Connect.

Resources
Knowledge Base
SparkR overview
7/21/2022 • 4 minutes to read

SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR also supports
distributed machine learning using MLlib.

SparkR in notebooks
For Spark 2.0 and above, you do not need to explicitly pass a sqlContext object to every function call.
For Spark 2.2 and above, notebooks no longer import SparkR by default because SparkR functions were
conflicting with similarly named functions from other popular packages. To use SparkR you can call
library(SparkR) in your notebooks. The SparkR session is already configured, and all SparkR functions will
talk to your attached cluster using the existing session.

SparkR in spark-submit jobs


You can run scripts that use SparkR on Azure Databricks as spark-submit jobs, with minor code modifications.
For an example, see Create and run a spark-submit job for R scripts.

Create SparkR DataFrames


You can create a DataFrame from a local R data.frame , from a data source, or using a Spark SQL query.
From a local R data.frame

The simplest way to create a DataFrame is to convert a local R data.frame into a SparkDataFrame . Specifically
we can use createDataFrame and pass in the local R data.frame to create a SparkDataFrame . Like most other
SparkR functions, createDataFrame syntax changed in Spark 2.0. You can see examples of this in the code
snippet bellow. For more examples, see createDataFrame.

library(SparkR)
df <- createDataFrame(faithful)

# Displays the content of the DataFrame to stdout


head(df)

Using the data source API


The general method for creating a DataFrame from a data source is read.df . This method takes the path for the
file to load and the type of data source. SparkR supports reading CSV, JSON, text, and Parquet files natively.

library(SparkR)
diamondsDF <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", source = "csv",
header="true", inferSchema = "true")
head(diamondsDF)

SparkR automatically infers the schema from the CSV file.


Adding a data source connector with Spark Packages
Through Spark Packages you can find data source connectors for popular file formats such as Avro. As an
example, use the spark-avro package to load an Avro file. The availability of the spark-avro package depends on
your cluster’s image version. See Avro file.
First take an existing data.frame , convert to a Spark DataFrame, and save it as an Avro file.

require(SparkR)
irisDF <- createDataFrame(iris)
write.df(irisDF, source = "com.databricks.spark.avro", path = "dbfs:/tmp/iris.avro", mode = "overwrite")

To verify that an Avro file was saved:

%fs ls /tmp/iris.avro

Now use the spark-avro package again to read back the data.

irisDF2 <- read.df(path = "/tmp/iris.avro", source = "com.databricks.spark.avro")


head(irisDF2)

The data source API can also be used to save DataFrames into multiple file formats. For example, you can save
the DataFrame from the previous example to a Parquet file using write.df .

write.df(irisDF2, path="dbfs:/tmp/iris.parquet", source="parquet", mode="overwrite")

%fs ls dbfs:/tmp/iris.parquet

From a Spark SQL query


You can also create SparkR DataFrames using Spark SQL queries.

# Register earlier df as temp view


createOrReplaceTempView(irisDF2, "irisTemp")

# Create a df consisting of only the 'species' column using a Spark SQL query
species <- sql("SELECT species FROM irisTemp")

species is a SparkDataFrame.

DataFrame operations
Spark DataFrames support a number of functions to do structured data processing. Here are some basic
examples. A complete list can be found in the API docs.
Select rows and columns

# Import SparkR package if this is a new notebook


require(SparkR)

# Create DataFrame
df <- createDataFrame(faithful)

# Select only the "eruptions" column


head(select(df, df$eruptions))
# You can also pass in column name as strings
head(select(df, "eruptions"))

# Filter the DataFrame to only retain rows with wait times shorter than 50 mins
head(filter(df, df$waiting < 50))

Grouping and aggregation


SparkDataFrames support a number of commonly used functions to aggregate data after grouping. For
example you can count the number of times each waiting time appears in the faithful dataset.

head(count(groupBy(df, df$waiting)))

# You can also sort the output from the aggregation to get the most common waiting times
waiting_counts <- count(groupBy(df, df$waiting))
head(arrange(waiting_counts, desc(waiting_counts$count)))

Column operations
SparkR provides a number of functions that can be directly applied to columns for data processing and
aggregation. The following example shows the use of basic arithmetic functions.

# Convert waiting time from hours to seconds.


# You can assign this to a new column in the same DataFrame
df$waiting_secs <- df$waiting * 60
head(df)

Machine learning
SparkR exposes most of MLLib algorithms. Under the hood, SparkR uses MLlib to train the model.
The following example shows how to build a gaussian GLM model using SparkR. To run linear regression, set
family to "gaussian" . To run logistic regression, set family to "binomial" . When using SparkML GLM SparkR
automatically performs one-hot encoding of categorical features so that it does not need to be done manually.
Beyond String and Double type features, it is also possible to fit over MLlib Vector features, for compatibility
with other MLlib components.

# Create the DataFrame


df <- createDataFrame(iris)

# Fit a linear model over the dataset.


model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")

# Model coefficients are returned in a similar format to R's native glm().


summary(model)

For tutorials, see SparkR ML tutorials.


SparkR ML tutorials
7/21/2022 • 2 minutes to read

Use glm
Load diamonds data and split into training and test sets
Train a linear regression model using glm()
Train a logistic regression model using glm()
Use glm
7/21/2022 • 2 minutes to read

glm fits a Generalized Linear Model, similar to R’s glm().


Syntax : glm(formula, data, family...)

Parameters :
formula: Symbolic description of model to be fitted, for eg: ResponseVariable ~ Predictor1 + Predictor2 .
Supported operators: ~ , + , - , and .
data : Any SparkDataFrame
family : String, "gaussian" for linear regression or "binomial" for logistic regression
lambda : Numeric, Regularization parameter
alpha : Numeric, Elastic-net mixing parameter

Output : MLlib PipelineModel


This tutorial shows how to perform linear and logistic regression on the diamonds dataset.

Load diamonds data and split into training and test sets
require(SparkR)

# Read diamonds.csv dataset as SparkDataFrame


diamonds <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
source = "com.databricks.spark.csv", header="true", inferSchema = "true")
diamonds <- withColumnRenamed(diamonds, "", "rowID")

# Split data into Training set and Test set


trainingData <- sample(diamonds, FALSE, 0.7)
testData <- except(diamonds, trainingData)

# Exclude rowIDs
trainingData <- trainingData[, -1]
testData <- testData[, -1]

print(count(diamonds))
print(count(trainingData))
print(count(testData))

head(trainingData)

Train a linear regression model using glm()


This section shows how to predict a diamond’s price from its features by training a linear regression model
using the training data.
There are mix of categorical features (cut - Ideal, Premium, Very Good…) and continuous features (depth, carat).
Under the hood, SparkR automatically performs one-hot encoding of such features so that it does not have to
be done manually.
# Family = "gaussian" to train a linear regression model
lrModel <- glm(price ~ ., data = trainingData, family = "gaussian")

# Print a summary of the trained model


summary(lrModel)

Use predict() on the test data to see how well the model works on new data.
Syntax : predict(model, newData)

Parameters :
model : MLlib model
newData : SparkDataFrame, typically your test set

Output : SparkDataFrame

# Generate predictions using the trained model


predictions <- predict(lrModel, newData = testData)

# View predictions against mpg column


display(select(predictions, "price", "prediction"))

Evaluate the model.

errors <- select(predictions, predictions$price, predictions$prediction, alias(predictions$price -


predictions$prediction, "error"))
display(errors)

# Calculate RMSE
head(select(errors, alias(sqrt(sum(errors$error^2 , na.rm = TRUE) / nrow(errors)), "RMSE")))

Train a logistic regression model using glm()


This section shows how to create a logistic regression on the same dataset to predict a diamond’s cut based on
some of its features.
Logistic regression in MLlib supports only binary classification. To test the algorithm in this example, subset the
data to work with only 2 labels.

# Subset data to include rows where diamond cut = "Premium" or diamond cut = "Very Good"
trainingDataSub <- subset(trainingData, trainingData$cut %in% c("Premium", "Very Good"))
testDataSub <- subset(testData, testData$cut %in% c("Premium", "Very Good"))

# Family = "binomial" to train a logistic regression model


logrModel <- glm(cut ~ price + color + clarity + depth, data = trainingDataSub, family = "binomial")

# Print summary of the trained model


summary(logrModel)
# Generate predictions using the trained model
predictionsLogR <- predict(logrModel, newData = testDataSub)

# View predictions against label column


display(select(predictionsLogR, "label", "prediction"))

Evaluate the model.

errorsLogR <- select(predictionsLogR, predictionsLogR$label, predictionsLogR$prediction,


alias(abs(predictionsLogR$label - predictionsLogR$prediction), "error"))
display(errorsLogR)
SparkR function reference
7/21/2022 • 2 minutes to read

You can find the latest SparkR function reference on spark.apache.org.


You can also view function help in R notebooks or RStudio after you import the SparkR package.
sparklyr
7/21/2022 • 2 minutes to read

Azure Databricks supports sparklyr in notebooks, jobs, and RStudio Desktop.

Requirements
Azure Databricks distributes the latest stable version of sparklyr with every runtime release. You can use
sparklyr in Azure Databricks R notebooks or inside RStudio Server hosted on Azure Databricks by importing the
installed version of sparklyr.
In RStudio Desktop, Databricks Connect allows you to connect sparklyr from your local machine to Azure
Databricks clusters and run Apache Spark code. See Use sparklyr and RStudio Desktop with Databricks Connect.

Connect sparklyr to Azure Databricks clusters


To establish a sparklyr connection, you can use "databricks" as the connection method in spark_connect() . No
additional parameters to spark_connect() are needed, nor is calling spark_install() needed because Spark is
already installed on an Azure Databricks cluster.

# create a sparklyr connection


sc <- spark_connect(method = "databricks")

Progress bars and Spark UI with sparklyr


If you assign the sparklyr connection object to a variable named sc as in the above example, you will see Spark
progress bars in the notebook after each command that triggers Spark jobs. In addition, you can click the link
next to the progress bar to view the Spark UI associated with the given Spark job.

Use sparklyr
After you install sparklyr and establish the connection, all other sparklyr API work as they normally do. See the
example notebook for some examples.
sparklyr is usually used along with other tidyverse packages such as dplyr. Most of these packages are
preinstalled on Databricks for your convenience. You can simply import them and start using the API.

Use sparklyr and SparkR together


SparkR and sparklyr can be used together in a single notebook or job. You can import SparkR along with
sparklyr and use its functionality. In Azure Databricks notebooks, the SparkR connection is pre-configured.
Some of the functions in SparkR mask a number of functions in dplyr:

> library(SparkR)
The following objects are masked from ‘package:dplyr’:

arrange, between, coalesce, collect, contains, count, cume_dist,


dense_rank, desc, distinct, explain, filter, first, group_by,
intersect, lag, last, lead, mutate, n, n_distinct, ntile,
percent_rank, rename, row_number, sample_frac, select, sql,
summarize, union

If you import SparkR after you imported dplyr, you can reference the functions in dplyr by using the fully
qualified names, for example, dplyr::arrange() . Similarly if you import dplyr after SparkR, the functions in
SparkR are masked by dplyr.
Alternatively, you can selectively detach one of the two packages while you do not need it.

detach("package:dplyr")

Use sparklyr in spark-submit jobs


You can run scripts that use sparklyr on Azure Databricks as spark-submit jobs, with minor code modifications.
Some of the instructions above do not apply to using sparklyr in spark-submit jobs on Azure Databricks. In
particular, you must provide the Spark master URL to spark_connect . For an example, see Create and run a
spark-submit job for R scripts.

Unsupported features
Azure Databricks does not support sparklyr methods such as spark_web() and spark_log() that require a local
browser. However, since the Spark UI is built-in on Azure Databricks, you can inspect Spark jobs and logs easily.
See Cluster driver and worker logs.
Sparklyr notebook
Get notebook
RStudio on Azure Databricks
7/21/2022 • 10 minutes to read

Azure Databricks integrates with RStudio Server, the popular integrated development environment (IDE) for R.
You can use either the Open Source or Pro editions of RStudio Server on Azure Databricks. If you want to use
RStudio Server Pro, you must transfer your existing RStudio Pro license to Azure Databricks (see Get started
with RStudio Workbench (previously RStudio Server Pro)).
Databricks Runtime for Machine Learning includes an unmodified version of RStudio Server Open Source
package for which the source code can be found in GitHub. The following table lists the version of RStudio
Server Open Source that is currently preinstalled on Databricks Runtime for ML versions.

DATA B RIC K S RUN T IM E F O R M L VERSIO N RST UDIO SERVER VERSIO N

Databricks Runtime 7.3 LTS ML 1.2

Databricks Runtime 9.1 LTS ML and 10.4 LTS ML 1.4

RStudio integration architecture


When you use RStudio Server on Azure Databricks, the RStudio Server Daemon runs on the driver node of an
Azure Databricks cluster. The RStudio web UI is proxied through Azure Databricks webapp, which means that
you do not need to make any changes to your cluster network configuration. This diagram demonstrates the
RStudio integration component architecture.

WARNING
Azure Databricks proxies the RStudio web service from port 8787 on the cluster’s Spark driver. This web proxy is intended
for use only with RStudio. If you launch other web services on port 8787, you might expose your users to potential
security exploits. Neither Databricks nor Microsoft is responsible for any issues that result from the installation of
unsupported software on a cluster.

Requirements
The cluster must not have table access control, automatic termination, or credential passthrough enabled.
You must have Can Attach To permission for that cluster. The cluster admin can grant you this permission.
See Cluster access control.
If you want to use the Pro edition, an RStudio Server floating Pro license.
Get started with RStudio Server Open Source
IMPORTANT
If you are using Databricks Runtime 7.0 ML or above, RStudio Server Open Source is already installed and you can skip
the section on installing RStudio Server.

To get started with RStudio Server Open Source on Azure Databricks, you must install RStudio on an Azure
Databricks cluster. You need to perform this installation only once. Installation is usually performed by an
administrator.
Install RStudio Server Open Source
To set up RStudio Server Open Source on an Azure Databricks cluster, you must create an init script to install the
RStudio Server Open Source binary package. See Cluster-scoped init scripts for more details. Here is an example
notebook cell that installs an init script on a location on DBFS.

IMPORTANT
All users have read and write access to DBFS, so the init script can be modified by any user. If this is a potential
issue for you, Databricks recommends that you put the init script on Azure Data Lake Storage Gen2 and restrict
permissions to it.
You may need to modify the package URL depending on the Ubuntu version of your runtime, which you can find
in the release notes.

script = """#!/bin/bash

set -euxo pipefail


RSTUDIO_BIN="/usr/sbin/rstudio-server"

if [[ ! -f "$RSTUDIO_BIN" && $DB_IS_DRIVER = "TRUE" ]]; then


apt-get update
apt-get install -y gdebi-core
cd /tmp
# You can find new releases at https://rstudio.com/products/rstudio/download-server/debian-ubuntu/.
wget https://download2.rstudio.org/server/bionic/amd64/rstudio-server-2022.02.1-461-amd64.deb -O rstudio-
server.deb
sudo gdebi -n rstudio-server.deb
rstudio-server restart || true
fi
"""

dbutils.fs.mkdirs("/databricks/rstudio")
dbutils.fs.put("/databricks/rstudio/rstudio-install.sh", script, True)

1. Run the code in a notebook to install the script at dbfs:/databricks/rstudio/rstudio-install.sh


2. Before launching a cluster add dbfs:/databricks/rstudio/rstudio-install.sh as an init script. See Cluster-
scoped init scripts for details.
3. Launch the cluster.
Use RStudio Server Open Source
1. Display the details of the cluster on which you installed RStudio and click the Apps tab:
2. In the Apps tab, click the Set up RStudio button. This generates a one-time password for you. Click the
show link to display it and copy the password.

3. Click the Open RStudio UI link to open the UI in a new tab. Enter your username and password in the
login form and sign in.

4. From the RStudio UI, you can import the SparkR package and set up a SparkR session to launch Spark
jobs on your cluster.

library(SparkR)
sparkR.session()
5. You can also attach the sparklyr package and set up a Spark connection.

SparkR::sparkR.session()
library(sparklyr)
sc <- spark_connect(method = "databricks")

Get started with RStudio Workbench (previously RStudio Server Pro)


In February 2022, RStudio Server Pro was renamed RStudio Workbench (FAQ about the name change).
Depending on your license, RStudio Workbench may include RStudio Server Pro.
Set up RStudio license server
To use RStudio Workbench on Azure Databricks, you need to convert your Pro License to a floating license. For
assistance, contact help@rstudio.com. When your license is converted, you must set up a license server for
RStudio Workbench.
To set up a license server:
1. Launch a small instance on your cloud provider network; the license server daemon is very lightweight.
2. Download and install the corresponding version of RStudio License Server on your instance, and start the
service. For detailed instructions, see RStudio Workbench Admin Guide.
3. Make sure that the license server port is open to Azure Databricks instances.
Install RStudio Workbench
To set up RStudio Workbench on an Azure Databricks cluster, you must create an init script to install the RStudio
Workbench binary package and configure it to use your license server for license lease. See Diagnostic logs for
more details.

NOTE
If you plan to install RStudio Workbench on a Databricks Runtime version that already includes RStudio Server Open
Source package, you need to first uninstall that package for installation to succeed.

The following is an example notebook cell that generates an init script on DBFS. The script also performs
additional authentication configurations that streamline integration with Azure Databricks.

IMPORTANT
All users have read and write access to DBFS, so the init script can be modified by any user. If this is a potential
issue for you, Databricks recommends that you put the init script on Azure Data Lake Storage Gen2 and restrict
permissions to it.
You may need to modify the package URL depending on the Ubuntu version of your runtime, which you can find
in the release notes.
script = """#!/bin/bash

set -euxo pipefail

if [[ $DB_IS_DRIVER = "TRUE" ]]; then


sudo apt-get update
sudo dpkg --purge rstudio-server # in case open source version is installed.
sudo apt-get install -y gdebi-core alien

## Installing RStudio Workbench


cd /tmp

# You can find new releases at https://rstudio.com/products/rstudio/download-commercial/debian-ubuntu/.


wget https://download2.rstudio.org/server/bionic/amd64/rstudio-workbench-2022.02.1-461.pro1-amd64.deb -O
rstudio-workbench.deb
sudo gdebi -n rstudio-workbench.deb

## Configuring authentication
sudo echo 'auth-proxy=1' >> /etc/rstudio/rserver.conf
sudo echo 'auth-proxy-user-header-rewrite=^(.*)$ $1' >> /etc/rstudio/rserver.conf
sudo echo 'auth-proxy-sign-in-url=<domain>/login.html' >> /etc/rstudio/rserver.conf
sudo echo 'admin-enabled=1' >> /etc/rstudio/rserver.conf
sudo echo 'export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin' >>
/etc/rstudio/rsession-profile

# Enabling floating license


sudo echo 'server-license-type=remote' >> /etc/rstudio/rserver.conf

# Session configurations
sudo echo 'session-rprofile-on-resume-default=1' >> /etc/rstudio/rsession.conf
sudo echo 'allow-terminal-websockets=0' >> /etc/rstudio/rsession.conf

sudo rstudio-server license-manager license-server <license-server-url>


sudo rstudio-server restart || true
fi
"""

dbutils.fs.mkdirs("/databricks/rstudio")
dbutils.fs.put("/databricks/rstudio/rstudio-install.sh", script, True)

1. Replace <domain> with your Azure Databricks URL and <license-server-url> with the URL of your floating
license server.
2. Run the code in a notebook to install the script at dbfs:/databricks/rstudio/rstudio-install.sh
3. Before launching a cluster add dbfs:/databricks/rstudio/rstudio-install.sh as an init script. See Diagnostic
logs for details.
4. Launch the cluster.
Use RStudio Server Pro
1. Display the details of the cluster on which you installed RStudio and click the Apps tab:
2. In the Apps tab, click the Set up RStudio button.

3. You do not need the one-time password. Click the Open RStudio UI link and it will open an
authenticated RStudio Pro session for you.
4. From the RStudio UI, you can attach the SparkR package and set up a SparkR session to launch Spark
jobs on your cluster.

library(SparkR)
sparkR.session()

5. You can also attach the sparklyr package and set up a Spark connection.

SparkR::sparkR.session()
library(sparklyr)
sc <- spark_connect(method = "databricks")
Frequently asked questions (FAQ)
What is the difference between RStudio Server Open Source and RStudio Workbench?
RStudio Workbench supports a wide range of enterprise features that are not available on the Open Source
edition. You can see the feature comparison on RStudio’s website.
In addition, RStudio Server Open Source is distributed under the GNU Affero General Public License (AGPL),
while the Pro version comes with a commercial license for organizations that are not able to use AGPL software.
Finally, RStudio Workbench comes with professional and enterprise support from RStudio, PBC, while RStudio
Server Open Source comes with no support.
Can I use my RStudio Workbench / RStudio Server Pro license on Azure Databricks?
Yes, if you already have a Pro or Enterprise license for RStudio Server, you can use that license on Azure
Databricks. See Get started with RStudio Workbench (previously RStudio Server Pro) to learn how to set up
RStudio Workbench on Azure Databricks.
Where does RStudio Server run? Do I need to manage any additional services/servers?
As you can see on the diagram in RStudio integration architecture, the RStudio Server daemon runs on the
driver (master) node of your Azure Databricks cluster. With RStudio Server Open Source, you do not need to run
any additional servers/services. However, for RStudio Workbench, you must manage a separate instance that
runs RStudio License Server.
Can I use RStudio Server on a standard cluster?
Yes, you can. Originally, you were required to use a high concurrency cluster, but that limitation is no longer in
place.
Can I use RStudio Server on a cluster with auto termination?
No, you can’t use RStudio when auto termination is enabled. Auto termination can purge unsaved user scripts
and data inside an RStudio session. To protect users against this unintended data loss scenario, RStudio is
disabled on such clusters by default.
For customers who require cleaning up cluster resources when they are not used, Databricks recommends using
cluster APIs to clean up RStudio clusters based on a schedule.
How should I persist my work on RStudio?
We strongly recommend that you persist your work using a version control system from RStudio. RStudio has
great support for various version control systems and allows you to check in and manage your projects.
You can also save your files (code or data) on the Databricks File System (DBFS). For example, if you save a file
under /dbfs/ the files will not be deleted when your cluster is terminated or restarted.

IMPORTANT
If you do not persist your code through version control or DBFS, you risk losing your work if an admin restarts or
terminates the cluster.

Another method is to save the R notebook to your local file system by exporting it as Rmarkdown , then later
importing the file into the RStudio instance.
The blog Sharing R Notebooks using RMarkdown describes the steps in more detail.
How do I start a SparkR session?
SparkR is contained in Databricks Runtime, but you must load it into RStudio. Run the following code inside
RStudio to initialize a SparkR session.

library(SparkR)
sparkR.session()

If there is an error importing the SparkR package, run .libPaths() and verify that
/home/ubuntu/databricks/spark/R/lib is included in the result.
If it is not included, check the content of /usr/lib/R/etc/Rprofile.site . List
/home/ubuntu/databricks/spark/R/lib/SparkR on the driver to verify that the SparkR package is installed.
How do I start a sparklyr session?
The sparklyr package must be installed on the cluster. Use one of the following methods to install the
sparklyr package:
As an Azure Databricks library
install.packages() command
RStudio package management UI
SparkR is contained in Databricks Runtime, but you must load it into RStudio. Run the following code inside
RStudio to initialize a sparklyr session.

SparkR::sparkR.session()
library(sparklyr)
sc <- spark_connect(method = “databricks”)

If the sparklyr commands fail, confirm that SparkR::sparkR.session() succeeded.


How does RStudio integrate with Azure Databricks R notebooks?
You can move your work between notebooks and RStudio through version control.
What is the working directory?
When you start a project in RStudio, you choose a working directory. By default this is the home directory on
the driver (master) container where RStudio Server is running. You can change this directory if you want.
Can I launch Shiny Apps from RStudio running on Azure Databricks?
Yes, you can develop and view Shiny applications inside RStudio Server on Databricks.
I can’t use terminal or git inside RStudio on Azure Databricks. How can I fix that?
Make sure that you have disabled websockets. In RStudio Server Open Source, you can do this from the UI.

In RStudio Server Pro, you can add allow-terminal-websockets=0 to /etc/rstudio/rsession.conf to disable


websockets for all users.
I don’t see the Apps tab under cluster details.
This feature is not available to all customers. You must be on the Premium Plan.
Shiny on hosted RStudio Server
7/21/2022 • 6 minutes to read

Shiny is an R package, available on CRAN, used to build interactive R applications and dashboards. You can use
Shiny inside RStudio Server hosted on Azure Databricks clusters. You can also develop, host, and share Shiny
applications directly from an Azure Databricks notebook. See Share Shiny app URL.
To get started with Shiny, see the Shiny tutorials.
This article describes how to run Shiny applications on RStudio on Azure Databricks and use Apache Spark
inside Shiny applications.

Requirements
Shiny is included in Databricks Runtime 7.3 LTS and above. On Databricks Runtime 6.4 Extended Support
and Databricks Runtime 5.5 LTS, you must install the Shiny R package.
RStudio on Azure Databricks.

IMPORTANT
With RStudio Server Pro, you must disable proxied authentication. Make sure auth-proxy=1 is not present inside
/etc/rstudio/rserver.conf .

Get Started with Shiny


1. Open RStudio on Azure Databricks.
2. In RStudio, import the Shiny package and run the example app 01_hello as follows:

> library(shiny)
> runExample("01_hello")

Listening on http://127.0.0.1:3203

A new window appears, displaying the Shiny application.


Run a Shiny app from an R script
To run a Shiny app from an R script, open the R script in the RStudio editor and click the Run App button on the
top right.

Use Apache Spark inside Shiny apps


You can use Apache Spark when developing Shiny applications on Azure Databricks. You can interact with Spark
using both SparkR and sparklyr. You need at least one worker to launch Spark tasks.
The following example uses SparkR to launch Spark jobs. The example uses the ggplot2 diamonds dataset to
plot the price of diamonds by carat. The carat range can be changed using the slider at the top of the application,
and the range of the plot’s x-axis would change accordingly.
library(SparkR)
library(sparklyr)
library(dplyr)
library(ggplot2)
sparkR.session()

sc <- spark_connect(method = "databricks")


diamonds_tbl <- spark_read_csv(sc, path = "/databricks-datasets/Rdatasets/data-
001/csv/ggplot2/diamonds.csv")

# Define the UI
ui <- fluidPage(
sliderInput("carat", "Select Carat Range:",
min = 0, max = 5, value = c(0, 5), step = 0.01),
plotOutput('plot')
)

# Define the server code


server <- function(input, output) {
output$plot <- renderPlot({
# Select diamonds in carat range
df <- diamonds_tbl %>%
dplyr::select("carat", "price") %>%
dplyr::filter(carat >= !!input$carat[[1]], carat <= !!input$carat[[2]])

# Scatter plot with smoothed means


ggplot(df, aes(carat, price)) +
geom_point(alpha = 1/2) +
geom_smooth() +
scale_size_area(max_size = 2) +
ggtitle("Price vs. Carat")
})
}

# Return a Shiny app object

shinyApp(ui = ui, server = server)

Frequently asked questions (FAQ)


How do I install Shiny on Databricks Runtime 6.4 Extended Support and Databricks Runtime 5.5 LTS?
Why is my Shiny app grayed out after some time?
Why does my Shiny viewer window disappear after a while?
My app crashes immediately after launching, but the code appears to be correct. What’s going on?
Why do long Spark jobs never return?
How can I avoid the timeout?
How many connections can be accepted for one Shiny app link during development?
Can I use a different version of the Shiny package than the one installed in Databricks Runtime?
How can I develop a Shiny application that can be published to a Shiny server and access data on Azure
Databricks?
How can I save the Shiny applications that I develop on Azure Databricks?
Can I develop a Shiny application inside an Azure Databricks notebook?
How do I install Shiny on Databricks Runtime 6.4 Extended Support and Databricks Runtime 5.5 LTS?
Install the Shiny package as an Azure Databricks library on the cluster. Using install.packages(‘shiny’) in the
RStudio console or using the RStudio package manager may not work.
Why is my Shiny app grayed out after some time?
If there is no interaction with the Shiny app, the connection to the app closes after about 4 minutes.
To reconnect, refresh the Shiny app page. The dashboard state resets.
Why does my Shiny viewer window disappear after a while?
If the Shiny viewer window disappears after idling for several minutes, it is due to the same timeout as the “gray
out” scenario.
My app crashes immediately after launching, but the code appears to be correct. What’s going on?
There is a 20 MB limit on the total amount of data that can be displayed in a Shiny app on Azure Databricks. If
the application’s total data size exceeds this limit, it will crash immediately after launching. To avoid this,
Databricks recommends reducing the data size, for example by downsampling the displayed data or reducing
the resolution of images.
Why do long Spark jobs never return?
This is also because of the idle timeout. Any Spark job running for longer than the previously mentioned
timeouts is not able to render its result because the connection closes before the job returns.
How can I avoid the timeout?
There is a workaround suggested in this issue thread. The workaround sends heartbeats to keep the
websocket alive when the app is idle. However, if the app is blocked by a long running computation, this
workaround does not work.
Shiny does not support long running tasks. A Shiny blog post recommends using promises and futures
to run long tasks asynchronously and keep the app unblocked. Here is an example that uses heartbeats to
keep the Shiny app alive, and runs a long running Spark job in a future construct.

# Write an app that uses spark to access data on Databricks


# First, install the following packages:
install.packages(‘future’)
install.packages(‘promises’)

library(shiny)
library(promises)
library(future)
plan(multisession)

HEARTBEAT_INTERVAL_MILLIS = 1000 # 1 second

# Define the long Spark job here


run_spark <- function(x) {
run_spark <- function(x) {
# Environment setting
library("SparkR", lib.loc = "/databricks/spark/R/lib")
sparkR.session()

irisDF <- createDataFrame(iris)


collect(irisDF)
Sys.sleep(3)
x + 1
}

run_spark_sparklyr <- function(x) {


# Environment setting
library(sparklyr)
library(dplyr)
library("SparkR", lib.loc = "/databricks/spark/R/lib")
sparkR.session()
sc <- spark_connect(method = "databricks")

iris_tbl <- copy_to(sc, iris, overwrite = TRUE)


collect(iris_tbl)
x + 1
}

ui <- fluidPage(
sidebarLayout(
# Display heartbeat
sidebarPanel(textOutput("keep_alive")),

# Display the Input and Output of the Spark job


mainPanel(
numericInput('num', label = 'Input', value = 1),
actionButton('submit', 'Submit'),
textOutput('value')
)
)
)
server <- function(input, output) {
#### Heartbeat ####
# Define reactive variable
cnt <- reactiveVal(0)
# Define time dependent trigger
autoInvalidate <- reactiveTimer(HEARTBEAT_INTERVAL_MILLIS)
# Time dependent change of variable
observeEvent(autoInvalidate(), { cnt(cnt() + 1) })
# Render print
output$keep_alive <- renderPrint(cnt())

#### Spark job ####


result <- reactiveVal() # the result of the spark job
busy <- reactiveVal(0) # whether the spark job is running
# Launch a spark job in a future when actionButton is clicked
observeEvent(input$submit, {
if (busy() != 0) {
showNotification("Already running Spark job...")
return(NULL)
}
showNotification("Launching a new Spark job...")
# input$num must be read outside the future
input_x <- input$num
fut <- future({ run_spark(input_x) }) %...>% result()
# Or: fut <- future({ run_spark_sparklyr(input_x) }) %...>% result()
busy(1)
# Catch exceptions and notify the user
fut <- catch(fut, function(e) {
result(NULL)
cat(e$message)
showNotification(e$message)
})
fut <- finally(fut, function() { busy(0) })
fut <- finally(fut, function() { busy(0) })
# Return something other than the promise so shiny remains responsive
NULL
})
# When the spark job returns, render the value
output$value <- renderPrint(result())
}
shinyApp(ui = ui, server = server)

How many connections can be accepted for one Shiny app link during development?
Databricks recommends up to 20.
Can I use a different version of the Shiny package than the one installed in Databricks Runtime?
Yes. See Fix the Version of R Packages.
How can I develop a Shiny application that can be published to a Shiny server and access data on Azure
Databricks?
While you can access data naturally using SparkR or sparklyr during development and testing on Azure
Databricks, after a Shiny application is published to a stand-alone hosting service, it cannot directly access the
data and tables on Azure Databricks.
To enable your application to function outside Azure Databricks, you must rewrite how you access data. There
are a few options:
Use JDBC/ODBC to submit queries to an Azure Databricks cluster.
Use Databricks Connect.
Directly access data on object storage.
Databricks recommends that you work with your Azure Databricks solutions team to find the best approach for
your existing data and analytics architecture.
How can I save the Shiny applications that I develop on Azure Databricks?
You can either save your application code on DBFS through the FUSE mount or check your code into version
control.
Can I develop a Shiny application inside an Azure Databricks notebook?
Yes, you can develop a Shiny application inside an Azure Databricks notebook. For more details see Use Shiny
inside Databricks notebooks.
Use Shiny inside Databricks notebooks
7/21/2022 • 2 minutes to read

IMPORTANT
This feature is in Public Preview.

You can develop, host, and share Shiny applications directly from an Azure Databricks notebook.
To get started with Shiny, see the Shiny tutorials. You can run these tutorials on Azure Databricks notebooks.

Requirements
Databricks Runtime 8.3 or above.

Use Shiny inside R notebooks


The Shiny package is included with Databricks Runtime. You can interactively develop and test Shiny
applications inside Azure Databricks R notebooks similarly to hosted RStudio.
Follow these steps to get started:
1. Create an R notebook.
2. Run this code:

library(shiny)
runExample("01_hello")

3. When the app is ready, the output includes the Shiny app URL as a clickable link which opens a new tab.
See Share Shiny app URL for information about sharing this app with other users.

NOTE
Log messages appear in the command result, similar to the default log message (
Listening on http://0.0.0.0:5150 ) shown in the example.
To stop the Shiny application, click Cancel.
The Shiny application uses the notebook R process. If you detach the notebook from the cluster, or if you cancel the
cell running the application, the Shiny application teminates. You cannot run other cells while the Shiny application is
running.

Run Shiny applications from files


If your Shiny application code is part of a project managed by version control, you can run it inside the
notebook.

NOTE
You must use absolute path or set the working directory with setwd() .

1. Check out the code from a repository using code similar to:

%sh git clone https://github.com/rstudio/shiny-examples.git


cloning into 'shiny-examples'...

2. To run the application, enter code similar to the following code in another cell:

library(shiny)
runApp("/databricks/driver/shiny-examples/007-widgets/")

Use Apache Spark inside Shiny applications


You can use Apache Spark inside Shiny applications with either SparkR or sparklyr. For more details see install
the Shiny R package.
Use SparkR with Shiny in a notebook

library(shiny)
library(SparkR)
sparkR.session()

ui <- fluidPage(
mainPanel(
textOutput("value")
)
)

server <- function(input, output) {


output$value <- renderText({ nrow(createDataFrame(iris)) })
}

shinyApp(ui = ui, server = server)

Use sparklyr with Shiny in a notebook


library(shiny)
library(sparklyr)

sc <- spark_connect(method = "databricks")

ui <- fluidPage(
mainPanel(
textOutput("value")
)
)

server <- function(input, output) {


output$value <- renderText({
df <- sdf_len(sc, 5, repartition = 1) %>%
spark_apply(function(e) sum(e)) %>%
collect()
df$result
})
}

shinyApp(ui = ui, server = server)

Share Shiny app URL


The Shiny app URL generated when you start an app is shareable with other users. Any Azure Databricks user
with Can Attach To permission on the cluster can view and interact with the app as long as both the app and
the cluster are running.
If the cluster that the app is running on terminates, the app is no longer accessible. You can disable automatic
termination in the cluster settings.
If you attach and run the notebook hosting the Shiny app on a different cluster, the Shiny URL changes. Also, if
you restart the app on the same cluster, Shiny might pick a different random port. To ensure a stable URL, you
can set the shiny.port option, or, when restarting the app on the same cluster, you can specify the port
argument.
renv on Azure Databricks
7/21/2022 • 2 minutes to read

renv is an R package that lets users manage R dependencies specific to the notebook.
Using renv , you can create and manage the R library environment for your project, save the state of these
libraries to a lockfile , and later restore libraries as required. Together, these tools can help make projects more
isolated, portable, and reproducible.

Basic renv workflow


In this section:
Install renv
Initialize renv session with pre-installed R libraries
Use renv to install additional packages
Use renv to save your R notebook environment to DBFS
Reinstall a renv environment given a lockfile from DBFS
Install renv

You can install renv as a cluster-scoped library or as a notebook-scoped library. To install renv as a notebook-
scoped library, use:

install.packages("renv", repo = "https://cran.microsoft.com/snapshot/2021-07-16/")

Databricks recommends using a CRAN snapshot as the repository to fix the package version.
Initialize renv session with pre -installed R libraries
The first step when using renv is to initialize a session using renv::init() . Set libPaths to change the default
download location to be your R notebook-scoped library path.

renv::init(settings = list(external.libraries=.libPaths()))
.libPaths(c(.libPaths()[2], .libPaths())

Use renv to install additional packages


You can now use renv ’s API to install and remove R packages. For example, to install the latest version of
digest , run the following inside of a notebook cell.

renv::install("digest")

To install an old version of digest , run the following inside of a notebook cell.

renv::install("digest@0.6.18")

To install digest from GitHub, run the following inside of a notebook cell.
renv::install("eddelbuettel/digest")

To install a package from Bioconductor, run the following inside of a notebook cell.

# (note: requires the BiocManager package)


renv::install("bioc::Biobase")

Note that the renv::install API uses the renv Cache.


Use renv to save your R notebook environment to DBFS
Run the following command once before saving the environment.

renv::settings$snapshot.type("all")

This sets renv to snapshot all packages that are installed into libPaths , not just the ones that are currently
used in the notebook. See renv documentation for more information.
Now you can run the following inside of a notebook cell to save the current state of your environment.

renv::snapshot(lockfile="/dbfs/PATH/TO/WHERE/YOU/WANT/TO/SAVE/renv.lock", force=TRUE)

This updates the lockfile by capturing all packages installed on libPaths . It also moves your lockfile from
the local filesystem to DBFS, where it persists even if your cluster terminates or restarts.
Reinstall a renv environment given a lockfile from DBFS
First, make sure that your new cluster is running an identical Databricks Runtime version as the one you first
created the renv environment on. This ensures that the pre-installed R packages are identical. You can find a list
of these in each runtime’s release notes. After you Install renv, run the following inside of a notebook cell.

renv::init(settings = list(external.libraries=.libPaths()))
.libPaths(c(.libPaths()[2], .libPaths()))
renv::restore(lockfile="/dbfs/PATH/TO/WHERE/YOU/SAVED/renv.lock", exclude=c("Rserve", "SparkR"))

This copies your lockfile from DBFS into the local file system and then restores any packages specified in the
lockfile .

NOTE
To avoid missing repository errors, exclude the Rserve and SparkR packages from package restoration. Both of these
packages are pre-installed in all runtimes.

renv Cache
A very useful feature of renv is its global package cache, which is shared across all renv projects on the
cluster. It speeds up installation times and saves disk space. The renv cache does not cache packages
downloaded via the devtools API or install.packages() with any additional arguments other than pkgs .
Azure Databricks for Scala developers
7/21/2022 • 2 minutes to read

This section provides a guide to developing notebooks and jobs in Azure Databricks using the Scala language.

Scala API
These links provide an introduction to and reference for the Apache Spark Scala API.
Introduction to DataFrames
Complex and nested data
Aggregators
Introduction to Datasets
Introduction to Structured Streaming
Apache Spark API reference

Visualizations
Azure Databricks Scala notebooks have built-in support for many types of visualizations. You can also use legacy
visualizations:
Visualization overview
Visualization deep dive in Scala

Interoperability
This section describes features that support interoperability between Scala and SQL.
User-defined functions
User-defined aggregate functions

Tools
In addition to Azure Databricks notebooks, you can also use the following Scala developer tools
IntelliJ

Libraries
Databricks runtimes provide many libraries. To make third-party or locally-built Scala libraries available to
notebooks and jobs running on your Azure Databricks clusters, you can install libraries following these
instructions:
Install Scala libraries in a cluster

Resources
Knowledge Base
Azure Databricks for SQL developers
7/21/2022 • 2 minutes to read

This section provides a guide to developing notebooks in the Databricks Data Science & Engineering and
Databricks Machine Learning environments using the SQL language.

Databricks SQL
If you are a data analyst who works primarily with SQL queries and BI tools, Databricks SQL provides an
intuitive environment for running ad-hoc queries and creating dashboards on data stored in your data lake. You
may want to skip this article, which is focused on developing notebooks in the Databricks Data Science &
Engineering and Databricks Machine Learning environments. Instead see:
_
Queries in Databricks SQL
SQL reference for Databricks SQL

SQL Reference
The SQL language reference that you use depends on the Databricks Runtime version that your cluster is
running:
Databricks Runtime 7.x and above (Spark SQL 3.x)
Databricks Runtime 6.4 Extended Support and Databricks Light 2.4 (Spark SQL 2.4)
For Databricks SQL, see SQL reference for Databricks SQL.

Use cases
Cost-based optimizer
Transactional writes to cloud storage with DBIO
Handling bad records and files
Handling large queries in interactive workflows
Adaptive query execution
Query semi-structured data in SQL
Data skipping index

Visualizations
SQL notebooks support various types of visualizations using the display function.
Create a new visualization

Interoperability
This section describes features that support interoperability between SQL and other languages supported in
Azure Databricks.
User-defined scalar functions - Python
User-defined scalar functions - Scala
User-defined aggregate functions - Scala

Tools
In addition to Azure Databricks notebooks, you can also use various third-party developer tools, data sources,
and other integrations. See Databricks integrations.

Access control
This article describes how to use SQL constructs to control access to database objects:
Data object privileges

Resources
Apache Spark SQL Guide
Delta Lake guide
Knowledge Base
SQL reference for Databricks Runtime 7.3 LTS and
above
7/21/2022 • 2 minutes to read

This is a SQL command reference for users on clusters running Databricks Runtime 7.x and above in the
Databricks Data Science & Engineering workspace and Databricks Machine Learning environment.

NOTE
For Databricks Runtime 5.5 LTS and 6.x SQL commands, see SQL reference for Databricks Runtime 5.5 LTS and 6.x.
For the Databricks SQL language reference, see SQL reference for Databricks SQL.

General reference
This general reference describes data types, functions, identifiers, literals, and semantics:
How to read a syntax diagram
Data types and literals
SQL data type rules
Datetime patterns
Functions
Built-in functions
Lambda functions
Window functions
Identifiers
Names
Null semantics
Expressions
JSON path expressions
Partitions
ANSI compliance
Apache Hive compatibility
Principals
Privileges and securable objects
External locations and storage credentials
Delta Sharing
Information schema
Reserved words

DDL statements
You use data definition statements to create or modify the structure of database objects in a database:
ALTER CATALOG
ALTER CREDENTIAL
ALTER DATABASE
ALTER LOCATION
ALTER TABLE
ALTER SCHEMA
ALTER SHARE
ALTER VIEW
COMMENT ON
CREATE BLOOMFILTER INDEX
CREATE CATALOG
CREATE DATABASE
CREATE FUNCTION (External)
CREATE FUNCTION (SQL)
CREATE LOCATION
CREATE RECIPIENT
CREATE SCHEMA
CREATE SHARE
CREATE TABLE
CREATE VIEW
DROP BLOOMFILTER INDEX
DROP CATALOG
DROP DATABASE
DROP CREDENTIAL
DROP FUNCTION
DROP LOCATION
DROP RECIPIENT
DROP SCHEMA
DROP SHARE
DROP TABLE
DROP VIEW
MSCK REPAIR TABLE
TRUNCATE TABLE

DML statements
You use data manipulation statements to add, change, or delete data:
COPY INTO
DELETE FROM
INSERT INTO
INSERT OVERWRITE DIRECTORY
INSERT OVERWRITE DIRECTORY with Hive format
LOAD DATA
MERGE INTO
UPDATE

Data retrieval statements


You use a query to retrieve rows from one or more tables according to the specified clauses. The full syntax and
brief description of supported clauses are explained in the Query article. The related SQL statements SELECT
and VALUES are also included in this section.
Query
SELECT
VALUES
Databricks Runtime also provides the ability to generate the logical and physical plan for a query using the
EXPLAIN statement.

EXPLAIN

Delta Lake statements


You use Delta Lake SQL statements to manage tables stored in Delta Lake format:
CACHE SELECT
CONVERT TO DELTA
DESCRIBE HISTORY
FSCK REPAIR TABLE
OPTIMIZE
REORG TABLE
RESTORE
VACUUM
For details on using Delta Lake statements, see Delta Lake guide.

Auxiliary statements
You use auxiliary statements to collect statistics, manage caching for Apache Spark cache, explore metadata, set
configurations, and manage resources:
Analyze statement
Apache Spark Cache statements
Describe statements
Show statements
Configuration management
Resource management
Analyze statement
ANALYZE TABLE
Apache Spark Cache statements
CACHE TABLE
CLEAR CACHE
REFRESH
REFRESH FUNCTION
REFRESH TABLE
UNCACHE TABLE
Describe statements
DESCRIBE CATALOG
DESCRIBE CREDENTIAL
DESCRIBE DATABASE
DESCRIBE FUNCTION
DESCRIBE LOCATION
DESCRIBE QUERY
DESCRIBE RECIPIENT
DESCRIBE SCHEMA
DESCRIBE SHARE
DESCRIBE TABLE
Show statements
LIST
SHOW ALL IN SHARE
SHOW CATALOGS
SHOW COLUMNS
SHOW CREATE TABLE
SHOW CREDENTIALS
SHOW DATABASES
SHOW FUNCTIONS
SHOW GROUPS
SHOW LOCATIONS
SHOW PARTITIONS
SHOW RECIPIENTS
SHOW SCHEMAS
SHOW SHARES
SHOW TABLE
SHOW TABLES
SHOW TBLPROPERTIES
SHOW USERS
SHOW VIEWS
Configuration management
RESET
SET
SET TIMEZONE
USE CATALOG
USE DATABASE
USE SCHEMA
Resource management
ADD ARCHIVE
ADD FILE
ADD JAR
LIST ARCHIVE
LIST FILE
LIST JAR

Security statements
You use security SQL statements to manage access to data:
ALTER GROUP
CREATE GROUP
DENY
DROP GROUP
GRANT
GRANT SHARE
REPAIR PRIVILEGES
REVOKE
REVOKE SHARE
SHOW GRANTS
SHOW GRANTS ON SHARE
SHOW GRANTS TO RECIPIENT
For details using these statements, see Data object privileges.
How to read a syntax diagram
7/21/2022 • 2 minutes to read

This section describes the various patterns of syntax used throughout the Databricks Runtime reference.

Base components
Keyword
Token
Clause
Argument
Keyword

SELECT

Keywords in SQL are always capitalized in this document, but they are case insensitive.
Token

( )
< >
.
*
,

The SQL language includes round braces ( ( , ) ) as well as angled braces ( < , > ), dots ( . ), commas ( , ), and
a few other characters. When these characters are present in a syntax diagram you must enter them as is.
Clause

LIMIT clause

SELECT named_expression

named_expression
expression AS alias

A clause represents a named subsection of syntax. A local clause is described in the same syntax diagram that
invokes it. If the clause is common, it links to another section of the Databricks Runtime reference. Some clauses
are known by their main keyword and are depicted with a capital keyword followed by clause. Other clauses are
always lower case and use underscore ( _ ) where appropriate. Local clauses are fully explained within the
following section. All other clauses have a short description with a link to the main page.
Argument

mapExpr

Arguments to functions are specified in camelCase. Databricks Runtime describes the meaning of arguments in
the Arguments section.
Chain of tokens
SELECT expr

Components separated by whitespace must be entered in order, unconditionally, and be separated only by
whitespace or comments. Databricks Runtime supports comments of the form /* ... */ (C-style), and -- ...
, which extends to end of the line.

Choice
Specifies a fork in the syntax.
Mandatory choice

{ INT | INTEGER }

Curly braces { ... } mean you must specify exactly one of the multiple components. Each choice is separated
by a | .
Optional choice

[ ASC | DESC ]

Square brackets [ ... ] indicate you can choose at most one of multiple components. Each choice is separated
by a | .

Grouping
{ SELECT expr }

{ SELECT
expr }

Curly braces { ... } specify that you must provide all the embedded components. If a syntax diagram spans
multiple lines, this form clarifies that it depicts the same syntax.

Option
[ NOT NULL ]

Square brackets [...] specify that the enclosed components are optional.

Repetition
col_option [...]

col_alias [, ...]

{ expr [ AS ] col_alias } [, ...]

The [...] ellipsis notation indicates that you can repeat the immediately preceding component, grouping, or
choice multiple times. If the ellipsis is preceded by another character, such as a separated dot [. ...] , or a
comma [, ...] , you must separate each repetition by that character.
Names
7/21/2022 • 5 minutes to read

Identifies different kinds of objects in Databricks Runtime.

Catalog name
Identifies a catalog. A catalog provides a grouping of objects which can be further subdivided into schemas.
Syntax

catalog_identifier

Parameters
catalog_identifier : An identifier that uniquely identifies the catalog.
Examples

> USE CATALOG hive_metastore;

> CREATE CATALOG mycatalog;

Schema name
Identifies a schema. A schema provides a grouping of objects in a catalog.
Syntax

[ catalog_name . ] schema_identifier

Parameters
catalog_name : The name of an existing catalog.
schema_identifier : An identifier that uniquely identifies the schema.
Examples

> USE SCHEMA default;

> CREATE SCHEMA my_sc;

Database name
A synonym for schema name.
While usage of SCHEMA , and DATABASE is interchangeable, SCHEMA is preferred.

Table name
Identifies a table object. The table can be qualified with a schema name or unqualified using a simple identifier.
Syntax

{ [ schema_name . ] table_identifier [ temporal_spec ] |


{ file_format | `file_format` } . `path_to_table` [ temporal_spec ] [ credential_spec ] }

temporal_spec
{
@ timestamp_encoding |
@V version |
[ FOR ] { SYSTEM_TIMESTAMP | TIMESTAMP } AS OF timestamp_expression |
[ FOR ] { SYSTEM_VERSION | VERSION } AS OF version
}

credential_spec
WITH ( CREDENTIAL credential_name )

Parameters
schema_name : A qualified or unqualified schema name that contains the table.
table_identifier : An identifier that specifies the name of the table or table_alias.
file_format : One of json , csv , avro , parquet , orc , binaryFile , text , delta (case insensitive).
path_to_table : The location of the table in the file system. You must have the ANY_FILE permission to
use this syntax.
temporal_spec : When used references a Delta table at the specified point in time or version.
You can use a temporal specification only within the context of a query or a MERGE USING.
@ timestamp_encoding : A positive Bigint literal that encodes a timestamp in yyyyMMddHHmmssSSS
format.
@V version : A positive Integer literal identifying the version of the Delta table.
timestamp_expression : A simple expression that evaluates to a TIMESTAMP. timestamp_expressiom
must be a constant expression, but may contain current_date() or current_timestamp() .
version : A Integer literal or String literal identifying the version of the Delta table.
credential_spec
You can use an applicable credential to gain access to a path_to_table which is not embedded in an
external location.
credential_name
The name of the credential used to access the storage location.
If the name is unqualified and does not reference a known table alias, Databricks Runtime first attempts to
resolve the table in the current schema.
If the name is qualified with a schema, Databricks Runtime attempts to resolve the table in the current catalog.
Databricks Runtime raises an error if you use a temporal_spec for a table that is not in Delta Lake format.
Examples
`Employees`

employees

hr.employees

`hr`.`employees`

hive_metastore.default.tab

system.information_schema.columns

delta.`somedir/delta_table`

`csv`.`spreadsheets/data.csv`

`csv`.`spreadsheets/data.csv` WITH (CREDENTIAL some_credential)

View name
Identifies a view. The view can be qualified with a schema name or unqualified using a simple identifier.
Syntax

[ schema_name . ] view_identifier

Parameters
schema_name : The qualified or unqualified name of the schema that contains the view.
view_identifier : An identifier that specifies the name of the view or the view identifier of a CTE.
Examples

`items`

items

hr.items

`hr`.`items`

Column name
Identifies a column within a table or view. The column can be qualified with a table or view name, or unqualified
using a simple identifier.
Syntax

[ { table_name | view_name } . ] column_identifier

Parameters
table_name : A qualified or unqualified table name of the table containing the column.
view_name : A qualified or unqualified view name of the view containing the column.
column_identifier : An identifier that specifies the name of the column.
The identified column must exist within the table or view.
Databricks Runtime supports a special _metadata column. This pseudo column of type struct is part of every
table and can be used to retrieve metadata information about the rows in the table.
Examples

-- An unqualified column name


> SELECT c1 FROM VALUES(1) AS T(c1);
c1
1

-- A qualified column name


> SELECT T.c1 FROM VALUES(1) AS T(c1);
c1
1

-- Using _matadata to retrieve infromation about rows retrieved from T.


> CREATE TABLE T(c1 INT);
> INSERT INTO T VALUES(1);
> SELECT T._metadata.file_size;
574

Field name
Identifies a field within a struct. The field must be qualified with the path up to the struct containing the field.
Syntax

expr { . field_identifier [. ...] }

Parameters
expr : An expression of type STRUCT.
field_identifier : An identifier that specifies the name of the field.
A deeply nested field can be referenced by specifying the field identifier along the path to the root struct.
Examples

> SELECT addr.address.name


FROM VALUES (named_struct('address', named_struct('number', 5, 'name', 'Main St'),
'city', 'Springfield')) as t(addr);
Main St

Function name
Identifies a function. The function can be qualified with a schema name, or unqualified using a simple identifier.
Syntax

[ schema_name . ] function_identifier

Parameters
schema_name : A qualified or unqualified schema name that contains the function.
function_identifier : An identifier that specifies the name of the function.
Examples
`math`.myplus

myplus

math.`myplus`

Parameter name
Identifies a parameter in the body of a SQL user-defined function (SQL UDF). The function can be qualified with
a function identifier, or unqualified using a simple identifier.
Syntax

[ function_identifier . ] parameter_identifier

Parameters
function_identifier : An identifier that specifies the name of a function.
parameter_identifier : An identifier that specifies the name of a parameter.
Examples

CREATE FUNCTION area(x INT, y INT) RETURNS INT


RETURN area.x + y;

Table alias
Labels a table reference, query, table function, or other form of a relation.
Syntax

[ AS ] table_identifier [ ( column_identifier1 [, ...] ) ]

Parameters
table_identifier : An identifier that specifies the name of the table.
column_identifierN : An optional identifier that specifies the name of the column.
If you provide column identifiers, their number must match the number of columns in the matched relation.
If you don’t provide column identifiers, their names are inherited from the labeled relation.
Examples

> SELECT a, b FROM VALUES (1, 2) AS t(a, b);


a b
1 2

> DELETE FROM emp AS e WHERE e.c1 = 5;

Column alias
Labels the result of an expression in a SELECT list for reference.
If the expression is a table valued generator function, the alias labels the list of columns produced.
Syntax

[AS] column_identifier

[AS] ( column_identifier [, ...] )

Parameters
column_identifier : An identifier that specifies the name of the column.
While column aliases need not be unique within the select list, uniqueness is a requirement to reference an alias
by name.
Examples

> SELECT 1 AS a;
a
1

> SELECT 1 a, 2 b;
a b
1 2

> SELECT 1 AS `a`;


a
1

> SELECT posexplode(array(2)) AS (i, a);


i a
0 2

> SELECT a + a FROM (SELECT 1 AS a);


a
2

Credential name
Identifies a credential to access storage at an external location.
Syntax

credential_identifier

Parameters
credential_identifier : An unqualified identifier that uniquely identifies the credential.
Examples

Location name
Identifies an external storage location.
Syntax

location_identifier

Parameters
location_identifier : An unqualified identifier that uniquely identifies the location.
Examples

`s3-json-data`

s3_json_data

Share name
Identifies a share to access data shared by a provider.
Syntax

share_identifier

Parameters
share_identifier : An unqualified identifier that uniquely identifies the share.
Examples

`public info`

`public-info`

public_info

Recipient name
Identifies an recipient for a share.
Syntax

recipient_identifier

Parameters
recipient_identifier : An unqualified identifier that uniquely identifies the recipient.
Examples

`Good Corp`

`Good-corp`

Good_Corp

Related
Identifiers
File metadata column
Databricks Runtime expression
7/21/2022 • 2 minutes to read

An expression is a formula that computes a result based on literals or references to columns, fields, or variables,
using functions or operators.

Syntax
{ literal |
column_reference |
field_reference |
parameter_reference |
CAST expression |
CASE expression |
expr operator expr |
operator expr |
expr [ expr ] |
function_invocation |
( expr ) |
scalar_subquery }

scalar_subquery
( query )

The brackets [ expr ] are actual brackets and do not indicate optional syntax.

Parameters
literal
A literal of a type described in Data types.
column_reference
A reference to a column in a table or column alias.
field_reference
A reference to a field in a STRUCT type.
parameter_reference
A reference to a parameter of a SQL user defined function from with the body of the function. The
reference may use the unqualified name of the parameter or qualify the name with the function name.
Parameters constitute the outermost scope when resolving identifiers.
CAST expression
An expression casting the argument to a different type.
CASE expression
An expression allowing for conditional evaluation.
expr
An expression itself which is combined with an operator , or which is an argument to a function.
operator
A unary or binary operator.
[ expr ]
A reference to an array element or a map key.
function_invocation
An expression invoking a built-in or user defined function.
The pages for each builtin function and operator describe the data types their parameters expect.
Databricks Runtime performs implicit casting to expected types using SQL data type rules. If an operator
or function is invalid for the provided argument, Databricks Runtime raises an error. Functions also
document which parameters are mandatory or optional.
When invoking a SQL user defined function you may omit arguments for trailing parameters if the
parameters have defined defaults.
( expr )
Enforced precedence that overrides operator precedence.
scalar_subquer y :
( quer y )
An expression based on a query that must return a single column and at most one row.
The pages for each function and operator describe the data types their parameters expect. Databricks Runtime
performs implicit casting to expected types using SQL data type rules. If an operator or function is invalid for the
provided argument, Databricks Runtime raises an error.

Constant expression
An expression that is based only on literals or deterministic functions with no arguments. Databricks Runtime
can execute the expression and use the resulting constant where ordinarily literals are required.

Boolean expression
An expression with a result type of BOOLEAN . A Boolean expression is also sometimes referred to as a condition
or a predicate .

Scalar subquery
An expression of the form ( query ) . The query must return a table that has one column and at most one row.
If the query returns no row, the result is NULL . If the query returns more than one row, Databricks Runtime
returns an error. Otherwise, the result is the value returned by the query.

Simple expression
An expression that does not contain a query , such as a scalar subquery or an EXISTS predicate.

Examples
> SELECT 1;
1

> SELECT (SELECT 1) + 1;


2

> SELECT 1 + 1;
2

> SELECT 2 * (1 + 2);


6

> SELECT 2 * 1 + 2;
4

> SELECT substr('Spark', 1, 2);


Sp

> SELECT c1 + c2 FROM VALUES(1, 2) AS t(c1, c2);


3

> SELECT a[1] FROM VALUES(array(10, 20)) AS T(a);


20

> SELECT true;


true
Reserved words and schemas
7/21/2022 • 2 minutes to read

Reserved words are literals used as keywords by the SQL language which should not be used as identifiers to
avoid unexpected behavior.
Reserved schema names have special meaning to Databricks Runtime.

Reserved words
Databricks Runtime does not formally disallow any specific literals from being used as identifiers.
However, to use any of the following list of identifiers as a table alias, you must surround the name with back-
ticks (`).
ANTI
CROSS
EXCEPT
FULL
INNER
INTERSECT
JOIN
LATERAL
LEFT
MINUS
NATURAL
ON
RIGHT
SEMI
UNION
USING

Special words in expressions


The following list of identifiers can be used anywhere, but Databricks Runtime treats them preferentially as
keywords within expressions in certain contexts:
NULL

The SQL NULL value.


DEFAULT

Future use as a column default.


TRUE

The SQL boolean true value.


FALSE
The SQL boolean false value.
Use back-ticks ( NULL and DEFAULT ) or qualify the column names with a table name or alias.
Databricks Runtime uses the CURRENT_ prefix to refer to some configuration settings or other context
variables. The underbar ( _ ) prefix is intended for Databricks Runtime pseudo columns . An existing pseudo
column is the _metadata column.
Identifiers with these prefixes are not treated preferentially. However, avoid columns or column aliases using
these prefixes to avoid unexpected behavior.

Reserved schema names


Databricks Runtime reserves the following list of schema names for current or future use:
BUILTIN

Future use to qualify builtin functions.


SESSION

Future use to qualify temporary views and functions.


INFORMATION_SCHEMA

Holds the SQL Standard information schema.


Database names starting with SYS

Avoid using these names.

ANSI Reserved words


Databricks Runtime does not enforce ANSI reserved words. The following list of SQL2016 keywords is
provided for informational purposes only.
A
ALL, ALTER, AND, ANY, ARRAY, AS, AT, AUTHORIZATION
B
BETWEEN, BOTH, BY
C
CASE, CAST, CHECK, COLLATE, COLUMN, COMMIT, CONSTRAINT, CREATE, CROSS, CUBE, CURRENT,
CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP, CURRENT_USER
D
DELETE, DESCRIBE, DISTINCT, DROP
E
ELSE, END, ESCAPE, EXCEPT, EXISTS, EXTERNAL, EXTRACT
F
FALSE, FETCH, FILTER, FOR, FOREIGN, FROM, FULL, FUNCTION
G
GLOBAL, GRANT, GROUP, GROUPING
H
HAVING
I
IN, INNER, INSERT, INTERSECT, INTERVAL, INTO, IS
J
JOIN
L
LEADING, LEFT, LIKE, LOCAL
N
NATURAL, NO, NOT, NULL
O
OF, ON, ONLY, OR, ORDER, OUT, OUTER, OVERLAPS
P
PARTITION, POSITION, PRIMARY
R
RANGE, REFERENCES, REVOKE, RIGHT, ROLLBACK, ROLLUP, ROW, ROWS
S
SELECT, SESSION_USER, SET, SOME, START
T
TABLE, TABLESAMPLE, THEN, TIME, TO, TRAILING, TRUE, TRUNCATE
U
UNION, UNIQUE, UNKNOWN, UPDATE, USER, USING
V
VALUES
W
WHEN, WHERE, WINDOW, WITH

Examples
-- Using SQL keywords
> CREATE TEMPORARY VIEW where(where) AS (VALUES (1));

> SELECT where from FROM where select;


1

-- Usage of NULL
> SELECT NULL, `null`, T.null FROM VALUES(1) AS T(null);
NULL 1 1

-- current_date is eclipsed by the column alias T.current_date


> SELECT (SELECT current_date), current_date, current_date()
FROM VALUES(1) AS T(current_date);
2021-10-23 1 2021-10-23

-- Reserved keyword ANTI cannot be used as table alias


> SELECT * FROM VALUES(1) AS ANTI;
Error in query: no viable alternative at input 'ANTI'

> SELECT * FROM VALUES(1) AS `ANTI`;


1

Related articles
names
identifiers
Data types
7/21/2022 • 5 minutes to read

Supported data types


Databricks Runtime SQL and DataFrames support the following data types:

DATA T Y P E DESC RIP T IO N

BIGINT Represents 8-byte signed integer numbers.

BINARY Represents byte sequence values.

BOOLEAN Represents Boolean values.

DATE Represents values comprising values of fields year, month


and day, without a time-zone.

DECIMAL(p,s) Represents numbers with maximum precision p and fixed


scale s .

DOUBLE Represents 8-byte double-precision floating point numbers.

FLOAT Represents 4-byte single-precision floating point numbers.

INT Represents 4-byte signed integer numbers.

INTERVAL intervalQualifier Represents intervals of time either on a scale of seconds or


months.

VOID Represents the untyped NULL.

SMALLINT Represents 2-byte signed integer numbers.

STRING Represents character string values.

TIMESTAMP Represents values comprising values of fields year, month,


day, hour, minute, and second, with the session local
timezone.

TINYINT Represents 1-byte signed integer numbers.

ARRAY Represents values comprising a sequence of elements with


the type of elementType .

MAP<keyType,valueType> Represents values comprising a set of key-value pairs.

STRUCT<[fieldName:fieldType [NOT NULL][COMMENT str][, Represents values with the structure described by a
…]]> sequence of fields.
Data type classification
Data types are grouped into the following classes:
Integral numeric types represent whole numbers:
TINYINT
SMALLINT
INT
BIGINT
Exact numeric types represent base-10 numbers:
Integral numeric
DECIMAL
Binar y floating point types use exponents and a binary representation to cover a large range of numbers:
FLOAT
DOUBLE
Numeric types represents all numeric data types:
Exact numeric
Binary floating point
Date-time types represent date and time components:
DATE
TIMESTAMP
Simple types are types defined by holding singleton values:
Numeric
Date-time
BINARY
BOOLEAN
INTERVAL
STRING
Complex types are composed of multiple components of complex or simple types:
ARRAY
MAP
STRUCT

Language mappings
Scala
Spark SQL data types are defined in the package org.apache.spark.sql.types . You access them by importing the
package:

import org.apache.spark.sql.types._

A P I TO A C C ESS O R C REAT E
SQ L T Y P E DATA T Y P E VA L UE T Y P E DATA T Y P E

TINYINT ByteType Byte ByteType

SMALLINT ShortType Short ShortType


A P I TO A C C ESS O R C REAT E
SQ L T Y P E DATA T Y P E VA L UE T Y P E DATA T Y P E

INT IntegerType Int IntegerType

BIGINT LongType Long LongType

FLOAT FloatType Float FloatType

DOUBLE DoubleType Double DoubleType

DECIMAL(p,s) DecimalType java.math.BigDecimal DecimalType

STRING StringType String StringType

BINARY BinaryType Array[Byte] BinaryType

BOOLEAN BooleanType Boolean BooleanType

TIMESTAMP TimestampType java.sql.Timestamp TimestampType

DATE DateType java.sql.Date DateType

year-month interval YearMonthIntervalType java.time.Period YearMonthIntervalType (3)

day-time interval DayTimeIntervalType java.time.Duration DayTimeIntervalType (3)

ARRAY ArrayType scala.collection.Seq ArrayType(elementType [,


containsNull]). (2)

MAP MapType scala.collection.Map MapType(keyType,


valueType [,
valueContainsNull]). (2)

STRUCT StructType org.apache.spark.sql.Row StructType(fields). fields is a


Seq of StructField. 4.

StructField The value type of the data StructField(name, dataType


type of this field(For [, nullable]). 4
example, Int for a
StructField with the data
type IntegerType)

Java
Spark SQL data types are defined in the package org.apache.spark.sql.types . To access or create a data type,
use factory methods provided in org.apache.spark.sql.types.DataTypes .

A P I TO A C C ESS O R C REAT E
SQ L T Y P E DATA T Y P E VA L UE T Y P E DATA T Y P E

TINYINT ByteType byte or Byte DataTypes.ByteType

SMALLINT ShortType short or Short DataTypes.ShortType


A P I TO A C C ESS O R C REAT E
SQ L T Y P E DATA T Y P E VA L UE T Y P E DATA T Y P E

INT IntegerType int or Integer DataTypes.IntegerType

BIGINT LongType long or Long DataTypes.LongType

FLOAT FloatType float or Float DataTypes.FloatType

DOUBLE DoubleType double or Double DataTypes.DoubleType

DECIMAL(p,s) DecimalType java.math.BigDecimal DataTypes.createDecimalTyp


e()
DataTypes.createDecimalTyp
e(precision, scale).

STRING StringType String DataTypes.StringType

BINARY BinaryType byte[] DataTypes.BinaryType

BOOLEAN BooleanType boolean or Boolean DataTypes.BooleanType

TIMESTAMP TimestampType java.sql.Timestamp DataTypes.TimestampType

DATE DateType java.sql.Date DataTypes.DateType

year-month interval YearMonthIntervalType java.time.Period YearMonthIntervalType (3)

day-time interval DayTimeIntervalType java.time.Duration DayTimeIntervalType (3)

ARRAY ArrayType ava.util.List DataTypes.createArrayType(


elementType [,
containsNull]).(2)

MAP MapType java.util.Map DataTypes.createMapType(k


eyType, valueType [,
valueContainsNull]).(2)

STRUCT StructType org.apache.spark.sql.Row DataTypes.createStructType(


fields). fields is a List or
array of StructField. 4

StructField The value type of the data DataTypes.createStructField(


type of this field (For name, dataType, nullable) 4
example, int for a StructField
with the data type
IntegerType)

Python
Spark SQL data types are defined in the package pyspark.sql.types . You access them by importing the package:

from pyspark.sql.types import *


A P I TO A C C ESS O R C REAT E
SQ L T Y P E DATA T Y P E VA L UE T Y P E DATA T Y P E

TINYINT ByteType int or long. (1) ByteType()

SMALLINT ShortType int or long. (1) ShortType()

INT IntegerType int or long IntegerType()

BIGINT LongType long (1) LongType()

FLOAT FloatType float (1) FloatType()

DOUBLE DoubleType float DoubleType()

DECIMAL(p,s) DecimalType decimal.Decimal DecimalType()

STRING StringType string StringType()

BINARY BinaryType bytearray BinaryType()

BOOLEAN BooleanType bool BooleanType()

TIMESTAMP TimestampType datetime.datetime TimestampType()

DATE DateType datetime.date DateType()

year-month interval YearMonthIntervalType Not supported Not supported

day-time interval DayTimeIntervalType datetime.timedelta DayTimeIntervalType (3)

ARRAY ArrayType list, tuple, or array ArrayType(elementType,


[containsNull]).(2)

MAP MapType dict MapType(keyType,


valueType,
[valueContainsNull]).(2)

STRUCT StructType list or tuple StructType(fields). field is a


Seq of StructField. (4)

StructField The value type of the data StructField(name, dataType,


type of this field (For [nullable]).(4)
example, Int for a
StructField with the data
type IntegerType)

R
A P I TO A C C ESS O R C REAT E
SQ L T Y P E DATA T Y P E VA L UE T Y P E DATA T Y P E

TINYINT ByteType integer (1) ‘byte’

SMALLINT ShortType integer (1) ‘short’


A P I TO A C C ESS O R C REAT E
SQ L T Y P E DATA T Y P E VA L UE T Y P E DATA T Y P E

INT IntegerType integer ‘integer’

BIGINT LongType integer (1) ‘long’

FLOAT FloatType numeric (1) ‘float’

DOUBLE DoubleType numeric ‘double’

DECIMAL(p,s) DecimalType Not supported Not supported

STRING StringType character ‘string’

BINARY BinaryType raw ‘binary’

BOOLEAN BooleanType logical ‘bool’

TIMESTAMP TimestampType POSIXct ‘timestamp’

DATE DateType Date ‘date’

year-month interval YearMonthIntervalType Not supported Not supported

day-time interval DayTimeIntervalType Not supported Not supported

ARRAY ArrayType vector or list list(type=’array’,


elementType=elementType,
containsNull=
[containsNull]).(2)

MAP MapType environment list(type=’map’,


keyType=keyType,
valueType=valueType,
valueContainsNull=
[valueContainsNull]).(2)

STRUCT StructType named list list(type=’struct’,


fields=fields). fields is a Seq
of StructField. (4)

StructField The value type of the data list(name=name,


type of this field (For type=dataType, nullable=
example, integer for a [nullable]).(4)
StructField with the data
type IntegerType)

(1) Numbers are converted to the domain at runtime. Make sure that numbers are within range.
(2) The optional value defaults to TRUE .
(3) Interval types
YearMonthIntervalType([startField,] endField) : Represents a year-month interval which is made up of a
contiguous subset of the following fields:
startField is the leftmost field, and endField is the rightmost field of the type. Valid values of
startField and endField are 0(MONTH) and 1(YEAR) .
DayTimeIntervalType([startField,] endField) : Represents a day-time interval which is made up of a
contiguous subset of the following fields:
startField is the leftmost field, and endField is the rightmost field of the type. Valid values of
startField and endField are 0(DAY) , 1(HOUR) , 2(MINUTE) , 3(SECOND) .

(4) StructType

StructType(fields) Represents values with the structure described by a sequence, list, or array of
StructField s (fields). Two fields with the same name are not allowed.
StructField(name, dataType, nullable) Represents a field in a StructType . The name of a field is indicated
by name . The data type of a field is indicated by dataType. nullable indicates if values of these fields can
have null values. This is the default.

Related articles
Special floating point values
SQL data type rules
7/21/2022 • 6 minutes to read

Databricks Runtime uses several rules to resolve conflicts among data types:
Promotion safely expands a type to a wider type.
Implicit downcasting narrows a type. The opposite of promotion.
Implicit crosscasting transforms a type into a type of another type family.
You can also explicitly cast between many types:
cast function casts between most types, and returns errors if it cannot.
tr y_cast function works like cast function but returns NULL when passed invalid values.
Other builtin functions cast between types using provided format directives.

Type promotion
Type promotion is the process of casting a type into another type of the same type family which contains all
possible values of the original type. Therefore type promotion is a safe operation. For example TINYINT has a
range from -128 to 127 . All its possible values can be safely promoted to INTEGER .

Type precedence list


The type precedence list defines whether values of a given data type can be implicitly promoted to another data
type.

DATA T Y P E P REC EDEN C E L IST ( F RO M N A RRO W EST TO W IDEST )

TINYINT TINYINT -> SMALLINT -> INT -> BIGINT -> DECIMAL ->
FLOAT (1) -> DOUBLE

SMALLINT SMALLINT -> INT -> BIGINT -> DECIMAL -> FLOAT (1) ->
DOUBLE

INT INT -> BIGINT -> DECIMAL -> FLOAT (1) -> DOUBLE

BIGINT BIGINT -> DECIMAL -> FLOAT (1) -> DOUBLE

DECIMAL DECIMAL -> FLOAT (1) -> DOUBLE

FLOAT FLOAT (1) -> DOUBLE

DOUBLE DOUBLE

DATE DATE -> TIMESTAMP

TIMESTAMP TIMESTAMP

ARRAY ARRAY (2)


DATA T Y P E P REC EDEN C E L IST ( F RO M N A RRO W EST TO W IDEST )

BINARY BINARY

BOOLEAN BOOLEAN

INTERVAL INTERVAL

MAP MAP (2)

STRING STRING

STRUCT STRUCT (2)

(1) For least common type resolution FLOAT is skipped to avoid loss of precision.
(2) For a complex type the precedence rule applies recursively to its component elements.
Strings and NULL
Special rules apply for STRING and untyped NULL :
NULL can be promoted to any other type.
STRING can be promoted to BIGINT , BINARY , BOOLEAN , DATE , DOUBLE , INTERVAL , and TIMESTAMP . If the
actual string value cannot be cast to least common type Databricks Runtime raises a runtime error. When
promoting to INTERVAL the string value must match the intervals units.
Type precedence graph
This is a graphical depiction of the precedence hierarchy, combining the type precedence list and strings and
NULLs rules.

Least common type resolution


The least common type from a set of types is the narrowest type reachable from the type precedence graph by
all elements of the set of types.
The least common type resolution is used to:
Decide whether a function that expects a parameter of a given type can be invoked using an argument of a
narrower type.
Derive the argument type for a function that expects a shared argument type for multiple parameters, such
as coalesce, in, least, or greatest.
Derive the operand types for operators such as arithmetic operations or comparisons.
Derive the result type for expressions such as the case expression.
Derive the element, key, or value types for array and map constructors.
Derive the result type of UNION, INTERSECT, or EXCEPT set operators.
Special rules are applied if the least common type resolves to FLOAT . If any of the contributing types is an exact
numeric type ( TINYINT , SMALLINT , INTEGER , BIGINT , or DECIMAL ) the least common type is pushed to DOUBLE
to avoid potential loss of digits.

Implicit downcasting and crosscasting


Databricks Runtime employs these forms of implicit casting only on function and operator invocation, and only
where it can unambiguously determine the intent.
Implicit downcasting
Implicit downcasting automatically casts a wider type to a narrower type without requiring you to specify
the cast explicitly. Downcasting is convenient, but it carries the risk of unexpected runtime errors if the
actual value fails to be representable in the narrow type.
Downcasting applies the type precedence list in reverse order.
Implicit crosscasting
Implicit crosscasting casts a value from one type family to another without requiring you to specify the
cast explicitly.
Databricks Runtime supports implicit crosscasting from:
Any simple type, except BINARY , to STRING .
A STRING to any simple type.

Casting on function invocation


Given a resolved function or operator, the following rules apply, in the order they are listed, for each parameter
and argument pair:
If a supported parameter type is part of the argument’s type precedence graph, Databricks Runtime
promotes the argument to that parameter type.
In most cases the function description explicitly states the supported types or chain, such as “any numeric
type”.
For example, sin(expr) operates on DOUBLE but will accept any numeric.
If the expected parameter type is a STRING and the argument is a simple type Databricks Runtime
crosscasts the argument to the string parameter type.
For example, substr(str, start, len) expects str to be a STRING . Instead, you can pass a numeric or
datetime type.
If the argument type is a STRING and the expected parameter type is a simple type, Databricks Runtime
crosscasts the string argument to the widest supported parameter type.
For example, date_add(date, days) expects a DATE and an INTEGER .
If you invoke date_add() with two STRING s, Databricks Runtime crosscasts the first STRING to DATE and
the second STRING to an INTEGER .
If the function expects a numeric type, such as an INTEGER , or a DATE type, but the argument is a more
general type, such as a DOUBLE or TIMESTAMP , Databricks Runtime implicitly downcasts the argument to
that parameter type.
For example, a date_add(date, days) expects a DATE and an INTEGER .
If you invoke date_add() with a TIMESTAMP and a BIGINT , Databricks Runtime downcasts the TIMESTAMP
to DATE by removing the time component and the BIGINT to an INTEGER .
Otherwise, Databricks Runtime raises an error.
Examples
The coalesce function accepts any set of argument types as long as they share a least common type.
The result type is the least common type of the arguments.

-- The least common type of TINYINT and BIGINT is BIGINT


> SELECT typeof(coalesce(1Y, 1L, NULL));
BIGINT

-- INTEGER and DATE do not share a precedence chain or support crosscasting in either direction.
> SELECT typeof(coalesce(1, DATE'2020-01-01'));
Error: Incompatible types [INT, DATE]

-- Both are ARRAYs and the elements have a least common type
> SELECT typeof(coalesce(ARRAY(1Y), ARRAY(1L)))
ARRAY<BIGINT>

-- The least common type of INT and FLOAT is DOUBLE


> SELECT typeof(coalesce(1, 1F))
DOUBLE

> SELECT typeof(coalesce(1L, 1F))


DOUBLE

> SELECT (typeof(coalesce(1BD, 1F))


DOUBLE

-- The least common type between an INT and STRING is BIGINT


> SELECT typeof(coalesce(5, '6');
BIGINT

-- The least common type is a BIGINT, but the value is not BIGINT.
> SELECT coalesce('6.1', 5);
Error: 6.1 is not a BIGINT

-- The least common type between a DECIMAL and a STRING is a DOUBLE


> SELECT typeof(coalesce(1BD, '6');
DOUBLE

The substring function expects arguments of type STRING for the string and INTEGER for the start and length
parameters.
-- Promotion of TINYINT to INTEGER
> SELECT substring('hello', 1Y, 2);
he

-- No casting
> SELECT substring('hello', 1, 2);
he

-- Casting of a literal string


> SELECT substring('hello', '1', 2);
he

-- Downcasting of a BIGINT to an INT


> SELECT substring('hello', 1L, 2);
he

-- Crosscasting from STRING to INTEGER


> SELECT substring('hello', str, 2)
FROM VALUES(CAST('1' AS STRING)) AS T(str);
he

-- Crosscasting from INTEGER to STRING


> SELECT substring(12345, 2, 2);
23

|| (CONCAT) allows implicit crosscasting to string.

-- A numeric is cast to STRING


> SELECT 'This is a numeric: ' || 5.4E10;
This is a numeric: 5.4E10

-- A date is cast to STRING


> SELECT 'This is a date: ' || DATE'2021-11-30';
This is a date: 2021-11-30

date_add can be invoked with a TIMESTAMP or BIGINT due to implicit downcasting.

> SELECT date_add(TIMESTAMP'2011-11-30 08:30:00', 5L);


2011-12-05

date_add can be invoked with STRING s due to implicit crosscasting.

> SELECT date_add('2011-11-30 08:30:00', '5');


2011-12-05

Related
cast
Data types
Functions
try_cast
Datetime patterns
7/21/2022 • 6 minutes to read

There are several common scenarios for datetime usage in Databricks Runtime:
CSV and JSON data sources use the pattern string for parsing and formatting datetime content.
Datetime functions related to convert STRING to and from DATE or TIMESTAMP . For example:
unix_timestamp
date_format
to_unix_timestamp
from_unixtime
to_date
to_timestamp
from_utc_timestamp
to_utc_timestamp

Pattern table
Databricks Runtime uses pattern letters in the following table for date and timestamp parsing and formatting:

SY M B O L M EA N IN G P RESEN TAT IO N EXA M P L ES

G era text AD; Anno Domini

y year year 2020; 20

D day-of-year number(3) 189

M/L month-of-year month 7; 07; Jul; July

d day-of-month number(3) 28

Q/q quarter-of-year number/text 3; 03; Q3; 3rd quarter

E day-of-week text Tue; Tuesday

F aligned day of week in number(1) 3


month

a am-pm-of-day am-pm PM

h clock-hour-of-am-pm (1- number(2) 12


12)

K hour-of-am-pm (0-11) number(2) 0

k clock-hour-of-day (1-24) number(2) 0


SY M B O L M EA N IN G P RESEN TAT IO N EXA M P L ES

H hour-of-day (0-23) number(2) 0

m minute-of-hour number(2) 30

s second-of-minute number(2) 55

S fraction-of-second fraction 978

V time-zone ID zone-id America/Los_Angeles; Z; -


08:30

z time-zone name zone-name Pacific Standard Time; PST

O localized zone-offset offset-O GMT+8; GMT+08:00; UTC-


08:00;

X zone-offset ‘Z’ for zero offset-X Z; -08; -0830; -08:30; -


083015; -08:30:15;

x zone-offset offset-x +0000; -08; -0830; -08:30;


-083015; -08:30:15;

Z zone-offset offset-Z +0000; -0800; -08:00;

‘ escape for text delimiter

‘’ single quote literal ‘

[ optional section start

] optional section end

The count of pattern letters determines the format.


Text: The text style is determined based on the number of pattern letters used. Less than 4 pattern letters
will use the short text form, typically an abbreviation, e.g. day-of-week Monday might output “Mon”.
Exactly 4 pattern letters will use the full text form, typically the full description, e.g, day-of-week Monday
might output “Monday”. 5 or more letters will fail.
Number(n): The n here represents the maximum count of letters this type of datetime pattern can be
used. If the count of letters is one, then the value is output using the minimum number of digits and
without padding. Otherwise, the count of digits is used as the width of the output field, with the value
zero-padded as necessary.
Number/Text: If the count of pattern letters is 3 or greater, use the Text rules above. Otherwise use the
Number rules above.
Fraction: Use one or more (up to 9) contiguous 'S' characters, for example, SSSSSS , to parse and
format fraction of second. For parsing, the acceptable fraction length can be [1, the number of contiguous
‘S’]. For formatting, the fraction length would be padded to the number of contiguous ‘S’ with zeros.
Databricks Runtime supports datetime of micro-of-second precision, which has up to 6 significant digits,
but can parse nano-of-second with exceeded part truncated.
Year: The count of letters determines the minimum field width below which padding is used. If the count
of letters is two, then a reduced two digit form is used. For printing, this outputs the rightmost two digits.
For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to
2099 inclusive. If the count of letters is less than four (but not two), then the sign is only output for
negative years. Otherwise, the sign is output if the pad width is exceeded when ‘G’ is not present. 7 or
more letters will fail.
Month: It follows the rule of Number/Text. The text form is depend on letters - 'M' denotes the ‘standard’
form, and 'L' is for ‘stand-alone’ form. These two forms are different only in some certain languages.
For example, in Russian, ‘Июль’ is the stand-alone form of July, and ‘Июля’ is the standard form. Here are
examples for all supported pattern letters:
'M' or 'L' : Month number in a year starting from 1. There is no difference between 'M' and
'L' . Month from 1 to 9 are printed without padding.

> SELECT date_format(date '1970-01-01', 'M');


1

> SELECT date_format(date '1970-12-01', 'L');


12

'MM' or 'LL' : Month number in a year starting from 1. Zero padding is added for month 1-9.

> SELECT date_format(date '1970-1-01', 'LL');


01

> SELECT date_format(date '1970-09-01', 'MM');


09

'MMM' : Short textual representation in the standard form. The month pattern should be a part of a
date pattern not just a stand-alone month except locales where there is no difference between
stand and stand-alone forms like in English.

> SELECT date_format(date '1970-01-01', 'd MMM');


1 Jan

-- Passing a format pattern to to_csv()


> SELECT to_csv(named_struct('date', date '1970-01-01'), map('dateFormat', 'dd MMM', 'locale',
'RU'));
01 янв.

: full textual month representation in the standard form. It is used for parsing/formatting
'MMMM'
months as a part of dates/timestamps.

> SELECT date_format(date '1970-01-01', 'd MMMM');


1 January

-- Passing a format pttern to to_csv()


> SELECT to_csv(named_struct('date', date '1970-01-01'), map('dateFormat', 'd MMMM', 'locale',
'RU'));
1 января

am-pm: This outputs the am-pm-of-day. Pattern letter count must be 1.


Zone ID(V): This outputs the display the time-zone ID. Pattern letter count must be 2.
Zone names(z): This outputs the display textual name of the time-zone ID. If the count of letters is one,
two or three, then the short name is output. If the count of letters is four, then the full name is output. Five
or more letters will fail.
Offset X and x: This formats the offset based on the number of pattern letters. One letter outputs just the
hour, such as ‘+01’, unless the minute is non-zero in which case the minute is also output, such as
‘+0130’. Two letters outputs the hour and minute, without a colon, such as ‘+0130’. Three letters outputs
the hour and minute, with a colon, such as ‘+01:30’. Four letters outputs the hour and minute and
optional second, without a colon, such as ‘+013015’. Five letters outputs the hour and minute and
optional second, with a colon, such as ‘+01:30:15’. Six or more letters will fail. Pattern letter ‘X’ (upper
case) will output ‘Z’ when the offset to be output would be zero, whereas pattern letter ‘x’ (lower case)
will output ‘+00’, ‘+0000’, or ‘+00:00’.
Offset O: This formats the localized offset based on the number of pattern letters. One letter outputs the
short form of the localized offset, which is localized offset text, such as ‘GMT’, with hour without leading
zero, optional 2-digit minute and second if non-zero, and colon, for example ‘GMT+8’. Four letters
outputs the full form, which is localized offset text, such as ‘GMT, with 2-digit hour and minute field,
optional second field if non-zero, and colon, for example ‘GMT+08:00’. Any other count of letters will fail.
Offset Z: This formats the offset based on the number of pattern letters. One, two or three letters outputs
the hour and minute, without a colon, such as ‘+0130’. The output is ‘+0000’ when the offset is zero. Four
letters outputs the full form of localized offset, equivalent to four letters of Offset-O. The output is the
corresponding localized offset text if the offset is zero. Five letters outputs the hour, minute, with optional
second if non-zero, with colon. It outputs ‘Z’ if the offset is zero. Six or more letters will fail.
Optional section start and end: Use [] to define an optional section and maybe nested. During
formatting, all valid data is output even it is in the optional section. During parsing, the whole section may
be missing from the parsed string. An optional section is started by [ and ended using ] (or at the end
of the pattern).
Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime formatting, e.g. date_format . They are not
allowed used for datetime parsing, e.g. to_timestamp .

Related articles
date_format function
from_unixtime function
from_utc_timestamp function
to_date function
to_timestamp function
to_utc_timestamp function
to_unix_timestamp function
unix_timestamp function
Functions
7/21/2022 • 2 minutes to read

Spark SQL provides two function features to meet a wide range of needs: built-in functions and user-defined
functions (UDFs).

Built-in functions
This article presents the usages and descriptions of categories of frequently used built-in functions for
aggregation, arrays and maps, dates and timestamps, and JSON data.
Built-in functions

SQL user-defined functions


SQL user-defined functions (UDFs) are functions you can define yourself which can return scalar values or result
sets.
See CREATE FUNCTION (SQL) for more information.

User-defined functions
UDFs allow you to define your own functions when the system’s built-in functions are not enough to perform
the desired task. To use UDFs, you first define the function, then register the function with Spark, and finally call
the registered function. A UDF can act on a single row or act on multiple rows at once. Spark SQL also supports
integration of existing Hive implementations of UDFs, user defined aggregate functions (UDAF), and user
defined table functions (UDTF).
User-defined aggregate functions (UDAFs)
Integration with Hive UDFs, UDAFs, and UDTFs
User-defined scalar functions (UDFs)
Built-in functions
7/21/2022 • 28 minutes to read

This article presents links to and descriptions of built-in operators, and functions for strings and binary types,
numeric scalars, aggregations, windows, arrays, maps, dates and timestamps, casting, CSV data, JSON data,
XPath manipulation, and miscellaneous functions.
Also see:
Alphabetic list of built-in functions

Operators and predicates


For information on how operators are parsed with respect to each other, see Operator precedence.

O P ERATO R SY N TA X DESC RIP T IO N

& expr1 & expr2 Returns the bitwise AND of expr1


and expr2 .

and expr1 and expr2 Returns the logical AND of expr1 and
expr2 .

* multiplier * multiplicand Returns multiplier multiplied by


multiplicand .

!= expr1 != expr2 Returns true if expr1 does not equal


expr2 , or false otherwise.

! !expr Returns the logical NOT of a Boolean


expression.

between expr1 [not] between expr2 and Tests whether expr1 is greater or
expr2
equal than expr2 and less than or
equal to expr3 .

[] arrayExpr [ indexExpr ] Returns indexExpr nd element of


ARRAY arrayExpr

[] mapExpr [ keyExpr ] Returns value at keyExpr of MAP


mapExpr

^ expr1 ^ expr2 Returns the bitwise exclusive


OR (XOR) of expr1 and expr2 .

: jsonStr : jsonPath Returns fields extracted from the


jsonStr .

:: expr :: type Casts the value expr to the target


data type type .”
O P ERATO R SY N TA X DESC RIP T IO N

div divisor div dividend Returns the integral part of the


division of divisor by dividend .

== expr1 == expr2 Returns true if expr1 equals


expr2 , or false otherwise.”

= expr1 = expr2 Returns true if expr1 equals


expr2 , or false otherwise.”

>= expr1 >= expr2 Returns true if expr1 is greater


than or equal to expr2 , or false
otherwise.

> expr1 > expr2 Returns true if expr1 is greater


than expr2 , or false otherwise.

exists exists(query) Returns true if query returns at least


one row, or false otherwise.

ilike str [not] ilike (pattern[ESCAPE Returns true if str matches


escape]) pattern with escape case-
insensitively.

ilike str [not] ilike {ANY|SOME|ALL} Returns true if str matches any/all
([pattern[, ...]])
patterns case-insensitively.

in elem [not] in (expr1[, ...]) Returns true if elem equals any


exprN .

in elem [not] in (query) Returns true if elem equals any


row in query .

is distinct expr1 is [not] distinct from Tests whether the arguments have
expr2 different values where NULLs are
considered as comparable values.

is false expr is [not] false Tests whether expr is false .

is null expr is [not] null Returns true if expr is (not) NULL .

is true expr is [not] true Tests whether expr is true .

like str [not] like (pattern[ESCAPE Returns true if str matches


escape])
pattern with escape .

like str [not] like {ANY|SOME|ALL} Returns true if str matches any/all
([pattern[, ...]]) patterns.
O P ERATO R SY N TA X DESC RIP T IO N

<=> expr1 <=> expr2 Returns the same result as the


EQUAL(=) for non-null operands, but
returns true if both are NULL ,
false if one of the them is NULL .

<> expr1 <= expr2 Returns true if expr1 does not


equal expr2 , or false otherwise.

<> expr1 <> expr2 Returns true if expr1 does not


equal expr2 , or false otherwise.

< expr1 < expr2 Returns true if expr1 is less than


or equal to expr2 , or false
otherwise.

- expr1 - expr2 Returns the subtraction of expr2


from expr1 .

not not expr Returns the logical NOT of a Boolean


expression.

% dividend % divisor Returns the remainder after


dividend / divisor .

|| expr1 || expr2 Returns the concatenation of expr1


and expr2 .

| expr1 | expr2 Returns the bitwise OR of expr1


and expr2 .

+ expr1 + expr2 Returns the sum of expr1 and


expr2 .

regexp str [not] regexp regex Returns true if str matches regex .

regexp_like str [not] regexp_like regex Returns true if str matches regex .

rlike str [not] rlike regex Returns true if str matches regex .

/ dividend / divisor Returns dividend divided by


divisor .

~ ~ expr Returns the bitwise NOT of expr .

Operator precedence
P REC EDEN C E O P ERATO R

1 : , :: , [ ]
P REC EDEN C E O P ERATO R

2 - (unary), + (unary), ~

3 * , / , % , div

4 + , - , ||

5 &

6 ^

7 |

8 = , == , <=> , <> , != , < , <= , > , >=

9 not , exists

10 between , in , rlike , regexp , ilike , like ,


is [not] [NULL, true, false] ,
is [not] distinct from

11 and

12 or

String and binary functions


F UN C T IO N DESC RIP T IO N

expr1 || expr2 Returns the concatenation of expr1 and expr2 .

aes_decrypt(expr, key[, mode[, padding]]) Decrypts a binary expr using AES encryption.

aes_encrypt(expr, key[, mode[, padding]]) Encrypts a binary expr using AES encryption.

ascii(str) Returns the ASCII code point of the first character of str .

base64(expr) Converts expr to a base 64 string.

bin(expr) Returns the binary representation of expr .

binary(expr) Casts the value of expr to BINARY.

bit_length(expr) Returns the bit length of string data or number of bits of


binary data.

btrim(str [, trimStr]) Returns str with leading and trailing characters removed.

char(expr) Returns the character at the supplied UTF-16 code point.


F UN C T IO N DESC RIP T IO N

char_length(expr) Returns the character length of string data or number of


bytes of binary data.

character_length(expr) Returns the character length of string data or number of


bytes of binary data.

charindex(substr, str[, pos]) Returns the position of the first occurrence of substr in
str after position pos .

chr(expr) Returns the character at the supplied UTF-16 code point.

concat(expr1, expr2[, …]) Returns the concatenation of the arguments.

concat_ws(sep[, expr1[, …]]) Returns the concatenation strings separated by sep .

contains(expr, subExpr) Returns true if expr STRING or BINARY contains


subExpr .

crc32(expr) Returns a cyclic redundancy check value of expr .

decode(expr, charSet) Translates binary expr to a string using the character set
encoding charSet .

encode(expr, charSet) Returns the binary representation of a string using the


charSet character encoding.

endswith(expr, endExpr) Returns true if expr STRING or BINARY ends with


endExpr .

find_in_set(searchExpr, sourceExpr) Returns the position of a string within a comma-separated


list of strings.

format_number(expr, scale) Formats expr like #,###,###.## , rounded to scale


decimal places.

format_number(expr, fmt) Formats expr like fmt .

format_string(strfmt[, obj1 [, …]]) Returns a formatted string from printf-style format strings.

hex(expr) Converts expr to hexadecimal.

str ilike (pattern[ESCAPE escape]) Returns true if str matches pattern with escape case
insensitively.

initcap(expr) Returns expr with the first letter of each word in


uppercase.

instr(str, substr) Returns the (1-based) index of the first occurrence of


substr in str .

lcase(expr) Returns expr with all characters changed to lowercase.


F UN C T IO N DESC RIP T IO N

left(str, len) Returns the leftmost len characters from str .

length(expr) Returns the character length of string data or number of


bytes of binary data.

levenshtein(str1, str2) Returns the Levenshtein distance between the strings str1
and str2 .

str like (pattern[ESCAPE escape]) Returns true if str matches pattern with escape .

locate(substr, str[, pos]) Returns the position of the first occurrence of substr in
str after position pos .

lower(expr) Returns expr with all characters changed to lowercase.

lpad(expr, len[, pad]) Returns expr , left-padded with pad to a length of len .

ltrim([trimstr,] str) Returns str with leading characters within trimStr


removed.

md5(expr) Returns an MD5 128-bit checksum of expr as a hex string.

octet_length(expr) Returns the byte length of string data or number of bytes of


binary data.

overlay(input PLACING replace FROM pos [FOR len]) Replaces input with replace that starts at pos and is
of length len .

parse_url(url, partToExtract[, key]) Extracts a part from url .

position(substr, str[, pos]) Returns the position of the first occurrence of substr in
str after position pos .

position(subtr IN str) Returns the position of the first occurrence of substr in


str after position pos .

printf(strfmt[, obj1 [, …]]) Returns a formatted string from printf-style format strings.

str regexp regex Returns true if str matches regex .

str regexp_like regex Returns true if str matches regex .

regexp_extract(str, regexp[, idx]) Extracts the first string in str that matches the regexp
expression and corresponds to the regex group index.

regexp_extract_all(str, regexp[, idx]) Extracts the all strings in str that matches the regexp
expression and corresponds to the regex group index.

regexp_replace(str, regexp, rep[, position]) Replaces all substrings of str that match regexp with
rep .
F UN C T IO N DESC RIP T IO N

repeat(expr, n) Returns the string that repeats expr n times.

replace(str, search [, replace]) Replaces all occurrences of search with replace .

reverse(expr) Returns a reversed string or an array with reverse order of


elements.

right(str, len) Returns the rightmost len characters from the string str
.

str rlike regex Returns true if str matches regex .

rpad(expr, len[, pad]) Returns expr , right-padded with pad to a length of len .

rtrim([trimStr,] str) Returns str with trailing characters removed.

sentences(str[, lang, country]) Splits str into an array of array of words.

sha(expr) Returns a sha1 hash value as a hex string of expr .

sha1(expr) Returns a sha1 hash value as a hex string of expr .

sha2(expr, bitLength) Returns a checksum of the SHA-2 family as a hex string of


expr .

soundex(expr) Returns the soundex code of the string.

space(n) Returns a string consisting of n spaces.

split(str, regex[, limit]) Splits str around occurrences that match regex and
returns an array with a length of at most limit .

split_part(str, delim, partNum) Splits str around occurrences of delim and returns the
partNum part.

startswith(expr, startExpr) Returns true if expr STRING or BINARY starts with


startExpr .

string(expr) Casts the value expr to STRING.

substr(expr, pos[, len]) Returns the substring of expr that starts at pos and is of
length len .

substr(expr FROM pos[ FOR len]) Returns the substring of expr that starts at pos and is of
length len .

substring(expr, pos[, len]) Returns the substring of expr that starts at pos and is of
length len .
F UN C T IO N DESC RIP T IO N

substring(expr FROM pos[ FOR len]) Returns the substring of expr that starts at pos and is of
length len .

substring_index(expr, delim, count) Returns the substring of expr before count occurrences
of the delimiter delim .

translate(expr, from, to) Returns an expr where all characters in from have been
replaced with those in to .

trim([[BOTH | LEADING | TRAILING] [trimStr] FROM] str) Trim characters from a string.

ucase(expr) Returns expr with all characters changed to uppercase.

unbase64(expr) Returns a decoded base64 string as binary.

unhex(expr) Converts hexadecimal expr to BINARY.

upper(expr) Returns expr with all characters changed to uppercase.

Numeric scalar functions


F UN C T IO N DESC RIP T IO N

~ expr Returns the bitwise NOT of expr .

dividend / divisor Returns dividend divided by divisor .

expr1 | expr2 Returns the bitwise OR of expr1 and expr2 .

- expr Returns the negated value of expr .

expr1 - expr2 Returns the subtraction of expr2 from expr1 .

+ expr Returns the value of expr .

expr1 + expr2 Returns the sum of expr1 and expr2 .

dividend % divisor Returns the remainder after dividend / divisor .

expr1 ^ expr2 Returns the bitwise exclusive OR (XOR) of expr1 and


expr2 .

expr1 & expr2 Returns the bitwise AND of expr1 and expr2 .

multiplier * multiplicand Returns multiplier multiplied by multiplicand .

abs(expr) Returns the absolute value of the numeric value in expr .


F UN C T IO N DESC RIP T IO N

acos(expr) Returns the inverse cosine (arccosine) of expr .

acosh(expr) Returns the inverse hyperbolic cosine of expr .

asin(expr) Returns the inverse sine (arcsine) of expr .

asinh(expr) Returns the inverse hyperbolic sine of expr .

atan(expr) Returns the inverse tangent (arctangent) of expr .

atan2(exprY, exprX) Returns the angle in radians between the positive x-axis of a
plane and the point specified by the coordinates ( exprX ,
exprY ).

atanh(expr) Returns inverse hyperbolic tangent of expr .

bigint(expr) Casts the value expr to BIGINT.

bit_count(expr) Returns the number of bits set in the argument.

bit_get(expr, pos) Returns the value of a bit in a binary representation of an


integral numeric.

bit_reverse(expr) Returns the value obtained by reversing the order of the bits
in the argument.

bround(expr[,targetScale]) Returns the rounded expr using HALF_EVEN rounding


mode.

cbrt(expr) Returns the cube root of expr .

ceil(expr[,targetScale]) Returns the smallest number not smaller than expr


rounded up to targetScale digits relative to the decimal
point.

ceiling(expr[,targetScale]) Returns the smallest number not smaller than expr


rounded up to targetScale digits relative to the decimal
point.

conv(num, fromBase, toBase) Converts num from fromBase to toBase .

cos(expr) Returns the cosine of expr .

cosh(expr) Returns the hyperbolic cosine of expr .

cot(expr) Returns the cotangent of expr .

csc(expr) Returns the cosecant of expr .

decimal(expr) Casts the value expr to DECIMAL.


F UN C T IO N DESC RIP T IO N

degrees(expr) Converts radians to degrees.

divisor div dividend Returns the integral part of the division of divisor by
dividend .

double(expr) Casts the value expr to DOUBLE.

e() Returns the constant e .

exp(expr) Returns e to the power of expr .

expm1(expr) Returns exp(expr) - 1 .

factorial(expr) Returns the factorial of expr .

float(expr) Casts the value expr to FLOAT.

floor(expr[,targetScale]) Returns the largest number not smaller than expr rounded
down to targetScale digits relative to the decimal point.

getbit(expr, pos) Returns the value of a bit in a binary representation of an


integral numeric.

hypot(expr1, expr2) Returns sqrt(expr1 * expr1 + expr2 * expr2) .

int(expr) Casts the value expr to INTEGER.

isnan(expr) Returns true if expr is NaN .

ln(expr) Returns the natural logarithm (base e ) of expr .

log([base,] expr) Returns the logarithm of expr with base .

log1p(expr) Returns log(1 + expr) .

log2(expr) Returns the logarithm of expr with base 2 .

log10(expr) Returns the logarithm of expr with base 10 .

mod(dividend, divisor) Returns the remainder after dividend / divisor .

nanvl(expr1, expr2) Returns expr1 if it’s not NaN , or expr2 otherwise.

negative(expr) Returns the negated value of expr .

pi() Returns pi.

pmod(dividend, divisor) Returns the positive remainder after dividend / divisor .


F UN C T IO N DESC RIP T IO N

positive(expr) Returns the value of expr .

pow(expr1, expr2) Raises expr1 to the power of expr2 .

power(expr1, expr2) Raises expr1 to the power of expr2 .

radians(expr) Converts expr in degrees to radians.

rand([seed]) Returns a random value between 0 and 1.

randn([seed]) Returns a random value from a standard normal distribution.

random([seed]) Returns a random value between 0 and 1.

rint(expr) Returns expr rounded to a whole number as a DOUBLE.

round(expr[,targetScale]) Returns the rounded expr using HALF_UP rounding


mode.

sec(expr) Returns the secant of expr .

sin(expr) Returns the sine of expr .

shiftleft(expr, n) Returns a bitwise left shifted by n bits.

shiftright(expr, n) Returns a bitwise signed signed integral number right shifted


by n bits.

shiftrightunsigned(expr, n) Returns a bitwise unsigned signed integral number right


shifted by n bits.

sign(expr) Returns -1.0, 0.0, or 1.0 as expr is negative, 0, or positive.

signum(expr) Returns -1.0, 0.0, or 1.0 as expr is negative, 0, or positive.

sinh(expr) Returns the hyperbolic sine of expr .

smallint(expr) Casts the value expr to SMALLINT.

sqrt(expr) Returns the square root of expr .

tan(expr) Returns the tangent of expr .

tanh(expr) Returns the hyperbolic tangent of expr .

tinyint(expr) Casts expr to TINYINT.

to_number(expr, fmt ) Returns expr cast to DECIMAL using formatting fmt .


F UN C T IO N DESC RIP T IO N

try_add(expr1, expr2) Returns the sum of expr1 and expr2 , or NULL in case of
error.

try_divide(dividend, divisor) Returns dividend divided by divisor , or NULL if


divisor is 0.

try_multiply(multiplier, multiplicand) Returns multiplier multiplied by multiplicand , or


NULL on overflow.

try_subtract(expr1, expr2) Returns the subtraction of expr2 from expr1 , or NULL


on overflow.

try_to_number(expr, fmt ) Returns expr cast to DECIMAL using formatting fmt , or


NULL if expr does not match the format.

Aggregate functions
F UN C T IO N DESC RIP T IO N

any(expr) Returns true if at least one value of expr in the group is


true.

approx_count_distinct(expr[,relativeSD]) Returns the estimated number of distinct values in expr


within the group.

approx_percentile(expr,percentage[,accuracy]) Returns the approximate percentile of the expr within the


group.

approx_top_k(expr[,k[,maxItemsTracked]]) Returns the top k most frequently occurring item values in


an expr along with their approximate counts.

array_agg(expr) Returns an array consisting of all values in expr within the


group.

avg(expr) Returns the mean calculated from values of a group.

bit_and(expr) Returns the bitwise AND of all input values in the group.

bit_or(expr) Returns the bitwise OR of all input values in the group.

bit_xor(expr) Returns the bitwise XOR of all input values in the group.

bool_and(expr) Returns true if all values in expr are true within the group.

bool_or(expr) Returns true if at least one value in expr is true within the
group.

collect_list(expr) Returns an array consisting of all values in expr within the


group.
F UN C T IO N DESC RIP T IO N

collect_set(expr) Returns an array consisting of all unique values in expr


within the group.

corr(expr1,expr2) Returns Pearson coefficient of correlation between a group of


number pairs.

count(*) Returns the total number of retrieved rows in a group,


including rows containing null.

count(expr[, …]) Returns the number of rows in a group for which the
supplied expressions are all non-null.

count_if(expr) Returns the number of true values for the group in expr .

count_min_sketch(expr, epsilon, confidence, seed) Returns a count-min sketch of all values in the group in
expr with the epsilon , confidence and seed .

covar_pop(expr1,expr2) Returns the population covariance of number pairs in a


group.

covar_samp(expr1,expr2) Returns the sample covariance of number pairs in a group.

every(expr) Returns true if all values of expr in the group are true.

first(expr[,ignoreNull]) Returns the first value of expr for a group of rows.

first_value(expr[,ignoreNull]) Returns the first value of expr for a group of rows.

kurtosis(expr) Returns the kurtosis value calculated from values of a group.

last(expr[,ignoreNull]) Returns the last value of expr for the group of rows.

last_value(expr[,ignoreNull]) Returns the last value of expr for the group of rows.

max(expr) Returns the maximum value of expr in a group.

max_by(expr1,expr2) Returns the value of an expr1 associated with the


maximum value of expr2 in a group.

mean(expr) Returns the mean calculated from values of a group.

min(expr) Returns the minimum value of expr in a group.

min_by(expr1, expr2) Returns the value of an expr1 associated with the


minimum value of expr2 in a group.

percentile(expr, percentage [,frequency]) Returns the exact percentile value of expr at the specified
percentage .

percentile_approx(expr,percentage[,accuracy]) Returns the approximate percentile of the expr within the


group.
F UN C T IO N DESC RIP T IO N

percentile_cont(pct) WITHIN GROUP (ORDER BY key) Returns the interpolated percentile of the key within the
group.

percentile_disc(pct) WITHIN GROUP (ORDER BY key) Returns the discrete percentile of the key within the group.

regr_avgx(yExpr, xExpr) Returns the mean of xExpr calculated from values of a


group where xExpr and yExpr are NOT NULL.

regr_avgy(yExpr, xExpr) Returns the mean of yExpr calculated from values of a


group where xExpr and yExpr are NOT NULL.

regr_count(yExpr, xExpr) Returns the number of non-null value pairs yExpr , xExpr
in the group.

regr_r2(yExpr, xExpr) Returns the coefficient of determination from values of a


group where xExpr and yExpr are NOT NULL.

regr_sxx(yExpr, xExpr) Returns the sum of squares of the xExpr values of a group
where xExpr and yExpr are NOT NULL.

regr_sxy(yExpr, xExpr) Returns the sum of products of yExpr and xExpr


calculated from values of a group where xExpr and
yExpr are NOT NULL.

regr_syy(yExpr, xExpr) Returns the sum of squares of the yExpr values of a group
where xExpr and yExpr are NOT NULL.

skewness(expr) Returns the skewness value calculated from values of a


group.

some(expr) Returns true if at least one value of expr in a group is


true .

std(expr) Returns the sample standard deviation calculated from the


values within the group.

stddev(expr) Returns the sample standard deviation calculated from the


values within the group.

stddev_pop(expr) Returns the population standard deviation calculated from


values of a group.

stddev_samp(expr) Returns the sample standard deviation calculated from


values of a group.

sum(expr) Returns the sum calculated from values of a group.

try_avg(expr) Returns the mean calculated from values of a group, NULL if


there is an overflow.

try_sum(expr) Returns the sum calculated from values of a group, NULL if


there is an overflow.
F UN C T IO N DESC RIP T IO N

var_pop(expr) Returns the population variance calculated from values of a


group.

var_samp(expr) Returns the sample variance calculated from values of a


group.

variance(expr) Returns the sample variance calculated from values of a


group.

Ranking window functions


F UN C T IO N DESC RIP T IO N

dense_rank() Returns the rank of a value compared to all values in the


partition.

ntile(n) Divides the rows for each window partition into n buckets
ranging from 1 to at most n .

percent_rank() Computes the percentage ranking of a value within the


partition.

rank() Returns the rank of a value compared to all values in the


partition.

row_number() Assigns a unique, sequential number to each row, starting


with one, according to the ordering of rows within the
window partition.

Analytic window functions


F UN C T IO N DESC RIP T IO N

cume_dist() Returns the position of a value relative to all values in the


partition.

lag(expr[,offset[,default]]) Returns the value of expr from a preceding row within the
partition.

lead(expr[,offset[,default]]) Returns the value of expr from a subsequent row within


the partition.

nth_value(expr, offset[, ignoreNulls]) Returns the value of expr at a specific offset in the
window.

Array functions
F UN C T IO N DESC RIP T IO N
F UN C T IO N DESC RIP T IO N

arrayExpr[indexExpr] Returns element at position indexExpr of ARRAY


arrayExpr .

aggregate(expr,start,merge[,finish]) Aggregates elements in an array using a custom aggregator.

array([expr [, …]]) Returns an array with the elements in expr .

array_contains(array,value) Returns true if array contains value .

array_distinct(array) Removes duplicate values from array .

array_except(array1,array2) Returns an array of the elements in array1 but not in


array2 .

array_intersect(array1,array2) Returns an array of the elements in the intersection of


array1 and array2 .

array_join(array,delimiter[,nullReplacement]) Concatenates the elements of array .

array_max(array) Returns the maximum value in array .

array_min(array) Returns the minimum value in array .

array_position(array,element) Returns the position of the first occurrence of element in


array .

array_remove(array,element) Removes all occurrences of element from array .

array_repeat(element,count) Returns an array containing element count times.

array_size(array) Returns the number of elements in array .

array_sort(array,func) Returns array sorted according to func .

array_union(array1,array2) Returns an array of the elements in the union of array1


and array2 without duplicates.

arrays_overlap(array1, array2) Returns true if the intersection of array1 and array2 is


not empty.

arrays_zip(array1 [, …]) Returns a merged array of structs in which the nth struct
contains all Nth values of input arrays.

cardinality(expr) Returns the size of expr .

concat(expr1, expr2 [, …]) Returns the concatenation of the arguments.

element_at(arrayExpr, index) Returns the element of an arrayExpr at index .

exists(expr, pred) Returns true if pred is true for any element in expr .
F UN C T IO N DESC RIP T IO N

explode(expr) Returns rows by un-nesting expr .

explode_outer(expr) Returns rows by un-nesting expr using outer semantics.

filter(expr,func) Filters the array in expr using the function func .

flatten(arrayOfArrays) Transforms an array of arrays into a single array.

forall(expr, predFunc) Tests whether predFunc holds for all elements in the array.

inline(expr) Explodes an array of structs into a table.

inline_outer(expr) Explodes an array of structs into a table with outer


semantics.

posexplode(expr) Returns rows by un-nesting the array with numbering of


positions.

posexplode_outer(expr) Returns rows by un-nesting the array with numbering of


positions using OUTER semantics.

reduce(expr,start,merge[,finish]) Aggregates elements in an array using a custom aggregator.

reverse(array) Returns a reversed string or an array with reverse order of


elements.

sequence(start,stop,step) Generates an array of elements from start to stop


(inclusive), incrementing by step .

shuffle(array) Returns a random permutation of the array in expr .

size(expr) Returns the cardinality of expr .

slice(expr,start,length) Returns a subset of an array.

sort_array(expr[,ascendingOrder]) Returns the array in expr in sorted order.

transform(expr, func) Transforms elements in an array in expr using the function


func .

try_element_at(arrayExpr, index) Returns the element of an arrayExpr at index , or NULL


if index is out of bound.

zip_with(expr1, expr2, func) Merges the arrays in expr1 and expr2 , element-wise,
into a single array using func .

Map functions
F UN C T IO N DESC RIP T IO N

mapExpr[keyExpr] Returns value at keyExpr of MAP mapExpr .

cardinality(expr) Returns the size of expr .

element_at(mapExpr, key) Returns the value of mapExpr for key .

explode(expr) Returns rows by un-nesting expr .

explode_outer(expr) Returns rows by un-nesting expr using outer semantics.

map([{key1, value1}[, …]]) Creates a map with the specified key-value pairs.

map_concat([expr1 [, …]]) Returns the union of all expr map expressions.

map_contains_key(map, key) Returns true if map contains key , false otherwise.

map_entries(map) Returns an unordered array of all entries in map .

map_filter(expr, func) Filters entries in the map in expr using the function func .

map_from_arrays(keys, values) Creates a map with a pair of the keys and values arrays.

map_from_entries(expr) Creates a map created from the specified array of entries.

map_keys(map) Returns an unordered array containing the keys of map .

map_values(map) Returns an unordered array containing the values of map .

map_zip_with(map1, map2, func) Merges map1 and map2 into a single map.

size(expr) Returns the cardinality of expr .

transform_keys(expr, func) Transforms keys in a map in expr using the function func
.

transform_values(expr, func) Transforms values in a map in expr using the function


func .

try_element_at(mapExpr, key) Returns the value of mapExpr for key , or NULL if key
does not exist.

Date, timestamp, and interval functions


For information on date and timestamp formats, see Datetime patterns.

F UN C T IO N DESC RIP T IO N

intervalExpr / divisor Returns interval divided by divisor .


F UN C T IO N DESC RIP T IO N

- intervalExpr Returns the negated value of intervalExpr .

intervalExpr1 - intervalExpr2 Returns the subtraction of intervalExpr2 from


intervalExpr1 .

datetimeExpr1 - datetimeExpr2 Returns the subtraction of datetimeExpr2 from


datetimeExpr1 .

+ intervalExpr Returns the value of intervalExpr .

intervalExpr1 + intervalExpr2 Returns the sum of intervalExpr1 and intervalExpr2 .

intervalExpr * multiplicand Returns intervalExpr multiplied by multiplicand .

abs(expr) Returns the absolute value of the interval value in expr .

add_months(startDate,numMonths) Returns the date that is numMonths after startDate .

current_date() Returns the current date at the start of query evaluation.

current_timestamp() Returns the current timestamp at the start of query


evaluation.

current_timezone() Returns the current session local timezone.

date(expr) Casts the value expr to DATE.

date_add(startDate,numDays) Returns the date numDays after startDate .

date_format(expr,fmt) Converts a timestamp to a string in the format fmt .

date_from_unix_date(days) Creates a date from the number of days since 1970-01-01 .

date_part(field,expr) Extracts a part of the date, timestamp, or interval.

date_sub(startDate,numDays) Returns the date numDays before startDate .

date_trunc(field,expr) Returns timestamp truncated to the unit specified in field .

dateadd(unit, value, expr) Adds value unit s to a timestamp expr .

datediff(endDate,startDate) Returns the number of days from startDate to endDate .

datediff(unit, start, stop) Returns the difference between two timestamps measured in
unit s.

day(expr) Returns the day of month of the date or timestamp.

dayofmonth(expr) Returns the day of month of the date or timestamp.


F UN C T IO N DESC RIP T IO N

dayofweek(expr) Returns the day of week of the date or timestamp.

dayofyear(expr) Returns the day of year of the date or timestamp.

divisor div dividend Returns the integral part of the division of interval divisor
by interval dividend .

extract(field FROM source) Returns field of source .

from_unixtime(unixTime,fmt) Returns unixTime in fmt .

from_utc_timestamp(expr,timezone) Returns a timestamp in expr specified in UTC in the


timezone timeZone .

hour(expr) Returns the hour component of a timestamp.

last_day(expr) Returns the last day of the month that the date belongs to.

make_date(year,month,day) Creates a date from year , month , and day fields.

make_dt_interval([days[, hours[, mins[, secs]]]]) Creates an day-time interval from days , hours , mins
and secs .

make_interval(years, months, weeks, days, hours, mins, secs) Deprecated: Creates an interval from years , months ,
weeks , days , hours , mins and secs .

make_timestamp(year,month,day,hour,min,sec[,timezone]) Creates a timestamp from year , month , day , hour ,


min , sec , and timezone fields.

make_ym_interval([years[, months]]) Creates a year-month interval from years , and months .

minute(expr) Returns the minute component of the timestamp in expr .

month(expr) Returns the month component of the timestamp in expr .

months_between(expr1,expr2[,roundOff]) Returns the number of months elapsed between dates or


timestamps in expr1 and expr2 .

next_day(expr,dayOfWeek) Returns the first date which is later than expr and named
as in dayOfWeek .

now() Returns the current timestamp at the start of query


evaluation.

quarter(expr) Returns the quarter of the year for expr in the range 1 to
4.

second(expr) Returns the second component of the timestamp in expr .


F UN C T IO N DESC RIP T IO N

sign(expr) Returns -1.0, 0.0, or 1.0 as interval expr is negative, 0, or


positive.

signum(expr) Returns -1.0, 0.0, or 1.0 as interval expr is negative, 0, or


positive.

timestamp(expr) Casts expr to TIMESTAMP.

timestamp_micros(expr) Creates a timestamp expr microseconds since UTC epoch.

timestamp_millis(expr) Creates a timestamp expr milliseconds since UTC epoch.

timestamp_seconds(expr) Creates timestamp expr seconds since UTC epoch.

timestampadd(unit, value, expr) Adds value unit s to a timestamp expr .

timestampdiff(unit, start, stop) Returns the difference between two timestamps measured in
unit s.

to_date(expr[,fmt]) Returns expr cast to a date using an optional formatting.

to_timestamp(expr[,fmt]) Returns expr cast to a timestamp using an optional


formatting.

to_unix_timestamp(expr[,fmt]) Returns the timestamp in expr as a UNIX timestamp.

to_utc_timestamp(expr,timezone) Returns the timestamp in expr in a different timezone as


UTC.

trunc(expr, fmt) Returns a date with the a portion of the date truncated to
the unit specified by the format model fmt .

try_add(expr1, expr2) Returns the sum of expr1 and expr2 , or NULL in case of
error.

try_divide(dividend, divisor) Returns dividend divided by divisor , or NULL if


divisor is 0.

try_multiply(multiplier, multiplicand) Returns multiplier multiplied by multiplicand , or


NULL on overflow.

try_subtract(expr1, expr2) Returns the subtraction of expr2 from expr1 , or NULL


on overflow.

unix_date(expr) Returns the number of days since 1970-01-01 .

unix_micros(expr) Returns the number of microseconds since


1970-01-01 00:00:00 UTC .

unix_millis(expr) Returns the number of milliseconds since


1970-01-01 00:00:00 UTC .
F UN C T IO N DESC RIP T IO N

unix_seconds(expr) Returns the number of seconds since


1970-01-01 00:00:00 UTC .

unix_timestamp([expr[, fmt]]) eturns the UNIX timestamp of current or specified time.

weekday(expr) Returns the day of the week of expr .

weekofyear(expr) Returns the week of the year of expr .

year(expr) Returns the year component of expr .

window(expr, width[, step[, start]]) Creates a hopping based sliding-window over a timestamp
expression.

Cast functions and constructors


For information on casting between types, see cast function and try_cast function.

F UN C T IO N DESC RIP T IO N

array([expr [, …]]) Returns an array with the elements in expr .

bigint(expr) Casts the value expr to BIGINT.

binary(expr) Casts the value of expr to BINARY.

boolean(expr) Casts expr to Boolean.

cast(expr AS type) Casts the value expr to the target data type type .

expr :: type Casts the value expr to the target data type type .

date(expr) Casts the value expr to DATE.

decimal(expr) Casts the value expr to DECIMAL.

double(expr) Casts the value expr to DOUBLE.

float(expr) Casts the value expr to FLOAT.

int(expr) Casts the value expr to INTEGER.

make_date(year,month,day) Creates a date from year , month , and day fields.

make_dt_interval([days[, hours[, mins[, secs]]]]) Creates an day-time interval from days , hours , mins
and secs .
F UN C T IO N DESC RIP T IO N

make_interval(years, months, weeks, days, hours, mins, secs) Creates an interval from years , months , weeks , days ,
hours , mins and secs .

make_timestamp(year,month,day,hour,min,sec[,timezone]) Creates a timestamp from year , month , day , hour ,


min , sec , and timezone fields.

make_ym_interval([years[, months]]) Creates a year-month interval from years , and months .

map([{key1, value1} [, …]]) Creates a map with the specified key-value pairs.

named_struct({name1, val1} [, …]) Creates a struct with the specified field names and values.

smallint(expr) Casts the value expr to SMALLINT.

string(expr) Casts the value expr to STRING.

struct(expr1 [, …]) Creates a STRUCT with the specified field values.

tinyint(expr) Casts expr to TINYINT.

timestamp(expr) Casts expr to TIMESTAMP.

to_number(expr, fmt) Returns expr cast to DECIMAL using formatting fmt .

try_cast(expr AS type) Casts the value expr to the target data type type safely.

try_to_number(expr, fmt) Returns expr cast to DECIMAL using formatting fmt , or


NULL if expr is not a valid.

CSV functions
F UN C T IO N DESC RIP T IO N

from_csv(csvStr, schema[, options]) Returns a struct value with the csvStr and schema .

schema_of_csv(csv[, options]) Returns the schema of a CSV string in DDL format.

to_csv(expr[, options]) Returns a CSV string with the specified struct value.

JSON functions
F UN C T IO N DESC RIP T IO N

jsonStr : jsonPath Returns fields extracted from the jsonStr .

from_json(jsonStr, schema[, options]) Returns a struct value with the jsonStr and schema .

get_json_object(expr, path) Extracts a JSON object from path .


F UN C T IO N DESC RIP T IO N

json_array_length(jsonArray) Returns the number of elements in the outermost JSON


array.

json_object_keys(jsonObject) Returns all the keys of the outermost JSON object as an


array.

json_tuple(jsonStr, path1 [, …]) Returns multiple JSON objects as a tuple.

schema_of_json(json[, options]) Returns the schema of a JSON string in DDL format.

to_json(expr[, options]) Returns a JSON string with the struct specified in expr .

XPath functions
F UN C T IO N DESC RIP T IO N

xpath(xml, xpath) Returns values within the nodes of xml that match xpath .

xpath_boolean(xml, xpath) Returns true if the xpath expression evaluates to true ,


or if a matching node in xml is found.

xpath_double(xml, xpath) Returns a DOUBLE value from an XML document.

xpath_float(xml, xpath) Returns a FLOAT value from an XML document.

xpath_int(xml, xpath) Returns a INTEGER value from an XML document.

xpath_long(xml, xpath) Returns a BIGINT value from an XML document.

xpath_number(xml, xpath) Returns a DOUBLE value from an XML document.

xpath_short(xml, xpath) Returns a SHORT value from an XML document.

xpath_string(xml, xpath) Returns the contents of the first XML node that matches the
XPath expression.

Miscellaneous functions
F UN C T IO N DESC RIP T IO N

assert_true(expr) Returns an error if expr is not true.

CASE expr { WHEN opt1 THEN res1 } […] [ELSE def] END Returns resN for the first optN that equals expr or
def if none matches.

CASE { WHEN cond1 THEN res1 } […] [ELSE def] END Returns resN for the first condN that evaluates to true, or
def if none found.

coalesce(expr1, expr2 [, …]) Returns the first non-null argument.


F UN C T IO N DESC RIP T IO N

cube (expr1 [, …]) Creates a multi-dimensional cube using the specified


expression columns.

current_catalog() Returns the current catalog.

current_database() Returns the current schema.

current_schema() Returns the current schema.

current_user() Returns the current user.

current_version() Returns the current version of Databricks Runtime.

decode(expr, { key, value } [, …] [,defValue]) Returns the value matching the key.

elt(index, expr1 [, …] ) Returns the nth expression.

greatest(expr1 [, …]) Returns the largest value of all arguments, skipping null
values.

grouping(col) Indicates whether a specified column in a GROUPING SET ,


ROLLUP , or CUBE represents a subtotal.

grouping_id([col1 [, …]]) Returns the level of grouping for a set of columns.

hash(expr1 [, …]) Returns a hashed value of the arguments.

java_method(class, method[, arg1 [, …]]) Calls a method with reflection.

if(cond, expr1, expr2) Returns expr1 if cond is true , or expr2 otherwise.

iff(cond, expr1, expr2) Returns expr1 if cond is true , or expr2 otherwise.

ifnull(expr1, expr2) Returns expr2 if expr1 is NULL , or expr1 otherwise.

input_file_block_length() Returns the length in bytes of the block being read.

input_file_block_start() Returns the start offset in bytes of the block being read.

input_file_name() Returns the name of the file being read, or empty string if
not available.

is_member(group) Returns true if the current user is a member of group.

isnull(expr) Returns true if expr is NULL .

isnotnull(expr) Returns true if expr is not NULL .

least(expr1 [, …]) Returns the smallest value of all arguments, skipping null
values.
F UN C T IO N DESC RIP T IO N

monotonically_increasing_id() Returns monotonically increasing 64-bit integers.

nullif(expr1, expr2) Returns NULL if expr1 equals expr2 , or expr1


otherwise.

nvl(expr1, expr2) Returns expr2 if expr1 is NULL , or expr1 otherwise.

nvl2(expr1, expr2, expr3) Returns expr2 if expr1 is not NULL , or expr3


otherwise.

raise_error(expr) Throws an exception with expr as the message.

range(end) Returns a table of values within a specified range.

range(start, end [, step [, numParts]]) Returns a table of values within a specified range.

reflect(class, method[, arg1 [, …]]) Calls a method with reflection.

spark_partition_id() Returns the current partition ID.

stack(numRows, expr1 [, …]) Separates expr1 , …, exprN into numRows rows.

uuid() Returns an universally unique identifier (UUID) string.

window(expr, width[, step [, start]]) Creates a hopping based sliding-window over a timestamp
expression.

xxhash64(expr1 [, …]) Returns a 64-bit hashed value of the arguments.

version() Returns the Apache Spark version.


Alphabetic list of built-in functions
7/21/2022 • 4 minutes to read

This article provides an alphabetically ordered list of built-in functions and operators in Databricks Runtime.
abs function
acos function
acosh function
add_months function
aes_decrypt function
aes_encrypt function
aggregate function
& (ampersand sign) operator
and predicate
any aggregate function
approx_count_distinct aggregate function
approx_percentile aggregate function
approx_top_k aggregate function
array function
array_agg aggregate function
array_contains function
array_distinct function
array_except function
array_intersect function
array_join function
array_max function
array_min function
array_position function
array_remove function
array_repeat function
array_size function
array_sort function
array_union function
arrays_overlap function
arrays_zip function
ascii function
asin function
asinh function
assert_true function
* (asterisk sign) operator
atan function
atan2 function
atanh function
avg aggregate function
!= (bangeq sign) operator
! (bang sign) operator
base64 function
between predicate
bigint function
bin function
binary function
bit_and aggregate function
bit_count function
bit_get function
bit_length function
bit_or aggregate function
bit_reverse function
bit_xor aggregate function
bool_and aggregate function
bool_or aggregate function
boolean function
[ ] (bracket sign) operator (Databricks Runtime)
bround function
btrim function
cardinality function
^ (caret sign) operator
case expression
cast function
cbrt function
ceil function
ceiling function
char function
char_length function
character_length function
charindex function
chr function
coalesce function
collect_list aggregate function
collect_set aggregate function
:: (colon colon sign) operator
: (colon sign) operator (Databricks Runtime)
concat function
concat_ws function
contains function
conv function
corr aggregate function
cos function
cosh function
cot function
count aggregate function
count_if aggregate function
count_min_sketch aggregate function
covar_pop aggregate function
covar_samp aggregate function
crc32 function
csc function
cube function
cume_dist analytic window function
current_catalog function
current_database function
current_date function
current_schema function
current_timestamp function
current_timezone function
current_user function
current_version function
date function
date_add function
date_format function
date_from_unix_date function
date_part function
date_sub function
date_trunc function
dateadd function
datediff function
datediff (timestamp) function
day function
dayofmonth function
dayofweek function
dayofyear function
decimal function
decode function
decode (character set) function
degrees function
dense_rank ranking window function
div operator
double function
e function
element_at function
elt function
encode function
endswith function
== (eq eq sign) operator
= (eq sign) operator
every aggregate function
exists function
exp function
explode table-valued generator function
explode_outer table-valued generator function
expm1 function
extract function
factorial function
filter function
find_in_set function
first aggregate function
first_value aggregate function
flatten function
float function
floor function
forall function
format_number function
format_string function
from_csv function
from_json function
from_unixtime function
from_utc_timestamp function
get_json_object function
getbit function
greatest function
grouping function
grouping_id function
>= (gt eq sign) operator
> (gt sign) operator
hash function
hex function
hour function
hypot function
if function
iff function
ifnull function
ilike operator
in predicate
initcap function
inline table-valued generator function
inline_outer table-valued generator function
input_file_block_length function
input_file_block_start function
input_file_name function
instr function
int function
is_member function
is distinct operator
is false operator
isnan function
isnotnull function
isnull function
is null operator
is true operator
java_method function
json_array_length function
json_object_keys function
json_tuple table-valued generator function
kurtosis aggregate function
lag analytic window function
last aggregate function
last_day function
last_value aggregate function
lcase function
lead analytic window function
least function
left function
length function
levenshtein function
like operator
ln function
locate function
log function
log10 function
log1p function
log2 function
lower function
lpad function
<=> (lt eq gt sign) operator
<= (lt eq sign) operator
<> (lt gt sign) operator
ltrim function
< (lt sign) operator
make_date function
make_dt_interval function
make_interval function
make_timestamp function
make_ym_interval function
map function
map_concat function
map_contains_key function
map_entries function
map_filter function
map_from_arrays function
map_from_entries function
map_keys function
map_values function
map_zip_with function
max aggregate function
max_by aggregate function
md5 function
mean aggregate function
min aggregate function
min_by aggregate function
- (minus sign) operator
- (minus sign) unary operator
minute function
mod function
monotonically_increasing_id function
month function
months_between function
named_struct function
nanvl function
negative function
next_day function
not operator
now function
nth_value analytic window function
ntile ranking window function
nullif function
nvl function
nvl2 function
octet_length function
or operator
overlay function
parse_url function
percent_rank ranking window function
percentile aggregate function
percentile_approx aggregate function
percentile_cont aggregate function
percentile_disc aggregate function
% (percent sign) operator
pi function
|| (pipe pipe sign) operator
| (pipe sign) operator
+ (plus sign) operator
+ (plus sign) unary operator
pmod function
posexplode table-valued generator function
posexplode_outer table-valued generator function
position function
positive function
pow function
power function
printf function
quarter function
radians function
raise_error function
rand function
randn function
random function
range table-valued function
rank ranking window function
reduce function
reflect function
regexp operator
regexp_extract function
regexp_extract_all function
regexp_like function
regexp_replace function
regr_avgx aggregate function
regr_avgy aggregate function
regr_count aggregate function
regr_r2 aggregate function
regr_sxx aggregate function
regr_sxy aggregate function
regr_syy aggregate function
repeat function
replace function
reverse function
right function
rint function
rlike operator
round function
row_number ranking window function
rpad function
rtrim function
schema_of_csv function
schema_of_json function
sec function
second function
sentences function
sequence function
sha function
sha1 function
sha2 function
shiftleft function
shiftright function
shiftrightunsigned function
shuffle function
sign function
signum function
sin function
sinh function
size function
skewness aggregate function
/ (slash sign) operator
slice function
smallint function
some aggregate function
sort_array function
soundex function
space function
spark_partition_id function
split function
split_part function
sqrt function
stack table-valued generator function
startswith function
std aggregate function
stddev aggregate function
stddev_pop aggregate function
stddev_samp aggregate function
str_to_map function
string function
struct function
substr function
substring function
substring_index function
sum aggregate function
tan function
tanh function
~ (tilde sign) operator
timestamp function
timestamp_micros function
timestamp_millis function
timestamp_seconds function
timestampadd function
timestampdiff function
tinyint function
to_csv function
to_date function
to_json function
to_number function
to_timestamp function
to_unix_timestamp function
to_utc_timestamp function
transform function
transform_keys function
transform_values function
translate function
trim function
trunc function
try_add function
try_avg aggregate function
try_cast function
try_divide function
try_element_at function
try_multiply function
try_subtract function
try_sum aggregate function
try_to_number function
typeof function
ucase function
unbase64 function
unhex function
unix_date function
unix_micros function
unix_millis function
unix_seconds function
unix_timestamp function
upper function
uuid function
var_pop aggregate function
var_samp aggregate function
variance aggregate function
version function
weekday function
weekofyear function
width_bucket function
window grouping expression
xpath function
xpath_boolean function
xpath_double function
xpath_float function
xpath_int function
xpath_long function
xpath_number function
xpath_short function
xpath_string function
xxhash64 function
year function
zip_with function
Azure Databricks lambda functions
7/21/2022 • 2 minutes to read

A parameterized expression that can be passed to a function to control its behavior.


For example, array_sort function accepts a lambda function as an argument to define a custom sort order.

Syntax
{ param -> expr |
(param1 [, ...] ) -> expr }

Parameters
paramN : An identifier used by the parent function to pass arguments for the lambda function.
expr : Any simple expression referencing paramN , which does not contain a subquery.

Returns
The result type is defined by the result type of expr .
If there is more than one paramN , the parameter names must be unique. The types of the parameters are set by
the invoking function. The expression must be valid for these types and the result type must match the defined
expectations of the invoking functions.

Examples
The array_sort function function expects a lambda function with two parameters. The parameter types will be
the type of the elements of the array to be sorted. The expression is expected to return an INTEGER where -1
means param1 < param2 , 0 means param1 = param2 , and 1 otherwise.
To sort an ARRAY of STRING in a right to left lexical order, you can use the following lambda function.

(p1, p2) -> CASE WHEN p1 = p2 THEN 0


WHEN reverse(p1) < reverse(p2) THEN -1
ELSE 1 END

Lambda functions are defined and used ad hoc. So the function definition is the argument:

> SELECT array_sort(array('Hello', 'World'),


(p1, p2) -> CASE WHEN p1 = p2 THEN 0
WHEN reverse(p1) < reverse(p2) THEN -1
ELSE 1 END);
[World, Hello]

Related articles
aggregate function
array_sort function
exists function
filter function
forall function
map_filter function
map_zip_with function
transform function
transform_keys function
transform_values function
zip_with function
Window functions
7/21/2022 • 3 minutes to read

Functions that operate on a group of rows, referred to as a window, and calculate a return value for each row
based on the group of rows. Window functions are useful for processing tasks such as calculating a moving
average, computing a cumulative statistic, or accessing the value of rows given the relative position of the
current row.

Syntax
function OVER { window_name | ( window_name ) | window_spec }

function:
{ ranking_function | analytic_function | aggregate_function }

window_spec:
( [ PARTITION BY partition [ , ... ] ] [ order_by ] [ window_frame ] )

Parameters
function
The function operating on the window. Different classes of functions support different configurations of
window specifications.
ranking_function
Any of the Ranking window functions.
If specified the window_spec must include an ORDER BY clause, but not a window_frame clause.
analytic_function
Any of the Analytic window functions.
aggregate_function
Any of the Aggregate functions.
If specified the function must not include a FILTER clause.
window_spec
This clause defines how the rows will be grouped, sorted within the group, and which rows within a
partition a function operates on.
partition
One or more expression used to specify a group of rows defining the scope on which the function
operates. If no PARTITION clause is specified the partition is comprised of all rows.
order_by
The ORDER BY clause specifies the order of rows within a partition.
window_frame
The window frame clause specifies a sliding subset of rows within the partition on which the
aggregate or analytics function operates.
You can specify SORT BY as an alias for ORDER BY.
You can also specify CLUSTER BY, or DISTRIBUTE BY as an alias for PARTITION BY.

Examples
> CREATE TABLE employees
(name STRING, dept STRING, salary INT, age INT);
> INSERT INTO employees
VALUES ('Lisa', 'Sales', 10000, 35),
('Evan', 'Sales', 32000, 38),
('Fred', 'Engineering', 21000, 28),
('Alex', 'Sales', 30000, 33),
('Tom', 'Engineering', 23000, 33),
('Jane', 'Marketing', 29000, 28),
('Jeff', 'Marketing', 35000, 38),
('Paul', 'Engineering', 29000, 23),
('Chloe', 'Engineering', 23000, 25);

> SELECT name, dept, salary, age FROM employees;


Chloe Engineering 23000 25
Fred Engineering 21000 28
Paul Engineering 29000 23
Helen Marketing 29000 40
Tom Engineering 23000 33
Jane Marketing 29000 28
Jeff Marketing 35000 38
Evan Sales 32000 38
Lisa Sales 10000 35
Alex Sales 30000 33

> SELECT name,


dept,
RANK() OVER (PARTITION BY dept ORDER BY salary) AS rank
FROM employees;
Lisa Sales 10000 1
Alex Sales 30000 2
Evan Sales 32000 3
Fred Engineering 21000 1
Tom Engineering 23000 2
Chloe Engineering 23000 2
Paul Engineering 29000 4
Helen Marketing 29000 1
Jane Marketing 29000 1
Jeff Marketing 35000 3

> SELECT name,


dept,
DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS dense_rank
FROM employees;
Lisa Sales 10000 1
Alex Sales 30000 2
Evan Sales 32000 3
Fred Engineering 21000 1
Tom Engineering 23000 2
Chloe Engineering 23000 2
Paul Engineering 29000 3
Helen Marketing 29000 1
Jane Marketing 29000 1
Jeff Marketing 35000 2

> SELECT name,


dept,
dept,
age,
CUME_DIST() OVER (PARTITION BY dept ORDER BY age
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cume_dist
FROM employees;
Alex Sales 33 0.3333333333333333
Lisa Sales 35 0.6666666666666666
Evan Sales 38 1.0
Paul Engineering 23 0.25
Chloe Engineering 25 0.75
Fred Engineering 28 0.25
Tom Engineering 33 1.0
Jane Marketing 28 0.3333333333333333
Jeff Marketing 38 0.6666666666666666
Helen Marketing 40 1.0

> SELECT name,


dept,
salary,
MIN(salary) OVER (PARTITION BY dept ORDER BY salary) AS min
FROM employees;
Lisa Sales 10000 10000
Alex Sales 30000 10000
Evan Sales 32000 10000
Helen Marketing 29000 29000
Jane Marketing 29000 29000
Jeff Marketing 35000 29000
Fred Engineering 21000 21000
Tom Engineering 23000 21000
Chloe Engineering 23000 21000
Paul Engineering 29000 21000

> SELECT name,


salary,
LAG(salary) OVER (PARTITION BY dept ORDER BY salary) AS lag,
LEAD(salary, 1, 0) OVER (PARTITION BY dept ORDER BY salary) AS lead
FROM employees;
Lisa Sales 10000 NULL 30000
Alex Sales 30000 10000 32000
Evan Sales 32000 30000 0
Fred Engineering 21000 NULL 23000
Chloe Engineering 23000 21000 23000
Tom Engineering 23000 23000 29000
Paul Engineering 29000 23000 0
Helen Marketing 29000 NULL 29000
Jane Marketing 29000 29000 35000
Jeff Marketing 35000 29000 0

Related articles
SELECT
ORDER BY
window frame clause
Aggregate functions
Ranking window functions
Analytic window functions
Ranking window functions
User-defined scalar functions (UDFs)
7/21/2022 • 2 minutes to read

User-defined scalar functions (UDFs) are user-programmable routines that act on one row. This documentation
lists the classes that are required for creating and registering UDFs. It also contains examples that demonstrate
how to define and register UDFs and invoke them in Spark SQL.

UserDefinedFunction class
To define the properties of a user-defined function, you can use some of the methods defined in this class.
asNonNullable(): UserDefinedFunction : Updates UserDefinedFunction to non-nullable.
asNondeterministic(): UserDefinedFunction : Updates UserDefinedFunction to nondeterministic.
withName(name: String): UserDefinedFunction : Updates UserDefinedFunction with a given name.

Examples
Scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.udf

val spark = SparkSession


.builder()
.appName("Spark SQL UDF scalar example")
.getOrCreate()

// Define and register a zero-argument non-deterministic UDF


// UDF is deterministic by default, i.e. produces the same result for the same input.
val random = udf(() => Math.random())
spark.udf.register("random", random.asNondeterministic())
spark.sql("SELECT random()").show()
// +-------+
// |UDF() |
// +-------+
// |xxxxxxx|
// +-------+

// Define and register a one-argument UDF


val plusOne = udf((x: Int) => x + 1)
spark.udf.register("plusOne", plusOne)
spark.sql("SELECT plusOne(5)").show()
// +------+
// |UDF(5)|
// +------+
// | 6|
// +------+

// Define a two-argument UDF and register it with Spark in one step


spark.udf.register("strLenScala", (_: String).length + (_: Int))
spark.sql("SELECT strLenScala('test', 1)").show()
// +--------------------+
// |strLenScala(test, 1)|
// +--------------------+
// | 5|
// +--------------------+

// UDF in a WHERE clause


spark.udf.register("oneArgFilter", (n: Int) => { n > 5 })
spark.range(1, 10).createOrReplaceTempView("test")
spark.sql("SELECT * FROM test WHERE oneArgFilter(id)").show()
// +---+
// | id|
// +---+
// | 6|
// | 7|
// | 8|
// | 9|
// +---+

Java

import org.apache.spark.sql.*;
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.expressions.UserDefinedFunction;
import static org.apache.spark.sql.functions.udf;
import org.apache.spark.sql.types.DataTypes;

SparkSession spark = SparkSession


.builder()
.appName("Java Spark SQL UDF scalar example")
.getOrCreate();

// Define and register a zero-argument non-deterministic UDF


// UDF is deterministic by default, i.e. produces the same result for the same input.
UserDefinedFunction random = udf(
UserDefinedFunction random = udf(
() -> Math.random(), DataTypes.DoubleType
);
random.asNondeterministic();
spark.udf().register("random", random);
spark.sql("SELECT random()").show();
// +-------+
// |UDF() |
// +-------+
// |xxxxxxx|
// +-------+

// Define and register a one-argument UDF


spark.udf().register("plusOne", new UDF1<Integer, Integer>() {
@Override
public Integer call(Integer x) {
return x + 1;
}
}, DataTypes.IntegerType);
spark.sql("SELECT plusOne(5)").show();
// +----------+
// |plusOne(5)|
// +----------+
// | 6|
// +----------+

// Define and register a two-argument UDF


UserDefinedFunction strLen = udf(
(String s, Integer x) -> s.length() + x, DataTypes.IntegerType
);
spark.udf().register("strLen", strLen);
spark.sql("SELECT strLen('test', 1)").show();
// +------------+
// |UDF(test, 1)|
// +------------+
// | 5|
// +------------+

// UDF in a WHERE clause


spark.udf().register("oneArgFilter", new UDF1<Long, Boolean>() {
@Override
public Boolean call(Long x) {
return x > 5;
}
}, DataTypes.BooleanType);
spark.range(1, 10).createOrReplaceTempView("test");
spark.sql("SELECT * FROM test WHERE oneArgFilter(id)").show();
// +---+
// | id|
// +---+
// | 6|
// | 7|
// | 8|
// | 9|
// +---+

Related statements
User-defined aggregate functions (UDAFs)
Integration with Hive UDFs, UDAFs, and UDTFs
User-defined aggregate functions (UDAFs)
7/21/2022 • 4 minutes to read

User-defined aggregate functions (UDAFs) are user-programmable routines that act on multiple rows at once
and return a single aggregated value as a result. This documentation lists the classes that are required for
creating and registering UDAFs. It also contains examples that demonstrate how to define and register UDAFs in
Scala and invoke them in Spark SQL.

Aggregator
Syntax Aggregator[-IN, BUF, OUT]

A base class for user-defined aggregations, which can be used in Dataset operations to take all of the elements
of a group and reduce them to a single value.
IN : The input type for the aggregation.
BUF : The type of the intermediate value of the reduction.
OUT : The type of the final output result.
bufferEncoder : Encoder[BUF]
The Encoder for the intermediate value type.
finish(reduction: BUF): OUT
Transform the output of the reduction.
merge(b1: BUF, b2: BUF): BUF
Merge two intermediate values.
outputEncoder : Encoder[OUT]
The Encoder for the final output value type.
reduce(b: BUF, a: IN): BUF
Aggregate input value a into current intermediate value. For performance, the function may modify b
and return it instead of constructing new object for b .
zero: BUF
The initial value of the intermediate result for this aggregation.

Examples
Type -safe user-defined aggregate functions
User-defined aggregations for strongly typed Datasets revolve around the Aggregator abstract class. For
example, a type-safe user-defined average can look like:
Untyped user-defined aggregate functions
Typed aggregations, as described above, may also be registered as untyped aggregating UDFs for use with
DataFrames. For example, a user-defined average for untyped DataFrames can look like:
Scala

import org.apache.spark.sql.{Encoder, Encoders, SparkSession}


import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.functions

case class Average(var sum: Long, var count: Long)

object MyAverage extends Aggregator[Long, Average, Double] {


// A zero value for this aggregation. Should satisfy the property that any b + zero = b
def zero: Average = Average(0L, 0L)
// Combine two values to produce a new value. For performance, the function may modify `buffer`
// and return it instead of constructing a new object
def reduce(buffer: Average, data: Long): Average = {
buffer.sum += data
buffer.count += 1
buffer
}
// Merge two intermediate values
def merge(b1: Average, b2: Average): Average = {
b1.sum += b2.sum
b1.count += b2.count
b1
}
// Transform the output of the reduction
def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count
// The Encoder for the intermediate value type
def bufferEncoder: Encoder[Average] = Encoders.product
// The Encoder for the final output value type
def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}

// Register the function to access it


spark.udf.register("myAverage", functions.udaf(MyAverage))

val df = spark.read.format("json").load("examples/src/main/resources/employees.json")
df.createOrReplaceTempView("employees")
df.show()
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// | Andy| 4500|
// | Justin| 3500|
// | Berta| 4000|
// +-------+------+

val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")


result.show()
// +--------------+
// |average_salary|
// +--------------+
// | 3750.0|
// +--------------+

Java

import java.io.Serializable;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.expressions.Aggregator;
import org.apache.spark.sql.functions;
public static class Average implements Serializable {
private long sum;
private long count;

// Constructors, getters, setters...

public static class MyAverage extends Aggregator<Long, Average, Double> {


// A zero value for this aggregation. Should satisfy the property that any b + zero = b
public Average zero() {
return new Average(0L, 0L);
}
// Combine two values to produce a new value. For performance, the function may modify `buffer`
// and return it instead of constructing a new object
public Average reduce(Average buffer, Long data) {
long newSum = buffer.getSum() + data;
long newCount = buffer.getCount() + 1;
buffer.setSum(newSum);
buffer.setCount(newCount);
return buffer;
}
// Merge two intermediate values
public Average merge(Average b1, Average b2) {
long mergedSum = b1.getSum() + b2.getSum();
long mergedCount = b1.getCount() + b2.getCount();
b1.setSum(mergedSum);
b1.setCount(mergedCount);
return b1;
}
// Transform the output of the reduction
public Double finish(Average reduction) {
return ((double) reduction.getSum()) / reduction.getCount();
}
// The Encoder for the intermediate value type
public Encoder<Average> bufferEncoder() {
return Encoders.bean(Average.class);
}
// The Encoder for the final output value type
public Encoder<Double> outputEncoder() {
return Encoders.DOUBLE();
}
}

// Register the function to access it


spark.udf().register("myAverage", functions.udaf(new MyAverage(), Encoders.LONG()));

Dataset<Row> df = spark.read().format("json").load("examples/src/main/resources/employees.json");
df.createOrReplaceTempView("employees");
df.show();
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// | Andy| 4500|
// | Justin| 3500|
// | Berta| 4000|
// +-------+------+

Dataset<Row> result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees");


result.show();
// +--------------+
// |average_salary|
// +--------------+
// | 3750.0|
// +--------------+

SQL
-- Compile and place UDAF MyAverage in a JAR file called `MyAverage.jar` in /tmp.
CREATE FUNCTION myAverage AS 'MyAverage' USING JAR '/tmp/MyAverage.jar';

SHOW USER FUNCTIONS;


+------------------+
| function|
+------------------+
| default.myAverage|
+------------------+

CREATE TEMPORARY VIEW employees


USING org.apache.spark.sql.json
OPTIONS (
path "examples/src/main/resources/employees.json"
);

SELECT * FROM employees;


+-------+------+
| name|salary|
+-------+------+
|Michael| 3000|
| Andy| 4500|
| Justin| 3500|
| Berta| 4000|
+-------+------+

SELECT myAverage(salary) as average_salary FROM employees;


+--------------+
|average_salary|
+--------------+
| 3750.0|
+--------------+

Related statements
Scalar user defined functions (UDFs)
Integration with Hive UDFs, UDAFs, and UDTFs
Integration with Hive UDFs, UDAFs, and UDTFs
7/21/2022 • 2 minutes to read

Spark SQL supports integration of Hive UDFs, UDAFs, and UDTFs. Similar to Spark UDFs and UDAFs, Hive UDFs
work on a single row as input and generate a single row as output, while Hive UDAFs operate on multiple rows
and return a single aggregated row as a result. In addition, Hive also supports UDTFs (User Defined Tabular
Functions) that act on one row as input and return multiple rows as output. To use Hive UDFs/UDAFs/UTFs, the
user should register them in Spark, and then use them in Spark SQL queries.

Examples
Hive has two UDF interfaces: UDF and GenericUDF. An example below uses GenericUDFAbs derived from
GenericUDF .

-- Register `GenericUDFAbs` and use it in Spark SQL.


-- Note that, if you use your own programmed one, you need to add a JAR containing it
-- into a classpath,
-- e.g., ADD JAR yourHiveUDF.jar;
CREATE TEMPORARY FUNCTION testUDF AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs';

SELECT * FROM t;
+-----+
|value|
+-----+
| -1.0|
| 2.0|
| -3.0|
+-----+

SELECT testUDF(value) FROM t;


+--------------+
|testUDF(value)|
+--------------+
| 1.0|
| 2.0|
| 3.0|
+--------------+

An example below uses GenericUDTFExplode derived from GenericUDTF.


-- Register `GenericUDTFExplode` and use it in Spark SQL
CREATE TEMPORARY FUNCTION hiveUDTF
AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode';

SELECT * FROM t;
+------+
| value|
+------+
|[1, 2]|
|[3, 4]|
+------+

SELECT hiveUDTF(value) FROM t;


+---+
|col|
+---+
| 1|
| 2|
| 3|
| 4|
+---+

Hive has two UDAF interfaces: UDAF and GenericUDAFResolver. An example below uses GenericUDAFSum
derived from GenericUDAFResolver .

-- Register `GenericUDAFSum` and use it in Spark SQL


CREATE TEMPORARY FUNCTION hiveUDAF
AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum';

SELECT * FROM t;
+---+-----+
|key|value|
+---+-----+
| a| 1|
| a| 2|
| b| 3|
+---+-----+

SELECT key, hiveUDAF(value) FROM t GROUP BY key;


+---+---------------+
|key|hiveUDAF(value)|
+---+---------------+
| b| 3|
| a| 3|
+---+---------------+
JSON path expression
7/21/2022 • 2 minutes to read

A JSON path expression is used to extract values from a JSON string using the : operator

Syntax
{ { identifier | [ field ] | [ * ] | [ index ] }
[ . identifier | [ field ] | [ * ] | [ index ] ] [...] }

The brackets surrounding field , * and index are actual brackets and not indicating an optional syntax.

Parameters
identifier : A case insensitive identifier of a JSON field.
[ field ] : A bracketed case sensitive STRING literal identifying a JSON field.
[ * ] : Identifying all elements in a JSON array.
[ index ] : An integer literal identifying a specific element in a 0-based JSON array.

Returns
A STRING.
When a JSON field exists with an un-delimited null value, you will receive a SQL NULL value for that column,
not a null text value.
You can use :: operator to cast values to basic data types.
Use the from_json function to cast nested results into more complex data types, such as arrays or structs.

Notes
You can use an un-delimited identifier to refer to a JSON field if the name does not contain spaces, or special
characters, and there is no field of the same name in different case.
Use a delimited identifier if there is no field of the same name in different case.
The [ field ] notation can always be used, but requires you to exactly match the case of the field.
If Databricks Runtime cannot uniquely identify a field an error is returned. If no match is found for any field
Databricks Runtime returns NULL .

Examples
The following examples use the data created with the statement in Example data.
In this section:
Extract using identifier and delimiters
Extract nested fields
Extract values from arrays
NULL behavior
Cast values
Example data
Extract using identifier and delimiters

> SELECT raw:owner, raw:OWNER, raw:['owner'], raw:['OWNER'] FROM store_data;


amy amy amy NULL

-- Use backticks to escape special characters. References are case insensitive when you use backticks.
-- Use brackets to make them case sensitive.
> SELECT raw:`zip code`, raw:`Zip Code`, raw:['fb:testid'] FROM store_data;
94025 94025 1234

Extract nested fields

-- Use dot notation


> SELECT raw:store.bicycle FROM store_data;
'{ "price":19.95, "color":"red" }'

-- Use brackets
> SELECT raw:['store']['bicycle'] FROM store_data;
'{ "price":19.95, "color":"red" }'

Extract values from arrays

-- Index elements
> SELECT raw:store.fruit[0], raw:store.fruit[1] FROM store_data;
'{ "weight":8, "type":"apple" }' '{ "weight":9, "type":"pear" }'

-- Extract subfields from arrays


> SELECT raw:store.book[*].isbn FROM store_data;
'[ null, "0-553-21311-3", "0-395-19395-8" ]'

-- Access arrays within arrays or structs within arrays


> SELECT raw:store.basket[*],
raw:store.basket[*][0] first_of_baskets,
raw:store.basket[0][*] first_basket,
raw:store.basket[*][*] all_elements_flattened,
raw:store.basket[0][2].b subfield
FROM store_data;
basket first_of_baskets first_basket all_elements_flattened
subfield
---------------------------- ------------------ --------------------- --------------------------------- ---
-------
[ [ [ [1,2,{"b":"y","a":"x"},3,4,5,6] y
[1,2,{"b":"y","a":"x"}], 1, 1,
[3,4], 3, 2,
[5,6] 5 {"b":"y","a":"x"}
] ] ]

NULL behavior

> SELECT '{"key":null}':key IS NULL sql_null, '{"key":"null"}':key IS NULL;


true false

Cast values
-- price is returned as a double, not a string
> SELECT raw:store.bicycle.price::double FROM store_data
19.95

-- use from_json to cast into more complex types


> SELECT from_json(raw:store.bicycle, 'price double, color string') bicycle FROM store_data
'{ "price":19.95, "color":"red" }'

-- the column returned is an array of string arrays


> SELECT from_json(raw:store.basket[*], 'array<array<string>>') baskets FROM store_data
'[
["1","2","{\"b\":\"y\",\"a\":\"x\"}]",
["3","4"],
["5","6"]
]'

Example data

CREATE TABLE store_data AS SELECT


'{
"store":{
"fruit": [
{"weight":8,"type":"apple"},
{"weight":9,"type":"pear"}
],
"basket":[
[1,2,{"b":"y","a":"x"}],
[3,4],
[5,6]
],
"book":[
{
"author":"Nigel Rees",
"title":"Sayings of the Century",
"category":"reference",
"price":8.95
},
{
"author":"Herman Melville",
"title":"Moby Dick",
"category":"fiction",
"price":8.99,
"isbn":"0-553-21311-3"
},
{
"author":"J. R. R. Tolkien",
"title":"The Lord of the Rings",
"category":"fiction",
"reader":[
{"age":25,"name":"bob"},
{"age":26,"name":"jack"}
],
"price":22.99,
"isbn":"0-395-19395-8"
}
],
"bicycle":{
"price":19.95,
"color":"red"
}
},
"owner":"amy",
"zip code":"94025",
"fb:testid":"1234"
}' as raw
Related functions
: operator
Partitions
7/21/2022 • 2 minutes to read

A partition is composed of a subset of rows in a table that share the same value for a predefined subset of
columns called the partitioning columns. Using partitions can speed up queries against the table as well as data
manipulation.
To use partitions, you define the set of partitioning column when you create a table by including the
PARTITIONED BY clause.
When inserting or manipulating rows in a table Databricks Runtime automatically dispatches rows into the
appropriate partitions.
You can also specify the partition directly using a PARTITION clause.
This syntax is also available for tables that don’t use Delta Lake format, to DROP, ADD or RENAME partitions
quickly by using the ALTER TABLE statement.

PARTITIONED BY
The PARTITIONED BY clause specified a list of columns along which the new table is partitioned.
Syntax

PARTITIONED BY ( { partition_column [ column_type ] } [, ...] )

Parameters
par tition_column
An identifier may reference a column_identifier in the table. If you specify more than one column there
must be no duplicates. If you reference all columns in the table’s column_specification an error is raised.
column_type
Unless the partition_column refers to a column_identifier in the table’s column_specification ,
column_type defines the data type of the partition_column .
Not all data types supported by Databricks Runtime are supported by all data sources.
Notes
Unless you define a Delta Lake table partitioning columns referencing the columns in the column specification
are always moved to the end of the table.

PARTITION
You use the PARTITION clause to identify a partition to be queried or manipulated.
A partition is identified by naming all its columns and associating each with a value. You need not specify them
in a specific order.
Unless you are adding a new partition to an existing table you may omit columns or values to indicate that the
operation applies to the all matching partitions matching the subset of columns.
PARTITION ( { partition_column [ = partition_value | LIKE pattern ] } [ , ... ] )

Parameters
par tition_column
A column named as a partition column of the table. You may not specify the same column twice.
= partition_value

A literal of a data type matching the type of the partition column. If you omit a partition value the
specification will match all values for this partition column.
LIKE pattern

This form is only allowed in ALTER SHARE ADD TABLE.


Matches the string representation of partition_column to pattern . pattern must be a string literal as
used in LIKE.

Examples
-- Use the PARTTIONED BY clause in a table definition
> CREATE TABLE student(university STRING,
major STRING,
name STRING)
PARTITIONED BY(university, major)

> CREATE TABLE professor(name STRING)


PARTITIONED BY(university STRING,
department STRING);

-- Use the PARTITION specification to INSERT into a table


> INSERT INTO student
PARTITION(university= 'TU Kaiserslautern') (major, name)
SELECT major, name FROM freshmen;

-- Use the partition specification to add and drop a partition


> CREATE TABLE log(date DATE, id INT, event STRING)
USING CSV LOCATION 'dbfs:/log'
PARTITIONED BY (date);

> ALTER TABLE log ADD PARTIITON(date = DATE'2021-09-10');

> ALTER TABLE log DROP PARTITION(date = DATE'2021-09-10');

-- Drop all partitions from the named university, independent of the major.
> ALTER TABLE student DROP PARTITION(university = 'TU Kaiserslautern');
Principal
7/21/2022 • 2 minutes to read

A principal is a user, service principal, or group known to the metastore. Principals can be granted privileges and
may own securable objects.

Syntax
{ `<user>@<domain-name>` |
`<sp-application-id>` |
group_name |
USERS }

Parameters
<user>@<domain-name>

An individual user. You must quote the identifier with back-ticks (`) due to the @ character.
<sp-application-id>

A service principal, specified by its applicationId value. You must quote the identifier with back-ticks (`)
due to the dash characters in the ID.
group_name
An identifier specifying a group of users or groups.
USERS

The root group to which all workspace level users belong.


ACOUNT USERS

The root group to which all account level users belong.

Workspace and Account level principals


Databricks Runtime supports to distinct sets of principals: workspace level and account level.
If you attempt to GRANT a privilege to a securable_object you will receive a “user not found” error if the
principal does not apply to the securable object.
Workspace level principal
Workspace level principals are managed in each workspace. They apply to all objects defined in the
hive_metastore catalog. You can also create and manage workspace level groups using the following
statements:
ALTER GROUP
CREATE GROUP
DROP GROUP
Account level principal
Account level principals are global within the account. They are managed outside of of the space of SQL and
apply to all objects outside the hive_metastore catalog.

Examples
-- Granting a privilege to the user alf@melmak.et
> GRANT SELECT ON TABLE t TO `alf@melmak.et`;

-- Granting a privilege to the service principal fab9e00e-ca35-11ec-9d64-0242ac120002


> GRANT SELECT ON TABLE t TO `fab9e00e-ca35-11ec-9d64-0242ac120002`;

-- Revoking a privilege from the general public group.


> REVOKE SELECT ON TABLE t FROM users;

-- Transfering owbership of an object to `some_group`


> ALTER SCHEMA some_schema OWNER TO some_group;

Related
ALTER GROUP
CREATE GROUP
GRANT
REVOKE
Privileges and securable objects
7/21/2022 • 3 minutes to read

A privilege is a right granted to a principal to operate on a securable object.

Securable objects
A securable object is an object defined in the metastore on which privileges can be granted to a principal.
To manage privileges on any object you must be its owner or an administrator.
Syntax

securable_object
{ ANONYMOUS FUNCTION |
ANY FILE |
CATALOG [ catalog_name ] |
{ SCHEMA | DATABASE } schema_name |
EXTERNAL LOCATION location_name |
FUNCTION function_name |
STORAGE CREDENTIAL credential_name |
[ TABLE ] table_name |
VIEW view_name }

Parameters
ANONYMOUS FUNCTION

You can grant the privilege to SELECT from anonymous functions.


ANY FILE

You can grant the privilege to SELECT and MODIFY any file in the filesystem.
CATALOG catalog_name
You can grant CREATE , CREATE_NAMED_FUNCTION , and USAGE on a catalog. The default catalog name is
hive_metastore . If the catalog name is hive_metastore you can also grant SELECT , READ_METADATA , and
MODIFY to grant these privileges on to any existing and future securable object within the catalog.

{ SCHEMA | DATABASE } schema_name


You can grant CREATE , CREATE_NAMED_FUNCTION , and USAGE on a schema.
You can also grant SELECT , READ_METADATA , and MODIFY to grant these privileges on to any existing and
future securable object within the catalog.
EXTERNAL LOCATION location_name
You can grant CREATE TABLE , READ FILES , and WRITE FILES on an external location.
FUNCTION function_name
You can grant SELECT on a user defined function.
STORAGE CREDENTIAL credential_name
You can grant CREATE TABLE , READ FILES , and WRITE FILES on a storage credential.
[ TABLE ] table_name
You can grant SELECT , and MODIFY on a table.
For the principal to use SELECT or MODIFY on a table it must also have USAGE privilege on the table’s
schema and catalog.
VIEW view_name
You can grant SELECT on a view.
For the principal to use SELECT from a view it must also have USAGE privilege on the view’s schema and
catalog.

Privilege types
CREATE

Create objects other than external user defined functions (UDF) within the catalog or schema.
CREATE_NAMED_FUNCTION

Create external user defined functions within the catalog or schema.


CREATE TABLE

Create external tables using the storage credential or external location.


MODIFY

COPY INTO, UPDATE DELETE, INSERT, or MERGE INTO the table.


If the securable_object is the hive_metastore or a schema within it, granting MODIFY will grant MODIFY
on all current and future tables and views within the securable object.
MODIFY_CLASSPATH

Add files to the Spark class path to create named non-SQL functions.
READ_METADATA

Discover the securable object in SHOW and interrogate the object in DESCRIBE
If the securable object is the hive_metastore catalog or a schema within it, granting READ_METADATA will
grant READ_METADATA on all current and future tables and views within the securable object.
READ FILES

Query files directly using the storage credential or external location.


SELECT

Query a table or view, invoke a user defined or anonymous function, or select ANY FILE . The user needs
SELECT on the table, view, or function, as well as USAGE on the object’s schema and catalog.

If the securable object is the hive_metastore or a schema within it, granting SELECT will grant SELECT on
all current and future tables and views within the securable object.
USAGE

Required, but not sufficient to reference any objects in a catalog or schema. The principal also needs to
have privileges on the individual securable objects.
WRITE FILES

Directly COPY INTO files governed by the storage credential or external location.

Privilege matrix
The following table shows which privileges are associated with which securable objects.

ANON
YMOU EXT ER STO RA
P RIVIL S NAL GE
EGE F UN C T ANY C ATA L SC H EM LO C AT I F UN C T C REDE
TYPE IO N F IL E OG A ON IO N N T IA L TA B L E VIEW

CREAT Yes Yes


E

CREAT Yes Yes


E_NA
MED_F
UNCTI
ON

CREAT Yes Yes


E
TABLE

MODI Yes HMS HMS Yes


FY

MODI Yes
FY_CL
ASSPA
TH

READ_ HMS HMS HMS HMS


META
DATA

READ Yes Yes


FILES

SELECT Yes Yes HMS HMS Yes Yes Yes

USAGE Yes Yes

WRITE Yes Yes


FILES

HMS This privilege only applies for securable objects in the hive_metastore catalog.

Examples
-- Grant a privilege to the user alf@melmak.et
> GRANT SELECT ON TABLE t TO `alf@melmak.et`;

-- Revoke a privilege from the general public group.


> REVOKE USAGE ON SCHEMA some_schema FROM `alf@melmak.et`;

Related
GRANT
Principal
REVOKE
External locations and storage credentials
7/21/2022 • 3 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

To interface with storage not managed by Databricks Runtime in a secure manner use a:
storage credential
A SQL object used to abstract long term credentials from cloud storage providers.
Since: Databricks Runtime 10.3
external location
A SQL Object used to associate a URL with a storage credential.
Since: Databricks Runtime 10.3
external table
A table with a storage path contained within an external location.

Storage credential
A storage credential is a securable SQL object encapsulating an identity for the cloud service provider, any of:
An AWS IAM role
An Azure service principal
Once a storage credential is created access to it can be granted to principals (users and groups).
A user or group with permission to use a storage credential can access any storage path covered by the storage
credential by using WITH (CREDENTIAL = credential) in your SQL command.
For more fine-grained access control, combine a storage credential with an external location.
Storage credential names are unqualified and must be unique within the metastore.
Related articles
Create a storage credential (CLI)
ALTER STORAGE CREDENTIAL
DROP STORAGE CREDENTIAL
DESCRIBE STORAGE CREDENTIAL
SHOW STORAGE CREDENTIALS
GRANT
REVOKE

External location
An external location is a securable SQL object that combines a storage path with a storage credential that
authorizes access to that path.
After an external location is created, you can grant access to it to account-level principals (users and groups).
A user or group with permission to use an external location can access any storage path within the location’s
path without direct access to the storage credential.
To further refine access control you can use GRANT on external tables to encapsulate access to individual files
within an external location.
External location names are unqualified and must be unique within the metastore.
The storage path of any external location may not be contained within another external location’s storage path,
or within an external table’s storage path using an explicit storage credential.

External table
An external table is a table that references an external storage path by using a LOCATION clause.
The storage path should be an existing external location to which you have been granted access.
Alternatively you can reference a storage credential to which you have been granted access.
Using external tables abstracts away the storage path, external location, and storage credential for users whom
are granted access to the external table.

WARNING
To avoid accidental data loss, do not register a schema (database) to a location with existing data or create new external
tables in a location managed by a schema. Dropping a schema will recursively delete all data files in the managed location.

Graphical Representation of relationships


The following diagram describes the relationship between:
storage credentials
external locations
external tables
storage paths
IAM entities
Azure service accounts
Examples
Using CLI create a storage credential my_azure_storage_cred for an Azure service principal.

databricks unity-catalog create-storage-credential --json '{"name": "my_azure_storage_cred",


"azure_service_principal": {"directory_id": "12345678-9abc-def0-1234-56789abcdef0", "application_id":
"23456789-9abc-def0-1234-56789abcdef0", "client_secret": "Cli3nt5ecr3t"}}'

The rest of the commands can be run within SQL.


-- Grant access to the storage credential
> GRANT READ FILES ON STORAGE CREDENTIAL my_azure_storage_cred TO ceo;

-- ceo can directly read from any storage path using myazure_storage_cred
> SELECT count(1) FROM `delta`.`abfss://depts/finance/forecast/somefile` WITH (CREDENTIAL
my_azure_storage_cred);
100
> SELECT count(1) FROM `delta`.`abfss://depts/hr/employees` WITH (CREDENTIAL my_azure_storage_cred);
2017

-- Create an external location on specific path to which `my_azure_storage_cred` has access


> CREATE EXTERNAL LOCATION finance_loc URL 'abfss://depts/finance'
WITH (CREDENTIAL my_azure_storage_cred)
COMMENT 'finance';

-- Grant access to the finance location to


> GRANT READ_FILE ON EXTERNAL LOCATION finance_loc TO finance;

-- `finance` can read from any torage path that under abfss://depts/finance but nowhere else
> SELECT count(1) FROM `delta`.`abfss://depts/finance/forecast/somefile` WITH (CREDENTIAL
my_azure_storage_cred);
100
> SELECT count(1) FROM `delta`.`abfss://depts/hr/employees` WITH (CREDENTIAL my_azure_storage_cred);
Error

-- `finance` can create an external table over specific object within the `finance_loc` location
> CREATE TABLE sec_filings LOCATION 'abfss://depts/finance/sec_filings`;

-- Grant access to sec filings to all employees


> GRANT SELECT ON TABLE sec_filings TO employee;

-- Any member of the`employee` group can securely read SEC filings


> SELECT count(1) FROM sec_filings;
20

Related articles
Create a storage credential (CLI)
ALTER STORAGE CREDENTIAL
ALTER TABLE
CREATE LOCATION
DESCRIBE STORAGE CREDENTIAL
DESCRIBE TABLE
DROP STORAGE CREDENTIAL
DROP TABLE
SHOW STORAGE CREDENTIALS
SHOW TABLES
GRANT
REVOKE
Delta Sharing
7/21/2022 • 2 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

Delta Sharing is an open protocol for secure data sharing with other organizations regardless of which
computing platforms they use. It can share collections of tables in a Unity Catalog metastore in real time without
copying them, so that data recipients can immediately begin working with the latest version of the shared data.
Since: Databricks Runtime 10.3
There are two components to Delta Sharing:
Shares
A share provides a logical grouping for the tables you intend to share.
Recipients
A recipient identifies an organization with which you want to share any number of shares.

Shares
A share is a container instantiated with the CREATE SHARE command. Once created you can iteratively register a
collection of existing tables defined within the metastore using the ALTER SHARE command. You can register
tables under their original name, qualified by their original schema, or provide alternate exposed names.
You must be a metastore admin or account admin to create, alter, and drop shares.
Examples

-- Create share `customer_share` only if share with same name doesn't exist, with a comment.
> CREATE SHARE IF NOT EXISTS customer_share COMMENT 'This is customer share';

-- Add 2 tables to the share.


-- Expose my_schema.tab1 a different name.
-- Expose only two partitions of other_schema.tab2
> ALTER SHARE customer_share ADD TABLE my_schema.tab1 AS their_schema.tab1;
> ALTER SHARE customer_share ADD TABLE other_schema.tab2 PARTITION (c1 = 5), (c1 = 7);

-- List the content of the share


> SHOW ALL IN SHARE customer_share;
name type shared_object added_at added_by
comment partitions
----------------- ---- ---------------------- ---------------------------- -------------------------- ---
---- -----------------
other_schema.tab2 TABLE main.other_schema.tab2 2022-01-01T00:00:01.000+0000 alwaysworks@databricks.com
NULL
their_schema.tab1 TABLE main.myschema.tab2 2022-01-01T00:00:00.000+0000 alwaysworks@databricks.com
NULL (c1 = 5), (c1 = 7)
Related articles
ALTER SHARE
CREATE SHARE
DESCRIBE SHARE
DROP SHARE
SHOW SHARES

Recipients
A recipient is an object you create using CREATE RECIPIENT to represent an organization which you want to
allow access shares. When you create a recipient Databricks Runtime generates an activation link you can send
to the organization. To retrieve the activation link after creation you use DESCRIBE RECIPIENT.
Once a recipient has been created you can give it SELECT privileges on shares of your choice using GRANT ON
SHARE.
You must be a metastore administrator to create recipients, drop recipients, and grant access to shares.
Examples

-- Create a recipient.
> CREATE RECIPIENT IF NOT EXISTS other_org COMMENT 'other.org';

-- Retrieve the activation link to send to other.org


> DESCRIBE RECIPIENT other_org;
name created_at created_by comment activation_link
active_token_id active_token_expiration_time rotated_token_id
rotated_token_expiration_time
--------- ---------------------------- -------------------------- --------- --------------- --------------
---------------------- ---------------------------- ---------------- -----------------------------
other_org 2022-01-01T00:00:00.000+0000 alwaysworks@databricks.com other.org https://.... 0160c81f-5262-
40bb-9b03-3ee12e6d98d7 9999-12-31T23:59:59.999+0000 NULL NULL

-- Choose shares that other.org has access to


> GRANT SELECT ON SHARE customer_share TO RECIPIENT other_org;

Related articles
CREATE RECIPIENT
DESCRIBE RECIPIENT
DROP RECIPIENT
SHOW RECIPIENT
ARRAY type
7/21/2022 • 2 minutes to read

Represents values comprising a sequence of elements with the type of elementType .

Syntax
ARRAY < elementType >

elementType : Any data type defining the type of the elements of the array.

Limits
The array type supports sequences of any length greater or equal to 0.

Literals
See array function for details on how to produce literal array values.
See [ ] operator for details how to retrieve elements from an array.

Examples
> SELECT ARRAY(1, 2, 3);
[1, 2, 3]

> SELECT CAST(ARRAY(1, 2, 3) AS ARRAY<TINYINT>);


[1, 2, 3]

> SELECT typeof(ARRAY());


ARRAY<NULL>

> SELECT CAST(ARRAY(ARRAY(1, 2), ARRAY(3, 4)) AS ARRAY<ARRAY<BIGINT>>);


[[1, 2], [3, 4]]

> SELECT a[1] FROM VALUES(ARRAY(3, 4)) AS T(a);


4

Related
[]
MAP type
STRUCT type
array function
cast function
BIGINT type
7/21/2022 • 2 minutes to read

Represents 8-byte signed integer numbers.

Syntax
{ BIGINT |
LONG }

Limits
The range of numbers is from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.

Literals
[ + | - ] digit [ ... ] [L]

digit : Any numeral from 0 to 9.


If the literal is not post-fixed with L (or l ) and it is within the range for an INT it will be implicitly turned into
an INT.

Examples
> SELECT +1L;
1

> SELECT CAST('5' AS BIGINT);


5

> SELECT typeof(-2147483);


INT

> SELECT typeof(123456789012345);


BIGINT

Related
TINYINT type
SMALLINT type
INT type
DECIMAL type
FLOAT type
DOUBLE type
cast function
BINARY type
7/21/2022 • 2 minutes to read

Represents byte sequence values.

Syntax
BINARY

Limits
The type supports byte sequences of any length greater or equal to 0.

Literals
X { 'num [ ... ]' | "num [ ... ]" }

num : Any hexadecimal number from 0 to F.


The prefix X is case insensitive.
If the hexadecimal string literal has an odd length the parser prepends a 0.

Examples
> SELECT X'1';
[01]

> SELECT X'1ABF';


[1A BF]

> SELECT X'';


[ ]

> SELECT CAST('Spark' AS BINARY);


[53 70 61 72 6B]

Related
STRING type
cast function
BOOLEAN type
7/21/2022 • 2 minutes to read

Represents Boolean values.

Syntax
BOOLEAN

Limits
The type supports true and false values.

Literals
{ TRUE | FALSE }

Examples
> SELECT true;
TRUE

> SELECT typeof(false);


BOOLEAN

> SELECT CAST(0 AS BOOLEAN);


FALSE

> SELECT CAST(-1 AS BOOLEAN);


TRUE

> SELECT CAST('true' AS BOOLEAN);


TRUE

Related
cast function
DATE type
7/21/2022 • 2 minutes to read

Represents values comprising values of fields year, month, and day, without a time-zone.

Syntax
DATE

Limits
The range of dates supported is June 23 -5877641 CE to July 11 +5881580 CE .

Literals
DATE dateString

dateString
{ '[+|-]yyyy[...]' |
'[+|-]yyyy[...]-[m]m' |
'[+|-]yyyy[...]-[m]m-[d]d' |
'[+|-]yyyy[...]-[m]m-[d]d[T]' }

+ or : An option sign. - indicates BCE, + indicates CE (default).


-
yyyy[...] : A year comprising at least four digits.
[m]m : A one or two digit month between 01 and 12 .
[d]d : A one or two digit day between 01 and 31 .

The prefix DATE is case insensitive.


If the literal does represent a proper date Azure Databricks raises an error.

Examples
> SELECT DATE'0000';
0000-01-01

> SELECT DATE'2020-12-31';


2020-12-31

> SELECT DATE'2021-7-1T';


2021-07-01

> SELECT cast('1908-03-15' AS DATE)


1908-03-15

> SELECT DATE'-10000-01-01'


-10000-01-01

Related
TIMESTAMP type
INTERVAL type
cast function
DECIMAL type
7/21/2022 • 2 minutes to read

Represents numbers with a specified maximum precision and fixed scale.

Syntax
{ DECIMAL | DEC | NUMERIC } [ ( p [ , s ] ) ]

p : Optional maximum precision (total number of digits) of the number between 1 and 38. The default is 10. s :
Optional scale of the number between 0 and p . The number of digits to the right of the decimal point. The
default is 0.

Limits
The range of numbers:
-1Ep + 1 to -1E-s
0
+1E-s to +1Ep - 1
For example a DECIMAL(5, 2) has a range of: -999.99 to 999.99.

Literals
decimal_digits { [ BD ] | [ exponent BD ] }
| digit [ ... ] [ exponent ] BD

decimal_digits:
[ + | - ] { digit [ ... ] . [ digit [ ... ] ]
| . digit [ ... ] }

exponent:
E [ + | - ] digit [ ... ]

digit : Any numeral from 0 to 9.


The BD postfix and exponent E are case insensitive.

Examples
> SELECT +1BD;
1

> SELECT 5E3BD;


5000

> SELECT 5.321E2BD;


532.1

> SELECT -6.45


-6.45

> SELECT typeof(6.45);


DECIMAL(3,2)

> SELECT CAST(5.345 AS DECIMAL(3, 2));


5.35

> SELECT typeof(CAST(5.345 AS DECIMAL));


DECIMAL(10, 0)

> SELECT typeof(CAST(5.345 AS DECIMAL(2)));


DECIMAL(2, 0)

Related
TINYINT type
SMALLINT type
INT type
BIGINT type
FLOAT type
DOUBLE type
cast function
DOUBLE type
7/21/2022 • 2 minutes to read

Represents 8-byte double-precision floating point numbers.

Syntax
DOUBLE

Limits
The range of numbers is:
Negative infinity
-1.79769E+308 to -2.225E-307
0
+2.225E-307 to +1.79769E+308
Positive infinity
NaN (not a number)

Literals
decimal_digits { D | exponent [ D ] }
| digit [ ... ] { exponent [ D ] | [ exponent ] D }

decimal_digits:
[ + | - ] { digit [ ... ] . [ digit [ ... ] ]
| . digit [ ... ] }

exponent:
E [ + | - ] digit [ ... ]

digit : Any numeral from 0 to 9.


The D postfix and E exponent are case insensitive.

Notes
DOUBLE is a base-2 numeric type. When given a literal which is base-10 the representation may not be exact.
Use DECIMAL type to accurately represent fractional or large base-10 numbers.

Examples
> SELECT +1D;
1.0

> SELECT 5E10;


5E10

> SELECT 5.3E10;


5.3E10

> SELECT -.1D;


-0.1

> SELECT 2.D;


2.0

> SELECT -5555555555555555.1D


-5.555555555555555E15

> SELECT CAST(-6.1 AS DOUBLE)


-6.1

Related
TINYINT type
SMALLINT type
INT type
BIGINT type
DECIMAL type
FLOAT type
cast function
Special floating point values
FLOAT type
7/21/2022 • 2 minutes to read

Represents 8-byte double-precision floating point numbers.

Syntax
{ FLOAT | REAL }

Limits
The range of numbers is:
Negative infinity
-3.402E+38 to -1.175E-37
0
+1.175E-37 to +3.402E+38
Positive infinity
NaN (not a number)

Literals
decimal_digits [ exponent ] F
| [ + | - ] digit [ ... ] [ exponent ] F

decimal_digits:
[ + | - ] { digit [ ... ] . [ digit [ ... ] ]
| . digit [ ... ] }

exponent:
E [ + | - ] digit [ ... ]

digit : Any numeral from 0 to 9.


The F postfix and E exponent are case insensitive.

Notes
FLOAT is a base-2 numeric type. When given a literal which is base-10 the representation may not be exact. Use
DECIMAL type to accurately represent fractional or large base-10 numbers.

Examples
> SELECT +1F;
1.0

> SELECT 5E10F;


5E10

> SELECT 5.3E10F;


5.3E10

> SELECT -.1F;


-0.1

> SELECT 2.F;


2.0

> SELECT -5555555555555555.1F


-5.5555558E15

> SELECT CAST(6.1 AS FLOAT)


6.1

Related
TINYINT type
SMALLINT type
INT type
BIGINT type
DECIMAL type
DOUBLE type
cast function
Special floating point values
INT type
7/21/2022 • 2 minutes to read

Represents 4-byte signed integer numbers.

Syntax
{ INT | INTEGER }

Limits
The range of numbers is from -2,147,483,648 to 2,147,483,647.

Literals
[ + | - ] digit [ ... ]

digit : Any numeral from 0 to 9.


If the literal is outside of the range for an INT it will be implicitly turned into a BIGINT.

Examples
> SELECT +1;
1

> SELECT CAST('5' AS INT);


5

> SELECT typeof(-2147483649);


BIGINT

Related
TINYINT type
SMALLINT type
BIGINT type
DECIMAL type
FLOAT type
DOUBLE type
cast function
INTERVAL type
7/21/2022 • 2 minutes to read

Represents intervals of time either on a scale of seconds or months.

Syntax
INTERVAL { yearMonthIntervalQualifier | dayTimeIntervalQualifier }

yearMonthIntervalQualifier
{ YEAR [TO MONTH] |
MONTH }

dayTimeIntervalQualifier
{ DAY [TO { HOUR | MINUTE | SECOND } ] |
HOUR [TO { MINUTE | SECOND } ] |
MINUTE [TO SECOND] |
SECOND }

Notes
Intervals covering years or months are called year-month intervals.
Intervals covering days, hours, minutes, or seconds are called day-time intervals.
You cannot combine or compare year-month and day-time intervals.
Day-time intervals are strictly based on 86400s/day and 60s/min.
Seconds are always considered to include microseconds.

Limits
A year-month interval has a maximal range of +/- 178,956,970 years and 11 months.
A day-time interval has a maximal range of +/- 106,751,991 days, 23 hours, 59 minutes, and 59.999999
seconds.

Literals
year-month interval
INTERVAL [+|-] yearMonthIntervalString yearMonthIntervalQualifier

day-time interval
INTERVAL [+|-] dayTimeIntervalString dayTimeIntervalQualifier

yearMonthIntervalString
{ '[+|-] y[...]' |
'[+|-] y[...]-[m]m' }

dayTimeIntervalString
{ '[+|-] d[...]' |
'[+|-] d[...] [h]h' |
'[+|-] d[...] [h]h:[m]m' |
'[+|-] d[...] [h]h:[m]m:[s]s' |
'[+|-] d[...] [h]h:[m]m:[s]s.ms[ms][ms][us][us][us]' |
'[+|-] h[...]' |
'[+|-] h[...]:[m]m' |
'[+|-] h[...]:[m]m:[s]s' |
'[+|-] h[...]:[m]m:[s]s.ms[ms][ms][us][us][us]' |
'[+|-] m[...]' |
'[+|-] m[...]:[s]s' |
'[+|-] m[...]:[s]s.ms[ms][ms][us][us][us]' |
'[+|-] s[...]' |
'[+|-] s[...].ms[ms][ms][us][us][us]' }

y : The elapsed number of years.


m : The elapsed number of months.
d : The elapsed number of days.
h : The elapsed number of hours.
m : The elapsed number of minutes.
s : The elapsed number of seconds.
ms : The elapsed number of milliseconds.
us : The elapsed number of microseconds.

Unless a unit constitutes the leading unit of the intervalQualifier it must fall within the defined range:
Months: between 0 and 11
Hours: between 0 and 23
Minutes: between 0 and 59
Seconds: between 0.000000 and 59.999999
You can prefix a sign either inside or outside intervalString . If there is one - sign, the interval is negative. If
there are two or no - signs, the interval is positive. If the components in the intervalString do not match up
with the components in the intervalQualifier an error is raised. If the intervalString value does not fit into
the range specified by the intervalQualifier an error is raised.

Examples
> SELECT INTERVAL '100-00' YEAR TO MONTH;
100-0

> SELECT INTERVAL '-3600' MONTH;


-300-0

> SELECT INTERVAL -'200:13:50.3' HOUR TO SECOND;


-200:13:50.300000000

> SELECT typeof(INTERVAL -'200:13:50.3' HOUR TO SECOND);


interval hour to second

> SELECT CAST('11 23:4:0' AS INTERVAL DAY TO SECOND);


11 23:04:00.000000000

Related
DATE type
TIMESTAMP type
cast function
MAP type
7/21/2022 • 2 minutes to read

Represents values comprising a set of key-value pairs.

Syntax
MAP <keyType, valueType>

keyType : Any data type other than MAP specifying the keys.
valueType : Any data type specifying the values.

Limits
The map type supports maps of any cardinality greater or equal to 0.
The keys must be unique and not be NULL.
MAP is not a comparable data type.

Literals
See map function for details on how to produce literal map values.
See [ ] operator for details on how to retrieve values from a map by key.

Examples
> SELECT map('red', 1, 'green', 2);
{red->1, green->2}

> SELECT typeof(CAST(NULL AS MAP<TIMESTAMP, INT>));


MAP<TIMESTAMP, INT>

> SELECT map(array(1, 2), map('green', 5));


{[1, 2]->{green->5}}

> SELECT CAST(map(struct('Hello', 'World'), 'Greeting') AS MAP<STRUCT<w1:string, w2:string>, string>);


{{Hello, World}->Greeting}

> SELECT m['red'] FROM VALUES(map('red', 1, 'green', 2)) AS T(m);


1

> SELECT map('red', 1) = map('red', 1);


Error: EqualTo does not support ordering on type map<string,int>

Related
[ ] operator
ARRAY type
STRUCT type
map function
cast function
VOID type
7/21/2022 • 2 minutes to read

Represents the untyped NULL value

Syntax
{ NULL | VOID }

Limits
The only value the VOID type can hold is NULL.

Literals
NULL

Examples
> SELECT typeof(NULL);
VOID

> SELECT cast(NULL AS VOID);


VOID

Related
cast function
SMALLINT type
7/21/2022 • 2 minutes to read

Represents 2-byte signed integer numbers.

Syntax
{ SMALLINT | SHORT }

Limits
The range of numbers is from -32,768 to 32,767.

Literals
[ + | - ] digit [ ... ] S

digit : Any numeral from 0 to 9.


The postfix S is case insensitive.

Examples
> SELECT +1S;
1

> SELECT CAST('5' AS SMALLINT);


5

Related
TINYINT type
INT type
BIGINT type
DECIMAL type
FLOAT type
DOUBLE type
cast function
STRING type
7/21/2022 • 2 minutes to read

The type supports character sequences of any length greater or equal to 0.

Syntax
STRING

Literals
[r|R]'c [ ... ]'

r or R

Since: Databricks Runtime 10.0


Optional prefix denoting a raw-literal.
c

Any character from the Unicode character set.


Unless the string is prefixed with r , use \ to escape special characters (e.g. ' or \ ).
If the string is prefixed with r there is no escape character.
You can use double quotes ( " ) instead of single quotes ( ' ) to delimit a string literal.

Examples
> SELECT 'Spark';
Spark

> SELECT CAST(5 AS STRING);


5

> SELECT 'O\'Connell'


O'Connell

> SELECT 'Some\nText'


Some
Text

> SELECT r'Some\nText'


Some\nText

> SELECT '서울시'


서울시

> SELECT ''

> SELECT '\\'


\

> SELECT r'\\'


\\

Related
cast function
STRUCT type
7/21/2022 • 2 minutes to read

Represents values with the structure described by a sequence of fields.

Syntax
STRUCT < [fieldName [:] fieldType [NOT NULL] [COMMENT str] [, …] ] >

fieldName : An identifier naming the field. The names need not be unique.
fieldType : Any data type.
NOT NULL : When specified the struct guarantees that the value of this field is never NULL.
COMMENT str : An optional string literal describing the field.

Limits
The type supports any number of fields greater or equal to 0.

Literals
See struct function and named_struct function for details on how to produce literal array values.

Examples
> SELECT struct('Spark', 5);
{Spark, 5}

> SELECT typeof(named_struct('Field1', 'Spark', 'Field2', 5));


struct<Field1:string,Field2:int>

> SELECT typeof(struct('Spark', 5));


struct<col1:string,col2:int>

> SELECT typeof(CAST(NULL AS STRUCT<Field1:INT NOT NULL COMMENT 'The first field.',Field2:ARRAY<INT>>));
struct<Field1:int,Field2:array<int>>

Related
ARRAY type
MAP type
struct function
named_struct function
cast function
TIMESTAMP type
7/21/2022 • 2 minutes to read

Represents values comprising values of fields year, month, day, hour, minute, and second, with the session local
time-zone. The timestamp value represents an absolute point in time.

Syntax
TIMESTAMP

Limits
The range of timestamps supported is June 23 -5877641 CE to July 11 +5881580 CE .

Literals
TIMESTAMP timestampString

timestampString
{ '[+|-]yyyy[...]' |
'[+|-]yyyy[...]-[m]m' |
'[+|-]yyyy[...]-[m]m-[d]d' |
'[+|-]yyyy[...]-[m]m-[d]d ' |
'[+|-]yyyy[...]-[m]m-[d]d[T][h]h[:]' |
'[+|-]yyyy[..]-[m]m-[d]d[T][h]h:[m]m[:]' |
'[+|-]yyyy[...]-[m]m-[d]d[T][h]h:[m]m:[s]s[.]' |
'[+|-]yyyy[...]-[m]m-[d]d[T][h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zoneId]' }

+ or : An optional sign. - indicates BCE, + indicates CE (default).


-
yyyy : A year comprising at least four digits.
[m]m : A one or two digit month between 01 and 12.
[d]d : A one or two digit day between 01 and 31.
h[h] : A one or two digit hour between 00 and 23.
m[m] : A one or two digit minute between 00 and 59.
s[s] : A one or two digit second between 00 and 59.
[ms][ms][ms][us][us][us] : Up to 6 digits of fractional seconds.

zoneId :
Z - Zulu time zone UTC+0
+|-[h]h:[m]m
An ID with one of the prefixes UTC+, UTC-, GMT+, GMT-, UT+ or UT-, and a suffix in the formats:
+|-h[h]
+|-hh[:]mm
+|-hh:mm:ss
+|-hhmmss
Region-based zone IDs in the form <area>/<city> , for example, Europe/Paris .
If the month or day components are not specified they default to 1. If hour, minute, or second components are
not specified they default to 0. If no zoneId is specified it defaults to session time zone,
If the literal does represent a proper timestamp Azure Databricks raises an error.

Notes
Timestamps with local timezone are internally normalized and persisted in UTC. Whenever the value or a
portion of it is extracted the local session timezone is applied.

Examples
> SELECT TIMESTAMP'0000';
0000-01-01 00:00:00

> SELECT TIMESTAMP'2020-12-31';


2020-12-31 00:00:00

> SELECT TIMESTAMP'2021-7-1T8:43:28.123456';


2021-07-01 08:43:28.123456

> SELECT current_timezone(), TIMESTAMP'2021-7-1T8:43:28UTC+3';


America/Los_Angeles 2021-06-30 22:43:28

> SELECT CAST('1908-03-15 10:1:17' AS TIMESTAMP)


1908-03-15 10:01:17

> SELECT TIMESTAMP'+10000';


+10000-01-01 00:00:00

Related
DATE type
INTERVAL type
cast function
TINYINT type
7/21/2022 • 2 minutes to read

Represents 1-byte signed integer numbers.

Syntax
{ TINYINT | BYTE }

Limits
The range of numbers is from -128 to 127.

Literals
[ + | - ] digit [ ... ] Y

digit : Any numeral from 0 to 9.


The Y postfix is case insensitive.

Examples
> SELECT +1Y;
1

> SELECT CAST('5' AS TINYINT);


5

Related
SMALLINT type
INT type
BIGINT type
DECIMAL type
FLOAT type
DOUBLE type
cast function
Special floating point values (Databricks SQL)
7/21/2022 • 2 minutes to read

Several special floating point values are treated in a case-insensitive manner:


Inf, +Inf, Infinity, +Infinity: positive infinity
-Inf, -Infinity: negative infinity
NaN: not a number

Positive and negative infinity semantics


Positive and negative infinity have the following semantics:
Positive infinity multiplied by any positive value returns positive infinity.
Negative infinity multiplied by any positive value returns negative infinity.
Positive infinity multiplied by any negative value returns negative infinity.
Negative infinity multiplied by any negative value returns positive infinity.
Positive or negative infinity multiplied by 0 returns NaN.
Positive or negative infinity is equal to itself.
In aggregations, all positive infinity values are grouped together. Similarly, all negative infinity values are
grouped together.
Positive infinity and negative infinity are treated as normal values in join keys.
Positive infinity sorts lower than NaN and higher than any other values.
Negative infinity sorts lower than any other values.

NaN semantics
When dealing with float or double types that do not exactly match standard floating point semantics, NaN
has the following semantics:
NaN = NaN returns true.
In aggregations, all NaN values are grouped together.
NaN is treated as a normal value in join keys.
NaN values go last when in ascending order, larger than any other numeric value.

Examples
> SELECT double('infinity');
Infinity

> SELECT float('-inf');


-Infinity

> SELECT float('NaN');


NaN

> SELECT double('infinity') * 0;


NaN

> SELECT double('-infinity') * (-1234567);


Infinity

> SELECT double('infinity') < double('NaN');


true

> SELECT double('NaN') = double('NaN');


true

> SELECT double('inf') = double('infinity');


true

> SELECT COUNT(*), c2


FROM VALUES (1, double('infinity')),
(2, double('infinity')),
(3, double('inf')),
(4, double('-inf')),
(5, double('NaN')),
(6, double('NaN')),
(7, double('-infinity'))
AS test(c1, c2)
GROUP BY c2;
2 NaN
2 -Infinity
3 Infinity

Related
FLOAT type (Databricks SQL)
DOUBLE type (Databricks SQL)
abs function
7/21/2022 • 2 minutes to read

Returns the absolute value of the numeric value in expr .

Syntax
abs(expr)

Arguments
expr : An expression that evaluates to a numeric or interval.

Interval is supported Since : Databricks Runtime 10.1

Returns
A numeric or interval of the same type as expr .
For integral numeric types the function can return an ARITHMETIC_OVERFLOW error.

WARNING
If spark.sql.ansi.enabled is false an overflow will not cause an error but “wrap” the result.

Examples
> SELECT abs(-1);
1

> SELECT abs(cast(-32768 AS Smallint))


Error: ARITHMETIC_OVERFLOW

Related functions
sign function
signum function
negative function
positive function
acos function
7/21/2022 • 2 minutes to read

Returns the inverse cosine (arccosine) of expr .

Syntax
acos(expr)

Arguments
expr : A numeric expression.

Returns
A DOUBLE. if the argument is out of bounds, NaN is returned.

Examples
> SELECT acos(1);
0.0
> SELECT acos(2);
NaN

Related functions
cos function
acosh function
acosh function
7/21/2022 • 2 minutes to read

Returns the inverse hyperbolic cosine of expr .

Syntax
acosh(expr)

Arguments
expr : A numeric expression.

Returns
A DOUBLE. if the argument is out of bounds, NaN is returned.

Examples
> SELECT acosh(1);
0.0
> SELECT acosh(0);
NaN

Related functions
cos function
acos function
add_months function
7/21/2022 • 2 minutes to read

Returns the date that is numMonths after startDate .

Syntax
add_months(startDate, numMonths)

Arguments
startDate : A DATE expression.
numMonths : An integral number.

Returns
A DATE. If the result exceeds the number of days of the month the result is rounded down to the end of the
month. If the result exceeds the supported range for a date an overflow error is reported.

Examples
> SELECT add_months('2016-08-31', 1);
2016-09-30

> SELECT add_months('2016-08-31', -6);


2016-02-29

Related functions
dateadd function
datediff (timestamp) function
months_between function
aes_decrypt function
7/21/2022 • 2 minutes to read

Decrypts a binary produced using AES encryption.


Since: Databricks Runtime 10.3

Syntax
aes_decrypt(expr, key [, mode [, padding]])

Arguments
expr : The BINARY expression to be decrypted.
key : A BINARY expression. Must match the key originally used to produce the encrypted value and be 16,
24, or 32 bytes long.
mode : An optional STRING expression describing the encryption mode used to produce the encrypted value.
padding : An optional STRING expression describing how encryption handled padding of the value to key
length.

Returns
A BINARY.
mode must be one of (case insensitive):
'ECB' : Use Electronic CodeBook (ECB) mode.
'GCM' : Use Galois/Counter Mode (GCM) . This is the default.

padding must be one of (case insensitive):


'NONE' : Uses no padding. Valid only for 'GCM' .
'PKCS' : Uses Public Key Cr yptography Standards (PKCS) padding. Valid only for 'ECB' .
'DEFAULT' : Uses 'NONE' for 'GCM' and 'PKCS' for 'ECB' mode.

Examples
> SELECT base64(aes_encrypt('Spark', 'abcdefghijklmnop'));
4A5jOAh9FNGwoMeuJukfllrLdHEZxA2DyuSQAWz77dfn

> SELECT cast(aes_decrypt(unbase64('4A5jOAh9FNGwoMeuJukfllrLdHEZxA2DyuSQAWz77dfn'),


'abcdefghijklmnop') AS STRING);
Spark

> SELECT base64(aes_encrypt('Spark SQL', '1234567890abcdef', 'ECB', 'PKCS'));


3lmwu+Mw0H3fi5NDvcu9lg==

> SELECT cast(aes_decrypt(unbase64('3lmwu+Mw0H3fi5NDvcu9lg=='),


'1234567890abcdef', 'ECB', 'PKCS') AS STRING);
Spark SQL

> SELECT base64(aes_encrypt('Spark SQL', '1234567890abcdef', 'GCM'));


2sXi+jZd/ws+qFC1Tnzvvde5lz+8Haryz9HHBiyrVohXUG7LHA==

> SELECT cast(aes_decrypt(unbase64('2sXi+jZd/ws+qFC1Tnzvvde5lz+8Haryz9HHBiyrVohXUG7LHA=='),


'1234567890abcdef', 'GCM') AS STRING);
Spark SQL

Related functions
aes_encrypt function
aes_encrypt function
7/21/2022 • 2 minutes to read

Encrypts a binary using AES encryption.


Since: Databricks Runtime 10.3

Syntax
aes_encrypt(expr, key [, mode [, padding]])

Arguments
expr : The BINARY expression to be encrypted.
key : A BINARY expression. The key to be used to encrypt expr . It must be 16, 24, or 32 bytes long.
mode : An optional STRING expression describing the encryption mode.
padding : An optional STRING expression describing how encryption handles padding of the value to key
length.

Returns
A BINARY.
mode must be one of (case insensitive):
'ECB' : Use Electronic CodeBook (ECB) mode.
'GCM' : Use Galois/Counter Mode (GCM) . This is the default.

padding must be one of (case insensitive):


'NONE' : Uses no padding. Valid only for 'GCM' .
'PKCS' : Uses Public Key Cr yptography Standards (PKCS) padding. Valid only for 'ECB' . PKCS padding
adds between 1 and key-length number of bytes to pad expr to a multiple of key length. The value of each
pad byte is the number of bytes being padded.
'DEFAULT' : Uses 'NONE' for 'GCM' and 'PKCS' for 'ECB' mode.

Examples
> SELECT base64(aes_encrypt('Spark', 'abcdefghijklmnop'));
4A5jOAh9FNGwoMeuJukfllrLdHEZxA2DyuSQAWz77dfn

> SELECT cast(aes_decrypt(unbase64('4A5jOAh9FNGwoMeuJukfllrLdHEZxA2DyuSQAWz77dfn'),


'abcdefghijklmnop') AS STRING);
Spark

> SELECT base64(aes_encrypt('Spark SQL', '1234567890abcdef', 'ECB', 'PKCS'));


3lmwu+Mw0H3fi5NDvcu9lg==

> SELECT cast(aes_decrypt(unbase64('3lmwu+Mw0H3fi5NDvcu9lg=='),


'1234567890abcdef', 'ECB', 'PKCS') AS STRING);
Spark SQL

> SELECT base64(aes_encrypt('Spark SQL', '1234567890abcdef', 'GCM'));


2sXi+jZd/ws+qFC1Tnzvvde5lz+8Haryz9HHBiyrVohXUG7LHA==

> SELECT cast(aes_decrypt(unbase64('2sXi+jZd/ws+qFC1Tnzvvde5lz+8Haryz9HHBiyrVohXUG7LHA=='),


'1234567890abcdef', 'GCM') AS STRING);
Spark SQL

Related functions
aes_decrypt function
aggregate function
7/21/2022 • 2 minutes to read

Aggregates elements in an array using a custom aggregator.

Syntax
aggregate(expr, start, merge [, finish])

Arguments
expr : An ARRAY expression.
start : An initial value of any type.
merge : A lambda function used to aggregate the current element.
finish : An optional lambda function used to finalize the aggregation.

Returns
The result type matches the result type of the finish lambda function if exists or start .
Applies an expression to an initial state and all elements in the array, and reduces this to a single state. The final
state is converted into the final result by applying a finish function.
The merge function takes two parameters. The first being the accumulator, the second the element to be
aggregated. The accumulator and the result must be of the type of start . The optional finish function takes
one parameter and returns the final result.
This function is a synonym for reduce function.

Examples
> SELECT aggregate(array(1, 2, 3), 0, (acc, x) -> acc + x);
6
> SELECT aggregate(array(1, 2, 3), 0, (acc, x) -> acc + x, acc -> acc * 10);
60

> SELECT aggregate(array(1, 2, 3, 4),


named_struct('sum', 0, 'cnt', 0),
(acc, x) -> named_struct('sum', acc.sum + x, 'cnt', acc.cnt + 1),
acc -> acc.sum / acc.cnt) AS avg
2.5

Related functions
array function
reduce function
& (ampersand sign) operator
7/21/2022 • 2 minutes to read

Returns the bitwise AND of expr1 and expr2 .

Syntax
expr1 & expr2

Arguments
expr1 : An integral numeric type expression.
expr2 : An integral numeric type expression.

Returns
The result type matches the widest type of expr1 and expr2 .

Examples
> SELECT 3 & 5;
1

Related functions
| (pipe sign) operator
~ (tilde sign) operator
^ (caret sign) operator
bit_count function
and predicate
7/21/2022 • 2 minutes to read

Returns the logical AND of expr1 and expr2 .

Syntax
expr1 and expr2

Arguments
expr1 : A BOOLEAN expression
expr2 : A BOOLEAN expression

Returns
A BOOLEAN.

Examples
> SELECT true and true;
true
> SELECT true and false;
false
> SELECT true and NULL;
NULL
> SELECT false and NULL;
false

Related functions
or operator
not operator
any aggregate function
7/21/2022 • 2 minutes to read

Returns true if at least one value of expr in the group is true.

Syntax
any(expr) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A BOOLEAN.
The any aggregate function is synonymous to max aggregate function, but limited to a boolean argument.

Examples
> SELECT any(col) FROM VALUES (true), (false), (false) AS tab(col);
true

> SELECT any(col) FROM VALUES (NULL), (true), (false) AS tab(col);


true

> SELECT any(col) FROM VALUES (false), (false), (NULL) AS tab(col);


false

> SELECT any(col1) FILTER (WHERE col2 = 1)


FROM VALUES (false, 1), (false, 2), (true, 2), (NULL) AS tab(col1, col2);
false

Related functions
max aggregate function
min aggregate function
approx_count_distinct aggregate function
7/21/2022 • 2 minutes to read

Returns the estimated number of distinct values in expr within the group.

Syntax
approx_count_distinct(expr[, relativeSD]) [FILTER ( WHERE cond ) ]

Arguments
expr : Can be of any type for which equivalence is defined.
relativeSD : Defines the maximum relative standard deviation allowed.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A BIGINT.

Examples
> SELECT approx_count_distinct(col1) FROM VALUES (1), (1), (2), (2), (3) tab(col1);
3
> SELECT approx_count_distinct(col1) FILTER(WHERE col2 = 10)
FROM VALUES (1, 10), (1, 10), (2, 10), (2, 10), (3, 10), (1, 12) AS tab(col1, col2);
3

Related functions
approx_percentile aggregate function
approx_top_k aggregate function
approx_percentile aggregate function
7/21/2022 • 2 minutes to read

Returns the approximate percentile of the expr within the group.

Syntax
approx_percentile ( [ALL | DISTINCT] expr, percentile [, accuracy] ) [FILTER ( WHERE cond ) ]

Arguments
expr : A numeric expression.
percentile : A numeric literal between 0 and 1 or a literal array of numeric values, each between 0 and 1.
accuracy : An INTEGER literal greater than 0. If accuracy is omitted it is set to 10000.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
The aggregate function returns the expression that is the smallest value in the ordered group (sorted from least
to greatest) such that no more than percentile of expr values is less than the value or equal to that value.
If percentile is an array, approx_percentile returns the approximate percentile array of expr at percentile .
The accuracy parameter controls approximation accuracy at the cost of memory. A higher value of accuracy
yields better accuracy, 1.0/accuracy is the relative error of the approximation. This function is a synonym for
percentile_approx aggregate function.
If DISTINCT is specified the function operates only on a unique set of expr values.

Examples
> SELECT approx_percentile(col, array(0.5, 0.4, 0.1), 100) FROM VALUES (0), (1), (2), (10) AS tab(col);
[1,1,0]
> SELECT approx_percentile(col, 0.5, 100) FROM VALUES (0), (6), (6), (7), (9), (10) AS tab(col);
6
> SELECT approx_percentile(DISTINCT col, 0.5, 100) FROM VALUES (0), (6), (6), (7), (9), (10) AS tab(col);
7

Related functions
approx_count_distinct aggregate function
approx_top_k aggregate function
percentile aggregate function
percentile_approx aggregate function
percentile_cont aggregate function
approx_top_k aggregate function
7/21/2022 • 2 minutes to read

Returns the top k most frequently occurring item values in an expr along with their approximate counts.
Since: Databricks Runtime 10.2

Syntax
approx_top_k(expr[, k[, maxItemsTracked]]) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression of STRING, BOOLEAN, DATE, TIMESTAMP, or numeric type.
k : An optional INTEGER literal greater than 0. If k is not specified, it defaults to 5 .
maxItemsTracked : An optional INTEGER literal greater than or equal to k . If maxItemsTracked is not specified,
it defaults to 10000 .
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
Results are returned as an ARRAY of type STRUCT, where each STRUCT contains an item field for the value
(with its original input type) and a count field (of type LONG) with the approximate number of occurrences. The
array is sorted by count descending.
The aggregate function returns the top k most frequently occurring item values in an expression expr along
with their approximate counts. The error in each count may be up to 2.0 * numRows / maxItemsTracked where
numRows is the total number of rows. Higher values of maxItemsTracked provide better accuracy at the cost of
increased memory usage. Expressions that have fewer than maxItemsTracked distinct items will yield exact item
counts. Results include NULL values as their own item in the results.

Examples
> SELECT approx_top_k(expr) FROM VALUES (0), (0), (1), (1), (2), (3), (4), (4) AS tab(expr);
[{'item':4,'count':2},{'item':1,'count':2},{'item':0,'count':2},{'item':3,'count':1},{'item':2,'count':1}]

> SELECT approx_top_k(expr, 2) FROM VALUES 'a', 'b', 'c', 'c', 'c', 'c', 'd', 'd' AS tab(expr);
[{'item':'c','count',4},{'item':'d','count':2}]

> SELECT approx_top_k(expr, 10, 100) FROM VALUES (0), (1), (1), (2), (2), (2) AS tab(expr);
[{'item':2,'count':3},{'item':1,'count':2},{'item':0,'count':1}]

Related functions
approx_count_distinct aggregate function
approx_percentile aggregate function
array function
7/21/2022 • 2 minutes to read

Returns an array with the elements in expr .

Syntax
array(expr [, ...])

Arguments
exprN : Elements of any type that share a least common type.

Returns
An array of elements of exprNs least common type.
If the array is empty or all elements are NULL the result type is an array of type null.

Examples
-- an array of integers
> SELECT array(1, 2, 3);
[1,2,3]
-- an array of strings
> SELECT array(1.0, 1, 'hello');
[1.0,1,hello]

Related
[ ] operator
map function
collect_set aggregate function
collect_list aggregate function
SQL data type rules
array_agg aggregate function
7/21/2022 • 2 minutes to read

Returns an array consisting of all values in expr within the group.


Since: Databricks Runtime 10.4

Syntax
array_agg ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression of any type.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
An ARRAY of the argument type.
The order of elements in the array is non-deterministic. NULL values are excluded.
If DISTINCT is specified the function collects only unique values and is a synonym for collect_set aggregate
function
This function is a synonym for collect_list

Examples
> SELECT array_agg(col) FROM VALUES (1), (2), (NULL), (1) AS tab(col);
[1,2,1]
> SELECT array_agg(DISTINCT col) FROM VALUES (1), (2), (NULL), (1) AS tab(col);
[1,2]

Related functions
array function
collect_list aggregate function
collect_set aggregate function
array_contains function
7/21/2022 • 2 minutes to read

Returns true if array contains value .

Syntax
array_contains(array, value)

Arguments
array : An ARRAY to be searched.
value : An expression with a type sharing a least common type with the array elements.

Returns
A BOOLEAN. If value is NULL , the result is NULL . If any element in array is NULL , the result is NULL if value
is not matched to any other element.

Examples
> SELECT array_contains(array(1, 2, 3), 2);
true
> SELECT array_contains(array(1, NULL, 3), 2);
NULL
> SELECT array_contains(array(1, 2, 3), NULL);
NULL

Related
arrays_overlap function
array_position function
SQL data type rules
array_distinct function
7/21/2022 • 2 minutes to read

Removes duplicate values from array .

Syntax
array_distinct(array)

Arguments
array : An ARRAY expression.

Returns
The function returns an array of the same type as the input argument where all duplicate values have been
removed.

Examples
> SELECT array_distinct(array(1, 2, 3, NULL, 3));
[1,2,3,NULL]

Related functions
array_except function
array_intersect function
array_sort function
array_remove function
array_union function
array_except function
7/21/2022 • 2 minutes to read

Returns an array of the elements in array1 but not in array2 .

Syntax
array_except(array1, array2)

Arguments
array1 : An ARRAY of any type with comparable elements.
array2 : An ARRAY of elements sharing a least common type with the elements of array1 .

Returns
An ARRAY of matching type to array1 with no duplicates.

Examples
> SELECT array_except(array(1, 2, 2, 3), array(1, 1, 3, 5));
[2]

Related
array_distinct function
array_intersect function
array_sort function
array_remove function
array_union function
SQL data type rules
array_intersect function
7/21/2022 • 2 minutes to read

Returns an array of the elements in the intersection of array1 and array2 .

Syntax
array_intersect(array1, array2)

Arguments
array1 : An ARRAY of any type with comparable elements.
array2 : n ARRAY of elements sharing a least common type with the elements of array1 .

Returns
An ARRAY of matching type to array1 with no duplicates and elements contained in both array1 and array2 .

Examples
> SELECT array_intersect(array(1, 2, 3), array(1, 3, 3, 5));
[1,3]

Related
array_distinct function
array_except function
array_sort function
array_remove function
array_union function
SQL data type rules
array_join function
7/21/2022 • 2 minutes to read

Concatenates the elements of array .

Syntax
array_join(array, delimiter [, nullReplacement])

Arguments
array : Any ARRAY type, but its elements are interpreted as strings.
delimiter : A STRING used to separate the concatenated array elements.
nullReplacement : A STRING used to express a NULL value in the result.

Returns
A STRING where the elements of array are separated by delimiter and null elements are substituted for
nullReplacement . If nullReplacement is omitted, null elements are filtered out. If any argument is NULL , the
result is NULL .

Examples
> SELECT array_join(array('hello', 'world'), ' ');
hello world
> SELECT array_join(array('hello', NULL ,'world'), ' ');
hello world
> SELECT array_join(array('hello', NULL ,'world'), ' ', ',');
Hello,world

Related functions
concat function
concat_ws function
array_max function
7/21/2022 • 2 minutes to read

Returns the maximum value in array .

Syntax
array_max(array)

Arguments
array : Any ARRAY with elements for which order is supported.

Returns
The result matches the type of the elements. NULL elements are skipped. If array is empty, or contains only
NULL elements, NULL is returned.

Examples
> SELECT array_max(array(1, 20, NULL, 3));
20

Related functions
array_min function
array_min function
7/21/2022 • 2 minutes to read

Returns the minimum value in array .

Syntax
array_min(array)

Arguments
array : Any ARRAY with elements for which order is supported.

Returns
The result matches the type of the elements. NULL elements are skipped. If array is empty, or contains only
NULL elements, NULL is returned.

Examples
> SELECT array_min(array(1, 20, NULL, 3));
1

Related functions
array_max function
array_position function
7/21/2022 • 2 minutes to read

Returns the position of the first occurrence of element in array .

Syntax
array_position(array, element)

Arguments
array : An ARRAY with comparable elements.
element : An expression matching the types of the elements in array .

Returns
A long type.
Array indexing starts at 1. If the element value is NULL a NULL is returned.

Examples
> SELECT array_position(array(3, 2, 1, 4, 1), 1);
3
> SELECT array_position(array(3, NULL, 1), NULL)
NULL

Related functions
array_contains function
arrays_overlap function
array_remove function
7/21/2022 • 2 minutes to read

Removes all occurrences of element from array .

Syntax
array_remove(array, element)

Arguments
array : An ARRAY.
element : An expression of a type sharing a least common type with the elements of array .

Returns
The result type matched the type of the array.
If the element to be removed is NULL , the result is NULL .

Examples
> SELECT array_remove(array(1, 2, 3, NULL, 3, 2), 3);
[1,2,NULL,2]
> SELECT array_remove(array(1, 2, 3, NULL, 3, 2), NULL);
NULL

Related
array_except function
SQL data type rules
array_repeat function
7/21/2022 • 2 minutes to read

Returns an array containing element count times.

Syntax
array_repeat(element, count)

Arguments
element : An expression of any type.
count : An INTEGER greater or equal to 0.

Returns
An array of the elements of the element type.

Examples
> SELECT array_repeat('123', 2);
[123, 123]

Related functions
array function
array_size function
7/21/2022 • 2 minutes to read

Returns the number of elements in array .


Since: Databricks Runtime 10.5

Syntax
array_size(array)

Arguments
array : An ARRAY expression.

Returns
An INTEGER.

Examples
> SELECT array_size(array(1, NULL, 3, NULL));
4

> SELECT array_size(array());


0

Related
array function
element_at function
array_sort function
7/21/2022 • 2 minutes to read

Returns array sorted according to func .

Syntax
array_sort(array, func)

Arguments
array : An expression that evaluates to an array.
func : A lambda function defining the sort order.

Returns
The result type matches the type of array .
If func is omitted, the array is sorted in ascending order.
If func is provided it takes two arguments representing two elements of the array.
The function must return -1, 0, or 1 depending on whether the first element is less than, equal to, or greater than
the second element.
If the func returns other values (including NULL), array_sort fails and raises an error.
NULL elements are placed at the end of the returned array.

Examples
> SELECT array_sort(array(5, 6, 1),
(left, right) -> CASE WHEN left < right THEN -1
WHEN left > right THEN 1 ELSE 0 END);
[1,5,6]
> SELECT array_sort(array('bc', 'ab', 'dc'),
(left, right) -> CASE WHEN left IS NULL and right IS NULL THEN 0
WHEN left IS NULL THEN -1
WHEN right IS NULL THEN 1
WHEN left < right THEN 1
WHEN left > right THEN -1 ELSE 0 END);
[dc,bc,ab]
> SELECT array_sort(array('b', 'd', null, 'c', 'a'));
[a,b,c,d,NULL]

Related functions
array_distinct function
array_intersect function
array_except function
array_remove function
array_union function
sort_array function
array_union function
7/21/2022 • 2 minutes to read

Returns an array of the elements in the union of array1 and array2 without duplicates.

Syntax
array_union(array1, array2)

Arguments
array1 : An ARRAY.
array2 : An ARRAY of the same type as array1 .

Returns
An ARRAY of the same type as array .

Examples
> SELECT array_union(array(1, 2, 2, 3), array(1, 3, 5));
[1,2,3,5]

Related functions
array_distinct function
array_intersect function
array_except function
array_sort function
array_remove function
zip_with function
arrays_overlap function
7/21/2022 • 2 minutes to read

Returns true if the intersection of array1 and array2 is not empty.

Syntax
arrays_overlap (array1, array2)

Arguments
array1 : An ARRAY.
array2 : An ARRAY sharing a least common type with array1 .

Returns
The result is BOOLEAN true if there is overlap.
If the arrays have no common non-null element, they are both non-empty, and either of them contains a null
element, NULL , false otherwise.

Examples
> SELECT arrays_overlap(array(1, 2, 3), array(3, 4, 5));
true
> SELECT arrays_overlap(array(1, 2, NULL, 3), array(NULL, 4, 5));
NULL

Related
array_contains function
array_position function
SQL data type rules
arrays_zip function
7/21/2022 • 2 minutes to read

Returns a merged array of structs in which the nth struct contains all Nth values of input arrays.

Syntax
arrays_zip (array1 [, ...])

Arguments
arrayN : An ARRAY.

Returns
An ARRAY of STRUCT where the type of the nth field that matches the type of the elements of arrayN .
The number of array arguments can be 0 or more. If the function is called without arguments it returns an
empty array of an empty struct. Arrays that are shorter than the largest array are extended with null elements.

Examples
> SELECT arrays_zip(array(1, 2, 3), array(2, 3, 4));
[{1,2},{2,3},{3,4}]
> SELECT arrays_zip(array(1, 2), array(2, 3), array(3, 4));
[{1,2,3},{2,3,4}]
> SELECT arrays_zip(array(1, 2), array('shoe', 'string', 'budget'));
[{1, shoe},{2, string},{null,budget}]
> SELECT arrays_zip();
[{}]

Related functions
ascii function
7/21/2022 • 2 minutes to read

Returns the ASCII code point of the first character of str .

Syntax
ascii(str)

Arguments
str : A STRING.

Returns
An INTEGER.
If str is empty, the result is 0. If the first character is not an ASCII character or part of the Latin-1 Supplement
range of UTF-16, the result is undefined.

Examples
> SELECT ascii('234');
50
> SELECT ascii('');
0

Related functions
chr function
char function
asin function
7/21/2022 • 2 minutes to read

Returns the inverse sine (arcsine) of expr .

Syntax
asin(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE. If the argument is out of bound, NaN .

Examples
> SELECT asin(0);
0.0
> SELECT asin(2);
NaN

Related functions
sin function
acos function
atan function
asinh function
asinh function
7/21/2022 • 2 minutes to read

Returns the inverse hyperbolic sine of expr .

Syntax
asinh(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
The result type is DOUBLE. If the argument is out of bound, NaN .

Examples
> SELECT asinh(0);
0.0

Related functions
sin function
asin function
assert_true function
7/21/2022 • 2 minutes to read

Returns an error if expr is not true.

Syntax
assert_true(expr)

Arguments
expr : A BOOLEAN expression.

Returns
An untyped NULL if no error is returned.

Examples
> SELECT assert_true(0 < 1);
NULL
> SELECT assert_true(0 > 1);
'0 > 1' is not true

Related functions
* (asterisk sign) operator
7/21/2022 • 2 minutes to read

Returns multiplier multiplied by multiplicand .

Syntax
multiplier * multiplicand

Arguments
multiplier : A numeric or INTERVAL expression.
multiplicand : A numeric expression or INTERVAL expression.

You may not specify an INTERVAL for both arguments.

Returns
If both multiplier and multiplicand are DECIMAL, the result is DECIMAL.
If multiplier or multiplicand is an INTERVAL, the result is of the same type.
If both multiplier and multiplicand are integral numeric types, the result is the larger of the two types.
In all other cases the result is a DOUBLE.
If either the multiplier or the multiplicand is 0, the operator returns 0.
If the result of the multiplication is outside the bound for the result type an ARITHMETIC_OVERFLOW error is
raised.
Use try_multiply to return NULL on overflow.

WARNING
If spark.sql.ansi.enabled is false the result “wraps” if it is out of bounds for integral types and NULL for fractional types.

Examples
> SELECT 3 * 2;
6

> SELECT 2L * 2L;


4L

> SELECT INTERVAL '3' YEAR * 3;


9-0

> SELECT 100Y * 100Y;


Error: ARITHMETIC_OVERFLOW
Related functions
div operator
- (minus sign) operator
+ (plus sign) operator
/ (slash sign) operator
sum aggregate function
try_multiply function
atan function
7/21/2022 • 2 minutes to read

Returns the inverse tangent (arctangent) of expr .

Syntax
atan(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.

Examples
> SELECT atan(0);
0.0

Related functions
atan2 function
acos function
asin function
tan function
atan2 function
7/21/2022 • 2 minutes to read

Returns the angle in radians between the positive x-axis of a plane and the point specified by the coordinates (
exprX , exprY ).

Syntax
atan2(exprY, exprX)

Arguments
exprY : An expression that evaluates to a numeric.
exprX : An expression that evaluates to a numeric.

Returns
A DOUBLE.

Examples
> SELECT atan2(0, 0);
0.0

Related functions
atan function
acos function
asin function
tan function
atanh function
7/21/2022 • 2 minutes to read

Returns inverse hyperbolic tangent of expr .

Syntax
atanh(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE. if the argument is out of bounds, the result is NaN.

Examples
> SELECT atanh(0);
0.0
> SELECT atanh(2);
NaN

Related functions
atan function
tan function
atan2 function
acosh function
asinh function
avg aggregate function
7/21/2022 • 2 minutes to read

Returns the mean calculated from values of a group.

Syntax
avg( [ALL | DISTINCT] expr) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to a numeric or an interval.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
The result type is computed as for the arguments:
DECIMAL(p, s) : The result type is a DECIMAL(p + 4, s + 4) . If the maximum precision for DECIMAL is reached
the increase in scale will be limited to avoid loss of significant digits.
year-month interval: The result is an INTERVAL YEAR TO MONTH .
day-time interval: The result is an INTERVAL YEAR TO SECOND .
In all other cases the result is a DOUBLE.
Nulls within the group are ignored. If a group is empty or consists only of nulls, the result is NULL.
If DISTINCT is specified the average is computed after duplicates have been removed.
If the result overflows the result type, Databricks Runtime raises an ARITHMETIC_OVERFLOW error or
CANNOT_CHANGE_DECIMAL_PRECISION error. To return a NULL instead use try_avg.

WARNING
If spark.sql.ansi.enabled is false an overflow will not cause an error but return NULL.

This function is a synonym for mean aggregate function.

Examples
> SELECT avg(col) FROM VALUES (1), (2), (3) AS tab(col);
2.0

> SELECT avg(DISTINCT col) FROM VALUES (1), (1), (2) AS tab(col);
1.5

> SELECT avg(col) FROM VALUES (1), (2), (NULL) AS tab(col);


1.5

> SELECT avg(vol) FROM VALUES (INTERVAL '1' YEAR), (INTERVAL '2' YEAR) AS tab(col);
1-6

-- Overflow results in NULL for try_avg()


> SELECT try_avg(col) FROM VALUES (5e37::DECIMAL(38, 0)), (5e37::DECIMAL(38, 0)) AS tab(col);
NULL

-- Overflow causes error for avg() in ANSI mode.


> SELECT avg(col) FROM VALUES (5e37::DECIMAL(38, 0)), (5e37::DECIMAL(38, 0)) AS tab(col);
Error: CANNOT_CHANGE_DECIMAL_PRECISION

Related functions
aggregate function
max aggregate function
mean aggregate function
min aggregate function
try_avg aggregate function
try_sum aggregate function
sum aggregate function
!= (bangeq sign) operator
7/21/2022 • 2 minutes to read

Returns true if expr1 does not equal expr2 , or false otherwise.

Syntax
expr1 != expr2

Arguments
expr1 : An expression of any comparable type.
expr2 : An expression that shares a least common type with expr1 .

Returns
A BOOLEAN.
This function is a synonym for <> (lt gt sign) operator.

Examples
> SELECT 2 != 2;
false
> SELECT 3 != 2;
true
> SELECT 1 != '1';
false
> SELECT true != NULL;
NULL
> SELECT NULL != NULL;
NULL

Related
< (lt sign) operator
<= (lt eq sign) operator
> (gt sign) operator
>= (gt eq sign) operator
<=> (lt eq gt sign) operator
= (eq sign) operator
<> (lt gt sign) operator
SQL data type rules
! (bang sign) operator
7/21/2022 • 2 minutes to read

Returns the logical NOT of a Boolean expression.

Syntax
!expr

Arguments
expr :A BOOLEAN expression.

Returns
A BOOLEAN.
This operator is a synonym for not operator.

Examples
> SELECT !true;
false
> SELECT !false;
true
> SELECT !NULL;
NULL

Related functions
and predicate
or operator
not operator
base64 function
7/21/2022 • 2 minutes to read

Converts expr to a base 64 string.

Syntax
base64(expr)

Arguments
expr : A BINARY expression or a STRING which the function will interpret as BINARY.

Returns
A STRING.

Examples
> SELECT base64('Spark SQL');
U3BhcmsgU1FM

Related functions
unbase64 function
between predicate
7/21/2022 • 2 minutes to read

Tests whether expr1 is greater or equal than expr2 and less than or equal to expr3 .

Syntax
expr1 [not] between expr2 and expr2

Arguments
expr1 : An expression of any comparable type.
expr2 : An expression that shares a least common type with all other arguments.
expr3 : An expression that shares a least common type with all other arguments.

Returns
The results is a BOOLEAN.
If not is specified the function is a synonym for expr1 < expr2 or expr1 > expr3 .
Without not the function is a synonym for expr1 >= expr2 and expr1 <= expr3 .

Examples
> SELECT 4 between 3 and 5;
true
> SELECT 4 not between 3 and 5;
false
> SELECT 4 not between NULL and 5;
NULL

Related
in predicate
and predicate
SQL data type rules
bigint function
7/21/2022 • 2 minutes to read

Casts the value expr to BIGINT.

Syntax
bigint(expr)

Arguments
expr : Any expression which is castable to BIGINT.

Returns
A BIGINT.
This function is a synonym for CAST(expr AS BIGINT) .
See cast function for details.

Examples
> SELECT bigint(current_timestamp);
1616168320
> SELECT bigint('5');
5

Related functions
cast function
bin function
7/21/2022 • 2 minutes to read

Returns the binary representation of expr .

Syntax
bin(expr)

Arguments
expr : A BIGINT expression.

Returns
A STRING consisting of 1 and 0 s.

Examples
> SELECT bin(13);
1101
> SELECT bin(-13);
1111111111111111111111111111111111111111111111111111111111110011
> SELECT bin(13.3);
1101

Related functions
binary function
7/21/2022 • 2 minutes to read

Casts the value of expr to BINARY.

Syntax
binary(expr)

Arguments
expr : Any expression that that can be cast to BINARY.

Returns
A BINARY.
This function is a synonym for CAST(expr AS BINARY) .
See cast function for details.

Examples
> SELECT binary('Spark SQL');
[53 70 61 72 6B 20 53 51 4C]

> SELECT binary(1984);


[00 00 07 C0]

Related functions
cast function
bit_and aggregate function
7/21/2022 • 2 minutes to read

Returns the bitwise AND of all input values in the group.

Syntax
bit_and(expr) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to an integral numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
The result type matches the argument type.

Examples
> SELECT bit_and(col) FROM VALUES (3), (5) AS tab(col);
1
> SELECT bit_and(col) FILTER(WHERE col < 6) FROM VALUES (3), (5), (6) AS tab(col);
1

Related functions
bit_or aggregate function
bit_xor aggregate function
some aggregate function
bit_count function
7/21/2022 • 2 minutes to read

Returns the number of bits set in the argument.

Syntax
bit_count(expr)

Arguments
expr : A BIGINT or BOOLEAN expression.

Returns
An INTEGER.

Examples
> SELECT bit_count(0);
0
> SELECT bit_count(5);
2
> SELECT bit_count(-1);
64
> SELECT bit_count(true);
1

Related functions
| (pipe sign) operator
& (ampersand sign) operator
^ (caret sign) operator
~ (tilde sign) operator
bit_get function
7/21/2022 • 2 minutes to read

Returns the value of a bit in a binary representation of an integral numeric.


Since: Databricks Runtime 10.0

Syntax
bit_get(expr, pos))

Arguments
expr : An expression that evaluates to an integral numeric.
pos : An expression of type INTEGER.

Returns
The result type is an INTEGER.
The result value is 1 if the bit is set, 0 otherwise.
Bits are counted right to left and 0-based.
If pos is outside the bounds of the data type of expr Databricks Runtime raises an error.
bit_get is a synonym of getbit.

Examples
> SELECT hex(23Y), bit_get(23Y, 3);
0

> SELECT hex(23Y), bit_get(23Y, 0);


1

> SELECT bit_get(23Y, 8);


Error: Invalid bit position: 8 exceeds the bit upper limit

> SELECT bit_get(23Y, -1);


Error: Invalid bit position: -1 is less than zero

Related functions
bit_reverse function
getbit function
~ (tilde sign) operator
bit_length function
7/21/2022 • 2 minutes to read

Returns the bit length of string data or number of bits of binary data.

Syntax
bit_length(expr)

Arguments
expr : An BINARY or STRING expression.

Returns
An INTEGER.

Examples
> SELECT bit_length('Spark SQL');
72
> SELECT bit_length('北京');
48

Related functions
length function
char_length function
character_length function
bit_or aggregate function
7/21/2022 • 2 minutes to read

Returns the bitwise OR of all input values in the group.

Syntax
bit_or(expr) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to an integral numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
The result type matches the argument type.

Examples
> SELECT bit_or(col) FROM VALUES (3), (5) AS tab(col);
7
> SELECT bit_or(col) FILTER(WHERE col < 8) FROM VALUES (3), (5), (8) AS tab(col);
7

Related functions
bit_and aggregate function
bit_xor aggregate function
some aggregate function
bit_xor aggregate function
7/21/2022 • 2 minutes to read

Returns the bitwise XOR of all input values in the group.

Syntax
bit_xor ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to an integral numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
The result type matches the argument type.
If DISTINCT is specified the aggregate operates only on distinct values.

Examples
> SELECT bit_xor(col) FROM VALUES (3), (3), (5) AS tab(col);
5
> SELECT bit_xor(DISTINCT col) FROM VALUES (3), (3), (5) AS tab(col);
6

Related functions
bit_or aggregate function
bit_and aggregate function
some aggregate function
bool_and aggregate function
7/21/2022 • 2 minutes to read

Returns true if all values in expr are true within the group.

Syntax
bool_and(expr) [FILTER ( WHERE cond ) ]

Arguments
expr : A BOOLEAN expression.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A BOOLEAN.

Examples
> SELECT bool_and(col) FROM VALUES (true), (true), (true) AS tab(col);
true
> SELECT bool_and(col) FROM VALUES (NULL), (true), (true) AS tab(col);
true
> SELECT bool_and(col) FROM VALUES (true), (false), (true) AS tab(col);
false

Related functions
bool_or aggregate function
every aggregate function
bool_or aggregate function
7/21/2022 • 2 minutes to read

Returns true if at least one value in expr is true within the group.

Syntax
bool_or(expr) [FILTER ( WHERE cond ) ]

Arguments
expr : A BOOLEAN expression.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A BOOLEAN.

Examples
> SELECT bool_or(col) FROM VALUES (true), (false), (false) AS tab(col);
true
> SELECT bool_or(col) FROM VALUES (NULL), (true), (false) AS tab(col);
true
> SELECT bool_or(col) FROM VALUES (false), (false), (NULL) AS tab(col);
false

Related functions
bool_and aggregate function
every aggregate function
some aggregate function
boolean function
7/21/2022 • 2 minutes to read

Casts expr to boolean.

Syntax
boolean(expr)

Arguments
expr : Any expression that can be cast to BOOLEAN.

Returns
A BOOLEAN.
This function is a synonym for CAST(expr AS binary) .
See cast function for details.

Examples
> SELECT boolean(1);
true
> SELECT boolean(0);
false

> SELECT boolean(2);


true

Related functions
cast function
bround function
7/21/2022 • 2 minutes to read

Returns the rounded expr using HALF_EVEN rounding mode.

Syntax
bround(expr [,targetScale] )

Arguments
expr : A numeric expression.
targetScale : An INTEGER expression greater or equal to 0. If targetScale is omitted the default is 0.

Returns
If expr is DECIMAL the result is DECIMAL with a scale that is the smaller of expr scale and targetScale .
In HALF_EVEN rounding, also known as Gaussian or banker’s rounding, the digit 5 is rounded towards an even
digit.

Examples
> SELECT bround(2.5, 0);
2
> SELECT bround(2.6, 0);
3
> SELECT bround(3.5, 0);
4
> SELECT bround(2.25, 1);
2.2

Related functions
floor function
ceiling function
ceil function
round function
btrim function
7/21/2022 • 2 minutes to read

Returns str with leading and trailing characters removed.


Since: Databricks Runtime 10.0

Syntax
btrim( str [, trimStr ] )

Arguments
str : A STRING expression to be trimmed.
trimStr : An optional STRING expression with characters to be trimmed. The default is a space character.

Returns
A STRING.
The function removes any leading and trailing characters within trimStr from str .

Examples
> SELECT 'X' || btrim(' SparkSQL ') || 'X';
XSparkSQLX

> SELECT btrim('abcaabaSparkSQLabcaaba', 'abc');


SparkSQL

Related functions
lpad function
ltrim function
rpad function
rtrim function
trim function
cardinality function
7/21/2022 • 2 minutes to read

Returns the size of expr .

Syntax
cardinality(expr)

Arguments
expr : An ARRAY or MAP expression.

Returns
An INTEGER.

Examples
> SELECT cardinality(array('b', 'd', 'c', 'a'));
4
> SELECT cardinality(map('a', 1, 'b', 2));
2

Related functions
^ (caret sign) operator
7/21/2022 • 2 minutes to read

Returns the bitwise exclusive OR (XOR) of expr1 and expr2 .

Syntax
expr1 ^ expr2

Arguments
expr1 : An integral numeric type expression.
expr2 : An integral numeric type expression.

Returns
The result type matches the least common type of expr1 and expr2 .

Examples
> SELECT 3 ^ 5;
6

Related functions
& (ampersand sign) operator
~ (tilde sign) operator
| (pipe sign) operator
bit_count function
case expression
7/21/2022 • 2 minutes to read

Returns resN for the first optN that equals expr or def if none matches.
Returns resN for the first condN evaluating to true, or def if none found.

Syntax
CASE expr {WHEN opt1 THEN res1} [...] [ELSE def] END

CASE {WHEN cond1 THEN res1} [...] [ELSE def] END

Arguments
expr : Any expression for which comparison is defined.
optN : An expression that has a least common type with expr and all other optN .
resN : Any expression that has a least common type with all other resN and def .
def : An optional expression that has a least common type with all resN .
condN : A BOOLEAN expression.

Returns
The result type matches the least common type of resN and def .
If def is omitted the default is NULL. Conditions are evaluated in order and only the resN or def which yields
the result is executed.

Examples
> SELECT CASE WHEN 1 > 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;
1.0
> SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;
2.0
> SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 END;
NULL
> SELECT CASE 3 WHEN 1 THEN 'A' WHEN 2 THEN 'B' WHEN 3 THEN 'C' END;
C

Related articles
decode function
coalesce function
nullif function
nvl function
nvl2 function
SQL data type rules
cast function
7/21/2022 • 13 minutes to read

Casts the value expr to the target data type type .

Syntax
cast(sourceExpr AS targetType)

Arguments
sourceExpr : Any castable expression.
targetType : The data type of the result.

Returns
The result is type targetType .
The following combinations of data type casting are valid:

SO U
RC E
( RO Y EA R
W) -
TA RG MON DAY -
ET ( C T IM E TH T IM E
OLU N UM ST RI STA IN T E IN T E BOO B IN A A RRA ST RU
MN) VO ID ERIC NG DAT E MP RVA L RVA L L EA N RY Y MAP CT

VOI Y Y Y Y Y Y Y Y Y Y Y Y
D

num N Y Y N Y N N Y N N N N
eric

STRI N Y Y Y Y Y Y Y Y N N N
NG

DATE N N Y Y Y N N N N N N N

TIME N Y Y Y Y N N N N N N N
STA
MP

year- N N Y N N Y N N N N N N
mont
h
inter
val
SO U
RC E
( RO Y EA R
W) -
TA RG MON DAY -
ET ( C T IM E TH T IM E
OLU N UM ST RI STA IN T E IN T E BOO B IN A A RRA ST RU
MN) VO ID ERIC NG DAT E MP RVA L RVA L L EA N RY Y MAP CT

day- N N Y N N N Y N N N N N
time
inter
val

BOO N Y Y N Y N N Y N N N N
LEAN

BINA N Y Y N N N N N Y N N N
RY

ARR N N Y N N N N N N Y N N
AY

MAP N N Y N N N N N N N Y N

STRU N N Y N N N N N N N N Y
CT

Rules and limitations based on targetType


WARNING
If spark.sql.ansi.enabled is false an overflow will not cause an error but “wrap” the result.
A sourceExpr value with an invalid format or invalid characters for targetType will result in a NULL .

numeric
If the targetType is a numeric and sourceExpr is of type:
VOID
The result is a NULL of the specified numeric type.
numeric
If targetType is an integral numeric, the result is sourceExpr truncated to a whole number.
Otherwise the result is sourceExpr rounded to a fit the available scale of targetType .
If the value is outside the range of targetType , an overflow error is raised.
Use try_cast to turn overflow errors into NULL .
STRING
sourceExpr is read as a literal value of the targetType .
If sourceExpr doesn’t comply with the format for literal values, an error is raised.
If the value is outside the range of the targetType , an overflow error is raised.
Use try_cast to turn overflow and invalid format errors into NULL .
TIMESTAMP
The result is the number of seconds elapsed between 1970-01-01 00:00:00 UTC and sourceExpr .
If targetType is an integral numeric, the result is truncated to a whole number.
Otherwise the result is rounded to a fit the available scale of targetType .
If the result is outside the range of targetType , an overflow error is raised.
Use try_cast to turn overflow errors into NULL .
BOOLEAN
If sourceExpr is:
true : The result is 0.
false : The result is 1.
NULL : The result is NULL .

Examples

> SELECT cast(NULL AS INT);


NULL

> SELECT cast(5.6 AS INT);


5

> SELECT cast(5.6 AS DECIMAL(2, 0));


6

> SELECT cast(-5.6 AS INT);


-5

> SELECT cast(-5.6 AS DECIMAL(2, 0));


-6

> SELECT cast(128 AS TINYINT);


Overflow

> SELECT cast(128 AS DECIMAL(2, 0));


Overflow

> SELECT cast('123' AS INT);


123

> SELECT cast('123.0' AS INT);


Invalid format

> SELECT cast(TIMESTAMP'1970-01-01 00:00:01' AS LONG);


1

> SELECT cast(TIMESTAMP'1970-01-01 00:00:00.000001' AS DOUBLE);


1.0E-6

> SELECT cast(TIMESTAMP'2022-02-01 00:00:00' AS SMALLINT);


error: overflow
> SELECT cast(true AS BOOLEAN);
1

STRING
If the targetType is a STRING type and sourceExpr is of type:
VOID
The result is a NULL string.
exact numeric
The result is the literal number with an optional minus-sign and no leading zeros except for the single
digit to the left of the decimal point. If the targetType is DECIMAL(p, s) with s greater 0, a decimal
point is added and trailing zeros are added up to scale.
floating-point binar y
If the absolute number is less that 10,000,000 and greater or equal than 0.001 , the result is expressed
without scientific notation with at least one digit on either side of the decimal point.
Otherwise Databricks Runtime uses a mantissa followed by E and an exponent. The mantissa has an
optional leading minus sign followed by one digit to the left of the decimal point, and the minimal
number of digits greater than zero to the right. The exponent has and optional leading minus sign.
DATE
If the year is between 9999 BCE and 9999 CE, the result is a dateString of the form -YYYY-MM-DD and
YYYY-MM-DD respectively.

For years prior or after this range, the necessary number of digits are added to the year component and
+ is used for CE.

TIMESTAMP
If the year is between 9999 BCE and 9999 CE, the result is a timestampString of the form
-YYYY-MM-DD hh:mm:ss and YYYY-MM-DD hh:mm:ss respectively.

For years prior or after this range, the necessary number of digits are added to the year component and
+ is used for CE.

Fractional seconds .f... are added if necessary.


year-month inter val
The result is its shortest representation of the interval literal. If the interval is negative, the sign is
embedded in the interval-string . For units smaller than 10,leading zeros are omitted.
A typical year-month interval string has the form:
INTERVAL 'Y' YEAR
INTERVAL 'Y-M' YEAR TO MONTH
INTERVAL 'M' MONTH
day-time inter val
The result is its shortest representation of the interval literal. If the interval is negative, the sign is
embedded in the interval-string . For units smaller than 10, leading zeros are omitted.
A typical day time interval string has the form:
INTERVAL 'D' DAY
INTERVAL 'D h' DAY TO HOUR
INTERVAL 'D h:m' DAY TO MINUTE
INTERVAL 'D h:m:s' DAY TO SECOND
INTERVAL 'h' HOUR
INTERVAL 'h:m' HOUR TO MINUTE
INTERVAL 'm:s' MINUTE TO SECOND
INTERVAL 's' SECOND
BOOLEAN
The result of the true boolean is the STRING literal true , for false it’s the STRING literal false , and
for NULL it’s the NULL string.
BINARY
A result is the binary sourceExpr interpreted as a UTF-8 character sequence.
Databricks Runtime doesn’t validate the UTF-8 characters. A cast from BINARY to STRING will never inject
substitution characters or raise an error.
ARRAY
The result is a comma separated list of cast elements, which is braced with square brackets [ ] . One
space follows each comma. A NULL element is translated to a literal null .
Databricks Runtime doesn’t quote or otherwise mark individual elements, which may themselves contain
brackets or commas.
MAP
The result is a comma separated list of cast key value pairs, which is braced with curly braces { } . One
space follows each comma. Each key value pair is separated by a -> . A NULL map value is translated to
literal null .
Databricks Runtime doesn’t quote or otherwise mark individual keys or values, which may themselves
may contain curly braces, commas or -> .
STRUCT
The result is a comma separated list of cast field values, which is braced with curly braces { } . One
space follows each comma. A NULL field value is translated to a literal null .
Databricks Runtime doesn’t quote or otherwise mark individual field values, which may themselves may
contain curly braces, or commas.
Examples

> SELECT cast(NULL AS STRING);


NULL

> SELECT cast(-3Y AS STRING);


-3

> SELECT cast(5::DECIMAL(10, 5) AS STRING);


5.00000

> SELECT cast(12345678e-4 AS STRING);


1234.5678

> SELECT cast(1e7 as string);


1.0E7

> SELECT cast(1e6 as string);


1000000.0

> SELECT cast(1e-4 as string);


1.0E-4
> SELECT cast(1e-3 as string);
0.001

> SELECT cast(12345678e7 AS STRING);


1.2345678E14

> SELECT cast(DATE'1900-12-31' AS STRING);


1900-12-31

-- Caesar no more
> SELECT cast(DATE'-0044-03-15' AS STRING);
-0044-03-15

> SELECT cast(DATE'100000-12-31' AS STRING);


+100000-12-31

> SELECT cast(current_timestamp() AS STRING);


2022-04-02 22:29:09.783

> SELECT cast(INTERVAL -'13-02' YEAR TO MONTH AS STRING);


INTERVAL '-13-2' YEAR TO MONTH

> SELECT cast(INTERVAL '12:04.9900' MINUTE TO SECOND AS STRING);


INTERVAL '12:04.99' MINUTE TO SECOND

> SELECT cast(true AS STRING);


true

> SELECT cast(false AS STRING);


false

-- A bad UTF-8 string


> SELECT cast(x'33800033' AS STRING);
3� 3

> SELECT hex(cast(x'33800033' AS STRING));


33800033

> SELECT cast(array('hello', NULL, 'world') AS STRING);


[hello, null, world]

> SELECT cast(array('hello', 'wor, ld') AS STRING);


[hello, wor, ld]

> SELECT cast(array() AS STRING);


[]

> SELECT cast(map('hello', 1, 'world', null) AS STRING);


{hello -> 1, world -> null}

> SELECT cast(map('hello -> 1', DATE'2022-01-01') AS STRING);


{hello -> 1 -> 2022-01-01}

> SELECT cast(map() AS STRING);


{}

> SELECT cast(named_struct('a', 5, 'b', 6, 'c', NULL) AS STRING);


{5, 6, null}

> SELECT cast(named_struct() AS STRING);


{}

DATE
If the targetType is a DATE type and sourceExpr is of type:
VOID
The result is a NULL DATE.
STRING
sourceExpr must be a valid dateString.
If sourceExpr is not a valid dateString , Databricks Runtime returns an error.
Use try_cast to turn invalid data errors into NULL .
TIMESTAMP
The result is date portion of the timestamp sourceExpr .
Examples

> SELECT cast(NULL AS DATE);


NULL

> SELECT cast('1900-10-01' AS DATE);


1900-10-01

> SELECT cast('1900-10-01' AS DATE);


1900-10-01

-- There is no February 30.


> SELECT cast('1900-02-30' AS DATE);
Error

> SELECT cast(TIMESTAMP'1900-10-01 12:13:14' AS DATE);


1900-10-01

TIMESTAMP
If the targetType is a TIMESTAMP type and sourceExpr is of type:
VOID
The result is a NULL DATE.
numeric
sourceExpr is read as the number of seconds since 1970-01-01 00:00:00 UTC .
Fractions smaller than microseconds are truncated.
If the value is outside of the range of TIMESTAMP , an overflow error is raised.
Use try_cast to turn overflow errors into NULL .
STRING
sourceExpr must be a valid timestampString.
If sourceExpr is not a valid timestampString , Databricks Runtime returns an error.
Use try_cast to turn invalid data errors into NULL .
DATE
The result is the sourceExpr DATE at 00:00:00 hrs.
Examples
> SELECT cast(NULL AS TIMESTAMP);
NULL

> SET TIME ZONE '+00:00';


> SELECT cast(0.0 AS TIMESTAMP);
1970-01-01 00:00:00

> SELECT cast(0.0000009 AS TIMESTAMP);


1970-01-01 00:00:00

> SELECT cast(1e20 AS TIMESTAMP);


Error: overflow

> SELECT cast('1900' AS TIMESTAMP);


1900-01-01 00:00:00

> SELECT cast('1900-10-01 12:13:14' AS TIMESTAMP);


1900-10-01 12:13:14

> SELECT cast('1900-02-30 12:13:14' AS TIMESTAMP);


Error

> SELECT cast(DATE'1900-10-01' AS TIMESTAMP);


1900-10-01 00:00:00

year-month interval
If the targetType is a year-month interval and sourceExpr is of type:
VOID
The result is a NULL year-month interval.
STRING
sourceExpr must be a valid yearMonthIntervalString.
If sourceExpr is not a valid yearMonthIntervalString , Databricks Runtime returns an error.
Use try_cast to turn invalid data errors into NULL .
year-month inter val
If the targetType yearMonthIntervalQualifier includes MONTH the value remains unchanged, but is
reinterpreted to match the target type.
Otherwise, if the source type yearMonthIntervalQualifier includes MONTH , the result is truncated to full
years.
Examples
> SELECT cast(NULL AS INTERVAL YEAR);
NULL

> SELECT cast('1-4' AS INTERVAL YEAR TO MONTH)::STRING;


INTERVAL '1-4' YEAR TO MONTH

> SELECT cast('1' AS INTERVAL YEAR TO MONTH);


error

> SELECT cast(INTERVAL '1-4' YEAR TO MONTH AS INTERVAL MONTH)::STRING;


INTERVAL '16' MONTH

> SELECT cast(INTERVAL '1-11' YEAR TO MONTH AS INTERVAL YEAR)::STRING;


INTERVAL '1' YEAR

day-time interval
If the targetType is a day-time interval and sourceExpr is of type:
VOID
The result is a NULL day-time interval.
STRING
sourceExpr must be a valid dayTimeIntervalString.
If sourceExpr is not a valid dayTimeIntervalString , Databricks Runtime returns an error.
Use try_cast to turn invalid data errors into NULL .
day-time inter val
If the targetType dayTimeIntervalQualifier includes the smallest unit of the source type
dayTimeIntervalQualifier, the value remains unchanged, but is reinterpreted to match the target type.
Otherwise, the sourceExpr interval is truncated to fit the targetType .

> SELECT cast(NULL AS INTERVAL HOUR);


NULL

> SELECT cast('1 4:23' AS INTERVAL DAY TO MINUTE)::STRING;


INTERVAL '1 04:23' DAY TO MINUTE

> SELECT cast('1' AS INTERVAL DAY TO MINUTE);


error

> SELECT cast(INTERVAL '1 4:23' DAY TO MINUTE AS INTERVAL MINUTE)::STRING;


INTERVAL '1703' MINUTE

> SELECT cast(INTERVAL '1 4:23' DAY TO MINUTE AS INTERVAL HOUR)::STRING;


INTERVAL '28' HOUR

BOOLEAN
If the targetType is a BOOLEAN and sourceExpr is of type:
VOID
The result is a NULL Boolean.
numeric
If sourceExpr is:
0 : The result is false .
NULL : The result is NULL .
special floating point value : The result is true .
Otherwise the result is true .
STRING
If sourcEexpr is (case insensitive):
'T', 'TRUE', 'Y', 'YES', or '1' : The result is true
'F', 'FALSE', 'N', 'NO', or '0' : The result is false
: The result is NULL
NULL
Otherwise Databricks Runtime returns an invalid input syntax for type boolean error.
Use try_cast to turn invalid data errors into NULL .
Examples

> SELECT cast(NULL AS BOOLEAN);


NULL

> SELECT cast('T' AS BOOLEAN);


true

> SELECT cast('True' AS BOOLEAN);


true

> SELECT cast('1' AS BOOLEAN);


true

> SELECT cast('0' AS BOOLEAN);


false

> SELECT cast('n' AS BOOLEAN);


false

> SELECT cast('on' AS BOOLEAN);


error: invalid input syntax for type boolean

> SELECT cast(0 AS BOOLEAN);


false

> SELECT cast(0.0E10 AS BOOLEAN);


false

> SELECT cast(1 AS BOOLEAN);


true

> SELECT cast(0.1 AS BOOLEAN);


true

> SELECT cast('NaN'::FLOAT AS BOOLEAN);


true

BINARY
If the targetType is a BINARY and sourceExpr is of type:
VOID
The result is a NULL Binary.
STRING
The result is the UTF-8 encoding of the surceExpr .
Examples

> SELECT cast(NULL AS BINARY);


NULL

> SELECT hex(cast('Spark SQL' AS BINARY));


537061726B2053514C

> SELECT hex(cast('Oдesa' AS BINARY));


4FD0B4657361

ARRAY
If the targetType is an ARRAY and sourceExpr is of type:
VOID
The result is a NULL of the targeType .
ARRAY
If the cast from sourceElementType to targetElementType is supported, the result is an
ARRAY<targetElementType> with all elements cast to the targetElementType .
Databricks Runtime raises an error if the cast isn’t supported or if any of the elements can’t be cast.
Use try_cast to turn invalid data or overflow errors into NULL .
Examples

> SELECT cast(NULL AS ARRAY<INT>);


NULL

> SELECT cast(array('t', 'f', NULL) AS ARRAY<BOOLEAN>);


[true, false, NULL]

> SELECT cast(array('t', 'f', NULL) AS INTERVAL YEAR);


error: cannot cast array<string> to interval year

> SELECT cast(array('t', 'f', 'o') AS ARRAY<BOOLEAN>);


error: invalid input syntax for type boolean: o.

MAP
If the targetType is an MAP<targetKeyType, targetValueType> and sourceExpr is of type:
VOID
The result is a NULL of the targetType .
MAP<sourceKeyType, sourceValueType>
If the casts from sourceKeyType to targetKeyType and sourceValueType to targetValueType are
supported, the result is an MAP<targetKeyType, targetValueType> with all keys cast to the targetKeyType
and all values cast to the targetValueType .
Databricks Runtime raises an error if the cast isn’t supported or if any of the keys or values can’t be cast.
Use try_cast to turn invalid data or overflow errors into NULL .
Examples
> SELECT cast(NULL AS MAP<STRING, INT>);
NULL

> SELECT cast(map('10', 't', '15', 'f', '20', NULL) AS MAP<INT, BOOLEAN>);
{10:true,15:false,20:null}

> SELECT cast(map('10', 't', '15', 'f', '20', NULL) AS MAP<INT, ARRAY<INT>>);
error: cannot cast map<string,string> to map<int,array<int>>

> SELECT cast(map('10', 't', '15', 'f', '20', 'o') AS MAP<INT, BOOLEAN>);
error: invalid input syntax for type boolean: o.

STRUCT
If the targetType is a STRUCT<[targetFieldName:targetFieldType [NOT NULL][COMMENT str][, …]]> and
sourceExpr is of type:

VOID
The result is a NULL of the targetType .
STRUCT<[sourceFieldName:sourceFieldType [NOT NULL][COMMENT str][, …]]>
The sourceExpr can be cast to targetType if all of thee conditions are true:
The source type has the same number of fields as the target
For all fields: sourceFieldTypeN can be cast to the targetFieldTypeN .
For all field values: The source field value N can be cast to targetFieldTypeN and the value isn’t null if
target field N is marked as NOT NULL .
sourceFieldName s, source NOT NULL constraints, and source COMMENT s need not match the targetType
and are ignored.
Databricks Runtime raises an error if the cast isn’t supported or if any of the keys or values can’t be cast.
Use try_cast to turn invalid data or overflow errors into NULL .
Examples

> SELECT cast(NULL AS STRUCT<a:INT>);


NULL

> SELECT cast(named_struct('a', 't', 'b', '1900') AS STRUCT<b:BOOLEAN, c:DATE NOT NULL COMMENT 'Hello'>);
{"b":true,"c":1900-01-01}

> SELECT cast(named_struct('a', 't', 'b', NULL::DATE) AS STRUCT<b:BOOLEAN, c:DATE NOT NULL COMMENT
'Hello'>);
error: cannot cast struct<a:string,b:date> to struct<b:boolean,c:date>

> SELECT cast(named_struct('a', 't', 'b', '1900') AS STRUCT<b:BOOLEAN, c:ARRAY<INT>>);


error: cannot cast struct<a:string,b:string> to struct<b:boolean,c:array<int>>

> SELECT cast(named_struct('a', 't', 'b', 'hello') AS STRUCT<b:BOOLEAN, c:DATE>);


error: Cannot cast hello to DateType

Related functions
:: (colon colon sign) operator
try_cast function
cbrt function
7/21/2022 • 2 minutes to read

Returns the cube root of expr .

Syntax
cbrt(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.

Examples
> SELECT cbrt(27.0);
3.0

Related functions
sqrt function
ceil function
7/21/2022 • 2 minutes to read

Returns the smallest number not smaller than expr rounded up to targetScale digits relative to the decimal
point.

Syntax
ceil(expr [, targetScale])

Arguments
expr : An expression that evaluates to a numeric.
targetScale: An optional INTEGER literal greater than -38 specifying by how many digits after the
decimal points to round up.
Since: Databricks Runtime 10.5

Returns
If no targetScale is given:
If expr is DECIMAL(p, s) , returns DECIMAL(p - s + 1, 0) .
For all other cases, returns a BIGINT.
If targetScale is specified and expr is a:
TINYINT

Returns a DECIMAL(p, 0) with p = max(3, -targetScale + 1) .


SMALLINT

Returns a DECIMAL(p, 0) with p = max(5, -targetScale + 1) .


INTEGER

Returns a DECIMAL(p, 0) with p = max(10, -targetScale + 1)) .


BIGINT

Returns a DECIMAL(p, 0) with p = max(20, -targetScale + 1)) .


FLOAT

Returns a DECIMAL(p, s) with p = max(14, -targetScale + 1)) and s = min(7, max(0, targetScale))

DOUBLE

Returns a DECIMAL(p, s) with p = max(30, -targetScale + 1)) and s = min(15, max(0, targetScale))

DECIMAL(p_in, s_in)
Returns a DECIMAL(p, s) with p = max(p_in - s_in + 1, -targetScale + 1)) and
s = min(s_in, max(0, targetScale))

If targetScale is negative the rounding occurs to -targetScale digits to the left of the decimal point.
The default targetScale is 0, which rounds up to the next bigger integral number.
This function is a synonym of ceiling function.

Examples
> SELECT ceil(-0.1);
0

> SELECT ceil(5);


5

> SELECT ceil(5.4);


6

> SELECT ceil(3345.1, -2);


3400

> SELECT ceil(-12.345, 1);


-12.3

Related functions
floor function
ceiling function
bround function
round function
ceiling function
7/21/2022 • 2 minutes to read

Returns the smallest number not smaller than expr rounded up to targetScale digits relative to the decimal
point.

Syntax
ceiling(expr [, targetScale])

Arguments
expr : An expression that evaluates to a numeric.
targetScale: An optional INTEGER literal greater than -38 specifying to how many digits after the
decimal points to round up.
Since: Databricks Runtime 10.5

Returns
If no targetScale is given:
If expr is DECIMAL(p, s) , returns DECIMAL(p - s + 1, 0) .
For all other cases, returns a BIGINT.
If targetScale is specified and expr is a:
TINYINT

Returns a DECIMAL(p, 0) with p = max(3, -targetScale + 1) .


SMALLINT

Returns a DECIMAL(p, 0) with p = max(5, -targetScale + 1) .


INTEGER

Returns a DECIMAL(p, 0) with p = max(10, -targetScale + 1)) .


BIGINT

Returns a DECIMAL(p, 0) with p = max(20, -targetScale + 1)) .


FLOAT

Returns a DECIMAL(p, s) with p = max(14, -targetScale + 1)) and s = min(7, max(0, targetScale))

DOUBLE

Returns a DECIMAL(p, s) with p = max(30, -targetScale + 1)) and s = min(15, max(0, targetScale))

DECIMAL(p_in, s_in)
Returns a DECIMAL(p, s) with p = max(p_in - s_in + 1, -targetScale + 1)) and
s = min(s_in, max(0, targetScale))

If targetScale is negative the rounding occurs to -targetScale digits to the left of the decimal point.
The default targetScale is 0, which rounds up to the next bigger integral number.
This function is a synonym of ceil function.

Examples
> SELECT ceiling(-0.1);
0

> SELECT ceiling(5);


5

> SELECT ceiling(5.4);


6

> SELECT ceiling(3345.1, -2);


3400

> SELECT ceiling(-12.345, 1);


-12.3

Related functions
floor function
ceil function
bround function
round function
char function
7/21/2022 • 2 minutes to read

Returns the character at the supplied UTF-16 code point.

Syntax
char(expr)

Arguments
expr : An expression that evaluates to an integral numeric.

Returns
A STRING.
If the argument is less than 0, an empty string is returned. If the argument is larger than 255 , it is treated as
modulo 256. This implies char covers the ASCII and Latin-1 Supplement range of UTF-16.
This function is a synonym for chr function.

Examples
> SELECT char(65);
A

Related functions
chr function
ascii function
char_length function
7/21/2022 • 2 minutes to read

Returns the character length of string data or number of bytes of binary data.

Syntax
char_length(expr)

Arguments
expr : A BINARY or STRING expression.

Returns
The result type is INTEGER.
The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
This function is a synonym for character_length function and length function.

Examples
> SELECT char_length('Spark SQL ');
10
> select char_length('床前明月光')
5

Related functions
character_length function
length function
character_length function
7/21/2022 • 2 minutes to read

Returns the character length of string data or number of bytes of binary data.

Syntax
character_length(expr)

Arguments
expr : A BINARY or STRING expression.

Returns
The result type is INTEGER.
The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
This function is a synonym for char_length function and length function.

Examples
> SELECT character_length('Spark SQL ');
10
> select character_length('床前明月光')
5

Related functions
char_length function
length function
chr function
7/21/2022 • 2 minutes to read

Returns the character at the supplied UTF-16 code point.

Syntax
chr(expr)

Arguments
expr : An expression that evaluates to an integral numeric.

Returns
The result type is STRING.
If the argument is less than 0, an empty string is returned. If the argument is larger than 255 , it is treated as
modulo 256. This implies char covers the ASCII and Latin-1 Supplement range of UTF-16.
This function is a synonym for char function.

Examples
> SELECT chr(65);
A

Related functions
char function
ascii function
coalesce function
7/21/2022 • 2 minutes to read

Returns the first non-null argument.

Syntax
coalesce(expr1 [, ...] )

Arguments
exprN : Any expression that shares a least common type across all exprN.

Returns
The result type is the least common type of the arguments.
There must be at least one argument. Unlike for regular functions where all arguments are evaluated before
invoking the function, coalesce evaluates arguments left to right until a non-null value is found. If all
arguments are NULL , the result is NULL .

Examples
> SELECT coalesce(NULL, 1, NULL);
1
> SELECT coalesce(NULL, 5 / 0);
Division by zero

> SELECT coalesce(2, 5 / 0);


2

> SELECT coalesce(NULL, 'hello');


hello

Related
nvl function
nvl2 function
SQL data type rules
collect_list aggregate function
7/21/2022 • 2 minutes to read

Returns an array consisting of all values in expr within the group.

Syntax
collect_list ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression of any type.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
An ARRAY of the argument type.
The order of elements in the array is non-deterministic. NULL values are excluded.
If DISTINCT is specified the function collects only unique values and is a synonym for collect_set aggregate
function
This function is a synonym for array_agg

Examples
> SELECT collect_list(col) FROM VALUES (1), (2), (NULL), (1) AS tab(col);
[1,2,1]
> SELECT collect_list(DISTINCT col) FROM VALUES (1), (2), (NULL), (1) AS tab(col);
[1,2]

Related functions
array_agg aggregate function
array function
collect_set aggregate function
collect_set aggregate function
7/21/2022 • 2 minutes to read

Returns an array consisting of all unique values in expr within the group.

Syntax
collect_set(expr) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression of any type.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
An ARRAY of the argument type.
The order of elements in the array is non-deterministic. NULL values are excluded.

Examples
> SELECT collect_set(col) FROM VALUES (1), (2), (NULL), (1) AS tab(col);
[1,2]
> SELECT collect_set(col1) FILTER(WHERE col2 = 10)
FROM VALUES (1, 10), (2, 10), (NULL, 10), (1, 10), (3, 12) AS tab(col1, col2);
[1,2]

Related functions
array function
collect_list aggregate function
:: (colon colon sign) operator
7/21/2022 • 2 minutes to read

Casts the value expr to the target data type type .

Syntax
expr :: type

Arguments
expr : Any castable expression.

Returns
The result is type type .
This operator is a synonym for cast(expr AS type) where you find a detailed description.

Examples
> SELECT '20.5'::INTEGER;
20
> SELECT typeof(NULL::STRING);
string

Related functions
cast function
try_cast function
concat function
7/21/2022 • 2 minutes to read

Returns the concatenation of the arguments.

Syntax
concat(expr1, expr2 [, ...] )

Arguments
exprN : Expressions which are all STRING, all BINARY or all ARRAYs of STRING or BINARY.

Returns
The result type matches the argument types.
There must be at least one argument. This function is a synonym for || (pipe pipe sign) operator.

Examples
> SELECT concat('Spark', 'SQL');
SparkSQL
> SELECT concat(array(1, 2, 3), array(4, 5), array(6));
[1,2,3,4,5,6]

Related functions
|| (pipe pipe sign) operator
array_join function
array_union function
concat_ws function
concat_ws function
7/21/2022 • 2 minutes to read

Returns the concatenation strings separated by sep .

Syntax
concat_ws(sep [, expr1 [, ...] ])

Arguments
sep : An STRING expression.
exprN : Each exprN can be either a STRING or an ARRAY of STRING.

Returns
The result type is STRING.
If sep is NULL the result is NULL. exprN that are NULL are ignored. If only the separator is provided, or all
exprN are NULL, an empty string.

Examples
> SELECT concat_ws(' ', 'Spark', 'SQL');
Spark SQL
> SELECT concat_ws('s');
''
> SELECT concat_ws(',', 'Spark', array('S', 'Q', NULL, 'L'), NULL);
Spark,S,Q,L

Related functions
|| (pipe pipe sign) operator
concat function
array_join function
contains function
7/21/2022 • 2 minutes to read

Returns true if expr contains subExpr .


Since: Databricks Runtime 10.3

Syntax
contains(expr, subExpr)

Arguments
expr : A STRING or BINARY within which to search.
subExpr : The STRING or BINARY to search for.

Returns
A BOOLEAN. If expr or subExpr are NULL , the result is NULL . If subExpr is the empty string or empty binary
the result is true .
Since: Databricks Runtime 10.5
The function operates in BINARY mode if both arguments are BINARY.

Examples
> SELECT contains(NULL, 'Spark');
NULL

> SELECT contains('SparkSQL', NULL);


NULL

> SELECT contains('SparkSQL', 'Spark');


true

> SELECT contains('SparkSQL', 'ark');


true

> SELECT contains('SparkSQL', 'SQL');


true

> SELECT contains('SparkSQL', 'Spork');


false

> SELECT contains('SparkSQL', '');


true

> SELECT contains(x'120033', x'00');


true

Related
array_contains function
conv function
7/21/2022 • 2 minutes to read

Converts num from fromBase to toBase .

Syntax
conv(num, fromBase, toBase)

Arguments
num : An STRING expression expressing a number in fromBase .
fromBase : An INTEGER expression denoting the source base.
toBase : An INTEGER expression denoting the target base.

Returns
A STRING.
The function supports base 2 to base 36. The digit ‘A’ (or ‘a’) represents decimal 10 and ‘Z’ (or ‘z’) represents
decimal 35.

Examples
> SELECT conv('100', 2, 10);
4
> SELECT conv('-10', 16, 10);
-16

Related functions
corr aggregate function
7/21/2022 • 2 minutes to read

Returns Pearson coefficient of correlation between a group of number pairs.

Syntax
corr ( [ALL | DISTINCT] expr1, expr2 ) [FILTER ( WHERE cond ) ]

Arguments
expr1 : An expression that evaluates to a numeric.
expr2 : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr1 , expr2 pairs.

Examples
> SELECT corr(c1, c2) FROM VALUES (3, 2), (3, 3), (3, 3), (6, 4) as tab(c1, c2);
0.816496580927726

> SELECT corr(DISTINCT c1, c2) FROM VALUES (3, 2), (3, 3), (3, 3), (6, 4) as tab(c1, c2);
0.8660254037844387
> SELECT corr(DISTINCT c1, c2) FILTER(WHERE c1 != c2)
FROM VALUES (3, 2), (3, 3), (3, 3), (6, 4) as tab(c1, c2);
1.0

Related functions
cos function
7/21/2022 • 2 minutes to read

Returns the cosine of expr .

Syntax
cos(expr)

Arguments
expr : An expression that evaluates to a numeric expressing the angle in radians.

Returns
A DOUBLE.

Examples
> SELECT cos(0);
1.0
> SELECT cos(pi());
-1.0

Related functions
sin function
tan function
acos function
cosh function
cosh function
7/21/2022 • 2 minutes to read

Returns the hyperbolic cosine of expr .

Syntax
cosh(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.

Examples
> SELECT cosh(0);
1.0

Related functions
sinh function
cos function
cot function
7/21/2022 • 2 minutes to read

Returns the cotangent of expr .

Syntax
cot(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.

Examples
> SELECT cot(1);
0.6420926159343306

Related functions
cos function
cosh function
tan function
tanh function
count aggregate function
7/21/2022 • 2 minutes to read

Returns the number of retrieved rows in a group.

Syntax
count ( [DISTINCT | ALL] * ) [FILTER ( WHERE cond ) ]

count ( [DISTINCT | ALL] expr[, expr...] ) [FILTER ( WHERE cond ) ]

Arguments
expr : Any expression.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A BIGINT.
If * is specified also counts row containing NULL values.
If expr are specified counts only rows for which all expr are not NULL .
If DISTINCT duplicate rows are not counted.

Examples
> SELECT count(*) FROM VALUES (NULL), (5), (5), (20) AS tab(col);
4
> SELECT count(col) FROM VALUES (NULL), (5), (5), (20) AS tab(col);
3
> SELECT count(col) FILTER(WHERE col < 10)
FROM VALUES (NULL), (5), (5), (20) AS tab(col);
2
> SELECT count(DISTINCT col) FROM VALUES (NULL), (5), (5), (10) AS tab(col);
2

Related functions
avg aggregate function
sum aggregate function
min aggregate function
max aggregate function
count_if aggregate function
count_if aggregate function
7/21/2022 • 2 minutes to read

Returns the number of true values for the group in expr .

Syntax
count_if ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]

Arguments
expr : A BOOLEAN expression.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A BIGINT.
count_if(expr) FILTER(WHERE cond) is equivalent to count_if(expr AND cond) .
If DISTINCT is specified only unique rows are counted.

Examples
> SELECT count_if(col % 2 = 0) FROM VALUES (NULL), (0), (1), (2), (2), (3) AS tab(col);
3
> SELECT count_if(DISTINCT col % 2 = 0) FROM VALUES (NULL), (0), (1), (2), (2), (3) AS tab(col);
2
> SELECT count_if(col IS NULL) FROM VALUES (NULL), (0), (1), (2), (3) AS tab(col);
1

Related functions
avg aggregate function
sum aggregate function
min aggregate function
max aggregate function
count aggregate function
count_min_sketch aggregate function
7/21/2022 • 2 minutes to read

Returns a count-min sketch of all values in the group in expr with the epsilon , confidence and seed .

Syntax
count_min_sketch ( [ALL | DISTINCT] expr, epsilon, confidence, seed ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to an integral numeric, STRING, or BINARY.
epsilon : A DOUBLE literal greater than 0 describing the relative error.
confidence : A DOUBLE literal greater than 0 and less than 1.
seed : An INTEGER literal.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A BINARY.
Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
If DISTINCT is specified the function operates only on a unique set of expr values.

Examples
> SELECT hex(count_min_sketch(col, 0.5d, 0.5d, 1)) FROM VALUES (1), (2), (1) AS tab(col);
0000000100000000000000030000000100000004000000005D8D6AB90000000000000000000000000000000200000000000000010000
000000000000
> SELECT hex(count_min_sketch(DISTINCT col, 0.5d, 0.5d, 1)) FROM VALUES (1), (2), (1) AS tab(col);
0000000100000000000000020000000100000004000000005D8D6AB90000000000000000000000000000000100000000000000010000
000000000000

Related functions
covar_pop aggregate function
7/21/2022 • 2 minutes to read

Returns the population covariance of number pairs in a group.

Syntax
covar_pop ( [ALL | DISTINCT] expr1, expr2 ) [FILTER ( WHERE cond ) ]

Arguments
expr1 : An expression that evaluates to a numeric.
expr2 : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr1 , expr2 pairs.

Examples
> SELECT covar_pop(c1, c2) FROM VALUES (1,1), (2,2), (2,2), (3,3) AS tab(c1, c2);
0.5
> SELECT covar_pop(DISTINCT c1, c2) FROM VALUES (1,1), (2,2), (2,2), (3,3) AS tab(c1, c2);
0.6666666666666666

Related functions
covar_samp aggregate function
covar_samp aggregate function
7/21/2022 • 2 minutes to read

Returns the sample covariance of number pairs in a group.

Syntax
covar_samp ( [ALL | DISTINCT] expr1, expr2 ) [FILTER ( WHERE cond ) ]

Arguments
expr1 : An expression that evaluates to a numeric.
expr2 : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr1 , expr2 pairs.

Examples
> SELECT covar_samp(c1, c2) FROM VALUES (1,1), (2,2), (2, 2), (3,3) AS tab(c1, c2);
0.6666666666666666
> SELECT covar_samp(DISTINCT c1, c2) FROM VALUES (1,1), (2,2), (2, 2), (3,3) AS tab(c1, c2);
1.0

Related functions
covar_pop aggregate function
crc32 function
7/21/2022 • 2 minutes to read

Returns a cyclic redundancy check value of expr .

Syntax
crc32(expr)

Arguments
expr : A BINARY expression.

Returns
A BIGINT.

Examples
> SELECT crc32('Spark');
1557323817

Related functions
hash function
md5 function
sha function
sha1 function
sha2 function
csc function
7/21/2022 • 2 minutes to read

Returns the cosecant of expr .


Since: Databricks Runtime 10.1

Syntax
csc(expr)

Arguments
expr : An expression that evaluates to a numeric expressing the angle in radians.

Returns
A DOUBLE.
csc(expr) is equivalent to 1 / sin(expr)

Examples
> SELECT csc(pi() / 2);
1.0

> SELECT csc(0);


Infinity

Related functions
acos function
cos function
cosh function
csc function
sin function
tan function
cube function
7/21/2022 • 2 minutes to read

Creates a multi-dimensional cube using the specified expression columns.

Syntax
cube (expr1 [, ...] )

Arguments
exprN : Any expression that can be grouped.

Returns
The function must be the only grouping expression in the GROUP BY clause. See GROUP BY clause for details.

Examples
> SELECT name, age, count(*) FROM VALUES (2, 'Alice'), (5, 'Bob') people(age, name) GROUP BY cube(name,
age);
Bob 5 1
Alice 2 1
Alice NULL 1
NULL 2 1
NULL NULL 2
Bob NULL 1
NULL 5 1

Related functions
GROUP BY clause
cume_dist analytic window function
7/21/2022 • 2 minutes to read

Returns the position of a value relative to all values in the partition.

Syntax
cume_dist()

Arguments
This function takes no arguments.

Returns
A DOUBLE.
The OVER clause of the window function must include an ORDER BY clause. If the order is not unique the
duplicates share the same relative later position. cume_dist() over(order by expr) is similar, but not identical to
rank() over(order by position) / count(*) since rank ranking window function produces the earliest absolute
order.

Examples
> SELECT a, b, cume_dist() OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3),
('A1', 1) tab(a, b);
A1 1 0.6666666666666666
A1 1 0.6666666666666666
A1 2 1.0
A2 3 1.0

Related functions
rank ranking window function
dense_rank ranking window function
row_number ranking window function
Window functions
current_catalog function
7/21/2022 • 2 minutes to read

Returns the current catalog.

Syntax
current_catalog()

Arguments
This function takes no argument.

Returns
A STRING.

Examples
> SELECT current_catalog();
spark_catalog

Related functions
current_schema function
current_database function
7/21/2022 • 2 minutes to read

Returns the current schema.

Syntax
current_database()

Arguments
This function takes no arguments

Returns
A STRING.
This function is an alias for current_schema function.

Examples
> SELECT current_database();
default

Related functions
current_catalog function
current_schema function
current_date function
7/21/2022 • 2 minutes to read

Returns the current date at the start of query evaluation.

Syntax
current_date()

Arguments
This function takes no arguments.

Returns
A DATE.
The braces are optional.

Examples
> SELECT current_date();
2020-04-25
> SELECT current_date;
2020-04-25

Related functions
current_timestamp function
current_timezone function
now function
current_schema function
7/21/2022 • 2 minutes to read

Returns the current schema.

Syntax
current_schema()

Arguments
This function takes no arguments.

Returns
A STRING.

Examples
> SELECT current_schema();
default

Related functions
current_catalog function
current_timestamp function
7/21/2022 • 2 minutes to read

Returns the current timestamp at the start of query evaluation.

Syntax
current_timestamp()

Arguments
This function takes no arguments.

Returns
A TIMESTAMP.
The braces are optional.

Examples
> SELECT current_timestamp();
2020-04-25 15:49:11.914
> SELECT current_timestamp;
2020-04-25 15:49:11.914

Related functions
current_date function
current_timezone function
now function
current_timezone function
7/21/2022 • 2 minutes to read

Returns the current session local timezone.

Syntax
current_timezone()

Arguments
This function takes no arguments.

Returns
A STRING.

Examples
> SELECT current_timezone();
Asia/Shanghai

Related functions
current_date function
current_timestamp function
now function
current_user function
7/21/2022 • 2 minutes to read

Returns the user executing the statement.


Since: Databricks Runtime 10.0

Syntax
current_user()

Arguments
This function takes no arguments.

Returns
A STRING.
The braces are optional.

Examples
> SELECT current_user();
user1
> SELECT current_user;
user1

Related functions
is_member function
current_version function
7/21/2022 • 2 minutes to read

Returns the current version of Databricks Runtime.


Since: Databricks Runtime 11.0

Syntax
current_version()

Arguments
This function takes no arguments.

Returns
A STRUCT with the following fields:
dbr_version : A STRING with the current version of Databricks Runtime.
dbsql_version : A NULL STRING in Databricks Runtime.
u_build_hash : A STRING used by Azure Databricks support.
r_build_hash : A STRING used by Azure Databricks support.

Examples
> SELECT current_version().dbr_version;
11.0

> SELECT current_version();


{ 11.0, NULL, ..., ... }

Related functions
version function
date function
7/21/2022 • 2 minutes to read

Syntax
date(expr)

Casts the value expr to DATE.

Arguments
expr : An expression that can be cast to DATE.

Returns
A DATE.
This function is a synonym for CAST(expr AS expr) .
See cast function for details.

Examples
> SELECT date('2021-03-21');
2021-03-21

Related functions
cast function
date_add function
7/21/2022 • 2 minutes to read

Returns the date numDays after startDate .

Syntax
date_add(startDate, numDays)

Arguments
startDate : A DATE expression.
numDays : An INTEGER expression.

Returns
A DATE.
If numDays is negative abs(num_days) are subtracted from startDate .
If the result date overflows the date range the function raises an error.

Examples
> SELECT date_add('2016-07-30', 1);
2016-07-31

Related functions
date_from_unix_date function
date_sub function
datediff function
months_between function
timestampadd function
date_format function
7/21/2022 • 2 minutes to read

Converts a timestamp to a string in the format fmt .

Syntax
date_format(expr, fmt)

Arguments
expr : A DATE, TIMESTAMP, or a STRING in a valid datetime format.
fmt: A STRING expression describing the desired format.

Returns
A STRING.
See Datetime patterns for details on valid formats.

Examples
> SELECT date_format('2016-04-08', 'y');
2016

Related functions
Datetime patterns
date_from_unix_date function
7/21/2022 • 2 minutes to read

Creates a date from the number of days since 1970-01-01 .

Syntax
date_from_unix_date(days)

Arguments
days : An INTEGER expression.

Returns
A DATE.
If days is negative the days are subtracted from 1970-01-01 .
This function is a synonym for date_add(DATE'1970-01-01', days) .

Examples
> SELECT date_from_unix_date(1);
1970-01-02

Related functions
date_add function
date_sub function
date_part function
7/21/2022 • 2 minutes to read

Extracts a part of the date, timestamp, or interval.

Syntax
date_part(field, expr)

Arguments
field : An STRING literal. See extract function for details.
expr : A DATE, TIMESTAMP, or INTERVAL expression.

Returns
If field is ‘SECOND’, a DECIMAL(8, 6) . In all other cases, an INTEGER.
The date_part function is a synonym for extract(field FROM expr) .

Examples
> SELECT date_part('YEAR', TIMESTAMP'2019-08-12 01:00:00.123456');
2019
> SELECT date_part('WEEK', TIMESTAMP'2019-08-12 01:00:00.123456');
33
> SELECT date_part('DAY', DATE'2019-08-12');
224
> SELECT date_part('SECONDS', TIMESTAMP'2019-10-01 00:00:01.000001');
1.000001
> SELECT date_part('MONTHS', INTERVAL '2-11' YEAR TO MONTH);
11
> SELECT date_part('SECONDS', INTERVAL '5:00:30.001' HOUR TO SECOND);
30.001000

Related functions
extract function
date_sub function
7/21/2022 • 2 minutes to read

Returns the date numDays before startDate .

Syntax
date_sub(startDate, numDays)

Arguments
startDate : A DATE expression.
numDays : An INTEGER expression.

Returns
A DATE.
If numDays is negative abs(num_days) are added to startDate .
If the result date overflows the date range the function raises an error.

Examples
> SELECT date_sub('2016-07-30', 1);
2016-07-29

Related functions
date_add function
date_from_unix_date function
datediff function
months_between function
timestampadd function
date_trunc function
7/21/2022 • 2 minutes to read

Returns timestamp truncated to the unit specified in field .

Syntax
date_trunc(field, expr)

Arguments
field : A STRING literal.
expr : A DATE, TIMESTAMP, or STRING with a valid timestamp format.

Returns
A TIMESTAMP.
Valid units for field are:
‘YEAR’, ‘YYYY’, ‘YY’: truncate to the first date of the year that the expr falls in, the time part will be zero out
‘QUARTER’: truncate to the first date of the quarter that the expr falls in, the time part will be zero out
‘MONTH’, ‘MM’, ‘MON’: truncate to the first date of the month that the expr falls in, the time part will be
zero out
‘WEEK’: truncate to the Monday of the week that the expr falls in, the time part will be zero out
‘DAY’, ‘DD’: zero out the time part
‘HOUR’: zero out the minute and second with fraction part
‘MINUTE’- zero out the second with fraction part
‘SECOND’: zero out the second fraction part
‘MILLISECOND’: zero out the microseconds
‘MICROSECOND’: everything remains

Examples
> SELECT date_trunc('YEAR', '2015-03-05T09:32:05.359');
2015-01-01 00:00:00
> SELECT date_trunc('MM', '2015-03-05T09:32:05.359');
2015-03-01 00:00:00
> SELECT date_trunc('DD', '2015-03-05T09:32:05.359');
2015-03-05 00:00:00
> SELECT date_trunc('HOUR', '2015-03-05T09:32:05.359');
2015-03-05 09:00:00
> SELECT date_trunc('MILLISECOND', '2015-03-05T09:32:05.123456');
2015-03-05 09:32:05.123

Related functions
trunc function
dateadd function
7/21/2022 • 2 minutes to read

Adds value unit s to a timestamp expr .


Since: Databricks Runtime 10.4

Syntax
dateadd(unit, value, expr)

unit
{ MICROSECOND |
MILLISECOND |
SECOND |
MINUTE |
HOUR |
DAY | DAYOFYEAR |
WEEK |
MONTH |
QUARTER |
YEAR }

Arguments
unit : A unit of measure.
value : A numeric expression with the number of unit s to add to expr .
expr : A TIMESTAMP expression.

Returns
A TIMESTAMP.
If value is negative it is subtracted from the expr . If unit is MONTH , QUARTER , or YEAR the day portion of the
result will be adjusted to result in a valid date.
The function returns an overflow error if the result is beyond the supported range of timestamps.
dateadd is a synonym for timestampadd.

Examples
> SELECT dateadd(MICROSECOND, 5, TIMESTAMP'2022-02-28 00:00:00');
2022-02-28 00:00:00.000005

-- March 31. 2022 minus 1 month yields February 28. 2022


> SELECT dateadd(MONTH, -1, TIMESTAMP'2022-03-31 00:00:00');
2022-02-28 00:00:00.000000

Related functions
add_months function
date_add function
date_sub function
timestamp function
timestampadd function
datediff function
7/21/2022 • 2 minutes to read

Returns the number of days from startDate to endDate .

Syntax
datediff(endDate, startDate)

Arguments
endDate : A DATE expression.
startDate : A DATE expression.

Returns
An INTEGER.
If endDate is before startDate the result is negative.
To measure the difference between two dates in units other than days use datediff (timestamp) function.

Examples
> SELECT datediff('2009-07-31', '2009-07-30');
1
> SELECT datediff('2009-07-30', '2009-07-31');
-1

Related functions
date_add function
date_sub function
datediff (timestamp) function
datediff (timestamp) function
7/21/2022 • 2 minutes to read

Returns the difference between two timestamps measured in unit s.


Since: Databricks Runtime 10.4

Syntax
datediff(unit, start, end)

unit
{ MICROSECOND |
MILLISECOND |
SECOND |
MINUTE |
HOUR |
DAY |
WEEK |
MONTH |
QUARTER |
YEAR }

Arguments
unit : A unit of measure.
start : A starting TIMESTAMP expression.
end : A ending TIMESTAMP expression.

Returns
A BIGINT.
If start is greater than end the result is negative.
The function counts whole elapsed units based on UTC with a DAY being 86400 seconds.
One month is considered elapsed when the calendar month has increased and the calendar day and time is
equal or greater to the start. Weeks, quarters, and years follow from that.
datediff (timestamp) is a synonym for timestampdiff function.

Examples
-- One second shy of a month elapsed
> SELECT datediff(MONTH, TIMESTAMP'2021-02-28 12:00:00', TIMESTAMP'2021-03-28 11:59:59');
0

-- One month has passed even though its' not end of the month yet because day and time line up.
> SELECT datediff(MONTH, TIMESTAMP'2021-02-28 12:00:00', TIMESTAMP'2021-03-28 12:00:00');
1

-- Start is greater than the end


> SELECT datediff(YEAR, DATE'2021-01-01', DATE'1900-03-28');
-120

Related functions
add_months function
date_add function
date_sub function
datediff function
timestamp function
timestampadd function
day function
7/21/2022 • 2 minutes to read

Returns the day of month of the date or timestamp.

Syntax
day(expr)

Arguments
expr : A DATE or TIMESTAMP expression.

Returns
An INTEGER.
This function is a synonym for extract(DAY FROM expr) .

Examples
> SELECT day('2009-07-30');
30

Related functions
dayofmonth function
dayofweek function
dayofyear function
hour function
minute function
second function
extract function
dayofmonth function
7/21/2022 • 2 minutes to read

Returns the day of month of the date or timestamp.

Syntax
dayofmonth(expr)

Arguments
expr : A DATE or TIMESTAMP expression.

Returns
An INTEGER.
This function is a synonym for extract(DAY FROM expr) .

Examples
> SELECT dayofmonth('2009-07-30');
30

Related functions
day function
dayofweek function
dayofyear function
hour function
minute function
second function
extract function
dayofweek function
7/21/2022 • 2 minutes to read

Returns the day of week of the date or timestamp.

Syntax
dayofweek(expr)

Arguments
expr : A DATE or TIMESTAMP expression.

Returns
An INTEGER where 1 = Sunday , and 7 = Saturday .
This function is a synonym for extract(DAYOFWEEK FROM expr) .

Examples
> SELECT dayofweek('2009-07-30');
5

Related functions
day function
dayofmonth function
dayofyear function
hour function
minute function
second function
extract function
weekday function
dayofyear function
7/21/2022 • 2 minutes to read

Returns the day of year of the date or timestamp.

Syntax
dayofyear(expr)

Arguments
expr : A DATE or TIMESTAMP expression.

Returns
An INTEGER.
This function is a synonym for extract(DAY FORM expr) .

Examples
> SELECT dayofyear('2016-04-09');
100

Related functions
day function
dayofmonth function
dayofweek function
hour function
minute function
second function
extract function
weekday function
decimal function
7/21/2022 • 2 minutes to read

Casts the value expr to DECIMAL.

Syntax
decimal(expr)

Arguments
expr : An expression that can be cast to DECIMAL.

Returns
The result is DECIMAL(10, 0).
This function is a synonym for CAST(expr AS decimal(10, 0))

See cast function for details on casting.

Examples
> SELECT decimal('5.2');
5

Related functions
cast function
decode function
7/21/2022 • 2 minutes to read

Returns the value matching the key.


Since: Databricks Runtime 9.1

Syntax
decode(expr, { key1, value1 } [, ...] [, defValue])

Arguments
expr : Any expression of a comparable type.
keyN : An expression that matched the type of expr .
valueN : An expression that shares a least common type with defValue and the other valueN s.
defValue : An optional expression that shares a least common type with valueN .

Returns
The result is of the least common type of the valueN and defValue .
The function returns the first valueN for which keyN matches expr. For this function NULL matches NULL. If
no keyN matches expr , defValue is returned if it exists. If no defValue was specified the result is NULL.

Examples
> SELECT decode(5, 6, 'Spark', 5, 'SQL', 4, 'rocks');
SQL
> SELECT decode(NULL, 6, 'Spark', NULL, 'SQL', 4, 'rocks');
SQL
> SELECT decode(7, 6, 'Spark', 5, 'SQL', 'rocks');
rocks

Related functions
case expression
decode (character set) function
decode (character set) function
7/21/2022 • 2 minutes to read

Translates binary expr to a string using the character set encoding charSet .

Syntax
decode(expr, charSet)

Arguments
expr : A BINARY expression encoded in charset .
charSet : A STRING expression.

Returns
A STRING.
If charSet does not match the encoding the result is undefined. charSet must be one of (case insensitive):
‘US-ASCII’
‘ISO-8859-1’
‘UTF-8’
‘UTF-16BE’
‘UTF-16LE’
‘UTF-16’

Examples
> SELECT encode('Spark SQL', 'UTF-16');
[FE FF 00 53 00 70 00 61 00 72 00 6B 00 20 00 53 00 51 00 4C]
> SELECT decode(X'FEFF0053007000610072006B002000530051004C', 'UTF-16')
Spark SQL

Related functions
encode function
decode function
degrees function
7/21/2022 • 2 minutes to read

Converts radians to degrees.

Syntax
degrees(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.
Given an angle in radians, this function returns the associated degrees.

Examples
> SELECT degrees(3.141592653589793);
180.0

Related functions
radians function
dense_rank ranking window function
7/21/2022 • 2 minutes to read

Returns the rank of a value compared to all values in the partition.

Syntax
dense_rank()

Arguments
This function takes no arguments.

Returns
An INTEGER.
The OVER clause of the window function must include an ORDER BY clause. Unlike the function rank ranking
window function, dense_rank will not produce gaps in the ranking sequence. Unlike row_number ranking
window function, dense_rank does not break ties. If the order is not unique the duplicates share the same
relative later position.

Examples
> SELECT a,
b,
dense_rank() OVER(PARTITION BY a ORDER BY b),
rank() OVER(PARTITION BY a ORDER BY b),
row_number() OVER(PARTITION BY a ORDER BY b)
FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
A1 1 1 1 1
A1 1 1 1 2
A1 2 2 3 3
A2 3 1 1 1

Related functions
rank ranking window function
row_number ranking window function
cume_dist analytic window function
Window functions
div operator
7/21/2022 • 2 minutes to read

Returns the integral part of the division of divisor by dividend .

Syntax
divisor div dividend

Arguments
divisor : An expression that evaluates to a numeric or interval.
dividend : A matching interval type if divisor is an interval, a numeric otherwise.

Interval is supported Since : Databricks Runtime 10.1

Returns
A BIGINT
If dividend is 0 , INTERVAL '0' SECOND or INTERVAL '0' MONTH the operator raises a DIVIDE_BY_ZERO error.

NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error division by 0.

Examples
> SELECT 3 div 2;
1
> SELECT -5.9 div 1;
-5

> SELECT -5.9 div 0;


Error: DIVIDE_BY_ZERO

> SELECT INTERVAL '100' HOUR div INTERVAL '1' DAY;


4

Related functions
/ (slash sign) operator
* (asterisk sign) operator
+ (plus sign) operator
- (minus sign) operator
double function
7/21/2022 • 2 minutes to read

Casts the value expr to DOUBLE.

Syntax
double(expr)

Arguments
expr : An expression that can be cast to DOUBLE.

Returns
A DOUBLE.
This function is a synonym for CAST(expr AS DOUBLE) .
See cast function for details.

Examples
> SELECT double('5.2');
5.2

Related functions
cast function
e function
7/21/2022 • 2 minutes to read

Returns the constant e .

Syntax
e()

Arguments
This function takes no arguments.

Returns
The result is Euler’s constant as a DOUBLE.

Examples
> SELECT e();
2.7182818284590455

Related functions
exp function
pi function
element_at function
7/21/2022 • 2 minutes to read

Returns the element of an arrayExpr at index .


Returns the value of mapExpr for key .

Syntax
element_at(arrayExpr, index)

element_at(mapExpr, key)

Arguments
arrayExpr : An ARRAY expression.
index : An INTEGER expression.
mapExpr : A MAP expression.
key : An expression matching the type of the keys of mapExpr

Returns
If the first argument is an ARRAY:
The result is of the type of the elements of expr .
abs(index) must be between 1 and the length of the array.
If index is negative the function accesses elements from the last to the first.
The function raises INVALID_ARRAY_INDEX_IN_ELEMENT_AT error if abs(index) exceeds the length of the
array.
If the first argument is a MAP and key cannot be matched to an entry in mapExpr the function raises a
MAP_KEY_DOES_NOT_EXIST error.

NOTE
If spark.sql.ansi.failOnElementNotExists is false the function returns NULL instead of raising errors.

Examples
> SELECT element_at(array(1, 2, 3), 2);
2

> SELECT try_element_at(array(1, 2, 3), 5);


NULL

> SELECT element_at(array(1, 2, 3), 5);


Error: INVALID_ARRAY_INDEX_IN_ELEMENT_AT

> SELECT element_at(map(1, 'a', 2, 'b'), 2);


b

> SELECT try_element_at(map(1, 'a', 2, 'b'), 3);


NULL

> SELECT element_at(map(1, 'a', 2, 'b'), 3);


Error: MAP_KEY_DOES_NOT_EXIST

Related functions
array_contains function
array_position function
try_element_at function
elt function
7/21/2022 • 2 minutes to read

Returns the nth expression.

Syntax
elt(index, expr1 [, ...])

Arguments
index : An INTEGER expression greater than 0.
exprN : Any expression that shares a least common type with all exprN .

Returns
The result has the type of the least common type of the exprN .
Index must be between 1 and the number of expr . If index is out of bounds, an INVALID_ARRAY_INDEX error is
raised.

NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error if the index is out of bound.

Examples
> SELECT elt(1, 'scala', 'java');
scala

Related functions
element_at function
encode function
7/21/2022 • 2 minutes to read

Returns the binary representation of a string using the charSet character encoding.

Syntax
encode(expr, charSet)

Arguments
expr : A STRING expression to be encoded.
charSet : A STRING expression specifying the encoding.

Returns
A BINARY.
The charset must be one of (case insensitive):
‘US-ASCII’
‘ISO-8859-1’
‘UTF-8’
‘UTF-16BE’
‘UTF-16LE’
‘UTF-16’

Examples
> SELECT encode('Spark SQL', 'UTF-16');
[FE FF 00 53 00 70 00 61 00 72 00 6B 00 20 00 53 00 51 00 4C]
> SELECT decode(X'FEFF0053007000610072006B002000530051004C', 'UTF-16')
Spark SQL

Related functions
decode (character set) function
endswith function
7/21/2022 • 2 minutes to read

Returns true if expr ends with endExpr .


Since: Databricks Runtime 10.3

Syntax
endswith(expr, endExpr)

Arguments
expr : A STRING or BINARY expression.
endExpr : A STRING or BINARY expression which is compared to the end of str .

Returns
A BOOLEAN.
If expr or endExpr is NULL , the result is NULL .
If endExpr is the empty string or empty binary the result is true .
Since: Databricks Runtime 10.5
The function operates in BINARY mode if both arguments are BINARY.

Examples
> SELECT endswith('SparkSQL', 'SQL');
true

> SELECT endswith('SparkSQL', 'sql');


false

> SELECT endswith('SparkSQL', NULL);


NULL

> SELECT endswith(NULL, 'Spark');


NULL

> SELECT endswith('SparkSQL', '');


true

> SELECT endswith(x'110033', x'33');


true

Related
contains function
startswith function
substr function
== (eq eq sign) operator
7/21/2022 • 2 minutes to read

Returns true if expr1 equals expr2 , or false otherwise.

Syntax
expr1 == expr2

Arguments
expr1 : An expression of any comparable type.
expr2 : An expression sharing a least common type with expr1 .

Returns
A BOOLEAN.
This function is a synonym for = (eq sign) operator.

Examples
> SELECT 2 == 2;
true

> SELECT 1 == '1';


true

> SELECT true == NULL;


NULL

> SELECT NULL == NULL;


NULL

Related
< (lt sign) operator
<= (lt eq sign) operator
> (gt sign) operator
>= (gt eq sign) operator
<=> (lt eq gt sign) operator
!= (bangeq sign) operator
== (eq eq sign) operator
<> (lt gt sign) operator
SQL data type rules
= (eq sign) operator
7/21/2022 • 2 minutes to read

Returns true if expr1 equals expr2 , or false otherwise.

Syntax
expr1 = expr2

Arguments
expr1 : An expression of any comparable type.
expr2 : An expression sharing a least common type with expr1 .

Returns
A BOOLEAN.
This function is a synonym for == (eq eq sign) operator.

Examples
> SELECT 2 = 2;
true

> SELECT 1 = '1';


true

> SELECT true = NULL;


NULL

> SELECT NULL = NULL;


NULL

Related
< (lt sign) operator
<= (lt eq sign) operator
> (gt sign) operator
>= (gt eq sign) operator
<=> (lt eq gt sign) operator
!= (bangeq sign) operator
== (eq eq sign) operator
<> (lt gt sign) operator
SQL data type rules
every aggregate function
7/21/2022 • 2 minutes to read

Returns true if all values of expr in the group are true.

Syntax
every(expr) [FILTER ( WHERE cond ) ]

Arguments
expr : A BOOLEAN expression.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A BOOLEAN.
This function is a synonym for bool_and aggregate function.

Examples
> SELECT every(col) FROM VALUES (true), (true), (true) AS tab(col);
true
> SELECT every(col) FROM VALUES (NULL), (true), (true) AS tab(col);
true
> SELECT every(col) FROM VALUES (true), (false), (true) AS tab(col);
false
> SELECT every(col1) FILTER(WHERE col2 = 1)
FROM VALUES (true, 1), (false, 2), (true, 1) AS tab(col1, col2);
true

Related functions
bool_and aggregate function
bool_or aggregate function
some aggregate function
exists function
7/21/2022 • 2 minutes to read

Returns true if func is true for any element in expr or query returns at least one row.

Syntax
exists(expr, func)

exists(query)

Arguments
expr : An ARRAY expression.
func : A lambda function.
query : Any SELECT.

Returns
A BOOLEAN.
The lambda function must result in a boolean and operate on one parameter, which represents an element in the
array.
exists(query) can only be used in the WHERE clause and few other specific cases.

Examples
> SELECT exists(array(1, 2, 3), x -> x % 2 == 0);
true
> SELECT exists(array(1, 2, 3), x -> x % 2 == 10);
false
> SELECT exists(array(1, NULL, 3), x -> x % 2 == 0);
NULL
> SELECT exists(array(0, NULL, 2, 3, NULL), x -> x IS NULL);
true
> SELECT exists(array(1, 2, 3), x -> x IS NULL);
false

> SELECT count(*) FROM VALUES(1)


WHERE exists(SELECT * FROM VALUES(1), (2), (3) AS t(c1) WHERE c1 = 2);
1
> SELECT count(*) FROM VALUES(1)
WHERE exists(SELECT * FROM VALUES(1), (NULL), (3) AS t(c1) WHERE c1 = 2);
0
> SELECT count(*) FROM VALUES(1)
WHERE NOT exists(SELECT * FROM VALUES(1), (NULL), (3) AS t(c1) WHERE c1 = 2);
1

Related functions
filter function
array_contains function
exp function
7/21/2022 • 2 minutes to read

Returns e to the power of expr .

Syntax
exp(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.

Examples
> SELECT exp(0);
1.0
> SELECT exp(1);
2.7182818284590455

Related functions
e function
expm1 function
ln function
pow function
explode table-valued generator function
7/21/2022 • 2 minutes to read

Returns rows by un-nesting expr .

Syntax
explode(expr)

Arguments
expr : An ARRAY or MAP expression.

Returns
A set of rows composed of the other expressions in the select list and either the elements of the array or the
keys and values of the map. If expr is NULL no rows are produced.
explode can only be placed in the select list or a LATERAL VIEW. When placing the function in the SELECT list
there must be no other generator function in the same SELECT list.
The column produced by explode of an array is named col by default, but can be aliased. The columns for a
map are by default called key and value . They can also be aliased using an alias tuple such as
AS (myKey, myValue) .

Examples
> SELECT explode(array(10, 20)) AS elem, 'Spark';
10 Spark
20 Spark
> SELECT explode(map(1, 'a', 2, 'b')) AS (num, val), 'Spark';
1 a Spark
2 b Spark

> SELECT expode(array(1, 2)), explode(array(3, 4));


Error: unsupported generator

Related functions
explode_outer table-valued generator function
posexplode table-valued generator function
posexplode_outer table-valued generator function
inline table-valued generator function
inline_outer table-valued generator function
explode_outer table-valued generator function
7/21/2022 • 2 minutes to read

Returns rows by un-nesting expr using outer semantics.

Syntax
explode_outer(expr)

Arguments
expr : An ARRAY or MAP expression.

Returns
A set of rows composed of the other expressions in the select list and either the elements of the array or the
keys and values of the map. If expr is NULL a single row with NULLs for the array or map values is produced.
explode_outer can only be placed in the select list or a LATERAL VIEW. When placing the function in the select
list there must be no other generator function in the same select list.
The column produced by explode of an array is named col by default, but can be aliased. The columns for a
map are by default called key and value . They can also be aliased using an alias tuple such as
AS (myKey, myValue) .

Examples
> SELECT explode_outer(array(10, 20)) AS elem, 'Spark';
10 Spark
20 Spark

> SELECT explode_outer(map(1, 'a', 2, 'b')) AS (num, val), 'Spark';


1 a Spark
2 b Spark

> SELECT explode_outer(cast(NULL AS array<int>)), 'Spark';


NULL Spark

> SELECT expode_outer(array(1, 2)), explode_outer(array(3, 4));


Error: unsupported generator

Related functions
explode table-valued generator function
posexplode table-valued generator function
posexplode_outer table-valued generator function
inline table-valued generator function
inline_outer table-valued generator function
expm1 function
7/21/2022 • 2 minutes to read

Returns exp(expr) - 1 .

Syntax
expm1(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.

Examples
> SELECT expm1(0);
0.0

Related functions
e function
exp function
extract function
7/21/2022 • 2 minutes to read

Returns field of source .

Syntax
extract(field FROM source)

Arguments
field : A keyword that selects which part of source should be extracted.
source : A DATE, TIMESTAMP, or INTERVAL expression.

Returns
If field is SECOND ,a DECIMAL(8, 6) . In all other cases, an INTEGER.
Supported values of field when source is DATE or TIMESTAMP are:
“YEAR”, (“Y”, “YEARS”, “YR”, “YRS”) - the year field
“YEAROFWEEK” - the ISO 8601 week-numbering year that the datetime falls in. For example, 2005-01-02 is
part of the 53rd week of year 2004, so the result is 2004
“QUARTER”, (“QTR”) - the quarter (1 - 4) of the year that the datetime falls in
“MONTH”, (“MON”, “MONS”, “MONTHS”) - the month field (1 - 12)
“WEEK”, (“W”, “WEEKS”) - the number of the ISO 8601 week-of-week-based-year. A week is considered to
start on a Monday and week 1 is the first week with >3 days. In the ISO week-numbering system, it is
possible for early-January dates to be part of the 52nd or 53rd week of the previous year, and for late-
December dates to be part of the first week of the next year. For example, 2005-01-02 is part of the 53rd
week of year 2004, while 2012-12-31 is part of the first week of 2013
“DAY”, (“D”, “DAYS”) - the day of the month field (1 - 31)
“DAYOFWEEK”,(“DOW”) - the day of the week for datetime as Sunday(1) to Saturday(7)
“DAYOFWEEK_ISO”,(“DOW_ISO”) - ISO 8601 based day of the week for datetime as Monday(1) to Sunday(7)
“DOY” - the day of the year (1 - 365/366)
“HOUR”, (“H”, “HOURS”, “HR”, “HRS”) - The hour field (0 - 23)
“MINUTE”, (“M”, “MIN”, “MINS”, “MINUTES”) - the minutes field (0 - 59)
“SECOND”, (“S”, “SEC”, “SECONDS”, “SECS”) - the seconds field, including fractional parts
Supported values of field when source is INTERVAL are:
“YEAR”, (“Y”, “YEARS”, “YR”, “YRS”) - the total months / 12
“MONTH”, (“MON”, “MONS”, “MONTHS”) - the total months % 12
“DAY”, (“D”, “DAYS”) - the days part of interval
“HOUR”, (“H”, “HOURS”, “HR”, “HRS”) - how many hours the microseconds contains
“MINUTE”, (“M”, “MIN”, “MINS”, “MINUTES”) - how many minutes left after taking hours from microseconds
“SECOND”, (“S”, “SEC”, “SECONDS”, “SECS”) - how many seconds with fractions left after taking hours and
minutes from microseconds
Examples
> SELECT extract(YEAR FROM TIMESTAMP '2019-08-12 01:00:00.123456');
2019
> SELECT extract(week FROM TIMESTAMP'2019-08-12 01:00:00.123456');
33
> SELECT extract(DAY FROM DATE'2019-08-12');
224
> SELECT extract(SECONDS FROM TIMESTAMP'2019-10-01 00:00:01.000001');
1.000001
> SELECT extract(MONTHS FROM INTERVAL '2-11' YEAR TO MONTH);
11
> SELECT extract(SECONDS FROM INTERVAL '5:00:30.001' HOUR TO SECOND);
30.001000

Related functions
date_part function
dayofweek function
dayofmonth function
dayofyear function
factorial function
7/21/2022 • 2 minutes to read

Returns the factorial of expr .

Syntax
factorial(expr)

Arguments
expr : An INTEGER expression between 1 and 20.

Returns
A BIGINT.
If expr is out of bounds, the function returns NULL.

Examples
> SELECT factorial(5);
120

Related functions
filter function
7/21/2022 • 2 minutes to read

Filters the array in expr using the function func .

Syntax
filter(expr, func)

Arguments
expr : An ARRAY expression.
func : A lambda function.

Returns
The result is of the same type as expr .
The lambda function may use one or two parameters where the first parameter represents the element and the
second the index into the array.

Examples
> SELECT filter(array(1, 2, 3), x -> x % 2 == 1);
[1,3]
> SELECT filter(array(0, 2, 3), (x, i) -> x > i);
[2,3]
> SELECT filter(array(0, null, 2, 3, null), x -> x IS NOT NULL);
[0,2,3]

Related functions
exists function
forall function
map_filter function
find_in_set function
7/21/2022 • 2 minutes to read

Returns the position of a string within a comma-separated list of strings.

Syntax
find_in_set(searchExpr, sourceExpr)

Arguments
searchExpr : A STRING expression specifying the “word” to be searched.
sourceExpr : A STRING expression with commas separating “words”.

Returns
An INTEGER. The resulting position is 1-based and points to the first letter of the match. If no match is found for
searchExpr in sourceExpr or searchExpr contains a comma, 0 is returned.

Examples
> SELECT find_in_set('ab','abc,b,ab,c,def');
3

Related functions
array_contains function
first aggregate function
7/21/2022 • 2 minutes to read

Returns the first value of expr for a group of rows.

Syntax
first(expr[, ignoreNull) [ FILTER ( WHERE cond ) ] [ IGNORE NULLS | RESPECT NULLS ]

Arguments
expr : An expression of any type.
ignoreNull : An optional BOOLEAN literal defaulting to false. The default for ignoreNull is false.
cond : An optional boolean expression filtering the rows used for aggregation.
IGNORE NULLS or RESPECT NULLS : When IGNORE NULLS is used or ignoreNull is true any expr value that is
NULL is ignored. The default is RESPECT NULLS .

Returns
The result has the same type as expr .
first is a synonym for first_value aggregate function.
This function is non-deterministic.

Examples
> SELECT first(col) FROM VALUES (10), (5), (20) AS tab(col);
10
> SELECT first(col) FROM VALUES (NULL), (5), (20) AS tab(col);
NULL
> SELECT first(col) IGNORE NULLS FROM VALUES (NULL), (5), (20) AS tab(col);
5

Related functions
min aggregate function
max aggregate function
last aggregate function
first_value aggregate function
first_value aggregate function
7/21/2022 • 2 minutes to read

Returns the first value of expr for a group of rows.

Syntax
first_value(expr[, ignoreNull]) [FILTER ( WHERE cond ) ] [ IGNORE NULLS | RESPECT NULLS ]

Arguments
expr : An expression of any type.
ignoreNull : An optional BOOLEAN literal defaulting to false. The default for ignoreNull is false.
cond : An optional boolean expression filtering the rows used for aggregation.
IGNORE NULLS or RESPECT NULLS : When IGNORE NULLS is used or ignoreNull is true any expr value that is
NULL is ignored. The default is RESPECT NULLS .

Returns
The result has the same type as expr .
first_value is a synonym for first aggregate function.
This function is non-deterministic.

Examples
> SELECT first_value(col) FROM VALUES (10), (5), (20) AS tab(col);
10
> SELECT first_value(col) FROM VALUES (NULL), (5), (20) AS tab(col);
NULL
> SELECT first_value(col) IGNORE NULLS FROM VALUES (NULL), (5), (20) AS tab(col);
5

Related functions
min aggregate function
max aggregate function
last aggregate function
first_value aggregate function
flatten function
7/21/2022 • 2 minutes to read

Transforms an array of arrays into a single array.

Syntax
flatten(expr)

Arguments
expr : An ARRAY of ARRAY expression.

Returns
The result matches the type of the nested arrays within expr .

Examples
> SELECT flatten(array(array(1, 2), array(3, 4)));
[1,2,3,4]

Related functions
float function
7/21/2022 • 2 minutes to read

Casts the value expr to FLOAT.

Syntax
float(expr)

Arguments
expr : An expression that can be cast to FLOAT.

Returns
A FLOAT.
This function is a synonym for CAST(expr AS FLOAT) .
See cast function for details.

Examples
> SELECT float('5.2');
5.2

Related functions
cast function
floor function
7/21/2022 • 2 minutes to read

Returns the largest number not bigger than expr rounded down to targetScale digits relative to the decimal
point.

Syntax
floor(expr [, targetScale])

Arguments
expr : An expression that evaluates to a numeric.
targetScale: An optional INTEGER literal greater than -38 specifying by how many digits after the
decimal points to round down.
Since: Databricks Runtime 10.5

Returns
If no targetScale is given:
If expr is DECIMAL(p, s) , returns DECIMAL(p - s + 1, 0) .
For all other cases, returns a BIGINT.
If targetScale is specified and expr is a:
TINYINT

Returns a DECIMAL(p, 0) with p = max(3, -targetScale + 1) .


SMALLINT

Returns a DECIMAL(p, 0) with p = max(5, -targetScale + 1) .


INTEGER

Returns a DECIMAL(p, 0) with p = max(10, -targetScale + 1)) .


BIGINT

Returns a DECIMAL(p, 0) with p = max(20, -targetScale + 1)) .


FLOAT

Returns a DECIMAL(p, s) with p = max(14, -targetScale + 1)) and s = min(7, max(0, targetScale))

DOUBLE

Returns a DECIMAL(p, s) with p = max(30, -targetScale + 1)) and s = min(15, max(0, targetScale))

DECIMAL(p_in, s_in)
Returns a DECIMAL(p, s) with p = max(p_in - s_in + 1, -targetScale + 1)) and
s = min(s_in, max(0, targetScale))

If targetScale is negative the rounding occurs to -targetScale digits to the left of the decimal point.
The default targetScale is 0, which rounds down to the next smaller integral number.

Examples
> SELECT floor(-0.1);
-1

> SELECT floor(5);


5

> SELECT floor(3345.1, -2);


3300

> SELECT floor(-12.345, 1);


-12.4

Related functions
ceiling function
ceil function
bround function
round function
forall function
7/21/2022 • 2 minutes to read

Tests whether func holds for all elements in the array.

Syntax
forall(expr, func)

Arguments
expr : An ARRAY expression.
func : A lambda function returning a BOOLEAN.

Returns
A BOOLEAN.
The lambda function uses one parameter passing an element of the array.

Examples
> SELECT forall(array(1, 2, 3), x -> x % 2 == 0);
false
> SELECT forall(array(2, 4, 8), x -> x % 2 == 0);
true
> SELECT forall(array(1, NULL, 3), x -> x % 2 == 0);
false
> SELECT forall(array(2, NULL, 8), x -> x % 2 == 0);
NULL

Related functions
filter function
exists function
format_number function
7/21/2022 • 2 minutes to read

Formats expr like #,###,###.## , rounded to scale decimal places.


Formats expr like fmt .

Syntax
format_number(expr, scale)

format_number(expr, fmt)

Arguments
expr : An expression that evaluates to a numeric.
scale : An INTEGER expression greater or equal to 0.
fmt : A STRING expression specifying a format.

Returns
A STRING.
A negative scale produces a null.

Examples
> SELECT format_number(12332.123456, 4);
12,332.1235
> SELECT format_number(12332.123456, '#.###');
12332.123
> SELECT format_number(12332.123456, 'EUR ,###.-');
EUR 12,332.-

Related functions
format_string function
format_string function
7/21/2022 • 2 minutes to read

Returns a formatted string from printf-style format strings.

Syntax
format_string(strfmt [, obj1 [, ...] ])

Arguments
strfmt : A STRING expression.
objN : STRING or numeric expressions.

Returns
A STRING.

Examples
> SELECT format_string('Hello World %d %s', 100, 'days');
Hello World 100 days

Related functions
format_number function
from_csv function
7/21/2022 • 5 minutes to read

Returns a struct value with the csvStr and schema .

Syntax
from_csv(csvStr, schema [, options])

Arguments
csvStr : A STRING expression specifying a row of CSV data.
schema : A STRING literal or invocation of schema_of_csv function.
options : An optional MAP<STRING,STRING> literal specifying directives.

Returns
A STRUCT with field names and types matching the schema definition.
csvStr should be well formed with respect to the schema and options . schema must be defined as comma-
separated column name and data type pairs as used in for example CREATE TABLE .
options , if provided, can be any of the following:
sep (default , ): sets a separator for each field and value. This separator can be one or more characters.
encoding (default UTF-8): decodes the CSV files by the specified encoding type.
quote (default " ): sets a single character used for escaping quoted values where the separator can be part
of the value. If you would like to turn off quotations, you need to set not null but an empty string. This
behavior is different from com.databricks.spark.csv .
escape (default \ ): sets a single character used for escaping quotes inside an already quoted value.
charToEscapeQuoteEscaping (default escape or \0 ): sets a single character used for escaping the escape for
the quote character. The default value is escape character when escape and quote characters are different,
\0 otherwise.
comment (default empty string): sets a single character used for skipping lines beginning with this character.
By default, it is disabled.
header (default false ): uses the first line as names of columns.
enforceSchema (default true ): If it is set to true, the specified or inferred schema is forcibly applied to
datasource files, and headers in CSV files is ignored. If the option is set to false, the schema is validated
against all headers in CSV files in the case when the header option is set to true. Field names in the schema
and column names in CSV headers are checked by their positions taking into account
spark.sql.caseSensitive . Though the default value is true, it is recommended to disable the enforceSchema
option to avoid incorrect results.
inferSchema (default false ): infers the input schema automatically from data. It requires one extra pass
over the data.
samplingRatio (default 1.0): defines fraction of rows used for schema inferring.
ignoreLeadingWhiteSpace (default false ): a flag indicating whether or not leading whitespaces from values
being read should be skipped.
ignoreTrailingWhiteSpace (default false ): a flag indicating whether or not trailing whitespaces from values
being read should be skipped.
nullValue (default empty string): sets the string representation of a null value.
emptyValue (default empty string): sets the string representation of an empty value.
nanValue (default NaN ): sets the string representation of a non-number value.
positiveInf (default Inf ): sets the string representation of a positive infinity value.
negativeInf (default -Inf) : sets the string representation of a negative infinity value.
dateFormat (default yyyy-MM-dd ): sets the string that indicates a date format. Custom date formats follow the
formats at Datetime patterns. This applies to date type.
timestampFormat (default yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] ): sets the string that indicates a timestamp
format. Custom date formats follow the formats at Datetime patterns. This applies to timestamp type.
maxColumns (default 20480 ): defines a hard limit of how many columns a record can have.
maxCharsPerColumn (default -1): defines the maximum number of characters allowed for any specified value
being read. By default, it is -1 meaning unlimited length
unescapedQuoteHandling (default STOP_AT_DELIMITER ): defines how the CSV parser handles values with
unescaped quotes.
STOP_AT_CLOSING_QUOTE : If unescaped quotes are found in the input, accumulate the quote character
and proceed parsing the value as a quoted value, until a closing quote is found.
BACK_TO_DELIMITER : If unescaped quotes are found in the input, consider the value as an unquoted
value. This will make the parser accumulate all characters of the current parsed value until the
delimiter is found. If no delimiter is found in the value, the parser will continue accumulating
characters from the input until a delimiter or line ending is found.
STOP_AT_DELIMITER : If unescaped quotes are found in the input, consider the value as an unquoted
value. This will make the parser accumulate all characters until the delimiter or a line ending is found
in the input.
STOP_AT_DELIMITER : If unescaped quotes are found in the input, the content parsed for the specified
value is skipped and the value set in nullValue is produced instead.
RAISE_ERROR : If unescaped quotes are found in the input, a TextParsingException is thrown.
mode (default PERMISSIVE ): allows a mode for dealing with corrupt records during parsing. It supports the
following case-insensitive modes. Spark tries to parse only required columns in CSV under column pruning.
Therefore, corrupt records can be different based on required set of fields. This behavior can be controlled by
spark.sql.csv.parser.columnPruning.enabled (enabled by default).
PERMISSIVE : when it meets a corrupted record, puts the malformed string into a field configured by
columnNameOfCorruptRecord , and sets malformed fields to null. To keep corrupt records, an user can set
a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does
not have the field, it drops corrupt records during parsing. A record with fewer or more tokens than
schema is not a corrupted record to CSV. When it meets a record having fewer tokens than the length
of the schema, sets null to extra fields. When the record has more tokens than the length of the
schema, it drops extra tokens.
FAILFAST : throws an exception when it meets corrupted records.
columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord ): allows
renaming the new field having malformed string created by PERMISSIVE mode. This overrides
spark.sql.columnNameOfCorruptRecord .
multiLine (default false ): parse one record, which may span multiple lines.
locale (default en-US ): sets a locale as language tag in IETF BCP 47 format. For instance, this is used while
parsing dates and timestamps.
lineSep (default covers all \r , \r\n , and \n ): defines the line separator that should be used for parsing.
Maximum length is 1 character.
pathGlobFilter : an optional glob pattern to only include files with paths matching the pattern. The syntax
follows org.apache.hadoop.fs.GlobFilter . It does not change the behavior of partition discovery.

Examples
> SELECT from_csv('1, 0.8', 'a INT, b DOUBLE');
{1,0.8}
> SELECT from_csv('26/08/2015', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));
{"time":2015-08-26 00:00:00}

Related functions
from_json function
schema_of_json function
schema_of_csv function
to_json function
to_csv function
from_json function
7/21/2022 • 2 minutes to read

Returns a struct value with the jsonStr and schema .

Syntax
from_json(jsonStr, schema [, options])

Arguments
jsonStr : A STRING expression specifying a row of CSV data.
schema : A STRING literal or invocation of schema_of_json function.
options : An optional MAP<STRING,STRING> literal specifying directives.

Returns
A struct with field names and types matching the schema definition.
jsonStr should be well formed with respect to schema and options . schema must be defined as comma-
separated column name and data type pairs as used in for example CREATE TABLE .
options , if provided, can be any of the following:
primitivesAsString (default false ): infers all primitive values as a string type.
prefersDecimal (default false ): infers all floating-point values as a decimal type. If the values do not fit in
decimal, then it infers them as doubles.
allowComments (default false ): ignores Java and C++ style comment in JSON records.
allowUnquotedFieldNames (default false ): allows unquoted JSON field names.
allowSingleQuotes (default true ): allows single quotes in addition to double quotes.
allowNumericLeadingZeros (default false ): allows leading zeros in numbers (for example, 00012 ).
allowBackslashEscapingAnyCharacter (default false ): allows accepting quoting of all character using
backslash quoting mechanism.
allowUnquotedControlChars (default false ): allows JSON Strings to contain unquoted control characters
(ASCII characters with value less than 32, including tab and line feed characters) or not.
mode (default PERMISSIVE ): allows a mode for dealing with corrupt records during parsing.
PERMISSIVE : when it meets a corrupted record, puts the malformed string into a field configured by
columnNameOfCorruptRecord , and sets malformed fields to null. To keep corrupt records, you can set a
string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not
have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a
columnNameOfCorruptRecord field in an output schema.
FAILFAST : throws an exception when it meets corrupted records.
columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord ): allows
renaming the new field having malformed string created by PERMISSIVE mode. This overrides
spark.sql.columnNameOfCorruptRecord .
dateFormat (default yyyy-MM-dd ): sets the string that indicates a date format. Custom date formats follow the
formats at Datetime patterns. This applies to date type.
timestampFormat (default yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] ): sets the string that indicates a timestamp
format. Custom date formats follow the formats at Datetime patterns. This applies to timestamp type.
multiLine (default false ): parses one record, which may span multiple lines, per file.
encoding (by default it is not set): allows to forcibly set one of standard basic or extended encoding for the
JSON files. For example UTF-16BE, UTF-32LE. If the encoding is not specified and multiLine is set to true , it
is detected automatically.
lineSep (default covers all \r , \r\n and \n ): defines the line separator that should be used for parsing.
samplingRatio (default 1.0): defines fraction of input JSON objects used for schema inferring.
dropFieldIfAllNull (default false ): whether to ignore column of all null values or empty array/struct
during schema inference.
locale (default is en-US ): sets a locale as language tag in IETF BCP 47 format. For instance, this is used
while parsing dates and timestamps.
allowNonNumericNumbers (default true ): allows JSON parser to recognize set of not-a-number ( NaN ) tokens
as legal floating number values:
+INF for positive infinity, as well as alias of +Infinity and Infinity .
-INF for negative infinity), alias -Infinity .
NaN for other not-a-numbers, like result of division by zero.

Examples
> SELECT from_json('{"a":1, "b":0.8}', 'a INT, b DOUBLE');
{1,0.8}
> SELECT from_json('{"time":"26/08/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));
{2015-08-26 00:00:00}

Related functions
: operator
from_csv function
schema_of_json function
schema_of_csv function
to_csv function
to_json function
json_object_keys function
json_array_length function
json_tuple table-valued generator function
from_json function
get_json_object function
from_unixtime function
7/21/2022 • 2 minutes to read

Returns unixTime in fmt .

Syntax
from_unixtime(unixTime [, fmt])

Arguments
unixTime : A BIGINT expression representing seconds elapsed since 1969-12-31 at 16:00:00.
fmt: An optional STRING expression with a valid format.

Returns
A STRING.
See Datetime patterns for valid formats. The ‘yyyy-MM-dd HH:mm:ss’ pattern is used if omitted.

Examples
> SELECT from_unixtime(0, 'yyyy-MM-dd HH:mm:ss');
1969-12-31 16:00:00
> SELECT from_unixtime(0);
1969-12-31 16:00:00

Related functions
to_unix_timestamp function
Datetime patterns
from_utc_timestamp function
7/21/2022 • 2 minutes to read

Returns a timestamp in expr specified in UTC in the timezone timeZone .

Syntax
from_utc_timestamp(expr, timeZone)

Arguments
expr : A TIMESTAMP expression with a UTC timestamp.
timeZone : A STRING expression that is a valid timezone.

Returns
A TIMESTAMP.

Examples
> SELECT from_utc_timestamp('2016-08-31', 'Asia/Seoul');
2016-08-31 09:00:00
> SELECT from_utc_timestamp('2017-07-14 02:40:00.0', 'GMT+1');
'2017-07-14 03:40:00.0'

Related functions
to_utc_timestamp function
getbit function
7/21/2022 • 2 minutes to read

Returns the value of a bit in a binary representation of an integral numeric.


Since: Databricks Runtime 10.0

Syntax
getbit(expr, pos))

Arguments
expr : An expression that evaluates to an integral numeric.
pos : An expression of type INTEGER.

Returns
The result type is INTEGER.
The result value is 1 if the bit is set, 0 otherwise.
Bits are counted right to left and 0-based.
If pos is outside the bounds of the data type of expr Databricks Runtime raises an error.
getbit is a synonym of bit_get.

Examples
> SELECT hex(23Y), getbit(23Y, 3);
0

> SELECT hex(23Y), getbit(23Y, 0);


1

> SELECT getbit(23Y, 8);


Error: Invalid bit position: 8 exceeds the bit upper limit

> SELECT getbit(23Y, -1);


Error: Invalid bit position: -1 is less than zero

Related functions
bit_get function
bit_reverse function
~ (tilde sign) operator
get_json_object function
7/21/2022 • 2 minutes to read

Extracts a JSON object from path .

Syntax
get_json_object(expr, path)

Arguments
expr : A STRING expression containing well formed JSON.
path : A STRING literal with a well formed JSON path.

Returns
A STRING.
If the object cannot be found null is returned.

Examples
> SELECT get_json_object('{"a":"b"}', '$.a');
b

Related functions
json_tuple table-valued generator function
greatest function
7/21/2022 • 2 minutes to read

Returns the greatest value of all arguments, skipping null values.

Syntax
greatest(expr1 [, ...])

Arguments
exprN : Any expression of a comparable type with a shared least common type across all exprN .

Returns
The result type is the least common type of the arguments.

Examples
> SELECT greatest(10, 9, 2, 4, 3);
10

Related
least function
SQL data type rules
grouping function
7/21/2022 • 2 minutes to read

Indicates whether a specified column in a GROUPING SET , ROLLUP , or CUBE represents a subtotal.

Syntax
grouping(col)

Arguments
col : A column reference identified in a GROUPING SET , ROLLUP , or CUBE .

Returns
An INTEGER.
The result is 1 for a specified row if the row represents a subtotal over the grouping of col , or 0 if it is not.

Examples
> SELECT name, grouping(name), sum(age) FROM VALUES (2, 'Alice'), (5, 'Bob') people(age, name) GROUP BY
cube(name);
Alice 0 2
Bob 0 5
NULL 1 7

Related functions
grouping_id function
grouping_id function
7/21/2022 • 2 minutes to read

Returns the level of grouping for a set of columns.

Syntax
grouping_id( [col1 [, ...] ] )

Arguments
colN : A column reference identified in a GROUPING SET , ROLLUP , or CUBE .

Returns
A BIGINT.
The function combines the grouping function for several columns into one by assigning each column a bit in a
bit vector. The col1 is represented by the highest order bit. A bit is set to 1 if the row computes a subtotal for
the corresponding column.
Specifying no argument is equivalent to specifying all columns listed in the GROUPING SET , CUBE , or ROLLUP .

Examples
> SELECT name, age, grouping_id(name, age),
conv(cast(grouping_id(name, age) AS STRING), 10, 2),
avg(height)
FROM VALUES (2, 'Alice', 165), (5, 'Bob', 180) people(age, name, height)
GROUP BY cube(name, age)
Alice 2 0 0 165.0
Alice NULL 1 1 165.0
NULL 2 2 10 165.0
NULL NULL 3 11 172.5
Bob NULL 1 1 180.0
Bob 5 0 0 180.0
NULL 5 2 10 180.0

Related functions
grouping function
>= (gt eq sign) operator
7/21/2022 • 2 minutes to read

Returns true if expr1 is greater than or equal to expr2 , or false otherwise.

Syntax
expr1 >= expr2

Arguments
expr1 : An expression of any comparable type.
expr2 : An expression sharing a least common type with expr1 .

Returns
A BOOLEAN.

Examples
> SELECT 2 >= 1;
true
> SELECT 2.0 >= '2.1';
false
> SELECT to_date('2009-07-30 04:17:52') >= to_date('2009-07-30 04:17:52');
true
> SELECT to_date('2009-07-30 04:17:52') >= to_date('2009-08-01 04:17:52');
false
> SELECT 1 >= NULL;
NULL

Related
!= (bangeq sign) operator
<= (lt eq sign) operator
< (lt sign) operator
> (gt sign) operator
<=> (lt eq gt sign) operator
= (eq sign) operator
<> (lt gt sign) operator
SQL data type rules
> (gt sign) operator
7/21/2022 • 2 minutes to read

Returns true if expr1 is greater than expr2 , or false otherwise.

Syntax
expr1 > expr2

Arguments
expr1 : An expression of any comparable type.
expr2 : An expression sharing a least common type with expr1 .

Returns
A BOOLEAN.

Examples
> SELECT 2 > 1;
true
> SELECT 2 > '1.1';
true
> SELECT to_date('2009-07-30 04:17:52') > to_date('2009-07-30 04:17:52');
false
> SELECT to_date('2009-07-30 04:17:52') > to_date('2009-08-01 04:17:52');
false
> SELECT 1 > NULL;
NULL

Related
!= (bangeq sign) operator
<= (lt eq sign) operator
< (lt sign) operator
>= (gt eq sign) operator
<=> (lt eq gt sign) operator
= (eq sign) operator
<> (lt gt sign) operator
SQL data type rules
hash function
7/21/2022 • 2 minutes to read

Returns a hash value of the arguments.

Syntax
hash(expr1, ...)

Arguments
exprN : An expression of any type.

Returns
An INTEGER.

Examples
> SELECT hash('Spark', array(123), 2);
-1321691492

Related functions
crc32 function
md5 function
sha function
sha1 function
sha2 function
xxhash64 function
hex function
7/21/2022 • 2 minutes to read

Converts expr to hexadecimal.

Syntax
hex(expr)

Arguments
expr : A BIGINT, BINARY, or STRING expression.

Returns
A STRING.
The function returns the hexadecimal representation of the argument.

Examples
> SELECT hex(17);
11
> SELECT hex('Spark SQL');
537061726B2053514C

Related functions
unhex function
hour function
7/21/2022 • 2 minutes to read

Returns the hour component of a timestamp.

Syntax
hour(expr)

Arguments
expr : A TIMESTAMP expression.

Returns
An INTEGER.
This function is a synonym for extract(HOUR FROM expr) .

Examples
> SELECT hour('2009-07-30 12:58:59');
12

Related functions
dayofmonth function
dayofweek function
dayofyear function
day function
minute function
extract function
hypot function
7/21/2022 • 2 minutes to read

Returns sqrt(expr1 * expr1 + expr2 * expr2) .

Syntax
hypot(expr1, expr2)

Arguments
expr1 : An expression that evaluates to a numeric.
expr2 : An expression that evaluates to a numeric.

Returns
A DOUBLE.

Examples
> SELECT hypot(3, 4);
5.0

Related functions
cbrt function
sqrt function
if function
7/21/2022 • 2 minutes to read

Returns expr1 if cond is true , or expr2 otherwise.

Syntax
if(cond, expr1, expr2)

Arguments
cond : A BOOLEAN expression.
expr1 : An expression of any type.
expr2 : An expression that shares a least common type with expr1 .

Returns
The result is the common maximum type of expr1 and expr2 .
This function is a synonym for iff function.

Examples
> SELECT if(1 < 2, 'a', 'b');
a

Related functions
case expression
decode function
iff function
ifnull function
7/21/2022 • 2 minutes to read

Returns expr2 if expr1 is NULL , or expr1 otherwise.

Syntax
ifnull(expr1, expr2)

Arguments
expr1 : An expression of any type.
expr2 : An expression sharing a least common type with expr1 .

Returns
The result type is the least common type of expr1 and expr2 .
This function is a synonym for coalesce function.

Examples
> SELECT ifnull(NULL, array('2'));
[2]

Related functions
coalesce function
if function
nvl function
nvl2 function
in predicate
7/21/2022 • 2 minutes to read

Returns true if elem equals any exprN or a row in query .

Syntax
elem in ( expr1 [, ...] )

elem in ( query )

Arguments
elem : An expression of any comparable type.
exprN : An expression of any type sharing a least common type with all other arguments.
query : Any query. The result must share a least common type with elem . If the query returns more than
one column elem must be an tuple (STRUCT) with the same number of field

Returns
The results is a BOOLEAN.

Examples
> SELECT 1 in(1, 2, 3);
true
> SELECT 1 in(2, 3, 4);
false
> SELECT (1, 2) IN ((1, 2), (2, 3));
true
> SELECT named_struct('a', 1, 'b', 2) in(named_struct('a', 1, 'b', 1), named_struct('a', 1, 'b', 3));
false
> SELECT named_struct('a', 1, 'b', 2) in(named_struct('a', 1, 'b', 2), named_struct('a', 1, 'b', 3));
true
> SELECT 1 IN (SELECT * FROM VALUES(1), (2));
true;
> SELECT (1, 2) IN (SELECT c1, c2 FROM VALUES(1, 2), (3, 4) AS T(c1, c2));
true;

Related functions
exists function
array_contains function
SELECT
initcap function
7/21/2022 • 2 minutes to read

Returns expr with the first letter of each word in uppercase.

Syntax
initcap(expr)

Arguments
expr : A STRING expression.

Returns
A STRING.
All other letters are in lowercase. Words are delimited by white space.

Examples
> SELECT initcap('sPark sql');
Spark Sql

Related functions
lower function
lcase function
ucase function
upper function
inline table-valued generator function
7/21/2022 • 2 minutes to read

Explodes an array of structs into a table.

Syntax
inline(expr)

Arguments
expr : An ARRAY expression.

Returns
A set of rows composed of the other expressions in the select list and the fields of the structs.
If expr is NULL no rows are produced.
inline can only be placed in the select list or a LATERAL VIEW. When placing the function in the select list there
must be no other generator function in the same select list.
The columns produced by inline are named “col1”, “col2”, etc by default, but can be aliased using an alias tuple
such as AS (myCol1, myCol2) .

Examples
> SELECT inline(array(struct(1, 'a'), struct(2, 'b'))), 'Spark SQL';
1 a Spark SQL
2 b Spark SQL

Related functions
explode table-valued generator function
explode_outer table-valued generator function
posexplode table-valued generator function
posexplode_outer table-valued generator function
inline_outer table-valued generator function
inline_outer table-valued generator function
7/21/2022 • 2 minutes to read

Explodes an array of structs into a table with OUTER semantics.

Syntax
inline_outer(expr)

Arguments
expr : An ARRAY expression.

Returns
A set of rows composed of the other expressions in the select list and the fields of the structs.
If expr is NULL, or the array is empty a single row with nulls for the attributes is produced.
inline can only be placed in the select list or a LATERAL VIEW. When placing the function in the select list there
must be no other generator function in the same select list.
The columns produced by inline_outer are named “col1”, “col2”, etc by default, but can be aliased using an alias
tuple such as AS (myCol1, myCol2) .

Examples
> SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')));
1 a
2 b

Related functions
explode table-valued generator function
explode_outer table-valued generator function
posexplode table-valued generator function
posexplode_outer table-valued generator function
inline table-valued generator function
input_file_block_length function
7/21/2022 • 2 minutes to read

Returns the length in bytes of the block being read.

Syntax
input_file_block_length()

Arguments
This function takes no arguments.

Returns
A BIGINT.
If the information is not available -1 is returned.
The function is non-deterministic.

Examples
> SELECT input_file_block_length();
-1

Related functions
input_file_block_start function
input_file_name function
input_file_block_start function
7/21/2022 • 2 minutes to read

Returns the start offset in bytes of the block being read.

Syntax
input_file_block_start()

Arguments
This function takes no arguments.

Returns
A BIGINT.
If the information is not available -1 is returned.
The function is non-deterministic.

Examples
> SELECT input_file_block_start();
-1

Related functions
input_file_block_length function
input_file_name function
input_file_name function
7/21/2022 • 2 minutes to read

Returns the name of the file being read, or empty string if not available.

Syntax
input_file_name()

Arguments
This function takes no arguments.

Returns
A STRING.
If the information is not available an empty string is returned.
The function is non-deterministic.

Examples
> SELECT input_file_name();

Related functions
input_file_block_length function
input_file_block_start function
instr function
7/21/2022 • 2 minutes to read

Returns the (1-based) index of the first occurrence of substr in str .

Syntax
instr(str, substr)

Arguments
str : A STRING expression.
substr : A STRING expression.

Returns
A BIGINT.
If substr cannot be found the function returns 0.

Examples
> SELECT instr('SparkSQL', 'SQL');
6
> SELECT instr('SparkSQL', 'R');
0

Related functions
locate function
position function
int function
7/21/2022 • 2 minutes to read

Casts the value expr to INTEGER.

Syntax
int(expr)

Arguments
expr : Any expression which is castable to INTEGER.

Returns
An INTEGER.
This function is a synonym for CAST(expr AS INTEGER) .

Examples
> SELECT int(-5.6);
5
> SELECT int('5');
5

Related functions
cast function
is distinct operator
7/21/2022 • 2 minutes to read

Tests whether the arguments have different values where NULLs are considered as comparable values.

Syntax
expr1 is [not] distinct from expr2

Arguments
expr1 : An expression of a comparable type.
expr2 : An expression of a type sharing a least common type with expr1 .

Returns
A BOOLEAN.
If both expr1 and expr2 NULL they are considered not distinct.
If only one of expr1 and expr2 is NULL the expressions are considered distinct.
If both expr1 and expr2 are not NULL they are considered distinct if expr <> expr2 .

Examples
> SELECT NULL is distinct from NULL;
false

> SELECT NULL is distinct from 5;


true

> SELECT 1 is distinct from 5;


true

> SELECT NULL is not distinct from 5;


false

Related
= (eq sign) operator
!= (bangeq sign) operator
<=> (lt eq gt sign) operator
isnan function
is true operator
SQL data type rules
is false operator
7/21/2022 • 2 minutes to read

Tests whether expr is false .

Syntax
expr is [not] false

Arguments
expr : A BOOLEAN, STRING, or numeric expression.

Returns
A BOOLEAN.
If expr is a STRING of case insensitive value 't' or 'true' it is interpreted as a BOOLEAN true . Any other
non NULL string is interpreted as false .
If expr is a numeric of value 1 is its interpreted as a BOOLEAN true . Any other non NULL number is
interpreted as false .
If not is specified this operator returns true if expr is false or NULL and false otherwise.
If not is not specified the operator returns true if expr is true and false otherwise.

Examples
> SELECT 1 is true;
true

> SELECT NULL is not true;


true

> SELECT 'true' is not true;


false

> SELECT false is not true;


true

Related functions
isnotnull function
isnull function
is null operator
isnan function
is true operator
isnan function
7/21/2022 • 2 minutes to read

Returns true if expr is NaN .

Syntax
isnan(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A BOOLEAN.

Examples
> SELECT isnan(cast('NaN' as double));
true
> SELECT isnan(7);
false

Related functions
isnotnull function
isnull function
isnotnull function
7/21/2022 • 2 minutes to read

Returns true if expr is not NULL .

Syntax
isnotnull(expr)

Arguments
expr : An expression of any type.

Returns
A BOOLEAN.
This function is a synonym for expr IS NOT NULL .

Examples
> SELECT isnotnull(1);
true

Related functions
isnull function
isnan function
is null operator
isnull function
7/21/2022 • 2 minutes to read

Returns true if expr is NULL .

Syntax
isnull(expr)

Arguments
expr : An expression of any type.

Returns
A BOOLEAN.
This function is a synonym for expr IS NULL .

Examples
> SELECT isnull(1);
false

Related functions
isnotnull function
isnan function
is null operator
is null operator
7/21/2022 • 2 minutes to read

Tests whether expr is NULL .

Syntax
expr is [not] null

Arguments
expr : An expression of any type.

Returns
A BOOLEAN.
If is specified this operator is a synonym for
not isnotnull(expr) . Otherwise the operator is a synonym for
isnull(expr) .

Examples
> SELECT 1 is null;
false

> SELECT 1 is not null;


true

Related functions
isnotnull function
isnull function
isnan function
is false operator
is true operator
is true operator
7/21/2022 • 2 minutes to read

Tests whether expr is true .

Syntax
expr is [not] true

Arguments
expr : A BOOLEAN, STRING, or numeric expression.

Returns
A BOOLEAN.
If expr is a STRING of case insensitive value 't' or 'true' it is interpreted as a BOOLEAN true . Any other
non NULL string is interpreted as false .
If expr is a numeric of value 1 is its interpreted as a BOOLEAN true . Any other non NULL number is
interpreted as false .
If not is specified this operator returns true if expr is true or NULL and false otherwise.
If not is not specified the operator returns true if expr is false and false otherwise.

Examples
> SELECT 5 is false;
true

> SELECT NULL is not false;


true

> SELECT 'true' is not false;


true

> SELECT true is not false;


true

Related functions
isnotnull function
isnull function
is null operator
isnan function
is false operator
java_method function
7/21/2022 • 2 minutes to read

Calls a method with reflection.

Syntax
java_method(class, method [, arg1 [, ...] ] )

Arguments
class : A STRING literal specifying the java class.
method : A STRING literal specifying the java method.
argn : An expression with a type appropriate for the selected method.

Returns
A STRING.

Examples
> SELECT java_method('java.util.UUID', 'randomUUID');
c33fb387-8500-4bfa-81d2-6e0e3e930df2
> SELECT java_method('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
A5cf6c42-0c85-418f-af6c-3e4e5b1328f2

Related functions
reflect function
json_array_length function
7/21/2022 • 2 minutes to read

Returns the number of elements in the outermost JSON array.

Syntax
json_array_length(jsonArray)

Arguments
jsonArray : A JSON array.

Returns
An INTEGER.
The function returns NULL if jsonArray is not a valid JSON string or NULL .

Databricks Runtime versions


NOTE
Available in Databricks Runtime 8.0 and above.

Examples
> SELECT json_array_length('[1,2,3,4]');
4
> SELECT json_array_length('[1,2,3,{"f1":1,"f2":[5,6]},4]');
5
> SELECT json_array_length('[1,2');
NULL

Related functions
: operator
json_object_keys function
json_array_length function
json_tuple table-valued generator function
from_json function
get_json_object function
schema_of_json function
to_json function
json_object_keys function
7/21/2022 • 2 minutes to read

Returns all the keys of the outermost JSON object as an array.

Syntax
json_object_keys(jsonObject)

Arguments
jsonObject : A STRING expression of a valid JSON array format.

Returns
An ARRAY.
If ‘jsonObject’ is any other valid JSON string, an invalid JSON string or an empty string, the function returns
NULL.

Examples
> SELECT json_object_keys('{}');
[]
> SELECT json_object_keys('{"key": "value"}');
[key]
> SELECT json_object_keys('{"f1":"abc","f2":{"f3":"a", "f4":"b"}}');
[f1,f2]

Related functions
: operator
json_array_length function
json_tuple table-valued generator function
from_json function
get_json_object function
schema_of_json function
to_json function
json_tuple table-valued generator function
7/21/2022 • 2 minutes to read

Returns multiple JSON objects as a tuple.

Syntax
json_tuple(jsonStr, path1 [, ...] )

Arguments
jsonStr : A STRING expression with well formed JSON.
pathN : A STRING literal with a JSON path.

Returns
A row composed of the other expressions in the select list and the JSON objects.
If any object cannot be found NULL is returned for that object. The column alias for the produced columns are
by default named c1, c2, etc., but can be aliased using AS (myC1, myC2, …) .
json_tuple can only be placed in the select list or a LATERAL VIEW. When placing the function in the select list
there must be no other generator function in the same select list.

Examples
> SELECT json_tuple('{"a":1, "b":2}', 'a', 'b'), 'Spark SQL';
1 2 Spark SQL
> SELECT json_tuple('{"a":1, "b":2}', 'a', 'c'), 'Spark SQL';
1 NULL Spark SQL

Related functions
: operator
json_object_keys function
json_array_length function
json_tuple table-valued generator function
from_json function
get_json_object function
schema_of_json function
to_json function
kurtosis aggregate function
7/21/2022 • 2 minutes to read

Returns the kurtosis value calculated from values of a group.

Syntax
kurtosis ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.

Examples
> SELECT kurtosis(col) FROM VALUES (-10), (-20), (100), (100), (1000) AS tab(col);
0.16212458373485106
> SELECT kurtosis(DISTINCT col) FROM VALUES (-10), (-20), (100), (100), (1000) AS tab(col);
-0.7014368047529627
> SELECT kurtosis(col) FROM VALUES (1), (10), (100), (10), (1) as tab(col);
0.19432323191699075

Related functions
skewness aggregate function
lag analytic window function
7/21/2022 • 2 minutes to read

Returns the value of expr from a preceding row within the partition.

Syntax
lag( expr [, offset [, default] ] )

Arguments
expr : An expression of any type.
offset : An optional INTEGER literal specifying the offset.
default : An expression of the same type as expr .

Returns
The result type matches expr .
If offset is positive the value originates from the row preceding the current row by offset specified the
ORDER BY in the OVER clause. An offset of 0 uses the current row’s value. A negative offset uses the value from
a row following the current row. If you do not specify offset it defaults to 1, the immediately following row.
If there is no row at the specified offset within the partition, the specified default is used. The default default
is NULL . You must provide an ORDER BY clause.
This function is a synonym to lead(expr, -offset, default) .

Examples
> SELECT a, b, lag(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1',
1) tab(a, b);
A1 1 NULL
A1 1 1
A1 2 1
A2 3 NULL

Related functions
lead analytic window function
last aggregate function
last_value aggregate function
first_value aggregate function
Window functions
last aggregate function
7/21/2022 • 2 minutes to read

Returns the last value of expr for the group of rows.

Syntax
last(expr [, ignoreNull] ) [FILTER ( WHERE cond ) ] [ IGNORE NULLS | RESPECT NULLS ]

Arguments
expr : An expression of any type.
ignoreNull : An optional BOOLEAN literal defaulting to false. The default for ignoreNull is false.
cond : An optional boolean expression filtering the rows used for aggregation.
IGNORE NULLS or RESPECT NULLS : When IGNORE NULLS is used or ignoreNull is true any expr value that is
NULL is ignored. The default is RESPECT NULLS .

Returns
The result type matches expr .
The function is a synonym for last_value aggregate function.
This function is non-deterministic.

Examples
> SELECT last(col) FROM VALUES (10), (5), (20) AS tab(col);
20
> SELECT last(col) FROM VALUES (10), (5), (NULL) AS tab(col);
NULL
> SELECT last(col) IGNORE NULLS FROM VALUES (10), (5), (NULL) AS tab(col);
5

Related functions
last_value aggregate function
first aggregate function
first_value aggregate function
last_day function
7/21/2022 • 2 minutes to read

Returns the last day of the month that the date belongs to.

Syntax
last_day(expr)

Arguments
expr : A DATE expression.

Returns
A DATE.

Examples
> SELECT last_day('2009-01-12');
2009-01-31

Related functions
next_day function
last_value aggregate function
7/21/2022 • 2 minutes to read

Returns the last value of expr for the group of rows.

Syntax
last_value(expr [, ignoreNull] ) [FILTER ( WHERE cond ) ] [ IGNORE NULLS | RESPECT NULLS ]

Arguments
expr : An expression of any type.
ignoreNull : A optional BOOLEAN literal
cond : An optional boolean expression filtering the rows used for aggregation.
IGNORE NULLS or RESPECT NULLS : When IGNORE NULLS is used or ignoreNull is true any expr value that is
NULL is ignored. The default is RESPECT NULLS .

Returns
The result type matches expr .
The function is a synonym for last aggregate function.
This function is non-deterministic.

Examples
> SELECT last_value(col) FROM VALUES (10), (5), (20) AS tab(col);
20
> SELECT last_value(col) FROM VALUES (10), (5), (NULL) AS tab(col);
NULL
> SELECT last_value(col) IGNORE NULLS FROM VALUES (10), (5), (NULL) AS tab(col);
5

Related functions
last aggregate function
first aggregate function
first_value aggregate function
lcase function
7/21/2022 • 2 minutes to read

Returns expr with all characters changed to lowercase.

Syntax
lcase(expr)

Arguments
expr : A STRING expression.

Returns
A STRING.

Examples
> SELECT lcase('LowerCase');
lowercase

Related functions
lower function
initcap function
ucase function
upper function
lead analytic window function
7/21/2022 • 2 minutes to read

Returns the value of expr from a subsequent row within the partition.

Syntax
lead(expr [, offset [, default] ] )

Arguments
expr : An expression of any type.
offset : An optional INTEGER literal specifying the offset.
default : An expression of the same type as expr .

Returns
The result type matches expr .
If offset is positive the value originates from the row following the current row by offset specified the
ORDER BY in the OVER clause. An offset of 0 uses the current row’s value. A negative offset uses the value from
a row preceding the current row. If you do not specify offset it defaults to 1, the immediately preceding row.
If there is no row at the specified offset within the partition the specified default is used. The default default
is NULL. An ORDER BY clause must be provided.
This function is a synonym to lag(expr, -offset, default) .

Examples
> SELECT a, b, lead(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1',
1) tab(a, b);
A1 1 1
A1 1 2
A1 2 NULL
A2 3 NULL

Related functions
lag analytic window function
last aggregate function
last_value aggregate function
first_value aggregate function
Window functions
least function
7/21/2022 • 2 minutes to read

Returns the least value of all parameters, skipping null values.

Syntax
least(expr1 [, ...] )

Arguments
exprN : An expression of any type that shares a least common type with all other arguments.

Returns
The result is the least common type of all arguments.

Examples
> SELECT least(10, 9, 2, 4, 3);
2

Related
greatest function
SQL data type rules
left function
7/21/2022 • 2 minutes to read

Returns the leftmost len characters from str .

Syntax
left(str, len)

Arguments
str : A STRING expression.
len : An INTEGER expression.

Returns
A STRING.
If len is less than 1, an empty string is returned.

Examples
> SELECT left('Spark SQL', 3);
Spa

Related functions
right function
substr function
length function
7/21/2022 • 2 minutes to read

Returns the character length of string data or number of bytes of binary data.

Syntax
length(expr)

Arguments
expr : A STRING or BINARY expression.

Returns
An INTEGER.
The length of string data includes the trailing spaces. The length of binary data includes trailing binary zeros.
This function is a synonym for character_length function and char_length function.

Examples
> SELECT length('Spark SQL ');
10
> select length('床前明月光')
5

Related functions
character_length function
char_length function
levenshtein function
7/21/2022 • 2 minutes to read

Returns the Levenshtein distance between the strings str1 and str2 .

Syntax
levenshtein(str1, str2)

Arguments
str1 : A STRING expression.
str2 : A STRING expression.

Returns
An INTEGER.

Examples
> SELECT levenshtein('kitten', 'sitting');
3

Related functions
like operator
7/21/2022 • 2 minutes to read

Returns true if str matches pattern with escape .

Syntax
str [ NOT ] like ( pattern [ ESCAPE escape ] )

str [ NOT ] like { ANY | SOME | ALL } ( [ pattern [, ...] ] )

Arguments
str : A STRING expression.
pattern : A STRING expression.
escape : A single character STRING literal.
ANY or SOME or ALL :
Since: Databricks Runtime 9.1
If ALL is specified then like returns true if str matches all patterns, otherwise returns true if it
matches at least one pattern.

Returns
A BOOLEAN.
The pattern is a string which is matched literally, with exception to the following special symbols:
_ matches any one character in the input (similar to . in POSIX regular expressions)
% matches zero or more characters in the input (similar to .* in POSIX regular expressions).

The default escape character is the '\' . If an escape character precedes a special symbol or another escape
character, the following character is matched literally. It is invalid to escape any other character.
String literals are unescaped. For example, in order to match '\abc' , the pattern should be '\\abc' .
str NOT like ... is equivalent to NOT(str like ...) .

Examples
> SELECT like('Spark', '_park');
true

> SELECT '%SystemDrive%\\Users\\John' like '\%SystemDrive\%\\\\Users%';


true

> SELECT '%SystemDrive%/Users/John' like '/%SystemDrive/%//Users%' ESCAPE '/';


true

> SELECT like('Spock', '_park');


false

> SELECT 'Spark' like SOME ('_park', '_ock')


true

> SELECT 'Spark' like ALL ('_park', '_ock')


false

Related functions
ilike operator
rlike operator
regexp operator
ln function
7/21/2022 • 2 minutes to read

Returns the natural logarithm (base e ) of expr .

Syntax
ln(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.
if the argument is out of bounds, the result is NULL.

Examples
> SELECT ln(1);
0.0

Related functions
exp function
expm1 function
e function
log function
log10 function
log1p function
locate function
7/21/2022 • 2 minutes to read

Returns the position of the first occurrence of substr in str after position pos .

Syntax
locate(substr, str [, pos] )

Arguments
subtr : A STRING expression.
str : A STRING expression.
pos : An optional INTEGER expression.

Returns
An INTEGER.
The specified pos and return value are 1-based. If pos is omitted substr is searched from the beginning of
str . If pos is less than 1 the result is 0.

This function is a synonym for position function.

Examples
> SELECT locate('bar', 'foobarbar');
4
> SELECT locate('bar', 'foobarbar', 5);
7

Related functions
position function
instr function
charindex function
log function
7/21/2022 • 2 minutes to read

Returns the logarithm of expr with base .

Syntax
log( [base,] expr)

Arguments
base : An optional expression that evaluates to a numeric.
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.
If base or expr are less than or equal to 0 the result is NULL. log(expr) is a synonym for ln(expr) .

Examples
> SELECT log(10, 100);
2.0
> SELECT log(e());
1.0

Related functions
log10 function
ln function
log1p function
pow function
log10 function
7/21/2022 • 2 minutes to read

Returns the logarithm of expr with base 10 .

Syntax
log10(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.
If expr is less than or equal to 0 the result is NULL.

Examples
> SELECT log10(10);
1.0

Related functions
log function
log2 function
ln function
log1p function
pow function
log1p function
7/21/2022 • 2 minutes to read

Returns log(1 + expr) .

Syntax
log1p(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.
If expr is less than or equal to -1 the result is NULL.

Examples
> SELECT log1p(0);
0.0

Related functions
log function
ln function
log10 function
expm1 function
e function
exp function
pow function
log2 function
7/21/2022 • 2 minutes to read

Returns the logarithm of expr with base 2 .

Syntax
log2(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.
If expr is less than or equal to 0 the result is NULL.

Examples
> SELECT log2(2);
1.0

Related functions
log function
ln function
log1p function
log10 function
pow function
lower function
7/21/2022 • 2 minutes to read

Returns expr with all characters changed to lowercase.

Syntax
lower(expr)

Arguments
expr : A STRING expression.

Returns
A STRING.

Examples
> SELECT lower('LowerCase');
lowercase

Related functions
lcase function
initcap function
ucase function
upper function
lpad function
7/21/2022 • 2 minutes to read

Returns expr , left-padded with pad to a length of len .

Syntax
lpad(expr, len [, pad] )

Arguments
expr : A STRING or BINARY expression to be padded.
len : An INTEGER expression specifying the length of the result string
pad : An optional STRING or BINARY expression specifying the padding.

Returns
A BINARY if both expr and pad are BINARY, otherwise STRING.
If expr is longer than len , the return value is shortened to len characters. If you do not specify pad , a
STRING expr is padded to the left with space characters, whereas a BINARY expr is padded to the left with
x’00’ bytes. If len is less than 1, an empty string.
BINARY is supported since: Databricks Runtime 11.0.

Examples
> SELECT lpad('hi', 5, 'ab');
abahi
> SELECT lpad('hi', 1, '??');
h
> SELECT lpad('hi', 5);
hi

> SELECT hex(pad(x'1020', 5, x'05'))


0505051020

Related functions
rpad function
ltrim function
rtrim function
trim function
<=> (lt eq gt sign) operator
7/21/2022 • 2 minutes to read

Returns the same result as the EQUAL(=) for non-null operands, but returns true if both are NULL , false if
one of the them is NULL .

Syntax
expr1 <=> expr2

Arguments
expr1 : An expression of a comparable type.
expr2 : An expression that shares a least common type with expr1 .

Returns
A BOOLEAN.
This operator is a synonym for expr1 is not distinct from expr2

Examples
> SELECT 2 <=> 2;
true
> SELECT 1 <=> '1';
true
> SELECT true <=> NULL;
false
> SELECT NULL <=> NULL;
true

Related
!= (bangeq sign) operator
< (lt sign) operator
> (gt sign) operator
>= (gt eq sign) operator
<= (lt eq sign) operator
= (eq sign) operator
<> (lt gt sign) operator
is distinct operator
SQL data type rules
<= (lt eq sign) operator
7/21/2022 • 2 minutes to read

Returns true if expr1 is less than or equal to expr2 , or false otherwise.

Syntax
expr1 <= expr2

Arguments
expr1 : An expression of any comparable type.
expr2 : An expression that shares a least common type with expr1 .

Returns
A BOOLEAN.

Examples
> SELECT 2 <= 2;
true
> SELECT 1.0 <= '1';
true
> SELECT to_date('2009-07-30 04:17:52') <= to_date('2009-07-30 04:17:52');
true
> SELECT to_date('2009-07-30 04:17:52') <= to_date('2009-08-01 04:17:52');
true
> SELECT 1 <= NULL;
NULL

Related
!= (bangeq sign) operator
< (lt sign) operator
> (gt sign) operator
>= (gt eq sign) operator
<=> (lt eq gt sign) operator
= (eq sign) operator
<> (lt gt sign) operator
SQL data type rules
<> (lt gt sign) operator
7/21/2022 • 2 minutes to read

Returns true if expr1 does not equal expr2 , or false otherwise.

Syntax
expr1 <> expr2

Arguments
expr1 : An expression of any comparable type.
expr2 : An expression that shares a least common type with expr1 .

Returns
A BOOLEAN.
This function is a synonym for != (bangeq sign) operator.

Examples
> SELECT 2 <> 2;
false
> SELECT 3 <> 2;
true
> SELECT 1 <> '1';
false
> SELECT true <> NULL;
NULL
> SELECT NULL <> NULL;
NULL

Related
< (lt sign) operator
<= (lt eq sign) operator
> (gt sign) operator
>= (gt eq sign) operator
<=> (lt eq gt sign) operator
= (eq sign) operator
!= (bangeq sign) operator
SQL data type rules
ltrim function
7/21/2022 • 2 minutes to read

Returns str with leading characters within trimStr removed.

Syntax
ltrim( [trimstr ,] str)

Arguments
trimstr : An optional STRING expression with the string to be trimmed.
str : A STRING expression from which to trim.

Returns
A STRING.
The default for trimStr is a single space. The function removes any leading characters within trimStr from
str .

Examples
> SELECT '+' || ltrim(' SparkSQL ') || '+';
+SparkSQL +
> SELECT '+' || ltrim('abc', 'acbabSparkSQL ') || '+';
+SparkSQL +

Related functions
btrim function
rpad function
lpad function
rtrim function
trim function
< (lt sign) operator
7/21/2022 • 2 minutes to read

Returns true if expr1 is less than expr2 , or false otherwise.

Syntax
expr1 < expr2

Arguments
expr1 : An expression of any comparable type.
expr2 : An expression that shares a least common type with expr1 .

Returns
A BOOLEAN.

Examples
> SELECT 1 < 2;
true
> SELECT 1.1 < '1';
false
> SELECT to_date('2009-07-30 04:17:52') < to_date('2009-07-30 04:17:52');
false
> SELECT to_date('2009-07-30 04:17:52') < to_date('2009-08-01 04:17:52');
true
> SELECT 1 < NULL;
NULL

Related
!= (bangeq sign) operator
<= (lt eq sign) operator
> (gt sign) operator
>= (gt eq sign) operator
<=> (lt eq gt sign) operator
= (eq sign) operator
<> (lt gt sign) operator
SQL data type rules
make_date function
7/21/2022 • 2 minutes to read

Creates a date from year , month , and day fields.

Syntax
make_date(year, month, day)

Arguments
year : An INTEGER expression evaluating to a value from 1 to 9999.
month : An INTEGER expression evaluating to a value from 1 (January) to 12 (December).
day : An INTEGER expression evaluating to a value from 1 to 31.

Returns
A DATE.
If any of the arguments is out of bounds, the function raises an error.

NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error for malformed arguments.

Examples
> SELECT make_date(2013, 7, 15);
2013-07-15
> SELECT make_date(2019, 13, 1);
NULL
> SELECT make_date(2019, 7, NULL);
NULL
> SELECT make_date(2019, 2, 30);
NULL

Related functions
make_timestamp function
make_interval function
make_dt_interval function
7/21/2022 • 2 minutes to read

Creates an interval from days , hours , mins and secs .


Since: Databricks Runtime 10.0

Syntax
make_dt_interval( [ days [, hours [, mins [, secs] ] ] ] )

Arguments
days : An integral number of days, positive or negative
hours : An integral number of hours, positive or negative
mins : An integral number of minutes, positive or negative
secs : A number of seconds with the fractional part in microsecond precision.

Returns
An INTERVAL DAY TO SECOND .
Unspecified arguments are defaulted to 0. If you provide no arguments the result is an
INTERVAL '0 00:00:00.000000000' DAY TO SECOND .

The function is equivalent to executing:


INTERVAL days DAYS + INTERVAL hours HOURS + INTERVAL mins MINUTES + INTERVAL secs SECONDS .
As such each unit can be outside of its natural range as well as negative.

Examples
> SELECT make_dt_interval(100, 13);
100 13:00:00.000000000

> SELECT make_dt_interval(100, null);


NULL

> SELECT make_dt_interval(0, 25);


1 01:00:00.000000000

> SELECT make_dt_interval(0, 0, 1, -0.1);


0 00:00:59.900000000

Related functions
make_date function
make_timestamp function
make_ym_interval function
make_interval function
7/21/2022 • 2 minutes to read

Creates an interval from years , months , weeks , days , hours , mins and secs .

WARNING
This constructor is deprecated since it generates an INTERVAL which cannot be compared or operated upon. Please use
make_ym_interval or make_dt_interval to produce intervals.

Syntax
make_interval( [years [, months [, weeks [, days [, hours [, mins [, secs] ] ] ] ] ] ] )

Arguments
years : An integral number of years, positive or negative
months : An integral number of months, positive or negative
weeks : An integral number of weeks, positive or negative
days : An integral number of days, positive or negative
hours : An integral number of hours, positive or negative
mins : An integral number of minutes, positive or negative
secs : A number of seconds with the fractional part in microsecond precision.

Returns
An INTERVAL.
Unspecified arguments are defaulted to 0. If you provide no arguments the result is an INTERVAL with 0
seconds.

Examples
> SELECT make_interval(100, 11);
100 years 11 months
> SELECT make_interval(100, null);
NULL
> SELECT make_interval();
0 seconds
> SELECT make_interval(0, 0, 1, 1, 12, 30, 01.001001);
8 days 12 hours 30 minutes 1.001001 seconds

Related functions
make_dt_interval function
make_date function
make_timestamp function
make_ym_interval function
make_ym_interval function
7/21/2022 • 2 minutes to read

Creates an interval from days , hours , mins and secs .


Since: Databricks Runtime 10.0

Syntax
make_dt_interval( [ years [, months ] ] )

Arguments
years : An integral number of years, positive or negative
months : An integral number of months, positive or negative

Returns
An INTERVAL YEAR TO MONTH .
Unspecified arguments are defaulted to 0. If you provide no arguments the result is an
INTERVAL '0-0' YEAR TO MONTH .

The function is equivalent to executing:


INTERVAL year YEARS + INTERVAL month MONTHS .
As such each unit can be outside of its natural range as well as negative.

Examples
> SELECT make_ym_interval(100, 5);
100-5

> SELECT make_ym_interval(100, null);


NULL

> SELECT make_ym_interval(0, 13);


1-1

> SELECT make_ym_interval(1, -1);


0-11

Related functions
make_dt_interval function
make_date function
make_timestamp function
make_timestamp function
7/21/2022 • 2 minutes to read

Creates a timestamp from year , month , day , hour , min , sec , and timezone fields.

Syntax
make_timestamp(year, month, day, hour, min, sec [, timezone] )

Arguments
year : An INTEGER expression evaluating to a value from 1 to 9999.
month : An INTEGER expression evaluating to a value from 1 (January) to 12 (December).
day : An INTEGER expression evaluating to a value from 1 to 31.
hour : An INTEGER expression evaluating to a value between 0 and 23.
min : An INTEGER expression evaluating to a value between 0 and 59.
sec : A numeric expression evaluating to a value between 0 and 60.
timezone : An optional STRING expression evaluating to a valid timezone string. For example: CET, UTC.

Returns
A TIMESTAMP.
If any of the arguments is out of bounds, the function returns an error. If sec is 60 it is interpreted as 0 and a
minute is added to the result.

NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error for out of bounds arguments.

Examples
> SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887);
2014-12-28 06:30:45.887
> SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887, 'CET');
2014-12-27 21:30:45.887
> SELECT make_timestamp(2019, 6, 30, 23, 59, 60);
2019-07-01 00:00:00
> SELECT make_timestamp(2019, 13, 1, 10, 11, 12, 'PST');
NULL
> SELECT make_timestamp(NULL, 7, 22, 15, 30, 0);
NULL

Related functions
make_date function
map function
7/21/2022 • 2 minutes to read

Creates a map with the specified key-value pairs.

Syntax
map( [key1, value1] [, ...] )

Arguments
keyN : An expression of any comparable type. All keyN must share a least common type.
valueN : An expression of any type. All valueN must share a least common type.

Returns
A MAP with keys typed as the least common type of keyN and values typed as the least common type of
valueN .

There can be 0 or more pairs.


If there is a duplicate key or a NULL key the function raises an error.

Examples
> SELECT map(1.0, '2', 3.0, '4');
{1.0 -> 2, 3.0 -> 4}

Related
[ ] operator
map_concat function
map_entries function
map_filter function
map_from_arrays function
map_from_entries function
map_keys function
map_values function
map_zip_with function
SQL data type rules
map_concat function
7/21/2022 • 2 minutes to read

Returns the union of all expr map expressions.

Syntax
map_concat([ expr1 [, ...] ] )

Arguments
exprN : A MAP expression. All exprN must share a least common type.

Returns
A MAP of the least common type of exprN .
If no argument is provided, an empty map. If there is a key collision an error is raised.

Examples
> SELECT map_concat(map(1, 'a', 2, 'b'), map(3, 'c'));
{1 -> a, 2 -> b, 3 -> c}

Related
map function
map_entries function
map_filter function
map_from_arrays function
map_from_entries function
map_keys function
map_values function
map_zip_with function
SQL data type rules
map_contains_key function
7/21/2022 • 2 minutes to read

Returns true if map contains key .


Since: Databricks Runtime 10.3

Syntax
map_contains_key(map, key)

Arguments
map : An map to be searched.
key : An expression with a type sharing a least common type with the map keys.

Returns
A BOOLEAN. If map or key is NULL , the result is NULL .

Examples
> SELECT map_contains_key(map(1, 'a', 2, 'b'), 2);
true

> SELECT map_contains_key(map(1, 'a', 2, 'b'), 3);


false

> SELECT map_contains_key(map(1, 'a', 2, 'b'), NULL);


NULL

Related
array_contains function
map function
map_keys function
map_values function
SQL data type rules
map_entries function
7/21/2022 • 2 minutes to read

Returns an unordered array of all entries in map .

Syntax
map_entries(map)

Arguments
map : A MAP expression.

Returns
An ARRAY of STRUCTs holding key-value pairs.

Examples
> SELECT map_entries(map(1, 'a', 2, 'b'));
[{1, a}, {2, b}]

Related functions
map function
map_concat function
map_from_entries function
map_filter function
map_from_arrays function
map_keys function
map_values function
map_zip_with function
map_filter function
7/21/2022 • 2 minutes to read

Filters entries in the map in expr using the function func .

Syntax
map_filter(expr, func)

Arguments
expr : A MAP expression.
func : A lambda function with two parameters returning a BOOLEAN. The first parameter takes the key the
second parameter takes the value.

Returns
The result is the same type as expr .

Examples
> SELECT map_filter(map(1, 0, 2, 2, 3, -1), (k, v) -> k > v);
{1 -> 0, 3 -> -1}

Related functions
map function
map_concat function
map_entries function
map_from_arrays function
map_from_entries function
map_keys function
map_values function
map_zip_with function
map_from_arrays function
7/21/2022 • 2 minutes to read

Creates a map with a pair of the keys and values arrays.

Syntax
map_from_arrays(keys, values)

Arguments
keys : An ARRAY expression without duplicates or NULL.
values : An ARRAY expression of the same cardinality as keys

Returns
A MAP where keys are of the element type of keys and values are of the element type of values .

Examples
> SELECT map_from_arrays(array(1.0, 3.0), array('2', '4'));
{1.0 -> 2, 3.0 -> 4}

Related functions
map function
map_concat function
map_entries function
map_filter function
map_from_entries function
map_keys function
map_values function
map_zip_with function
map_from_entries function
7/21/2022 • 2 minutes to read

Creates a map created from the specified array of entries.

Syntax
map_from_entries(expr)

Arguments
expr : An ARRAY expression of STRUCT with two fields.

Returns
A MAP where keys are the first field of the structs and values the second. There must be no duplicates or nulls in
the first field (the key).

Examples
> SELECT map_from_entries(array(struct(1, 'a'), struct(2, 'b')));
{1 -> a, 2 -> b}

Related functions
map function
map_concat function
map_entries function
map_filter function
map_from_arrays function
map_keys function
map_values function
map_zip_with function
map_keys function
7/21/2022 • 2 minutes to read

Returns an unordered array containing the keys of map .

Syntax
map_keys(map)

Arguments
map : A MAP expression.

Returns
An ARRAY where the element type matches the map key type.

Examples
> SELECT map_keys(map(1, 'a', 2, 'b'));
[1,2]

Related functions
map function
map_concat function
map_contains_key function
map_entries function
map_filter function
map_from_arrays function
map_from_entries function
map_values function
map_zip_with function
map_values function
7/21/2022 • 2 minutes to read

Returns an unordered array containing the values of map .

Syntax
map_values(map)

Arguments
map : A MAP expression.

Returns
An ARRAY where the element type matches the map value type.

Examples
> SELECT map_values(map(1, 'a', 2, 'b'));
[a,b]

Related functions
map function
map_concat function
map_entries function
map_filter function
map_from_arrays function
map_from_entries function
map_keys function
map_zip_with function
map_zip_with function
7/21/2022 • 2 minutes to read

Merges map1 and map2 into a single map.

Syntax
map_zip_with(map1, map2, func)

Arguments
map1 : A MAP expression.
map2 : A MAP expression of the same key type as map1
func : A lambda function taking three parameters. The first parameter is the key, followed by the values from
each map.

Returns
A MAP where the key matches the key type of the input maps and the value is typed by the return type of the
lambda function.
If a key is not matched by one side the respective value provided to the lambda function is NULL.

Examples
> SELECT map_zip_with(map(1, 'a', 2, 'b'), map(1, 'x', 2, 'y'), (k, v1, v2) -> concat(v1, v2));
{1 -> ax, 2 -> by}

Related functions
map function
map_concat function
map_entries function
map_filter function
map_from_arrays function
map_from_entries function
map_keys function
map_values function
max aggregate function
7/21/2022 • 2 minutes to read

Returns the maximum value of expr in a group.

Syntax
max(expr) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression of any type that can be ordered.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
The result type matches the type of the argument.

Examples
> SELECT max(col) FROM VALUES (10), (50), (20) AS tab(col);
50

Related functions
min aggregate function
avg aggregate function
max_by aggregate function
min_by aggregate function
mean aggregate function
max_by aggregate function
7/21/2022 • 2 minutes to read

Returns the value of an expr1 associated with the maximum value of expr2 in a group.

Syntax
max_by(expr1, expr2) [FILTER ( WHERE cond ) ]

Arguments
expr1 : An expression of any type.
expr2 : An expression of a type that can be ordered.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
The result type matches the type of expr1 .
This function is non-deterministic if expr2 is not unique within the group.

Examples
> SELECT max_by(x, y) FROM VALUES (('a', 10)), (('b', 50)), (('c', 20)) AS tab(x, y);
b

Related functions
min aggregate function
avg aggregate function
max aggregate function
min_by aggregate function
md5 function
7/21/2022 • 2 minutes to read

Returns an MD5 128-bit checksum of expr as a hex string.

Syntax
md5(expr)

Arguments
expr : An BINARY expression.

Returns
A STRING.

Examples
> SELECT md5('Spark');
8cde774d6f7333752ed72cacddb05126

Related functions
crc32 function
hash function
sha function
sha1 function
sha2 function
mean aggregate function
7/21/2022 • 2 minutes to read

Returns the mean calculated from values of a group.

Syntax
mean ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
The result type is computed as for the arguments:
DECIMAL(p, s) : The result type is a DECIMAL(p + 4, s + 4) . If the maximum precision for DECIMAL is reached
the increase in scale will be limited to avoid loss of significant digits.
year-month interval: The result is an INTERVAL YEAR TO MONTH .
day-time interval: The result is an INTERVAL YEAR TO SECOND .
In all other cases the result is a DOUBLE.
Nulls within the group are ignored. If a group is empty or consists only of nulls the result is NULL.
If DISTINCT is specified the mean is computed after duplicates have been removed.
If the result overflows the result type, Databricks Runtime raises an overflow error. To return a NULL instead use
try_avg.

WARNING
If spark.sql.ansi.enabled is false an overflow will not cause an error but return NULL.

This function is a synonym for avg aggregate function.

Examples
> SELECT mean(col) FROM VALUES (1), (2), (3) AS tab(col);
2.0

> SELECT avg(DISTINCT col) FROM VALUES (1), (1), (2) AS tab(col);
1.5

> SELECT mean(col) FROM VALUES (1), (2), (NULL) AS tab(col);


1.5

> SELECT mean(vol) FROM VALUES (INTERVAL '1' YEAR), (INTERVAL '2' YEAR) AS tab(col);
1-6

Related functions
aggregate function
avg aggregate function
max aggregate function
min aggregate function
sum aggregate function
try_avg aggregate function
min aggregate function
7/21/2022 • 2 minutes to read

Returns the minimum value of expr in a group.

Syntax
min(expr) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression of any type that can be ordered.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
The result type matches the type of the argument.

Examples
> SELECT min(col) FROM VALUES (10), (50), (20) AS tab(col);
10

Related functions
max aggregate function
avg aggregate function
max_by aggregate function
min_by aggregate function
mean aggregate function
min_by aggregate function
7/21/2022 • 2 minutes to read

Returns the value of an expr1 associated with the minimum value of expr2 in a group.

Syntax
min_by(expr1, expr2) [FILTER ( WHERE cond ) ]

Arguments
expr1 : An expression of any type.
expr2 : An expression of a type that can be ordered.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
The result type matches the type of expr1 .
This function is non-deterministic if expr2 is not unique within the group.

Examples
> SELECT min_by(x, y) FROM VALUES (('a', 10)), (('b', 50)), (('c', 20)) AS tab(x, y);
a

Related functions
min aggregate function
avg aggregate function
max aggregate function
min_by aggregate function
- (minus sign) operator
7/21/2022 • 2 minutes to read

Returns the subtraction of expr2 from expr1 .

Syntax
expr1 - expr2

Arguments
expr1 : A numeric, DATE, TIMESTAMP, or INTERVAL expression.
expr2 : The accepted type depends on the type of expr :
If expr1 is a numeric expr2 must be numeric expression
If expr1 is a year-month or day-time interval, expr2 must be of the matching class of interval.
Otherwise expr2 must be a DATE or TIMESTAMP.

Returns
The result type is determined in the following order:
If expr1 is a numeric, the result is common maximum type of the arguments.
If expr1 is a DATE and expr2 is a day-time interval the result is a TIMESTAMP.
If expr1 is a TIMESTAMP and expr2 is an interval the result is a TIMESTAMP.
If expr1 and expr2 are DATEs the result is an INTERVAL DAYS .
If expr1 or expr2 are TIMESTAMP the result is an INTERVAL DAY TO SECOND .
If expr1 and expr2 are year-month intervals the result is a year-month interval of sufficiently wide units to
represent the result.
If expr1 and expr2 are day-time intervals the result is a day-time interval of sufficiently wide units to
represent the result.
Otherwise, the result type matches expr1 .
If both expressions are interval they must be of the same class.
When you subtract a year-month interval from a DATE, Databricks Runtime ensures that the resulting date is
well-formed.
If the result overflows the result type, Databricks Runtime raises an ARITHMETIC_OVERFLOW error.
Use try_subtract to return NULL on overflow.

WARNING
If spark.sql.ansi.enabled is false an overflow will not cause an error but “wrap” the result.

Examples
> SELECT 2 - 1;
1

> SELECT DATE'2021-03-20' - INTERVAL '2' MONTH


2021-1-20

> SELECT TIMESTAMP'2021-03-20 12:15:29' - INTERVAL '3' SECOND


2021-03-20 12:15:26

> SELECT typeof(INTERVAL '3' DAY - INTERVAL '2' HOUR);


interval day to hour

> SELECT typeof(current_date - (current_date + INTERVAL '1' DAY));


interval day

> SELECT typeof(current_timestamp - (current_date + INTERVAL '1' DAY));


interval day to second

> SELECT DATE'2021-03-31' - INTERVAL '1' MONTH;


2021-02-28

> SELECT -100Y - 100Y;


Error: ARITHMETIC_OVERFLOW

Related functions
* (asterisk sign) operator
+ (plus sign) operator
/ (slash sign) operator
sum aggregate function
try_add function
try_subtract function
- (minus sign) unary operator
7/21/2022 • 2 minutes to read

Returns the negated value of expr .

Syntax
- expr

Arguments
expr : An expression that evaluates to a numeric or interval.

Returns
The result type matches the argument type.
For integral numeric types the function can return an ARITHMETIC_OVERFLOW error.
This function is a synonym for negative function.

WARNING
If spark.sql.ansi.enabled is false an overflow will not cause an error but “wrap” the result.

Examples
> SELECT -(1);
-1

> SELECT -(cast(-32768 AS smallint));


Error: ARITHMETIC_OVERFLOW

> SELECT -INTERVAL '5' MONTH;


-0-5

Related functions
sign function
abs function
positive function
+ (plus sign) unary operator
negative function
minute function
7/21/2022 • 2 minutes to read

Returns the minute component of the timestamp in expr .

Syntax
minute(expr)

Arguments
expr : An TIMESTAMP expression or a STRING of a valid timestamp format.

Returns
An INTEGER.
This function is a synonym for extract(MINUTES FROM expr) .

Examples
> SELECT minute('2009-07-30 12:58:59');
58

Related functions
dayofmonth function
dayofweek function
dayofyear function
day function
hour function
month function
extract function
mod function
7/21/2022 • 2 minutes to read

Returns the remainder after dividend / divisor .

Syntax
mod(dividend, divisor)

Arguments
dividend : An expression that evaluates to a numeric.
divisor : An expression that evaluates to a numeric.

Returns
If both dividend and divisor are of DECIMAL , the result matches the divisor’s type. In all other cases, a
DOUBLE.
If divisor is 0, the function raises a DIVIDE_BY_ZERO error.
This function is equivalent to the % (percent sign) operator.

Examples
> SELECT MOD(2, 1.8);
0.2

> SELECT MOD(2, 0);


Error: DIVIDE_BY_ZERO

Related functions
% (percent sign) operator
/ (slash sign) operator
pmod function
monotonically_increasing_id function
7/21/2022 • 2 minutes to read

Returns monotonically increasing 64-bit integers.

Syntax
monotonically_increasing_id()

Arguments
This function takes no arguments.

Returns
A BIGINT.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

Examples
> SELECT monotonically_increasing_id();
0

Related functions
uuid function
month function
7/21/2022 • 2 minutes to read

Returns the month component of the timestamp in expr .

Syntax
month(expr)

Arguments
expr : An TIMESTAMP expression or a STRING of a valid timestamp format.

Returns
An INTEGER.
This function is a synonym for extract(MONTH FROM expr) .

Examples
> SELECT month('2016-07-30');
7

Related functions
dayofmonth function
dayofweek function
dayofyear function
day function
hour function
minute function
extract function
months_between function
7/21/2022 • 2 minutes to read

Returns the number of months elapsed between dates or timestamps in expr1 and expr2 .

Syntax
months_between(expr1, expr2 [, roundOff] )

Arguments
expr1 : An DATE or TIMESTAMP expression.
expr2 : An expression of the same type as expr1 .
roundOff : A optional BOOLEAN expression.

Returns
A DOUBLE.
If expr1 is later than expr2 , the result is positive.
If expr1 and expr2 are on the same day of the month, or both are the last day of the month, time of day is
ignored. Otherwise, the difference is calculated based on 31 days per month, and rounded to 8 digits unless
roundOff =false.

Examples
> SELECT months_between('1997-02-28 10:30:00', '1996-10-30');
3.94959677
> SELECT months_between('1997-02-28 10:30:00', '1996-10-30', false);
3.9495967741935485

Related functions
- (minus sign) operator
add_months function
datediff function
datediff (timestamp) function
date_add function
date_sub function
dateadd function
named_struct function
7/21/2022 • 2 minutes to read

Creates a struct with the specified field names and values.

Syntax
named_struct( {name1, val1} [, ...] )

Arguments
nameN : A STRING literal naming field N.
valN : An expression of any type specifying the value for field N.

Returns
A struct with field N matching the type of valN .

Examples
> SELECT named_struct('a', 1, 'b', 2, 'c', 3);
{ 1, 2, 3}

Related functions
struct function
map function
str_to_map function
array function
nanvl function
7/21/2022 • 2 minutes to read

Returns expr1 if it’s not NaN , or expr2 otherwise.

Syntax
nanvl(expr1, expr2)

Arguments
expr1 : An expression that evaluates to a numeric.
expr2 : An expression that evaluates to a numeric.

Returns
A DOUBLE.

Examples
> SELECT nanvl(cast('NaN' AS DOUBLE), 123);
123.0

Related functions
coalesce function
negative function
7/21/2022 • 2 minutes to read

Returns the negated value of expr .

Syntax
negative(expr)

Arguments
expr : An expression that evaluates to a numeric or interval.

Returns
The result type matches the argument type.
For integral numeric types the function can return an ARITHMETIC_OVERFLOW error.
This function is a synonym for - (minus sign) unary operator.

WARNING
If spark.sql.ansi.enabled is false an overflow will not cause an error but “wrap” the result.

Examples
> SELECT negative(1);
-1

> SELECT negative(cast(-32768 AS smallint))


Error: ARITHMETIC_OVERFLOW

Related functions
sign function
abs function
positive function
- (minus sign) unary operator
+ (plus sign) unary operator
next_day function
7/21/2022 • 2 minutes to read

Returns the first date which is later than expr and named as in dayOfWeek .

Syntax
next_day(expr, dayOfWeek)

Arguments
expr : A DATE expression.
dayOfWeek : A STRING expression identifying a day of the week.

Returns
A DATE.
dayOfWeek must be one of the following (case insensitive):
'SU' , 'SUN' , 'SUNDAY'
'MO' , 'MON' , 'MONDAY'
'TU' , 'TUE' , 'TUESDAY'
'WE' , 'WED' , 'WEDNESDAY'
'TH' , 'THU' , 'THURSDAY'
'FR' , 'FRI' , 'FRIDAY'
'SA' , 'SAT' , 'SATURDAY'

NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error for a malformed dayOfWeek .

Examples
> SELECT next_day('2015-01-14', 'TU');
2015-01-20

Related functions
dayofweek function
last_day function
not operator
7/21/2022 • 2 minutes to read

Returns logical negation of the argument.

Syntax
not expr

Arguments
expr : A BOOLEAN expression.

Returns
A BOOLEAN.
This operator is an alias for ! (bang sign) operator.

Examples
> SELECT not true;
false
> SELECT not false;
true
> SELECT not NULL;
NULL

Related functions
& (ampersand sign) operator
| (pipe sign) operator
! (bang sign) operator
now function
7/21/2022 • 2 minutes to read

Returns the current timestamp at the start of query evaluation.

Syntax
now()

Arguments
This function takes no arguments.

Returns
A TIMESTAMP.

Examples
> SELECT now();
2020-04-25 15:49:11.914

Related functions
current_date function
current_timezone function
current_timestamp function
nth_value analytic window function
7/21/2022 • 2 minutes to read

Returns the value at a specific offset in the window.

Syntax
nth_value(expr, offset) [ IGNORE NULLS | RESPECT NULLS ]

Arguments
expr : An expression of any type.
offset : An INTEGER literal greater than 0.
IGNORE NULLS or RESPECT NULLS : When IGNORE NULLS is used any expr value that is NULL is ignored in the
count. The default is RESPECT NULLS .

Returns
The result type matches the expr type.
The window function returns the value of expr at the row that is the offset th row from the beginning of the
window frame.
If there is no such offset th row, returns NULL .
You must use the ORDER BY clause clause with this function. If the order is non-unique, the result is non-
deterministic.

Examples
> SELECT a, b, nth_value(b, 2) OVER (PARTITION BY a ORDER BY b)
FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
A1 1 1
A1 1 1
A1 2 1
A2 3 NULL

Related functions
lag analytic window function
lead analytic window function
first aggregate function
last aggregate function
Window functions
ntile ranking window function
7/21/2022 • 2 minutes to read

Divides the rows for each window partition into n buckets ranging from 1 to at most n .

Syntax
ntile([n])

Arguments
n : An optional INTEGER literal greater than 0.

Returns
An INTEGER.
The default for n is 1. If n is greater than the actual number or rows in the window You must use the ORDER
BY clause with this function.
If the order is non-unique, the result is non-deterministic.

Examples
> SELECT a, b, ntile(2) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1',
1) tab(a, b);
A1 1 1
A1 1 1
A1 2 2
A2 3 1

Related functions
Window functions
nullif function
7/21/2022 • 2 minutes to read

Returns NULL if expr1 equals expr2 , or expr1 otherwise.

Syntax
nullif(expr1, expr2)

Arguments
expr1 : An expression of any type.
expr2 : An expression of the same type as expr .

Returns
NULL if expr1 equals to expr2 , or expr1 otherwise.

Examples
> SELECT nullif(2, 2);
NULL
> SELECT nullif(2, 3);
2

Related functions
coalesce function
decode function
case expression
nvl function
7/21/2022 • 2 minutes to read

Returns expr2 if expr1 is NULL , or expr1 otherwise.

Syntax
nvl(expr1, expr2)

Arguments
expr1 : An expression of any type.
expr2 : An expression that shares a least common type with expr1 .

Returns
The result type is the least common type of the argument types.
This function is a synonym for coalesce(expr1, expr2) .

Examples
> SELECT nvl(NULL, 2);
2
> SELECT nvl(3, 2);
3

Related functions
coalesce function
nvl2 function
nvl2 function
7/21/2022 • 2 minutes to read

Returns expr2 if expr1 is not NULL , or expr3 otherwise.

Syntax
nvl2(expr1, expr2, expr3)

Arguments
expr1 : An expression of any type.
expr2 : An expression of any type.
expr3 : An expression that shares a least common type with expr2 .

Returns
The result is least common type of expr2 and expr3 .
This function is a synonym for CASE WHEN expr1 IS NOT NULL expr2 ELSE expr2 END .

Examples
> SELECT nvl2(NULL, 2, 1);
1
> SELECT nvl2('spark', 2, 1);
2

Related functions
coalesce function
case expression
nvl function
octet_length function
7/21/2022 • 2 minutes to read

Returns the byte length of string data or number of bytes of binary data.

Syntax
octet_length(expr)

Arguments
expr : A STRING or BINARY expression.

Returns
An INTEGER.

Examples
> SELECT octet_length('Spark SQL');
9
> SELECT octet_length('서울시');
9

Related functions
bit_length function
length function
char_length function
character_length function
or operator
7/21/2022 • 2 minutes to read

Returns the logical OR of expr1 and expr2 .

Syntax
expr1 or expr2

Arguments
expr1 : A BOOLEAN expression.
expr2 : A BOOLEAN expression.

Returns
A BOOLEAN.

Examples
> SELECT true or false;
true
> SELECT false or false;
false
> SELECT true or NULL;
true
> SELECT false or NULL;
NULL

Related functions
and predicate
or operator
not operator
! (bang sign) operator
overlay function
7/21/2022 • 2 minutes to read

Replaces input with replace that starts at pos and is of length len .

Syntax
overlay(input, replace, pos[, len])

overlay(input PLACING replace FROM pos [FOR len])

Arguments
input : A STRING ot BINARY expression.
replace : An expression of the same type as input .
pos : An INTEGER expression.
len : An optional INTEGER expression.

Returns
The result type matches the type of input .
If pos is negative the position is counted starting from the back. len must be greater or equal to 0. len
specifies the length of the snippet within input to be replaced. The default for len is the length of replace .

Examples
> SELECT overlay('Spark SQL', 'ANSI ', 7, 0);
Spark ANSI SQL
> SELECT overlay('Spark SQL' PLACING '_' FROM 6);
Spark_SQL
> SELECT overlay('Spark SQL' PLACING 'CORE' FROM 7);
Spark CORE
> SELECT overlay('Spark SQL' PLACING 'ANSI ' FROM 7 FOR 0);
Spark ANSI SQL
> SELECT overlay('Spark SQL' PLACING 'tructured' FROM 2 FOR 4);
Structured SQL
> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('_', 'utf-8') FROM 6);
[53 70 61 72 6B 5F 53 51 4C]

Related functions
replace function
regexp_replace function
parse_url function
7/21/2022 • 2 minutes to read

Extracts a part from url .

Syntax
parse_url(url, partToExtract [, key] )

Arguments
url : A STRING expression.
partToExtract : A STRING expression.
key : A STRING expression.

Returns
A STRING.
partToExtract must be one of:
'HOST'
'PATH'
'QUERY'
'REF'
'PROTOCOL'
'FILE'
'AUTHORITY'
'USERINFO'

key is case sensitive.


If a requested partToExtract or key is not found, NULL is returned.
Databricks Runtime returns an error if the url string is invalid.

NOTE
If spark.sql.ansi.enabled is false parse_url returns NULL if the url string is invalid.

Examples
> SELECT parse_url('http://spark.apache.org/path?query=1', 'HOST');
spark.apache.org

> SELECT parse_url('http://spark.apache.org/path?query=1', 'QUERY');


query=1

> SELECT parse_url('http://spark.apache.org/path?query=1', 'QUERY', 'query');


1

> SELECT parse_url('http://spark. apache.org/path?query=1', 'QUERY', 'query');


Error: Illegal argument

Related functions
percent_rank ranking window function
7/21/2022 • 2 minutes to read

Computes the percentage ranking of a value within the partition.

Syntax
percent_rank()

Arguments
The function takes no arguments

Returns
A DOUBLE.
The function is defined as the rank within the window minus one divided by the number of rows within the
window minus 1. If the there is only one row in the window the rank is 0.
As an expression the semantic can be expressed as:
nvl((rank() OVER(PARTITION BY p ORDER BY o) - 1) / nullif(count(1) OVER(PARTITION BY p) -1), 0), 0)

This function is similar, but not the same as cume_dist analytic window function.
You must include ORDER BY clause in the window specification.

Examples
> SELECT a, b, percent_rank(b) OVER (PARTITION BY a ORDER BY b)
FROM VALUES ('A1', 2), ('A1', 1), ('A1', 3), ('A1', 6), ('A1', 7), ('A1', 7), ('A2', 3), ('A1', 1)
tab(a, b)
A1 1 0.0
A1 1 0.0
A1 2 0.3333333333333333
A1 3 0.5
A1 6 0.6666666666666666
A1 7 0.8333333333333334
A1 7 0.8333333333333334
A2 3 0.0

Related functions
cume_dist analytic window function
rank ranking window function
Window functions
percentile aggregate function
7/21/2022 • 2 minutes to read

Returns the exact percentile value of expr at the specified percentage in a group.

Syntax
percentile ( [ALL | DISTINCT] expr, percentage [, frequency] ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to a numeric.
percentage : A numeric expression between 0 and 1 or an ARRAY of numeric expressions, each between 0
and 1.
frequency : An optional integral number literal greater than 0.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
DOUBLE if percentage is numeric, or an ARRAY of DOUBLE if percentage is an ARRAY.
Frequency describes the number of times expr must be counted. A frequency of 10 for a specific value is
equivalent to that value appearing 10 times in the window at a frequency of 1. The default frequency is 1.
If DISTINCT is specified the function operates only on a unique set of expr values.

Examples
> SELECT percentile(col, 0.3) FROM VALUES (0), (10), (10) AS tab(col);
6.0

> SELECT percentile(DISTINCT col, 0.3) FROM VALUES (0), (10), (10) AS tab(col);
3.0

> SELECT percentile(col, 0.3, freq) FROM VALUES (0, 1), (10, 2) AS tab(col, freq);
6.0

> SELECT percentile(col, array(0.25, 0.75)) FROM VALUES (0), (10) AS tab(col);
[2.5,7.5]

Related functions
approx_percentile aggregate function
percentile_approx aggregate function
percentile_cont aggregate function
percentile_approx aggregate function
7/21/2022 • 2 minutes to read

Returns the approximate percentile of the expr within the group.

Syntax
percentile_approx ( [ALL | DISTINCT ] expr, percentile [, accuracy] ) [FILTER ( WHERE cond ) ]

Arguments
expr : A numeric expression.
percentile : A numeric literal between 0 and 1 or a literal array of numeric values, each between 0 and 1.
accuracy : An INTEGER literal greater than 0. If accuracy is omitted it is set to 10000 .
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
The aggregate function returns the expression which is the smallest value in the ordered group (sorted from
least to greatest) such that no more than percentile of expr values is less than the value or equal to that
value. If percentile is an array percentile_approx, returns the approximate percentile array of expr at the
specified percentile.
The accuracy parameter controls approximation accuracy at the cost of memory. Higher value of accuracy
yields better accuracy, 1.0/accuracy is the relative error of the approximation.
If DISTINCT is specified the function operates only on a unique set of expr values.
This function is a synonym for approx_percentile aggregate function.

Examples
> SELECT percentile_approx(col, array(0.5, 0.4, 0.1), 100)
FROM VALUES (0), (1), (2), (10) AS tab(col);
[1,1,0]

> SELECT percentile_approx(col, 0.5, 100)


FROM VALUES (0), (6), (7), (9), (10), (10), (10) AS tab(col);
9

> SELECT percentile_approx(DISTINCT col, 0.5, 100)


FROM VALUES (0), (6), (7), (9), (10), (10), (10) AS tab(col);
7

Related functions
approx_percentile aggregate function
approx_count_distinct aggregate function
percentile aggregate function
percentile_cont aggregate function
percentile_cont aggregate function
7/21/2022 • 2 minutes to read

Returns the value that corresponds to the percentile of the provided the sortKey s using a continuous
distribution model.
Since: Databricks Runtime 10.3

Syntax
percentile_cont ( percentile )
WITHIN GROUP (ORDER BY sortKey [ASC | DESC] )

Arguments
percentile : A numeric literal between 0 and 1 or a literal array of numeric literals, each between 0 and 1.
sortKey : A numeric expression over which the percentile will be computed.
ASC or DESC : Optionally specify whether the percentile is computed using ascending or descending order.
The default is ASC .

Returns
DOUBLE if percentile is numeric, or an ARRAY of DOUBLE if percentile is an ARRAY.
The aggregate function returns the interpolated percentile within the group of sortKeys .

Examples
-- Return the median, 40%-ile and 10%-ile.
> SELECT percentile_cont(array(0.5, 0.4, 0.1)) WITHIN GROUP (ORDER BY col)
FROM VALUES (0), (1), (2), (10) AS tab(col);
[1.5, 1.2000000000000002, 0.30000000000000004]

-- Return the interpolated median.


> SELECT percentile_cont(0.50) WITHIN GROUP (ORDER BY col)
FROM VALUES (0), (6), (6), (7), (9), (10) AS tab(col);
6.5

Related functions
percentile_approx aggregate function
approx_count_distinct aggregate function
percentile aggregate function
percentile_disc aggregate function
percentile_disc aggregate function
7/21/2022 • 2 minutes to read

Returns the value that corresponds to the percentile of the provided sortKey using a discrete distribution
model.
Since: Databricks Runtime 11.0

Syntax
percentile_disc ( percentile )
WITHIN GROUP (ORDER BY sortKey [ASC | DESC] )

Arguments
percentile : A numeric literal between 0 and 1 or a literal array of numeric literals, each between 0 and 1.
sortKey : A numeric expression over which the percentile is computed.
ASC or DESC : Optionally specify whether the percentile is computed using ascending or descending order.
The default is ASC .

Returns
DOUBLE if percentile is numeric, or an ARRAY of DOUBLE if percentile is an ARRAY.
The aggregate function returns the sortKey value that matches the percentile within the group of sortKeys .

Examples
-- Return the median, 40%-ile and 10%-ile.
> SELECT percentile_disc(array(0.5, 0.4, 0.1)) WITHIN GROUP (ORDER BY col)
FROM VALUES (0), (1), (2), (10) AS tab(col);
[1, 1, 0]

-- Return the interpolated median.


> SELECT percentile_disc(0.50) WITHIN GROUP (ORDER BY col)
FROM VALUES (0), (6), (6), (7), (9), (10) AS tab(col);
6

Related functions
percentile_approx aggregate function
approx_count_distinct aggregate function
percentile aggregate function
percentile_cont aggregate function
% (percent sign) operator
7/21/2022 • 2 minutes to read

Returns the remainder after dividend / divisor .

Syntax
dividend % divisor

Arguments
dividend : An expression that evaluates to a numeric.
divisor : An expression that evaluates to a numeric.

Returns
If both dividend and divisor are of DECIMAL, the result matches the divisor’s type. In all other cases, a
DOUBLE.
If divisor is 0 (zero) the function raises a DIVIDE_BY_ZERO error.
This function is equivalent to mod function.

Examples
> SELECT 2 % 1.8;
0.2

> SELECT 2 % 1.8;


Error: DIVIDE_BY_ZERO

Related functions
mod function
/ (slash sign) operator
pmod function
pi function
7/21/2022 • 2 minutes to read

Returns pi.

Syntax
pi()

Arguments
The function takes no argument.

Returns
A DOUBLE.

Examples
> SELECT pi();
3.141592653589793

Related functions
e function
|| (pipe pipe sign) operator
7/21/2022 • 2 minutes to read

Returns the concatenation of expr1 and expr2 .

Syntax
expr1 || expr2

Arguments
expr1 : A STRING, BINARY or ARRAY of STRING or BINARY expression.
expr2 : An expression with type matching expr1 .

Returns
The result type matches the argument types.
This operator is a synonym for concat function

Examples
> SELECT 'Spark' || 'SQL';
SparkSQL
> SELECT array(1, 2, 3) || array(4, 5) || array(6);
[1,2,3,4,5,6]

Related functions
concat function
array_join function
array_union function
concat_ws function
| (pipe sign) operator
7/21/2022 • 2 minutes to read

Returns the bitwise OR of expr1 and expr2 .

Syntax
expr1 | expr2

Arguments
expr1 : An integral numeric type expression.
expr2 : An integral numeric type expression.

Returns
The result type matches the widest type of expr1 and expr2 .

Examples
> SELECT 3 | 5;
7

Related functions
& (ampersand sign) operator
~ (tilde sign) operator
^ (caret sign) operator
bit_count function
+ (plus sign) operator
7/21/2022 • 2 minutes to read

Returns the sum of expr1 and expr2 .

Syntax
expr1 + expr2

Arguments
expr1 : A numeric, DATE, TIMESTAMP, or INTERVAL expression.
expr2 : If expr1 is a numeric expr2 must be numeric expression, or an INTERVAL otherwise.

Returns
If expr1 is a numeric, the common maximum type of the arguments.
If expr1 is a DATE and expr2 is a day-time interval the result is a TIMESTAMP.
If expr1 and expr2 are year-month intervals the result is a year-month interval of sufficiently wide units to
represent the result.
If expr1 and expr2 are day-time intervals the result is a day-time interval of sufficiently wide units to
represent the result.
Otherwise, the result type matches expr1 .
If both expressions are interval they must be of the same class.
When you add a year-month interval to a DATE, Databricks Runtime ensures that the resulting date is well-
formed.
If the result overflows the result type, Databricks Runtime raises an ARITHMETIC_OVERFLOW error.
Use try_add to return NULL on overflow.

WARNING
If spark.sql.ansi.enabled is false an overflow will not cause an error but “wrap” the result.

Examples
> SELECT 1 + 2;
3

> SELECT DATE'2021-03-20' + INTERVAL '2' MONTH


2021-5-20

> SELECT TIMESTAMP'2021-03-20 12:15:29' + INTERVAL '3' SECOND


2021-03-20 12:15:32

> SELECT typeof(INTERVAL '3' DAY + INTERVAL '2' HOUR)


interval day to hour

> SELECT DATE'2021-03-31' + INTERVAL '1' MONTH;


2021-04-30

> SELECT 127Y + 1Y;


Error: ARITMETIC_OVERFLOW

Related functions
* (asterisk sign) operator
- (minus sign) operator
/ (slash sign) operator
sum aggregate function
try_add function
try_divide function
+ (plus sign) unary operator
7/21/2022 • 2 minutes to read

Returns the value of expr .

Syntax
+ expr

Arguments
expr : An expression that evaluates to a numeric or INTERVAL.

Returns
The result type matches the argument.
This function is a no-op.
+ is a synonym for positive function.

Examples
> SELECT +(1);
1
> SELECT +(-1);
-1

Related functions
negative function
abs function
sign function
- (minus sign) unary operator
positive function
pmod function
7/21/2022 • 2 minutes to read

Returns the positive remainder after dividend / divisor .

Syntax
pmod(dividend, divisor)

Arguments
dividend : An expression that evaluates to a numeric.
divisor : An expression that evaluates to a numeric.

Returns
If both dividend and divisor are of DECIMAL the result matches the type of divisor . In all other cases a
DOUBLE.
If divisor is 0 the function raises a DIVIDE_BY_ZERO error.
This function is equivalent to abs(mod(dividend, divisor)) .

Examples
> SELECT pmod(10, 3);
1

> SELECT pmod(-10, 3);


2

> SELECT pmod(-10, 0);


Error: DIVIDE_BY_ZERO

Related functions
mod function
% (percent sign) operator
posexplode table-valued generator function
7/21/2022 • 2 minutes to read

Returns rows by un-nesting the array with numbering of positions.

Syntax
posexplode(expr)

Arguments
expr : An ARRAY or MAP expression.

Returns
A set of rows composed of the other expressions in the select list, the position of the elements in the array or
map, and the elements of the array, or keys and values of the map.
If expr is NULL , no rows are produced.
The columns produced by posexplode of an array are named pos, and col by default, but can be aliased. You can
also alias them using an alias tuple such as AS (myPos, myValue) .
The columns for maps are by default called pos, key and value. You can also alias them using an alias tuple such
as AS (myPos, myKey, myValue) .
You can place pos_explode only in the select list or a LATERAL VIEW. When placing the function in the select list
there must be no other generator function in the same select list.

Examples
> SELECT posexplode(array(10, 20)) AS (r, elem), 'Spark';
0 10 Spark
1 20 Spark
> SELECT posexplode(map(1, 'a', 2, 'b')) AS (r, num, val), 'Spark';
0 1 a Spark
1 2 b Spark

Related functions
explode table-valued generator function
explode_outer table-valued generator function
inline table-valued generator function
inline_outer table-valued generator function
posexplode_outer table-valued generator function
stack table-valued generator function
posexplode_outer table-valued generator function
7/21/2022 • 2 minutes to read

Returns rows by un-nesting the array with numbering of positions using OUTER semantics.

Syntax
posexplode_outer(expr)

Arguments
expr : An ARRAY or MAP expression.

Returns
A set of rows composed of the other expressions in the select list, the position of the elements in the array or
map, and the elements of the array, or keys and values of the map.
If expr is NULL , a single row with NULLs for the array or map values.
The columns produced by posexplode_outer of an array are named pos and col by default, but can be aliased.
You can also alias them using an alias tuple such as AS (myPos, myValue) .
The columns for maps are by default called pos , key , and value . You can also alias them using an alias tuple
such as AS (myPos, myKey, myValue) .
You can place posexplode_outer only in the select list or a LATERAL VIEW . When placing the function in the select
list there must be no other generator function in the same select list.

Examples
> SELECT posexplode_outer(array(10, 20)) AS (r, elem), 'Spark';
0 10 Spark
1 20 Spark
> SELECT posexplode_outer(map(1, 'a', 2, 'b')) AS (r, num, val), 'Spark';
0 1 a Spark
1 2 b Spark
> SELECT posexplode_outer(cast(NULL AS array<int>)), 'Spark';
NULL Spark

Related functions
explode table-valued generator function
explode_outer table-valued generator function
inline table-valued generator function
inline_outer table-valued generator function
posexplode table-valued generator function
position function
7/21/2022 • 2 minutes to read

Returns the position of the first occurrence of substr in str after position pos .

Syntax
position(substr, str [, pos] )

position(subtr IN str)

Arguments
substr : A STRING expression.
str : A STRING expression.
pos : An INTEGER expression.

Returns
An INTEGER.
The specified pos and return value are 1-based. If pos is omitted, substr is searched from the beginning of
str . If pos is less than 1, the result is 0.

This function is a synonym for locate function.

Examples
> SELECT position('bar', 'foobarbar');
4
> SELECT position('bar', 'foobarbar', 5);
7
> SELECT position('bar' IN 'foobarbar');
4

Related functions
charindex function
instr function
locate function
positive function
7/21/2022 • 2 minutes to read

Returns the value of expr .

Syntax
positive(expr)

Arguments
expr : An expression that evaluates to a numeric or INTERVAL.

Returns
The result type matches the argument.
This function is a no-op.
positive is a synonym for + (plus sign) unary operator.

Examples
> SELECT positive(1);
1
> SELECT positive(-1);
-1

Related functions
negative function
abs function
sign function
+ (plus sign) unary operator
- (minus sign) unary operator
pow function
7/21/2022 • 2 minutes to read

Raises expr1 to the power of expr2 .

Syntax
pow(expr1, expr2)

Arguments
expr1 : An expression that evaluates to a numeric.
expr2 : An expression that evaluates to a numeric.

Returns
A DOUBLE.
This function is a synonym for power function.

Examples
> SELECT pow(2, 3);
8.0

Related functions
exp function
log function
log10 function
ln function
power function
power function
7/21/2022 • 2 minutes to read

Raises expr1 to the power of expr2 .

Syntax
power(expr1, expr2)

Arguments
expr1 : An expression that evaluates to a numeric.
expr2 : An expression that evaluates to a numeric.

Returns
A DOUBLE.
This function is a synonym for pow function.

Examples
> SELECT pow(2, 3);
8.0

Related functions
exp function
log function
log10 function
ln function
pow function
printf function
7/21/2022 • 2 minutes to read

Returns a formatted string from printf-style format strings.

Syntax
printf(strfmt[, obj1, ...])

Arguments
strfmt : A STRING expression.
objN : A STRING or numeric expression.

Returns
A STRING.

Examples
> SELECT printf('Hello World %d %s', 100, 'days');
Hello World 100 days

Related functions
format_number function
format_string function
quarter function
7/21/2022 • 2 minutes to read

Returns the quarter of the year for expr in the range 1 to 4.

Syntax
quarter(expr)

Arguments
expr : A DATE or TIMESTAMP expression.

Returns
An INTEGER.
This function is a synonym for extract(QUARTER FROM expr) .

Examples
> SELECT quarter('2016-08-31');
3

Related functions
day function
dayofweek function
dayofyear function
extract function
month function
radians function
7/21/2022 • 2 minutes to read

Converts expr in degrees to radians.

Syntax
radians(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.
Given an angle in degrees, returns the associated radians.

Examples
> SELECT radians(180);
3.141592653589793

Related functions
degrees function
pi function
raise_error function
7/21/2022 • 2 minutes to read

Throws an exception with expr as the message.

Syntax
raise_error(expr)

Arguments
expr : A STRING expression.

Returns
The NULL type.
The function raises a runtime error with expr as the error message.

Examples
> SELECT raise_error('custom error message');
custom error message

Related functions
assert_true function
rand function
7/21/2022 • 2 minutes to read

Returns a random value between 0 and 1.

Syntax
rand( [seed] )

Arguments
seed : An optional INTEGER literal.

Returns
A DOUBLE.
The function generates pseudo random results with independent and identically distributed (i.i.d.) uniformly
distributed values in [0, 1).
This function is non-deterministic.
rand is a synonym for random function.

Examples
> SELECT rand();
0.9629742951434543
> SELECT rand(0);
0.8446490682263027
> SELECT rand(null);
0.8446490682263027

Related functions
randn function
random function
randn function
7/21/2022 • 2 minutes to read

Returns a random value from a standard normal distribution.

Syntax
randn( [seed] )

Arguments
seed : An optional INTEGER literal.

Returns
A DOUBLE.
The function regenerates pseudo random results with independent and identically distributed (i.i.d.) values
drawn from the standard normal distribution.
This function is non-deterministic.

Examples
> SELECT randn();
-0.3254147983080288
> SELECT randn(0);
1.1164209726833079
> SELECT randn(null);
1.1164209726833079

Related functions
rand function
random function
random function
7/21/2022 • 2 minutes to read

Returns a random value between 0 and 1.

Syntax
random( [seed] )

Arguments
seed : An optional INTEGER literal.

Returns
A DOUBLE.
The function generates pseudo random results with independent and identically distributed uniformly
distributed values in [0, 1) .
This function is non-deterministic.
rand is a synonym for random function.

Examples
> SELECT rand();
0.9629742951434543
> SELECT rand(0);
0.8446490682263027
> SELECT rand(null);
0.8446490682263027

Related functions
randn function
rand function
range table-valued function
7/21/2022 • 2 minutes to read

Returns a table of values within a specified range.

Syntax
range(end)

range(start, end [, step [, numParts] ] )

Arguments
start : An optional BIGINT literal defaulted to 0, marking the first value generated.
end : A BIGINT literal marking endpoint (exclusive) of the number generation.
step : An optional BIGINT literal defaulted to 1, specifying the increment used when generating values.
numParts : An optional INTEGER literal specifying how the production of rows is spread across partitions,

Returns
A table with a single BIGINT column named id .

Examples
> SELECT spark_partition_id(), t.* FROM range(5) AS t;
3 0
6 1
9 2
12 3
15 4

> SELECT * FROM range(-3, 0);


-3
-2
-1

> SELECT spark_partition_id(), t.* FROM range(0, -5, -1, 2) AS t;


0 0
0 -1
1 -2
1 -3
1 -4

Related functions
sequence function
rank ranking window function
7/21/2022 • 2 minutes to read

Returns the rank of a value compared to all values in the partition.

Syntax
rank()

Arguments
This function takes no arguments.

Returns
An INTEGER.
The OVER clause of the window function must include an ORDER BY clause.
Unlike the function dense_rank , rank will produce gaps in the ranking sequence. Unlike row_number , rank does
not break ties.
If the order is not unique, the duplicates share the same relative earlier position.

Examples
> SELECT a,
b,
dense_rank() OVER(PARTITION BY a ORDER BY b),
rank() OVER(PARTITION BY a ORDER BY b),
row_number() OVER(PARTITION BY a ORDER BY b)
FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
A1 1 1 1 1
A1 1 1 1 2
A1 2 2 3 3
A2 3 1 1 1

Related functions
dense_rank ranking window function
row_number ranking window function
cume_dist analytic window function
Window functions
reflect function
7/21/2022 • 2 minutes to read

Calls a method with reflection.

Syntax
reflect(class, method [, arg1] [, ...])

Arguments
class : A STRING literal specifying the java class.
method : A STRING literal specifying the java method.
argN : An expression with a type appropriate for the selected method.

Returns
A STRING.

Examples
> SELECT reflect('java.util.UUID', 'randomUUID');
c33fb387-8500-4bfa-81d2-6e0e3e930df2
> SELECT reflect('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
A5cf6c42-0c85-418f-af6c-3e4e5b1328f2

Related functions
java_method function
regexp operator
7/21/2022 • 2 minutes to read

Returns true if str matches regex .


Since: Databricks Runtime 10.0

Syntax
str [NOT] regexp regex

Arguments
str : A STRING expression to be matched.
regex : A STRING expression with a matching pattern.

Returns
A BOOLEAN.
The regexstring must be a Java regular expression. String literals are unescaped. For example, to match
'\abc' , a regular expression for regex can be '^\\abc$' .

str NOT regexp ... is equivalent to NOT(str regexp ...) .


regexp is a synonym for rlike operator.

Examples
> SELECT '%SystemDrive%\\Users\\John' regexp '%SystemDrive%\\\\Users.*';
true

Related functions
ilike operator
like operator
regexp_extract_all function
regexp_replace function
split function
rlike operator
regexp_extract function
7/21/2022 • 2 minutes to read

Extracts the first string in str that matches the regexp expression and corresponds to the regex group index.

Syntax
regexp_extract(str, regexp [, idx] )

Arguments
str : A STRING expression to be matched.
regexp : A STRING expression with a matching pattern.
idx : An optional integral number expression greater or equal 0 with default 1.

Returns
A STRING.
The regexp string must be a Java regular expression. String literals are unescaped. For example, to match
'\abc' , a regular expression for regexp can be '^\\abc$' . regexp may contain multiple groups. idx
indicates which regex group to extract. An idx of 0 means matching the entire regular expression.

Examples
> SELECT regexp_extract('100-200', '(\\d+)-(\\d+)', 1);
100

Related functions
ilike operator
like operator
regexp_extract_all function
regexp_replace function
split function
regexp operator
rlike operator
regexp_extract_all function
7/21/2022 • 2 minutes to read

Extracts the all strings in str that matches the regexp expression and corresponds to the regex group index.

Syntax
regexp_extract_all(str, regexp [, idx] )

Arguments
str : A STRING expression to be matched.
regexp : A STRING expression with a matching pattern.
idx : An optional integral number expression greater or equal 0 with default 1.

Returns
An ARRAY of STRING.
The regexp string must be a Java regular expression. String literals are unescaped. For example, to match
'\abc' , a regular expression for regexp can be '^\\abc$' . regexp may contain multiple groups. idx
indicates which regex group to extract. An idx of 0 means match the entire regular expression.

Examples
> SELECT regexp_extract_all('100-200, 300-400', '(\\d+)-(\\d+)', 1);
[100, 300]

Related functions
ilike operator
like operator
regexp_extract function
regexp_replace function
split function
regexp_like function
7/21/2022 • 2 minutes to read

Returns true if str matches regex .

Syntax
regexp_like( str, regex )

Arguments
str : A STRING expression to be matched.
regex : A STRING expression with a matching pattern.

Since: Databricks Runtime 10.0

Returns
A BOOLEAN.
The regexstring must be a Java regular expression. String literals are unescaped. For example, to match
'\abc' , a regular expression for regex can be '^\\abc$' .

regexp_like is a function synonym for rlike operator.

Examples
> SELECT regexp_like('%SystemDrive%\\Users\\John', '%SystemDrive%\\\\Users.*');
true

Related functions
ilike operator
like operator
regexp operator
regexp_extract_all function
regexp_replace function
rlike operator
split function
regexp_replace function
7/21/2022 • 2 minutes to read

Replaces all substrings of str that match regexp with rep .

Syntax
regexp_replace(str, regexp, rep [, position] )

Arguments
str : A STRING expression to be matched.
regexp : A STRING expression with a matching pattern.
rep : A STRING expression which is the replacement string.
position : A optional integral numeric literal greater than 0, stating where to start matching. The default is 1.

Returns
A STRING.
The regexpstring must be a Java regular expression. String literals are unescaped. For example, to match
'\abc' , a regular expression for regexp can be '^\\abc$' . Searching starts at position . The default is 1, which
marks the beginning of str . If position exceeds the character length of str , the result is str .

Examples
> SELECT regexp_replace('100-200', '(\\d+)', 'num');
num-num

Related functions
ilike operator
like operator
regexp_extract function
regexp_extract_all function
regr_avgx aggregate function
7/21/2022 • 2 minutes to read

Returns the mean of xExpr calculated from values of a group where xExpr and yExpr are NOT NULL.
Since: Databricks Runtime 10.5

Syntax
regr_avgx( [ALL | DISTINCT] yExpr, xExpr) [FILTER ( WHERE cond ) ]

Arguments
yExpr : An numeric expression, the dependent variable .
xExpr : An numeric expression, the independent variable .
cond : An optional boolean expression filtering the rows used for the function.

Returns
The result type depends on the type of xExpr :
DECIMAL(p, s) : The result type is a DECIMAL(p + 4, s + 4) . If the maximum precision for DECIMAL is reached
the increase in scale will be limited to avoid loss of significant digits.
Otherwise the result is a DOUBLE.
Any nulls within the group are ignored. If a group is empty or consists only of nulls, the result is NULL.
If DISTINCT is specified the average is computed after duplicates have been removed.
regr_avgx(y, x) is a synonym for avg(x) FILTER(WHERE x IS NOT NULL AND y IS NOT NULL) .

Examples
> SELECT regr_avgx(y, x) FROM VALUES (1, 2), (2, 3), (2, 3), (null, 4), (4, null) AS T(y, x);
2.6666666666666665

Related functions
avg aggregate function
regr_avgx aggregate function
regr_avgy aggregate function
regr_count aggregate function
regr_sxx aggregate function
regr_sxy aggregate function
regr_syy aggregate function
regr_avgy aggregate function
7/21/2022 • 2 minutes to read

Returns the mean of yExpr calculated from values of a group where xExpr and yExpr are NOT NULL.
Since: Databricks Runtime 10.5

Syntax
regr_avgy( [ALL | DISTINCT] yExpr, xExpr) [FILTER ( WHERE cond ) ]

Arguments
yExpr : An numeric expression, the dependent variable .
xExpr : An numeric expression, the independent variable .
cond : An optional boolean expression filtering the rows used for the function.

Returns
The result type depends on the type of yExpr :
DECIMAL(p, s) : The result type is a DECIMAL(p + 4, s + 4) . If the maximum precision for DECIMAL is reached
the increase in scale will be limited to avoid loss of significant digits.
Otherwise the result is a DOUBLE.
Any nulls within the group are ignored. If a group is empty or consists only of nulls, the result is NULL.
If DISTINCT is specified the average is computed after duplicates have been removed.
regr_avgy(y, x) is a synonym for avg(y) FILTER(WHERE x IS NOT NULL AND y IS NOT NULL) .

Examples
> SELECT regr_avgy(y, x) FROM VALUES (1, 2), (2, 3), (2, 3), (null, 4), (4, null) AS T(y, x);
1.6666666666666667

Related functions
avg aggregate function
regr_avgy aggregate function
regr_count aggregate function
regr_sxy aggregate function
regr_sxy aggregate function
regr_syy aggregate function
regr_count aggregate function
7/21/2022 • 2 minutes to read

Returns the number of non-null value pairs yExpr , xExpr in the group.
Since: Databricks Runtime 10.5

Syntax
regr_count ( [ALL | DISTINCT] yExpr, xExpr ) [FILTER ( WHERE cond ) ]

Arguments
yExpr : An numeric expression, the dependent variable .
xExpr : An numeric expression, the independent variable .
cond : An optional boolean expression filtering the rows used for the function.

Returns
A BIGINT.
regr_count(yExpr, xExpr) is equivalent to count_if(yExpr IS NOT NULL AND xExpr IS NOT NULL) .
If DISTINCT is specified only unique rows are counted.

Examples
> SELECT regr_count(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS t(y, x);
4

> SELECT regr_count(y, x) FROM VALUES (1, 2), (2, NULL), (2, 3), (2, 4) AS t(y, x);
3

> SELECT regr_count(y, x) FROM VALUES (1, 2), (2, NULL), (NULL, 3), (2, 4) AS t(y, x);
2

Related functions
avg aggregate function
count aggregate function
count_if aggregate function
min aggregate function
max aggregate function
regr_avgx aggregate function
regr_avgy aggregate function
regr_sxx aggregate function
regr_sxy aggregate function
regr_syy aggregate function
sum aggregate function
regr_r2 aggregate function
7/21/2022 • 2 minutes to read

Returns the coefficient of determination from values of a group where xExpr and yExpr are NOT NULL.
Since: Databricks Runtime 11.0

Syntax
regr_r2( [ALL | DISTINCT] yExpr, xExpr) [FILTER ( WHERE cond ) ]

Arguments
yExpr : An numeric expression, the dependent variable .
xExpr : An numeric expression, the independent variable .
cond : An optional boolean expression filtering the rows used for the function.

Returns
A DOUBLE.
Any nulls within the group are ignored. If a group is empty or consists only of nulls, the result is NULL.
If DISTINCT is specified, the average is computed after duplicates are removed.

Examples
> SELECT regr_r2(y, x) FROM VALUES (1, 2), (2, 3), (2, 3), (null, 4), (4, null) AS T(y, x);
1

Related functions
avg aggregate function
regr_avgx aggregate function
regr_avgy aggregate function
regr_count aggregate function
regr_sxx aggregate function
7/21/2022 • 2 minutes to read

Returns the sum of squares of the xExpr values of a group where xExpr and yExpr are NOT NULL.
Since: Databricks Runtime 11.0

Syntax
regr_sxx( [ALL | DISTINCT] yExpr, xExpr) [FILTER ( WHERE cond ) ]

Arguments
yExpr : A numeric expression, the dependent variable .
xExpr : A numeric expression, the independent variable .
cond : An optional boolean expression filtering the rows used for the function.

Returns
The result type is DOUBLE.
Any nulls within the group are ignored. If a group is empty or consists only of nulls, the result is NULL.
If DISTINCT is specified, the result is computed after duplicates are removed.
regr_sxx(y, x) is a synonym for regr_count(y, x) * var_pop(x) .

Examples
> SELECT typeof(regr_sxx(y, x)) FROM VALUES (1, 2), (2, 3), (2, 3), (null, 4), (4, null) AS T(y, x);
0.6666666666666666

Related functions
avg aggregate function
regr_avgx aggregate function
regr_avgy aggregate function
regr_count aggregate function
regr_sxy aggregate function
7/21/2022 • 2 minutes to read

Returns the sum of products of yExpr and xExpr calculated from values of a group where xExpr and yExpr
are NOT NULL.
Since: Databricks Runtime 11.0

Syntax
regr_sxy( [ALL | DISTINCT] yExpr, xExpr) [FILTER ( WHERE cond ) ]

Arguments
yExpr : A numeric expression, the dependent variable .
xExpr : A numeric expression, the independent variable .
cond : An optional boolean expression filtering the rows used for the function.

Returns
The result type is a DOUBLE.
Any nulls within the group are ignored. If a group is empty or consists only of nulls, the result is NULL.
If DISTINCT is specified, the result is computed after duplicates are removed.
regr_sxy(y, x) is a synonym for regr_count(y, x) * covar_pop(y, x) .

Examples
> SELECT regr_sxy(y, x) FROM VALUES (1, 2), (2, 3), (2, 3), (null, 4), (4, null) AS T(y, x);
0.6666666666666666

Related functions
covar_pop aggregate function
regr_avgx aggregate function
regr_avgx aggregate function
regr_count aggregate function
regr_syy aggregate function
7/21/2022 • 2 minutes to read

Returns the sum of squares of the yExpr values of a group where xExpr and yExpr are NOT NULL.
Since: Databricks Runtime 11.0

Syntax
regr_syy( [ALL | DISTINCT] yExpr, xExpr) [FILTER ( WHERE cond ) ]

Arguments
yExpr : An numeric expression, the dependent variable .
xExpr : An numeric expression, the independent variable .
cond : An optional boolean expression filtering the rows used for the function.

Returns
The result type is DOUBLE.
Any nulls within the group are ignored. If a group is empty or consists only of nulls, the result is NULL.
If DISTINCT is specified, the result is computed after duplicates are removed.
regr_syy(y, x) is a synonym for regr_count(y, x) * var_pop(y) .

Examples
> SELECT regr_syy(y, x) FROM VALUES (1, 2), (2, 3), (2, 3), (null, 4), (4, null) AS T(y, x);
0.6666666666666666

Related functions
avg aggregate function
regr_avgx aggregate function
regr_avgy aggregate function
regr_count aggregate function
repeat function
7/21/2022 • 2 minutes to read

Returns the string that repeats expr n times.

Syntax
repeat(expr, n)

Arguments
expr : A STRING expression.
n : An INTEGER expression.

Returns
A STRING.
If n is less than 1, an empty string.

Examples
> SELECT repeat('123', 2);
123123

Related functions
space function
replace function
7/21/2022 • 2 minutes to read

Replaces all occurrences of search with replace .

Syntax
replace(str, search [, replace] )

Arguments
str : A STRING expression to be searched.
search : A STRING repression to be replaced.
replace : An optional STRING expression to replace search with. The default is an empty string.

Returns
A STRING.
If you do not specify replace or is an empty string, nothing replaces the string that is removed from str .

Examples
> SELECT replace('ABCabc', 'abc', 'DEF');
ABCDEF

Related functions
overlay function
signum function
regexp_replace function
translate function
reverse function
7/21/2022 • 2 minutes to read

Returns a reversed string or an array with reverse order of elements.

Syntax
reverse(expr)

Arguments
expr : A STRING or ARRAY expression.

Returns
The result type matches the type of expr .

Examples
> SELECT reverse('Spark SQL');
LQS krapS
> SELECT reverse(array(2, 1, 4, 3));
[3,4,1,2]

Related functions
right function
7/21/2022 • 2 minutes to read

Returns the rightmost len characters from the string str .

Syntax
right(str, len)

Arguments
str : A STRING expression.
len : An integral number expression.

Returns
A STRING.
If len is less or equal than 0, an empty string.

Examples
> SELECT right('Spark SQL', 3);
SQL

Related functions
substr function
substring function
left function
rint function
7/21/2022 • 2 minutes to read

Returns expr rounded to a whole number as a DOUBLE.

Syntax
rint(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.
This function is a synonym for round(expr, 0) .

Examples
> SELECT rint(12.3456);
12.0

Related functions
ceil function
ceiling function
floor function
round function
rlike operator
7/21/2022 • 2 minutes to read

Returns true if str matches regex .

Syntax
str [NOT] rlike regex

Arguments
str : A STRING expression to be matched.
regex : A STRING expression with a matching pattern.

Returns
A BOOLEAN.
The regexstring must be a Java regular expression. String literals are unescaped. For example, to match
'\abc' , a regular expression for regex can be '^\\abc$' .

rlike is a synonym for regexp operator.


str NOT rlike ... is equivalent to NOT(str rlike ...) .

Examples
> SELECT '%SystemDrive%\\Users\\John' rlike '%SystemDrive%\\\\Users.*';
true

Related functions
ilike operator
like operator
regexp operator
regexp_extract_all function
regexp_like function
regexp_replace function
regexp_replace function
split function
round function
7/21/2022 • 2 minutes to read

Returns the rounded expr using HALF_UP rounding mode.

Syntax
round(expr [,targetScale] )

Arguments
expr : A numeric expression.
targetScale : An INTEGER expression greater or equal to 0. If targetScale is omitted the default is 0.

Returns
If expr is DECIMAL the result is DECIMAL with a scale that is the smaller of expr scale and targetScale .
In HALF_UP rounding, the digit 5 is rounded up.

Examples
> SELECT round(2.5, 0);
3
> SELECT round(2.6, 0);
3
> SELECT round(3.5, 0);
4
> SELECT round(2.25, 1);
2.2

Related functions
floor function
ceiling function
ceil function
bround function
row_number ranking window function
7/21/2022 • 2 minutes to read

Assigns a unique, sequential number to each row, starting with one, according to the ordering of rows within the
window partition.

Syntax
row_number()

Arguments
The function takes no arguments.

Returns
An INTEGER.
The OVERclause of the window function must include an ORDER BY clause. Unlike rank and dense_rank ,
row_number breaks ties.

If the order is not unique, the result is non-deterministic.

Examples
> SELECT a,
b,
dense_rank() OVER(PARTITION BY a ORDER BY b),
rank() OVER(PARTITION BY a ORDER BY b),
row_number() OVER(PARTITION BY a ORDER BY b)
FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
A1 1 1 1 1
A1 1 1 1 2
A1 2 2 3 3
A2 3 1 1 1

Related functions
rank ranking window function
dense_rank ranking window function
cume_dist analytic window function
Window functions
rpad function
7/21/2022 • 2 minutes to read

Returns expr , right-padded with pad to a length of len .

Syntax
rpad(expr, len [, pad] )

Arguments
expr : A STRING or BINARY expression to be padded.
len : An INTEGER expression.
pad : An optional STRING or BINARY expression with the pattern for padding. The default is a space
character for STRING and x’00’ for BINARY.

Returns
A BINARY if both expr and pad are BINARY. Otherwise, returns a STRING.
If expr is longer than len , the return value is shortened to len characters. If you do not specify pad , a
STRING expr is padded to the right with space characters, whereas a BINARY expr is padded to the right with
x’00’ bytes. If len is less than 1, an empty string.
BINARY is supported since: Databricks Runtime 11.0.

Examples
> SELECT rpad('hi', 5, 'ab');
hiaba
> SELECT rpad('hi', 1, '??');
h
> SELECT rpad('hi', 5);
hi

> SELECT hex(rpad(x'1020', 5, x'05'))


1020050505

Related functions
lpad function
ltrim function
rtrim function
trim function
rtrim function
7/21/2022 • 2 minutes to read

Returns str with trailing characters removed.

Syntax
rtrim( [trimStr ,] str)

Arguments
trimStr : An optional STRING expression with characters to be trimmed. The default is a space character.
str : A STRING expression to be trimmed.

Returns
A STRING.
The function removes any trailing characters within trimStr from str .

Examples
> SELECT rtrim('SparkSQL ') || '+';
SparkSQL+
> SELECT rtrim('ab', 'SparkSQLabcaaba');
SparkSQLabc

Related functions
btrim function
lpad function
ltrim function
rpad function
trim function
schema_of_csv function
7/21/2022 • 2 minutes to read

Returns the schema of a CSV string in DDL format.

Syntax
schema_of_csv(csv [, options] )

Arguments
csv : A STRING literal with valid CSV data.
options : An optional MAP literals where keys and values are STRING.

Returns
A STRING composing a struct. The field names are derived by position as _Cn . The values hold the derived
formatted SQL types. For details on options see from_csv function.

Examples
> DESCRIBE SELECT schema_of_csv('1,abc');
STRUCT<`_c0`: INT, `_c1`: STRING>

Related functions
from_csv function
schema_of_json function
7/21/2022 • 2 minutes to read

Returns the schema of a JSON string in DDL format.

Syntax
schema_of_json(json [, options] )

Arguments
json : A STRING literal with JSON.
options : An optional MAP literals with keys and values being STRING.

Returns
A STRING holding a definition of an array of structs with n fields of strings where the column names are
derived from the JSON keys. The field values hold the derived formatted SQL types. For details on options, see
from_json function.

Examples
> SELECT schema_of_json('[{"col":0}]');
ARRAY<STRUCT<`col`: BIGINT>>
> SELECT schema_of_json('[{"col":01}]', map('allowNumericLeadingZeros', 'true'));
ARRAY<STRUCT<`col`: BIGINT>>

Related functions
from_json function
sec function
7/21/2022 • 2 minutes to read

Returns the secant of expr .


Since: Databricks Runtime 10.1

Syntax
sec(expr)

Arguments
expr : An expression that evaluates to a numeric expressing the angle in radians.

Returns
A DOUBLE.
sec(expr) is equivalent to 1 / cos(expr)

Examples
> SELECT sec(pi());
-1.0

> SELECT sec(0);


1.0

Related functions
acos function
cos function
cosh function
csc function
sin function
tan function
second function
7/21/2022 • 2 minutes to read

Returns the second component of the timestamp in expr .

Syntax
second(expr)

Arguments
expr : A TIMESTAMP expression.

Returns
An INTEGER.
This function is equivalent to int(extract(SECOND FROM timestamp)) .

Examples
> SELECT second('2009-07-30 12:58:59');
59

Related functions
extract function
hour function
minute function
sentences function
7/21/2022 • 2 minutes to read

Splits str into an array of array of words.

Syntax
sentences(str [, lang, country] )

Arguments
str : A STRING expression to be parsed.
lang : An optional STRING expression with a language code from ISO 639 Alpha-2 (e.g. ‘DE’) , Alpha-3, or a
language subtag of up to 8 characters.
country : An optional STRING expression with a country code from ISO 3166 alpha-2 country code or a UN
M.49 numeric-3 area code.

Returns
An ARRAY of ARRAY of STRING.
The default for lang is en and country US .

Examples
> SELECT sentences('Hi there! Good morning.');
[[Hi, there],[Good, morning]]
> SELECT sentences('Hi there! Good morning.', 'en', 'US');
[[Hi, there],[Good, morning]]

Related functions
split function
sequence function
7/21/2022 • 2 minutes to read

Generates an array of elements from start to stop (inclusive), incrementing by step .

Syntax
sequence(start, stop [, step] )

Arguments
start : An expression of an integral numeric type, DATE, or TIMESTAMP.
stop : If start is numeric an integral numeric, a DATE or TIMESTAMP otherwise.
step : An INTERVAL expression if start is a DATE or TIMESTAMP, or an integral numeric otherwise.

Returns
An ARRAY of least common type of start and stop .
By default step is 1 if start is less than or equal to stop , otherwise -1.
For the DATE or TIMESTAMP sequences default step is INTERVAL ‘1’ DAY and INTERVAL ‘-1’ DAY respectively.
If start is greater than stop then step must be negative, and vice versa.

Examples
> SELECT sequence(1, 5);
[1,2,3,4,5]

> SELECT sequence(5, 1);


[5,4,3,2,1]

> SELECT sequence(DATE'2018-01-01', DATE'2018-03-01', INTERVAL 1 MONTH);


[2018-01-01,2018-02-01,2018-03-01]

Related functions
repeat function
sha function
7/21/2022 • 2 minutes to read

Returns a sha1 hash value as a hex string of expr .

Syntax
sha(expr)

Arguments
expr : A BINARY or STRING expression.

Returns
A STRING.
This function is a synonym for sha1 function.

Examples
> SELECT sha('Spark');
85f5955f4b27a9a4c2aab6ffe5d7189fc298b92c

Related functions
sha1 function
sha2 function
crc32 function
md5 function
hash function
sha1 function
7/21/2022 • 2 minutes to read

Returns a sha1 hash value as a hex string of expr .

Syntax
sha1(expr)

Arguments
expr : A BINARY or STRING expression.

Returns
A STRING.
This function is a synonym for sha function.

Examples
> SELECT sha1('Spark');
85f5955f4b27a9a4c2aab6ffe5d7189fc298b92c

Related functions
sha function
sha2 function
crc32 function
md5 function
hash function
sha2 function
7/21/2022 • 2 minutes to read

Returns a checksum of the SHA-2 family as a hex string of expr .

Syntax
sha2(expr, bitLength)

Arguments
expr : A BINARY or STRING expression.
bitLength : An INTEGER expression.

Returns
A STRING.
bitLength can be 0, 224 , 256 , 384 , or 512 . bitLength 0 is equivalent to 256 .

Examples
>> SELECT sha2('Spark', 256);
529bc3b07127ecb7e53a4dcf1991d9152c24537d919178022b2c42657f79a26b

Related functions
sha function
sha1 function
crc32 function
md5 function
hash function
shiftleft function
7/21/2022 • 2 minutes to read

Returns a bitwise left shifted by n bits.

Syntax
shiftleft(expr, n)

Arguments
expr : An INTEGER or BIGINT expression.
n : An INTEGER expression.

Returns
The result matches the type of expr .
If n is less than 0 the result is 0.

Examples
> SELECT shiftleft(2, 1);
4

Related functions
shiftright function
shiftrightunsigned function
shiftright function
7/21/2022 • 2 minutes to read

Returns a bitwise signed integral number right shifted by n bits.

Syntax
shiftright(expr, n)

Arguments
expr : An INTEGER or BIGINT expression.
n : An INTEGER expression specifying the number of bits to shift.

Returns
The result type matches expr .
When expr is negative (that is, the highest order bit is set) the result remains negative because the highest
order bit is sticky. When n is negative the result is 0.

Examples
> SELECT shiftright(4, 1);
2
> SELECT shiftright(-4, 1);
-2

Related functions
shiftleft function
shiftrightunsigned function
shiftrightunsigned function
7/21/2022 • 2 minutes to read

Returns a bitwise unsigned signed integral number right shifted by n bits.

Syntax
shiftrightunsigned(expr, n)

Arguments
expr : An INTEGER or BIGINT expression.
n : An INTEGER expression specifying the number of bits to shift.

Returns
The result type matches expr .
When n is negative the result is 0.

Examples
> SELECT shiftrightunsigned(4, 1);
2
> SELECT shiftrightunsigned(-4, 1);
2147483646

Related functions
shiftleft function
shiftright function
shuffle function
7/21/2022 • 2 minutes to read

Returns a random permutation of the array in expr .

Syntax
shuffle(expr)

Arguments
expr : An ARRAY expression.

Returns
The result type matches the type expr .
This function is non-deterministic.

Examples
> SELECT shuffle(array(1, 20, 3, 5));
[3,1,5,20]
> SELECT shuffle(array(1, 20, NULL, 3));
[20,NULL,3,1]

Related functions
array_sort function
sort_array function
sign function
7/21/2022 • 2 minutes to read

Returns -1.0, 0.0, or 1.0 as expr is negative, 0, or positive.

Syntax
sign(expr)

Arguments
expr : An expression that evaluates to a numeric or interval.

Interval is supported since : Databricks Runtime 10.1

Returns
A DOUBLE.
This function is a synonym for signum function.

Examples
> SELECT sign(40);
1.0

> SELECT sign(INTERVAL'-1' DAY)


-1.0

Related functions
abs function
signum function
negative function
positive function
signum function
7/21/2022 • 2 minutes to read

Returns -1.0, 0.0, or 1.0 as expr is negative, 0, or positive.

Syntax
signum(expr)

Arguments
expr : An expression that evaluates to a numeric or interval.

Interval is supported since : Databricks Runtime 10.1)

Returns
A DOUBLE.
This function is a synonym for sign function.

Examples
> SELECT signum(40);
1.0

> SELECT signum(INTERVAL'-1' DAY)


-1.0

Related functions
abs function
sign function
negative function
positive function
sin function
7/21/2022 • 2 minutes to read

Returns the sine of expr .

Syntax
sin(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.

Examples
> SELECT sin(0);
0.0

Related functions
cos function
sinh function
tan function
sinh function
7/21/2022 • 2 minutes to read

Returns the hyperbolic sine of expr .

Syntax
sinh(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.

Examples
> SELECT sinh(0);
0.0

Related functions
cos function
cosh function
sin function
tanh function
tan function
size function
7/21/2022 • 2 minutes to read

Returns the cardinality of the array or map in expr .

Syntax
size(expr)

Arguments
expr : An ARRAY or MAP expression.

Returns
An INTEGER.

NOTE
If spark.sql.ansi.enabled is false size(NULL) returns -1 instead of NULL .

Examples
> SELECT size(array('b', 'd', 'c', 'a'));
4
> SELECT size(map('a', 1, 'b', 2));
2
> SELECT size(NULL);
NULL

Related functions
length function
skewness aggregate function
7/21/2022 • 2 minutes to read

Returns the skewness value calculated from values of a group.

Syntax
skewness ( [ALL | DISTINCT ] expr ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.

Examples
> SELECT skewness(col) FROM VALUES (-10), (-20), (100), (1000), (1000) AS tab(col);
0.3853941073355022
> SELECT skewness(DISTINCT col) FROM VALUES (-10), (-20), (100), (1000), (1000) AS tab(col);
1.1135657469022011
> SELECT skewness(col) FROM VALUES (-1000), (-100), (10), (20) AS tab(col);
-1.1135657469022011

Related functions
kurtosis aggregate function
/ (slash sign) operator
7/21/2022 • 2 minutes to read

Returns dividend divided by divisor .

Syntax
dividend / divisor

Arguments
dividend : A numeric or INTERVAL expression.
divisor : A numeric expression.

Returns
If both dividend and divisor are DECIMAL, the result is DECIMAL.
If dividend is a year-month interval, the result is an INTERVAL YEAR TO MONTH .
If divident is a day-time interval, the result is an INTERVAL DAY TO SECOND .
In all other cases, a DOUBLE.
If the divisor is 0, the operator returns a DIVIDE_BY_ZERO error.
Use try_divide to return NULL on division-by-zero.

NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error division by 0.

Examples
> SELECT 3 / 2;
1.5

> SELECT 2L / 2L;


1.0

> SELECT INTERVAL '3:15' HOUR TO MINUTE / 3


0 01:05:00.000000

> SELECT 3 / 0;
Error: DIVIDE_BY_ZERO

Related functions
* (asterisk sign) operator
div operator
- (minus sign) operator
+ (plus sign) operator
sum aggregate function
try_divide function
slice function
7/21/2022 • 2 minutes to read

Returns a subset of an array.

Syntax
slice(expr, start, length)

Arguments
expr : An ARRAY expression.
start : An INTEGER expression.
length : An INTEGER expression that is greater or equal to 0.

Returns
The result is of the type of expr .
The function subsets array expr starting from index start (array indices start at 1), or starting from the end if
start is negative, with the specified length . If the requested array slice does not overlap with the actual length
of the array, an empty array is returned.

Examples
> SELECT slice(array(1, 2, 3, 4), 2, 2);
[2,3]
> SELECT slice(array(1, 2, 3, 4), -2, 2);
[3,4]

Related functions
array function
smallint function
7/21/2022 • 2 minutes to read

Casts the value expr to SMALLINT.

Syntax
smallint(expr)

Arguments
expr : Any expression which is castable to SMALLINT.

Returns
The result is SMALLINT.
This function is a synonym for CAST(expr AS SMALLINT) .
See cast function for details.

Examples
> SELECT smallint(-5.6);
5
> SELECT smallint('5');
5

Related functions
cast function
some aggregate function
7/21/2022 • 2 minutes to read

Returns true if at least one value of expr in a group is true .

Syntax
some(expr) [FILTER ( WHERE cond ) ]

Arguments
expr : A BOOLEAN expression.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A BOOLEAN.

Examples
> SELECT some(col) FROM VALUES (true), (false), (false) AS tab(col);
true
> SELECT some(col) FROM VALUES (NULL), (true), (false) AS tab(col);
true
> SELECT some(col) FROM VALUES (false), (false), (NULL) AS tab(col);
false

Related functions
bool_and aggregate function
bool_or aggregate function
every aggregate function
sort_array function
7/21/2022 • 2 minutes to read

Returns the array in expr in sorted order.

Syntax
sort_array(expr [, ascendingOrder] )

Arguments
expr : An ARRAY expression of sortable elements.
ascendingOrder : An optional BOOLEAN expression defaulting to true .

Returns
The result type matches expr .
Sorts the input array in ascending or descending order according to the natural ordering of the array elements.
NULL elements are placed at the beginning of the returned array in ascending order or at the end of the
returned array in descending order.

Examples
> SELECT sort_array(array('b', 'd', NULL, 'c', 'a'), true);
[NULL,a,b,c,d]

Related functions
array_sort function
soundex function
7/21/2022 • 2 minutes to read

Returns the soundex code of the string.

Syntax
soundex(expr)

Arguments
expr : A STRING expression.

Returns
A STRING.

Examples
> SELECT soundex('Miller');
M460

Related functions
space function
7/21/2022 • 2 minutes to read

Returns a string consisting of n spaces.

Syntax
space(n)

Arguments
n : An INTEGER expression that evaluates to a numeric.

Returns
A STRING.
If n is less than or equal to 0 an empty string.

Examples
> SELECT concat('1', space(2), '1');
1 1

Related functions
repeat function
spark_partition_id function
7/21/2022 • 2 minutes to read

Returns the current partition ID.

Syntax
spark_partition_id()

Arguments
The function takes no arguments.

Returns
An INTEGER.

Examples
> SELECT spark_partition_id();
0

Related functions
split function
7/21/2022 • 2 minutes to read

Splits str around occurrences that match regex and returns an array with a length of at most limit .

Syntax
split(str, regex [, limit] )

Arguments
str : A STRING expression to be split.
regexp : A STRING expression that is a Java regular expression used to split str .
limit : An optional INTEGER expression defaulting to 0 (no limit).

Returns
An ARRAY of STRING.
If limit > 0: The resulting array’s length will not be more than limit, and the resulting array’s last entry will
contain all input beyond the last matched regex .
If limit <= 0: regex will be applied as many times as possible, and the resulting array can be of any size.

Examples
> SELECT split('oneAtwoBthreeC', '[ABC]');
[one,two,three,]
> SELECT split('oneAtwoBthreeC', '[ABC]', -1);
[one,two,three,]
> SELECT split('oneAtwoBthreeC', '[ABC]', 2);
[one,twoBthreeC]

Related functions
regexp_extract function
regexp_extract_all function
split_part function
split_part function
7/21/2022 • 2 minutes to read

Splits str around occurrences of delim and returns the partNum part.
Since: Databricks Runtime 11.0

Syntax
split_part(str, delim, partNum)

Arguments
str : A STRING expression to be split.
delimiter : A STRING expression serving as delimiter for the parts.
partNum : An INTEGER expression electing the part to be returned.

Returns
A STRING.
If partNum >= 1: The partNum s part counting from the beginning of str will be returned.
If partNum <= -1: The abs(partNum) s part counting from the end of str will be returned.
partNum must not be 0. split_part returns an empty string if partNum is beyond the number of parts in str .

Examples
> SELECT '->' || split_part('Hello,world,!', ',', 1) || '<-';
->Hello<-

> SELECT '->' || split_part('Hello,world,!', ',', 2) || '<-';


-><-

> SELECT '->' || split_part('Hello,world,!', ',', 100) || '<-';


-><-

> SELECT '->' || split_part('Hello,world,!', ',', -2) || '<-';


->!<-

> SELECT '->' || split_part('Hello,world,!', ',', -100) || '<-';


-><-

> SELECT '->' || split_part('', ',', 1) || '<-';


-><-

> SELECT '->' || split_part('Hello', '', 3) || '<-';


-><-

> SELECT '->' || split_part('Hello,World,!', ',', 0) || '<-';


ERROR: Index out of bound
Related functions
split function
sqrt function
7/21/2022 • 2 minutes to read

Returns the square root of expr .

Syntax
sqrt(expr)

Arguments
expr : An expression that evaluates to a numeric.

Returns
A DOUBLE.
If expr is negative the result is NaN.

Examples
> SELECT sqrt(4);
2.0

Related functions
cbrt function
stack table-valued generator function
7/21/2022 • 2 minutes to read

Separates expr1 , …, exprN into numRows rows.

Syntax
stack(numRows, expr1 [, ...] )

Arguments
numRows : An INTEGER literal greater than 0 specifying the number of rows produced.
exprN : An expression of any type. The type of any exprN must match the type of expr(N+numRows) .

Returns
A set of numRows rows which includes all other columns in the select list and max(1, (N/numRows)) columns
produced by this function. An incomplete row is padded with NULL .
By default the produced columns are named col0, … col(n-1) . The column aliases can be specified using for
example, AS (myCol1, .. myColn) .
You can place stack only in the select list or a LATERAL VIEW. When placing the function in the select list there
must be no other generator function in the same select list.

Examples
SELECT 'hello', stack(2, 1, 2, 3) AS (first, second), 'world';
-- hello 1 2 world
-- hello 3 NULL world

Related functions
explode table-valued generator function
explode_outer table-valued generator function
inline table-valued generator function
inline_outer table-valued generator function
posexplode_outer table-valued generator function
posexplode table-valued generator function
startswith function
7/21/2022 • 2 minutes to read

Returns true if expr begins with startExpr .


Since: Databricks Runtime 10.3

Syntax
startswith(expr, startExpr)

Arguments
expr : A STRING expression.
startExpr : A STRING expression which is compared to the start of str .

Returns
A BOOLEAN.
If expr or startExpr is NULL , the result is NULL .
If startExpr is the empty string or empty binary the result is true .
Since: Databricks Runtime 10.5
The function operates in BINARY mode if both arguments are BINARY.

Examples
> SELECT startswith('SparkSQL', 'Spark');
true

> SELECT startswith('SparkSQL', 'spark');


false

> SELECT startswith('SparkSQL', NULL);


NULL

> SELECT startswith(NULL, 'Spark');


NULL

> SELECT startswith('SparkSQL', '');


true

> SELECT startswith(x110033', x'11');


true

Related
contains function
endswith function
substr function
std aggregate function
7/21/2022 • 2 minutes to read

Returns the sample standard deviation calculated from the values within the group.

Syntax
std ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.
This function is a synonym for stddev aggregate function.

Examples
> SELECT std(col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.9574271077563381
> SELECT std(DISTINCT col) FROM VALUES (1), (2), (3), (3) AS tab(col);
1.0

Related functions
stddev aggregate function
stddev_pop aggregate function
stddev_samp aggregate function
stddev aggregate function
7/21/2022 • 2 minutes to read

Returns the sample standard deviation calculated from the values within the group.

Syntax
stddev ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.
This function is a synonym for std aggregate function.

Examples
> SELECT stddev(col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.9574271077563381
> SELECT stddev(DISTINCT col) FROM VALUES (1), (2), (3), (3) AS tab(col);
1.0

Related functions
std aggregate function
stddev_pop aggregate function
stddev_samp aggregate function
stddev_pop aggregate function
7/21/2022 • 2 minutes to read

Returns the population standard deviation calculated from values of a group.

Syntax
stddev_pop ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.

Examples
> SELECT stddev_pop(col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.82915619758885
> SELECT stddev_pop(DISTINCT col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.816496580927726

Related functions
std aggregate function
stddev aggregate function
stddev_samp aggregate function
stddev_samp aggregate function
7/21/2022 • 2 minutes to read

Returns the sample standard deviation calculated from values of a group.

Syntax
stddev_samp ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.

Examples
> SELECT stddev_samp(col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.9574271077563381
> SELECT stddev_samp(DISTINCT col) FROM VALUES (1), (2), (3), (3) AS tab(col);
1.0

Related functions
std aggregate function
stddev aggregate function
stddev_pop aggregate function
str_to_map function
7/21/2022 • 2 minutes to read

Creates a map after splitting the input into key-value pairs using delimiters.

Syntax
str_to_map(expr [, pairDelim [, keyValueDelim] ] )

Arguments
expr : An STRING expression.
pairDelim : An optional STRING literal defaulting to ',' that specifies how to to split entries.
keyValueDelim : An optional STRING literal defaulting to ':' that specifies how to split each key-value pair.

Returns
A MAP of STRING for both keys and values.
Both pairDelim and keyValueDelim are treated as regular expressions.

Examples
> SELECT str_to_map('a:1,b:2,c:3', ',', ':');
{a -> 1, b -> 2, c -> 3}
> SELECT str_to_map('a');
{a->NULL}

Related functions
map function
string function
7/21/2022 • 2 minutes to read

Casts the value expr to STRING.

Syntax
string(expr)

Arguments
expr : An expression that can be cast to STRING.

Returns
The result matches the type of expr .
This function is a synonym for cast(expr AS STRING)

See cast function for details.

Examples
> SELECT string(5);
5
> SELECT string(current_date);
2021-04-01

Related functions
cast function
struct function
7/21/2022 • 2 minutes to read

Creates a STRUCT with the specified field values.

Syntax
struct(expr1 [, ...] )

Arguments
exprN : An expression of any type.

Returns
A struct with fieldN matching the type of exprN .
Fields are named colN .

Examples
> SELECT struct(1, 2, 3);
{1, 2, 3}

Related functions
named_struct function
map function
str_to_map function
substr function
7/21/2022 • 2 minutes to read

Returns the substring of expr that starts at pos and is of length len .

Syntax
substr(expr, pos [, len] )

substr(expr FROM pos[ FOR len])

Arguments
expr : An BINARY or STRING expression.
pos : An integral numeric expression specifying the starting position.
len : An optional integral numeric expression.

Returns
The result matches the type of expr .
pos is 1 based. If pos is negative the start is determined by counting characters (or bytes for BINARY) from the
end.
If len is less than 1 the result is empty.
If len is omitted the function returns on characters or bytes starting with pos .
This function is a synonym for substring function.

Examples
> SELECT substr('Spark SQL', 5);
k SQL
> SELECT substr('Spark SQL', -3);
SQL
> SELECT substr('Spark SQL', 5, 1);
k
> SELECT substr('Spark SQL' FROM 5);
k SQL
> SELECT substr('Spark SQL' FROM -3);
SQL
> SELECT substr('Spark SQL' FROM 5 FOR 1);
k
> SELECT substr('Spark SQL' FROM -10 FOR 5);
Spar

Related functions
substring function
substring function
7/21/2022 • 2 minutes to read

Returns the substring of expr that starts at pos and is of length len .

Syntax
substring(expr, pos [, len])

substring(expr FROM pos [FOR len] ] )

Arguments
expr : An BINARY or STRING expression.
pos : An integral numeric expression specifying the starting position.
len : An optional integral numeric expression.

Returns
A STRING.
pos is 1 based. If pos is negative the start is determined by counting characters (or bytes for BINARY) from the
end.
If len is less than 1 the result is empty.
If len is omitted the function returns on characters or bytes starting with pos .
This function is a synonym for substr function.

Examples
> SELECT substring('Spark SQL', 5);
k SQL
> SELECT substring('Spark SQL', -3);
SQL
> SELECT substring('Spark SQL', 5, 1);
k
> SELECT substring('Spark SQL' FROM 5);
k SQL
> SELECT substring('Spark SQL' FROM -3);
SQL
> SELECT substring('Spark SQL' FROM 5 FOR 1);
k
> SELECT substring('Spark SQL' FROM -10 FOR 5);
Spar

Related functions
substr function
substring_index function
7/21/2022 • 2 minutes to read

Returns the substring of expr before count occurrences of the delimiter delim .

Syntax
substring_index(expr, delim, count)

Arguments
expr : A STRING or BINARY expression.
delim : An expression matching the type of expr specifying the delimiter.
count : An INTEGER expression to count the delimiters.

Returns
The result matches the type of expr .
If count is positive, everything to the left of the final delimiter (counting from the left) is returned.
If count is negative, everything to the right of the final delimiter (counting from the right) is returned.

Examples
> SELECT substring_index('www.apache.org', '.', 2);
www.apache

Related functions
substr function
substring function
sum aggregate function
7/21/2022 • 2 minutes to read

Returns the sum calculated from values of a group.

Syntax
sum ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to a numeric or interval.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
If is an integral number type, a BIGINT. If expr is DECIMAL(p, s) the result is
expr
DECIMAL(p + min(10, 31-p), s) . If expr is an interval the result type matches expr .

Otherwise, a DOUBLE.
If DISTINCT is specified only unique values are summed up.
If the result overflows the result type Databricks Runtime raises an ARITHMETIC_OVERFLOW error. To return a
NULL instead use try_sum

WARNING
If spark.sql.ansi.enabled is false an overflow of BIGINT will not cause an error but “wrap” the result.

Examples
> SELECT sum(col) FROM VALUES (5), (10), (15) AS tab(col);
30

> SELECT sum(col) FILTER(WHERE col <15)


FROM VALUES (5), (10), (15) AS tab(col);
15

> SELECT sum(DISTINCT col) FROM VALUES (5), (10), (10), (15) AS tab(col);
30

> SELECT sum(col) FROM VALUES (NULL), (10), (15) AS tab(col);


25

> SELECT sum(col) FROM VALUES (NULL), (NULL) AS tab(col);


NULL

-- try_sum overflows a BIGINT


> SELECT try_sum(c1) FROM VALUES(5E18::BIGINT), (5E18::BIGINT) AS tab(c1);
NULL

-- In ANSI mode sum returns an error if it overflows BIGINT


> SELECT sum(c1) FROM VALUES(5E18::BIGINT), (5E18::BIGINT) AS tab(c1);
Error: ARITHMETIC_OVERFLOW

-- try_sum overflows an INTERVAL


> SELECT try_sum(c1) FROM VALUES(INTERVAL '100000000' YEARS), (INTERVAL '100000000' YEARS) AS tab(c1);
NULL

-- sum returns an error on INTERVAL overflow


> SELECT sum(c1) FROM VALUES(INTERVAL '100000000' YEARS), (INTERVAL '100000000' YEARS) AS tab(c1);
Error: ARITHMETIC_OVERFLOW

Related functions
aggregate function
avg aggregate function
max aggregate function
mean aggregate function
min aggregate function
try_avg aggregate function
try_sum aggregate function
tan function
7/21/2022 • 2 minutes to read

Returns the tangent of expr .

Syntax
tan(expr)

Arguments
expr : An expression that evaluates to a numeric expressing the angle in radians.

Returns
A DOUBLE.

Examples
> SELECT tan(0);
0.0

Related functions
tanh function
cos function
sin function
tanh function
7/21/2022 • 2 minutes to read

Returns the hyperbolic tangent of expr .

Syntax
tanh(expr)

Arguments
expr : An expression that evaluates to a numeric expressing the hyperbolic angle.

Returns
A DOUBLE.

Examples
> SELECT tanh(0);
0.0

Related functions
tan function
cosh function
sinh function
~ (tilde sign) operator
7/21/2022 • 2 minutes to read

Returns the bitwise NOT of expr .

Syntax
~ expr

Arguments
expr : An integral numeric type expression.

Returns
The result type matches the type of expr .

Examples
> SELECT ~ 0;
-1

Related functions
& (ampersand sign) operator
| (pipe sign) operator
^ (caret sign) operator
bit_count function
timestamp function
7/21/2022 • 2 minutes to read

Casts expr to TIMESTAMP.

Syntax
timestamp(expr)

Arguments
expr : Any expression that can be cast to TIMESTAMP.

Returns
A TIMESTAMP.
This function is a synonym for CAST(expr AS TIMESTAMP) .
For details see cast function.

Examples
> SELECT timestamp('2020-04-30 12:25:13.45');
2020-04-30 12:25:13.45
> SELECT timestamp(date'2020-04-30');
2020-04-30 00:00:00
> SELECT timestamp(123);
1969-12-31 16:02:03

Related functions
cast function
timestamp_micros function
7/21/2022 • 2 minutes to read

Creates a timestamp expr microseconds since UTC epoch.

Syntax
timestamp_micros(expr)

Arguments
expr : An integral numeric expression specifying microseconds.

Returns
A TIMESTAMP.

Examples
> SELECT timestamp_micros(1230219000123123);
2008-12-25 07:30:00.123123

Related functions
timestamp function
timestamp_millis function
timestamp_seconds function
timestamp_millis function
7/21/2022 • 2 minutes to read

Creates a timestamp expr milliseconds since UTC epoch.

Syntax
timestamp_millis(expr)

Arguments
expr : An integral numeric expression specifying milliseconds.

Returns
A TIMESTAMP.

Examples
> SELECT timestamp_millis(1230219000123);
2008-12-25 07:30:00.123

Related functions
timestamp function
timestamp_micros function
timestamp_seconds function
timestamp_seconds function
7/21/2022 • 2 minutes to read

Creates timestamp expr seconds since UTC epoch.

Syntax
timestamp_seconds(expr)

Arguments
expr : An numeric expression specifying seconds.

Returns
A TIMESTAMP.

Examples
> SELECT timestamp_seconds(1230219000);
2008-12-25 07:30:00
> SELECT timestamp_seconds(1230219000.123);
2008-12-25 07:30:00.123

Related functions
timestamp function
timestamp_micros function
timestamp_millis function
timestampadd function
7/21/2022 • 2 minutes to read

Adds value unit s to a timestamp expr .


Since: Databricks Runtime 10.4

Syntax
timestampadd(unit, value, expr)

unit
{ MICROSECOND |
MILLISECOND |
SECOND |
MINUTE |
HOUR |
DAY | DAYOFYEAR |
WEEK |
MONTH |
QUARTER |
YEAR }

Arguments
unit : A unit of measure.
value : A numeric expression with the number of unit s to add to expr .
expr : A TIMESTAMP expression.

Returns
A TIMESTAMP.
If value is negative it is subtracted from the expr . If unit is MONTH , QUARTER , or YEAR the day portion of the
result will be adjusted to result in a valid date.
The function returns an overflow error if the result is beyond the supported range of timestamps.

Examples
> SELECT timestampadd(MICROSECOND, 5, TIMESTAMP'2022-02-28 00:00:00');
2022-02-28 00:00:00.000005

-- March 31. 2022 minus 1 month yields February 28. 2022


> SELECT timestampadd(MONTH, -1, TIMESTAMP'2022-03-31 00:00:00');
2022-02-28 00:00:00.000000

Related functions
add_months function
date_add function
date_sub function
dateadd function
timestamp function
timestampdiff function
7/21/2022 • 2 minutes to read

Returns the difference between two timestamps measured in unit s.


Since: Databricks Runtime 10.4

Syntax
timestampdiff(unit, start, end)

unit
{ MICROSECOND |
MILLISECOND |
SECOND |
MINUTE |
HOUR |
DAY |
WEEK |
MONTH |
QUARTER |
YEAR }

Arguments
unit : A unit of measure.
start : A starting TIMESTAMP expression.
end : A ending TIMESTAMP expression.

Returns
A BIGINT.
If start is greater than end the result is negative.
The function counts whole elapsed units based on UTC with a DAY being 86400 seconds.
One month is considered elapsed when the calendar month has increased and the calendar day and time is
equal or greater to the start. Weeks, quarters, and years follow from that.

Examples
-- One second shy of a month elapsed
> SELECT timestampdiff(MONTH, TIMESTAMP'2021-02-28 12:00:00', TIMESTAMP'2021-03-28 11:59:59');
0

-- One month has passed even though its' not end of the month yet because day and time line up.
> SELECT timestampdiff(MONTH, TIMESTAMP'2021-02-28 12:00:00', TIMESTAMP'2021-03-28 12:00:00');
1

-- Start is greater than the end


> SELECT timestampdiff(YEAR, DATE'2021-01-01', DATE'1900-03-28');
-120
Related functions
add_months function
date_add function
date_sub function
datediff function
datediff (timestamp) function
timestamp function
timestampadd function
tinyint function
7/21/2022 • 2 minutes to read

Casts expr to TINYINT.

Syntax
tinyint(expr)

Arguments
expr : Any expression which is castable to TINYINT.

Returns
The result is TINYINT.
This function is a synonym for CAST(expr AS TINYINT) .
See cast function for details.

Examples
> SELECT tinyint('12');
12
> SELECT tinyint(5.4);
5

Related functions
cast function
to_csv function
7/21/2022 • 2 minutes to read

Returns a CSV string with the specified struct value.

Syntax
to_csv(expr [, options] )

Arguments
expr : A STRUCT expression.
options : An optional MAP literal expression with keys and values being STRING.

Returns
A STRING.
See from_csv function for details on possible options .

Examples
> SELECT to_csv(named_struct('a', 1, 'b', 2));
1,2
> SELECT to_csv(named_struct('time', to_timestamp('2015-08-26', 'yyyy-MM-dd')), map('timestampFormat',
'dd/MM/yyyy'));
26/08/2015

Related functions
from_csv function
schema_of_csv function
to_json function
from_json function
schema_of_json function
to_date function
7/21/2022 • 2 minutes to read

Returns expr cast to a date using an optional formatting.

Syntax
to_date(expr [, fmt] )

Arguments
expr : A STRING expression representing a date.
fmt: An optional format STRING expression.

Returns
A DATE.
If fmt is supplied, it must conform with Datetime patterns.
If fmt is not supplied, the function is a synonym for cast(expr AS DATE) .
If fmt is malformed or its application does not result in a well-formed date, the function raises an error.

NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error for malformed dates.

Examples
> SELECT to_date('2009-07-30 04:17:52');
2009-07-30
> SELECT to_date('2016-12-31', 'yyyy-MM-dd');
2016-12-31

Related functions
cast function
date function
to_timestamp function
Datetime patterns
to_json function
7/21/2022 • 2 minutes to read

Returns a JSON string with the struct specified in expr .

Syntax
to_json(expr [, options] )

Arguments
expr : A STRUCT expression.
options : An optional MAP literal expression with keys and values being STRING.

Returns
A STRING.
See from_json function for details on possible options .

Examples
> SELECT to_json(named_struct('a', 1, 'b', 2));
{"a":1,"b":2}
> SELECT to_json(named_struct('time', to_timestamp('2015-08-26', 'yyyy-MM-dd')), map('timestampFormat',
'dd/MM/yyyy'));
{"time":"26/08/2015"}
> SELECT to_json(array(named_struct('a', 1, 'b', 2)));
[{"a":1,"b":2}]
> SELECT to_json(map('a', named_struct('b', 1)));
{"a":{"b":1}}
> SELECT to_json(map(named_struct('a', 1),named_struct('b', 2)));
{"[1]":{"b":2}}
> SELECT to_json(map('a', 1));
{"a":1}
> SELECT to_json(array((map('a', 1))));
[{"a":1}]

Related functions
: operator
from_csv function
schema_of_csv function
from_json function
schema_of_json function
to_number function
7/21/2022 • 2 minutes to read

Returns expr cast to DECIMAL using formatting fmt .


Since: Databricks Runtime 10.5

Syntax
to_number(expr, fmt)

fmt
{ ' [ MI | S ] [ L | $ ]
[ 0 | 9 | G | , ] [...]
[ . | D ]
[ 0 | 9 ] [...]
[ L | $ ] [ PR | MI | S ] ' }

Arguments
expr : A STRING expression representing a number. expr may include leading or trailing spaces.
fmt : A STRING literal, specifying the expected format of expr .

Returns
A DECIMAL(p, s) where p is the total number of digits ( 0 or 9 ) and s is the number of digits after the
decimal point, or 0 if there is none.
fmt can contain the following elements (case insensitive):
0 or 9

Specifies an expected digit between 0 and 9 . A 0 to the left of the decimal points indicates that expr
must have at least as many digits. Leading 9 indicate that expr may omit these digits.
expr must not be larger that the number of digits to the left of the decimal point allows.
Digits to the right of the decimal indicate the most digits expr may have to the right of the decimal point
than fmt specifies.
. or D

Specifies the position of the decimal point.


expr does not need to include a decimal point.
, or G

Specifies the position of the , grouping (thousands) separator. There must be a 0 or 9 to the left and
right of each grouping separator. expr must match the grouping separator relevant to the size of the
number.
L or $
Specifies the location of the $ currency sign. This character may only be specified once.
S or MI

Specifies the position of an optional ‘+’ or ‘-‘ sign for S , and ‘-‘ only for MI . This directive may be
specified only once.
PR

Only allowed at the end of the format string; specifies that expr indicates a negative number with
wrapping angled brackets ( <1> ).
If expr contains any characters other than 0 through 9 , or characters permitted in fmt , an error is returned.
To return NULL instead of an error for invalid expr use try_to_number().

Examples
-- The format expects:
-- * an optional sign at the beginning,
-- * followed by a dollar sign,
-- * followed by a number between 3 and 6 digits long,
-- * thousands separators,
-- * up to two dight beyond the decimal point.
> SELECT to_number('-$12,345.67', 'S$999,099.99');
-12345.67

-- Plus is optional, and so are fractional digits.


> SELECT to_number('$345', 'S$999,099.99');
345.00

-- The format requires at least three digits.


> SELECT to_number('$45', 'S$999,099.99');
Error: Invalid number

-- The format requires at least three digits.


> SELECT try_to_number('$45', 'S$999,099.99');
NULL

-- The format requires at least three digits


> SELECT to_number('$045', 'S$999,099.99');
45.00

-- Using brackets to denote negative values


> SELECT to_number('<1234>', '999999PR');
-1234

Related functions
cast function
to_date function
try_to_number function
to_timestamp function
7/21/2022 • 2 minutes to read

Returns expr cast to a timestamp using an optional formatting.

Syntax
to_timestamp(expr [, fmt] )

Arguments
expr : A STRING expression representing a timestamp.
fmt: An optional format STRING expression.

Returns
A TIMESTAMP.
If fmt is supplied, it must conform with Datetime patterns.
If fmt is not supplied, the function is a synonym for cast(expr AS TIMESTAMP) .
If fmt is malformed or its application does not result in a well formed timestamp, the function raises an error.

NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error for malformed timestamps.

Examples
> SELECT to_timestamp('2016-12-31 00:12:00');
2016-12-31 00:12:00

> SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');


2016-12-31 00:00:00

Related functions
cast function
timestamp function
to_date function
Datetime patterns
to_unix_timestamp function
7/21/2022 • 2 minutes to read

Returns the timestamp in expr as a UNIX timestamp.

Syntax
to_unix_timestamp(expr [, fmt] )

Arguments
expr : A STRING expression representing a timestamp.
fmt: An optional format STRING expression.

Returns
A BIGINT.
If fmt is supplied, it must conform with Datetime patterns.
If fmt is not supplied, the function is a synonym for cast(expr AS TIMESTAMP) .
If fmt is malformed or its application does not result in a well formed timestamp, the function raises an error.

NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error for malformed timestamps.

Examples
> SELECT to_unix_timestamp('2016-04-08', 'yyyy-MM-dd');
1460098800

Related functions
from_unixtime function
Datetime patterns
to_utc_timestamp function
7/21/2022 • 2 minutes to read

Returns the timestamp in expr in a different timezone as UTC.

Syntax
to_utc_timestamp(expr, timezone)

Arguments
expr : A TIMESTAMP expression.
timezone : A STRING expression that is a valid timezone.

Returns
A TIMESTAMP.

Examples
> SELECT to_utc_timestamp('2016-08-31', 'Asia/Seoul');
2016-08-30 15:00:00
> SELECT to_utc_timestamp( '2017-07-14 02:40:00.0', 'GMT+1');
2017-07-14 01:40:00.0

Related functions
from_utc_timestamp function
transform function
7/21/2022 • 2 minutes to read

Transforms elements in an array in expr using the function func .

Syntax
transform(expr, func)

Arguments
expr : An ARRAY expression.
func : A lambda function.

Returns
An ARRAY of the type of the lambda function’s result.
The lambda function must have 1 or 2 parameters. The first parameter represents the element, the optional
second parameter represents the index of the element.
The lambda function produces a new value for each element in the array.

Examples
> SELECT transform(array(1, 2, 3), x -> x + 1);
[2,3,4]
> SELECT transform(array(1, 2, 3), (x, i) -> x + i);
[1,3,5]

Related functions
transform_keys function
transform_values function
transform_keys function
7/21/2022 • 2 minutes to read

Transforms keys in a map in expr using the function func .

Syntax
transform_keys(expr, func)

Arguments
expr : A MAP expression.
func : A lambda function.

Returns
A MAP where the keys have the type of the result of the lambda functions and the values have the type of the
expr MAP values.

The lambda function must have 2 parameters. The first parameter represents the key. The second parameter
represents the value.
The lambda function produces a new key for each entry in the map.

Examples
> SELECT transform_keys(map_from_arrays(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + 1);
{2 -> 1, 3 -> 2, 4 -> 3}
> SELECT transform_keys(map_from_arrays(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + v);
{2 -> 1, 4 -> 2, 6 -> 3}

Related functions
transform function
transform_values function
transform_values function
7/21/2022 • 2 minutes to read

Transforms values in a map in expr using the function func .

Syntax
transform_values(expr, func)

Arguments
expr : A MAP expression.
func : A lambda function.

Returns
A MAP where the values have the type of the result of the lambda functions and the keys have the type of the
expr MAP keys.

The lambda function must have 2 parameters. The first parameter represents the key. The second parameter
represents the value.
The lambda function produces a new value for each entry in the map.

Examples
> SELECT transform_values(map_from_arrays(array(1, 2, 3), array(1, 2, 3)), (k, v) -> v + 1);
{1 -> 2, 2 -> 3, 3 -> 4}
> SELECT transform_values(map_from_arrays(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + v);
{1 -> 2, 2 -> 4, 3 -> 6}

Related functions
transform function
transform_keys function
translate function
7/21/2022 • 2 minutes to read

Returns an expr where all characters in from have been replaced with those in to .

Syntax
translate(expr, from, to)

Arguments
expr : A STRING expression.
from : A STRING expression consisting of a set of characters to be replaced.
to : A STRING expression consisting of a matching set of characters to replace from .

Returns
A STRING.
The function replaces all occurrences of any character in from with the corresponding character in to.
If to has a shorter length than from unmatched characters are removed.

Examples
> SELECT translate('AaBbCc', 'abc', '123');
A1B2C3
> SELECT translate('AaBbCc', 'abc', '1');
A1BC
> SELECT translate('AaBbCc', 'abc', '');
ABC

Related functions
replace function
overlay function
regexp_replace function
trim function
7/21/2022 • 2 minutes to read

Removes the leading and trailing space characters from str .


Removes the leading space characters from str .
Removes the trailing space characters from str .
Removes the leading and trailing trimStr characters from str .
Removes the leading trimStr characters from str .
Removes the trailing trimStr characters from str .

Syntax
trim(str)

trim(BOTH FROM str)

trim(LEADING FROM str)

trim(TRAILING FROM str)

trim(trimStr FROM str)


trim(BOTH trimStr FROM str)

trim(LEADING trimStr FROM str)

trim(TRAILING trimStr FROM str)

Arguments
trimStr : A STRING expression with a set of characters to be trimmed.
str : A STRING expression to be trimmed.

Returns
A STRING.

Examples
> SELECT '+' || trim(' SparkSQL ') || '+';
+SparkSQL+
> SELECT '+' || trim(BOTH FROM ' SparkSQL ') || '+';
+SparkSQL+
> SELECT '+' || trim(LEADING FROM ' SparkSQL ') || '+';
+SparkSQL +
> SELECT '+' || trim(TRAILING FROM ' SparkSQL ') || '+';
+ SparkSQL+
> SELECT trim('SL' FROM 'SSparkSQLS');
parkSQ
> SELECT trim(BOTH 'SL' FROM 'SSparkSQLS');
parkSQ
> SELECT trim(LEADING 'SL' FROM 'SSparkSQLS');
parkSQLS
> SELECT trim(TRAILING 'SL' FROM 'SSparkSQLS');
SSparkSQ

Related functions
btrim function
lpad function
ltrim function
rpad function
rtrim function
trunc function
7/21/2022 • 2 minutes to read

Returns a date with the a portion of the date truncated to the unit specified by the format model fmt .

Syntax
trunc(expr, fmt)

Arguments
expr : A DATE expression.
fmt : A STRING expression specifying how to truncate.

Returns
A DATE.
fmt must be one of (case insensitive):
'YEAR' , , 'YY' - truncate to the first date of the year that the date falls in.
'YYYY'
'QUARTER' - truncate to the first date of the quarter that the date falls in.
'MONTH' , 'MM' , 'MON' - truncate to the first date of the month that the date falls in.
'WEEK' - truncate to the Monday of the week that the date falls in.

Examples
> SELECT trunc('2019-08-04', 'week');
2019-07-29
> SELECT trunc('2019-08-04', 'quarter');
2019-07-01
> SELECT trunc('2009-02-12', 'MM');
2009-02-01
> SELECT trunc('2015-10-27', 'YEAR');
2015-01-01

Related functions
date_trunc function
try_add function
7/21/2022 • 2 minutes to read

Returns the sum of expr1 and expr2 , or NULL in case of error.


Since: Databricks Runtime 10.0

Syntax
try_add ( expr1 , expr2 )

Arguments
expr1 : A numeric, DATE, TIMESTAMP, or INTERVAL expression.
expr2 : If expr1 is a numeric expr2 must be numeric expression, or an INTERVAL otherwise.

Returns
If expr1 is a numeric, the common maximum type of the arguments.
If expr1 is a DATE and expr2 is a day-time interval the result is a TIMESTAMP.
If expr1 and expr2 are year-month intervals the result is a year-month interval of sufficiently wide units to
represent the result.
If expr1 and expr2 are day-time intervals the result is a day-time interval of sufficiently wide units to
represent the result.
Otherwise, the result type matches expr1 .
If both expressions are interval they must be of the same class.
If the result overflows the result type Databricks Runtime returns NULL.
When you add a year-month interval to a DATE Databricks Runtime will assure that the resulting date is well
formed.

Examples
> SELECT try_add(1, 2);
3

> SELECT try_add(DATE'2021-03-20', INTERVAL '2' MONTH);


2021-5-20

> SELECT try_add(TIMESTAMP'2021-03-20 12:15:29', INTERVAL '3' SECOND);


2021-03-20 12:15:32

> SELECT typeof(try_add(INTERVAL '3' DAY, INTERVAL '2' HOUR));


interval day to hour

> SELECT try_add(DATE'2021-03-31', INTERVAL '1' MONTH);


2021-04-30

> SELECT try_add(127Y, 1Y);


NULL

Related functions
- (minus sign) operator
/ (slash sign) operator
* (asterisk sign) operator
sum aggregate function
try_divide function
try_avg aggregate function
7/21/2022 • 2 minutes to read

Returns the mean calculated from values of a group. If there is an overflow, returns NULL.
Since: Databricks Runtime 11.0

Syntax
try_avg( [ALL | DISTINCT] expr) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that returns a numeric or an interval value.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
The result type is computed as for the arguments:
DECIMAL(p, s) : The result type is a DECIMAL(p + 4, s + 4) . If the maximum precision for DECIMAL is reached
the increase in scale will be limited to avoid loss of significant digits.
year-month interval: The result is an INTERVAL YEAR TO MONTH .
day-time interval: The result is an INTERVAL YEAR TO SECOND .
In all other cases the result is a DOUBLE.
Nulls within the group are ignored. If a group is empty or consists only of nulls, the result is NULL.
If DISTINCT is specified, the average is computed after duplicates are removed.
To raise an error instead of NULL in case of an overflow use avg.

Examples
> SELECT try_avg(col) FROM VALUES (1), (2), (3) AS tab(col);
2.0

> SELECT try_avg(DISTINCT col) FROM VALUES (1), (1), (2) AS tab(col);
1.5

> SELECT try_avg(col) FROM VALUES (1), (2), (NULL) AS tab(col);


1.5

> SELECT try_avg(col) FROM VALUES (INTERVAL '1' YEAR), (INTERVAL '2' YEAR) AS tab(col);
1-6

-- Overflow results in NULL for try_avg()


> SELECT try_avg(col) FROM VALUES (5e37::DECIMAL(38, 0)), (5e37::DECIMAL(38, 0)) AS tab(col);
NULL

-- Overflow causes error for avg() in ANSI mode.


> SELECT avg(col) FROM VALUES (5e37::DECIMAL(38, 0)), (5e37::DECIMAL(38, 0)) AS tab(col);
CANNOT_CHANGE_DECIMAL_PRECISION error

Related functions
avg aggregate function
aggregate function
max aggregate function
mean aggregate function
min aggregate function
try_sum aggregate function
sum aggregate function
try_cast function
7/21/2022 • 2 minutes to read

Returns the value of sourceExpr cast to data type targetType if possible, or NULL if not possible.
Since: Databricks Runtime 10.0

Syntax
try_cast(sourceExpr AS targetType)

Arguments
sourceExpr : Any castable expression.
targetType : The type of the result.

Returns
The result is of type targetType .
This function is a more relaxed variant of cast function which includes a detailed description.
try_cast differs from cast function by tolerating the following conditions as long as the cast from the type of
expr to type is supported:
If a sourceExpr value cannot fit within the domain of targetType the result is NULL instead of an overflow
error.
If a sourceExpr value is not well formed or contains invalid characters the result is NULL instead of an
invalid data error.
Exception to the above are:
Casting to a STRUCT field with NOT NULL property.
Casting a MAP key.

Examples
> SELECT try_cast('10' AS INT);
10

> SELECT try_cast('a' AS INT);


NULL

Related functions
:: (colon colon sign) operator
cast function
try_divide function
7/21/2022 • 2 minutes to read

Returns dividend divided by divisor , or NULL if divisor is 0.


Since: Databricks Runtime 10.0

Syntax
try_divide(dividend, divisor)

Arguments
dividend : A numeric or INTERVAL expression.
divisor : A numeric expression.

Returns
If both dividend and divisor are DECIMAL, the result is DECIMAL.
If dividend is a year-month interval, the result is an INTERVAL YEAR TO MONTH .
If divident is a day-time interval, the result is an INTERVAL DAY TO SECOND .
In all other cases, a DOUBLE.
If the divisor is 0, the operator returns NULL.

Examples
> SELECT try_divide(3, 2);
1.5

> SELECT try_divide(2L, 2L);


1.0

> SELECT try_divide(INTERVAL '3:15' HOUR TO MINUTE, 3);


0 01:05:00.000000

> SELECT try_divide(3 , 0)


NULL

Related functions
* (asterisk sign) operator
div operator
- (minus sign) operator
+ (plus sign) operator
sum aggregate function
try_add function
try_element_at function
7/21/2022 • 2 minutes to read

Returns the element of an arrayExpr at index , or NULL if index is out of bound.


Returns the value of mapExpr for key , or NULL id key does not exist.
Since: Databricks Runtime 10.0

Syntax
try_element_at(arrayExpr, index)

try_element_at(mapExpr, key)

Arguments
arrayExpr : An ARRAY expression.
index : An INTEGER expression.
mapExpr : A MAP expression.
key : An expression matching the type of the keys of mapExpr

Returns
If the first argument is an ARRAY:
The result is of the type of the elements of expr .
abs(index) must not be 0.
If index is negative the function accesses elements from the last to the first.
The function returns NULL if abs(index) exceeds the length of the array, or if key does not exist in the map.

Examples
> SELECT try_element_at(array(1, 2, 3), 2);
2

> SELECT try_element_at(array(1, 2, 3), 5);


NULL

> SELECT element_at(array(1, 2, 3), 5);


Error: INVALID_ARRAY_INDEX_IN_ELEMENT_AT

> SELECT try_element_at(map(1, 'a', 2, 'b'), 2);


b

> SELECT try_element_at(map(1, 'a', 2, 'b'), 3);


NULL

> SELECT element_at(map(1, 'a', 2, 'b'), 3);


Error: MAP_KEY_DOES_NOT_EXIST

Related functions
array_contains function
array_position function
element_at function
try_multiply function
7/21/2022 • 2 minutes to read

Returns multiplier multiplied by multiplicand , or NULL on overflow.


Since: Databricks Runtime 10.4

Syntax
try_multiply(multiplier, multiplicand)

Arguments
multiplier : A numeric or INTERVAL expression.
multiplicand : A numeric expression or INTERVAL expression.

You may not specify an INTERVAL for both arguments.

Returns
If both multiplier and multiplicand are DECIMAL, the result is DECIMAL.
If multiplier or multiplicand is an INTERVAL, the result is of the same type.
If both multiplier and multiplier are integral numeric types the result is the larger of the two types.
In all other cases the result is a DOUBLE.
If either the multiplier or the multiplicand is 0, the operator returns 0.
If the result of the multiplication is outside the bound for the result type the result is NULL .

Examples
> SELECT 3 * 2;
6

> SELECT 2L * 2L;


4L

> SELECT INTERVAL '3' YEAR * 3;


9-0

> SELECT 100Y * 100Y


NULL

Related functions
* (asterisk sign) operator
div operator
- (minus sign) operator
+ (plus sign) operator
sum aggregate function
try_add function
try_divide function
try_subtract function
try_subtract function
7/21/2022 • 2 minutes to read

Returns the subtraction of expr2 from expr1 , or NULL on overflow.


Since: Databricks Runtime 10.4

Syntax
try_subtract ( expr1 , expr2 )

Arguments
expr1 : A numeric, DATE, TIMESTAMP, or INTERVAL expression.
expr2 : If expr1 is a numeric expr2 must be numeric expression, or an INTERVAL otherwise.

Returns
If expr1 is a numeric, the common maximum type of the arguments.
If expr1 is a DATE and expr2 is a day-time interval the result is a TIMESTAMP.
If expr1 and expr2 are year-month intervals the result is a year-month interval of sufficiently wide units to
represent the result.
If expr1 and expr2 are day-time intervals the result is a day-time interval of sufficiently wide units to
represent the result.
Otherwise, the result type matches expr1 .
If both expressions are interval they must be of the same class.
If the result overflows the result type Databricks Runtime returns NULL.
When you subtract a year-month interval from a DATE Databricks Runtime will assure that the resulting date is
well formed.

Examples
> SELECT try_subtract(1, 2);
-1

> SELECT try_subtract(DATE'2021-03-20', INTERVAL '2' MONTH);


2021-1-20

> SELECT try_subtract(TIMESTAMP'2021-03-20 12:15:29', INTERVAL '3' SECOND);


2021-03-20 12:15:26

> SELECT typeof(try_subtract(INTERVAL '3' DAY, INTERVAL '2' HOUR));


interval day to hour

> SELECT try_subtract(DATE'2021-03-31', INTERVAL '1' MONTH);


2021-02-28

> SELECT try_subtract(-128Y, 1Y);


NULL

Related functions
- (minus sign) operator
/ (slash sign) operator
* (asterisk sign) operator
sum aggregate function
try_add function
try_divide function
try_multiply function
try_sum aggregate function
7/21/2022 • 2 minutes to read

Returns the sum calculated from values of a group, or NULL if there is an overflow.
Since: Databricks Runtime 10.5

Syntax
try_sum ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to a numeric or interval.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
If expr is an integral number type, a BIGINT.
If expr is DECIMAL(p, s) the result is DECIMAL(p + min(10, 31-p), s) .
If expr is an interval the result type matches expr .
Otherwise, a DOUBLE.
If DISTINCT is specified only unique values are summed up.
If the result overflows the result type Databricks Runtime returns NULL. To return an error instead use sum.

Examples
> SELECT try_sum(col) FROM VALUES (5), (10), (15) AS tab(col);
30

> SELECT try_sum(col) FILTER(WHERE col <15)


FROM VALUES (5), (10), (15) AS tab(col);
15

> SELECT try_sum(DISTINCT col) FROM VALUES (5), (10), (10), (15) AS tab(col);
30

> SELECT try_sum(col) FROM VALUES (NULL), (10), (15) AS tab(col);


25

> SELECT try_sum(col) FROM VALUES (NULL), (NULL) AS tab(col);


NULL

-- try_sum overflows a BIGINT


> SELECT try_sum(c1) FROM VALUES(5E18::BIGINT), (5E18::BIGINT) AS tab(c1);
NULL

-- In ANSI mode sum returns an error if it overflows BIGINT


> SELECT sum(c1) FROM VALUES(5E18::BIGINT), (5E18::BIGINT) AS tab(c1);
ERROR

-- try_sum overflows an INTERVAL


> SELECT try_sum(c1) FROM VALUES(INTERVAL '100000000' YEARS), (INTERVAL '100000000' YEARS) AS tab(c1);
NULL

-- sum returns an error on INTERVAL overflow


> SELECT sum(c1) FROM VALUES(INTERVAL '100000000' YEARS), (INTERVAL '100000000' YEARS) AS tab(c1);
Error: ARITHMETIC_OVERFLOW

Related functions
aggregate function
avg aggregate function
max aggregate function
mean aggregate function
min aggregate function
sum aggregate function
try_to_number function
7/21/2022 • 2 minutes to read

Returns expr cast to DECIMAL using formatting fmt , or NULL if expr does not match the format.
Since: Databricks Runtime 10.5

Syntax
try_to_number(expr, fmt)

fmt
{ ' [ MI | S ] [ L | $ ]
[ 0 | 9 | G | , ] [...]
[ . | D ]
[ 0 | 9 ] [...]
[ L | $ ] [ PR | MI | S ] ' }

Arguments
expr : A STRING expression representing a number. expr may include leading or trailing spaces.
fmt : An STRING literal, specifying the expected format of expr .

Returns
A DECIMAL(p, s) where p is the total number of digits ( 0 or 9 ) and s is the number of digits after the
decimal point, or 0 if there are no digits after the decimal point.
fmt can contain the following elements (case insensitive):
0 or 9

Specifies an expected digit between 0 and 9 . A 0 to the left of the decimal points indicates that expr
must have at least as many digits. Leading 9 indicate that expr may omit these digits.
expr must not be larger that the number of digits to the left of the decimal point allows.
Digits to the right of the decimal indicate the maximum number of digits expr may have to the right of
the decimal point specified by fmt .
. or D

Specifies the position of the decimal point.


expr does not need to include a decimal point.
, or G

Specifies the position of the , grouping (thousands) separator. There must be a 0 or 9 to the left and
right of each grouping separator. expr must match the grouping separator relevant to the size of the
number.
L or $
Specifies the location of the $ currency sign. This character may only be specified once.
S or MI

Specifies the position of an optional ‘+’ or ‘-‘ sign for S , and ‘-‘ only for MI . This directive may be
specified only once.
PR

Specifies that expr indicates a negative number with wrapping angled brackets ( <1> ).
If expr contains any characters other than 0 through 9 , or those permitted in fmt ,a NULL is returned.
For strict semantic use to_number().

Examples
-- The format expects:
-- * an optional sign at the beginning,
-- * followed by a dollar sign,
-- * followed by a number between 3 and 6 digits long,
-- * thousands separators,
-- * up to two dight beyond the decimal point.
> SELECT try_to_number('-$12,345.67', 'S$999,099.99');
-12345.67

-- Plus is optional, and so are fractional digits.


> SELECT try_to_number('$345', 'S$999,099.99');
345.00

-- The format requires at least three digits.


> SELECT to_number('$45', 'S$999,099.99');
Error: Invalid number

-- The format requires at least three digits.


> SELECT try_to_number('$45', 'S$999,099.99');
NULL

-- The format requires at least three digits


> SELECT try_to_number('$045', 'S$999,099.99');
45.00

-- Using brackets to denote negative values


> SELECT try_to_number('<1234>', '999999PR');
-1234

Related functions
cast function
to_date function
to_number function
typeof function
7/21/2022 • 2 minutes to read

Return a DDL-formatted type string for the data type of the input.

Syntax
typeof(expr)

Arguments
expr : Any expression.

Returns
A STRING.

Examples
> SELECT typeof(1);
int
> SELECT typeof(array(1));
array<int>

Related functions
ucase function
7/21/2022 • 2 minutes to read

Returns expr with all characters changed to uppercase.

Syntax
ucase(expr)

Arguments
expr : A STRING expression.

Returns
A STRING.
This function is a synonym for upper function.

Examples
> SELECT ucase('SparkSql');
SPARKSQL

Related functions
lower function
initcap function
upper function
unbase64 function
7/21/2022 • 2 minutes to read

Returns a decoded base64 string as binary.

Syntax
unbase64(expr)

Arguments
expr : A STRING expression in a base64 format.

Returns
A BINARY.

Examples
> SELECT cast(unbase64('U3BhcmsgU1FM') AS STRING);
Spark SQL

Related functions
base64 function
unhex function
7/21/2022 • 2 minutes to read

Converts hexadecimal expr to BINARY.

Syntax
unhex(expr)

Arguments
expr : A STRING expression of hexadecimal characters.

Returns
The result is BINARY.
If the length of expr is odd, the first character is discarded and the result is padded with a null byte. If expr
contains non hex characters the result is NULL.

Examples
> SELECT decode(unhex('537061726B2053514C'), 'UTF-8');
Spark SQL

Related functions
hex function
unix_date function
7/21/2022 • 2 minutes to read

Returns the number of days since 1970-01-01 .

Syntax
unix_date(expr)

Arguments
expr : A DATE expression.

Returns
An INTEGER.

Examples
> SELECT unix_date(DATE('1970-01-02'));
1

Related functions
unix_micros function
unix_millis function
unix_seconds function
unix_micros function
7/21/2022 • 2 minutes to read

Returns the number of microseconds since 1970-01-01 00:00:00 UTC .

Syntax
unix_micros(expr)

Arguments
expr : A TIMESTAMP expression.

Returns
A BIGINT.

Examples
> SELECT unix_micros(TIMESTAMP('1970-01-01 00:00:01Z'));
1000000

Related functions
unix_date function
unix_millis function
unix_seconds function
unix_millis function
7/21/2022 • 2 minutes to read

Returns the number of milliseconds since 1970-01-01 00:00:00 UTC .

Syntax
unix_millis(expr)

Arguments
expr : A TIMESTAMP expression.

Returns
A BIGINT.
The function truncates higher levels of precision.

Examples
> SELECT unix_millis(TIMESTAMP('1970-01-01 00:00:01Z'));
1000

Related functions
unix_date function
unix_micros function
unix_seconds function
unix_seconds function
7/21/2022 • 2 minutes to read

Returns the number of seconds since 1970-01-01 00:00:00 UTC .

Syntax
unix_seconds(expr)

Arguments
expr : A TIMESTAMP expression.

Returns
A BIGINT.
The function truncates higher levels of precision.

Examples
> SELECT unix_seconds(TIMESTAMP('1970-01-01 00:00:01Z'));
1

Related functions
unix_date function
unix_micros function
unix_millis function
unix_timestamp function
7/21/2022 • 2 minutes to read

Returns the UNIX timestamp of current or specified time.

Syntax
unix_timestamp([expr [, fmt] ] )

Arguments
expr : An optional DATE, TIMESTAMP, or a STRING expression in a valid datetime format.
fmt : An optional STRING expression specifying the format if expr is a STRING.

Returns
A BIGINT.
If no argument is provided the default is the current timestamp. fmt is ignored if expr is a DATE or
TIMESTAMP. If expr is a STRING fmt is used to translate the string to a TIMESTAMP before computing the unix
timestamp.
The default fmt value is 'yyyy-MM-dd HH:mm:ss' .
See Datetime patterns for valid date and time format patterns.
If fmt or expr are invalid the function raises an error.

NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error for malformed timestamps.

Examples
> SELECT unix_timestamp();
1476884637
> SELECT unix_timestamp('2016-04-08', 'yyyy-MM-dd');
1460041200

Related functions
timestamp function
upper function
7/21/2022 • 2 minutes to read

Returns expr with all characters changed to uppercase.

Syntax
upper(expr)

Arguments
expr : A STRING expression.

Returns
A STRING.
This function is a synonym for ucase function.

Examples
> SELECT upper('SparkSql');
SPARKSQL

Related functions
lower function
initcap function
ucase function
uuid function
7/21/2022 • 2 minutes to read

Returns an universally unique identifier (UUID) string.

Syntax
uuid()

Arguments
The function takes no argument.

Returns
A STRING formatted as a canonical UUID 36-character string.
The function is non-deterministic.

Examples
> SELECT uuid();
46707d92-02f4-4817-8116-a4c3b23e6266

Related functions
rand function
var_pop aggregate function
7/21/2022 • 2 minutes to read

Returns the population variance calculated from values of a group.

Syntax
var_pop ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.

Examples
> SELECT var_pop(col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.6875
> SELECT var_pop(DISTINCT col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.6666666666666666

Related functions
var_samp aggregate function
variance aggregate function
var_samp aggregate function
7/21/2022 • 2 minutes to read

Returns the sample variance calculated from values of a group.

Syntax
var_samp ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.
This function is a synonym for variance aggregate function.

Examples
> SELECT var_samp(col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.9166666666666666
> SELECT var_samp(DISTINCT col) FROM VALUES (1), (2), (3), (3) AS tab(col);
1.0

Related functions
var_pop aggregate function
variance aggregate function
variance aggregate function
7/21/2022 • 2 minutes to read

Returns the sample variance calculated from values of a group.

Syntax
variance ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]

Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.

Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.
This function is a synonym for var_samp aggregate function.

Examples
> SELECT variance(col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.9166666666666666
> SELECT variance(DISTINCT col) FROM VALUES (1), (2), (3), (3) AS tab(col);
1.0

Related functions
var_pop aggregate function
var_samp aggregate function
version function
7/21/2022 • 2 minutes to read

Returns the Apache Spark version.


Use current_version to retrieve the Databricks Runtime version.

Syntax
version()

Arguments
The function takes no argument.

Returns
A STRING that contains two fields, the first being a release version and the second being a git revision.

Examples
> SELECT version();
3.1.0 a6d6ea3efedbad14d99c24143834cd4e2e52fb40

> SELECT current_version().dbr_version;


11.0

Related functions
current_version function
weekday function
7/21/2022 • 2 minutes to read

Returns the day of the week of expr .

Syntax
weekday(expr)

Arguments
expr : A DATE or TIMESTAMP expression.

Returns
An INTEGER where 0 = Monday and 6 = Sunday.
This function is a synonym for extract(DAYOFWEEK_ISO FROM expr) - 1 .

Examples
> SELECT weekday(DATE'2009-07-30'), extract(DAYOFWEEK_ISO FROM DATE'2009-07-30');
3 4

Related functions
day function
dayofmonth function
dayofyear function
extract function
dayofweek function
weekofyear function
7/21/2022 • 2 minutes to read

Returns the week of the year of expr .

Syntax
weekofyear(expr)

Arguments
expr : A DATE expression.

Returns
An INTEGER.
A week is considered to start on a Monday and week 1 is the first week with >3 days. This function is a synonym
for extract(WEEK FROM expr) .

Examples
> SELECT weekofyear('2008-02-20');
8

Related functions
day function
dayofmonth function
dayofyear function
extract function
dayofweek function
width_bucket function
7/21/2022 • 2 minutes to read

Returns the bucket number for a value in an equi-width histogram.

Syntax
width_bucket(expr, minExpr, maxExpr, numBuckets)

Arguments
expr : An numeric expression to be bucketed.
minExpr : A numeric expression providing a lower bound for the buckets.
maxExpr : A numeric expression providing an upper bound for the buckets.
numBuckets : An INTEGER expression greater than 0 specifying the number of buckets.

Returns
An INTEGER.
The function divides the range between minExpr and maxExpr into numBuckets slices of equal size. The result is
the slice into which expr falls.
If expr is outside of minExpr the result is 0.
If expr is outside of maxExpr the result is numbuckets + 1

minExpr can be greater than maxExpr .

Examples
> SELECT width_bucket(5.3, 0.2, 10.6, 5);
3
> SELECT width_bucket(-2.1, 1.3, 3.4, 3);
0
> SELECT width_bucket(8.1, 0.0, 5.7, 4);
5
> SELECT width_bucket(-0.9, 5.2, 0.5, 2);
3

Related functions
window grouping expression
7/21/2022 • 2 minutes to read

Creates a hopping based sliding-window over a timestamp expression.

Syntax
window(expr, width [, slide [, start] ] )

Arguments
expr : A TIMESTAMP expression specifying the subject of the window.
width : A STRING literal representing the width of the window as an INTERVAL DAY TO SECOND literal.
slide : An optional STRING literal representing an offset from midnight to start, expressed as an INTERVAL
HOUR TO SECOND literal.
start : An optional STRING literal representing the start of the next window expressed as an INTERVAL DAY
TO SECOND literal.

Returns
Returns a set of groupings which can be operated on with aggregate functions. The GROUP BY column name is
window . It is of type STRUCT<start:TIMESTAMP, end:TIMESTAMP>

slide must be less than or equal to width . start must be less than slide .
If slide < width the rows in each groups overlap. By default slide equals width so expr are partitioned
into groups. The windowing starts at 1970-01-01 00:00:00 UTC + start . The default for start is '0 SECONDS' ’

Examples
> SELECT window, min(val), max(val), count(val)
FROM VALUES (TIMESTAMP'2020-08-01 12:20:21', 17),
(TIMESTAMP'2020-08-01 12:20:22', 12),
(TIMESTAMP'2020-08-01 12:23:10', 8),
(TIMESTAMP'2020-08-01 12:25:05', 11),
(TIMESTAMP'2020-08-01 12:28:59', 15),
(TIMESTAMP'2020-08-01 12:30:01', 23),
(TIMESTAMP'2020-08-01 12:30:15', 2),
(TIMESTAMP'2020-08-01 12:35:22', 16) AS S(stamp, val)
GROUP BY window(stamp, '2 MINUTES 30 SECONDS', '30 SECONDS', '15 SECONDS');
{2020-08-01 12:19:15, 2020-08-01 12:21:45} 12 17 2
{2020-08-01 12:18:15, 2020-08-01 12:20:45} 12 17 2
{2020-08-01 12:20:15, 2020-08-01 12:22:45} 12 17 2
{2020-08-01 12:19:45, 2020-08-01 12:22:15} 12 17 2
{2020-08-01 12:18:45, 2020-08-01 12:21:15} 12 17 2
{2020-08-01 12:21:45, 2020-08-01 12:24:15} 8 8 1
{2020-08-01 12:22:45, 2020-08-01 12:25:15} 8 11 2
{2020-08-01 12:21:15, 2020-08-01 12:23:45} 8 8 1
{2020-08-01 12:22:15, 2020-08-01 12:24:45} 8 8 1
{2020-08-01 12:20:45, 2020-08-01 12:23:15} 8 8 1
{2020-08-01 12:23:45, 2020-08-01 12:26:15} 11 11 1
{2020-08-01 12:23:15, 2020-08-01 12:25:45} 11 11 1
{2020-08-01 12:24:45, 2020-08-01 12:27:15} 11 11 1
{2020-08-01 12:24:15, 2020-08-01 12:26:45} 11 11 1
{2020-08-01 12:27:15, 2020-08-01 12:29:45} 15 15 1
{2020-08-01 12:27:45, 2020-08-01 12:30:15} 15 23 2
{2020-08-01 12:28:45, 2020-08-01 12:31:15} 2 23 3
{2020-08-01 12:26:45, 2020-08-01 12:29:15} 15 15 1
{2020-08-01 12:28:15, 2020-08-01 12:30:45} 2 23 3
{2020-08-01 12:29:45, 2020-08-01 12:32:15} 2 23 2

Related functions
cube function
xpath function
7/21/2022 • 2 minutes to read

Returns values within the nodes of xml that match xpath .

Syntax
xpath(xml, xpath)

Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.

Returns
An ARRAY of STRING.
The function raises an error if xml or xpath are malformed.

Examples
> SELECT xpath('<a><b>b1</b><b>b2</b><b>b3</b><c>c1</c><c>c2</c></a>','a/b/text()');
[b1, b2, b3]

Related functions
xpath_boolean function
xpath_double function
xpath_int function
xpath_long function
xpath_number function
xpath_short function
xpath_string function
xpath_boolean function
7/21/2022 • 2 minutes to read

Returns true if the xpath expression evaluates to true , or if a matching node in xml is found.

Syntax
xpath_boolean(xml, xpath)

Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.

Returns
An BOOLEAN.
The function raises an error if xml or xpath are malformed.

Examples
> SELECT xpath_boolean('<a><b>1</b></a>','a/b');
true

Related functions
xpath function
xpath_double function
xpath_float function
xpath_int function
xpath_long function
xpath_number function
xpath_short function
xpath_string function
xpath_double function
7/21/2022 • 2 minutes to read

Returns a DOUBLE value from an XML document.

Syntax
xpath_double(xml, xpath)

Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.

Returns
A DOUBLE.
The result is zero if no match is found, or NaN if a match is found but the value is non-numeric.
The function raises an error if xml or xpath are malformed.

Examples
> SELECT xpath_double('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
3.0

Related functions
xpath function
xpath_boolean function
xpath_float function
xpath_int function
xpath_long function
xpath_number function
xpath_short function
xpath_string function
xpath_float function
7/21/2022 • 2 minutes to read

Returns a FLOAT value from an XML document.

Syntax
xpath_float(xml, xpath)

Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.

Returns
The result is FLOAT.
The result is zero if no match is found, or NaN if a match is found but the value is non-numeric.
The function raises an error if xml or xpath are malformed.

Examples
> SELECT xpath_float('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
3.0

Related functions
xpath function
xpath_boolean function
xpath_double function
xpath_int function
xpath_long function
xpath_number function
xpath_short function
xpath_string function
xpath_int function
7/21/2022 • 2 minutes to read

Returns an INTEGER value from an XML document.

Syntax
xpath_int(xml, xpath)

Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.

Returns
An INTEGER.
The result is zero if no match is found, or a match is found but the value is non-numeric.
The function raises an error if xml or xpath are malformed.

Examples
> SELECT xpath_int('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
3

Related functions
xpath function
xpath_boolean function
xpath_double function
xpath_float function
xpath_long function
xpath_number function
xpath_short function
xpath_string function
xpath_long function
7/21/2022 • 2 minutes to read

Returns an BIGINT value from an XML document.

Syntax
xpath_long(xml, xpath)

Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.

Returns
A BIGINT.
The result is zero if no match is found, or a match is found but the value is non-numeric.
The function raises an error if xml or xpath are malformed.

Examples
> SELECT xpath_long('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
3

Related functions
xpath function
xpath_boolean function
xpath_double function
xpath_float function
xpath_int function
xpath_number function
xpath_short function
xpath_string function
xpath_number function
7/21/2022 • 2 minutes to read

Returns a DOUBLE value from an XML document.

Syntax
xpath_number(xml, xpath)

Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.

Returns
A DOUBLE.
The result is zero if no match is found, or NaN if a match is found but the value is non-numeric.
The function raises an error if xml or xpath are malformed.

Examples
> SELECT xpath_number('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
3.0

Related functions
xpath function
xpath_boolean function
xpath_float function
xpath_int function
xpath_long function
xpath_double function
xpath_short function
xpath_string function
xpath_short function
7/21/2022 • 2 minutes to read

Returns an SMALLINT value from an XML document.

Syntax
xpath_short(xml, xpath)

Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.

Returns
The result is SMALLINT.
The result is zero if no match is found, or a match is found but the value is non-numeric.
The function raises an error if xml or xpath are malformed.

Examples
> SELECT xpath_int('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
3

Related functions
xpath function
xpath_boolean function
xpath_double function
xpath_float function
xpath_long function
xpath_number function
xpath_int function
xpath_string function
xpath_string function
7/21/2022 • 2 minutes to read

Returns the contents of the first XML node that matches the XPath expression.

Syntax
xpath_string(xml, xpath)

Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.

Returns
The result is STRING.
The function raises an error if xml or xpath are malformed.

Examples
> SELECT xpath_string('<a><b>b</b><c>cc</c></a>','a/c');
cc

Related functions
xpath function
xpath_boolean function
xpath_double function
xpath_float function
xpath_long function
xpath_number function
xpath_int function
xpath_short function
xxhash64 function
7/21/2022 • 2 minutes to read

Returns a 64-bit hash value of the arguments.

Syntax
xxhash64(expr1 [, ...] )

Arguments
exprN : An expression of any type.

Returns
A BIGINT.

Examples
> SELECT xxhash64('Spark', array(123), 2);
5602566077635097486

Related functions
hash function
crc32 function
year function
7/21/2022 • 2 minutes to read

Returns the year component of expr .

Syntax
year(expr)

Arguments
expr : A DATE or TIMESTAMP expression.

Returns
An INTEGER.
This function is a synonym for extract(YEAR FROM expr) .

Examples
> SELECT year('2016-07-30');
2016

Related functions
dayofmonth function
dayofweek function
day function
hour function
minute function
extract function
zip_with function
7/21/2022 • 2 minutes to read

Merges the arrays in expr1 and expr2 , element-wise, into a single array using func .

Syntax
zip_with(expr1, expr2, func)

Arguments
expr1 : An ARRAY expression.
expr2 : An ARRAY expression.
func : A lambda function taking two parameters.

Returns
An ARRAY of the result of the lambda function.
If one array is shorter, nulls are appended at the end to match the length of the longer array before applying
func .

Examples
> SELECT zip_with(array(1, 2, 3), array('a', 'b', 'c'), (x, y) -> (y, x));
[{a, 1}, {b, 2}, {c, 3}]
> SELECT zip_with(array(1, 2), array(3, 4), (x, y) -> x + y);
[4,6]
> SELECT zip_with(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) -> concat(x, y));
[ad , be, cf]

Related functions
array_distinct function
array_intersect function
array_except function
array_sort function
array_remove function
array_union function
Identifiers
7/21/2022 • 2 minutes to read

An identifier is a string used to identify a object such as a table, view, schema, or column. Databricks Runtime
has regular identifiers and delimited identifiers, which are enclosed within backticks. All identifiers are case-
insensitive.

Syntax
Regular identifiers

{ letter | digit | '_' } [ ... ]

NOTE
If spark.sql.ansi.enabled is set to true , you cannot use an ANSI SQL reserved keyword as an identifier. For details,
see ANSI Compliance.

Delimited identifiers

`c [ ... ]`

Parameters
letter : Any letter from A-Z or a-z.
digit : Any numeral from 0 to 9.
c : Any character from the character set. Use ` to escape special characters (for example, `.` ).

Examples
-- This CREATE TABLE fails because of the illegal identifier name a.b
CREATE TABLE test (a.b int);
no viable alternative at input 'CREATE TABLE test (a.'(line 1, pos 20)

-- This CREATE TABLE works


CREATE TABLE test (`a.b` int);

-- This CREATE TABLE fails because the special character ` is not escaped
CREATE TABLE test1 (`a`b` int);
no viable alternative at input 'CREATE TABLE test (`a`b`'(line 1, pos 23)

-- This CREATE TABLE works


CREATE TABLE test (`a``b` int);
NULL semantics
7/21/2022 • 11 minutes to read

A table consists of a set of rows and each row contains a set of columns. A column is associated with a data type
and represents a specific attribute of an entity (for example, age is a column of an entity called person ).
Sometimes, the value of a column specific to a row is not known at the time the row comes into existence. In
SQL , such values are represented as NULL . This section details the semantics of NULL values handling in
various operators, expressions and other SQL constructs.
The following illustrates the schema layout and data of a table named person . The data contains NULL values in
the age column and this table is used in various examples in the sections below.

Id Name Age
--- -------- ----
100 Joe 30
200 Marry NULL
300 Mike 18
400 Fred 50
500 Albert NULL
600 Michelle 30
700 Dan 50

Comparison operators
Databricks Runtime supports the standard comparison operators such as > , >= , = , < and <= . The result of
these operators is unknown or NULL when one of the operands or both the operands are unknown or NULL . In
order to compare the NULL values for equality, Databricks Runtime provides a null-safe equal operator ( <=> ),
which returns False when one of the operand is NULL and returns True when both the operands are NULL .
The following table illustrates the behavior of comparison operators when one or both operands are NULL :

L EF T RIGH T
O P ERA N D O P ERA N D > >= = < <= <=>

NULL Any value NULL NULL NULL NULL NULL False

Any value NULL NULL NULL NULL NULL NULL False

NULL NULL NULL NULL NULL NULL NULL True

Examples
-- Normal comparison operators return `NULL` when one of the operand is `NULL`.
> SELECT 5 > null AS expression_output;
expression_output
-----------------
null

-- Normal comparison operators return `NULL` when both the operands are `NULL`.
> SELECT null = null AS expression_output;
expression_output
-----------------
null

-- Null-safe equal operator return `False` when one of the operand is `NULL`
> SELECT 5 <=> null AS expression_output;
expression_output
-----------------
false

-- Null-safe equal operator return `True` when one of the operand is `NULL`
> SELECT NULL <=> NULL;
expression_output
-----------------
true
-----------------

Logical operators
Databricks Runtime supports standard logical operators such as AND , OR and NOT . These operators take
Boolean expressions as the arguments and return a Boolean value.

The following tables illustrate the behavior of logical operators when one or both operands are NULL .

L EF T O P ERA N D RIGH T O P ERA N D OR AND

True NULL True NULL

False NULL NULL False

NULL True True NULL

NULL False NULL False

NULL NULL NULL NULL

O P ERA N D N OT

NULL NULL

Examples
-- Normal comparison operators return `NULL` when one of the operands is `NULL`.
> SELECT (true OR null) AS expression_output;
expression_output
-----------------
true

-- Normal comparison operators return `NULL` when both the operands are `NULL`.
> SELECT (null OR false) AS expression_output
expression_output
-----------------
null

-- Null-safe equal operator returns `False` when one of the operands is `NULL`
> SELECT NOT(null) AS expression_output;
expression_output
-----------------
null

Expressions
The comparison operators and logical operators are treated as expressions in Databricks Runtime. Databricks
Runtime also supports other forms of expressions, which can be broadly classified as:
Null intolerant expressions
Expressions that can process NULL value operands
The result of these expressions depends on the expression itself.
Null intolerant expressions
Null intolerant expressions return NULL when one or more arguments of expression are NULL and most of the
expressions fall in this category.
Examples

> SELECT concat('John', null) AS expression_output;


expression_output
-----------------
null

> SELECT positive(null) AS expression_output;


expression_output
-----------------
null

> SELECT to_date(null) AS expression_output;


expression_output
-----------------
null

Expressions that can process null value operands


This class of expressions are designed to handle NULL values. The result of the expressions depends on the
expression itself. As an example, function expression isnull returns a true on null input and false on non
null input where as function coalesce returns the first non NULL value in its list of operands. However,
coalesce returns NULL when all its operands are NULL . Below is an incomplete list of expressions of this
category.
COALESCE
NULLIF
IFNULL
NVL
NVL2
ISNAN
NANVL
ISNULL
ISNOTNULL
ATLEASTNNONNULLS
IN
Examples

> SELECT isnull(null) AS expression_output;


expression_output
-----------------
true

-- Returns the first occurrence of non `NULL` value.


> SELECT coalesce(null, null, 3, null) AS expression_output;
expression_output
-----------------
3

-- Returns `NULL` as all its operands are `NULL`.


> SELECT coalesce(null, null, null, null) AS expression_output;
expression_output
-----------------
null

> SELECT isnan(null) AS expression_output;


expression_output
-----------------
false

Built-in aggregate expressions


Aggregate functions compute a single result by processing a set of input rows. Below are the rules of how NULL
values are handled by aggregate functions.
NULL values are ignored from processing by all the aggregate functions.
Only exception to this rule is COUNT(*) function.
Some aggregate functions return NULL when all input values are NULL or the input data set is empty. The
list of these functions is:
MAX
MIN
SUM
AVG
EVERY
ANY
SOME

Examples
-- `count(*)` does not skip `NULL` values.
> SELECT count(*) FROM person;
count(1)
--------
7

-- `NULL` values in column `age` are skipped from processing.


> SELECT count(age) FROM person;
count(age)
----------
5

-- `count(*)` on an empty input set returns 0. This is unlike the other


-- aggregate functions, such as `max`, which return `NULL`.
> SELECT count(*) FROM person where 1 = 0;
count(1)
--------
0

-- `NULL` values are excluded from computation of maximum value.


> SELECT max(age) FROM person;
max(age)
--------
50

-- `max` returns `NULL` on an empty input set.


> SELECT max(age) FROM person where 1 = 0;
max(age)
--------
null

Condition expressions in WHERE , HAVING , and JOIN clauses


WHERE , HAVING operators filter rows based on the user specified condition. A JOIN operator is used to combine
rows from two tables based on a join condition. For all the three operators, a condition expression is a boolean
expression and can return True , False or Unknown (NULL) . They are “satisfied” if the result of the condition is
True .

Examples
-- Persons whose age is unknown (`NULL`) are filtered out from the result set.
> SELECT * FROM person WHERE age > 0;
name age
-------- ---
Michelle 30
Fred 50
Mike 18
Dan 50
Joe 30

-- `IS NULL` expression is used in disjunction to select the persons


-- with unknown (`NULL`) records.
> SELECT * FROM person WHERE age > 0 OR age IS NULL;
name age
-------- ----
Albert null
Michelle 30
Fred 50
Mike 18
Dan 50
Marry null
Joe 30

-- Person with unknown(`NULL`) ages are skipped from processing.


> SELECT * FROM person GROUP BY age HAVING max(age) > 18;
age count(1)
--- --------
50 2
30 2

-- A self join case with a join condition `p1.age = p2.age AND p1.name = p2.name`.
-- The persons with unknown age (`NULL`) are filtered out by the join operator.
> SELECT * FROM person p1, person p2
WHERE p1.age = p2.age
AND p1.name = p2.name;
name age name age
-------- --- -------- ---
Michelle 30 Michelle 30
Fred 50 Fred 50
Mike 18 Mike 18
Dan 50 Dan 50
Joe 30 Joe 30

-- The age column from both legs of join are compared using null-safe equal which
-- is why the persons with unknown age (`NULL`) are qualified by the join.
> SELECT * FROM person p1, person p2
WHERE p1.age <=> p2.age
AND p1.name = p2.name;
name age name age
-------- ---- -------- ----
Albert null Albert null
Michelle 30 Michelle 30
Fred 50 Fred 50
Mike 18 Mike 18
Dan 50 Dan 50
Marry null Marry null
Joe 30 Joe 30

Aggregate operators ( GROUP BY , DISTINCT )


As discussed in Comparison operators, two NULL values are not equal. However, for the purpose of grouping
and distinct processing, the two or more values with NULL data are grouped together into the same bucket. This
behavior conforms with the SQL standard and with other enterprise database management systems.
Examples

-- `NULL` values are put in one bucket in `GROUP BY` processing.


> SELECT age, count(*) FROM person GROUP BY age;
age count(1)
---- --------
null 2
50 2
30 2
18 1

-- All `NULL` ages are considered one distinct value in `DISTINCT` processing.
> SELECT DISTINCT age FROM person;
age
----
null
50
30
18

Sort operator ( ORDER BY clause)


Databricks Runtime supports null ordering specification in ORDER BY clause. Databricks Runtime processes the
ORDER BY clause by placing all the NULL values at first or at last depending on the null ordering specification. By
default, all the NULL values are placed at first.
Examples
-- `NULL` values are shown at first and other values
-- are sorted in ascending way.
> SELECT age, name FROM person ORDER BY age;
age name
---- --------
null Marry
null Albert
18 Mike
30 Michelle
30 Joe
50 Fred
50 Dan

-- Column values other than `NULL` are sorted in ascending


-- way and `NULL` values are shown at the last.
> SELECT age, name FROM person ORDER BY age NULLS LAST;
age name
---- --------
18 Mike
30 Michelle
30 Joe
50 Dan
50 Fred
null Marry
null Albert

-- Columns other than `NULL` values are sorted in descending


-- and `NULL` values are shown at the last.
> SELECT age, name FROM person ORDER BY age DESC NULLS LAST;
age name
---- --------
50 Fred
50 Dan
30 Michelle
30 Joe
18 Mike
null Marry
null Albert

Set operators ( UNION , INTERSECT , EXCEPT )


NULL values are compared in a null-safe manner for equality in the context of set operations. That means when
comparing rows, two NULL values are considered equal unlike the regular EqualTo ( = ) operator.
Examples
> CREATE VIEW unknown_age SELECT * FROM person WHERE age IS NULL;

-- Only common rows between two legs of `INTERSECT` are in the


-- result set. The comparison between columns of the row are done
-- in a null-safe manner.
> SELECT name, age FROM person
INTERSECT
SELECT name, age from unknown_age;
name age
------ ----
Albert null
Marry null

-- `NULL` values from two legs of the `EXCEPT` are not in output.
-- This basically shows that the comparison happens in a null-safe manner.
> SELECT age, name FROM person
EXCEPT
SELECT age FROM unknown_age;
age name
--- --------
30 Joe
50 Fred
30 Michelle
18 Mike
50 Dan

-- Performs `UNION` operation between two sets of data.


-- The comparison between columns of the row ae done in
-- null-safe manner.
> SELECT name, age FROM person
UNION
SELECT name, age FROM unknown_age;
name age
-------- ----
Albert null
Joe 30
Michelle 30
Marry null
Fred 50
Mike 18
Dan 50

EXISTS and NOT EXISTS subqueries


In Databricks Runtime, EXISTS and NOT EXISTS expressions are allowed inside a WHERE clause. These are
Boolean expressions that return either TRUE or FALSE . In other words, EXISTS is a membership condition and
returns TRUE when the subquery it refers to returns one or more rows. Similarly, NOT EXISTS is a non-
membership condition and returns TRUE when no rows or zero rows are returned from the subquery.
These two expressions are not affected by presence of NULL in the result of the subquery. They are normally
faster because they can be converted to semijoins and anti-semijoins without special provisions for null
awareness.
Examples
-- Even if subquery produces rows with `NULL` values, the `EXISTS` expression
-- evaluates to `TRUE` as the subquery produces 1 row.
> SELECT * FROM person WHERE EXISTS (SELECT null);
name age
-------- ----
Albert null
Michelle 30
Fred 50
Mike 18
Dan 50
Marry null
Joe 30

-- `NOT EXISTS` expression returns `FALSE`. It returns `TRUE` only when


-- subquery produces no rows. In this case, it returns 1 row.
> SELECT * FROM person WHERE NOT EXISTS (SELECT null);
name age
---- ---

-- `NOT EXISTS` expression returns `TRUE`.


> SELECT * FROM person WHERE NOT EXISTS (SELECT 1 WHERE 1 = 0);
name age
-------- ----
Albert null
Michelle 30
Fred 50
Mike 18
Dan 50
Marry null
Joe 30

IN and NOT IN subqueries


In Databricks Runtime, IN and NOT IN expressions are allowed inside a WHERE clause of a query. Unlike the
EXISTS expression, IN expression can return a TRUE , FALSE or UNKNOWN (NULL) value. Conceptually a IN
expression is semantically equivalent to a set of equality condition separated by a disjunctive operator ( OR ). For
example, c1 IN (1, 2, 3) is semantically equivalent to (C1 = 1 OR c1 = 2 OR c1 = 3) .
As far as handling NULL values are concerned, the semantics can be deduced from the NULL value handling in
comparison operators( = ) and logical operators( OR ). To summarize, below are the rules for computing the
result of an IN expression.
TRUE is returned when the non-NULL value in question is found in the list
FALSE is returned when the non-NULL value is not found in the list and the list does not contain NULL
values
UNKNOWN is returned when the value is NULL , or the non-NULL value is not found in the list and the list
contains at least one NULL value

NOT IN always returns UNKNOWN when the list contains NULL , regardless of the input value. This is because
IN returns UNKNOWN if the value is not in the list containing NULL , and because NOT UNKNOWN is again UNKNOWN .
Examples
-- The subquery has only `NULL` value in its result set. Therefore,
-- the result of `IN` predicate is UNKNOWN.
> SELECT * FROM person WHERE age IN (SELECT null);
name age
---- ---

-- The subquery has `NULL` value in the result set as well as a valid
-- value `50`. Rows with age = 50 are returned.
> SELECT * FROM person
WHERE age IN (SELECT age FROM VALUES (50), (null) sub(age));
name age
---- ---
Fred 50
Dan 50

-- Since subquery has `NULL` value in the result set, the `NOT IN`
-- predicate would return UNKNOWN. Hence, no rows are
-- qualified for this query.
> SELECT * FROM person
WHERE age NOT IN (SELECT age FROM VALUES (50), (null) sub(age));
name age
---- ---
Information schema (Databricks Runtime)
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

The INFORMATION_SCHEMA is a SQL Standard based, system provided schema present in every catalog other than
the HIVE_METASTORE catalog.
Within the information schema you find a set of views describing the objects known to the schema’s catalog that
you are privileged the see. The information schema of the SYSTEM catalog returns information about objects
across all catalogs within the metastore.
The purpose of the information schema is to provide a SQL based, self describing API to the metadata.
Since: Databricks Runtime 10.2

Entity relationship diagram of the information schema


The following entity relationship (ER) diagram provides an overview of the relations within the information
schema and how they relate to each other.
Information schema views
NAME DESC RIP T IO N

CATALOG_PRIVILEGES Lists principals which have privileges on the catalog.

CATALOGS Describes catalogs.

COLUMNS Describes columns of tables and views in the catalog.

CHECK_CONTRAINTS Reserved for future use.

INFORMATION_SCHEMA_CATALOG_NAME Returns the name of this information schema’s catalog.

REFERENTIAL_CONSTRAINTS Reserved for future use.

TABLE_PRIVILEGES Lists principals which have privileges on the tables and views
in the catalog.

TABLES Describes tables and views defined within the catalog.

SCHEMA_PRIVILEGES Lists principals which have privileges on the schemas in the


catalog.

SCHEMATA Describes schemas within the catalog.

VIEWS Describes view specific information about the views in the


catalog.

Notes
While identifiers are case insensitive when referenced in SQL statements they are stored in the information
schema as STRING . This implies that you must either search for them using the case in which the identifier is
stored, or use functions such as ilike.

Examples
> SELECT table_name, column_name
FROM information_schema.views
WHERE data_type = 'DOUBLE'
AND schema_name = 'information_schema';

Related articles
SHOW
DESCRIBE
INFORMATION_SCHEMA.CATALOG_PRIVILEGES
(Databricks Runtime)
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Lists principals which have privileges on the catalogs.


Since: Databricks Runtime 10.2

Definition
The CATALOG_PRIVILEGES relation contains the following columns:

NAME DATA T Y P E N UL L A B L E STA N DA RD DESC RIP T IO N

GRANTOR STRING No Yes Principal that granted


the privilege.

GRANTEE STRING No Yes Principal to which the


privilege has been
granted.

CATALOG_NAME STRING No Yes Catalog on which


privilege has been
granted.

PRIVILEGE_TYPE STRING No Yes Privilege being


granted.

IS_GRANTABLE STRING No Yes Always NO .


Reserved for future
use.

Constraints
The following constraints apply to the CATALOG_PRIVILEGES relation:

C L A SS NAME C O L UM N L IST DESC RIP T IO N

Primary key CATPRIVS_PK GRANTOR, GRANTEE , Unique identifier for the


CATALOG_NAME , privilege granted.
PRIVILEGE_TYPE

Foreign key CATPRIVS_CATS_FK CATALOG_NAME References CATALOGS

Examples
> SELECT catalog_name, grantee
FROM information_schema.catalog_privileges;

Related
Information schema
INFORMATION_SCHEMA.CATALOGS
SHOW GRANTS
INFORMATION_SCHEMA.CATALOGS (Databricks
Runtime)
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Describes catalogs.
Information is displayed only for catalogs the user has permission to interact with.
Since: Databricks Runtime 10.2

Definition
The CATALOGS relation contains the following columns:

NAME DATA T Y P E N UL L A B L E STA N DA RD DESC RIP T IO N

TABLE_CATALOG STRING No Yes Name of the catalog.

COMMENT STRING Yes No An optional


comment that
describes the catalog.

CREATED TIMESTAMP No No Timestamp when the


catalog was created.

CREATED_BY STRING No No Principal who created


the catalog.

LAST_ALTERED TIMESTAMP No No Timestamp when the


catalog was last
altered in any way.

LAST_ALTERED_BY STRING No No Principal who last


altered the catalog.

Constraints
The following constraints apply to the CATALOG relation:

C L A SS NAME C O L UM N L IST DESC RIP T IO N

Primary key CATALOGS_PK CATALOG_NAME Unique identifier for the


catalog.

Examples
> SELECT catalog_owner
FROM information_schema.catalogs
system

Related
Information schema
INFORMATION_SCHEMA.CATALOG_PRIVILEGES
INFORMATION_SCHEMA.INFORMATION_SCHEMA_CATALOG_NAME
INFORMATION_SCHEMA.CHECK_CONSTRAINTS
(Databricks Runtime)
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Reserved for future use.


This relation will describe check constraints defined on tables.
Since: Databricks Runtime 10.2

Definition
The CHECK_CONSTRAINTS relation contains the following columns:

NAME DATA T Y P E N UL L A B L E STA N DA RD DESC RIP T IO N

CONSTRAINT_CATALOG STRING No Yes Catalog containing


the check constraint.

CONSTRAINT_SCHEMA STRING No Yes Schema containing


the check constraint.

CONSTRAINT_NAME STRING No Yes Name of the check


constraint.

CHECK_CLAUSE STRING No Yes The text of the check


constraint condition.

SQL_PATH STRING No Yes Always NULL ,


reserved for future
use.

Constraints
The following constraints apply to the CHECK_CONSTRAINT relation:

C L A SS NAME C O L UM N L IST DESC RIP T IO N

Primary key CHK_CONS_PK CONSTRAINT_CATALOG , Unique identifier for the


CONSTRAINT_SCHEMA , constraint.
CONSTRAINT_NAME

Examples
> SELECT constraint_name, check_clause
FROM information_schema.check_constraints
WHERE table_schema = 'information_schema';

Related
Information schema
INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS
INFORMATION_SCHEMA.SCHEMATA
INFORMATION_SCHEMA.COLUMNS (Databricks
Runtime)
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Describes columns of tables and views (relations) in the catalog.


The rows returned are limited to the relations the user is privileged to interact with.
Since: Databricks Runtime 10.2

Definition
The COLUMNS relation contains the following columns:

NAME DATA T Y P E N UL L A B L E STA N DA RD DESC RIP T IO N

TABLE_CATALOG STRING No Yes Catalog that contains


the relation.

TABLE_SCHEMA STRING No Yes Schema that contains


the relation.

TABLE_NAME STRING No Yes Name of the relation


the column is part of.

COLUMN_NAME STRING No Yes Name of the column.

ORDINAL_POSITION INTEGER No Yes The position


(numbered from 1 )
of the column within
the relation.

COLUMN_DEFAULT STRING No Yes Always NULL ,


reserved for future
use.

IS_NULLABLE STRING No Yes YES if column is


nullable, NO
otherwise.

DATA_TYPE STRING No Yes The simple data type


name of the column,
or STRUCT , or
ARRAY .
NAME DATA T Y P E N UL L A B L E STA N DA RD DESC RIP T IO N

FULL_DATA_TYPE STRING No No The data type as


specified in the
column definition.

CHARACTER_MAXIMUM_LENGTH
I NTEGER Yes Yes Always NULL ,
reserved for future
use.

CHARACTER_OCTET_LENGTH STRING Yes Yes Always NULL ,


reserved for future
use.

NUMERIC_PRECISION INTEGER Yes Yes For base-2 integral


numeric types,
FLOAT , and
DOUBLE , the
number of supported
bits. For DECIMAL
the number of digits,
NULL otherwise.

NUMERIC_PRECISION_RADIXINTEGER Yes Yes For DECIMAL 10, for


all other numeric
types 2, NULL
otherwise.

NUMERIC_SCALE INTEGER Yes Yes For integral numeric


types 0, for
DECIMAL the
number of digits to
the right of the
decimal point, NULL
otherwise.

DATETIME_PRECISION INTEGER Yes Yes For DATE 0, for


TIMESTAMP , and
INTERVAL …
SECOND 3, any other
INTERVAL 0, NULL
otherwise.

INTERVAL_TYPE STRING Yes Yes For INTERVAL the


unit portion of the
interval, e.g.
'YEAR TO MONTH' ,
NULL otherwise.

INTERVAL_PRECISION INTERAL Yes Yes Always NULL ,


reserved for future
use.

MAXIMUM_CARDINALITY INTEGER Yes Yes Always NULL ,


reserved for future
use.
NAME DATA T Y P E N UL L A B L E STA N DA RD DESC RIP T IO N

IS_IDENTITY STRING No Yes Always ‘NO’, reserved


for future use.

IDENTITY_GENERATION STRING Yes Yes Always NULL ,


reserved for future
use.

IDENTITY_START STRING Yes Yes Always NULL ,


reserved for future
use.

IDENTITY_INCREMENT STRING Yes Yes Always NULL ,


reserved for future
use.

IDENTITY_MAXIMUM STRING Yes Yes Always NULL ,


reserved for future
use.

IDENTITY_MINIMUM STRING Yes Yes Always NULL ,


reserved for future
use.

IDENTITY_CYCLE STRING Yes Yes Always NULL ,


reserved for future
use.

IS_GENERATED STRING Yes Yes Always NULL ,


reserved for future
use.

GENERATION_EXPRESSION STRING Yes Yes Always NULL ,


reserved for future
use.

IS_SYSTEM_TIME_PERIOD_START
STRING No Yes Always NO , reserved
for future use.

IS_SYSTEM_TIME_PERIOD_END
STRING No Yes Always NO , reserved
for future use.

SYSTEM_TIME_PERIOD_TIMESTAMP_GENERATION
STRING Yes Yes Always NULL ,
reserved for future
use.

IS_UPDATABLE STRING No Yes YES if column is


updatable, NO
otherwise.
NAME DATA T Y P E N UL L A B L E STA N DA RD DESC RIP T IO N

PARTITION_ORDINAL_POSITION
INTEGER Yes No Position (numbered
from 1 ) of the
column in the
partition, NULL if
not a partitioning
column.

COMMENT STRING Yes No Optional description


of the column.

Constraints
The following constraints apply to the COLUMNS relation:

C L A SS NAME C O L UM N L IST DESC RIP T IO N

Primary key COLUMNS_PK TABLE_CATALOG , Unique identifier for the


TABLE_SCHEMA , column.
TABLE_NAME ,
COLUMN_NAME

Unique key COLUMNS_UK TABLE_CATALOG , Unique identifier the


TABLE_SCHEMA , column.
TABLE_NAME ,
ORDINAL_POSITION )

Foreign key COLUMN_TABLES_FK TABLE_CATALOG , References TABLES.


TABLE_SCHEMA ,
TABLE_NAME

Examples
> SELECT ordinal_position, column_name, data_type
FROM information_schema.tables
WHERE table_schema = 'information_schema'
AND table_name = 'catalog_privilges'
ORDER BY ordinal_position;
1 grantor STRING
2 grantee STRING
3 catalog_name STRING
4 privilege_type STRING
5 is_grantable STRING

Related
DESCRIBE TABLE
Information schema
INFORMATION_SCHEMA.TABLES
SHOW COLUMNS
SHOW TABLE
SHOW TABLES
INFORMATION_SCHEMA.INFORMATION_SCHEMA_CATALOG_NAME
(Databricks Runtime)
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Returns the name of this information schema’s catalog.


Since: Databricks Runtime 10.2

Definition
The INFORMATION_SCHEMA_CATALOG_NAME relation contains the following columns:

NAME DATA T Y P E N UL L A B L E STA N DA RD DESC RIP T IO N

CATALOG_NAME STRING No Yes The name of the


catalog containing
this information
schema.

Constraints
The following constraints apply to the INFORMATION_SCHEMA_CATALOG_NAME relation:

C L A SS NAME C O L UM N L IST DESC RIP T IO N

Primary key INFOCAT_PK CATALOG_NAME Unique identifier the


catalog

Foreign key INFOCAT_CATS_FK CATALOG_NAME References CATALOGS.

Examples
> SELECT catalog_name
FROM information_schema.information_schema_catalog_name
default

Related
Information schema
INFORMATION_SCHEMA.CATALOGS
INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS
(Databricks Runtime)
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Reserved for future use.


This relation will describe referential integrity (RI) constraints defined on tables.
Since: Databricks Runtime 10.2

Definition
The REFERENTIAL_CONSTRAINTS relation contains the following columns:

NAME DATA T Y P E N UL L A B L E STA N DA RD DESC RIP T IO N

CONSTRAINT_CATALOG STRING No Yes Catalog containing


the foreign key
constraint.

CONSTRAINT_SCHEMA STRING No Yes Schema (database)


containing the
foreign key
constraints.

CONSTRAINT_NAME STRING No Yes Name of the check


constraint.

UNIQUE_CONSTRAINT_CATALOG
STRING No Yes Catalog containing
the referenced
constraint.

UNIQUE_CONSTARINT_SCHEMA
S TRING No Yes Database (schema)
containing the
referenced constraint.

UNIQUE_CONSTRAINT_NAME STRING No Yes Name of the


referenced constraint.

MATCH_OPTION STRING No Yes Always FULL ,


reserved for future
use..

UPDATE_RULE STRING No Yes Always NO ACTION ,


reserved for future
use.
NAME DATA T Y P E N UL L A B L E STA N DA RD DESC RIP T IO N

DELETE_RULE STRING No Yes Always NO ACTION ,


reserved for future
use.

Constraints
The following constraints apply to the REFERENTIAL_CONSTRAINTS relation:

C L A SS NAME C O L UM N L IST DESC RIP T IO N

Primary key REF_CONS_PK CONSTRAINT_CATALOG , Unique identifier for the


CONSTRAINT_SCHEMA , constraint.
CONSTRAINT_NAME

Examples
> SELECT constraint_name, check_clause
FROM information_schema.referential_constraints
WHERE table_schema = 'information_schema';

Related
Information schema
INFORMATION_SCHEMA.CHECK_CONSTRAINTS
INFORMATION_SCHEMA.SCHEMATA
INFORMATION_SCHEMA.SCHEMA_PRIVILEGES
(Databricks Runtime)
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Lists principals which have privileges on the schemas in the catalog.


The rows returned are limited to the schemas the user is privileged to interact with.
Since: Databricks Runtime 10.2

Definition
The SCHEMA_PRIVILEGES relation contains the following columns:

NAME DATA T Y P E N UL L A B L E STA N DA RD DESC RIP T IO N

GRANTOR STRING No Yes Principal that granted


the privilege.

GRANTEE STRING No Yes Principal to which the


privilege has been
granted.

CATALOG_NAME STRING No Yes Catalog of schema


on which privilege
has been granted.

SCHEMA_NAME STRING No Yes Schema on which


privilege has been
granted.

PRIVILEGE_TYPE STRING No Yes Privilege being


granted.

IS_GRANTABLE STRING No Yes Always NO .


Reserved for future
use.

Constraints
The following constraints apply to the SCHEMA_PRIVILEGES relation:

C L A SS NAME C O L UM N L IST DESC RIP T IO N


C L A SS NAME C O L UM N L IST DESC RIP T IO N

Primary key SCHEMAPRIVS_PK GRANTOR , GRANTEE , Unique identifier for the


CATALOG_NAME , privilege granted.
SCHEMA_NAME ,
PRIVILEGE_TYPE

Foreign key SCHEMAPRIVS_SCHEMATA_FK CATALOG_NAME , References SCHEMATA


SCHEMA_NAME

Examples
> SELECT catalog_name, schema_name, grantee
FROM information_schema.schema_privileges;

Related
Information schema
INFORMATION_SCHEMA.SCHEMATA
SHOW GRANTS
INFORMATION_SCHEMA.SCHEMATA (Databricks
Runtime)
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Describes schemas within the catalog.


The rows returned are limited to the schemas the user has permission to interact with.
Since: Databricks Runtime 10.2

Definition
The SCHEMATA relation contains the following columns:

NAME DATA T Y P E N UL L A B L E STA N DA RD DESC RIP T IO N

CATALOG_NAME STRING No Yes Catalog containing


the schema.

SCHEMA_NAME STRING No Yes Name of the schema.

SCHEMA_OWNER STRING No No User or group


(principal) that
currently owns the
schema.

COMMENT STRING Yes No An optional


comment that
describes the
relation.

CREATED TIMESTAMP No No Timestamp when the


relation was created.

CREATED_BY STRING No No Principal which


created the relation.

LAST_ALTERED TIMESTAMP No No Timestamp when the


relation definition
was last altered in
any way.

LAST_ALTERED_BY STRING No No Principal which last


altered the relation.

Constraints
The following constraints apply to the TABLES relation:

C L A SS NAME C O L UM N L IST DESC RIP T IO N

Primary key SCHEMATA_PK CATALOG_NAME , Unique identifier for the


SCHEMA_NAME schema.

Foreign key SCHEMATA_CATS_FK CATALOG_NAME References CATALOGS.

Examples
> SELECT schema_owner
FROM information_schema.schemata
WHERE table_schema = 'information_schema'
AND table_name = 'default';
system

Related
DESCRIBE DATABASE
Information schema
INFORMATION_SCHEMA.CATALOGS
INFORMATION_SCHEMA.SCHEMA_PRIVILEGES
SHOW DATABASES
INFORMATION_SCHEMA.TABLE_PRIVILEGES
(Databricks Runtime)
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Lists principals which have privileges on the schemas in the catalog.


Since: Databricks Runtime 10.2

Definition
The TABLE_PRIVILEGES relation contains the following columns:

NAME DATA T Y P E N UL L A B L E STA N DA RD DESC RIP T IO N

GRANTOR STRING No Yes Principal that granted


the privilege.

GRANTEE STRING No Yes Principal to which the


privilege has been
granted.

TABLE_CATALOG STRING No Yes Catalog of relation


on which privilege
has been granted.

TABLE_SCHEMA STRING No Yes Schema of relation


on which privilege
has been granted.

TABLE_NAME STRING No Yes Relation on which


privilege has been
granted.

PRIVILEGE_TYPE STRING No Yes Privilege being


granted.

IS_GRANTABLE STRING No Yes Always NO .


Reserved for future
use.

Constraints
The following constraints apply to the TABLE_PRIVILEGES relation:
C L A SS NAME C O L UM N L IST DESC RIP T IO N

Primary key TABLEPRIVS_PK GRANTOR, GRANTEE , Unique identifier for the


TABLE_CATALOG , privilege granted.
TABLE_SCHEMA ,
TABLE_NAME ,
PRIVILEGE_TYPE

Foreign key TABLEPRIVS_TABLES_FK TABLE_CATALOG , References TABLES


TABLE_SCHEMA ,
TABLE_NAME

Examples
> SELECT table_catalog, table_schema, table_name, grantee
FROM information_schema.table_privileges;

Related
Information schema
INFORMATION_SCHEMA.TABLES
SHOW GRANTS
INFORMATION_SCHEMA.TABLES (Databricks
Runtime)
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Contains the object level meta data for tables and views (relations) within the local catalog, or all catalogs, if
owned by the SYSTEM catalog.
The rows returned are limited to the relations the user is privileged to interact with.
Since: Databricks Runtime 10.2

Definition
The TABLES relation contains the following columns:

NAME DATA T Y P E N UL L A B L E STA N DA RD DESC RIP T IO N

TABLE_CATALOG STRING No Yes Catalog that contains


the relation.

TABLE_SCHEMA STRING No Yes Schema that contains


the relation.

TABLE_NAME STRING No Yes Name of the relation.

TABLE_TYPE STRING No Yes One of


'BASE TABLE' ,
'VIEW' .

IS_INSERTABLE_INTO STRING No Yes 'YES' if the relation


can be inserted into,
'NO' otherwise.

COMMIT_ACTION STRING No Yes Always 'PRESERVE .


Reserved for future
use.

TABLE_OWNER STRING No No User or group


(principal) currently
owning the relation.

COMMENT STRING Yes No An optional


comment that
describes the
relation.
NAME DATA T Y P E N UL L A B L E STA N DA RD DESC RIP T IO N

CREATED TIMESTAMP No No Timestamp when the


relation was created.

CREATED_BY STRING No No Principal which


created the relation.

LAST_ALTERED TIMESTAMP No No Timestamp when the


relation definition
was last altered in
any way.

LAST_ALTERED_BY STRING No No Principal which last


altered the relation.

DATA_SOURCE_FORMAT STRING No No Format of the data


source such as
PARQUET , or CSV .

STORAGE_SUB_DIRECTORY STRING Yes No Path to the storage


of an external table,
NULL otherwise.

Constraints
The following constraints apply to the TABLES relation:

C L A SS NAME C O L UM N L IST DESC RIP T IO N

Primary key TABLES_PK TABLE_CATALOG , Unique identifier for the


TABLE_SCHEMA , relation.
TABLE_NAME

Foreign key TABLES_SCHEMATA_FK TABLE_CATALOG , References SCHEMATA.


TABLE_SCHEMA

Examples
> SELECT table_owner
FROM information_schema.tables
WHERE table_schema = 'information_schema'
AND table_name = 'columns';
system

Related
DESCRIBE TABLE
Information schema
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.SCHEMATA
INFORMATION_SCHEMA.TABLE_PRIVILEGES
INFORMATION_SCHEMA.VIEWS
SHOW CREATE TABLE
SHOW PARTITIONS
SHOW TABLE
SHOW TABLES
SHOW TBLPROPERTIES
INFORMATION_SCHEMA.VIEWS (Databricks
Runtime)
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Describes view specific information about views in the catalog.


The rows returned are limited to the views the user is privileged to interact with.
Since: Databricks Runtime 10.2

Definition
The VIEWS relation contains the following columns:

NAME DATA T Y P E N UL L A B L E STA N DA RD DESC RIP T IO N

TABLE_CATALOG STRING No Yes Catalog containing


the view.

TABLE_SCHEMA STRING No Yes Schema containing


the view.

TABLE_NAME STRING No Yes Name of the relation.

VIEW_DEFINITION STRING Yes Yes The view text if the


user owns the view,
NULL otherwise.

CHECK_OPTION STRING No Yes Always 'NONE' .


Reserved for future
use.

IS_UPDATABLE STRING No Yes Always NO .


Reserved for future
use

IS_INSERTABLE_INTO STRING No Yes Always NO .


Reserved for future
use.

SQL_PATH STRING Yes Yes Always NULL .


Reserved for future
use.

Constraints
The following constraints apply to the VIEWS relation:

C L A SS NAME C O L UM N L IST DESC RIP T IO N

Primary key VIEWS_PK TABLE_CATALOG , Unique identifier for the


TABLE_SCHEMA , view.
TABLE_NAME

Foreign key VIEWS_TABLES_FK TABLE_CATALOG , References TABLES.


TABLE_SCHEMA ,
TABLE_NAME

Examples
> SELECT is_intertable_into
FROM information_schema.views
WHERE table_schema = 'information_schema'
AND table_name = 'columns'
NO

Related
DESCRIBE TABLE
Information schema
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.SCHEMATA
INFORMATION_SCHEMA.TABLES
INFORMATION_SCHEMA.TABLE_PRIVILEGES
SHOW CREATE TABLE
SHOW VIEWS
ANSI compliance
7/21/2022 • 16 minutes to read

Spark SQL has two options to support compliance with the ANSI SQL standard: spark.sql.ansi.enabled and
spark.sql.storeAssignmentPolicy .

When spark.sql.ansi.enabled is set to true , Spark SQL uses an ANSI compliant dialect instead of being Hive
compliant. For example, Spark will throw an exception at runtime instead of returning null results if the inputs to
a SQL operator/function are invalid. Some ANSI dialect features may be not from the ANSI SQL standard
directly, but their behaviors align with ANSI SQL’s style.
Moreover, Spark SQL has an independent option to control implicit casting behaviours when inserting rows in a
table. The casting behaviours are defined as store assignment rules in the standard.
When spark.sql.storeAssignmentPolicy is set to ANSI , Spark SQL complies with the ANSI store assignment
rules. This is a separate configuration because its default value is ANSI , while the configuration
spark.sql.ansi.enabled is disabled by default.

The following table summarizes the behavior:

P RO P ERT Y N A M E DEFA ULT M EA N IN G

spark.sql.ansi.enabled false When true, Spark attempts to conform


to the ANSI SQL specification:

* Throws a runtime exception if an


overflow occurs in any operation on an
integer or decimal field.
* Forbids using the reserved keywords
of ANSI SQL as identifiers in the SQL
parser.
P RO P ERT Y N A M E DEFA ULT M EA N IN G

spark.sql.storeAssignmentPolicy ANSI When inserting a value into a column


with a different data type, Spark
performs type conversion. There are
three policies for the type coercion
rules: ANSI , legacy , and strict .

* ANSI : Spark performs the type


coercion as per ANSI SQL. In practice,
the behavior is mostly the same as
PostgreSQL. It disallows certain
unreasonable type conversions such as
converting string to int or double to
boolean.
* legacy : Spark allows the type
coercion as long as it is a valid Cast,
which is very loose. For example,
converting string to int or double to
boolean is allowed. It is also the only
behavior in Spark 2.x and it is
compatible with Hive.
* strict : Spark doesn’t allow any
possible precision loss or data
truncation in type coercion, for
example, converting double to int or
decimal to double is not allowed.

The following subsections present behavior changes in arithmetic operations, type conversions, and SQL
parsing when ANSI mode is enabled. For type conversions in Spark SQL, there are three kinds of them and this
article will introduce them one by one: cast, store assignment and type coercion.

Arithmetic operations
In Spark SQL, arithmetic operations performed on numeric types (with the exception of decimal) are not
checked for overflows by default. This means that in case an operation causes overflows, the result is the same
with the corresponding operation in a Java or Scala program (For example, if the sum of 2 integers is higher
than the maximum value representable, the result is a negative number). On the other hand, Spark SQL returns
null for decimal overflows. When spark.sql.ansi.enabled is set to true and an overflow occurs in numeric and
interval arithmetic operations, it throws an arithmetic exception at runtime.

-- `spark.sql.ansi.enabled=true`
> SELECT 2147483647 + 1;
error: integer overflow

-- `spark.sql.ansi.enabled=false`
> SELECT 2147483647 + 1;
-2147483648

Cast
When spark.sql.ansi.enabled is set to true , explicit casting by CAST syntax throws a runtime exception for
illegal cast patterns defined in the standard, such as casts from a string to an integer.
The CAST clause of Spark ANSI mode follows the syntax rules of section 6.13 “cast specification” in ISO/IEC
9075-2:2011 Information technology — Database languages - SQL — Part 2: Foundation (SQL/Foundation),
except it specially allows the following straightforward type conversions which are disallowed as per the ANSI
standard:
NumericType <=> BooleanType
StringType <=> BinaryType
The valid combinations of source and target data type in a CAST expression are given by the following table. “Y”
indicates that the combination is syntactically valid without restriction and “N” indicates that the combination is
not valid.

SO URC
ETA RG N UM E ST RIN T IM ES IN T ERV BOOLE B IN A R ST RUC
ET RIC G DAT E TA M P AL AN Y A RRAY MAP T

Numer Y Y N N N Y N N N N
ic

String Y Y Y Y Y Y Y N N N

Date N Y Y Y N N N N N N

Timest N Y Y Y N N N N N N
amp

Interva N Y N N Y N N N N N
l

Boolea Y Y N N N Y N N N N
n

Binary Y N N N N N Y N N N

Array N N N N N N N Y N N

Map N N N N N N N N Y N

Struct N N N N N N N N N Y
-- Examples of explicit casting

-- `spark.sql.ansi.enabled=true`
> SELECT CAST('a' AS INT);
error: invalid input syntax for type numeric: a

> SELECT CAST(2147483648L AS INT);


error: Casting 2147483648 to int causes overflow

> SELECT CAST(DATE'2020-01-01' AS INT)


error: cannot resolve 'CAST(DATE '2020-01-01' AS INT)' due to data type mismatch: cannot cast date to int.
To convert values from date to int, you can use function UNIX_DATE instead.

-- `spark.sql.ansi.enabled=false` (This is a default behavior)


> SELECT cast('a' AS INT);
null

> SELECT cat(2147483648L AS INT);


-2147483648

> SELECT CAST(DATE'2020-01-01' AS INT);


null

-- Examples of store assignment rules


> CREATE TABLE t (v INT);

-- `spark.sql.storeAssignmentPolicy=ANSI`
> INSERT INTO t VALUES ('1');
error: Cannot write incompatible data to table '`default`.`t`':
- Cannot safely cast 'v': string to int;

-- `spark.sql.storeAssignmentPolicy=LEGACY` (This is a legacy behavior until Spark 2.x)


> INSERT INTO t VALUES ('1');
> SELECT * FROM t;
1

Store assignment
As mentioned at the beginning, when spark.sql.storeAssignmentPolicy is set to ANSI (which is the default
value), Spark SQL complies with the ANSI store assignment rules on table insertions. The valid combinations of
source and target data type in table insertions are given by the following table.

SO URC
ETA RG N UM E ST RIN T IM ES IN T ERV BOOLE B IN A R ST RUC
ET RIC G DAT E TA M P AL AN Y A RRAY MAP T

Numer Y Y N N N N N N N N
ic

String N Y N N N N N N N N

Date N Y Y Y N N N N N N

Timest N Y Y Y N N N N N N
amp

Interva N Y N N Y N N N N N
l
SO URC
ETA RG N UM E ST RIN T IM ES IN T ERV BOOLE B IN A R ST RUC
ET RIC G DAT E TA M P AL AN Y A RRAY MAP T

Boolea N Y N N N Y N N N N
n

Binary N Y N N N N Y N N N

Array N N N N N N N Y* N N

Map N N N N N N N N Y* N

Struct N N N N N N N N N Y*

For Array/Map/Struct types, the data type check rule applies recursively to its component elements.
During table insertion, Spark will throw exception on numeric value overflow.

> CREATE TABLE test(i INT);


> INSERT INTO test VALUES (2147483648L);
error: Casting 2147483648 to int causes overflow

Type coercion
Type Promotion and Precedence
When spark.sql.ansi.enabled is set to true , Spark SQL uses several rules that govern how conflicts between
data types are resolved. At the heart of this conflict resolution is the Type Precedence List which defines whether
values of a given data type can be promoted to another data type implicitly.

DATA T Y P E P REC EDEN C E L IST ( F RO M N A RRO W EST TO W IDEST )

Byte Byte -> Short -> Int -> Long -> Decimal -> Float* ->
Double

Short Short -> Int -> Long -> Decimal-> Float* -> Double

Int Int -> Long -> Decimal -> Float* -> Double

Long Long -> Decimal -> Float* -> Double

Decimal Decimal -> Float* -> Double

Float Float -> Double

Double Double

Date Date -> Timestamp

Timestamp Timestamp

String String
DATA T Y P E P REC EDEN C E L IST ( F RO M N A RRO W EST TO W IDEST )

Binary Binary

Boolean Boolean

Interval Interval

Map Map**

Array Array**

Struct Struct**

For least common type resolution float is skipped to avoid loss of precision.
** For a complex type, the precedence rule applies recursively to its component elements.
Special rules apply for the String type and untyped NULL. A NULL can be promoted to any other type, while a
String can be promoted to any simple data type.
This is a graphical depiction of the precedence list as a directed tree:

Least Common Type Resolution


The least common type from a set of types is the narrowest type reachable from the precedence list by all
elements of the set of types.
The least common type resolution is used to:
Decide whether a function expecting a parameter of a type can be invoked using an argument of a narrower
type.
Derive the argument type for functions which expect a shared argument type for multiple parameters, such
as coalesce, least, or greatest.
Derive the operand types for operators such as arithmetic operations or comparisons.
Derive the result type for expressions such as the case expression.
Derive the element, key, or value types for array and map constructors.
Special rules are applied if the least common type resolves to FLOAT. With float type values, if any of the types is
INT, BIGINT, or DECIMAL the least common type is pushed to DOUBLE to avoid potential loss of digits.

-- The coalesce function accepts any set of argument types as long as they share a least common type.
-- The result type is the least common type of the arguments.
> SET spark.sql.ansi.enabled=true;

> SELECT typeof(coalesce(1Y, 1L, NULL));


BIGINT

> SELECT typeof(coalesce(1, DATE'2020-01-01'));


Error: Incompatible types [INT, DATE]

> SELECT typeof(coalesce(ARRAY(1Y), ARRAY(1L)));


ARRAY<BIGINT>

> SELECT typeof(coalesce(1, 1F));


DOUBLE

> SELECT typeof(coalesce(1L, 1F));


DOUBLE

> SELECT (typeof(coalesce(1BD, 1F)));


DOUBLE

-- The substring function expects arguments of type INT for the start and length parameters.
> SELECT substring('hello', 1Y, 2);
he

> SELECT substring('hello', '1', 2);


he

> SELECT substring('hello', 1L, 2);


Error: Argument 2 requires an INT type.

> SELECT substring('hello', str, 2) FROM VALUES(CAST('1' AS STRING)) AS T(str);


Error: Argument 2 requires an INT type.

SQL functions
The behavior of some SQL functions can be different under ANSI mode ( spark.sql.ansi.enabled=true ).
size : This function returns null for null input under ANSI mode.
element_at :
This function throws ArrayIndexOutOfBoundsException if using invalid indices.
This function throws NoSuchElementException if key does not exist in map.
elt : This function throws ArrayIndexOutOfBoundsException if using invalid indices.
make_date : This function fails with an exception if the result date is invalid.
make_timestamp : This function fails with an exception if the result timestamp is invalid.
make_interval : This function fails with an exception if the result interval is invalid.
next_day : This function throws IllegalArgumentException if input is not a valid day of week.
parse_url : This function throws IllegalArgumentException if an input string is not a valid url.
to_date : This function fails with an exception if the input string can’t be parsed, or the pattern string is
invalid.
to_timestamp : This function fails with an exception if the input string can’t be parsed, or the pattern string is
invalid.
to_unix_timestamp : This function fails with an exception if the input string can’t be parsed, or the pattern
string is invalid.
unix_timestamp : This function fails with an exception if the input string can’t be parsed, or the pattern string
is invalid.

SQL operators
The behavior of some SQL operators can be different under ANSI mode ( spark.sql.ansi.enabled=true ).
array_col[index] : This operator throws ArrayIndexOutOfBoundsException if using invalid indices.
map_col[key] : This operator throws NoSuchElementException if key does not exist in map.
CAST(string_col AS TIMESTAMP) : This operator fails with an exception if the input string can’t be parsed.
CAST(string_col AS DATE) : This operator fails with an exception if the input string can’t be parsed.

Useful Functions for ANSI Mode


When ANSI mode is on, it throws exceptions for invalid operations. You can use the following SQL functions to
suppress such exceptions.
try_cast : identical to CAST , except that it returns NULL result instead of throwing an exception on runtime
error.
try_add : identical to the add operator + , except that it returns NULL result instead of throwing an exception
on integral value overflow.
try_divide : identical to the division operator / , except that it returns NULL result instead of throwing an
exception on dividing 0.

SQL keywords
When spark.sql.ansi.enabled is true, Spark SQL will use the ANSI mode parser. In this mode, Spark SQL has
two kinds of keywords:
Reserved keywords: Keywords that are reserved and can’t be used as identifiers for table, view, column,
function, alias, etc.
Non-reserved keywords: Keywords that have a special meaning only in particular contexts and can be used
as identifiers in other contexts. For example, EXPLAIN SELECT ... is a command, but EXPLAIN can be used as
identifiers in other places.
When the ANSI mode is disabled, Spark SQL has two kinds of keywords:
Non-reserved keywords: Same definition as the one when the ANSI mode enabled.
Strict-non-reserved keywords: A strict version of non-reserved keywords, which cannot be used as table
alias.
By default spark.sql.ansi.enabled is false.
Below is a list of all the keywords in Spark SQL.

SPA RK SQ L SPA RK SQ L
K EY W O RD A N SI M O DE DEFA ULT M O DE SQ L - 2016

ADD non-reserved non-reserved non-reserved

AFTER non-reserved non-reserved non-reserved

ALL reserved non-reserved reserved


SPA RK SQ L SPA RK SQ L
K EY W O RD A N SI M O DE DEFA ULT M O DE SQ L - 2016

ALTER non-reserved non-reserved reserved

ALWAYS non-reserved non-reserved non-reserved

ANALYZE non-reserved non-reserved non-reserved

AND reserved non-reserved reserved

ANTI non-reserved strict-non-reserved non-reserved

ANY reserved non-reserved reserved

ARCHIVE non-reserved non-reserved non-reserved

ARRAY non-reserved non-reserved reserved

AS reserved non-reserved reserved

ASC non-reserved non-reserved non-reserved

AT non-reserved non-reserved reserved

AUTHORIZATION reserved non-reserved reserved

BETWEEN non-reserved non-reserved reserved

BOTH reserved non-reserved reserved

BUCKET non-reserved non-reserved non-reserved

BUCKETS non-reserved non-reserved non-reserved

BY non-reserved non-reserved reserved

CACHE non-reserved non-reserved non-reserved

CASCADE non-reserved non-reserved non-reserved

CASE reserved non-reserved reserved

CAST reserved non-reserved reserved

CHANGE non-reserved non-reserved non-reserved

CHECK reserved non-reserved reserved

CLEAR non-reserved non-reserved non-reserved

CLUSTER non-reserved non-reserved non-reserved


SPA RK SQ L SPA RK SQ L
K EY W O RD A N SI M O DE DEFA ULT M O DE SQ L - 2016

CLUSTERED non-reserved non-reserved non-reserved

CODEGEN non-reserved non-reserved non-reserved

COLLATE reserved non-reserved reserved

COLLECTION non-reserved non-reserved non-reserved

COLUMN reserved non-reserved reserved

COLUMNS non-reserved non-reserved non-reserved

COMMENT non-reserved non-reserved non-reserved

COMMIT non-reserved non-reserved reserved

COMPACT non-reserved non-reserved non-reserved

COMPACTIONS non-reserved non-reserved non-reserved

COMPUTE non-reserved non-reserved non-reserved

CONCATENATE non-reserved non-reserved non-reserved

CONSTRAINT reserved non-reserved reserved

COST non-reserved non-reserved non-reserved

CREATE reserved non-reserved reserved

CROSS reserved strict-non-reserved reserved

CUBE non-reserved non-reserved reserved

CURRENT non-reserved non-reserved reserved

CURRENT_DATE reserved non-reserved reserved

CURRENT_TIME reserved non-reserved reserved

CURRENT_TIMESTAMP reserved non-reserved reserved

CURRENT_USER reserved non-reserved reserved

DATA non-reserved non-reserved non-reserved

DATABASE non-reserved non-reserved non-reserved

DATABASES non-reserved non-reserved non-reserved


SPA RK SQ L SPA RK SQ L
K EY W O RD A N SI M O DE DEFA ULT M O DE SQ L - 2016

DAY non-reserved non-reserved non-reserved

DBPROPERTIES non-reserved non-reserved non-reserved

DEFINED non-reserved non-reserved non-reserved

DELETE non-reserved non-reserved reserved

DELIMITED non-reserved non-reserved non-reserved

DESC non-reserved non-reserved non-reserved

DESCRIBE non-reserved non-reserved reserved

DFS non-reserved non-reserved non-reserved

DIRECTORIES non-reserved non-reserved non-reserved

DIRECTORY non-reserved non-reserved non-reserved

DISTINCT reserved non-reserved reserved

DISTRIBUTE non-reserved non-reserved non-reserved

DIV non-reserved non-reserved not a keyword

DROP non-reserved non-reserved reserved

ELSE reserved non-reserved reserved

END reserved non-reserved reserved

ESCAPE reserved non-reserved reserved

ESCAPED non-reserved non-reserved non-reserved

EXCEPT reserved strict-non-reserved reserved

EXCHANGE non-reserved non-reserved non-reserved

EXISTS non-reserved non-reserved reserved

EXPLAIN non-reserved non-reserved non-reserved

EXPORT non-reserved non-reserved non-reserved

EXTENDED non-reserved non-reserved non-reserved

EXTERNAL non-reserved non-reserved reserved


SPA RK SQ L SPA RK SQ L
K EY W O RD A N SI M O DE DEFA ULT M O DE SQ L - 2016

EXTRACT non-reserved non-reserved reserved

FALSE reserved non-reserved reserved

FETCH reserved non-reserved reserved

FIELDS non-reserved non-reserved non-reserved

FILTER reserved non-reserved reserved

FILEFORMAT non-reserved non-reserved non-reserved

FIRST non-reserved non-reserved non-reserved

FN non-reserved non-reserved non-reserved

FOLLOWING non-reserved non-reserved non-reserved

FOR reserved non-reserved reserved

FOREIGN reserved non-reserved reserved

FORMAT non-reserved non-reserved non-reserved

FORMATTED non-reserved non-reserved non-reserved

FROM reserved non-reserved reserved

FULL reserved strict-non-reserved reserved

FUNCTION non-reserved non-reserved reserved

FUNCTIONS non-reserved non-reserved non-reserved

GENERATED non-reserved non-reserved non-reserved

GLOBAL non-reserved non-reserved reserved

GRANT reserved non-reserved reserved

GRANTS non-reserved non-reserved non-reserved

GROUP reserved non-reserved reserved

GROUPING non-reserved non-reserved reserved

HAVING reserved non-reserved reserved

HOUR non-reserved non-reserved non-reserved


SPA RK SQ L SPA RK SQ L
K EY W O RD A N SI M O DE DEFA ULT M O DE SQ L - 2016

IF non-reserved non-reserved not a keyword

IGNORE non-reserved non-reserved non-reserved

IMPORT non-reserved non-reserved non-reserved

IN reserved non-reserved reserved

INDEX non-reserved non-reserved non-reserved

INDEXES non-reserved non-reserved non-reserved

INNER reserved strict-non-reserved reserved

INPATH non-reserved non-reserved non-reserved

INPUTFORMAT non-reserved non-reserved non-reserved

INSERT non-reserved non-reserved reserved

INTERSECT reserved strict-non-reserved reserved

INTERVAL non-reserved non-reserved reserved

INTO reserved non-reserved reserved

IS reserved non-reserved reserved

ITEMS non-reserved non-reserved non-reserved

JOIN reserved strict-non-reserved reserved

KEY non-reserved non-reserved non-reserved

KEYS non-reserved non-reserved non-reserved

LAST non-reserved non-reserved non-reserved

LATERAL reserved strict-non-reserved reserved

LAZY non-reserved non-reserved non-reserved

LEADING reserved non-reserved reserved

LEFT reserved strict-non-reserved reserved

LIKE non-reserved non-reserved reserved

ILIKE non-reserved non-reserved non-reserved


SPA RK SQ L SPA RK SQ L
K EY W O RD A N SI M O DE DEFA ULT M O DE SQ L - 2016

LIMIT non-reserved non-reserved non-reserved

LINES non-reserved non-reserved non-reserved

LIST non-reserved non-reserved non-reserved

LOAD non-reserved non-reserved non-reserved

LOCAL non-reserved non-reserved reserved

LOCATION non-reserved non-reserved non-reserved

LOCK non-reserved non-reserved non-reserved

LOCKS non-reserved non-reserved non-reserved

LOGICAL non-reserved non-reserved non-reserved

MACRO non-reserved non-reserved non-reserved

MAP non-reserved non-reserved non-reserved

MATCHED non-reserved non-reserved non-reserved

MERGE non-reserved non-reserved non-reserved

MINUTE non-reserved non-reserved non-reserved

MINUS non-reserved strict-non-reserved non-reserved

MONTH non-reserved non-reserved non-reserved

MSCK non-reserved non-reserved non-reserved

NAMESPACE non-reserved non-reserved non-reserved

NAMESPACES non-reserved non-reserved non-reserved

NATURAL reserved strict-non-reserved reserved

NO non-reserved non-reserved reserved

NOT reserved non-reserved reserved

NULL reserved non-reserved reserved

NULLS non-reserved non-reserved non-reserved

OF non-reserved non-reserved reserved


SPA RK SQ L SPA RK SQ L
K EY W O RD A N SI M O DE DEFA ULT M O DE SQ L - 2016

ON reserved strict-non-reserved reserved

ONLY reserved non-reserved reserved

OPTION non-reserved non-reserved non-reserved

OPTIONS non-reserved non-reserved non-reserved

OR reserved non-reserved reserved

ORDER reserved non-reserved reserved

OUT non-reserved non-reserved reserved

OUTER reserved non-reserved reserved

OUTPUTFORMAT non-reserved non-reserved non-reserved

OVER non-reserved non-reserved non-reserved

OVERLAPS reserved non-reserved reserved

OVERLAY non-reserved non-reserved non-reserved

OVERWRITE non-reserved non-reserved non-reserved

PARTITION non-reserved non-reserved reserved

PARTITIONED non-reserved non-reserved non-reserved

PARTITIONS non-reserved non-reserved non-reserved

PERCENT non-reserved non-reserved non-reserved

PIVOT non-reserved non-reserved non-reserved

PLACING non-reserved non-reserved non-reserved

POSITION non-reserved non-reserved reserved

PRECEDING non-reserved non-reserved non-reserved

PRIMARY reserved non-reserved reserved

PRINCIPALS non-reserved non-reserved non-reserved

PROPERTIES non-reserved non-reserved non-reserved

PURGE non-reserved non-reserved non-reserved


SPA RK SQ L SPA RK SQ L
K EY W O RD A N SI M O DE DEFA ULT M O DE SQ L - 2016

QUALIFY reserved non-reserved reserved

QUERY non-reserved non-reserved non-reserved

RANGE non-reserved non-reserved reserved

RECIPIENT non-reserved non-reserved non-reserved

RECIPIENTS non-reserved non-reserved non-reserved

RECORDREADER non-reserved non-reserved non-reserved

RECORDWRITER non-reserved non-reserved non-reserved

RECOVER non-reserved non-reserved non-reserved

REDUCE non-reserved non-reserved non-reserved

REFERENCES reserved non-reserved reserved

REFRESH non-reserved non-reserved non-reserved

REGEXP non-reserved non-reserved not a keyword

REMOVE non-reserved non-reserved non-reserved

RENAME non-reserved non-reserved non-reserved

REPAIR non-reserved non-reserved non-reserved

REPLACE non-reserved non-reserved non-reserved

RESET non-reserved non-reserved non-reserved

RESPECT non-reserved non-reserved non-reserved

RESTRICT non-reserved non-reserved non-reserved

REVOKE non-reserved non-reserved reserved

RIGHT reserved strict-non-reserved reserved

RLIKE non-reserved non-reserved non-reserved

ROLE non-reserved non-reserved non-reserved

ROLES non-reserved non-reserved non-reserved

ROLLBACK non-reserved non-reserved reserved


SPA RK SQ L SPA RK SQ L
K EY W O RD A N SI M O DE DEFA ULT M O DE SQ L - 2016

ROLLUP non-reserved non-reserved reserved

ROW non-reserved non-reserved reserved

ROWS non-reserved non-reserved reserved

SCHEMA non-reserved non-reserved non-reserved

SCHEMAS non-reserved non-reserved not a keyword

SECOND non-reserved non-reserved non-reserved

SELECT reserved non-reserved reserved

SEMI non-reserved strict-non-reserved non-reserved

SEPARATED non-reserved non-reserved non-reserved

SERDE non-reserved non-reserved non-reserved

SERDEPROPERTIES non-reserved non-reserved non-reserved

SESSION_USER reserved non-reserved reserved

SET non-reserved non-reserved reserved

SETS non-reserved non-reserved non-reserved

SHARE non-reserved non-reserved non-reserved

SHARES non-reserved non-reserved non-reserved

SHOW non-reserved non-reserved non-reserved

SKEWED non-reserved non-reserved non-reserved

SOME reserved non-reserved reserved

SORT non-reserved non-reserved non-reserved

SORTED non-reserved non-reserved non-reserved

START non-reserved non-reserved reserved

STATISTICS non-reserved non-reserved non-reserved

STORED non-reserved non-reserved non-reserved

STRATIFY non-reserved non-reserved non-reserved


SPA RK SQ L SPA RK SQ L
K EY W O RD A N SI M O DE DEFA ULT M O DE SQ L - 2016

STRUCT non-reserved non-reserved non-reserved

SUBSTR non-reserved non-reserved non-reserved

SUBSTRING non-reserved non-reserved non-reserved

SYNC non-reserved non-reserved non-reserved

TABLE reserved non-reserved reserved

TABLES non-reserved non-reserved non-reserved

TABLESAMPLE non-reserved non-reserved reserved

TBLPROPERTIES non-reserved non-reserved non-reserved

TEMP non-reserved non-reserved not a keyword

TEMPORARY non-reserved non-reserved non-reserved

TERMINATED non-reserved non-reserved non-reserved

THEN reserved non-reserved reserved

TIME reserved non-reserved reserved

TO reserved non-reserved reserved

TOUCH non-reserved non-reserved non-reserved

TRAILING reserved non-reserved reserved

TRANSACTION non-reserved non-reserved non-reserved

TRANSACTIONS non-reserved non-reserved non-reserved

TRANSFORM non-reserved non-reserved non-reserved

TRIM non-reserved non-reserved non-reserved

TRUE non-reserved non-reserved reserved

TRUNCATE non-reserved non-reserved reserved

TRY_CAST non-reserved non-reserved non-reserved

TYPE non-reserved non-reserved non-reserved

UNARCHIVE non-reserved non-reserved non-reserved


SPA RK SQ L SPA RK SQ L
K EY W O RD A N SI M O DE DEFA ULT M O DE SQ L - 2016

UNBOUNDED non-reserved non-reserved non-reserved

UNCACHE non-reserved non-reserved non-reserved

UNION reserved strict-non-reserved reserved

UNIQUE reserved non-reserved reserved

UNKNOWN reserved non-reserved reserved

UNLOCK non-reserved non-reserved non-reserved

UNSET non-reserved non-reserved non-reserved

UPDATE non-reserved non-reserved reserved

USE non-reserved non-reserved non-reserved

USER reserved non-reserved reserved

USING reserved strict-non-reserved reserved

VALUES non-reserved non-reserved reserved

VIEW non-reserved non-reserved non-reserved

VIEWS non-reserved non-reserved non-reserved

WHEN reserved non-reserved reserved

WHERE reserved non-reserved reserved

WINDOW non-reserved non-reserved reserved

WITH reserved non-reserved reserved

YEAR non-reserved non-reserved non-reserved

ZONE non-reserved non-reserved non-reserved


ALTER CATALOG
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Transfers the ownership of a catalog to a new principal.


Since: Databricks Runtime 10.3

Syntax
ALTER CATALOG [ catalog_name ] OWNER TO principal

Parameters
catalog_name
The name of the catalog to be altered. If you provide no name the default is hive_metastore .
OWNER TO principal
Transfers ownership of the catalog to principal .

Examples
-- Creates a catalog named `some_cat`.
> CREATE CATALOG some_cat;

-- Transfer ownership of the catalog to another user


> ALTER CATALOG some_cat OWNER TO `alf@melmak.et`;

Related articles
CREATE CATALOG
DESCRIBE CATALOG
DROP CATALOG
SHOW CATALOGS
ALTER STORAGE CREDENTIAL
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Renames a storage credential.


Since: Databricks Runtime 10.3

Syntax
ALTER STORAGE CREDENTAL credential_name
{ RENAME TO to_credential_name |
OWNER TO principal }

Parameters
credential_name
Identifies the storage credential being altered.
RENAME TO to_credential_name
Renames the credential a new name. The name must be unique among all credentials in the metastore.
OWNER TO principal
Transfers ownership of the storage credential to principal .

Examples
> ALTER STORAGE CREDENTIAL street_cred RENAME TO good_cred;

> ALTER STORAGE CREDENTIAL street_cred OWNER TO `alf@melmak.et`

Related articles
DESCRIBE STORAGE CREDENTIAL
DROP STORAGE CREDENTIAL
Principal
SHOW STORAGE CREDENTIAL
ALTER DATABASE
7/21/2022 • 2 minutes to read

An alias for ALTER SCHEMA.


While usage of SCHEMA and DATABASE is interchangeable, SCHEMA is preferred.

Related articles
ALTER SCHEMA
CREATE SCHEMA
DESCRIBE SCHEMA
DROP SCHEMA
SHOW SCHEMAS
ALTER EXTERNAL LOCATION
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Alters properties of an external location or renames the location.


Since: Databricks Runtime 10.3

Syntax
ALTER EXTERNAL LOCATION location_name
{ RENAME TO to_location_name |
SET URL url [FORCE] |
SET STORAGE CREDENTIAL credential_name |
ONNER TO principal }

Parameters
location_name
Identifies the external location being altered.
RENAME TO to_location_name
Renames the location a new name. The name must be unique among all locations in the metastore.
SET URL url [FORCE]
url must be a STRING literal with the location of the cloud storage described as an absolute URL.
Unless you specify FORCE the statement will fail if the location is currently in use.
SET STORAGE CREDENTIAL credential_name
Updates the named credential used to access this location. If the credential does not exist Databricks
Runtime raises an error.
OWNER TO principal
Transfers ownership of the storage location to principal .

Examples
-- Rename a location
> ALTER EXTERNAL LOCATION descend_loc RENAME TO decent_loc;

-- Redirect the URL associated with the location


> ALTER EXTERNAL LOCATION best_loc SET `abfss::/us-east-1-prod/best_location` FORCE;

-- Change the credentials used to access the location


> ALTER EXERNAL LOCATION best_loc SET STORAGE CREDENTIAL street_cred;

-- Change ownership of the external lcation


> ALTER EXTERNAL LOCATION best_loc OWER TO `alf@melmak.et`

Related articles
CREATE EXTERNAL LOCATION
DESCRIBE EXTERNAL LOCATION
DROP EXTERNAL LOCATION
External locations and storage credentials
SHOW EXTERNAL LOCATIONS
ALTER SCHEMA
7/21/2022 • 2 minutes to read

Alters metadata associated with a schema by setting DBPROPERTIES . The specified property values override any
existing value with the same property name. An error message is issued if the schema is not found in the
system. This command is mostly used to record the metadata for a schema and may be used for auditing
purposes.
While usage of SCHEMA and DATABASE is interchangeable, SCHEMA is preferred.

Syntax
ALTER { SCHEMA | DATABASE schema_name
{ SET DBPROPERTIES ( { key = val } [, ...] ) |
OWNER TO principal }

Parameters
schema_name
The name of the schema to be altered.
DBPROPERTIES ( key = val [, …] )
The schema properties to be set or unset.
OWNER TO principal
Transfers ownership of the schema to principal .

Examples
-- Creates a schema named `inventory`.
> CREATE SCHEMA inventory;

-- Alters the schema to set properties `Edited-by` and `Edit-date`.


> ALTER SCHEMA inventory SET DBPROPERTIES ('Edited-by' = 'John', 'Edit-date' = '01/01/2001');

-- Verify that properties are set.


> DESCRIBE SCHEMA EXTENDED inventory;
database_description_item database_description_value
------------------------- ------------------------------------------
Database Name inventory
Description
Location file:/temp/spark-warehouse/inventory.db
Properties ((Edit-date,01/01/2001), (Edited-by,John))

-- Transfer ownership of the schema to another user


> ALTER SCHEMA inventory OWNER TO `alf@melmak.et`

Related articles
CREATE SCHEMA
DESCRIBE SCHEMA
DROP SCHEMA
SHOW SCHEMAS
ALTER SHARE
7/21/2022 • 2 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

Adds or removes tables to or from the share. Transfers the ownership of a share to a new principal.
Since: Databricks Runtime 10.3

Syntax
ALTER share_name
{ alter_table |
REMOVE TABLE clause }

alter_table
{ ADD [ TABLE ] table_name [ COMMENT comment ]
[ PARTITION clause ] [ AS table_share_name ] }

Parameters
share_name
The name of the share to be altered.
alter_table

Adds a table or partitions of a table to the share.


ADD [ TABLE ] table_name
Identifies the table to be added. The table must not reside in the hive_metastore .
COMMENT comment

An optional string literal attached to the table share as a comment.


PARTITION clause
One or to more partitions of the table to be added. The partition keys must match the partitioning
of the table and be associated with values. If no PARTITION clause is present ADD TABLE adds the
entire table.
AS table_share_name
Optionally exposes the table under a different name. The name can be qualified with a database
(schema) name. If no table_share_name is specified the table will be known under its own name.
REMOVE [ TABLE ] table_name
Remove the table identified by table_name from the share.

Examples
-- Creates a share named `some_share`.
> CREATE SHARE some_share;

-- Add a table to the share.


> ALTER SHARE some_share
ADD TABLE my_db.my_tab
COMMENT 'some comment'
PARTITION(c1_int = 5, c2_date LIKE '2021%')
AS shared_db.shared_tab;

-- Remove the table again


> ALTER SHARE some_share
REMOVE TABLE shared_db.shared_tab;

Related articles
CREATE SHARE
DESCRIBE SHARE
DROP SHARE
SHOW SHARES
ALTER TABLE
7/21/2022 • 10 minutes to read

Alters the schema or properties of a table.


For type changes or renaming columns in Delta Lake see rewrite the data.
To change the comment on a table use COMMENT ON.
If the table is cached, the command clears cached data of the table and all its dependents that refer to it. The
cache will be lazily filled when the table or the dependents are accessed the next time.

Required permissions
If you use Unity Catalog you must have OWNERSHIP on the table to use ALTER TABLE to:
Change the owner
Grant permissions on the table
Change the table name
For all other metadata operations on a table (for example updating comments, properties, or columns) you can
make updates if you have the MODIFY permission on the table.

Syntax
ALTER TABLE table_name
{ RENAME TO clause |
ADD COLUMN clause |
ALTER COLUMN clause |
DROP COLUMN clause |
RENAME COLUMN clause |
ADD CONSTRAINT clause |
DROP CONSTRAINT clause |
ADD PARTITION clause |
DROP PARTITION clause |
RENAME PARTITION clause |
RECOVER PARTITIONS clause |
SET TBLPROPERTIES clause |
UNSET TBLPROPERTIES clause |
SET SERDE clause |
SET LOCATION clause |
OWNER TO clause }

Parameters
table_name
Identifies the table being altered. The name must not include a temporal specification.
RENAME TO to_table_name
Renames the table within the same schema.
to_table_name
Identifies the new table name. The name must not include a temporal specification.
ADD COLUMN
This clause is not supported for JDBC data sources.
Adds one or more columns to the table, or fields to existing columns in a Delta Lake table.

{ ADD [COLUMN | COLUMNS ]


( { {column_identifier | field_name} data_type [COMMENT comment] [FIRST | AFTER identifier] } [,
...] ) }

column_identifier
The name of the column to be added. The name must be unique within the table.
Unless FIRST or AFTER name are specified the column or field will be appended at the end.
field_name
The fully qualified name of the field to be added to an existing column. All components of the path
to the nested field must exist and the field name itself must be unique.
COMMENT comment
An optional STRING literal describing the added column or field.
FIRST
If specified the column will be added as the first column of the table, or the field will be added as
the first field of in the containing struct.
AFTER identifier
If specified the column or field will be added immediately after the field or column identifier .
ALTER COLUMN
Changes a property or the location of a column.

{ { ALTER | CHANGE } [COLUMN] { column_identifier | field_name }


{ COMMENT comment |
{ FIRST | AFTER column_identifier } |
{ SET | DROP } NOT NULL |
SYNC IDENTITY } }

column_identifier
The name of the column to be altered.
field_name
The fully qualified name of the field to be altered. All components of the path to the nested field
must exist.
COMMENT comment
Changes the description of the column_name column. comment must be a STRING literal.
FIRST or AFTER identifier
Moves the column from its current position to the front ( FIRST ) or immediately AFTER the
identifier . This clause is only supported if table_name is a Delta table.
SET NOT NULL or DROP NOT NULL
Changes the domain of valid column values to exclude nulls SET NOT NULL , or include nulls
DROP NOT NULL . This option is only supported for Delta Lake tables. Delta Lake will ensure the
constraint is valid for all existing and new data.
SYNC IDENTITY
Since: Databricks Runtime 10.3
Synchronize the metadata of an identity column with the actual data. When you write your own
values to an identity column, it might not comply with the metadata. This option evaluates the
state and updates the metadata to be consistent with the actual data. After this command, the next
automatically assigned identity value will start from start + (n + 1) * step , where n is the
smallest value that satisfies start + n * step >= max() (for a positive step).
This option is only supported for identity columns on Delta Lake tables.
DROP COLUMN
Since: Databricks Runtime 11.0
Drop one or more columns or fields in a Delta Lake table.
When you drop a column or field, you must drop dependent check constraints and generated columns.

DROP [COLUMN | COLUMNS] [ IF EXISTS ] ( { {column_identifier | field_name} [, ...] )

IF EXISTS
When you specify IF EXISTS , Databricks Runtime ignores an attempt to drop columns that do not
exist. Otherwise, dropping non-existing columns will cause an error.
column_identifier
The name of the existing column.
field_name
The fully qualified name of an existing field.
RENAME COLUMN
Renames a column or field in a Delta Lake table.
When you rename a column or field you also need to change dependent check constraints and generated
columns. Any primary keys and foreign keys using the column will be dropped. In case of foreign keys
you must own the table on which the foreign key is defined.

RENAME COLUMN { column_identifier TO to_column_identifier|


field_name TO to_field_identifier }

column_identifier
The existing name of the column.
to_column_identifier
The new column identifier. The identifier must be unique within the table.
field_name
The existing fully qualified name of a field.
to_field_identifier
The new field identifier. The identifier must be unique within the local struct.
ADD CONSTRAINT
Adds a check constraint, foreign key constraint, or primary key constraint to the table.
Foreign keys and primary keys are not supported for tables in the hive_metastore catalog.
DROP CONSTRAINT
Drops a primary key, foreign key, or check constraint from the table.
ADD PARTITION
If specified adds one or more partitions to the table. Adding partitions is not supported for Delta Lake
tables.

ADD [IF NOT EXISTS] { PARTITION clause [ LOCATION path ] } [...]

IF NOT EXISTS
An optional clause directing Databricks Runtime to ignore the statement if the partition already
exists.
PARTITION clause
A partition to be added. The partition keys must match the partitioning of the table and be
associated with values. If the partition already exists an error is raised unless IF NOT EXISTS has
been specified.
LOCATION path
path must be a STRING literal representing an optional location pointing to the partition.
If no location is specified the location will be derived from the location of the table and the
partition keys.
If there are files present at the location they populate the partition and must be compatible with
the data_source of the table and its options.
DROP PARTITION
If specified this clause drops one or more partitions from the table, optionally deleting any files at the
partitions’ locations.
Delta Lake tables do not support dropping of partitions.

DROP [ IF EXISTS ] PARTITION clause [, ...] [PURGE]

IF EXISTS
When you specify IF EXISTS Azure Databricks will ignore an attempt to drop partitions that do
not exists. Otherwise, non existing partitions will cause an error.
PARTITION clause
Specifies a partition to be dropped. If the partition is only partially identified a slice of partitions is
dropped.
PURGE
If set, the table catalog must remove partition data by skipping the Trash folder even when the
catalog has configured one. The option is applicable only for managed tables. It is effective only
when:
The file system supports a Trash folder. The catalog has been configured for moving the dropped
partition to the Trash folder. There is no Trash folder in AWS S3, so it is not effective.
There is no need to manually delete files after dropping partitions.
RENAME PARTITION
Replaces the keys of a partition.
Delta Lake tables do not support renaming partitions.

from_partition_clause RENAME TO to_partition_clause

from_par tition_clause
The definition of the partition to be renamed.
to_par tition_clause
The new definition for this partition. A partition with the same keys must not already exist.
RECOVER PARTITIONS
This clause does not apply to Delta Lake tables.
Instructs Databricks Runtime to scan the table’s location and add any files to the table which have been
added directly to the filesystem.
SET TBLPROPERTIES
Sets or resets one or more user defined properties.
UNSET TBLPROPERTIES
Removes one or more user defined properties.
SET LOCATION
Moves the location of a partition or table.
Delta Lake does not support moving individual partitions of a Delta Lake table.

[ PARTITION clause ] SET LOCATION path

PARTITION clause
Optionally identifies the partition for which the location will to be changed. If you omit naming a
partition Azure Databricks moves the location of the table.
LOCATION path
path must be a STRING literal. Specifies the new location for the partition or table.
Files in the original location will not be moved to the new location.
OWNER TO principal
Transfers ownership of the table to principal .

Examples
For Delta Lake add, and alter column examples, see
Explicitly update schema
Add columns
Change column comment or ordering
Constraints

-- RENAME table
> DESCRIBE student;
col_name data_type comment
----------------------- --------- -------
name string NULL
rollno int NULL
age int NULL
# Partition Information
# col_name data_type comment
age int NULL

> ALTER TABLE Student RENAME TO StudentInfo;

-- After Renaming the table


> DESCRIBE StudentInfo;
col_name data_type comment
----------------------- --------- -------
name string NULL
rollno int NULL
age int NULL
# Partition Information
# col_name data_type comment
age int NULL

-- RENAME partition
> SHOW PARTITIONS StudentInfo;
partition
---------
age=10
age=11
age=12

> ALTER TABLE default.StudentInfo PARTITION (age='10') RENAME TO PARTITION (age='15');

-- After renaming Partition


> SHOW PARTITIONS StudentInfo;
partition
---------
age=11
age=12
age=15

-- Add new columns to a table


> DESCRIBE StudentInfo;
col_name data_type comment
----------------------- --------- -------
name string NULL
rollno int NULL
age int NULL
# Partition Information
# col_name data_type comment
age int NULL

> ALTER TABLE StudentInfo ADD columns (LastName string, DOB timestamp);

-- After Adding New columns to the table


> DESCRIBE StudentInfo;
col_name data_type comment
----------------------- --------- -------
name string NULL
rollno int NULL
LastName string NULL
DOB timestamp NULL
age int NULL
# Partition Information
# col_name data_type comment
age int NULL

-- Add a new partition to a table


> SHOW PARTITIONS StudentInfo;
partition
---------
age=11
age=12
age=15

> ALTER TABLE StudentInfo ADD IF NOT EXISTS PARTITION (age=18);

-- After adding a new partition to the table


> SHOW PARTITIONS StudentInfo;
partition
---------
age=11
age=12
age=15
age=18

-- Drop a partition from the table


> SHOW PARTITIONS StudentInfo;
partition
---------
age=11
age=12
age=15
age=18

> ALTER TABLE StudentInfo DROP IF EXISTS PARTITION (age=18);

-- After dropping the partition of the table


> SHOW PARTITIONS StudentInfo;
partition
---------
age=11
age=12
age=15

-- Adding multiple partitions to the table


> SHOW PARTITIONS StudentInfo;
partition
---------
age=11
age=12
age=15

> ALTER TABLE StudentInfo ADD IF NOT EXISTS PARTITION (age=18) PARTITION (age=20);

-- After adding multiple partitions to the table


> SHOW PARTITIONS StudentInfo;
partition
---------
age=11
age=12
age=15
age=18
age=20

-- ALTER or CHANGE COLUMNS


> DESCRIBE StudentInfo;
col_name data_type comment
+-----------------------+---------+-------
name string NULL
rollno int NULL
LastName string NULL
DOB timestamp NULL
age int NULL
# Partition Information
# col_name data_type comment
age int NULL

ALTER TABLE StudentInfo ALTER COLUMN name COMMENT "new comment";

--After ALTER or CHANGE COLUMNS


> DESCRIBE StudentInfo;
col_name data_type comment
----------------------- --------- -----------
name string new comment
rollno int NULL
LastName string NULL
DOB timestamp NULL
age int NULL
# Partition Information
# col_name data_type comment
age int NULL

-- RENAME COLUMN
> ALTER TABLE StudentInfo RENAME COLUMN name TO FirstName;

--After RENAME COLUMN


> DESCRIBE StudentInfo;
col_name data_type comment
----------------------- --------- -----------
FirstName string new comment
rollno int NULL
LastName string NULL
DOB timestamp NULL
age int NULL
# Partition Information
# col_name data_type comment
age int NULL

-- Change the file Location


> ALTER TABLE dbx.tab1 PARTITION (a='1', b='2') SET LOCATION '/path/to/part/ways'

-- SET SERDE/ SERDE Properties


> ALTER TABLE test_tab SET SERDE 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe';

> ALTER TABLE dbx.tab1 SET SERDE 'org.apache.hadoop' WITH SERDEPROPERTIES ('k' = 'v', 'kay' = 'vee')

-- SET TABLE PROPERTIES


> ALTER TABLE dbx.tab1 SET TBLPROPERTIES ('winner' = 'loser');

-- DROP TABLE PROPERTIES


> ALTER TABLE dbx.tab1 UNSET TBLPROPERTIES ('winner');

Related articles
ALTER VIEW
ADD CONSTRAINT
COMMENT ON
CREATE TABLE
DROP CONSTRAINT
DROP TABLE
MSCK REPAIR TABLE
PARTITION
ALTER VIEW
7/21/2022 • 2 minutes to read

Alters metadata associated with the view. It can change the definition of the view, change the name of a view to a
different name, set and unset the metadata of the view by setting TBLPROPERTIES .
If the view is cached, the command clears cached data of the view and all its dependents that refer to it. The
view’s cache will be lazily filled when the view is accessed the next time. The command leaves view’s dependents
as uncached.

Syntax
ALTER VIEW view_name
{ rename |
SET TBLPROPERTIES clause |
UNSET TBLPROPERTIES clause |
alter_body |
owner_to }

rename
RENAME TO to_view_name

alter_body
AS query

property_key
{ idenitifier [. ...] | string_literal }

owner_to
OWNER TO principal

Parameters
view_name
Identifies the view to be altered.
RENAME TO to_view_name
Renames the existing view within the schema.
to_view_name specifies the new name of the view. If the to_view_name already exists, a
TableAlreadyExistsException is thrown. If to_view_name is qualified it must match the schema name of
view_name .

SET TBLPROPERTIES
Sets or resets one or more user defined properties.
UNSET TBLPROPERTIES
Removes one or more user defined properties.
AS quer y
A query that constructs the view from base tables or other views.
This clause is equivalent to a CREATE OR REPLACE VIEW statement on an existing view.
OWNER TO principal
Transfers ownership of the view to principal . Unless the view is defined in the hive_metastore you may
only transfer ownership to a group you belong to.
Examples

-- Rename only changes the view name.


-- The source and target schemas of the view have to be the same.
-- Use qualified or unqualified name for the source and target view.
> ALTER VIEW tempsc1.v1 RENAME TO tempsc1.v2;

-- Verify that the new view is created.


> DESCRIBE TABLE EXTENDED tempsc1.v2;
c1 int NULL
c2 string NULL

# Detailed Table Information


Database tempsc1
Table v2

-- Before ALTER VIEW SET TBLPROPERTIES


> DESCRIBE TABLE EXTENDED tempsc1.v2;
c1 int null
c2 string null

# Detailed Table Information


Database tempsc1
Table v2
Table Properties [....]

-- Set properties in TBLPROPERTIES


> ALTER VIEW tempsc1.v2 SET TBLPROPERTIES ('created.by.user' = "John", 'created.date' = '01-01-2001' );

-- Use `DESCRIBE TABLE EXTENDED tempsc1.v2` to verify


> DESCRIBE TABLE EXTENDED tempsc1.v2;
c1 int NULL
c2 string NULL

# Detailed Table Information


Database tempsc1
Table v2
Table Properties [created.by.user=John, created.date=01-01-2001, ....]

-- Remove the key created.by.user and created.date from `TBLPROPERTIES`


> ALTER VIEW tempsc1.v2 UNSET TBLPROPERTIES (`created`.`by`.`user`, created.date);

-- Use `DESCRIBE TABLE EXTENDED tempsc1.v2` to verify the changes


> DESCRIBE TABLE EXTENDED tempsc1.v2;
c1 int NULL
c2 string NULL

# Detailed Table Information


Database tempsc1
Table v2
Table Properties [....]

-- Change the view definition


> ALTER VIEW tempsc1.v2 AS SELECT * FROM tempsc1.v1;

-- Use `DESCRIBE TABLE EXTENDED` to verify


> DESCRIBE TABLE EXTENDED tempsc1.v2;
c1 int NULL
c2 string NULL

# Detailed Table Information


# Detailed Table Information
Database tempsc1
Table v2
Type VIEW
View Text select * from tempsc1.v1
View Original Text select * from tempsc1.v1

-- Transfer ownership of a view to another user


> ALTER VIEW v1 OWNER TO `alf@melmak.et`

Related articles
DESCRIBE TABLE
CREATE VIEW
DROP VIEW
SHOW VIEWS
CREATE CATALOG
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Creates a catalog with the specified name. If a catalog with the same name already exists, an exception is thrown.
Since: Databricks Runtime 10.3

Syntax
CREATE CATALOG [ IF NOT EXISTS ] catalog_name
[ COMMENT comment ]

Parameters
catalog_name
The name of the catalog to be created.
IF NOT EXISTS
Creates a catalog with the given name if it does not exist. If a catalog with the same name already exists,
nothing will happen.
comment
An optional STRING literal. The description for the catalog.

Examples
-- Create catalog `customer_cat`. This throws exception if catalog with name customer_cat
-- already exists.
> CREATE CATALOG customer_cat;

-- Create catalog `customer_cat` only if catalog with same name doesn't exist.
> CREATE CATALOG IF NOT EXISTS customer_cat;

-- Create catalog `customer_cat` only if catalog with same name doesn't exist, with a comment.
> CREATE CATALOG IF NOT EXISTS customer_cat COMMENT 'This is customer catalog';

Related articles
DESCRIBE CATALOG
DROP CATALOG
CREATE DATABASE
7/21/2022 • 2 minutes to read

An alias for CREATE SCHEMA.


While usage of SCHEMA and DATABASE is interchangeable, SCHEMA is preferred.

Related articles
CREATE SCHEMA
DESCRIBE SCHEMA
DROP SCHEMA
CREATE FUNCTION (External)
7/21/2022 • 3 minutes to read

Creates a temporary or permanent external function. Temporary functions are scoped at a session level where
as permanent functions are created in the persistent catalog and are made available to all sessions. The
resources specified in the USING clause are made available to all executors when they are executed for the first
time.
In addition to the SQL interface, Spark allows you to create custom user defined scalar and aggregate functions
using Scala, Python, and Java APIs. See User-defined scalar functions (UDFs) and User-defined aggregate
functions (UDAFs) for more information.

Syntax
CREATE [ OR REPLACE ] [ TEMPORARY ] FUNCTION [ IF NOT EXISTS ]
function_name AS class_name [ resource_locations ]

Parameters
OR REPL ACE
If specified, the resources for the function are reloaded. This is mainly useful to pick up any changes made
to the implementation of the function. This parameter is mutually exclusive to IF NOT EXISTS and cannot
be specified together.
TEMPORARY
Indicates the scope of function being created. When TEMPORARY is specified, the created function is valid
and visible in the current session. No persistent entry is made in the catalog for these kind of functions.
IF NOT EXISTS
If specified, creates the function only when it does not exist. The creation of function succeeds (no error is
thrown) if the specified function already exists in the system. This parameter is mutually exclusive to
OR REPLACE and cannot be specified together.

function_name
A name for the function. The function name may be optionally qualified with a schema name.
class_name
The name of the class that provides the implementation for function to be created. The implementing
class should extend one of the base classes as follows:
Should extend UDF or UDAF in org.apache.hadoop.hive.ql.exec package.
Should extend AbstractGenericUDAFResolver, GenericUDF , or GenericUDTF in
org.apache.hadoop.hive.ql.udf.generic package.
Should extend UserDefinedAggregateFunction in org.apache.spark.sql.expressions package.
resource_locations
The list of resources that contain the implementation of the function along with its dependencies.
Syntax: USING { { (JAR | FILE | ARCHIVE) resource_uri } , ... }

Examples
-- 1. Create a simple UDF `SimpleUdf` that increments the supplied integral value by 10.
-- import org.apache.hadoop.hive.ql.exec.UDF;
-- public class SimpleUdf extends UDF {
-- public int evaluate(int value) {
-- return value + 10;
-- }
-- }
-- 2. Compile and place it in a JAR file called `SimpleUdf.jar` in /tmp.

-- Create a table called `test` and insert two rows.


> CREATE TABLE test(c1 INT);
> INSERT INTO test VALUES (1), (2);

-- Create a permanent function called `simple_udf`.


> CREATE FUNCTION simple_udf AS 'SimpleUdf'
USING JAR '/tmp/SimpleUdf.jar';

-- Verify that the function is in the registry.


> SHOW USER FUNCTIONS;
function
------------------
default.simple_udf

-- Invoke the function. Every selected value should be incremented by 10.


> SELECT simple_udf(c1) AS function_return_value FROM t1;
function_return_value
---------------------
11
12

-- Created a temporary function.


> CREATE TEMPORARY FUNCTION simple_temp_udf AS 'SimpleUdf'
USING JAR '/tmp/SimpleUdf.jar';

-- Verify that the newly created temporary function is in the registry.


-- The temporary function does not have a qualified
-- schema associated with it.
> SHOW USER FUNCTIONS;
function
------------------
default.simple_udf
simple_temp_udf

-- 1. Modify `SimpleUdf`'s implementation to add supplied integral value by 20.


-- import org.apache.hadoop.hive.ql.exec.UDF;

-- public class SimpleUdfR extends UDF {


-- public int evaluate(int value) {
-- return value + 20;
-- }
-- }
-- 2. Compile and place it in a jar file called `SimpleUdfR.jar` in /tmp.

-- Replace the implementation of `simple_udf`


> CREATE OR REPLACE FUNCTION simple_udf AS 'SimpleUdfR'
USING JAR '/tmp/SimpleUdfR.jar';

-- Invoke the function. Every selected value should be incremented by 20.


> SELECT simple_udf(c1) AS function_return_value FROM t1;
function_return_value
---------------------
21
22

Related articles
CREATE FUNCTION (SQL)
SHOW FUNCTIONS
DESCRIBE FUNCTION
DROP FUNCTION
CREATE EXTERNAL LOCATION
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Creates an external location with the specified name. If a location with the same name already exists, an
exception is thrown.
Since: Databricks Runtime 10.3

Syntax
CREATE EXTERNAL LOCATION [IF NOT EXISTS] location_name
URL url
WITH (STORAGE CREDENTIAL credential_name)
[COMMENT comment]

Parameters
location_name
The name of the location to be created.
IF NOT EXISTS
Creates a location with the given name if it does not exist. If a location with the same name already exists,
nothing will happen.
url
A STRING literal with the location of the cloud storage described as an absolute URL.
credential_name
The named credential used to connect to this location.
comment
An optional description for the location, or NULL . The default is NULL .

Examples
-- Create a location accessed using the abfss_remote_cred credential
> CREATE EXTERNAL LOCATION abfss_remote URL 'abfss://us-east-1/location'
WITH (STORAGE CREDENTIAL abfss_remote_cred)
COMMENT 'Default source for Azure exernal data';

Related articles
ALTER EXTERNAL LOCATION
DESCRIBE EXTERNAL LOCATION
DROP EXTERNAL LOCATION
External locations and storage credentials
SHOW EXTERNAL LOCATIONS
CREATE RECIPIENT
7/21/2022 • 2 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

Creates a recipient with the specified name and generates an activation link. If a recipient with the same name
already exists, an exception is thrown.
To create and manage a recipient you must be a metastore administrator and Databricks Runtime must be
configured with the Unity Catalog metastore.
Use DESCRIBE RECIPIENT to retrieve the activation link.
Since: Databricks Runtime 10.3

Syntax
CREATE RECPIENT [ IF NOT EXISTS ] recipient_name
[ COMMENT comment ]

Parameters
recipient_name
The name of the recipient to be created.
IF NOT EXISTS
Creates a recipient with the given name if it does not exist. If a recipient with the same name already
exists, nothing will happen.
comment
An optional STRING literal. The description for the recipient.

Examples
-- Create recipient `other_corp` This throws an exception if a recipient with name other_corp
-- already exists.
> CREATE RECIPIENT other_corp;

-- Create recipient `other_corp` only if a recipient with the same name doesn't exist.
> CREATE RECIPIENT IF NOT EXISTS other_corp;

-- Create recipient `other_corp` only if a recipient with same name doesn't exist, with a comment.
> CREATE RECIPIENT IF NOT EXISTS other_corp COMMENT 'This is Other Corp';

-- Retrieve the activation link


> DESCRIBE RECIPIENT other_corp;
name created_at created_by comment activation_link
active_token_id active_token_expiration_time rotated_token_id
rotated_token_expiration_time
---------- ---------------------------- -------------------------- ------------------ ----------------- --
---------------------------------- ---------------------------- ---------------- ---------------------------
--
other_corp 2022-01-01T00:00:00.000+0000 alwaysworks@databricks.com This is Other Corp https://send/this
0160c81f-5262-40bb-9b03-3ee12e6d98d7 9999-12-31T23:59:59.999+0000 NULL NULL

Related articles
DESCRIBE RECIPIENT
DROP RECIPIENT
SHOW RECIPIENTS
CREATE SCHEMA
7/21/2022 • 2 minutes to read

Creates a schema with the specified name. If a schema with the same name already exists, an exception is
thrown.

Syntax
CREATE SCHEMA [ IF NOT EXISTS ] schema_name
[ COMMENT schema_comment ]
[ LOCATION schema_directory ]
[ WITH DBPROPERTIES ( property_name = property_value [ , ... ] ) ]

Parameters
schema_name
The name of the schema to be created.
IF NOT EXISTS
Creates a schema with the given name if it does not exist. If a schema with the same name already exists,
nothing will happen.
schema_director y
Path of the file system in which the specified schema is to be created. If the specified path does not exist
in the underlying file system, creates a directory with the path. If the location is not specified, the schema
is created in the default warehouse directory, whose path is configured by the static configuration
spark.sql.warehouse.dir .

WARNING
To avoid accidental data loss, do not register a schema (database) to a location with existing data or create new external
tables in a location managed by a schema. Dropping a schema will recursively delete all data files in the managed location.

schema_comment
The description for the schema.
WITH DBPROPERTIES ( proper ty_name = proper ty_value [ , … ] )
The properties for the schema in key-value pairs.

Examples
-- Create schema `customer_sc`. This throws exception if schema with name customer_sc
-- already exists.
> CREATE SCHEMA customer_sc;

-- Create schema `customer_sc` only if schema with same name doesn't exist.
> CREATE SCHEMA IF NOT EXISTS customer_sc;

-- Create schema `customer_sc` only if schema with same name doesn't exist with
-- `Comments`,`Specific Location` and `Database properties`.
> CREATE SCHEMA IF NOT EXISTS customer_sc COMMENT 'This is customer schema' LOCATION '/user'
WITH DBPROPERTIES (ID=001, Name='John');

-- Verify that properties are set.


> DESCRIBE SCHEMA EXTENDED customer_sc;
database_description_item database_description_value
------------------------- --------------------------
Database Name customer_sc
Description This is customer schema
Location hdfs://hacluster/user
Properties ((ID,001), (Name,John))

Related articles
DESCRIBE SCHEMA
DROP SCHEMA
CREATE SHARE
7/21/2022 • 2 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

Creates a share with the specified name. If a share with the same name already exists, an exception is thrown.
To create and manage a share you must be a metastore administrator and Databricks Runtime must be
configured with the Unity Catalog metastore.
To add content to the share use ALTER SHARE
Since: Databricks Runtime 10.3

Syntax
CREATE SHARE [ IF NOT EXISTS ] share_name
[ COMMENT comment ]

Parameters
share_name
The name of the share to be created.
IF NOT EXISTS
Creates a share with the given name if it does not exist. If a share with the same name already exists,
nothing will happen.
comment
An optional STRING literal. The description for the share.

Examples
-- Create share `customer_share`. This throws exception if a share with name customer_share
-- already exists.
> CREATE SHARE customer_share;

-- Create share `customer_share` only if share with same name doesn't exist.
> CREATE SHARE IF NOT EXISTS customer_share;

-- Create share `customer_share` only if share with same name doesn't exist, with a comment.
> CREATE SHARE IF NOT EXISTS customer_share COMMENT 'This is customer share';
Related articles
ALTER SHARE
DESCRIBE SHARE
DROP SHARE
CREATE FUNCTION (SQL)
7/21/2022 • 6 minutes to read

Since: Databricks Runtime 9.1


Creates a SQL scalar or table function that takes a set of arguments and returns a scalar value or a set of rows.

NOTE
This statement is supported only for functions created in the hive_metastore catalog.

Syntax
CREATE [OR REPLACE] [TEMPORARY] FUNCTION [IF NOT EXISTS]
function_name ( [ function_parameter [, ...] ] )
RETURNS { data_type | TABLE ( column_spec [, ...])
[ characteristic [...] ]
RETURN { expression | query }

function_parameter
parameter_name data_type [DEFAULT default_expression] [COMMENT parameter_comment]

column_spec
column_name data_type [COMMENT column_comment]

characteristic
{ LANGUAGE SQL |
[NOT] DETERMINISTIC |
COMMENT function_comment |
[CONTAINS SQL | READS SQL DATA] |
SQL SECURITY DEFINER }

Parameters
OR REPL ACE
If specified, the function with the same name and signature (number of parameters and parameter types)
is replaced. You cannot replace an existing function with a different signature. This is mainly useful to
update the function body and the return type of the function. You cannot specify this parameter with
IF NOT EXISTS .

TEMPORARY
The scope of the function being created. When you specify TEMPORARY , the created function is valid and
visible in the current session. No persistent entry is made in the catalog.
IF NOT EXISTS
If specified, creates the function only when it does not exist. The creation of the function succeeds (no
error is thrown) if the specified function already exists in the system. You cannot specify this parameter
with OR REPLACE .
function_name
A name for the function. For a permanent function, you can optionally qualify the function name with a
schema name. If the name is not qualified the permanent function is created in the current schema.
function_parameter
Specifies a parameter of the function.
parameter_name
The parameter name must be unique within the function.
data_type
Any supported data type.
DEFAULT default_expression
Since: Databricks Runtime 10.4
An optional default to be used when a function invocation does not assign an argument to the
parameter. default_expression must be castable to data_type . The expression must not reference
another parameter or contain a subquery.
When you specify a default for one parameter, all following parameters must also have a default.
COMMENT comment
An optional description of the parameter. comment must be a STRING literal.
RETURNS data_type
The return data type of the scalar function.
RETURNS TABLE (column_spec [,…] )
The signature of the result of the table function.
column_name
The column name must be unique within the signature.
data_type
Any supported data type.
COMMENT column_comment
An optional description of the column. comment must be a STRING literal.
RETURN { expression | quer y }
The body of the function. For a scalar function, it can either be a query or an expression. For a table
function, it can only be a query. The expression cannot contain:
Aggregate functions
Window functions
Ranking functions
Row producing functions such as explode
Within the body of the function you can refer to parameter by its unqualified name or by qualifying the
parameter with the function name.
characteristic
All characteristic clauses are optional. You can specify any number of them in any order, but you can
specify each clause only once.
L ANGUAGE SQL
The language of the function. SQL is the only supported language.
[NOT] DETERMINISTIC
Whether the function is deterministic. A function is deterministic when it returns only one result
for a given set of arguments.
COMMENT function_comment
A comment for the function. function_comment must be String literal.
CONTAINS SQL or READS SQL DATA
Whether a function reads data directly or indirectly from a table or a view. When the function
reads SQL data, you cannot specify CONTAINS SQL . If you don’t specify either clause, the property is
derived from the function body.
SQL SECURITY DEFINER
The body of the function and any default expressions are executed using the authorization of the
owner of the function. This is the only supported behavior.

Examples
Create and use a SQL scalar function
Create and use a function that uses DEFAULTs
Create a SQL table function
Replace a SQL function
Describe a SQL function
Create and use a SQL scalar function

> CREATE VIEW t(c1, c2) AS VALUES (0, 1), (1, 2);
-- Create a temporary function with no parameter.
> CREATE TEMPORARY FUNCTION hello() RETURNS STRING RETURN 'Hello World!';

> SELECT hello();


Hello World!

-- Create a permanent function with parameters.


> CREATE FUNCTION area(x DOUBLE, y DOUBLE) RETURNS DOUBLE RETURN x * y;

-- Use a SQL function in the SELECT clause of a query.


> SELECT area(c1, c2) AS area FROM t;
0.0
2.0

-- Use a SQL function in the WHERE clause of a query.


> SELECT * FROM t WHERE area(c1, c2) > 0;
1 2

-- Compose SQL functions.


> CREATE FUNCTION square(x DOUBLE) RETURNS DOUBLE RETURN area(x, x);

> SELECT c1, square(c1) AS square FROM t;


0 0.0
1 1.0

-- Create a non-deterministic function


> CREATE FUNCTION roll_dice()
RETURNS INT
NOT DETERMINISTIC
CONTAINS SQL
COMMENT 'Roll a single 6 sided die'
RETURN (rand() * 6)::INT + 1;
-- Roll a single 6-sided die
> SELECT roll_dice();
3

Create and use a function that uses DEFAULTs


-- Extend the function to support variable number of sides and dice.
-- Use defaults to support a variable number of arguments
> DROP FUNCTION roll_dice;
> CREATE FUNCTION roll_dice(num_dice INT DEFAULT 1 COMMENT 'number of dice to roll (Default: 1)',
num_sides INT DEFAULT 6 COMMENT 'number of sides per die (Default: 6)')
RETURNS INT
NOT DETERMINISTIC
CONTAINS SQL
COMMENT 'Roll a number of n-sided dice'
RETURN aggregate(sequence(1, roll_dice.num_dice, 1),
0,
(acc, x) -> (rand() * roll_dice.num_sides)::int,
acc -> acc + roll_dice.num_dice);

-- Roll a single 6-sided die still works


> SELECT roll_dice();
3

-- Roll 3 6-sided dice


> SELECT roll_dice(3);
15

-- Roll 3 10-sided dice


> SELECT roll_dice(3, 10)
21

-- Create a SQL function with a scalar subquery.


> CREATE VIEW scores(player, score) AS VALUES (0, 1), (0, 2), (1, 2), (1, 5);

> CREATE FUNCTION avg_score(p INT) RETURNS FLOAT


COMMENT 'get an average score of the player'
RETURN SELECT AVG(score) FROM scores WHERE player = p;

> SELECT c1, avg_score(c1) FROM t;


0 1.5
1 3.5

Create a SQL table function


-- Produce all weekdays between two dates
> CREATE FUNCTION weekdays(start DATE, end DATE)
RETURNS TABLE(day_of_week STRING, day DATE)
RETURN SELECT extract(DAYOFWEEK_ISO FROM day), day
FROM (SELECT sequence(weekdays.start, weekdays.end)) AS T(days)
LATERAL VIEW explode(days) AS day
WHERE extract(DAYOFWEEK_ISO FROM day) BETWEEN 1 AND 5;

-- Return all weekdays


> SELECT weekdays.day_of_week, day
FROM weekdays(DATE'2022-01-01', DATE'2022-01-14');
1 2022-01-03
2 2022-01-04
3 2022-01-05
4 2022-01-06
5 2022-01-07
1 2022-01-10
2 2022-01-11
3 2022-01-12
4 2022-01-13
5 2022-01-14

-- Return weekdays for date ranges originating from a LATERAL correlation


> SELECT weekdays.*
FROM VALUES (DATE'2020-01-01'),
(DATE'2021-01-01'),
(DATE'2022-01-01') AS starts(start),
LATERAL weekdays(start, start + INTERVAL '7' DAYS);
3 2020-01-01
4 2020-01-02
5 2020-01-03
1 2020-01-06
2 2020-01-07
3 2020-01-08
5 2021-01-01
1 2021-01-04
2 2021-01-05
3 2021-01-06
4 2021-01-07
5 2021-01-08
1 2022-01-03
2 2022-01-04
3 2022-01-05
4 2022-01-06
5 2022-01-07

Replace a SQL function

-- Replace a SQL scalar function.


> CREATE OR REPLACE FUNCTION square(x DOUBLE) RETURNS DOUBLE RETURN x * x;

-- Replace a SQL table function.


> CREATE OR REPLACE FUNCTION getemps(deptno INT)
RETURNS TABLE (name STRING)
RETURN SELECT name FROM employee e WHERE e.deptno = getemps.deptno;

-- Describe a SQL table function.


> DESCRIBE FUNCTION getemps;
Function: default.getemps
Type: TABLE
Input: deptno INT
Returns: id INT
name STRING
NOTE
You cannot replace an existing function with a different signature.

Describe a SQL function

> DESCRIBE FUNCTION hello();


Function: hello
Type: SCALAR
Input: ()
Returns: STRING

> DESCRIBEFUNCTION area;


Function:default.area
Type: SCALAR
Input: x DOUBLE
y DOUBLE
Returns: DOUBLE

> DESCRIBEFUNCTION roll_dice;


Function:default.roll_dice
Type: SCALAR
Input: num_dice INT
num_sides INT
Returns: INT

> DESCRIBE FUNCTION EXTENDED roll_dice;


Function: default.roll_dice
Type: SCALAR
Input: num_dice INT DEFAULT 1 'number of dice to roll (Default: 1)'
num_sides INT DEFAULT 6 'number of sides per dice (Default: 6)'
Returns: INT
Comment: Roll a number of m-sided dice
Deterministic: false
Data Access: CONTAINS SQL
Configs: ...
Owner: the.house@always.wins
Create Time: Sat Feb 12 09:29:02 PST 2022
Body: aggregate(sequence(1, roll_dice.num_dice, 1),
0,
(acc, x) -> (rand() * roll_dice.num_sides)::int,
acc -> acc + roll_dice.num_dice)

Related articles
CREATE FUNCTION (External)
DROP FUNCTION
SHOW FUNCTIONS
DESCRIBE FUNCTION
GRANT
REVOKE
CREATE TABLE
7/21/2022 • 2 minutes to read

Defines a table in an existing schema.


You can use any of three different means to create a table for different purposes:
CREATE TABLE [USING]
Use this syntax if the new table will be:
Based on a column definition you provide.
Derived from data at an existing storage location.
Derived from a query.
CREATE TABLE (Hive format)
This statement matches CREATE TABLE [USING] using Hive syntax.
CREATE TABLE [USING] is preferred.
CREATE TABLE LIKE
Using this syntax you create a new table based on the definition, but not the data, of another table.
CREATE TABLE CLONE
You can use table cloning for Delta Lake tables to achieve two major goals:
Make a complete, independent copy of a table including its definition and data at a specific version.
This is called a DEEP CLONE .
Make a copy of the definition of the table which refers to the original table’s storage for the initial data
at a specific version. Updates, on either the source or the new table will not affect the other. However
the new table depends on the source table’s existence and column definition.

Related articles
ALTER TABLE
DROP TABLE
CREATE TABLE (Hive format)
CREATE TABLE [USING]
CREATE TABLE LIKE
CREATE TABLE CLONE
CREATE TABLE [USING]
7/21/2022 • 6 minutes to read

Defines a managed or external table, optionally using a data source.

Syntax
{ { [CREATE OR] REPLACE TABLE | CREATE TABLE [ IF NOT EXISTS ] }
table_name
[ table_specification ] [ USING data_source ]
[ table_clauses ]
[ AS query ] }

table_specification
( { column_identifier column_type [ NOT NULL ]
[ GENERATED ALWAYS AS ( expr ) |
GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( [ START WITH start ] [ INCREMENT BY step ] ) ] ]
[ COMMENT column_comment ]
[ column_constraint ] } [, ...]
[ , table_constraint ] [...] )

table_clauses
{ OPTIONS clause |
PARTITIONED BY clause |
clustered_by_clause |
LOCATION path [ WITH ( CREDENTIAL credential_name ) ] |
COMMENT table_comment |
TBLPROPERTIES clause } [...]

clustered_by_clause
{ CLUSTERED BY ( cluster_column [, ...] )
[ SORTED BY ( { sort_column [ ASC | DESC ] } [, ...] ) ]
INTO num_buckets BUCKETS }

Parameters
REPL ACE
If specified replaces the table and its content if it already exists. This clause is only supported for Delta
Lake tables.

NOTE
Azure Databricks strongly recommends using REPLACE instead of dropping and re-creating Delta Lake tables.

IF NOT EXISTS
If specified and a table with the same name already exists, the statement is ignored.
IF NOT EXISTS cannot coexist with REPLACE , which means CREATE OR REPLACE TABLE IF NOT EXISTS is not
allowed.
table_name
The name of the table to be created. The name must not include a temporal specification. If the name is
not qualified the table is created in the current schema.
table_specification
This optional clause defines the list of columns, their types, properties, descriptions, and column
constraints.
If you do not define columns the table schema you must specify either AS query or LOCATION .
column_identifier
A unique name for the column.
column_type
Specifies the data type of the column. Not all data types supported by Azure Databricks are
supported by all data sources.
NOT NULL
If specified the column will not accept NULL values. This clause is only supported for Delta Lake
tables.
GENERATED ALWAYS AS ( expr )
When you specify this clause the value of this column is determined by the specified expr .
expr may be composed of literals, column identifiers within the table, and deterministic, built-in
SQL functions or operators except:
Aggregate functions
Analytic window functions
Ranking window functions
Table valued generator functions
Also expr must not contain any subquery.
GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( [ START WITH star t ] [
INCREMENT BY step ] ) ]
Since: Databricks Runtime 10.3
Defines an identity column. When you write to the table, and do not provide values for the identity
column, it will be automatically assigned a unique and statistically increasing (or decreasing if
step is negative) value. This clause is only supported for Delta Lake tables. This clause can only be
used for columns with BIGINT data type.
The automatically assigned values start with start and increment by step . Assigned values are
unique but are not guaranteed to be contiguous. Both parameters are optional, and the default
value is 1. step cannot be 0 .
If the automatically assigned values are beyond the range of the identity column type, the query
will fail.
When ALWAYS is used, you cannot provide your own values for the identity column.
The following operations are not supported:
PARTITIONED BY an identity column
UPDATE an identity column
COMMENT column_comment
A string literal to describe the column.
column_constraint
Adds a primary key or foreign key constraint to the column in a Delta Lake table.
Constraints are not supported for tables in the hive_metastore catalog.
To add a check constraint to a Delta Lake table use ALTER TABLE.
table_constraint
Adds a primary key or foreign key constraints to the Delta Lake table.
Constraints are not supported for tables in the hive_metastore catalog.
To add a check constraint to a Delta Lake table use ALTER TABLE.
USING data_source
The file format to use for the table. data_source must be one of:
TEXT
AVRO
CSV
JSON
JDBC
PARQUET
ORC
DELTA
LIBSVM
a fully-qualified class name of a custom implementation of
org.apache.spark.sql.sources.DataSourceRegister .
If USING is omitted, the default is DELTA .
For any data_source other than DELTA you must also specify a LOCATION unless the table catalog is
hive_metastore .
HIVE is supported to create a Hive SerDe table. You can specify the Hive-specific file_format and
row_format using the OPTIONS clause, which is a case-insensitive string map. The option_keys are:

FILEFORMAT
INPUTFORMAT
OUTPUTFORMAT
SERDE
FIELDDELIM
ESCAPEDELIM
MAPKEYDELIM
LINEDELIM
table_clauses
Optionally specify location, partitioning, clustering, options, comments, and user defined properties for
the new table. Each sub clause may only be specified once.
PARTITIONED BY
An optional clause to partition the table by a subset of columns.

NOTE
Unless you define a Delta Lake table partitioning columns referencing the columns in the column
specification are always moved to the end of the table.

clustered_by_clause
Optionally cluster the table or each partition into a fixed number of hash buckets using a subset of
the columns.
Clustering is not supported for Delta Lake tables.
CLUSTERED BY
Specifies the set of columns by which to cluster each partition, or the table if no partitioning
is specified.
cluster_column
An identifier referencing a column_identifier in the table. If you specify more than
one column there must be no duplicates. Since a clustering operates on the partition
level you must not name a partition column also as a cluster column.
SORTED BY
Optionally maintains a sort order for rows in a bucket.
sor t_column
A column to sort the bucket by. The column must not be partition column. Sort
columns must be unique.
ASC or DESC
Optionally specifies whether sort_column is sorted in ascending ( ASC ) or
descending ( DESC ) order. The default values is ASC .
INTO num_buckets BUCKETS
An INTEGER literal specifying the number of buckets into which each partition (or the table
if no partitioning is specified) is devided.
LOCATION path [ WITH ( CREDENTIAL credential_name ) ]
An optional path to the directory where table data is stored, which could be a path on distributed
storage. path must be a STRING literal. If you specify no location the table is considered a
managed table and Azure Databricks creates a default table location.

Specifying a location makes the table an external table .


For tables that do not reside in the hive_metastore catalog the table path must be protected by
an external location unless a valid storage credential is specified.
For a Delta Lake table the table configuration is inherited from the LOCATION if data is present.
Therefore if any TBLPROPERTIES , column_specification , or PARTITION BY clauses are specified for
Delta Lake tables they must exactly match the Delta Lake location data.
OPTIONS
Sets or resets one or more user defined table options.
COMMENT table_comment
A string literal to describe the table.
TBLPROPERTIES
Optionally sets one or more user defined properties.
AS quer y
This optional clause populates the table using the data from query . When you specify a query you must
not also specify a column_specification . The table schema will be derived form the query.
Note that Azure Databricks overwrites the underlying data source with the data of the input query, to
make sure the table gets created contains exactly the same data as the input query.

Examples
-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);

-- Use data from another table


> CREATE TABLE student_copy AS SELECT * FROM student;

-- Creates a CSV table from an external directory


> CREATE TABLE student USING CSV LOCATION '/mnt/csv_files';

-- Specify table comment and properties


> CREATE TABLE student (id INT, name STRING, age INT)
COMMENT 'this is a comment'
TBLPROPERTIES ('foo'='bar');

-- Specify table comment and properties with different clauses order


> CREATE TABLE student (id INT, name STRING, age INT)
TBLPROPERTIES ('foo'='bar')
COMMENT 'this is a comment';

-- Create partitioned table


> CREATE TABLE student (id INT, name STRING, age INT)
PARTITIONED BY (age);

-- Create a table with a generated column


> CREATE TABLE rectangles(a INT, b INT,
area INT GENERATED ALWAYS AS (a * b));

Related articles
ALTER TABLE
CONSTRAINT
CREATE TABLE LIKE
CREATE TABLE CLONE
DROP TABLE
PARTITIONED BY
Table properties and table options
CREATE TABLE with Hive format
7/21/2022 • 3 minutes to read

Defines a table using Hive format.

Syntax
CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier
[ ( col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ]
[ COMMENT table_comment ]
[ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT col_comment2 ], ... )
| ( col_name1, col_name2, ... ) ]
[ ROW FORMAT row_format ]
[ STORED AS file_format ]
[ LOCATION path ]
[ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ]
[ AS select_statement ]

row_format:
: SERDE serde_class [ WITH SERDEPROPERTIES (k1=v1, k2=v2, ... ) ]
| DELIMITED [ FIELDS TERMINATED BY fields_terminated_char [ ESCAPED BY escaped_char ] ]
[ COLLECTION ITEMS TERMINATED BY collection_items_terminated_char ]
[ MAP KEYS TERMINATED BY map_key_terminated_char ]
[ LINES TERMINATED BY row_terminated_char ]
[ NULL DEFINED AS null_char ]

The clauses between the column definition clause and the AS SELECT clause can appear in any order. For
example, you can write COMMENT table_comment after TBLPROPERTIES .

NOTE
In Databricks Runtime 8.0 and above you must specify either the STORED AS or ROW FORMAT clause. Otherwise, the SQL
parser uses the CREATE TABLE [USING] syntax to parse it and creates a Delta table by default.

Parameters
table_identifier
A table name, optionally qualified with a schema name.
Syntax: [schema_name.] table_name

EXTERNAL
Defines the table using the path provided in LOCATION .
PARTITIONED BY
Partitions the table by the specified columns.
ROW FORMAT
Use the SERDE clause to specify a custom SerDe for one table. Otherwise, use the DELIMITED clause to
use the native SerDe and specify the delimiter, escape character, null character and so on.
SERDE
Specifies a custom SerDe for one table.
serde_class
Specifies a fully-qualified class name of a custom SerDe.
SERDEPROPERTIES
A list of key-value pairs used to tag the SerDe definition.
DELIMITED
The DELIMITED clause can be used to specify the native SerDe and state the delimiter, escape character,
null character and so on.
FIELDS TERMINATED BY
Used to define a column separator.
COLLECTION ITEMS TERMINATED BY
Used to define a collection item separator.
MAP KEYS TERMINATED BY
Used to define a map key separator.
LINES TERMINATED BY
Used to define a row separator.
NULL DEFINED AS
Used to define the specific value for NULL.
ESCAPED BY
Define the escape mechanism.
COLLECTION ITEMS TERMINATED BY
Define a collection item separator.
MAP KEYS TERMINATED BY
Define a map key separator.
LINES TERMINATED BY
Define a row separator.
NULL DEFINED AS
Define the specific value for NULL .
STORED AS
The file format for the table. Available formats include TEXTFILE , SEQUENCEFILE , RCFILE , ORC , PARQUET ,
and AVRO . Alternatively, you can specify your own input and output formats through INPUTFORMAT and
OUTPUTFORMAT . Only formats TEXTFILE , SEQUENCEFILE , and RCFILE can be used with ROW FORMAT SERDE
and only TEXTFILE can be used with ROW FORMAT DELIMITED .
LOCATION
Path to the directory where table data is stored, which could be a path on distributed storage.
COMMENT
A string literal to describe the table.
TBLPROPERTIES
A list of key-value pairs used to tag the table definition.
AS select_statement
Populates the table using the data from the select statement.

Examples
--Use hive format
CREATE TABLE student (id INT, name STRING, age INT) STORED AS ORC;

--Use data from another table


CREATE TABLE student_copy STORED AS ORC
AS SELECT * FROM student;

--Specify table comment and properties


CREATE TABLE student (id INT, name STRING, age INT)
COMMENT 'this is a comment'
STORED AS ORC
TBLPROPERTIES ('foo'='bar');

--Specify table comment and properties with different clauses order


CREATE TABLE student (id INT, name STRING, age INT)
STORED AS ORC
TBLPROPERTIES ('foo'='bar')
COMMENT 'this is a comment';

--Create partitioned table


CREATE TABLE student (id INT, name STRING)
PARTITIONED BY (age INT)
STORED AS ORC;

--Create partitioned table with different clauses order


CREATE TABLE student (id INT, name STRING)
STORED AS ORC
PARTITIONED BY (age INT);

--Use Row Format and file format


CREATE TABLE student (id INT, name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

--Use complex datatype


CREATE EXTERNAL TABLE family(
name STRING,
friends ARRAY<STRING>,
children MAP<STRING, INT>,
address STRUCT<street: STRING, city: STRING>
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
COLLECTION ITEMS TERMINATED BY '_'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n'
NULL DEFINED AS 'foonull'
STORED AS TEXTFILE
LOCATION '/tmp/family/';

--Use predefined custom SerDe


CREATE TABLE avroExample
CREATE TABLE avroExample
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{ "namespace": "org.apache.hive",
"name": "first_schema",
"type": "record",
"fields": [
{ "name":"string1", "type":"string" },
{ "name":"string2", "type":"string" }
] }');

--Use personalized custom SerDe(we may need to `ADD JAR xxx.jar` first to ensure we can find the
serde_class,
--or you may run into `CLASSNOTFOUND` exception)
ADD JAR /tmp/hive_serde_example.jar;

CREATE EXTERNAL TABLE family (id INT, name STRING)


ROW FORMAT SERDE 'com.ly.spark.serde.SerDeExample'
STORED AS INPUTFORMAT 'com.ly.spark.example.serde.io.SerDeExampleInputFormat'
OUTPUTFORMAT 'com.ly.spark.example.serde.io.SerDeExampleOutputFormat'
LOCATION '/tmp/family/';

Related statements
CREATE TABLE [USING]
CREATE TABLE LIKE
CREATE TABLE LIKE
7/21/2022 • 3 minutes to read

Defines a table using the definition and metadata of an existing table or view.
The statement does not inherit primary key or foreign key constraints from the source table.
Delta Lake does not support CREATE TABLE LIKE . Instead use CREATE TABLE AS.

Syntax
CREATE TABLE [ IF NOT EXISTS ] table_name LIKE source_table_name [table_clauses]

table_clauses
{ USING data_source |
LOCATION path |
TBLPROPERTIES clause } [...]
ROW FORMAT row_format |
STORED AS file_format } [...]

row_format
{ SERDE serde_class [ WITH SERDEPROPERTIES (serde_key = serde_val [, ...] ) ] |
{ DELIMITED [ FIELDS TERMINATED BY fields_terminated_char [ ESCAPED BY escaped_char ] ]
[ COLLECTION ITEMS TERMINATED BY collection_items_terminated_char ]
[ MAP KEYS TERMINATED BY map_key_terminated_char ]
[ LINES TERMINATED BY row_terminated_char ]
[ NULL DEFINED AS null_char ] } }

property_key
{ identifier [. ...] | string_literal }

Parameters
IF NOT EXISTS
If specified ignores the statement if the table_name already exists.
table_name
The name of the table to create. The name must not include a temporal specification. If the name is not
qualified the table is created in the current schema. A table_name must not exist already.
source_table_name
The name of the table whose definition is copied. The table must not be a Delta Lake table.
table_clauses
Optionally specify a data source format, location, and user defined properties for the new table. Each sub
clause may only be specified once.
LOCATION path
Path to the directory where table data is stored, which could be a path on distributed storage. If
you specify a location the new table becomes an external table . If you do not specify a location
the table is a managed table .
TBLPROPERTIES
Optionally sets one or more user defined properties.
USING data_source
The file format to use for the table. data_source must be one of:
TEXT
CSV
JSON
JDBC
PARQUET
ORC
HIVE
LIBSVM
a fully-qualified class name of a custom implementation of
org.apache.spark.sql.sources.DataSourceRegister .
HIVE is supported to create a Hive SerDe table. You can specify the Hive-specific file_format and
row_format using the OPTIONS clause, which is a case-insensitive string map. The option keys are
FILEFORMAT , INPUTFORMAT , OUTPUTFORMAT , SERDE , FIELDDELIM , ESCAPEDELIM , MAPKEYDELIM , and
LINEDELIM .

If you do not specify USING the format of the source table will be inherited.
ROW FORMAT row_format
To specify a custom SerDe, set to SERDE and specify the fully-qualified class name of a custom
SerDe and optional SerDe properties. To use the native SerDe, set to DELIMITED and specify the
delimiter, escape character, null character and so on.
SERDEPROPERTIES
A list of key-value pairs used to tag the SerDe definition.
FIELDS TERMINATED BY
Define a column separator.
ESCAPED BY
Define the escape mechanism.
COLLECTION ITEMS TERMINATED BY
Define a collection item separator.
MAP KEYS TERMINATED BY
Define a map key separator.
LINES TERMINATED BY
Define a row separator.
NULL DEFINED AS
Define the specific value for NULL .
STORED AS
The file format for the table. Available formats include TEXTFILE , SEQUENCEFILE , RCFILE ,
ORC , PARQUET , and AVRO . Alternatively, you can specify your own input and output formats
through INPUTFORMAT and OUTPUTFORMAT . Only formats TEXTFILE , SEQUENCEFILE , and
RCFILE can be used with ROW FORMAT SERDE and only TEXTFILE can be used with
ROW FORMAT DELIMITED .

Examples
-- Create table using a new location
> CREATE TABLE Student_Dupli LIKE Student LOCATION '/mnt/data_files';

-- Create table like using a data source


> CREATE TABLE Student_Dupli LIKE Student USING CSV LOCATION '/mnt/csv_files';

-- Table is created as external table at the location specified


> CREATE TABLE Student_Dupli LIKE Student LOCATION '/root1/home';

-- Create table like using a rowformat


> CREATE TABLE Student_Dupli LIKE Student
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES ('owner'='xxxx');

Related articles
CREATE TABLE [USING]
CREATE TABLE CLONE
DROP TABLE
ALTER TABLE
CREATE TABLE with Hive format
Table properties
CREATE VIEW
7/21/2022 • 2 minutes to read

Constructs a virtual table that has no physical data based on the result-set of a SQL query. ALTER VIEW and
DROP VIEW only change metadata.

Syntax
CREATE [ OR REPLACE ] [ [ GLOBAL ] TEMPORARY ] VIEW [ IF NOT EXISTS ] view_name
[ column_list ]
[ COMMENT view_comment ]
[ TBLPROPERTIES clause ]
AS query

column_list
( { column_alias [ COMMENT column_comment ] } [, ...] )

Parameters
OR REPL ACE
If a view of the same name already exists, it is replaced. To replace an existing view you must be its owner.
[ GLOBAL ] TEMPORARY
TEMPORARY views are session-scoped and is dropped when session ends because it skips persisting the
definition in the underlying metastore, if any. GLOBAL TEMPORARY views are tied to a system preserved
temporary schema global_temp .
IF NOT EXISTS
Creates the view only if it does not exist. If a view by this name already exists the CREATE VIEW statement
is ignored.
You may specify at most one of IF NOT EXISTS or OR REPLACE .
view_name
The name of the newly created view. A temporary view’s name must not be qualified. A the fully qualified
view name must be unique.
column_list
Optionally labels the columns in the query result of the view. If you provide a column list the number of
column aliases must match the number of expressions in the query. In case no column list is specified
aliases are derived from the body of the view.
column_alias
The column aliases must be unique.
column_comment
An optional STRING literal describing the column alias.
view_comment
An optional STRING literal providing a view-level comments.
TBLPROPERTIES
Optionally sets one or more user defined properties.
AS quer y
A query that constructs the view from base tables or other views.

Examples
-- Create or replace view for `experienced_employee` with comments.
> CREATE OR REPLACE VIEW experienced_employee
(id COMMENT 'Unique identification number', Name)
COMMENT 'View for experienced employees'
AS SELECT id, name
FROM all_employee
WHERE working_years > 5;

-- Create a temporary view `subscribed_movies` if it does not exist.


> CREATE TEMPORARY VIEW IF NOT EXISTS subscribed_movies
AS SELECT mo.member_id, mb.full_name, mo.movie_title
FROM movies AS mo
INNER JOIN members AS mb
ON mo.member_id = mb.id;

Related articles
ALTER VIEW
DROP VIEW
query
SHOW VIEWS
Table properties
DROP CATALOG
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Drops a catalog. An exception is thrown if the catalog does not exist in the metastore.
Since: Databricks Runtime 10.3

Syntax
DROP CATALOG [ IF EXISTS ] catalog_name [ RESTRICT | CASCADE ]

Parameters
IF EXISTS
If specified, no exception is thrown when the catalog does not exist.
catalog_name :
The name of an existing catalog in the metastore. If the name does not exist, an exception is thrown.
RESTRICT
If specified, will restrict dropping a non-empty catalog and is enabled by default.
CASCADE
If specified, will drop all the associated databases (schemas) and the objects within them.

Examples
-- Create a `vaccine` catalog
> CREATE CATALOG vaccine COMMENT 'This catalog is used to maintain information about vaccines';

-- Drop the catalog and its schemas


> DROP CATALOG vaccine CASCADE;

-- Drop the catalog using IF EXISTS and only if it is empty.


> DROP CATALOG IF EXISTS vaccine RESTRICT;

Related articles
CREATE CATALOG
DESCRIBE CATALOG
SHOW CATALOGS
DROP STORAGE CREDENTIAL
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Drops an existing storage credential. An exception is thrown if the location does not exist in the metastore.
Since: Databricks Runtime 10.3

Syntax
DROP STORAGE CREDENTIAL [ IF EXISTS ] credential_name [ FORCE ]

Parameters
IF EXISTS

If specified, no exception is thrown when the credential does not exist.


credential_name
The name of an existing credential in the metastore. If the name does not exist, an exception is thrown
unless IF EXISTS has been specified.
FORCE

Optionally force the credential to be dropped even if it is used by existing objects. If FORCE is not
specified an error is raised if the credential is in use.

Examples
> DROP STORAGE CREDENTIAL street_cred FORCE;

Related articles
ALTER STORAGE CREDENTIAL
DESCRIBE STORAGE CREDENTIAL
SHOW STORAGE CREDENTIALS
DROP DATABASE
7/21/2022 • 2 minutes to read

An alias for DROP SCHEMA.


While usage of SCHEMA and DATABASE is interchangeable, SCHEMA is preferred.

Related articles
CREATE SCHEMA
DESCRIBE SCHEMA
DROP SCHEMA
SHOW SCHEMAS
DROP FUNCTION
7/21/2022 • 2 minutes to read

Drops a temporary or permanent user-defined function (UDF).

Syntax
DROP [ TEMPORARY ] FUNCTION [ IF EXISTS ] function_name

Parameters
function_name
The name of an existing function. The function name may be optionally qualified with a schema name.
TEMPORARY
Used to delete a TEMPORARY function.
IF EXISTS
If specified, no exception is thrown when the function does not exist.

Examples
-- Create a permanent function `hello`
> CREATE FUNCTION hello() RETURNS STRING RETURN 'Hello World!';

-- Create a temporary function `hello`


> CREATE TEMPORARY FUNCTION hello() RETURNS STRING RETURN 'Good morning!';

-- List user functions


> SHOW USER FUNCTIONS;
default.hello
hello

-- Drop a permanent function


> DROP FUNCTION hello;

-- Try to drop a permanent function which is not present


> DROP FUNCTION hello;
Function 'default.hello' not found in schema 'default'

-- List the functions after dropping, it should list only temporary function
> SHOW USER FUNCTIONS;
hello

-- Drop a temporary function if exists


> DROP TEMPORARY FUNCTION IF EXISTS hello;

Related statements
CREATE FUNCTION (External)
CREATE FUNCTION (SQL)
DESCRIBE FUNCTION
SHOW FUNCTIONS
DROP EXTERNAL LOCATION
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Drops an external location. An exception is thrown if the location does not exist in the metastore.
Since: Databricks Runtime 10.3

Syntax
DROP EXTERNAL LOCATION [ IF EXISTS ] location_name [ FORCE ]

Parameters
IF EXISTS

If specified, no exception is thrown when the location does not exist.


location_name
The name of an existing location in the metastore. If the name does not exist, an exception is thrown
unless IF EXISTS has been specified.
FORCE

Optionally force the location to be dropped even if it is used by existing external tables. If FORCE is not
specified an error is raised if the location is in use.

Examples
> DROP EXTERNAL LOCATION some_location FORCE;

Related articles
CREATE EXTERNAL LOCATION
DESCRIBE EXTERNAL LOCATION
External locations and storage credentials
SHOW EXTERNAL LOCATIONS
DROP RECIPIENT
7/21/2022 • 2 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

Drops a recipient. An exception is thrown if the recipient does not exist in the system.
Since: Databricks Runtime 10.3

Syntax
DROP RECIPIENT [ IF EXISTS ] recipient_name

Parameters
IF EXISTS
If specified, no exception is thrown when the recipient does not exist.
recipient_name
The name of an existing recipient in the system. If the name does not exist, an exception is thrown.

Examples
-- Create `other_corp` recipient
> CREATE RECIPIENT other_corp COMMENT 'OtherCorp.com';

-- Retrieve the activation link


> DESCRIBE RECIPIENT other_corp;
name created_at created_by comment activation_link
active_token_id active_token_expiration_time rotated_token_id
rotated_token_expiration_time
---------- ---------------------------- -------------------------- ------------- ----------------- -------
----------------------------- ---------------------------- ---------------- -----------------------------
other_corp 2022-01-01T00:00:00.000+0000 alwaysworks@databricks.com OtherCorp.com https://send/this
0160c81f-5262-40bb-9b03-3ee12e6d98d7 9999-12-31T23:59:59.999+0000 NULL NULL

-- Drop the recipient


> DROP RECIPIENT other_corp;

-- Drop the recipient using IF EXISTS.


> DROP RECIPIENT IF EXISTS other_corp;

Related articles
CREATE RECIPIENT
DESCRIBE RECIPIENT
SHOW RECIPIENTS
DROP SCHEMA
7/21/2022 • 2 minutes to read

Drops a schema and deletes the directory associated with the schema from the file system. An exception is
thrown if the schema does not exist in the system.
While usage of SCHEMA and DATABASE is interchangeable, SCHEMA is preferred.

Syntax
DROP SCHEMA [ IF EXISTS ] schema_name [ RESTRICT | CASCADE ]

Parameters
IF EXISTS
If specified, no exception is thrown when the schema does not exist.
schema_name
The name of an existing schemas in the system. If the name does not exist, an exception is thrown.
RESTRICT
If specified, will restrict dropping a non-empty schema and is enabled by default.
CASCADE
If specified, will drop all the associated tables and functions.

WARNING
To avoid accidental data loss, do not register a schema (database) to a location with existing data or create new external
tables in a location managed by a schema. Dropping a schema will recursively delete all data files in the managed location.

Examples
-- Create `inventory_schema` Database
> CREATE SCHEMA inventory_schema COMMENT 'This schema is used to maintain Inventory';

-- Drop the schema and its tables


> DROP SCHEMA inventory_schema CASCADE;

-- Drop the schema using IF EXISTS


> DROP SCHEMA IF EXISTS inventory_schema CASCADE;

Related articles
CREATE SCHEMA
DESCRIBE SCHEMA
SHOW SCHEMAS
DROP SHARE
7/21/2022 • 2 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

Drops a share. An exception is thrown if the share does not exist in the system.
Since: Databricks Runtime 10.3

Syntax
DROP SHARE [ IF EXISTS ] share_name

Parameters
IF EXISTS
If specified, no exception is thrown when the share does not exist.
share_name
The name of an existing share. If the name does not exist, an exception is thrown.

Examples
-- Create `vaccine` share
> CREATE SHARE vaccine COMMENT 'This share is used to share information about vaccines';

-- Drop the share


> DROP SHARE vaccine;

-- Drop the share using IF EXISTS.


> DROP SHARE IF EXISTS vaccine;

Related articles
ALTER SHARE
CREATE SHARE
DESCRIBE SHARE
SHOW SHARES
DROP TABLE
7/21/2022 • 2 minutes to read

Deletes the table and removes the directory associated with the table from the file system if the table is not
EXTERNAL table. An exception is thrown if the table does not exist.

In case of an external table, only the associated metadata information is removed from the metastore schema.
Any foreign key constraints referencing the table are also dropped.
If the table is cached, the command uncaches the table and all its dependents.

Syntax
DROP TABLE [ IF EXISTS ] table_name

Parameter
IF EXISTS
If specified, no exception is thrown when the table does not exist.
table_name
The name of the table to be created. The name must not include a temporal specification.

Examples
-- Assumes a table named `employeetable` exists.
> DROP TABLE employeetable;

-- Assumes a table named `employeetable` exists in the `userdb` schema


> DROP TABLE userdb.employeetable;

-- Assumes a table named `employeetable` does not exist.


-- Throws exception
> DROP TABLE employeetable;
Error: Table or view not found: employeetable;

-- Assumes a table named `employeetable` does not exist,Try with IF EXISTS


-- this time it will not throw exception
> DROP TABLE IF EXISTS employeetable;

Related articles
CREATE TABLE
CREATE SCHEMA
DROP SCHEMA
DROP VIEW
7/21/2022 • 2 minutes to read

Removes the metadata associated with a specified view from the catalog.

Syntax
DROP VIEW [ IF EXISTS ] view_name

Parameter
IF EXISTS
If specified, no exception is thrown when the view does not exist.
view_name
The name of the view to be dropped.

Examples
-- Assumes a view named `employeeView` exists.
> DROP VIEW employeeView;

-- Assumes a view named `employeeView` exists in the `usersc` schema


> DROP VIEW usersc.employeeView;

-- Assumes a view named `employeeView` does not exist.


-- Throws exception
> DROP VIEW employeeView;
Error: Table or view not found: employeeView;

-- Assumes a view named `employeeView` does not exist. Try with IF EXISTS
-- this time it will not throw exception
> DROP VIEW IF EXISTS employeeView;

Related articles
CREATE VIEW
ALTER VIEW
SHOW VIEWS
CREATE SCHEMA
DROP SCHEMA
MSCK REPAIR TABLE
7/21/2022 • 2 minutes to read

Recovers all of the partitions in the directory of a table and updates the Hive metastore. When creating a table
using PARTITIONED BY clause, partitions are generated and registered in the Hive metastore. However, if the
partitioned table is created from existing data, partitions are not registered automatically in the Hive metastore;
you must run MSCK REPAIR TABLE to register the partitions.
Another way to recover partitions is to use ALTER TABLE RECOVER PARTITIONS.
This statement does not apply to Delta Lake tables.
If the table is cached, the command clears cached data of the table and all its dependents that refer to it. The
cache will be lazily filled when the next time the table or the dependents are accessed.

Syntax
MSCK REPAIR TABLE table_name [ {ADD | DROP | SYNC} PARTITIONS]

Parameters
table_name
The name of the partitioned table to be repaired.
ADD or DROP or SYNC PARTITIONS

Since: Databricks Runtime 10.0


Specifies how to recover partitions. If not specified, ADD is the default.
ADD , the command adds new partitions to the session catalog for all sub-folder in the base table
folder that don’t belong to any table partitions.
DROP , the command drops all partitions from the session catalog that have non-existing locations in
the file system.
SYNC is the combination of DROP and ADD .

Examples
-- create a partitioned table from existing data /tmp/namesAndAges.parquet
> CREATE TABLE t1 (name STRING, age INT) USING parquet PARTITIONED BY (age)
LOCATION "/tmp/namesAndAges.parquet";

-- SELECT * FROM t1 does not return results


> SELECT * FROM t1;

-- run MSCK REPAIR TABLE to recovers all the partitions


> MSCK REPAIR TABLE t1;

-- SELECT * FROM t1 returns results


> SELECT * FROM t1;
name age
------- ---
Michael 20
Justin 19
Andy 30

Related articles
ALTER TABLE
MSCK REPAIR PRIVILEGES
TRUNCATE TABLE
7/21/2022 • 2 minutes to read

Removes all the rows from a table or partition(s). The table must not be a view or an external or temporary
table. In order to truncate multiple partitions at once, specify the partitions in partition_spec . If no
partition_spec is specified, removes all partitions in the table.

If the table is cached, the command clears cached data of the table and all its dependents that refer to it. The
cache will be lazily filled when the table or the dependents are accessed the next time.

Syntax
TRUNCATE TABLE table_name [ PARTITION clause ]

Parameters
table_name
The name of the table to truncate. The name must not include a temporal specification.
PARTITION clause
Optional specification of a partition.

Examples
-- Create table Student with partition
> CREATE TABLE Student (name STRING, rollno INT) PARTITIONED BY (age INT);

> SELECT * FROM Student;


name rollno age
---- ------ ---
ABC 1 10
DEF 2 10
XYZ 3 12

-- Remove all rows from the table in the specified partition


> TRUNCATE TABLE Student partition(age=10);

-- After truncate execution, records belonging to partition age=10 are removed


> SELECT * FROM Student;
name rollno age
---- ------ ---
XYZ 3 12

-- Remove all rows from the table from all partitions


> TRUNCATE TABLE Student;

> SELECT * FROM Student;


name rollno age
---- ------ ---

Related articles
DROP TABLE
ALTER TABLE
PARTITION
USE CATALOG
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Sets the current catalog. After the current catalog is set, partially and unqualified identifiers for tables, functions,
and views that are referenced by SQLs are resolved from the current catalog.
Setting the catalog also resets the current database to default .
Since: Databricks Runtime 10.3

Syntax
{ USE | SET } CATALOG [ catalog_name | ' catalog_name ' ]

Parameter
catalog_name
Name of the catalog to use. If the database does not exist, an exception is thrown.

Examples
-- Use the 'hive_metastore' which exists.
> USE CATALOG hive_metastore;

> USE CATALOG 'hive_metastore';

-- Use the 'some_catalog' which doesn't exist


> USE CATALG `some_catalog`;
Error: Catalog 'some_catalog' not found;

-- Setting the catalog resets the datbase to `default`


> USE CATALOG some_cat;
> SELECT current_database(), current_catalog();
some_cat default

-- Setting the schema within the curret catalog


> USE DATABASE some_db;
> SELECT current_database(), current_catalog();
some_cat some_db

-- Resetting both catalog and schema


> USE DATABASE main.my_db;
> SELECT current_database(), current_catalog();
main my_db

-- Setting the catalog resets the database to `default` again


> USE CATALOG some_cat;
> SELECT current_database(), current_catalog();
some_cat default
Related articles
CREATE CATALOG
DROP CATALOG
USE DATABASE
7/21/2022 • 2 minutes to read

An alias for USE SCHEMA.


While usage of SCHEMA , NAMESPACE and DATABASE is interchangeable, SCHEMA is preferred.

Related articles
CREATE SCHEMA
DROP SCHEMA
USE SCHEMA
USE SCHEMA
7/21/2022 • 2 minutes to read

Since: Databricks Runtime 10.2


Sets the current schema. After the current schema is set, unqualified references to objects such as tables,
functions, and views that are referenced by SQLs are resolved from the current schema. The default schema
name is default .
While usage of SCHEMA and DATABASE is interchangeable, SCHEMA is preferred.

Syntax
USE [SCHEMA] schema_name

Parameter
schema_name
Name of the schema to use. If schema_name is qualified the current catalog is also set to the specified
catalog name. If the schema does not exist, an exception is thrown.

Examples
-- Use the 'userschema' which exists.
> USE SCHEMA userschema;

-- Use the 'userschema1' which doesn't exist


> USE SCHEMA userschema1;
Error: Database 'userschema1' not found;

-- Setting the catalog resets the schema to `default`


> USE CATALOG some_cat;
> SELECT current_schema(), current_catalog();
some_cat default

-- Setting the schema within the current catalog


> USE SCHEMA some_schem;
> SELECT current_schema(), current_catalog();
some_cat some_schema

-- Resetting both catalog and schema


> USE SCHEMA main.my_schema;
> SELECT current_schema(), current_catalog();
main my_schema

-- Setting the catalog resets the schema to `default` again


> USE CATALOG some_cat;
> SELECT current_schema(), current_catalog();
some_cat default

Related articles
CREATE SCHEMA
DROP SCHEMA
Table properties and table options
7/21/2022 • 3 minutes to read

Defines user defined tags for tables and views.


table proper ties
A table property is a key-value pair which you can initialize when you perform a CREATE TABLE or a
CREATE VIEW. You can UNSET existing or SET new or existing table properties using ALTER TABLE or
ALTER TABLE.
You can use table properties to tag tables with information not tracked by SQL.
table options
The purpose of table options is to pass storage properties to the underlying storage, such as SERDE
properties to Hive.
A table option is a key-value pair which you can initialize when you perform a CREATE TABLE. You cannot
SET or UNSET a table option.

TBLPROPERTIES
Sets one or more table properties in a new table or view.
You can use table properties to tag tables with information not tracked by SQL.
Syntax

TBLPROPERTIES ( property_key [ = ] property_val [, ...] )

property_key
{ identifier [. ...] | string_literal }

Parameters
proper ty_key
The property key. The key can consist of one or more identifiers separated by a dot, or a string literal.
Property keys must be unique and are case sensitive.
proper ty_val
The value for the property. The value must be a BOOLEAN, STRING, INTEGER, or DECIMAL literal.
Examples

-- Create table with user defined table properties


> CREATE TABLE T(c1 INT) TBLPROPERTIES('this.is.my.key' = 12, this.is.my.key2 = true);
> SHOW TBLPROPERTIES T;
option.serialization.format 1
this.is.my.key 12
this.is.my.key2 true
transient_lastDdlTime 1649783569
SET TBLPROPERTIES
Sets one or more table properties in an existing table or view.
Syntax

SET TBLPROPERTIES ( property_key [ = ] property_val [, ...] )

property_key
{ identifier [. ...] | string_literal }

Parameters
proper ty_key
The property key. The key can consist of one or more identifiers separated by a dot, or a string literal.
Property keys must be unique and are case sensitive.
proper ty_val
The new value for the property. The value must be a BOOLEAN, STRING, INTEGER, or DECIMAL literal.
Examples

-- Alter the a table's table properties.


> ALTER TABLE T SET TBLPROPERTIES(this.is.my.key = 14, 'this.is.my.key2' = false);
> SHOW TBLPROPERTIES T;
option.serialization.format 1
this.is.my.key 14
this.is.my.key2 false
transient_lastDdlTime 1649783980

UNSET TBLPROPERTIES
Removes one or more table properties from a table or view.
Syntax

SET TBLPROPERTIES [ IF EXISTS ] ( property_key [, ...] )

property_key
{ identifier [. ...] | string_literal }

Parameters
IF EXISTS
An optional clause directing Databricks Runtime not to raise an error if any of the property keys do not
exist.
proper ty_key
The property key to remove. The key can consist of one or more identifiers separated by a dot, or a string
literal.
Property keys are case sensitive. If property_key doesn’t exist and error is raised unless IF EXISTS has
been specified.
Examples
-- Remove a table's table properties.
> ALTER TABLE T UNSET TBLPROPERTIES(this.is.my.key, 'this.is.my.key2');
> SHOW TBLPROPERTIES T;
option.serialization.format 1
transient_lastDdlTime 1649784415

OPTIONS
Sets one or more table options in a new table.
The purpose of table options is to pass storage properties to the underlying storage, such as SERDE properties
to Hive.
Specifying table options for Delta Lake tables will also echo these options as table properties.
Syntax

OPTIONS ( property_key [ = ] property_val [, ...] )

property_key
{ identifier [. ...] | string_literal }

Parameters
proper ty_key
The property key. The key can consist of one or more identifiers separated by a dot, or a string literal.
Property keys must be unique and are case sensitive.
proper ty_val
The value for the property. The value must be a BOOLEAN, STRING, INTEGER, or DECIMAL literal.
Examples

-- Create table with user defined table option


-- The options appears with an `option.` prefix.
> CREATE TABLE T(c1 INT) OPTIONS(this.is.my.key = 'green');
> SHOW TBLPROPERTIES T;
option.a.b green
option.serialization.format 2

Reserved table property keys


Databricks Runtime reserves some property keys for its own use and raises an error if you attempt to use them:
external

Use CREATE EXTERNAL TABLE to create an external table.


location

Use the LOCATION clauses of ALTER TABLE and CREATE TABLE to set a table location.
owner

Use the OWNER TO clause of ALTER TABLE and ALTER VIEW to transfer ownership of a table or view.
provider
Use the USING clause of CREATE TABLE to set the data source of a table

You should not use property keys starting with the option identifier. This prefix identifier will be filtered out in
SHOW TBLPROPERTIES. The option prefix is also used to display table options.

Common TBLPROPERTIES and OPTIONS keys


The following settings are commonly used with Delta Lake:
delta.appendOnly : Set to to disable UPDATE and DELETE operations.
true
delta.dataSkippingNumIndexedCols : Set to the number of leading column for which to collect and consider
statistics.
delta.deletedFileRetentionDuration : Set to an interval such as 'interval 7 days' to control when VACUUM is
allowed to delete files.
delta.logRetentionDuration : Set to an interval such as 'interval 60 days' to control how long history is
kept for time travel queries.

Related articles
CREATE TABLE [USING]
CREATE TABLE CLONE
DROP TABLE
ALTER TABLE
INSERT INTO
7/21/2022 • 6 minutes to read

Inserts new rows into a table and optionally truncates the table or partitions. You specify the inserted rows by
value expressions or the result of a query.

Syntax
INSERT { OVERWRITE | INTO } [ TABLE ] table_name
[ PARTITION clause ]
[ ( column_name [, ...] ) ]
query

NOTE
When you INSERT INTO a Delta table schema enforcement and evolution is supported. If a column’s data type cannot
be safely cast to a Delta table’s data type, a runtime exception is thrown. If schema evolution is enabled, new columns can
exist as the last columns of your schema (or nested columns) for the schema to evolve.

Parameters
INTO or OVERWRITE

If you specify OVERWRITE the following applies:


Without a partition_spec the table is truncated before inserting the first row.
Otherwise all partitions matching the partition_spec are truncated before inserting the first row.
If you specify INTO all rows inserted are additive to the existing rows.
table_name
Identifies the table to be inserted to. The name must not include a temporal specification.
PARTITION clause
An optional parameter that specifies a target partition for the insert. You may also only partially specify
the partition.
( column_name [, …] )
An optional permutation of all the columns in the table. You can use this clause to map the columns if the
columns returned by the query does not line up with the natural order of the column.
quer y
A query that produces the rows to be inserted.
You must match the number of columns returned by the query with the columns in the table excluding
partitioning columns with assigned values in the PARTITION clause.
If a data type cannot be safely cast to the matching column data type, a runtime exception is thrown.
If schema evolution is enabled, new columns can exist as the last columns of your schema (or nested
columns) for the schema to evolve.

Dynamic partition inserts


In PARTITION clause, the partition column values are optional. When the partition specification part_spec is not
completely provided, such inserts are called dynamic partition inserts or multi-partition inserts. When the
values are not specified, these columns are referred to as dynamic partition columns; otherwise, they are static
partition columns. For example, the partition spec (p1 = 3, p2, p3) has a static partition column ( p1 ) and two
dynamic partition columns ( p2 and p3 ).
In PARTITION clause, static partition keys must come before the dynamic partition keys. This means all partition
columns having constant values must appear before other partition columns that do not have an assigned
constant value.
The partition values of dynamic partition columns are determined during the execution. The dynamic partition
columns must be specified last in both part_spec and the input result set (of the row value lists or the select
query). They are resolved by position, instead of by names. Thus, the orders must be exactly matched.
The DataFrameWriter APIs do not have an interface to specify partition values. Therefore, the insertInto() API
is always using dynamic partition mode.

Examples
In this section:
INSERT INTO
Insert with a column list
Insert with both a partition spec and a column list
INSERT OVERWRITE
INSERT INTO
Single row insert using a VALUES clause

> CREATE TABLE students (name VARCHAR(64), address VARCHAR(64), student_id INT)
PARTITIONED BY (student_id);

> INSERT INTO students VALUES


('Amy Smith', '123 Park Ave, San Jose', 111111);

> SELECT * FROM students;


name address student_id
--------- --------------------- ----------
Amy Smith 123 Park Ave,San Jose 111111

Multi-row insert using a VALUES clause

> INSERT INTO students VALUES


('Bob Brown', '456 Taylor St, Cupertino', 222222),
('Cathy Johnson', '789 Race Ave, Palo Alto', 333333);

> SELECT * FROM students;


name address student_id
------------- ------------------------ ----------
Amy Smith 123 Park Ave, San Jose 111111
Bob Brown 456 Taylor St, Cupertino 222222
Cathy Johnson 789 Race Ave, Palo Alto 333333

Insert using a subquery


-- Assuming the persons table has already been created and populated.
> SELECT * FROM persons;
name address ssn
------------- ------------------------- ---------
Dora Williams 134 Forest Ave, Melo Park 123456789
Eddie Davis 245 Market St, Milpitas 345678901

> INSERT INTO students PARTITION (student_id = 444444)


SELECT name, address FROM persons WHERE name = "Dora Williams";

> SELECT * FROM students;


name address student_id
------------- ------------------------- ----------
Amy Smith 123 Park Ave, San Jose 111111
Bob Brown 456 Taylor St, Cupertino 222222
Cathy Johnson 789 Race Ave, Palo Alto 333333
Dora Williams 134 Forest Ave, Melo Park 444444

Insert using a TABLE clause

-- Assuming the visiting_students table has already been created and populated.
> SELECT * FROM visiting_students;
name address student_id
------------- --------------------- ----------
Fleur Laurent 345 Copper St, London 777777
Gordon Martin 779 Lake Ave, Oxford 888888

> INSERT INTO students TABLE visiting_students;

> SELECT * FROM students;


name address student_id
------------- ------------------------- ----------
Amy Smith 123 Park Ave,San Jose 111111
Bob Brown 456 Taylor St, Cupertino 222222
Cathy Johnson 789 Race Ave, Palo Alto 333333
Dora Williams 134 Forest Ave, Melo Park 444444
Fleur Laurent 345 Copper St, London 777777
Gordon Martin 779 Lake Ave, Oxford 888888

Insert into a directory

> CREATE TABLE students (name VARCHAR(64), address VARCHAR(64), student_id INT)
PARTITIONED BY (student_id)
LOCATION "/mnt/user1/students";

> INSERT INTO delta.`/mnt/user1/students` VALUES


('Amy Smith', '123 Park Ave, San Jose', 111111);
> SELECT * FROM students;
name address student_id
------------- ------------------------- ----------
Amy Smith 123 Park Ave, San Jose 111111

Insert with a column list

> INSERT INTO students (address, name, student_id) VALUES


('Hangzhou, China', 'Kent Yao', 11215016);
> SELECT * FROM students WHERE name = 'Kent Yao';
name address student_id
--------- ---------------------- ----------
Kent Yao Hangzhou, China 11215016

Insert with both a partition spec and a column list


> INSERT INTO students PARTITION (student_id = 11215017) (address, name) VALUES
('Hangzhou, China', 'Kent Yao Jr.');
> SELECT * FROM students WHERE student_id = 11215017;
name address student_id
------------ ---------------------- ----------
Kent Yao Jr. Hangzhou, China 11215017

INSERT OVERWRITE
Insert using a VALUES clause

-- Assuming the students table has already been created and populated.
> SELECT * FROM students;
name address student_id
------------- ------------------------- ----------
Amy Smith 123 Park Ave, San Jose 111111
Bob Brown 456 Taylor St, Cupertino 222222
Cathy Johnson 789 Race Ave, Palo Alto 333333
Dora Williams 134 Forest Ave, Melo Park 444444
Fleur Laurent 345 Copper St, London 777777
Gordon Martin 779 Lake Ave, Oxford 888888
Helen Davis 469 Mission St, San Diego 999999
Jason Wang 908 Bird St, Saratoga 121212

> INSERT OVERWRITE students VALUES


('Ashua Hill', '456 Erica Ct, Cupertino', 111111),
('Brian Reed', '723 Kern Ave, Palo Alto', 222222);

> SELECT * FROM students;


name address student_id
---------- ----------------------- ----------
Ashua Hill 456 Erica Ct, Cupertino 111111
Brian Reed 723 Kern Ave, Palo Alto 222222

Insert using a subquery

-- Assuming the persons table has already been created and populated.
> SELECT * FROM persons;
name address ssn
------------- ------------------------- ---------
Dora Williams 134 Forest Ave, Melo Park 123456789
Eddie Davis 245 Market St,Milpitas 345678901

> INSERT OVERWRITE students PARTITION (student_id = 222222)


SELECT name, address FROM persons WHERE name = "Dora Williams";

> SELECT * FROM students;


name address student_id
------------- ------------------------- ----------
Ashua Hill 456 Erica Ct, Cupertino 111111
Dora Williams 134 Forest Ave, Melo Park 222222

Insert using a TABLE clause


-- Assuming the visiting_students table has already been created and populated.
> SELECT * FROM visiting_students;
name address student_id
------------- --------------------- ----------
Fleur Laurent 345 Copper St, London 777777
Gordon Martin 779 Lake Ave, Oxford 888888

> INSERT OVERWRITE students TABLE visiting_students;

> SELECT * FROM students;


name address student_id
------------- --------------------- ----------
Fleur Laurent 345 Copper St, London 777777
Gordon Martin 779 Lake Ave, Oxford 888888

Insert overwrite a directory

> CREATE TABLE students (name VARCHAR(64), address VARCHAR(64), student_id INT)
PARTITIONED BY (student_id)
LOCATION "/mnt/user1/students";

> INSERT OVERWRITE delta.`/mnt/user1/students` VALUES


('Amy Smith', '123 Park Ave, San Jose', 111111);
> SELECT * FROM students;
name address student_id
------------- ------------------------- ----------
Amy Smith 123 Park Ave, San Jose 111111

Related articles
COPY
DELETE
MERGE
PARTITION
query
UPDATE
INSERT OVERWRITE DIRECTORY
INSERT OVERWRITE DIRECTORY with Hive format
INSERT OVERWRITE DIRECTORY with Hive format
7/21/2022 • 2 minutes to read

Overwrites the existing data in the directory with the new values using Hive SerDe . Hive support must be
enabled to use this command. You specify the inserted rows by value expressions or the result of a query.

Syntax
INSERT OVERWRITE [ LOCAL ] DIRECTORY directory_path
[ ROW FORMAT row_format ] [ STORED AS file_format ]
{ VALUES ( { value | NULL } [ , ... ] ) [ , ( ... ) ] | query }

Parameters
director y_path
The destination directory. The LOCAL keyword specifies that the directory is on the local file system.
row_format
The row format for this insert. Valid options are SERDE clause and DELIMITED clause. SERDE clause can
be used to specify a custom SerDe for this insert. Alternatively, DELIMITED clause can be used to specify
the native SerDe and state the delimiter, escape character, null character, and so on.
file_format
The file format for this insert. Valid options are TEXTFILE , SEQUENCEFILE , RCFILE , ORC , PARQUET , and
AVRO . You can also specify your own input and output format using INPUTFORMAT and OUTPUTFORMAT .
ROW FORMAT SERDE can only be used with TEXTFILE , SEQUENCEFILE , or RCFILE , while
ROW FORMAT DELIMITED can only be used with TEXTFILE .

VALUES ( { value | NULL } [ , … ] ) [ , ( … ) ]


The values to be inserted. Either an explicitly specified value or a NULL can be inserted. A comma must be
used to separate each value in the clause. More than one set of values can be specified to insert multiple
rows.
quer y
A query that produces the rows to be inserted. One of following formats:
A SELECT statement
A TABLE statement
A FROM statement

Examples
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/destination'
STORED AS orc
SELECT * FROM test_table;

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/destination'


ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT * FROM test_table;

Related statements
INSERT INTO
INSERT OVERWRITE DIRECTORY
INSERT OVERWRITE DIRECTORY
7/21/2022 • 2 minutes to read

Overwrites the existing data in the directory with the new values using a given Spark file format. You specify the
inserted row by value expressions or the result of a query.

Syntax
INSERT OVERWRITE [ LOCAL ] DIRECTORY [ directory_path ]
USING file_format [ OPTIONS ( key [ = ] val [ , ... ] ) ]
{ VALUES ( { value | NULL } [ , ... ] ) [ , ( ... ) ] | query }

Parameters
director y_path
The destination directory. It can also be specified in OPTIONS using path . The LOCAL keyword is used to
specify that the directory is on the local file system.
file_format
The file format to use for the insert. Valid options are TEXT , CSV , JSON , JDBC , PARQUET , ORC , HIVE ,
LIBSVM , or a fully qualified class name of a custom implementation of
org.apache.spark.sql.execution.datasources.FileFormat .

OPTIONS ( key [ = ] val [ , … ] )


Specifies one or more options for the writing of the file format.
VALUES ( { value | NULL } [ , … ] ) [ , ( … ) ]
The values to be inserted. Either an explicitly specified value or a NULL can be inserted. A comma must be
used to separate each value in the clause. More than one set of values can be specified to insert multiple
rows.
quer y
A query that produces the rows to be inserted. One of following formats:
A SELECT statement
A TABLE statement
A FROM statement

Examples
INSERT OVERWRITE DIRECTORY '/tmp/destination'
USING parquet
OPTIONS (col1 1, col2 2, col3 'test')
SELECT * FROM test_table;

INSERT OVERWRITE DIRECTORY


USING parquet
OPTIONS ('path' '/tmp/destination', col1 1, col2 2, col3 'test')
SELECT * FROM test_table;

Related statements
INSERT INTO
INSERT OVERWRITE DIRECTORY with Hive format
LOAD DATA
7/21/2022 • 2 minutes to read

Loads the data into a Hive SerDe table from the user specified directory or file. If a directory is specified then all
the files from the directory are loaded. If a file is specified then only the single file is loaded. Additionally the
LOAD DATA statement takes an optional partition specification. When a partition is specified, the data files (when
input source is a directory) or the single file (when input source is a file) are loaded into the partition of the
target table.
If the table is cached, the command clears cached data of the table and all its dependents that refer to it. The
cache will be lazily filled when the table or the dependents are accessed the next time.

Syntax
LOAD DATA [ LOCAL ] INPATH path [ OVERWRITE ] INTO TABLE table_name [ PARTITION clause ]

Parameters
path
Path of the file system. It can be either an absolute or a relative path.
table_name
Identifies the table to be inserted to. The name must not include a temporal specification.
PARTITION clause
An optional parameter that specifies a target partition for the insert. You may also only partially specify
the partition.
LOCAL
If specified, it causes the INPATH to be resolved against the local file system, instead of the default file
system, which is typically a distributed storage.
OVERWRITE
By default, new data is appended to the table. If OVERWRITE is used, the table is instead overwritten with
new data.

Examples
-- Example without partition specification.
-- Assuming the students table has already been created and populated.
> SELECT * FROM students;
name address student_id
--------- ---------------------- ----------
Amy Smith 123 Park Ave, San Jose 111111

> CREATE TABLE test_load (name VARCHAR(64), address VARCHAR(64), student_id INT) USING HIVE;

-- Assuming the students table is in '/user/hive/warehouse/'


> LOAD DATA LOCAL INPATH '/user/hive/warehouse/students' OVERWRITE INTO TABLE test_load;

> SELECT * FROM test_load;


name address student_id
--------- ---------------------- ----------
Amy Smith 123 Park Ave, San Jose 111111

-- Example with partition specification.


> CREATE TABLE test_partition (c1 INT, c2 INT, c3 INT) PARTITIONED BY (c2, c3);

> INSERT INTO test_partition PARTITION (c2 = 2, c3 = 3) VALUES (1);

> INSERT INTO test_partition PARTITION (c2 = 5, c3 = 6) VALUES (4);

> INSERT INTO test_partition PARTITION (c2 = 8, c3 = 9) VALUES (7);

> SELECT * FROM test_partition;


c1 c2 c3
--- --- ---
1 2 3
4 5 6
7 8 9

> CREATE TABLE test_load_partition (c1 INT, c2 INT, c3 INT) USING HIVE PARTITIONED BY (c2, c3);

-- Assuming the test_partition table is in '/user/hive/warehouse/'


> LOAD DATA LOCAL INPATH '/user/hive/warehouse/test_partition/c2=2/c3=3'
OVERWRITE INTO TABLE test_load_partition PARTITION (c2=2, c3=3);

> SELECT * FROM test_load_partition;


c1 c2 c3
--- --- ---
1 2 3

Related articles
INSERT INTO
COPY INTO
EXPLAIN
7/21/2022 • 2 minutes to read

Provides the logical or physical plans for an input statement. By default, this clause provides information about a
physical plan only.

Syntax
EXPLAIN [ EXTENDED | CODEGEN | COST | FORMATTED ] statement

Parameters
EXTENDED
Generates parsed logical plan, analyzed logical plan, optimized logical plan and physical plan. Parsed
Logical plan is a unresolved plan that extracted from the query. Analyzed logical plans transforms which
translates unresolvedAttribute and unresolvedRelation into fully typed objects. The optimized logical plan
transforms through a set of optimization rules, resulting in the physical plan.
CODEGEN
Generates code for the statement, if any and a physical plan.
COST
If plan node statistics are available, generates a logical plan and the statistics.
FORMATTED
Generates two sections: a physical plan outline and node details.
statement
A SQL statement to be explained.

Examples
-- Default Output
> EXPLAIN select k, sum(v) from values (1, 2), (1, 3) t(k, v) group by k;

+----------------------------------------------------+
| plan|
+----------------------------------------------------+
| == Physical Plan ==
*(2) HashAggregate(keys=[k#33], functions=[sum(cast(v#34 as bigint))])
+- Exchange hashpartitioning(k#33, 200), true, [id=#59]
+- *(1) HashAggregate(keys=[k#33], functions=[partial_sum(cast(v#34 as bigint))])
+- *(1) LocalTableScan [k#33, v#34]
|
+----------------------------------------------------

-- Using Extended
> EXPLAIN EXTENDED select k, sum(v) from values (1, 2), (1, 3) t(k, v) group by k;

+----------------------------------------------------+
| plan|
+----------------------------------------------------+
| == Parsed Logical Plan ==
'Aggregate ['k], ['k, unresolvedalias('sum('v), None)]
+- 'SubqueryAlias `t`
+- 'UnresolvedInlineTable [k, v], [List(1, 2), List(1, 3)]

== Analyzed Logical Plan ==


k: int, sum(v): bigint
Aggregate [k#47], [k#47, sum(cast(v#48 as bigint)) AS sum(v)#50L]
+- SubqueryAlias `t`
+- LocalRelation [k#47, v#48]

== Optimized Logical Plan ==


Aggregate [k#47], [k#47, sum(cast(v#48 as bigint)) AS sum(v)#50L]
+- LocalRelation [k#47, v#48]

== Physical Plan ==
*(2) HashAggregate(keys=[k#47], functions=[sum(cast(v#48 as bigint))], output=[k#47, sum(v)#50L])
+- Exchange hashpartitioning(k#47, 200), true, [id=#79]
+- *(1) HashAggregate(keys=[k#47], functions=[partial_sum(cast(v#48 as bigint))], output=[k#47, sum#52L])
+- *(1) LocalTableScan [k#47, v#48]
|
+----------------------------------------------------+

-- Using Formatted

> EXPLAIN FORMATTED select k, sum(v) from values (1, 2), (1, 3) t(k, v) group by k;

+----------------------------------------------------+
| plan|
+----------------------------------------------------+
| == Physical Plan ==
* HashAggregate (4)
+- Exchange (3)
+- * HashAggregate (2)
+- * LocalTableScan (1)

(1) LocalTableScan [codegen id : 1]


Output: [k#19, v#20]

(2) HashAggregate [codegen id : 1]


Input: [k#19, v#20]

(3) Exchange
Input: [k#19, sum#24L]

(4) HashAggregate [codegen id : 2]


Input: [k#19, sum#24L]
|
+----------------------------------------------------+
Query
7/21/2022 • 2 minutes to read

Retrieves result sets from one or more tables.

Syntax
[ common_table_expression ]
{ subquery | set_operator }
[ ORDER BY clause | { [ DISTRIBUTE BY clause ] [ SORT BY clause ] } | CLUSTER BY clause ]
[ WINDOW clause ]
[ LIMIT clause ]

subquery
{ SELECT clause |
VALUES clause |
( query ) |
TABLE [ table_name | view_name ]}

Parameters
common table expression
Common table expressions (CTE) are one or more named queries which can be reused multiple times
within the main query block to avoid repeated computations or to improve readability of complex, nested
queries.
subquer y
One of several constructs producing an intermediate result set.
SELECT
A subquery consisting of a SELECT FROM WHERE pattern.
VALUES
Specified an inline temporary table.
** ( query )**
A nested invocation of a query which may contain set operators or common table expressions.
TABLE
Returns the entire table or view.
table_name
Identifies the table to be returned.
view_name
Identifies the view to be returned.
set_operator
A construct combining subqueries using UNION , EXCEPT , or INTERSECT operators.
ORDER BY
An ordering of the rows of the complete result set of the query. The output rows are ordered across the
partitions. This parameter is mutually exclusive with SORT BY , CLUSTER BY , and DISTRIBUTE BY and
cannot be specified together.
DISTRIBUTE BY
A set of expressions by which the result rows are repartitioned. This parameter is mutually exclusive with
ORDER BY and CLUSTER BY and cannot be specified together.

SORT BY
An ordering by which the rows are ordered within each partition. This parameter is mutually exclusive
with ORDER BY and CLUSTER BY and cannot be specified together.
CLUSTER BY
A set of expressions that is used to repartition and sort the rows. Using this clause has the same effect of
using DISTRIBUTE BY and SORT BY together.
LIMIT
The maximum number of rows that can be returned by a statement or subquery. This clause is mostly
used in the conjunction with ORDER BY to produce a deterministic result.
WINDOW
Defines named window specifications that can be shared by multiple Window functions in the
select_query .

Related articles
CLUSTER BY clause
Common table expression (CTE)
DISTRIBUTE BY clause
GROUP BY clause
HAVING clause
Hints
VALUES clause
JOIN
LATERAL VIEW clause
LIMIT clause
ORDER BY clause
PIVOT clause
TABLESAMPLE clause
Set operator
SORT BY clause
Table-valued function (TVF)
WHERE clause
WINDOW clause
Window functions
SELECT
7/21/2022 • 5 minutes to read

Composes a result set from one or more tables. The SELECT clause can be part of a query which also includes
common table expressions (CTE), set operations, and various other clauses.

Syntax
SELECT [ hints ] [ ALL | DISTINCT ] { named_expression | star_clause } [, ...]
FROM from_item [, ...]
[ LATERAL VIEW clause ]
[ PIVOT clause ]
[ WHERE clause ]
[ GROUP BY clause ]
[ HAVING clause]
[ QUALIFY clause ]

from_item
{ table_name [ TABLESAMPLE clause ] [ table_alias ] |
JOIN clause |
[ LATERAL ] table_valued_function [ table_alias ] |
VALUES clause |
[ LATERAL ] ( query ) [ TABLESAMPLE clause ] [ table_alias ] }

named_expression
expression [ column_alias ]

star_clause
[ { table_name | view_name } . ] * [ except_clause ]

except_clause
EXCEPT ( { column_name | field_name } [, ...] )

Parameters
hints
Hints help the Databricks Runtime optimizer make better planning decisions. Databricks Runtime
supports hints that influence selection of join strategies and repartitioning of the data.
ALL
Select all matching rows from the relation. Enabled by default.
DISTINCT
Select all matching rows from the relation after removing duplicates in results.
named_expression
An expression with an optional assigned name.
expression
A combination of one or more values, operators, and SQL functions that evaluates to a value.
column_alias
An optional column identifier naming the expression result. If no column_alias is provided
Databricks Runtime derives one.
star_clause
A shorthand to name all the referencable columns in the FROM clause. The list of columns is ordered by
the order of from_item s and the order of columns within each from_item .
The _metadata column is not included this list. You must reference it explicitly.
table_name
If present limits the columns to be named to those in the specified referencable table.
view_name
If specified limits the columns to be expanded to those in the specified referencable view.
except_clause
Since: Databricks Runtime 11.0
Optionally prunes columns or fields from the referencable set of columns identified in the select_star
clause.
column_name
A column that is part of the set of columns that you can reference.
field_name
A reference to a field in a column of the set of columns that you can reference. If you exclude all
fields from a STRUCT , the result is an empty STRUCT .
Each name must reference a column included in the set of columns that you can reference or their fields.
Otherwise, Databricks Runtime raises a MISSING_COLUMN error. If names overlap or are not unique,
Databricks Runtime raises an EXCEPT_OVERLAPPING_COLUMNS error.
from_item
A source of input for the SELECT . One of the following:
table_name
Identifies a table that may contain a temporal specification. See Query an older snapshot of a table
(time travel) for details.
view_name
Identifies a view.
JOIN
Combines two or more relations using a join.
[L ATERAL] table_valued_function
Invokes a table function. To refer to columns exposed by a preceding from_item in the same FROM
clause you must specify LATERAL .
VALUES
Defines an inline table.
[L ATERAL] ( quer y )
Computes a relation using a query. A query prefixed by LATERAL may reference columns exposed
by a preceding from_item in the same FROM clause. Such a construct is called a correlated or
dependent query.
LATERAL is supported since Databricks Runtime 9.0.
TABLESAMPLE
Optionally reduce the size of the result set by only sampling a fraction of the rows.
table_alias
Optionally specifies a label for the from_item . If the table_alias includes column_identifier s
their number must match the number of columns in the from_item .
PIVOT
Used for data perspective; you can get the aggregated values based on specific column value.
L ATERAL VIEW
Used in conjunction with generator functions such as EXPLODE , which generates a virtual table containing
one or more rows. LATERAL VIEW applies the rows to each original output row.
WHERE
Filters the result of the FROM clause based on the supplied predicates.
GROUP BY
The expressions that are used to group the rows. This is used in conjunction with aggregate functions (
MIN , MAX , COUNT , SUM , AVG ) to group rows based on the grouping expressions and aggregate values
in each group. When a FILTER clause is attached to an aggregate function, only the matching rows are
passed to that function.
HAVING
The predicates by which the rows produced by GROUP BY are filtered. The HAVING clause is used to filter
rows after the grouping is performed. If you specify HAVING without GROUP BY , it indicates a GROUP BY
without grouping expressions (global aggregate).
QUALIFY
The predicates that are used to filter the results of window functions. To use QUALIFY , at least one
window function is required to be present in the SELECT list or the QUALIFY clause.

Select on Delta table


In addition to the standard SELECT options, Delta tables support the time travel options described in this
section. For details, see Query an older snapshot of a table (time travel).
AS OF syntax

table_identifier TIMESTAMP AS OF timestamp_expression

table_identifier VERSION AS OF version

where
timestamp_expression can be any one of:
'2018-10-18T22:15:12.013Z' , that is, a string that can be cast to a timestamp
cast('2018-10-18 13:36:32 CEST' as timestamp)
'2018-10-18' , that is, a date string
In Databricks Runtime 6.6 and above:
current_timestamp() - interval 12 hours
date_sub(current_date(), 1)
Any other expression that is or can be cast to a timestamp
version is a long value that can be obtained from the output of DESCRIBE HISTORY table_spec .

Neither timestamp_expression nor version can be subqueries.


Example

> SELECT * FROM events TIMESTAMP AS OF '2018-10-18T22:15:12.013Z'

> SELECT * FROM delta.`/mnt/delta/events` VERSION AS OF 123

@ syntax
Use the @ syntax to specify the timestamp or version. The timestamp must be in yyyyMMddHHmmssSSS format.
You can specify a version after @ by prepending a v to the version. For example, to query version 123 for the
table events , specify events@v123 .
Example

> SELECT * FROM events@20190101000000000

> SELECT * FROM events@v123

Examples
-- select all referencable columns from all tables
> SELECT * FROM VALUES(1, 2) AS t1(c1, c2), VALUES(3, 4) AS t2(c3, c4);
1 2 3 4

-- select all referencable columns from one table


> SELECT t2.* FROM VALUES(1, 2) AS t1(c1, c2), VALUES(3, 4) AS t2(c3, c4);
3 4

-- select all referencable columns from all tables except t2.c4


> SELECT * EXCEPT(c4) FROM VALUES(1, 2) AS t1(c1, c2), VALUES(3, 4) AS t2(c3, c4);
1 2 3

-- select all referencable columns from a table, except a nested field.


> SELECT * EXCEPT(c2.b) FROM VALUES(1, named_struct('a', 2, 'b', 3)) AS t(c1, c2);
1 { "a" : 2 }

-- Removing all fields results in an empty struct


> SELECT * EXCEPT(c2.b, c2.a) FROM VALUES(1, named_struct('a', 2, 'b', 3)) AS t(c1, c2);
1 { }

-- Overlapping names result in an error


> SELECT * EXCEPT(c2, c2.a) FROM VALUES(1, named_struct('a', 2, 'b', 3)) AS t(c1, c2);
Error: EXCEPT_OVERLAPPING_COLUMNS
Related articles
CLUSTER BY clause
Common table expression (CTE)
DISTRIBUTE BY clause
GROUP BY clause
HAVING clause
QUALIFY clause
Hints
VALUES clause
JOIN
LATERAL VIEW clause
LIMIT clause
ORDER BY clause
PIVOT clause
Query
TABLESAMPLE clause
Set operators
SORT BY clause
Table-valued function (TVF)
WHERE clause
WINDOW clause
Window functions
VALUES clause
7/21/2022 • 2 minutes to read

Produces an inline temporary table for use within the query.

Syntax
VALUES {expression | ( expression [, ...] ) } [, ...] [table_alias]

SELECT expression [, ...] [table_alias]

Parameters
expression
A combination of one or more values, operators and SQL functions that results in a value.
table_alias
An optional label to allow the result set to be referenced by name.
Each tuple constitutes a row.
If there is more than one row the number of fields in each tuple must match.
When using the VALUES syntax, if no tuples are specified, each expression equates to a single field tuple.
When using the SELECT syntax all expressions constitute a single row temporary table.
The nth field of each tuple must share a least common type. If table_alias specifies column names, their
number must match the number of expressions per tuple.
The result is a temporary table where each column’s type is the least common type of the matching tuples fields.

Examples
-- single row, without a table alias
> VALUES ("one", 1);
one 1

-- Multiple rows, one column


> VALUES 1, 2, 3;
1
2
3

-- three rows with a table alias


> SELECT data.a, b
FROM VALUES ('one', 1),
('two', 2),
('three', NULL) AS data(a, b);
one 1
two 2
three NULL

-- complex types with a table alias


> SELECT a, b
FROM VALUES ('one', array(0, 1)),
('two', array(2, 3)) AS data(a, b);
one [0, 1]
two [2, 3]

-- Using the SELECT syntax


> SELECT 'one', 2
one 2

Related articles
Query
SELECT
CLUSTER BY clause
7/21/2022 • 2 minutes to read

Repartitions the data based on the input expressions and then sorts the data within each partition. This is
semantically equivalent to performing a DISTRIBUTE BY followed by a SORT BY. This clause only ensures that the
resultant rows are sorted within each partition and does not guarantee a total order of output.

Syntax
CLUSTER BY expression [, ...]

Parameters
expression
Specifies combination of one or more values, operators and SQL functions that results in a value.

Examples
> CREATE TEMP VIEW person (name, age)
AS VALUES ('Zen Hui', 25),
('Anil B', 18),
('Shone S', 16),
('Mike A', 25),
('John A', 18),
('Jack N', 16);

-- Reduce the number of shuffle partitions to 2 to illustrate the behavior of `CLUSTER BY`.
-- It's easier to see the clustering and sorting behavior with less number of partitions.
> SET spark.sql.shuffle.partitions = 2;

-- Select the rows with no ordering. Please note that without any sort directive, the results
-- of the query is not deterministic. It's included here to show the difference in behavior
-- of a query when `CLUSTER BY` is not used vs when it's used. The query below produces rows
-- where age column is not sorted.
> SELECT age, name FROM person;
16 Shone S
25 Zen Hui
16 Jack N
25 Mike A
18 John A
18 Anil B

-- Produces rows clustered by age. Persons with same age are clustered together.
-- In the query below, persons with age 18 and 25 are in first partition and the
-- persons with age 16 are in the second partition. The rows are sorted based
-- on age within each partition.
> SELECT age, name FROM person CLUSTER BY age;
18 John A
18 Anil B
25 Zen Hui
25 Mike A
16 Shone S
16 Jack N
Related articles
Query
DISTRIBUTE BY
SORT BY
Common table expression (CTE)
7/21/2022 • 2 minutes to read

Defines a temporary result set that you can reference possibly multiple times within the scope of a SQL
statement. A CTE is used mainly in a SELECT statement.

Syntax
WITH common_table_expression [, ...]

common_table_expression
view_identifier [ ( column_identifier [, ...] ) ] [ AS ] ( query )

Parameters
view_identifier
An identifier by which the common_table_expression can be referenced
column_identifier
An optional identifier by which a column of the common_table_expression can be referenced.
If column_identifiers are specified their number must match the number of columns returned by the
query . If no names are specified the column names are derived from the query .

quer y
A query producing a result set.

Examples
-- CTE with multiple column aliases
> WITH t(x, y) AS (SELECT 1, 2)
SELECT * FROM t WHERE x = 1 AND y = 2;
1 2

-- CTE in CTE definition


> WITH t AS (
WITH t2 AS (SELECT 1)
SELECT * FROM t2)
SELECT * FROM t;
1

-- CTE in subquery
> SELECT max(c) FROM (
WITH t(c) AS (SELECT 1)
SELECT * FROM t);
1

-- CTE in subquery expression


> SELECT (WITH t AS (SELECT 1)
SELECT * FROM t);
1

-- CTE in CREATE VIEW statement


> CREATE VIEW v AS
WITH t(a, b, c, d) AS (SELECT 1, 2, 3, 4)
SELECT * FROM t;
> SELECT * FROM v;
1 2 3 4

-- CTE names are scoped


> WITH t AS (SELECT 1),
t2 AS (
WITH t AS (SELECT 2)
SELECT * FROM t)
SELECT * FROM t2;
2

Related articles
Query
DISTRIBUTE BY clause
7/21/2022 • 2 minutes to read

Repartitions data based on the input expressions. Unlike the CLUSTER BY clause, does not sort the data within
each partition.

Syntax
DISTRIBUTE BY expression [, ...]

Parameters
expression
An expression of any type.

Examples
> CREATE TEMP VIEW person (name, age)
AS VALUES ('Zen Hui', 25),
('Anil B', 18),
('Shone S', 16),
('Mike A', 25),
('John A', 18),
('Jack N', 16);

-- Reduce the number of shuffle partitions to 2 to illustrate the behavior of `DISTRIBUTE BY`.
-- It's easier to see the clustering and sorting behavior with less number of partitions.
> SET spark.sql.shuffle.partitions = 2;

-- Select the rows with no ordering. Please note that without any sort directive, the result
-- of the query is not deterministic. It's included here to just contrast it with the
-- behavior of `DISTRIBUTE BY`. The query below produces rows where age columns are not
-- clustered together.
> SELECT age, name FROM person;
16 Shone S
25 Zen Hui
16 Jack N
25 Mike A
18 John A
18 Anil B

-- Produces rows clustered by age. Persons with same age are clustered together.
-- Unlike `CLUSTER BY` clause, the rows are not sorted within a partition.
> SELECT age, name FROM person DISTRIBUTE BY age;
25 Zen Hui
25 Mike A
18 John A
18 Anil B
16 Shone S
16 Jack N

Related articles
Query
CLUSTER BY
SORT BY
GROUP BY clause
7/21/2022 • 7 minutes to read

The GROUP BY clause is used to group the rows based on a set of specified grouping expressions and compute
aggregations on the group of rows based on one or more specified aggregate functions. Databricks Runtime
also supports advanced aggregations to do multiple aggregations for the same input record set via
GROUPING SETS , CUBE , ROLLUP clauses. The grouping expressions and advanced aggregations can be mixed in
the GROUP BY clause and nested in a GROUPING SETS clause.
See more details in the Mixed/Nested Grouping Analytics section.
When a FILTER clause is attached to an aggregate function, only the matching rows are passed to that function.

Syntax
GROUP BY group_expression [, ...] [ WITH ROLLUP | WITH CUBE ]

GROUP BY { group_expression | { ROLLUP | CUBE | GROUPING SETS } ( grouping_set [, ...] ) } [, ...]

grouping_set
{ expression |
( [ expression [, ...] ] ) }

While aggregate functions are defined as

aggregate_name ( [ DISTINCT ] expression [, ...] ) [ FILTER ( WHERE boolean_expression ) ]

Parameters
group_expression
Specifies the criteria for grouping rows together. The grouping of rows is performed based on result
values of the grouping expressions. A grouping expression may be a column name like GROUP BY a, a ,
column position like GROUP BY 0 , or an expression like GROUP BY a + b .
grouping_set
A grouping set is specified by zero or more comma-separated expressions in parentheses. When the
grouping set has only one element, parentheses can be omitted. For example, GROUPING SETS ((a), (b)) is
the same as GROUPING SETS (a, b).
GROUPING SETS
Groups the rows for each grouping set specified after GROUPING SETS . For example:
GROUP BY GROUPING SETS ((warehouse), (product)) is semantically equivalent to a union of results of
GROUP BY warehouse and GROUP BY product .
This clause is a shorthand for a UNION ALL where each leg of the UNION ALL operator performs
aggregation of each grouping set specified in the GROUPING SETS clause.
Similarly, GROUP BY GROUPING SETS ((warehouse, product), (product), ()) is semantically equivalent to the
union of results of GROUP BY warehouse, product , GROUP BY product and a global aggregate.

NOTE
For Hive compatibility Databricks Runtime allows GROUP BY ... GROUPING SETS (...) . The GROUP BY expressions are
usually ignored, but if they contain extra expressions in addition to the GROUPING SETS expressions, the extra
expressions will be included in the grouping expressions and the value is always null. For example,
SELECT a, b, c FROM ... GROUP BY a, b, c GROUPING SETS (a, b) , the output of column c is always null.

ROLLUP
Specifies multiple levels of aggregations in a single statement. This clause is used to compute
aggregations based on multiple grouping sets. ROLLUP is a shorthand for GROUPING SETS . For example:
GROUP BY warehouse, product WITH ROLLUP or GROUP BY ROLLUP(warehouse, product) is equivalent to
GROUP BY GROUPING SETS((warehouse, product), (warehouse), ()) .
While GROUP BY ROLLUP(warehouse, product, (warehouse, location))

is equivalent to
GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse), ()) .
The N elements of a ROLLUP specification result in N+1 GROUPING SETS .
CUBE
The CUBE clause is used to perform aggregations based on a combination of grouping columns specified
in the GROUP BY clause. CUBE is a shorthand for GROUPING SETS . For example:
GROUP BY warehouse, product WITH CUBE or GROUP BY CUBE(warehouse, product) is equivalent to
GROUP BY GROUPING SETS((warehouse, product), (warehouse), (product), ()) .
While GROUP BY CUBE(warehouse, product, (warehouse, location))

is equivalent to
GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location),
(product, warehouse, location), (warehouse), (product), (warehouse, product), ())
.
The N elements of a CUBE specification results in 2^N GROUPING SETS .
aggregate_name
An aggregate function name (MIN, MAX, COUNT, SUM, AVG, etc.).
DISTINCT
Removes duplicates in input rows before they are passed to aggregate functions.
FILTER
Filters the input rows for which the boolean_expression in the WHERE clause evaluates to true are passed
to the aggregate function; other rows are discarded.

Mixed/Nested Grouping Analytics


A GROUP BY clause can include multiple group_expressions and multiple CUBE , ROLLUP , and GROUPING SETS s.
GROUPING SETS can also have nested CUBE , ROLLUP , or GROUPING SETS clauses. For example:
GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location)), GROUPING SETS(warehouse, GROUPING
SETS(location, GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location))))

and
CUBE ROLLUP is just syntax sugar for GROUPING SETS . Please refer to the sections above for how to translate
CUBE and ROLLUP to GROUPING SETS . group_expression can be treated as a single-group GROUPING SETS in this
context.
For multiple GROUPING SETS in the GROUP BY clause, Databricks Runtime generates a single GROUPING SETS by
doing a cross-product of the original GROUPING SETS .
For nested GROUPING SETS in the GROUPING SETS clause, Databricks Runtime simply takes its grouping sets and
strips them. For example:
GROUP BY warehouse, GROUPING SETS((product), ()), GROUPING SETS((location, size), (location), (size), ())
and
GROUP BY warehouse, ROLLUP(product), CUBE(location, size)

are equivalent to
GROUP BY GROUPING SETS( (warehouse, product, location, size), (warehouse, product, location), (warehouse,
product, size), (warehouse, product), (warehouse, location, size), (warehouse, location), (warehouse, size),
(warehouse))
.
While GROUP BY GROUPING SETS(GROUPING SETS(warehouse), GROUPING SETS((warehouse, product)))

is equivalent to GROUP BY GROUPING SETS((warehouse), (warehouse, product)) .

Examples
CREATE TEMP VIEW dealer (id, city, car_model, quantity) AS
VALUES (100, 'Fremont', 'Honda Civic', 10),
(100, 'Fremont', 'Honda Accord', 15),
(100, 'Fremont', 'Honda CRV', 7),
(200, 'Dublin', 'Honda Civic', 20),
(200, 'Dublin', 'Honda Accord', 10),
(200, 'Dublin', 'Honda CRV', 3),
(300, 'San Jose', 'Honda Civic', 5),
(300, 'San Jose', 'Honda Accord', 8);

-- Sum of quantity per dealership. Group by `id`.


> SELECT id, sum(quantity) FROM dealer GROUP BY id ORDER BY id;
id sum(quantity)
--- -------------
100 32
200 33
300 13

-- Use column position in GROUP by clause.


> SELECT id, sum(quantity) FROM dealer GROUP BY 1 ORDER BY 1;
id sum(quantity)
--- -------------
100 32
200 33
300 13

-- Multiple aggregations.
-- 1. Sum of quantity per dealership.
-- 2. Max quantity per dealership.
> SELECT id, sum(quantity) AS sum, max(quantity) AS max
FROM dealer GROUP BY id ORDER BY id;
id sum max
--- --- ---
100 32 15
200 33 20
200 33 20
300 13 8

-- Count the number of distinct dealers in cities per car_model.


> SELECT car_model, count(DISTINCT city) AS count FROM dealer GROUP BY car_model;
car_model count
------------ -----
Honda Civic 3
Honda CRV 2
Honda Accord 3

-- Sum of only 'Honda Civic' and 'Honda CRV' quantities per dealership.
> SELECT id,
sum(quantity) FILTER (WHERE car_model IN ('Honda Civic', 'Honda CRV')) AS `sum(quantity)`
FROM dealer
GROUP BY id ORDER BY id;
id sum(quantity)
--- -------------
100 17
200 23
300 5

-- Aggregations using multiple sets of grouping columns in a single statement.


-- Following performs aggregations based on four sets of grouping columns.
-- 1. city, car_model
-- 2. city
-- 3. car_model
-- 4. Empty grouping set. Returns quantities for all city and car models.
> SELECT city, car_model, sum(quantity) AS sum
FROM dealer
GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
ORDER BY city;
city car_model sum
--------- ------------ ---
null null 78
null HondaAccord 33
null HondaCRV 10
null HondaCivic 35
Dublin null 33
Dublin HondaAccord 10
Dublin HondaCRV 3
Dublin HondaCivic 20
Fremont null 32
Fremont HondaAccord 15
Fremont HondaCRV 7
Fremont HondaCivic 10
San Jose null 13
San Jose HondaAccord 8
San Jose HondaCivic 5

-- Group by processing with `ROLLUP` clause.


-- Equivalent GROUP BY GROUPING SETS ((city, car_model), (city), ())
> SELECT city, car_model, sum(quantity) AS sum
FROM dealer
GROUP BY city, car_model WITH ROLLUP
ORDER BY city, car_model;
city car_model sum
--------- ------------ ---
null null 78
Dublin null 33
Dublin HondaAccord 10
Dublin HondaCRV 3
Dublin HondaCivic 20
Fremont null 32
Fremont HondaAccord 15
Fremont HondaCRV 7
Fremont HondaCivic 10
San Jose null 13
San Jose HondaAccord 8
San Jose HondaCivic 5
San Jose HondaCivic 5

-- Group by processing with `CUBE` clause.


-- Equivalent GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
> SELECT city, car_model, sum(quantity) AS sum
FROM dealer
GROUP BY city, car_model WITH CUBE
ORDER BY city, car_model;
city car_model sum
--------- ------------ ---
null null 78
null HondaAccord 33
null HondaCRV 10
null HondaCivic 35
Dublin null 33
Dublin HondaAccord 10
Dublin HondaCRV 3
Dublin HondaCivic 20
Fremont null 32
Fremont HondaAccord 15
Fremont HondaCRV 7
Fremont HondaCivic 10
San Jose null 13
San Jose HondaAccord 8
San Jose HondaCivic 5

--Prepare data for ignore nulls example


> CREATE TEMP VIEW person (id, name, age) AS
VALUES (100, 'Mary', NULL),
(200, 'John', 30),
(300, 'Mike', 80),
(400, 'Dan' , 50);

--Select the first row in column age


> SELECT FIRST(age) FROM person;
first(age, false)
--------------------
NULL

--Get the first row in column `age` ignore nulls,last row in column `id` and sum of column `id`.
> SELECT FIRST(age IGNORE NULLS), LAST(id), SUM(id) FROM person;
first(age, true) last(id, false) sum(id)
------------------- ------------------ ----------
30 400 1000

Related articles
QUALIFY
SELECT
HAVING clause
7/21/2022 • 2 minutes to read

Filters the results produced by GROUP BY based on the specified condition. Often used in conjunction with a
GROUP BY clause.

Syntax
HAVING boolean_expression

Parameters
boolean_expression
Any expression that evaluates to a result type BOOLEAN . Two or more expressions may be combined
together using logical operators such as AND or OR .
The expressions specified in the HAVING clause can only refer to:
Constant expressions
Expressions that appear in GROUP BY
Aggregate functions

Examples
> CREATE TABLE dealer (id INT, city STRING, car_model STRING, quantity INT);
> INSERT INTO dealer VALUES
(100, 'Fremont' , 'Honda Civic' , 10),
(100, 'Fremont' , 'Honda Accord', 15),
(100, 'Fremont' , 'Honda CRV' , 7),
(200, 'Dublin' , 'Honda Civic' , 20),
(200, 'Dublin' , 'Honda Accord', 10),
(200, 'Dublin' , 'Honda CRV' , 3),
(300, 'San Jose', 'Honda Civic' , 5),
(300, 'San Jose', 'Honda Accord', 8);

-- `HAVING` clause referring to column in `GROUP BY`.


> SELECT city, sum(quantity) AS sum FROM dealer GROUP BY city HAVING city = 'Fremont';
Fremont 32

-- `HAVING` clause referring to aggregate function.


> SELECT city, sum(quantity) AS sum FROM dealer GROUP BY city HAVING sum(quantity) > 15;
Dublin 33
Fremont 32

-- `HAVING` clause referring to aggregate function by its alias.


> SELECT city, sum(quantity) AS sum FROM dealer GROUP BY city HAVING sum > 15;
Dublin 33
Fremont 32

-- `HAVING` clause referring to a different aggregate function than what is present in


-- `SELECT` list.
> SELECT city, sum(quantity) AS sum FROM dealer GROUP BY city HAVING max(quantity) > 15;
Dublin 33

-- `HAVING` clause referring to constant expression.


> SELECT city, sum(quantity) AS sum FROM dealer GROUP BY city HAVING 1 > 0 ORDER BY city;
Dublin 33
Fremont 32
San Jose 13

-- `HAVING` clause without a `GROUP BY` clause.


> SELECT sum(quantity) AS sum FROM dealer HAVING sum(quantity) > 10;
78

Related articles
GROUP BY
QUALIFY
SELECT
Hints
7/21/2022 • 4 minutes to read

Suggest specific approaches to generate an execution plan.

Syntax
/*+ hint [, ...] */

Partitioning hints
Partitioning hints allow you to suggest a partitioning strategy that Databricks Runtime should follow. COALESCE ,
REPARTITION , and REPARTITION_BY_RANGE hints are supported and are equivalent to coalesce , repartition , and
repartitionByRange Dataset APIs, respectively. These hints give you a way to tune performance and control the
number of output files. When multiple partitioning hints are specified, multiple nodes are inserted into the
logical plan, but the leftmost hint is picked by the optimizer.
Partitioning hint types
COALESCE
Reduce the number of partitions to the specified number of partitions. It takes a partition number as a
parameter.
REPARTITION
Repartition to the specified number of partitions using the specified partitioning expressions. It takes a
partition number, column names, or both as parameters.
REPARTITION_BY_RANGE
Repartition to the specified number of partitions using the specified partitioning expressions. It takes
column names and an optional partition number as parameters.
REBAL ANCE
The REBALANCE hint can be used to rebalance the query result output partitions, so that every partition is
of a reasonable size (not too small and not too big). It can take column names as parameters, and try its
best to partition the query result by these columns. This is a best-effort: if there are skews, Spark will split
the skewed partitions, to make these partitions not too big. This hint is useful when you need to write the
result of this query to a table, to avoid too small/big files. This hint is ignored if AQE is not enabled.
Examples
> SELECT /*+ COALESCE(3) */ * FROM t;

> SELECT /*+ REPARTITION(3) */ * FROM t;

> SELECT /*+ REPARTITION(c) */ * FROM t;

> SELECT /*+ REPARTITION(3, c) */ * FROM t;

> SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t;

> SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t;

> SELECT /*+ REBALANCE */ * FROM t;

> SELECT /*+ REBALANCE(c) */ * FROM t;

-- multiple partitioning hints


> EXPLAIN EXTENDED SELECT /*+ REPARTITION(100), COALESCE(500), REPARTITION_BY_RANGE(3, c) */ * FROM t;
== Parsed Logical Plan ==
'UnresolvedHint REPARTITION, [100]
+- 'UnresolvedHint COALESCE, [500]
+- 'UnresolvedHint REPARTITION_BY_RANGE, [3, 'c]
+- 'Project [*]
+- 'UnresolvedRelation [t]

== Analyzed Logical Plan ==


name: string, c: int
Repartition 100, true
+- Repartition 500, false
+- RepartitionByExpression [c#30 ASC NULLS FIRST], 3
+- Project [name#29, c#30]
+- SubqueryAlias spark_catalog.default.t
+- Relation[name#29,c#30] parquet

== Optimized Logical Plan ==


Repartition 100, true
+- Relation[name#29,c#30] parquet

== Physical Plan ==
Exchange RoundRobinPartitioning(100), false, [id=#121]
+- *(1) ColumnarToRow
+- FileScan parquet default.t[name#29,c#30] Batched: true, DataFilters: [], Format: Parquet,
Location: CatalogFileIndex[file:/spark/spark-warehouse/t], PartitionFilters: [],
PushedFilters: [], ReadSchema: struct<name:string>

Join hints
Join hints allow you to suggest the join strategy that Databricks Runtime should use. When different join
strategy hints are specified on both sides of a join, Databricks Runtime prioritizes hints in the following order:
BROADCAST over MERGE over SHUFFLE_HASH over SHUFFLE_REPLICATE_NL . When both sides are specified with the
BROADCAST hint or the SHUFFLE_HASH hint, Databricks Runtime picks the build side based on the join type and the
sizes of the relations. Since a given strategy may not support all join types, Databricks Runtime is not
guaranteed to use the join strategy suggested by the hint.
Join hint types
BROADCAST
Use broadcast join. The join side with the hint is broadcast regardless of autoBroadcastJoinThreshold . If
both sides of the join have the broadcast hints, the one with the smaller size (based on stats) is broadcast.
The aliases for BROADCAST are BROADCASTJOIN and MAPJOIN .
MERGE
Use shuffle sort merge join. The aliases for MERGE are SHUFFLE_MERGE and MERGEJOIN .
SHUFFLE_HASH
Use shuffle hash join. If both sides have the shuffle hash hints, Databricks Runtime chooses the smaller
side (based on stats) as the build side.
SHUFFLE_REPLICATE_NL
Use shuffle-and-replicate nested loop join.
Examples

-- Join Hints for broadcast join


> SELECT /*+ BROADCAST(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
> SELECT /*+ BROADCASTJOIN (t1) */ * FROM t1 left JOIN t2 ON t1.key = t2.key;
> SELECT /*+ MAPJOIN(t2) */ * FROM t1 right JOIN t2 ON t1.key = t2.key;

-- Join Hints for shuffle sort merge join


> SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
> SELECT /*+ MERGEJOIN(t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
> SELECT /*+ MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

-- Join Hints for shuffle hash join


> SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

-- Join Hints for shuffle-and-replicate nested loop join


> SELECT /*+ SHUFFLE_REPLICATE_NL(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

-- When different join strategy hints are specified on both sides of a join, Databricks Runtime
-- prioritizes the BROADCAST hint over the MERGE hint over the SHUFFLE_HASH hint
-- over the SHUFFLE_REPLICATE_NL hint.
-- Databricks Runtime will issue Warning in the following example
-- org.apache.spark.sql.catalyst.analysis.HintErrorLogger: Hint (strategy=merge)
-- is overridden by another hint and will not take effect.
SELECT /*+ BROADCAST(t1), MERGE(t1, t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

Skew hints
(Delta Lake on Azure Databricks) See Skew join optimization for information about the SKEW hint.

Related statements
SELECT
JOIN
7/21/2022 • 4 minutes to read

Combines rows from two relations based on join criteria.

Syntax
relation { [ join_type ] JOIN relation join_criteria |
NATURAL join_type JOIN relation |
CROSS JOIN relation }

relation
{ table_name [ table_alias ] |
view_name [ table_alias ] |
[ LATERAL ] ( query ) [ table_alias ] |
( JOIN clause ) [ table_alias ] |
VALUES clause |
[ LATERAL ] table_valued_function [ table_alias ] }

join_type
{ [ INNER ] |
LEFT [ OUTER ] |
[ LEFT ] SEMI |
RIGHT [ OUTER ] |
FULL [ OUTER ] |
[ LEFT ] ANTI |
CROSS }

join_criteria
{ ON boolean_expression |
USING ( column_name [, ...] ) }

Parameters
relation
The relations to be joined.
table_name
A reference to a table, view, or common table expression (CTE).
view_name
A reference to a view, or common table expression (CTE).
[ L ATERAL ] ( quer y )
Any nested query. A query prefixed by LATERAL (Since: Databricks Runtime 9.0) may reference
columns exposed by preceding from_item s in the same FROM clause. Such a construct is called a
correlated or dependent join. A correlated join cannot be a RIGHT OUTER JOIN or a
FULL OUTER JOIN .

( JOIN clause )
A nested invocation of a JOIN.
VALUES clause
A clause that produces an inline temporary table.
[ L ATERAL ] table_valued_function
An invocation of a table function.
join_type
The join type.
[ INNER ]
Returns rows that have matching values in both relations. The default join.
LEFT [ OUTER ]
Returns all values from the left relation and the matched values from the right relation, or appends
NULL if there is no match. It is also referred to as a left outer join.

RIGHT [ OUTER ]
Returns all values from the right relation and the matched values from the left relation, or appends
NULL if there is no match. It is also referred to as a right outer join.

FULL [OUTER]
Returns all values from both relations, appending NULL values on the side that does not have a
match. It is also referred to as a full outer join.
[ LEFT ] SEMI
Returns values from the left side of the relation that has a match with the right. It is also referred to
as a left semi join.
[ LEFT ] ANTI
Returns values from the left relation that has no match with the right. It is also referred to as a left
anti join.
CROSS JOIN
Returns the Cartesian product of two relations.
NATURAL
Specifies that the rows from the two relations will implicitly be matched on equality for all columns with
matching names.
join_criteria
Specifies how the rows from one relation is combined with the rows of another relation.
ON boolean_expression
An expression with a return type of BOOLEAN which specifies how rows from the two relations are
matched. If the result is true the rows are considered a match.
USING ( column_name [, …] )
Matches rows by comparing equality for list of columns column_name which must exist in both
relations.
USING (c1, c2) is a synonym for ON rel1.c1 = rel2.c1 AND rel1.c2 = rel2.c2 .
table_alias
A temporary name with an optional column identifier list.

Notes
When you specify USING or NATURAL , SELECT * will only show one occurrence for each of the columns used to
match.
If you omit the join_criteria the semantic of any join_type becomes that of a CROSS JOIN .

Examples
-- Use employee and department tables to demonstrate different type of joins.
> CREATE TEMP VIEW employee(id, name, deptno) AS
VALUES(105, 'Chloe', 5),
(103, 'Paul' , 3),
(101, 'John' , 1),
(102, 'Lisa' , 2),
(104, 'Evan' , 4),
(106, 'Amy' , 6);

> CREATE TEMP VIEW department(deptno, deptname) AS


VALUES(3, 'Engineering'),
(2, 'Sales' ),
(1, 'Marketing' );

-- Use employee and department tables to demonstrate inner join.


> SELECT id, name, employee.deptno, deptname
FROM employee
INNER JOIN department ON employee.deptno = department.deptno;
103 Paul 3 Engineering
101 John 1 Marketing
102 Lisa 2 Sales

-- Use employee and department tables to demonstrate left join.


> SELECT id, name, employee.deptno, deptname
FROM employee
LEFT JOIN department ON employee.deptno = department.deptno;
105 Chloe 5 NULL
103 Paul 3 Engineering
101 John 1 Marketing
102 Lisa 2 Sales
104 Evan 4 NULL
106 Amy 6 NULL

-- Use employee and department tables to demonstrate right join.


> SELECT id, name, employee.deptno, deptname
FROM employee
RIGHT JOIN department ON employee.deptno = department.deptno;
103 Paul 3 Engineering
101 John 1 Marketing
102 Lisa 2 Sales

-- Use employee and department tables to demonstrate full join.


> SELECT id, name, employee.deptno, deptname
FROM employee
FULL JOIN department ON employee.deptno = department.deptno;
101 John 1 Marketing
106 Amy 6 NULL
103 Paul 3 Engineering
105 Chloe 5 NULL
104 Evan 4 NULL
102 Lisa 2 Sales
102 Lisa 2 Sales

-- Use employee and department tables to demonstrate cross join.


> SELECT id, name, employee.deptno, deptname
FROM employee
CROSS JOIN department;
105 Chloe 5 Engineering
105 Chloe 5 Marketing
105 Chloe 5 Sales
103 Paul 3 Engineering
103 Paul 3 Marketing
103 Paul 3 Sales
101 John 1 Engineering
101 John 1 Marketing
101 John 1 Sales
102 Lisa 2 Engineering
102 Lisa 2 Marketing
102 Lisa 2 Sales
104 Evan 4 Engineering
104 Evan 4 Marketing
104 Evan 4 Sales
106 Amy 4 Engineering
106 Amy 4 Marketing
106 Amy 4 Sales

-- Use employee and department tables to demonstrate semi join.


> SELECT *
FROM employee
SEMI JOIN department ON employee.deptno = department.deptno;
103 Paul 3
101 John 1
102 Lisa 2

-- Use employee and department tables to demonstrate anti join.


> SELECT *
FROM employee
ANTI JOIN department ON employee.deptno = department.deptno;
105 Chloe 5
104 Evan 4
106 Amy 6

-- Use employee and department tables to demonstrate lateral inner join.


> SELECT id, name, deptno, deptname
FROM employee
JOIN LATERAL (SELECT deptname
FROM department
WHERE employee.deptno = department.deptno);
103 Paul 3 Engineering
101 John 1 Marketing
102 Lisa 2 Sales

-- Use employee and department tables to demonstrate lateral left join.


> SELECT id, name, deptno, deptname
FROM employee
LEFT JOIN LATERAL (SELECT deptname
FROM department
WHERE employee.deptno = department.deptno);
105 Chloe 5 NULL
103 Paul 3 Engineering
101 John 1 Marketing
102 Lisa 2 Sales
104 Evan 4 NULL
106 Amy 6 NULL

Related articles
SELECT
LATERAL VIEW clause
7/21/2022 • 2 minutes to read

Used in conjunction with generator functions such as EXPLODE , which generates a virtual table containing one
or more rows. LATERAL VIEW applies the rows to each original output row.

Syntax
LATERAL VIEW [ OUTER ] generator_function ( expression [, ...] ) [ table_identifier ] AS column_identifier
[, ...]

Parameters
OUTER
If OUTER specified, returns null if an input array/map is empty or null.
generator_function
A generator function (EXPLODE, INLINE, etc.).
table_identifier
The alias for generator_function , which is optional.
column_identifier
Lists the column aliases of generator_function , which may be used in output rows. The number of
column identifiers must match the number of columns returned by the generator function.

Examples
> CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
> INSERT INTO person VALUES
(100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');

> SELECT * FROM person


LATERAL VIEW EXPLODE(ARRAY(30, 60)) tableName AS c_age
LATERAL VIEW EXPLODE(ARRAY(40, 80)) AS d_age;
id name age class address c_age d_age
------ ------- ------- -------- ----------- -------- --------
100 John 30 1 Street 1 30 40
100 John 30 1 Street 1 30 80
100 John 30 1 Street 1 60 40
100 John 30 1 Street 1 60 80
200 Mary NULL 1 Street 2 30 40
200 Mary NULL 1 Street 2 30 80
200 Mary NULL 1 Street 2 60 40
200 Mary NULL 1 Street 2 60 80
300 Mike 80 3 Street 3 30 40
300 Mike 80 3 Street 3 30 80
300 Mike 80 3 Street 3 60 40
300 Mike 80 3 Street 3 60 80
400 Dan 50 4 Street 4 30 40
400 Dan 50 4 Street 4 30 80
400 Dan 50 4 Street 4 60 40
400 Dan 50 4 Street 4 60 80

> SELECT c_age, COUNT(1) FROM person


LATERAL VIEW EXPLODE(ARRAY(30, 60)) AS c_age
LATERAL VIEW EXPLODE(ARRAY(40, 80)) AS d_age
GROUP BY c_age;
c_age count(1)
-------- -----------
60 8
30 8

SELECT * FROM person


LATERAL VIEW EXPLODE(ARRAY()) tableName AS c_age;
id name age class address c_age
----- ------- ------ -------- ---------- --------

> SELECT * FROM person


LATERAL VIEW OUTER EXPLODE(ARRAY()) tableName AS c_age;
id name age class address c_age
------ ------- ------- -------- ----------- --------
100 John 30 1 Street 1 NULL
200 Mary NULL 1 Street 2 NULL
300 Mike 80 3 Street 3 NULL
400 Dan 50 4 Street 4 NULL

Related articles
SELECT
LIMIT clause
7/21/2022 • 2 minutes to read

Constrains the number of rows returned by the Query. In general, this clause is used in conjunction with ORDER
BY to ensure that the results are deterministic.

Syntax
LIMIT { ALL | integer_expression }

Parameters
ALL
If specified, the query returns all the rows. In other words, no limit is applied if this option is specified.
integer_expression
A literal expression that returns an integer.

Examples
> CREATE TEMP VIEW person (name, age)
AS VALUES ('Zen Hui', 25),
('Anil B' , 18),
('Shone S', 16),
('Mike A' , 25),
('John A' , 18),
('Jack N' , 16);

-- Select the first two rows.


> SELECT name, age FROM person ORDER BY name LIMIT 2;
Anil B 18
Jack N 16

-- Specifying ALL option on LIMIT returns all the rows.


> SELECT name, age FROM person ORDER BY name LIMIT ALL;
Anil B 18
Jack N 16
John A 18
Mike A 25
Shone S 16
Zen Hui 25

-- A function expression as an input to LIMIT.


> SELECT name, age FROM person ORDER BY name LIMIT length('SPARK');
Anil B 18
Jack N 16
John A 18
Mike A 25
Shone S 16

-- A non-literal expression as an input to LIMIT is not allowed.


SELECT name, age FROM person ORDER BY name LIMIT length(name);
Error: The limit expression must evaluate to a constant value
Related articles
Query
ORDER BY clause
7/21/2022 • 2 minutes to read

Returns the result rows in a sorted manner in the user specified order. Unlike the SORT BY clause, this clause
guarantees a total order in the output.

Syntax
ORDER BY { expression [ sort_direction | nulls_sort_oder ] } [, ...]

sort_direction
[ ASC | DESC ]

nulls_sort_order
[ NULLS FIRST | NULLS LAST ]

Parameters
expression
An expression of any type used to establish an order in which results are returned.
If the expression a literal INT value it is interpreted as a column position in the select list.
sor t_direction
Specifies the sort order for the order by expression.
ASC : The sort direction for this expression is ascending.
DESC : The sort order for this expression is descending.

If sort direction is not explicitly specified, then by default rows are sorted ascending.
nulls_sor t_order
Optionally specifies whether NULL values are returned before/after non-NULL values. If null_sort_order
is not specified, then NULLs sort first if sort order is ASC and NULLS sort last if sort order is DESC .
NULLS FIRST : NULL values are returned first regardless of the sort order.
NULLS LAST : NULL values are returned last regardless of the sort order.

When specifying more than one expression sorting occurs left to right. All rows are sorted by the first
expression. If there are duplicate values for the first expression the second expression is used to resolve order
within the group of duplicates and so on. The resulting order not deterministic if there are duplicate values
across all order by expressions.

Examples
> CREATE TABLE person (id INT, name STRING, age INT);
> INSERT INTO person VALUES
(100, 'John' , 30),
(200, 'Mary' , NULL),
(300, 'Mike' , 80),
(400, 'Jerry', NULL),
(500, 'Dan' , 50);

-- Sort rows by age. By default rows are sorted in ascending manner with NULL FIRST.
> SELECT name, age FROM person ORDER BY age;
Jerry NULL
Mary NULL
John 30
Dan 50
Mike 80

-- Sort rows in ascending manner keeping null values to be last.


> SELECT name, age FROM person ORDER BY age NULLS LAST;
John 30
Dan 50
Mike 80
Mary NULL
Jerry NULL

-- Sort rows by age in descending manner, which defaults to NULL LAST.


> SELECT name, age FROM person ORDER BY age DESC;
Mike 80
Dan 50
John 30
Jerry NULL
Mary NULL

-- Sort rows in ascending manner keeping null values to be first.


> SELECT name, age FROM person ORDER BY age DESC NULLS FIRST;
Jerry NULL
Mary NULL
Mike 80
Dan 50
John 30

-- Sort rows based on more than one column with each column having different
-- sort direction.
> SELECT * FROM person ORDER BY name ASC, age DESC;
500 Dan 50
400 Jerry NULL
100 John 30
200 Mary NULL
300 Mike 80

Related articles
Query
SORT BY
Window functions
PIVOT clause
7/21/2022 • 4 minutes to read

Transforms the intermediate result set of the FROM clause by rotating unique values of a specified column list
into separate columns.

Syntax
PIVOT ( { aggregate_expression [ [ AS ] agg_column_alias ] } [, ...]
FOR column_list IN ( expression_list ) )

column_list
{ column_name |
( column_name [, ...] ) }

expression_list
{ expression [ AS ] [ column_alias ] |
{ ( expression [, ...] ) [ AS ] [ column_alias] } [, ...] ) }

Parameters
aggregate_expression
An expression of any type where all column references to the FROM clause are arguments to aggregate
functions.
agg_column_alias
An optional alias for the result of the aggregation. If no alias is specified, PIVOT generates an alias based
on aggregate_xpression .
column_list
The set of columns to be rotated.
column_name
A column from the FROM clause.
expression_list
Maps values from column_list to column aliases.
expression
A literal expression with a type that shares a least common type with the respective column_name .
The number of expressions in each tuple must match the number of column_names in
column_list .

column_alias
An optional alias specifying the name of the generated column. If no alias is specified PIVOT
generates an alias based on the expression s.
Result
A temporary table of the following form:
All the columns from the intermediate result set of the FROM clause that have not been specified in any
aggregate_expression or column_list .

These columns are grouping columns.


For each expression tuple and aggregate_expression combination, PIVOT generates one column. The
type is the type of aggregate_expression .
If there is only one aggregate_expression the column is named using column_alias . Otherwise it is
named column_alias_agg_column_alias .
The value in each cell is the result of the aggregation_expression using a
FILTER ( WHERE column_list IN (expression, ...) .

Examples
-- A very basic PIVOT
-- Given a table with sales by quarter, return a table that returns sales across quarters per year.
> CREATE TEMP VIEW sales(year, quarter, region, sales) AS
VALUES (2018, 1, 'east', 100),
(2018, 2, 'east', 20),
(2018, 3, 'east', 40),
(2018, 4, 'east', 40),
(2019, 1, 'east', 120),
(2019, 2, 'east', 110),
(2019, 3, 'east', 80),
(2019, 4, 'east', 60),
(2018, 1, 'west', 105),
(2018, 2, 'west', 25),
(2018, 3, 'west', 45),
(2018, 4, 'west', 45),
(2019, 1, 'west', 125),
(2019, 2, 'west', 115),
(2019, 3, 'west', 85),
(2019, 4, 'west', 65);

> SELECT year, region, q1, q2, q3, q4


FROM sales
PIVOT (sum(sales) AS sales
FOR quarter
IN (1 AS q1, 2 AS q2, 3 AS q3, 4 AS q4));
2018 east 100 20 40 40
2019 east 120 110 80 60
2018 west 105 25 45 45
2019 west 125 115 85 65

-- The same query written without PIVOT


> SELECT year, region,
sum(sales) FILTER(WHERE quarter = 1) AS q1,
sum(sales) FILTER(WHERE quarter = 2) AS q2,
sum(sales) FILTER(WHERE quarter = 3) AS q2,
sum(sales) FILTER(WHERE quarter = 4) AS q4
FROM sales
GROUP BY year, region;
2018 east 100 20 40 40
2019 east 120 110 80 60
2018 west 105 25 45 45
2019 west 125 115 85 65

-- Also PIVOT on region


> SELECT year, q1_east, q1_west, q2_east, q2_west, q3_east, q3_west, q4_east, q4_west
> SELECT year, q1_east, q1_west, q2_east, q2_west, q3_east, q3_west, q4_east, q4_west
FROM sales
PIVOT (sum(sales) AS sales
FOR (quarter, region)
IN ((1, 'east') AS q1_east, (1, 'west') AS q1_west, (2, 'east') AS q2_east, (2, 'west') AS q2_west,
(3, 'east') AS q3_east, (3, 'west') AS q3_west, (4, 'east') AS q4_east, (4, 'west') AS q4_west));
2018 100 105 20 25 40 45 40 45
2019 120 125 110 115 80 85 60 65

-- The same query written without PIVOT


> SELECT year,
sum(sales) FILTER(WHERE (quarter, region) = (1, 'east')) AS q1_east,
sum(sales) FILTER(WHERE (quarter, region) = (1, 'west')) AS q1_west,
sum(sales) FILTER(WHERE (quarter, region) = (2, 'east')) AS q2_east,
sum(sales) FILTER(WHERE (quarter, region) = (2, 'west')) AS q2_west,
sum(sales) FILTER(WHERE (quarter, region) = (3, 'east')) AS q3_east,
sum(sales) FILTER(WHERE (quarter, region) = (3, 'west')) AS q3_west,
sum(sales) FILTER(WHERE (quarter, region) = (4, 'east')) AS q4_east,
sum(sales) FILTER(WHERE (quarter, region) = (4, 'west')) AS q4_west
FROM sales
GROUP BY year, region;
2018 100 105 20 25 40 45 40 45
2019 120 125 110 115 80 85 60 65

-- To aggregate across regions the column must be removed from the input.
> SELECT year, q1, q2, q3, q4
FROM (SELECT year, quarter, sales FROM sales) AS s
PIVOT (sum(sales) AS sales
FOR quarter
IN (1 AS q1, 2 AS q2, 3 AS q3, 4 AS q4));
2018 205 45 85 85
2019 245 225 165 125

-- The same query without PIVOT


> SELECT year,
sum(sales) FILTER(WHERE quarter = 1) AS q1,
sum(sales) FILTER(WHERE quarter = 2) AS q2,
sum(sales) FILTER(WHERE quarter = 3) AS q3,
sum(sales) FILTER(WHERE quarter = 4) AS q4
FROM sales
GROUP BY year;

-- A PIVOT with multiple aggregations


> SELECT year, q1_total, q1_avg, q2_total, q2_avg, q3_total, q3_avg, q4_total, q4_avg
FROM (SELECT year, quarter, sales FROM sales) AS s
PIVOT (sum(sales) AS total, avg(sales) AS avg
FOR quarter
IN (1 AS q1, 2 AS q2, 3 AS q3, 4 AS q4));
2018 205 102.5 45 22.5 85 42.5 85 42.5
2019 245 122.5 225 112.5 165 82.5 125 62.5

-- The same query without PIVOT


> SELECT year,
sum(sales) FILTER(WHERE quarter = 1) AS q1_total,
avg(sales) FILTER(WHERE quarter = 1) AS q1_avg,
sum(sales) FILTER(WHERE quarter = 1) AS q2_total,
avg(sales) FILTER(WHERE quarter = 1) AS q2_avg,
sum(sales) FILTER(WHERE quarter = 1) AS q3_total,
avg(sales) FILTER(WHERE quarter = 1) AS q3_avg,
sum(sales) FILTER(WHERE quarter = 1) AS q4_total,
avg(sales) FILTER(WHERE quarter = 1) AS q4_avg
FROM sales
GROUP BY year;

> CREATE TEMP VIEW person (id, name, age, class, address) AS
VALUES (100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');
2018 205 102.5 45 22.5 85 42.5 85 42.5
2018 205 102.5 45 22.5 85 42.5 85 42.5
2019 245 122.5 225 112.5 165 82.5 125 62.5

Related articles
SELECT
Aggregate functions
QUALIFY clause
7/21/2022 • 2 minutes to read

Filters the results of window functions. To use QUALIFY , at least one window function is required to be present in
the SELECT list or the QUALIFY clause.

NOTE
Available in Databricks Runtime 10.0 and above.

Syntax
QUALIFY boolean_expression

Parameters
boolean_expression
Any expression that evaluates to a result type boolean . Two or more expressions may be combined
together using the logical operators ( AND, OR).
The expressions specified in the QUALIFY clause cannot contain aggregate functions.

Examples
CREATE TABLE dealer (id INT, city STRING, car_model STRING, quantity INT);
INSERT INTO dealer VALUES
(100, 'Fremont', 'Honda Civic', 10),
(100, 'Fremont', 'Honda Accord', 15),
(100, 'Fremont', 'Honda CRV', 7),
(200, 'Dublin', 'Honda Civic', 20),
(200, 'Dublin', 'Honda Accord', 10),
(200, 'Dublin', 'Honda CRV', 3),
(300, 'San Jose', 'Honda Civic', 5),
(300, 'San Jose', 'Honda Accord', 8);

-- QUALIFY with window functions in the SELECT list.


> SELECT
city,
car_model,
RANK() OVER (PARTITION BY car_model ORDER BY quantity) AS rank
FROM dealer
QUALIFY rank = 1;
city car_model rank
-------- ------------ ----
San Jose Honda Accord 1
Dublin Honda CRV 1
San Jose Honda Civic 1

-- QUALIFY with window functions in the QUALIFY clause.


SELECT city, car_model
FROM dealer
QUALIFY RANK() OVER (PARTITION BY car_model ORDER BY quantity) = 1;
city car_model
-------- ------------
San Jose Honda Accord
Dublin Honda CRV
San Jose Honda Civic

Related statements
SELECT
WHERE clause
GROUP BY clause
ORDER BY clause
SORT BY clause
CLUSTER BY clause
DISTRIBUTE BY clause
LIMIT clause
PIVOT clause
LATERAL VIEW clause
TABLESAMPLE clause
7/21/2022 • 2 minutes to read

The TABLESAMPLE statement is used to sample the relation.

Syntax
TABLESAMPLE ( { percentage PERCENT ) |
num_rows ROWS |
BUCKET fraction OUT OF total } )
[ REPEATABLE ( seed ) ]

Parameters
percentage PERCENT

An INTEGER or DECIMAL constant percentage between 0 and 100 specifying which percentage of the
table’s rows to sample.
num_rows ROWS

A constant positive INTEGER expression num_rows specifying an absolute number of rows out of all rows
to sample.
BUCKET fraction OUT OF total

An INTEGER constant fraction specifying the portion out of the INTEGER constant total to sample.
REPEATABLE ( seed )

Since: Databricks Runtime 11.0


An optional positive INTEGER constant seed , used to always produce the same set of rows. Use this
clause when you want to reissue the query multiple times, and you expect the same set of sampled rows.

NOTE
TABLESAMPLE returns the approximate number of rows or fraction requested.
Always use TABLESAMPLE (percent PERCENT) if randomness is important. TABLESAMPLE (num_rows ROWS) is not a
simple random sample but instead is implemented using LIMIT .

Examples
> CREATE TEMPORARY VIEW test(id, name) AS
VALUES ( 1, 'Lisa'),
( 2, 'Mary'),
( 3, 'Evan'),
( 4, 'Fred'),
( 5, 'Alex'),
( 6, 'Mark'),
( 7, 'Lily'),
( 8, 'Lucy'),
( 9, 'Eric'),
(10, 'Adam');
> SELECT * FROM test;
5 Alex
8 Lucy
2 Mary
4 Fred
1 Lisa
9 Eric
10 Adam
6 Mark
7 Lily
3 Evan

> SELECT * FROM test TABLESAMPLE (30 PERCENT) REPEATABLE (123);


1 Lisa
2 Mary
3 Evan
5 Alex
8 Lucy

> SELECT * FROM test TABLESAMPLE (5 ROWS);


5 Alex
8 Lucy
2 Mary
4 Fred
1 Lisa

> SELECT * FROM test TABLESAMPLE (BUCKET 4 OUT OF 10);


8 Lucy
2 Mary
9 Eric
6 Mark

Related articles
SELECT
Set operators
7/21/2022 • 2 minutes to read

Combines two subqueries into a single one. Databricks Runtime supports three types of set operators:
EXCEPT
INTERSECT
UNION

Syntax
subquery1 { { UNION [ ALL | DISTINCT ] |
INTERSECT [ ALL | DISTINCT ] |
EXCEPT [ ALL | DISTINCT ] } subquery2 } [...] }

subquer y1 , subquer y2
Any two subquery clauses as specified in SELECT. Both subqueries must have the same number of
columns and share a least common type for each respective column.
UNION [ALL | DISTINCT]
Returns the result of subquery1 plus the rows of subquery2`.
If ALL is specified duplicate rows are preserved.
If DISTINCT is specified the result does not contain any duplicate rows. This is the default.
INTERSECT [ALL | DISTINCT]
Returns the set of rows which are in both subqueries.
If ALL is specified a row that appears multiple times in the subquery1 as well as in subquery will be
returned multiple times.
If DISTINCT is specified the result does not contain duplicate rows. This is the default.
EXCEPT [ALL | DISTINCT ]
Returns the rows in subquery1 which are not in subquery2 .
If is specified, each row in
ALL subquery2 will remove exactly one of possibly multiple matches from
subquery1 .

If DISTINCT is specified, duplicate rows are removed from subquery1 before applying the operation, so
all matches are removed and the result will have no duplicate rows (matched or unmatched). This is the
default.
You can specify MINUS as a syntax alternative for EXCEPT .
When chaining set operations INTERSECT has a higher precedence than UNION and EXCEPT .
The type of each result column is the least common type of the respective columns in subquery1 and subquery2
.
Examples
-- Use number1 and number2 tables to demonstrate set operators in this page.
> CREATE TEMPORARY VIEW number1(c) AS VALUES (3), (1), (2), (2), (3), (4);

> CREATE TEMPORARY VIEW number2(c) AS VALUES (5), (1), (1), (2);

> SELECT c FROM number1 EXCEPT SELECT c FROM number2;


3
4

> SELECT c FROM number1 MINUS SELECT c FROM number2;


3
4

> SELECT c FROM number1 EXCEPT ALL (SELECT c FROM number2);


3
3
4

> SELECT c FROM number1 MINUS ALL (SELECT c FROM number2);


3
3
4

> (SELECT c FROM number1) INTERSECT (SELECT c FROM number2);


1
2

> (SELECT c FROM number1) INTERSECT DISTINCT (SELECT c FROM number2);


1
2

> (SELECT c FROM number1) INTERSECT ALL (SELECT c FROM number2);


1
2
2

> (SELECT c FROM number1) UNION (SELECT c FROM number2);


1
3
5
4
2

> (SELECT c FROM number1) UNION DISTINCT (SELECT c FROM number2);


1
3
5
4
2

> SELECT c FROM number1 UNION ALL (SELECT c FROM number2);


3
1
2
2
3
4
5
1
2
2

Related articles
SELECT
SORT BY clause
7/21/2022 • 3 minutes to read

Returns the result rows sorted within each partition in the user specified order. When there is more than one
partition SORT BY may return result that is partially ordered. This is different than ORDER BY clause which
guarantees a total order of the output.

Syntax
SORT BY { expression [ sort_direction nulls_sort_oder ] } [, ...]

sort_direction
[ ASC | DEC ]

nulls_sort_order
[ NULLS FIRST | NULLS LAST ]

Parameters
expression
An expression of any type used to establish a partition local order in which results are returned.
If the expression is a literal INT value it is interpreted as a column position in the select list.
sor t_direction
Specifies the sort order for the sort by expression.
ASC : The sort direction for this expression is ascending.
DESC : The sort order for this expression is descending.
If sort direction is not explicitly specified, then by default rows are sorted ascending.
nulls_sor t_order
Optionally specifies whether NULL values are returned before/after non-NULL values. If null_sort_order
is not specified, then NULLs sort first if sort order is ASC and NULLS sort last if sort order is DESC .
NULLS FIRST : NULL values are returned first regardless of the sort order.
NULLS LAST : NULL values are returned last regardless of the sort order.

When specifying more than one expression sorting occurs left to right. All rows within the partition are sorted
by the first expression. If there are duplicate values for the first expression the second expression is used to
resolve order within the group of duplicates and so on. The resulting order not deterministic if there are
duplicate values across all order by expressions.

Examples
> CREATE TEMP VIEW person (zip_code, name, age)
AS VALUES (94588, 'Zen Hui', 50),
(94588, 'Dan Li', 18),
(94588, 'Anil K', 27),
(94588, 'John V', NULL),
(94511, 'David K', 42),
(94511, 'Aryan B.', 18),
(94511, 'Lalit B.', NULL);

-- Use `REPARTITION` hint to partition the data by `zip_code` to


-- examine the `SORT BY` behavior. This is used in rest of the
-- examples.

-- Sort rows by `name` within each partition in ascending manner


> SELECT /*+ REPARTITION(zip_code) */ name, age, zip_code FROM person
SORT BY name;
Anil K 27 94588
Dan Li 18 94588
John V NULL 94588
Zen Hui 50 94588
Aryan B. 18 94511
David K 42 94511
Lalit B. NULL 94511

-- Sort rows within each partition using column position.


> SELECT /*+ REPARTITION(zip_code) */ name, age, zip_code FROM person
SORT BY 1;
Anil K 27 94588
Dan Li 18 94588
John V null 94588
Zen Hui 50 94588
Aryan B. 18 94511
David K 42 94511
Lalit B. null 94511

-- Sort rows within partition in ascending manner keeping null values to be last.
> SELECT /*+ REPARTITION(zip_code) */ age, name, zip_code FROM person
SORT BY age NULLS LAST;
18 Dan Li 94588
27 Anil K 94588
50 Zen Hui 94588
NULL John V 94588
18 Aryan B. 94511
42 David K 94511
NULL Lalit B. 94511

-- Sort rows by age within each partition in descending manner, which defaults to NULL LAST.
> SELECT /*+ REPARTITION(zip_code) */ age, name, zip_code FROM person
SORT BY age DESC;
50 Zen Hui 94588
27 Anil K 94588
18 Dan Li 94588
NULL John V 94588
42 David K 94511
18 Aryan B. 94511
NULL Lalit B. 94511

-- Sort rows by age within each partition in descending manner keeping null values to be first.
> SELECT /*+ REPARTITION(zip_code) */ age, name, zip_code FROM person
SORT BY age DESC NULLS FIRST;
NULL John V 94588
50 Zen Hui 94588
27 Anil K 94588
18 Dan Li 94588
NULL Lalit B. 94511
42 David K 94511
18 Aryan B. 94511

-- Sort rows within each partition based on more than one column with each column having
-- different sort direction.
> SELECT /*+ REPARTITION(zip_code) */ name, age, zip_code FROM person
SORT BY name ASC, age DESC;
Anil K 27 94588
Dan Li 18 94588
John V null 94588
Zen Hui 50 94588
Aryan B. 18 94511
David K 42 94511
Lalit B. null 94511

Related articles
Query
Table-valued function (TVF)
7/21/2022 • 3 minutes to read

A function that returns a relation or a set of rows. There are two types of TVFs:
Specified in a FROM clause, for example, range .
Specified in SELECT and LATERAL VIEW clauses, for example, explode .

Syntax
function_name ( expression [, ...] ) [ table_alias ]

Parameters
expression
A combination of one or more values, operators, and SQL functions that results in a value.
table_alias
An optional label to reference the function result and its columns.

Supported table-valued functions


TVFs that can be specified in FROM clauses

F UN C T IO N A RGUM EN T T Y P E( S) DESC RIP T IO N

range (end) Long Creates a table with a single LongType


column named id, containing rows in a
range from 0 to end (exclusive) with
step value 1.

range (start, end) Long, Long Creates a table with a single LongType
column named id, containing rows in a
range from start to end (exclusive)
with step value 1.

range (start, end, step) Long, Long, Long Creates a table with a single LongType
column named id, containing rows in a
range from start to end (exclusive)
with step value.

range (start, end, step, numPartitions) Long, Long, Long, Int Creates a table with a single LongType
column named id, containing rows in a
range from start to end (exclusive)
with step value, with partition number
numPartitions specified.

TVFs that can be specified in SELECT and LATERAL VIEW clauses


F UN C T IO N A RGUM EN T T Y P E( S) DESC RIP T IO N

explode (expr) Array/Map Separates the elements of array expr


into multiple rows, or the elements of
map expr into multiple rows and
columns. Unless specified otherwise,
uses the default column name col for
elements of the array or key and value
for the elements of the map.

explode_outer (expr) Array/Map Separates the elements of array expr


into multiple rows, or the elements of
map expr into multiple rows and
columns. Unless specified otherwise,
uses the default column name col for
elements of the array or key and value
for the elements of the map.

inline (expr) Expression Explodes an array of structs into a


table. Uses column names col1, col2,
etc. by default unless specified
otherwise.

inline_outer (expr) Expression Explodes an array of structs into a


table. Uses column names col1, col2,
etc. by default unless specified
otherwise.

posexplode (expr) Array/Map Separates the elements of array expr


into multiple rows with positions, or
the elements of map expr into multiple
rows and columns with positions.
Unless specified otherwise, uses the
column name pos for position, col for
elements of the array or key and value
for elements of the map.

posexplode_outer (expr) Array/Map Separates the elements of array expr


into multiple rows with positions, or
the elements of map expr into multiple
rows and columns with positions.
Unless specified otherwise, uses the
column name pos for position, col for
elements of the array or key and value
for elements of the map.

stack (n, expr1, …, exprk) Seq[Expression] Separates expr1, …, exprk into n rows.
Uses column names col0, col1, etc. by
default unless specified otherwise.

json_tuple (jsonStr, p1, p2, …, pn) Seq[Expression] Returns a tuple like the function
get_json_object, but it takes multiple
names. All the input parameters and
output column types are string.

parse_url (url, partToExtract[, key] ) Seq[Expression] Extracts a part from a URL.

Examples
-- range call with end
> SELECT * FROM range(6 + cos(3));
0
1
2
3
4

-- range call with start and end


> SELECT * FROM range(5, 10);
5
6
7
8
9

-- range call with numPartitions


> SELECT * FROM range(0, 10, 2, 200);
0
2
4
6
8

-- range call with a table alias


> SELECT * FROM range(5, 8) AS test;
5
6
7

> SELECT explode(array(10, 20));


10
20

> SELECT inline(array(struct(1, 'a'), struct(2, 'b')));


col1 col2
---- ----
1 a
2 b

> SELECT posexplode(array(10,20));


pos col
--- ---
0 10
1 20

> SELECT stack(2, 1, 2, 3);


col0 col1
---- ----
1 2
3 null

> SELECT json_tuple('{"a":1, "b":2}', 'a', 'b');


c0 c1
--- ---
1 2

> SELECT parse_url('http://spark.apache.org/path?query=1', 'HOST');


spark.apache.org

-- Use explode in a LATERAL VIEW clause


> CREATE TABLE test (c1 INT);
> INSERT INTO test VALUES (1);
> INSERT INTO test VALUES (2);
> SELECT * FROM test LATERAL VIEW explode (ARRAY(3,4)) AS c2;
c1 c2
-- --
1 3
1 4
1 4
2 3
2 4

Related articles
SELECT
WHERE clause
7/21/2022 • 2 minutes to read

Limits the results of the FROM clause of a query or a subquery based on the specified condition.

Syntax
WHERE boolean_expression

Parameters
boolean_expression
Any expression that evaluates to a result type BOOLEAN . You can combine two or more expressions using
the logical operators such as AND or OR .

Examples
> CREATE TABLE person (id INT, name STRING, age INT);
> INSERT INTO person VALUES
(100, 'John', 30),
(200, 'Mary', NULL),
(300, 'Mike', 80),
(400, 'Dan' , 50);

-- Comparison operator in `WHERE` clause.


> SELECT * FROM person WHERE id > 200 ORDER BY id;
300 Mike 80
400 Dan 50

-- Comparison and logical operators in `WHERE` clause.


> SELECT * FROM person WHERE id = 200 OR id = 300 ORDER BY id;
200 Mary NULL
300 Mike 80

-- IS NULL expression in `WHERE` clause.


> SELECT * FROM person WHERE id > 300 OR age IS NULL ORDER BY id;
200 Mary null
400 Dan 50

-- Function expression in `WHERE` clause.


> SELECT * FROM person WHERE length(name) > 3 ORDER BY id;
100 John 30
200 Mary NULL
300 Mike 80

-- `BETWEEN` expression in `WHERE` clause.


SELECT * FROM person WHERE id BETWEEN 200 AND 300 ORDER BY id;
200 Mary NULL
300 Mike 80

-- Scalar Subquery in `WHERE` clause.


> SELECT * FROM person WHERE age > (SELECT avg(age) FROM person);
300 Mike 80

-- Correlated Subquery in `WHERE` clause.


> SELECT * FROM person AS parent
WHERE EXISTS (SELECT 1 FROM person AS child
WHERE parent.id = child.id
AND child.age IS NULL);
200 Mary NULL

Related articles
QUALIFY
SELECT
WINDOW clause
7/21/2022 • 2 minutes to read

The window clause allows you to define and name one or more distinct window specifications once and share
them across many window functions within the same query.

Syntax
WINDOW { window_name AS window_spec } [, ...]

Parameters
window_name
An identifier by which the window specification can be referenced. The identifier must be unique within
the WINDOW clause.
window_spec
A window specification to be shared across one or more window functions.

Examples
> CREATE TABLE employees
(name STRING, dept STRING, salary INT, age INT);
> INSERT INTO employees
VALUES ('Lisa', 'Sales', 10000, 35),
('Evan', 'Sales', 32000, 38),
('Fred', 'Engineering', 21000, 28),
('Alex', 'Sales', 30000, 33),
('Tom', 'Engineering', 23000, 33),
('Jane', 'Marketing', 29000, 28),
('Jeff', 'Marketing', 35000, 38),
('Paul', 'Engineering', 29000, 23),
('Chloe', 'Engineering', 23000, 25);

> SELECT round(avg(age) OVER win, 1) AS salary,


round(avg(salary) OVER win, 1) AS avgsalary,
min(salary) OVER win AS minsalary,
max(salary) OVER win AS maxsalary,
count(1) OVER win AS numEmps
FROM employees
WINDOW win AS (ORDER BY age
ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING);
25.3 27000.0 23000 29000 3
26.0 25500.0 21000 29000 4
27.4 25000.0 21000 29000 5
29.4 25200.0 21000 30000 5
31.4 22600.0 10000 30000 5
33.4 23800.0 10000 35000 5
35.4 26000.0 10000 35000 5
36.0 26750.0 10000 35000 4
37.0 25666.7 10000 35000 3
Related articles
Window functions
SELECT
Window frame clause (Azure Databricks)
7/21/2022 • 2 minutes to read

Specifies a sliding subset of rows within the partition on which the aggregate or analytic window function
operates.

Syntax
{ frame_mode frame_start |
frame_mode BETWEEN frame_start AND frame_end } }

frame_mode
{ RANGE | ROWS }

frame_start
{ UNBOUNDED PRECEDING |
offset_start PRECEDING |
CURRENT ROW |
offset_start FOLLOWING }

frame_end
{ offset_stop PRECEDING |
CURRENT ROW |
offset_stop FOLLOWING |
UNBOUNDED FOLLOWING }

Parameters
frame_mode
ROWS
If specified, the sliding window frame is expressed in terms of rows preceding or following the
current row.
RANGE
If specified, the window function must specify an ORDER BY clause with a single expression
obExpr .

The boundaries of the sliding window are then expressed as an offset from the obExpr for the
current row.
frame_star t
The starting position of the sliding window frame relative to the current row.
UNBOUNDED PRECEDING
Specifies that the window frame starts at the beginning of partition.
offset_start PRECEDING
If the mode is ROWS , offset_start is the positive integral literal number defining how many rows
prior to the current row the frame starts.
If the mode is RANGE , offset_start is a positive literal value of a type which can be subtracted
from obExpr . The frame starts at the first row of the partition for which obExpr is greater or
equal to obExpr - offset_start at the current row.
CURRENT ROW
Specifies that the frame starts at the current row.
offset_start FOLLOWING
If the mode is ROWS , offset_start is the positive integral literal number defining how many rows
past to the current row the frame starts. If the mode is RANGE , offset_start is a positive literal
value of a type which can be added to obExpr . The frame starts at the first row of the partition for
which obExpr is greater or equal to obExpr + offset_start at the current row.
frame_stop
The end of the sliding window frame relative to the current row.
If not specified, the frame stops at the CURRENT ROW. The end of the sliding window must be greater
than the start of the window frame.
offset_start PRECEDING
If frame_mode is ROWS , offset_stop is the positive integral literal number defining how many
rows prior to the current row the frame stops. If frame_mode is RANGE , offset_stop is a positive
literal value of the same type as offset_start . The frame ends at the last row off the partition for
which obExpr is less than or equal to obExpr - offset_stop at the current row.
CURRENT ROW
Specifies that the frame stops at the current row.
offsetStop FOLLOWING
If frame_mode is ROWS , offset_stop is the positive integral literal number defining how many
rows past to the current row the frame ends. If frame_mode is RANGE , offset_stop is a positive
literal value of the same type as offset_start . The frame ends at the last row of the partition for
which obExpr is less than or equal to obExpr + offset_stop at the current row.
UNBOUNDED FOLLOWING
Specifies that the window frame stops at the end of the partition.

Related articles
Window functions
ANALYZE TABLE
7/21/2022 • 2 minutes to read

The ANALYZE TABLE statement collects statistics about one specific table or all the tables in one specified schema,
that are to be used by the query optimizer to find a better query execution plan.

IMPORTANT
You can run ANALYZE TABLE on Delta tables only on Databricks Runtime 8.3 and above.

Syntax
ANALYZE TABLE table_name [ PARTITION clause ]
COMPUTE STATISTICS [ NOSCAN | FOR COLUMNS col1 [, ...] | FOR ALL COLUMNS ]

ANALYZE TABLES [ { FROM | IN } schema_name ] COMPUTE STATISTICS [ NOSCAN ]

Parameters
table_name
Identifies the table to be analyzed. The name must not include a temporal specification.
PARTITION clause
Optionally limits the command to a subset of partitions.
[ NOSCAN | FOR COLUMNS col [, …] | FOR ALL COLUMNS ]
If no analyze option is specified, ANALYZE TABLE collects the table’s number of rows and size in bytes.
NOSCAN
Collect only the table’s size in bytes ( which does not require scanning the entire table ).
FOR COLUMNS col [, …] | FOR ALL COLUMNS
Collect column statistics for each column specified, or alternatively for every column, as well as
table statistics.
{ FROM | IN } [schema_name](sql-ref-names.md#schema-name
Specifies the name of the schema to be analyzed. Without a schema name, ANALYZE TABLES collects all
tables in the current schema that the current user has permission to analyze.
NOSCAN
Collects only the table’s size in bytes (which does not require scanning the entire table).
FOR COLUMNS col [ , … ] | FOR ALL COLUMNS
Collects column statistics for each column specified, or alternatively for every column, as well as table
statistics.
If no analyze option is specified, both number of rows and size in bytes are collected.

Examples
> CREATE TABLE students (name STRING, student_id INT) PARTITIONED BY (student_id);
> INSERT INTO students PARTITION (student_id = 111111) VALUES ('Mark');
> INSERT INTO students PARTITION (student_id = 222222) VALUES ('John');

> ANALYZE TABLE students COMPUTE STATISTICS NOSCAN;

> DESC EXTENDED students;


col_name data_type comment
-------------------- -------------------- -------
name string null
student_id int null
... ... ...
Statistics 864 bytes
... ... ...

> ANALYZE TABLE students COMPUTE STATISTICS;

> DESC EXTENDED students;


col_name data_type comment
-------------------- -------------------- -------
name string null
student_id int null
... ... ...
Statistics 864 bytes, 2 rows
... ... ...

> ANALYZE TABLE students PARTITION (student_id = 111111) COMPUTE STATISTICS;

> DESC EXTENDED students PARTITION (student_id = 111111);


col_name data_type comment
-------------------- -------------------- -------
name string null
student_id int null
... ... ...
Partition Statistics 432 bytes, 1 rows
... ... ...
OutputFormat org.apache.hadoop...

> ANALYZE TABLE students COMPUTE STATISTICS FOR COLUMNS name;

> DESC EXTENDED students name;


info_name info_value
-------------- ----------
col_name name
data_type string
comment NULL
min NULL
max NULL
num_nulls 0
distinct_count 2
avg_col_len 4
max_col_len 4
histogram NULL

> ANALYZE TABLES IN school_schema COMPUTE STATISTICS NOSCAN;


> DESC EXTENDED teachers;
col_name data_type comment
-------------------- -------------------- -------
name string null
teacher_id int null
... ... ...
Statistics 1382 bytes
... ... ...
... ... ...

> DESC EXTENDED students;


col_name data_type comment
-------------------- -------------------- -------
name string null
student_id int null
... ... ...
Statistics 864 bytes
... ... ...

> ANALYZE TABLES COMPUTE STATISTICS;


> DESC EXTENDED teachers;
col_name data_type comment
-------------------- -------------------- -------
name string null
teacher_id int null
... ... ...
Statistics 1382 bytes, 2 rows
... ... ...

> DESC EXTENDED students;


col_name data_type comment
-------------------- -------------------- -------
name string null
student_id int null
... ... ...
Statistics 864 bytes, 2 rows
... ... ...

Related articles
PARTITION
CACHE TABLE
7/21/2022 • 2 minutes to read

Caches contents of a table or output of a query with the given storage level in Apache Spark cache. If a query is
cached, then a temp view is created for this query. This reduces scanning of the original files in future queries.

Syntax
CACHE [ LAZY ] TABLE table_name
[ OPTIONS ( 'storageLevel' [ = ] value ) ] [ [ AS ] query ]

See Delta and Apache Spark caching for the differences between the Delta cache and the Apache Spark cache.

Parameters
L AZY
Only cache the table when it is first used, instead of immediately.
table_name
Identifies the Delta table or view to cache. The name must not include a temporal specification.
OPTIONS ( ‘storageLevel’ [ = ] value )
OPTIONS clause with storageLevel key and value pair. A warning is issued when a key other than
storageLevel is used. The valid options for storageLevel are:

NONE
DISK_ONLY
DISK_ONLY_2
MEMORY_ONLY
MEMORY_ONLY_2
MEMORY_ONLY_SER
MEMORY_ONLY_SER_2
MEMORY_AND_DISK
MEMORY_AND_DISK_2
MEMORY_AND_DISK_SER
MEMORY_AND_DISK_SER_2
OFF_HEAP
An Exception is thrown when an invalid value is set for storageLevel . If storageLevel is not explicitly set
using OPTIONS clause, the default storageLevel is set to MEMORY_AND_DISK .
quer y
A query that produces the rows to be cached. It can be in one of following formats:
A SELECT statement
A TABLE statement
A FROM statement
Examples
> CACHE TABLE testCache OPTIONS ('storageLevel' 'DISK_ONLY') SELECT * FROM testData;

Related statements
CLEAR CACHE
UNCACHE TABLE
REFRESH TABLE
REFRESH
REFRESH FUNCTION
CLEAR CACHE
7/21/2022 • 2 minutes to read

Removes the entries and associated data from the in-memory and/or on-disk cache for all cached tables and
views in Apache Spark cache.

Syntax
> CLEAR CACHE

See Delta and Apache Spark caching for the differences between the Delta cache and the Apache Spark cache.

Examples
> CLEAR CACHE;

Related statements
CACHE TABLE
UNCACHE TABLE
REFRESH TABLE
REFRESH
REFRESH FUNCTION
REFRESH
7/21/2022 • 2 minutes to read

Invalidates and refreshes all the cached data (and the associated metadata) in Apache Spark cache for all
Datasets that contains the given data source path. Path matching is by prefix, that is, / would invalidate
everything that is cached.

Syntax
REFRESH resource_path

See Delta and Apache Spark caching for the differences between the Delta cache and the Apache Spark cache.

Parameters
resource_path
The path of the resource that is to be refreshed.

Examples
-- The Path is resolved using the datasource's File Index.
> CREATE TABLE test(ID INT) using parquet;
> INSERT INTO test SELECT 1000;
> CACHE TABLE test;
> INSERT INTO test SELECT 100;
> REFRESH "hdfs://path/to/table";

Related statements
CACHE TABLE
CLEAR CACHE
UNCACHE TABLE
REFRESH TABLE
REFRESH FUNCTION
REFRESH FUNCTION
7/21/2022 • 2 minutes to read

Invalidates the cached function entry for Apache Spark cache, which includes a class name and resource location
of the given function. The invalidated cache is populated right away. Note that REFRESH FUNCTION only works for
permanent functions. Refreshing native functions or temporary functions will cause an exception.

Syntax
REFRESH FUNCTION function_name

See Delta and Apache Spark caching for the differences between the Delta cache and the Apache Spark cache.

Parameters
function_name
A function name. If the name in unqualified the current schema is used.

Examples
-- The cached entry of the function is refreshed
-- The function is resolved from the current schema as the function name is unqualified.
> REFRESH FUNCTION func1;

-- The cached entry of the function is refreshed


-- The function is resolved from tempDB schema as the function name is qualified.
> REFRESH FUNCTION sc1.func1;

Related statements
CACHE TABLE
CLEAR CACHE
UNCACHE TABLE
REFRESH TABLE
REFRESH
REFRESH TABLE
7/21/2022 • 2 minutes to read

Invalidates the cached entries for Apache Spark cache, which include data and metadata of the given table or
view. The invalidated cache is populated in lazy manner when the cached table or the query associated with it is
executed again.

Syntax
REFRESH [TABLE] table_name

See Delta and Apache Spark caching for the differences between the Delta cache and the Apache Spark cache.

Parameters
table_name
Identifies the Delta table or view to cache. The name must not include a temporal specification.

Examples
-- The cached entries of the table is refreshed
-- The table is resolved from the current schema as the table name is unqualified.
> REFRESH TABLE tbl1;

-- The cached entries of the view is refreshed or invalidated


-- The view is resolved from tempDB schema, as the view name is qualified.
> REFRESH TABLE tempDB.view1;

Related statements
CACHE TABLE
CLEAR CACHE
UNCACHE TABLE
REFRESH
REFRESH FUNCTION
UNCACHE TABLE
7/21/2022 • 2 minutes to read

Removes the entries and associated data from the in-memory and/or on-disk cache for a given table or view in
Apache Spark cache. The underlying entries should already have been brought to cache by previous
CACHE TABLE operation. UNCACHE TABLE on a non-existent table throws an exception if IF EXISTS is not
specified.

Syntax
UNCACHE TABLE [ IF EXISTS ] table_name

See Delta and Apache Spark caching for the differences between the Delta cache and the Apache Spark cache.

Parameters
table_name
Identifies the Delta table or view to cache. The name must not include a temporal specification.

Examples
> UNCACHE TABLE t1;

Related statements
CACHE TABLE
CLEAR CACHE
REFRESH TABLE
REFRESH
REFRESH FUNCTION
DESCRIBE CATALOG
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Returns the metadata of an existing catalog. The metadata information includes catalog name, comment, and
owner. If the optional EXTENDED option is specified, it returns the basic metadata information along with the
other catalog properties.
Since: Databricks Runtime 10.3

Syntax
{ DESC | DESCRIBE } CATALOG [ EXTENDED ] catalog_name

Parameters
catalog_name : The name of an existing catalog in the metastore. If the name does not exist, an exception is
thrown.

Examples
> DESCRIBE CATALOG main;
info_name info_value
------------ ------------------------------------
Catalog Name main
Comment Main catalog (auto-created)
Owner metastore-admin-users

> DESCRIBE CATALOG EXTENDED main;


info_name info_value
------------ ------------------------------------
Catalog Name main
Comment This is a reserved catalog in Spark.
Comment Main catalog (auto-created)
Owner metastore-admin-users
Created By
Created At
Updated By
Updated At

Related articles
DESCRIBE DATABASE
DESCRIBE FUNCTION
DESCRIBE QUERY
DESCRIBE TABLE
INFORMATION_SCHEMA.CATALOGS
DESCRIBE STORAGE CREDENTIAL
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Returns the metadata of an existing storage credential. The metadata information includes credential name,
comment, owner and other metadata.
You must be account or metastore admin to execute this command.
Since: Databricks Runtime 10.3

Syntax
DESCRIBE STORAGE CREDENTIAL credential_name

Parameters
credential_name
The name of an existing storage credential in the metastore. If the name does not exist, an exception is
thrown.

Examples
> DESCRIBE CREDENTIAL good_cred;
name owner created_at created_by credential
--------- ------ ------------------------ ------------ ---------------------------------------------
good_cred admins 2022-01-01T08:00:00.0000 jane@doe.com AwsIamRole:arn:aws:iam:123456789012:roe/us....

Related articles
ALTER STORAGE CREDENTIAL
DROP STORAGE CREDENTIAL
SHOW STORAGE CREDENTIALS
DESCRIBE DATABASE
7/21/2022 • 2 minutes to read

An alias for DESCRIBE SCHEMA.


While usage of SCHEMA and DATABASE is interchangeable, SCHEMA is preferred.

Related articles
DESCRIBE CATALOG
DESCRIBE FUNCTION
DESCRIBE QUERY
DESCRIBE SCHEMA
DESCRIBE TABLE
INFORMATION_SCHEMA.SCHEMATA
DESCRIBE FUNCTION
7/21/2022 • 2 minutes to read

Returns the basic metadata information of an existing function. The metadata information includes the function
name, implementing class and the usage details. If the optional EXTENDED option is specified, the basic metadata
information is returned along with the extended usage information.

Syntax
{ DESC | DESCRIBE } FUNCTION [ EXTENDED ] function_name

Parameters
function_name
The name of an existing function in the metastore. The function name may be optionally qualified with a
schema name. If function_name is qualified with a schema then the function is resolved from the user
specified schema, otherwise it is resolved from the current schema.

Examples
-- Describe a builtin scalar function.
-- Returns function name, implementing class and usage
> DESCRIBE FUNCTION abs;
Function: abs
Class: org.apache.spark.sql.catalyst.expressions.Abs
Usage: abs(expr) - Returns the absolute value of the numeric value.

-- Describe a builtin scalar function.


-- Returns function name, implementing class and usage and examples.
> DESCRIBE FUNCTION EXTENDED abs;
Function: abs
Class: org.apache.spark.sql.catalyst.expressions.Abs
Usage: abs(expr) - Returns the absolute value of the numeric value.
Extended Usage:
Examples:
> SELECT abs(-1);
1

-- Describe a builtin aggregate function


> DESCRIBE FUNCTION max;
Function: max
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Max
Usage: max(expr) - Returns the maximum value of `expr`.

-- Describe a builtin user defined aggregate function


-- Returns function name, implementing class and usage and examples.
> DESCRIBE FUNCTION EXTENDED explode;
Function: explode
Class: org.apache.spark.sql.catalyst.expressions.Explode
Usage: explode(expr) - Separates the elements of array `expr`
into multiple rows, or the elements of map `expr` into
multiple rows and columns. Unless specified otherwise, use
the default column name `col` for elements of the array or
`key` and `value` for the elements of the map.
Extended Usage:
Examples:
> SELECT explode(array(10, 20));
10
20

-- Describe a user defined scalar function


> CREATE FUNCTION dice(n INT) RETURNS INT
NOT DETERMINISTIC
COMMENT 'An n-sided dice'
RETURN floor((rand() * n) + 1);

> DESCRIBE FUNCTION EXTENDED dice;


Function: default.dice
Type: SCALAR
Input: n INT
Returns: INT
Comment: An n-sided dice
Deterministic: false
Owner: user
Create Time: Fri Apr 16 10:00:00 PDT 2021
Body: floor((rand() * n) + 1)

Related articles
DESCRIBE SCHEMA
DESCRIBE TABLE
DESCRIBE QUERY
DESCRIBE EXTERNAL LOCATION
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Returns the metadata of an existing external location. The metadata information includes location name, URL,
associated credential, owner, and timestamps of creation and last modification.
Since: Databricks Runtime 10.3

Syntax
DESCRIBE EXTERNAL LOCATION location_name

Parameters
location_name
The name of an existing external location in the metastore. If the name does not exist, an exception is
thrown.

Examples
> DESCRIBE EXTERNAL LOCATION best_loco;
name url credential_name owner created_by created_at
comment
--------- ----------------------------------- --------------- -------------- -------------- --------------
-- ----------
best_loco abfss://us-east-1-dev/best_location good_credential scooby@doo.com scooby@doo.com 2021-11-12
13:51 Nice place

Related articles
ALTER EXTERNAL LOCATION
CREATE EXTERNAL LOCATION
DROP EXTERNAL LOCATION
SHOW EXTERNAL LOCATIONS
DESCRIBE QUERY
7/21/2022 • 2 minutes to read

Returns the metadata of output of a query.

Syntax
{ DESC | DESCRIBE } [ QUERY ] input_statement

Parameters
QUERY
This clause is optional and may be omitted.
quer y
The query to be described.

Examples
-- Create table `person`
> CREATE TABLE person (name STRING , age INT COMMENT 'Age column', address STRING);

-- Returns column metadata information for a simple select query


> DESCRIBE QUERY SELECT age, sum(age) FROM person GROUP BY age;
col_name data_type comment
-------- --------- ----------
age int Age column
sum(age) bigint null

-- Returns column metadata information for common table expression (`CTE`).


> DESCRIBE QUERY WITH all_names_cte
AS (SELECT name FROM person) SELECT * FROM all_names_cte;
col_name data_type comment
-------- --------- -------
name string null

-- Returns column metadata information for an inline table.


> DESCRIBE QUERY VALUES(100, 'John', 10000.20D) AS employee(id, name, salary);
col_name data_type comment
-------- --------- -------
id int null
name string null
salary double null

-- Returns column metadata information for `TABLE` statement.


> DESCRIBE QUERY TABLE person;
col_name data_type comment
-------- --------- ----------
name string null
age int Agecolumn
address string null

Related articles
DESCRIBE SCHEMA
DESCRIBE TABLE
DESCRIBE FUNCTION
DESCRIBE RECIPIENT
7/21/2022 • 2 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

Returns the metadata of an existing recipient. The metadata information includes recipient name, and activation
link.
Since: Databricks Runtime 10.3

Syntax
[ DESC | DESCRIBE ] RECIPIENT recipient_name

Parameters
recipient_name
The name of an existing recipient. If the name does not exist, an exception is thrown.

Examples
> CREATE RECIPIENT other_org COMMENT 'other.org';
> DESCRIBE RECIPIENT other_org;
name created_at created_by comment activation_link
active_token_id active_token_expiration_time rotated_token_id
rotated_token_expiration_time
--------- ---------------------------- -------------------------- --------- --------------- --------------
---------------------- ---------------------------- ---------------- -----------------------------
other_org 2022-01-01T00:00:00.000+0000 alwaysworks@databricks.com other.org https://.... 0160c81f-5262-
40bb-9b03-3ee12e6d98d7 9999-12-31T23:59:59.999+0000 NULL NULL

Related articles
CREATE RECIPIENT
DROP RECIPIENT
SHOW RECIPIENTS
DESCRIBE SCHEMA
7/21/2022 • 2 minutes to read

Returns the metadata of an existing schema. The metadata information includes the schema’s name, comment,
and location on the filesystem. If the optional EXTENDED option is specified, schema properties are also returned.
While usage of SCHEMA and DATABASE is interchangeable, SCHEMA is preferred.

Syntax
{ DESC | DESCRIBE } SCHEMA [ EXTENDED ] schema_name

Parameters
schema_name : The name of an existing schema (schema) in the system. If the name does not exist, an
exception is thrown.

Examples
-- Create employees SCHEMA
> CREATE SCHEMA employees COMMENT 'For software companies';

-- Describe employees SCHEMA.


-- Returns Database Name, Description and Root location of the filesystem
-- for the employees SCHEMA.
> DESCRIBE SCHEMA employees;
database_description_item database_description_value
------------------------- -----------------------------
Database Name employees
Description For software companies
Location file:/you/Temp/employees.db

-- Create employees SCHEMA


> CREATE SCHEMA employees COMMENT 'For software companies';

-- Alter employees schema to set DBPROPERTIES


> ALTER SCHEMA employees SET DBPROPERTIES ('Create-by' = 'Kevin', 'Create-date' = '09/01/2019');

-- Describe employees SCHEMA with EXTENDED option to return additional schema properties
> DESCRIBE SCHEMA EXTENDED employees;
database_description_item database_description_value
------------------------- ---------------------------------------------
Database Name employees
Description For software companies
Location file:/you/Temp/employees.db
Properties ((Create-by,kevin), (Create-date,09/01/2019))

-- Create deployment SCHEMA


> CREATE SCHEMA deployment COMMENT 'Deployment environment';

-- Describe deployment.
> DESCRIBE SCHEMA deployment;
database_description_item database_description_value
------------------------- ------------------------------
Database Name deployment
Description Deployment environment
Location file:/you/Temp/deployment.db

Related articles
DESCRIBE FUNCTION
DESCRIBE QUERY
DESCRIBE TABLE
INFORMATION_SCHEMA.SCHEMATA
DESCRIBE SHARE
7/21/2022 • 2 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

Returns the metadata of an existing share. The metadata information includes share name, owner, and
timestamps of creation and last modification.
To list the content of a share use SHOW ALL IN SHARE.
Since: Databricks Runtime 10.3

Syntax
[ DESC | DESCRIBE ] SHARE share_name

Parameters
share_name
The name of an existing share. If the name does not exist, an exception is thrown.

Examples
> CREATE SHARE vaccine COMMENT 'vaccine data to publish';
> DESCRIBE SHARE vaccine;
name created_at created_by comment
--------- ---------------------------- -------------------------- -----------------------
vaccine 2022-01-01T00:00:00.000+0000 alwaysworks@databricks.com vaccine data to publish

Related articles
ALTER SHARE
CREATE SHARE
DROP SHARE
SHOW ALL IN SHARE
SHOW SHARES
DESCRIBE TABLE
7/21/2022 • 3 minutes to read

Returns the basic metadata information of a table. The metadata information includes column name, column
type and column comment. Optionally you can specify a partition spec or column name to return the metadata
pertaining to a partition or column respectively.

Syntax
{ DESC | DESCRIBE } [ TABLE ] [ EXTENDED | FORMATTED ] table_name { [ PARTITION clause ] | [ column_name ] }

Parameters
EXTENDED or FORMATTED

If specified display detailed information about the specified columns, including the column statistics
collected by the command, and additional metadata information (such as schema qualifier, owner, and
access time).
table_name
Identifies the table to be described. The name may not use a temporal specification.
PARTITION clause
An optional parameter directing Databricks Runtime to return addition metadata for the named
partitions.
column_name
An optional parameter with the column name that needs to be described. Currently nested columns are
not allowed to be specified.
Parameters partition_spec and column_name are mutually exclusive and cannot be specified together.

Examples
-- Creates a table `customer`. Assumes current schema is `salesdb`.
> CREATE TABLE customer(
cust_id INT,
state VARCHAR(20),
name STRING COMMENT 'Short name'
)
USING parquet
PARTITIONED BY (state);

> INSERT INTO customer PARTITION (state = 'AR') VALUES (100, 'Mike');

-- Returns basic metadata information for unqualified table `customer`


> DESCRIBE TABLE customer;
col_name data_type comment
----------------------- --------- ----------
cust_id int null
name string Short name
state string null
# Partition Information
# col_name data_type comment
state string null

-- Returns basic metadata information for qualified table `customer`


> DESCRIBE TABLE salesdb.customer;
col_name data_type comment
----------------------- --------- ----------
cust_id int null
name string Short name
state string null
# Partition Information
# col_name data_type comment
state string null

-- Returns additional metadata such as parent schema, owner, access time etc.
> DESCRIBE TABLE EXTENDED customer;
col_name data_type comment
---------------------------- ------------------------------ ----------
cust_id int null
name string Short name
state string null
# Partition Information
# col_name data_type comment
state string null

# Detailed Table Information


Database default
Table customer
Owner <TABLE OWNER>
Created Time Tue Apr 07 22:56:34 JST 2020
Last Access UNKNOWN
Created By <SPARK VERSION>
Type MANAGED
Provider parquet
Location file:/tmp/salesdb.db/custom...
Serde Library org.apache.hadoop.hive.ql.i...
InputFormat org.apache.hadoop.hive.ql.i...
OutputFormat org.apache.hadoop.hive.ql.i...
Partition Provider Catalog

-- Returns partition metadata such as partitioning column name, column type and comment.
> DESCRIBE TABLE EXTENDED customer PARTITION (state = 'AR');
col_name data_type comment
------------------------------ ------------------------------ ----------
cust_id int null
name string Short name
state string null
# Partition Information
# col_name data_type comment
state string null

# Detailed Partition Inform...


Database default
Table customer
Partition Values [state=AR]
Location file:/tmp/salesdb.db/custom...
Serde Library org.apache.hadoop.hive.ql.i...
InputFormat org.apache.hadoop.hive.ql.i...
OutputFormat org.apache.hadoop.hive.ql.i...
Storage Properties [serialization.format=1, pa...
Partition Parameters {transient_lastDdlTime=1586...
Created Time Tue Apr 07 23:05:43 JST 2020
Last Access UNKNOWN
Partition Statistics 659 bytes

# Storage Information
Location file:/tmp/salesdb.db/custom...
Serde Library org.apache.hadoop.hive.ql.i...
Serde Library org.apache.hadoop.hive.ql.i...
InputFormat org.apache.hadoop.hive.ql.i...
OutputFormat org.apache.hadoop.hive.ql.i...

-- Returns the metadata for `name` column.


-- Optional `TABLE` clause is omitted and column is fully qualified.
> DESCRIBE customer salesdb.customer.name;
info_name info_value
--------- ----------
col_name name
data_type string
comment Short name

> CREATE TABLE T(pk1 INTEGER NOT NULL, pk2 INTEGER NOT NULL,
CONSTRAINT t_pk PRIMARY KEY(pk1, pk2));
> CREATE TABLE S(pk INTEGER NOT NULL PRIMARY KEY,
fk1 INTEGER, fk2 INTEGER,
CONSTRAINT s_t_fk FOREIGN KEY(fk1, fk2) REFERENCES T);

> DESCRIBE TABLE EXTENDED T;


col_name data_type comment
--------------------- ---------------------------- ----------
pk1 int
pk2 int

# Detailed Table I...


Database default
Table t
Owner ...
Created Time Mon Nov 15 11:42:07 PST 2021
...
Partition Provider Catalog

# Constraints
t_pk PRIMARY KEY (pk1, pk2)

> DESCRIBE TABLE EXTENDED S;


col_name data_type comment
--------------------- ----------------------------------------------------- ----------
pk int
fk1 int
fk2 int

# Detailed Table Inf...


Database default
Table s
Owner ...
Created Time Mon Nov 15 11:42:35 PST 2021
...
Partition Provider Catalog

# Constraints
s_pk PRIMARY KEY (pk)
s_fk_p FOREIGN KEY (fk1, fk2) REFERENCES default.t (pk1, pk2)

DESCRIBE DETAIL
DESCRIBE DETAIL [schema_name.]table_name

DESCRIBE DETAIL delta.`<path-to-table>`

Return information about schema, partitioning, table size, and so on. For example, for Delta tables, you can see
the current reader and writer versions of a table. See Detail schema for the detail schema.
Related articles
DESCRIBE FUNCTION
DESCRIBE QUERY
DESCRIBE SCHEMA
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.TABLES
PARTITION
LIST
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Lists the objects immediately contained at the URL.


Since: Databricks Runtime 10.3

Syntax
LIST url [ WITH ( CREDENTIAL credential_name ) ] [ LIMIT limit ]

Parameters
url
A STRING literal with the location of the cloud storage described as an absolute URL.
credential_name
An optional named credential used to access this URL. If you supply a credential it must be sufficient to
access the URL. If you do not supply a credential the URL must be contained in an external location to to
which you have access.
limit
An optional INTEGER constant between 1 and 1001 used to limit the number of objects returned. The
default limit is 1001 .

Examples
> LIST 'abfss://us-east-1-dev/some_dir' WITH (CREDENTIAL azure_some_dir) LIMIT 2
path name size modification_time is_directory
------------------------------------- ------ ---- ----------------- ------------
abfss://us-east-1-dev/some_dir/table1 table1 0 ... true
abfss://us-east-1-dev/some_dir/table1 table1 0 ... true

Related articles
ALTER EXTERNAL LOCATION
CREATE EXTERNAL LOCATION
DESCRIBE EXTERNAL LOCATIONS
DROP EXTERNAL LOCATION
SHOW ALL IN SHARE
7/21/2022 • 2 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

Lists the content of a share.


Since: Databricks Runtime 10.3

Syntax
SHOW ALL IN SHARE share_name

Parameters
share_name
The name of an existing share. If the name does not exist, an exception is thrown.

Examples
-- Create share `customer_share` only if share with same name doesn't exist, with a comment.
> CREATE SHARE IF NOT EXISTS customer_share COMMENT 'This is customer share';

-- Add 2 tables to the share.


-- Expose my_schema.tab1 a different name.
-- Expose only two partitions of other_schema.tab2
> ALTER SHARE customer_share ADD TABLE my_schema.tab1 AS their_schema.tab1;
> ALTER SHARE customer_share ADD TABLE other_schema.tab2 PARTITION (c1 = 5), (c1 = 7);

-- List the content of the share


> SHOW ALL IN SHARE customer_share;
name type shared_object added_at added_by
comment partitions
----------------- ----- ---------------------- ---------------------------- -------------------------- ---
---- ------------------
other_schema.tab2 TABLE main.other_schema.tab2 2022-01-01T00:00:01.000+0000 alwaysworks@databricks.com
NULL
their_schema.tab1 TABLE main.myschema.tab2 2022-01-01T00:00:00.000+0000 alwaysworks@databricks.com
NULL (c1 = 5), (c1 = 7)

Related articles
ALTER SHARE
CREATE SHARE
DESCRIBE SHARE
DROP SHARE
SHOW SHARES
SHOW COLUMNS
7/21/2022 • 2 minutes to read

Returns the list of columns in a table. If the table does not exist, an exception is thrown.

Syntax
SHOW COLUMNS { IN | FROM } table_name [ { IN | FROM } schema_name ]

NOTE
Keywords IN and FROM are interchangeable.

Parameters
table_name
Identifies the table. The name must not include a temporal specification.
schema_name
An optional alternative means of qualifying the table_name with a schema name. When this parameter is
specified then table name should not be qualified with a different schema name.

Examples
-- Create `customer` table in the `salessc` schema;
> USE SCHEMA salessc;
> CREATE TABLE customer(
cust_cd INT,
name VARCHAR(100),
cust_addr STRING);

-- List the columns of `customer` table in current schema.


> SHOW COLUMNS IN customer;
col_name
---------
cust_cd
name
cust_addr

-- List the columns of `customer` table in `salessc` schema.


> SHOW COLUMNS IN salessc.customer;
col_name
---------
cust_cd
name
cust_addr

-- List the columns of `customer` table in `salesdb` schema


> SHOW COLUMNS IN customer IN salessc;
col_name
---------
cust_cd
name
cust_addr

Related articles
DESCRIBE TABLE
INFORMATION_SCHEMA.COLUMNS
SHOW TABLE
SHOW COLUMNS
7/21/2022 • 2 minutes to read

Returns the list of columns in a table. If the table does not exist, an exception is thrown.

Syntax
SHOW COLUMNS { IN | FROM } table_name [ { IN | FROM } schema_name ]

NOTE
Keywords IN and FROM are interchangeable.

Parameters
table_name
Identifies the table. The name must not include a temporal specification.
schema_name
An optional alternative means of qualifying the table_name with a schema name. When this parameter is
specified then table name should not be qualified with a different schema name.

Examples
-- Create `customer` table in the `salessc` schema;
> USE SCHEMA salessc;
> CREATE TABLE customer(
cust_cd INT,
name VARCHAR(100),
cust_addr STRING);

-- List the columns of `customer` table in current schema.


> SHOW COLUMNS IN customer;
col_name
---------
cust_cd
name
cust_addr

-- List the columns of `customer` table in `salessc` schema.


> SHOW COLUMNS IN salessc.customer;
col_name
---------
cust_cd
name
cust_addr

-- List the columns of `customer` table in `salesdb` schema


> SHOW COLUMNS IN customer IN salessc;
col_name
---------
cust_cd
name
cust_addr

Related articles
DESCRIBE TABLE
INFORMATION_SCHEMA.COLUMNS
SHOW TABLE
SHOW CREATE TABLE
7/21/2022 • 2 minutes to read

Returns the CREATE TABLE statement or CREATE VIEW statement that was used to create a given table or view.
SHOW CREATE TABLE on a non-existent table or a temporary view throws an exception.

Syntax
SHOW CREATE TABLE { table_name | view_name }

Parameters
table_name
Identifies the table. The name must not include a temporal specification.

Examples
> CREATE TABLE test (c INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES ('prop1' = 'value1', 'prop2' = 'value2');

> SHOW CREATE TABLE test;


createtab_stmt
----------------------------------------------------
CREATE TABLE `default`.`test` (`c` INT)
USING text
TBLPROPERTIES (
'transient_lastDdlTime' = '1586269021',
'prop1' = 'value1',
'prop2' = 'value2')

Related articles
CREATE TABLE
CREATE VIEW
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.TABLES
INFORMATION_SCHEMA.VIEWS
SHOW STORAGE CREDENTIALS
7/21/2022 • 2 minutes to read

Since: Databricks Runtime 10.0

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Lists the storage credentials.


You must be account or metastore admin to execute this command.
Since: Databricks Runtime 10.3

Syntax
SHOW STORAGE CREDENTIALS

Parameters
The statement takes no parameters.

Examples
> SHOW STORAGE CREDENTIALS
name comment
------------ -----------------
some_creds Used to access s3

Related articles
DESCRIBE STORAGE CREDENTIAL
DROP STORAGE CREDENTIAL
SHOW DATABASES
7/21/2022 • 2 minutes to read

An alias for SHOW SCHEMAS.


While usage of SCHEMA and DATABASE is interchangeable, SCHEMA is preferred.

Related articles
ALTER SCHEMA
CREATE SCHEMA
DESCRIBE SCHEMA
INFORMATION_SCHEMA.SCHEMATA
SHOW SCHEMAS
SHOW FUNCTIONS
7/21/2022 • 2 minutes to read

Returns the list of functions after applying an optional regex pattern. Databricks Runtime supports a large
number of functions. You can use SHOW FUNCTIONS in conjunction with describe function to quickly find a
function and learn how to use it. The LIKE clause is optional, and ensures compatibility with other systems.

Syntax
SHOW [ function_kind ] FUNCTIONS [ { FROM | IN } schema_name ]
[ [ LIKE ] { function_name | regex_pattern } ]

function_kind
{ USER | SYSTEM | ALL }

Parameters
function_kind
The name space of the function to be searched upon. The valid name spaces are:
USER - Looks up the function(s) among the user defined functions.
SYSTEM - Looks up the function(s) among the system defined functions.
ALL - Looks up the function(s) among both user and system defined functions.
schema_name
Since: Databricks Runtime 10.3
Specifies the schema in which functions are to be listed.
function_name
A name of an existing function in the system. If schema_name is not provided the function name may be
qualified with a schema name instead. If function_name is not qualified and schema_name has not been
specified the function is resolved from the current schema.
regex_pattern
A regular expression pattern that is used to filter the results of the statement.
Except for * and | character, the pattern works like a regular expression.
* alone matches 0 or more characters and | is used to separate multiple different regular
expressions, any of which can match.
The leading and trailing blanks are trimmed in the input pattern before processing. The pattern match
is case-insensitive.

Examples
-- List a system function `trim` by searching both user defined and system
-- defined functions.
> SHOW FUNCTIONS trim;
trim

-- List a system function `concat` by searching system defined functions.


> SHOW SYSTEM FUNCTIONS concat;
concat

-- List a qualified function `max` from schema `salesdb`.


> SHOW SYSTEM FUNCTIONS IN salesdb max;
max

-- List all functions starting with `t`


> SHOW FUNCTIONS LIKE 't*';
tan
tanh
timestamp
tinyint
to_csv
to_date
to_json
to_timestamp
to_unix_timestamp
to_utc_timestamp
transform
transform_keys
transform_values
translate
trim
trunc
typeof

-- List all functions starting with `yea` or `windo`


> SHOW FUNCTIONS LIKE 'yea*|windo*';
window
year

-- Use normal regex pattern to list function names that has 4 characters
-- with `t` as the starting character.
> SHOW FUNCTIONS LIKE 't[a-z][a-z][a-z]';
tanh
trim

Related articles
DESCRIBE FUNCTION
SHOW GROUPS
7/21/2022 • 2 minutes to read

Since: Databricks Runtime 8.3


Lists the groups that match an optionally supplied regular expression pattern. If you don’t supply a pattern, the
command lists all of the groups in the system. You can optionally supply an identifier to show only the groups a
specific user or group belongs to.
If a principal is provided using WITH {USER | GROUP} , a not null Boolean value in column directGroup indicates
the principal’s membership.
TRUE : The principal is a direct member of the group.
FALSE : The principal is an indirect member of the group.

If WITH {USER | GROUP} is not used, directGroup will always be NULL .

Syntax
SHOW GROUPS [ WITH USER user_principal |
WITH GROUP group_principal ]
[ [ LIKE ] regex_pattern ]

Parameters
user_principal
Show only groups that contain the specified user.
group_principal
Show only groups that contain the specified group.
regex_pattern
A STRING literal with a limited regular expression pattern used to filter the results of the statement.
* at the start and end of a pattern matches on a substring.
* only at end of a pattern matches the start of a group.
| separates multiple regular expressions, any of which can match.
The pattern match is case-insensitive.

Examples
-- Lists all groups.
> SHOW GROUPS;
name directGroup
------------ -----------
tv_alien NULL
alien NULL
californian NULL
pastafarian NULL

-- Lists groups with name containing with string pattern `rou`.


> SHOW GROUPS LIKE '*al*';
name directGroup
------------ -----------
tv_alien NULL
alien NULL
californian NULL

-- Lists groups with Alf as a member.


> SHOW GROUPS WITH USER `alf@melmak.et`;
name directGroup
------------ -----------
tv_alien true
alien false

Related articles
SHOW GRANTS
SHOW USERS
INFORMATION_SCHEMA.CATALOG_PRIVILEGES
INFORMATION_SCHEMA.SCHEMA_PRIVILEGES
INFORMATION_SCHEMA.TABLE_PRIVILEGES
SHOW EXTERNAL LOCATIONS
7/21/2022 • 2 minutes to read

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

Lists the external locations that match an optionally supplied regular expression pattern. If no pattern is supplied
then the command lists all the external locations in the metastore.
Since: Databricks Runtime 10.3

Syntax
SHOW EXTERNAL LOCATIONS [ LIKE regex_pattern ]

Parameters
regex_pattern
A regular expression pattern that is used to filter the results of the statement.
Except for * and | character, the pattern works like a regular expression.
* alone matches 0 or more characters and | is used to separate multiple different regular
expressions, any of which can match.
The leading and trailing blanks are trimmed in the input pattern before processing. The pattern match
is case-insensitive.

Examples
> SHOW EXTERNAL LOCATIONS;
name url comment
------------ ----------------------------------- ----------
best_loco abfss://us-east-1-dev/best_location Nice place
three_tatami abfss://us-west-1-dev/tatami_xs Quite cozy

Related articles
ALTER EXTERNAL LOCATION
CREATE EXTERNAL LOCATION
DESCRIBE EXTERNAL LOCATIONS
DROP EXTERNAL LOCATION
SHOW PARTITIONS
7/21/2022 • 2 minutes to read

Lists partitions of a table.

Syntax
SHOW PARTITIONS table_name [ PARTITION clause ]

Parameters
table_name
Identifies the table. The name must not include a temporal specification.
PARTITION clause
An optional parameter that specifies a partition. If the specification is only a partial all matching partitions
are returned. If no partition is specified at all Databricks Runtime returns all partitions.

Examples
-- create a partitioned table and insert a few rows.
> USE salesdb;
> CREATE TABLE customer(id INT, name STRING) PARTITIONED BY (state STRING, city STRING);
> INSERT INTO customer PARTITION (state = 'CA', city = 'Fremont') VALUES (100, 'John');
> INSERT INTO customer PARTITION (state = 'CA', city = 'San Jose') VALUES (200, 'Marry');
> INSERT INTO customer PARTITION (state = 'AZ', city = 'Peoria') VALUES (300, 'Daniel');

-- Lists all partitions for table `customer`


> SHOW PARTITIONS customer;
state=AZ/city=Peoria
state=CA/city=Fremont
state=CA/city=San Jose

-- Lists all partitions for the qualified table `customer`


> SHOW PARTITIONS salesdb.customer;
state=AZ/city=Peoria
state=CA/city=Fremont
state=CA/city=San Jose

-- Specify a full partition spec to list specific partition


> SHOW PARTITIONS customer PARTITION (state = 'CA', city = 'Fremont');
|state=CA/city=Fremont|

-- Specify a partial partition spec to list the specific partitions


> SHOW PARTITIONS customer PARTITION (state = 'CA');
state=CA/city=Fremont
state=CA/city=San Jose

-- Specify a partial spec to list specific partition


> SHOW PARTITIONS customer PARTITION (city = 'San Jose');
state=CA/city=San Jose
Related articles
CREATE TABLE
INSERT INTO
DESCRIBE TABLE
PARTITION
SHOW TABLE
SHOW RECIPIENTS
7/21/2022 • 2 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

Lists the recipients that match an optionally supplied regular expression pattern. If no pattern is supplied then
the command lists all the recipients in the metastore.
Since: Databricks Runtime 10.3

Syntax
SHOW RECIPIENTS [ LIKE regex_pattern ]

Parameters
regex_pattern
A regular expression pattern that is used to filter the results of the statement.
Except for * and | character, the pattern works like a regular expression.
* alone matches 0 or more characters and | is used to separate multiple different regular
expressions, any of which can match.
The leading and trailing blanks are trimmed in the input pattern before processing. The pattern match
is case-insensitive.

Examples
> CREATE RECIPIENT other_org COMMENT 'other.org';
> CREATE RECIPIENT better_corp COMMENT 'better.com';
> SHOW RECIPIENTS;
name created_at created_by comment
----------- ---------------------------- -------------------------- ----------
other_org 2022-01-01T00:00:00.000+0000 alwaysworks@databricks.com other.org
better_corp 2022-01-01T00:00:01.000+0000 alwaysworks@databricks.com better.com

> SHOW RECIPIENTS LIKE 'other_org';


name number_of_activated_token created_at created_by
--------- ------------------------- ---------------------------- --------------------------
other_org 0 2022-01-01T00:00:00.000+0000 alwaysworks@databricks.com other.org

Related articles
CREATE RECIPIENT
DESCRIBE RECIPIENT
DROP RECIPIENT
SHOW SCHEMAS
7/21/2022 • 2 minutes to read

Lists the schemas that match an optionally supplied regular expression pattern. If no pattern is supplied then the
command lists all the schemas in the system.
While usage of SCHEMAS and DATABASES is interchangeable, SCHEMAS is preferred.

Syntax
SHOW SCHEMAS [ LIKE regex_pattern ]

Parameters
regex_pattern
A regular expression pattern that is used to filter the results of the statement.
Except for * and | character, the pattern works like a regular expression.
* alone matches 0 or more characters and | is used to separate multiple different regular
expressions, any of which can match.
The leading and trailing blanks are trimmed in the input pattern before processing. The pattern match
is case-insensitive.

Examples
-- Create schema. Assumes a schema named `default` already exists in
-- the system.
> CREATE SCHEMA payroll_sc;
> CREATE SCHEMA payments_sc;

-- Lists all the schemas.


> SHOW SCHEMAS;
databaseName
------------
default
payments_sc
payroll_sc

-- Lists schemas with name starting with string pattern `pay`


> SHOW SCHEMAS LIKE 'pay*';
databaseName
------------
payments_sc
payroll_sc

-- Lists all schemas. Keywords SCHEMAS and DATABASES are interchangeable.


> SHOW SCHEMAS;
databaseName
------------
default
payments_sc
payroll_sc
Related articles
ALTER SCHEMA
CREATE SCHEMA
DESCRIBE SCHEMA
INFORMATION_SCHEMA.SCHEMATA
SHOW SHARES
7/21/2022 • 2 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

Lists the shares that match an optionally supplied regular expression pattern. If no pattern is supplied then the
command lists all the shares in the metastore.
To list the content of a share use SHOW ALL IN SHARE.
Since: Databricks Runtime 10.3

Syntax
SHOW SHARES[ LIKE regex_pattern ]

Parameters
regex_pattern
A regular expression pattern that is used to filter the results of the statement.
Except for * and | character, the pattern works like a regular expression.
* alone matches 0 or more characters and | is used to separate multiple different regular
expressions, any of which can match.
The leading and trailing blanks are trimmed in the input pattern before processing. The pattern match
is case-insensitive.

Examples
> CREATE SHARE vaccine COMMENT 'vaccine data to publish';
> CREATE SHARE meds COMMENT 'meds data to publish';
> SHOW SHARES;
name created_at created_by comment
--------- ---------------------------- -------------------------- -----------------------
vaccine 2022-01-01T00:00:00.000+0000 alwaysworks@databricks.com vaccine data to publish
meds 2022-01-01T00:00:01.000+0000 alwaysworks@databricks.com meds data to publish

> SHOW SHARES LIKE 'vaccine';


name created_at created_by comment
--------- ---------------------------- -------------------------- -----------------------
vaccine 2022-01-01T00:00:00.000+0000 alwaysworks@databricks.com vaccine data to publish

Related articles
ALTER SHARE
CREATE SHARE
DESCRIBE SHARE
DROP SHARE
SHOW ALL IN SHARE
SHOW TABLE EXTENDED
7/21/2022 • 2 minutes to read

Shows information for all tables matching the given regular expression. Output includes basic table information
and file system information like Last Access , Created By , Type , Provider , Table Properties , Location ,
Serde Library , InputFormat , OutputFormat , Storage Properties , Partition Provider , Partition Columns , and
Schema .

If a partition specification is present, it outputs the given partition’s file-system-specific information such as
Partition Parameters and Partition Statistics . You cannot use a table regex with a partition specification.

Syntax
SHOW TABLE EXTENDED [ { IN | FROM } schema_name ] LIKE regex_pattern
[ PARTITION clause ]

Parameters
schema_name
Specifies schema name. If not provided, uses the current schema.
regex_pattern
The regular expression pattern used to filter out unwanted tables.
Except for * and | character, the pattern works like a regular expression.
* alone matches 0 or more characters and | is used to separate multiple different regular
expressions, any of which can match.
The leading and trailing blanks are trimmed in the input pattern before processing. The pattern match
is case-insensitive.
PARTITION clause
Optionally specifying partitions. You cannot use a table regex pattern with a PARTITION clause.

Examples
-- Assumes `employee` table partitioned by column `grade`
> CREATE TABLE employee(name STRING, grade INT) PARTITIONED BY (grade);
> INSERT INTO employee PARTITION (grade = 1) VALUES ('sam');
> INSERT INTO employee PARTITION (grade = 2) VALUES ('suj');

-- Show the details of the table

> SHOW TABLE EXTENDED LIKE 'employee';


database tableName isTemporary information
-------- --------- ----------- --------------------------------------------------------------
default employee false Database: default
Table: employee
Owner: root
Created Time: Fri Aug 30 15:10:21 IST 2019
Last Access: Thu Jan 01 05:30:00 IST 1970
Created By: Spark 3.0.0
Type: MANAGED
Type: MANAGED
Provider: hive
Table Properties: [transient_lastDdlTime=1567158021]
Location: file:/opt/spark1/spark/spark-warehouse/employee
Serde Library: org.apache.hadoop.hive.serde2.lazy
.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io
.HiveIgnoreKeyTextOutputFormat
Storage Properties: [serialization.format=1]
Partition Provider: Catalog
Partition Columns: [`grade`]
Schema: root
-- name: string (nullable = true)
-- grade: integer (nullable = true)

-- show multiple table details with pattern matching


> SHOW TABLE EXTENDED LIKE `employe*`;
database tableName isTemporary information
-------- --------- ----------- --------------------------------------------------------------
default employee false Database: default
Table: employee
Owner: root
Created Time: Fri Aug 30 15:10:21 IST 2019
Last Access: Thu Jan 01 05:30:00 IST 1970
Created By: Spark 3.0.0
Type: MANAGED
Provider: hive
Table Properties: [transient_lastDdlTime=1567158021]
Location: file:/opt/spark1/spark/spark-warehouse/employee
Serde Library: org.apache.hadoop.hive.serde2.lazy
.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io
.HiveIgnoreKeyTextOutputFormat
Storage Properties: [serialization.format=1]
Partition Provider: Catalog
Partition Columns: [`grade`]
Schema: root
-- name: string (nullable = true)
-- grade: integer (nullable = true)

default employee1 false Database: default


Table: employee1
Owner: root
Created Time: Fri Aug 30 15:22:33 IST 2019
Last Access: Thu Jan 01 05:30:00 IST 1970
Created By: Spark 3.0.0
Type: MANAGED
Provider: hive
Table Properties: [transient_lastDdlTime=1567158753]
Location: file:/opt/spark1/spark/spark-warehouse/employee1
Serde Library: org.apache.hadoop.hive.serde2.lazy
.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io
.HiveIgnoreKeyTextOutputFormat
Storage Properties: [serialization.format=1]
Partition Provider: Catalog
Schema: root
-- name: string (nullable = true)

-- show partition file system details


> SHOW TABLE EXTENDED IN default LIKE `employee` PARTITION (`grade=1`);
database tableName isTemporary information
-------- --------- ----------- --------------------------------------------------------------
default employee false Partition Values: [grade=1]
Location: file:/opt/spark1/spark/spark-warehouse/employee
/grade=1
Serde Library: org.apache.hadoop.hive.serde2.lazy
Serde Library: org.apache.hadoop.hive.serde2.lazy
.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io
.HiveIgnoreKeyTextOutputFormat
Storage Properties: [serialization.format=1]
Partition Parameters: {rawDataSize=-1, numFiles=1,
transient_lastDdlTime=1567158221, totalSize=4,
COLUMN_STATS_ACCURATE=false, numRows=-1}
Created Time: Fri Aug 30 15:13:41 IST 2019
Last Access: Thu Jan 01 05:30:00 IST 1970
Partition Statistics: 4 bytes

-- show partition file system details with regex fail


> SHOW TABLE EXTENDED IN default LIKE `empl*` PARTITION (`grade=1`);
Error: Error running query:
Table or view 'emplo*' not found in database 'default'; (state=,code=0)

Related articles
CREATE TABLE
DESCRIBE TABLE
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.TABLES
PARTITION
SHOW TABLES
7/21/2022 • 2 minutes to read

Returns all the tables for an optionally specified schema. Additionally, the output of this statement may be
filtered by an optional matching pattern. If no schema is specified then the tables are returned from the current
schema.

Syntax
SHOW TABLES [ { FROM | IN } schema_name ] [ LIKE regex_pattern ]

Parameters
schema_name
Specifies schema name from which tables are to be listed. If not provided, uses the current schema.
regex_pattern
The regular expression pattern that is used to filter out unwanted tables.
Except for * and | character, the pattern works like a regular expression.
* alone matches 0 or more characters and | is used to separate multiple different regular
expressions, any of which can match.
The leading and trailing blanks are trimmed in the input pattern before processing. The pattern match
is case-insensitive.

Examples
-- List all tables in default schema
> SHOW TABLES;
database tableName isTemporary
-------- --------- -----------
default sam false
default sam1 false
default suj false

-- List all tables from usersc schema


> SHOW TABLES FROM usersc;
database tableName isTemporary
-------- --------- -----------
usersc user1 false
usersc user2 false

-- List all tables in usersc schema


> SHOW TABLES IN usersc;
database tableName isTemporary
-------- --------- -----------
usersc user1 false
usersc user2 false

-- List all tables from default schema matching the pattern `sam*`
> SHOW TABLES FROM default LIKE 'sam*';
database tableName isTemporary
-------- --------- -----------
default sam false
default sam1 false

-- List all tables matching the pattern `sam*|suj`


> SHOW TABLES LIKE 'sam*|suj';
database tableName isTemporary
-------- --------- -----------
default sam false
default sam1 false
default suj false

Related articles
CREATE SCHEMA
CREATE TABLE
DROP SCHEMA
DROP TABLE
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.TABLES
SHOW TBLPROPERTIES
7/21/2022 • 2 minutes to read

Returns the value of a table property given an optional value for a property key. If no key is specified then all the
properties and options are returned. Table options are prefixed with option .

Syntax
SHOW TBLPROPERTIES table_name
[ ( [unquoted_property_key | property_key_as_string_literal] ) ]

unquoted_property_key
key_part1 [. ...]

Parameters
table_name
Identifies the table. The name must not include a temporal specification.
unquoted_proper ty_key
The property key in unquoted form. The key can consist of multiple parts separated by a dot.
proper ty_key_as_string_literal
A property key value as a string literal.

NOTE
Property value returned by this statement excludes some properties that are internal to spark and hive. The excluded
properties are:
All the properties that start with prefix spark.sql
Property keys such as: EXTERNAL , comment
All the properties generated internally by hive to store statistics. Some of these properties are: numFiles ,
numPartitions , numRows .

Examples
-- create a table `customer` in schema `salessc`
> USE salessc;
> CREATE TABLE customer(cust_code INT, name VARCHAR(100), cust_addr STRING)
TBLPROPERTIES ('created.by.user' = 'John', 'created.date' = '01-01-2001');

-- show all the user specified properties for table `customer`


> SHOW TBLPROPERTIES customer;
key value
--------------------- ----------
created.by.user John
created.date 01-01-2001
transient_lastDdlTime 1567554931

-- show all the user specified properties for a qualified table `customer`
-- in schema `salessc`
> SHOW TBLPROPERTIES salessc.customer;
key value
--------------------- ----------
created.by.user John
created.date 01-01-2001
transient_lastDdlTime 1567554931

-- show value for unquoted property key `created.by.user`


> SHOW TBLPROPERTIES customer (created.by.user);
value
-----
John

-- show value for property `created.date`` specified as string literal


> SHOW TBLPROPERTIES customer ('created.date');
value
----------
01-01-2001

Related articles
ALTER TABLE
CREATE TABLE
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.TABLES
SHOW TABLES
SHOW TABLE
Table properties and table options
SHOW USERS
7/21/2022 • 2 minutes to read

Since: Databricks Runtime 8.3


Lists the users that match an optionally supplied regular expression pattern. If you don’t supply a pattern, the
command lists all of the users in the system.

Syntax
SHOW USERS [ LIKE pattern_expression ]

Parameters
pattern_expression
A limited pattern expression that is used to filter the results of the statement.
The * character is used at the start and end of a pattern to match on a substring.
The * character is used only at end of a pattern to match the start of a username.
The | character is used to separate multiple different expressions, any of which can match.
The pattern match is case-insensitive.

Examples
-- Lists all users.
> SHOW USERS;
name
------------------
user1@example.com
user2@example.com
user3@example.com

-- Lists users with name containing with string pattern `SEr`


> SHOW USERS LIKE '*SEr*';
name
------------------
user1@example.com
user2@example.com
user3@example.com

-- Lists users with name containing 1 or `3`


SHOW USERS LIKE '*1*|*3*';
name
------------------
user1@example.com
user3@example.com

Related articles
INFORMATION_SCHEMA.CATALOG_PRIVILEGES
INFORMATION_SCHEMA.SCHEMA_PRIVILEGES
INFORMATION_SCHEMA.TABLE_PRIVILEGES
SHOW GRANTS
SHOW GROUPS
SHOW VIEWS
7/21/2022 • 2 minutes to read

Returns all the views for an optionally specified schema. Additionally, the output of this statement may be
filtered by an optional matching pattern. If no schema is specified then the views are returned from the current
schema. If the specified schema is the global temporary view schema, Databricks Runtime lists global temporary
views. Note that the command also lists local temporary views regardless of a given schema.

Syntax
SHOW VIEWS [ { FROM | IN } schema_name ] [ LIKE regex_pattern ]

Parameters
schema_name
The schema name from which views are listed.
regex_pattern
The regular expression pattern that is used to filter out unwanted views.
Except for * and | character, the pattern works like a regular expression.
* alone matches 0 or more characters and | is used to separate multiple different regular
expressions, any of which can match.
The leading and trailing blanks are trimmed in the input pattern before processing. The pattern match
is case-insensitive.

Examples
-- Create views in different schemas, also create global/local temp views.
> CREATE VIEW sam AS SELECT id, salary FROM employee WHERE name = 'sam';
> CREATE VIEW sam1 AS SELECT id, salary FROM employee WHERE name = 'sam1';
> CREATE VIEW suj AS SELECT id, salary FROM employee WHERE name = 'suj';
> USE SCHEMA usersc;
> CREATE VIEW user1 AS SELECT id, salary FROM default.employee WHERE name = 'user1';
> CREATE VIEW user2 AS SELECT id, salary FROM default.employee WHERE name = 'user2';
> USE SCHEMA default;
> CREATE TEMP VIEW temp1 AS SELECT 1 AS col1;
> CREATE TEMP VIEW temp2 AS SELECT 1 AS col1;

-- List all views in default schema


> SHOW VIEWS;
namespace viewName isTemporary
------------- ------------ --------------
default sam false
default sam1 false
default suj false
temp2 true

-- List all views from usersc schema


> SHOW VIEWS FROM usersc;
namespace viewName isTemporary
------------- ------------ --------------
usersc user1 false
usersc user2 false
temp2 true

-- List all views from default schema matching the pattern `sam*`
> SHOW VIEWS FROM default LIKE 'sam*';
namespace viewName isTemporary
----------- ------------ --------------
default sam false
default sam1 false

-- List all views from the current schema matching the pattern `sam|suj|temp*`
> SHOW VIEWS LIKE 'sam|suj|temp*';
namespace viewName isTemporary
------------- ------------ --------------
default sam false
default suj false
temp2 true

Related articles
CREATE SCHEMA
CREATE VIEW
DROP SCHEMA
DROP VIEW
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.TABLES
INFORMATION_SCHEMA.VIEWS
RESET
7/21/2022 • 2 minutes to read

Resets runtime configurations specific to the current session which were set via the SET command to your
default values.

Syntax
RESET;

Parameters
(none)
Reset any runtime configurations specific to the current session which were set via the SET command to
your default values.

Examples
-- Reset any runtime configurations specific to the current session which were set via the SET command to
your default values.
RESET;

Related statements
SET
SET
7/21/2022 • 2 minutes to read

Sets a property, returns the value of an existing property or returns all SQLConf properties with value and
meaning.

Syntax
SET
SET [ -v ]
SET property_key[ = property_value ]

Parameters
-v
Outputs the key, value and meaning of existing SQLConf properties.
proper ty_key
Returns the value of specified property key.
proper ty_key=proper ty_value
Sets the value for a given property key. If an old value exists for a given property key, then it gets
overridden by the new value.

Examples
-- Set a property.
SET spark.sql.variable.substitute=false;

-- List all SQLConf properties with value and meaning.


SET -v;

-- List all SQLConf properties with value for current session.


SET;

-- List the value of specified property key.


SET spark.sql.variable.substitute;
+-----------------------------+-----+
| key|value|
+-----------------------------+-----+
|spark.sql.variable.substitute|false|
+-----------------------------+-----+

Related statements
RESET
SET TIME ZONE
7/21/2022 • 2 minutes to read

Sets the time zone of the current session.

Syntax
SET TIME ZONE { LOCAL | time_zone_value | INTERVAL interval_literal }

Parameters
LOCAL
Set the time zone to the one specified in the java user.timezone property, or to the environment variable
TZ if user.timezone is undefined, or to the system time zone if both of them are undefined.

timezone_value
A STRING literal. The ID of session local timezone in the format of either region-based zone IDs or zone
offsets. Region IDs must have the form ‘area/city’, such as ‘America/Los_Angeles’. Zone offsets must be in
the format ‘ (+|-)HH ’, ‘ (+|-)HH:mm ’ or ‘ (+|-)HH:mm:ss ’, e.g ‘-08’, ‘+01:00’ or ‘-13:33:33’. Also, ‘UTC’ and
‘Z’ are supported as aliases of ‘+00:00’. Other short names are not recommended to use because they
can be ambiguous.
inter val literal
The interval literal represents the difference between the session time zone to the ‘UTC’. It must be in the
range of [-18, 18] hours and max to second precision, e.g. INTERVAL 2 HOURS 30 MINUTES or
INTERVAL '15:40:32' HOUR TO SECOND .

Examples
-- Set time zone to the system default.
> SET TIME ZONE LOCAL;

-- Set time zone to the region-based zone ID.


> SET TIME ZONE 'America/Los_Angeles';

-- Set time zone to the Zone offset.


> SET TIME ZONE '+08:00';

-- Set time zone with intervals.


> SET TIME ZONE INTERVAL 1 HOUR 30 MINUTES;
> SET TIME ZONE INTERVAL '08:30:00' HOUR TO SECOND;

Related articles
SET
ADD ARCHIVE
7/21/2022 • 2 minutes to read

Adds an archive file to the list of resources. The given archive file should be one of .zip, .tar, .tar.gz, .tgz and .jar. To
list the archive files that have been added, use LIST ARCHIVE.
Since: Databricks Runtime 10.0

Syntax
ADD [ARCHIVE | ARCHIVES] file_name [...]

Parameters
file_name
The name of an ARCHIVE file to add. It could be either on a local file system or a distributed file system.

Examples
> ADD ARCHIVE /tmp/test.tar.gz;

> ADD ARCHIVE "/path/to/some.zip";

> ADD ARCHIVE '/some/other.tgz';

> ADD ARCHIVE "/path with space/abc.tar" ADD ARCHIVE "/path with space/def.tar";

Related statements
ADD FILE
ADD JAR
LIST ARCHIVE
LIST FILE
LIST JAR
ADD FILE
7/21/2022 • 2 minutes to read

Adds a single file as well as a directory to the list of resources. The added resource can be listed using LIST FILE.

Syntax
ADD [ FILE | FILES ] resource_name [...]

Parameters
resource_name
The name of a file or directory to be added.

Examples
> ADD FILE /tmp/test;

> ADD FILE "/path/to/file/abc.txt";

> ADD FILE '/another/test.txt';

> ADD FILE "/path with space/abc.txt";

> ADD FILE "/path/to/some/directory" "/path with space/abc.txt";

Related statements
ADD ARCHIVE
ADD JAR
LIST FILE
LIST JAR
LIST ARCHIVE
ADD JAR
7/21/2022 • 2 minutes to read

Adds a JAR file to the list of resources. The added JAR file can be listed using LIST JAR.

Syntax
ADD [JAR | JARS] file_name [...]

Parameters
file_name
The name of a JAR file to be added. It could be either on a local file system or a distributed file system.

Examples
> ADD JAR /tmp/test.jar;

> ADD JAR "/path/to/some.jar";

> ADD JAR '/some/other.jar';

> ADD JARS "/path with space/abc.jar" "/path with space/def.jar";

Related statements
ADD ARCHIVE
ADD FILE
LIST ARCHIVE
LIST FILE
LIST JAR
LIST ARCHIVE
7/21/2022 • 2 minutes to read

Lists the ARCHIVEs added by ADD ARCHIVE.


Since: Databricks Runtime 10.0

Syntax
LIST [ARCHIVE | ARCHIVES] [file_name [...]]

Parameters
file_name
Optional a name of an archive to list.

Examples
> ADD ARCHIVES /tmp/test.zip /tmp/test_2.tar.gz;

> LIST ARCHIVE;


file:/tmp/test.zip
file:/tmp/test_2.tar.gz

> LIST ARCHIVE /tmp/test.zip /some/random.tgz /another/random.tar;


file:/tmp/test.zip

Related statements
ADD ARCHIVE
ADD JAR
ADD FILE
LIST FILE
LIST JAR
LIST FILE
7/21/2022 • 2 minutes to read

Lists the resources added by ADD FILE.

Syntax
LIST [ FILE | FILES ] [ resource_name [...]]

Parameters
resource_name
Optional a name of a file or directory to list.

Examples
> ADD FILE /tmp/test /tmp/test_2;

> LIST FILE;


file:/private/tmp/test
file:/private/tmp/test_2

> LIST FILE /tmp/test /some/random/file /another/random/file


file:/private/tmp/test

Related statements
ADD FILE
ADD JAR
LIST JAR
LIST JAR
7/21/2022 • 2 minutes to read

Lists the JARs added by ADD JAR.

Syntax
LIST [JAR | JARS] [file_name [...]]

file_name
Optional a name of an archive to list.

Examples
> ADD JAR /tmp/test.jar /tmp/test_2.jar;

> LIST JAR;


spark://192.168.1.112:62859/jars/test.jar
spark://192.168.1.112:62859/jars/test_2.jar

> LIST JAR /tmp/test.jar /some/random.jar /another/random.jar;


spark://192.168.1.112:62859/jars/test.jar

Related statements
ADD ARCHIVE
ADD FILE
ADD JAR
LIST ARCHIVE
LIST FILE
CACHE SELECT
7/21/2022 • 2 minutes to read

Caches the data accessed by the specified simple SELECT query in the Delta cache. You can choose a subset of
columns to be cached by providing a list of column names and choose a subset of rows by providing a
predicate. This enables subsequent queries to avoid scanning the original files as much as possible. This
construct is applicable only to Parquet tables. Views are also supported, but the expanded queries are restricted
to the simple queries, as described above.

Syntax
CACHE SELECT column_name [, ...] FROM table_name [ WHERE boolean_expression ]

See Delta and Apache Spark caching for the differences between the Delta cache and the Apache Spark cache.

Parameters
table_name
Identifies an existing table. The name must not include a temporal specification.

Examples
> CACHE SELECT * FROM boxes
> CACHE SELECT width, length FROM boxes WHERE height=3
CREATE TABLE CLONE
7/21/2022 • 2 minutes to read

Clones a source Delta table to a target destination at a specific version. A clone can be either deep or shallow:
deep clones copy over the data from the source and shallow clones do not.

IMPORTANT
There are important differences between shallow and deep clones that can determine how best to use them. See Clone a
Delta table.

Syntax
CREATE TABLE [IF NOT EXISTS] table_name
[SHALLOW | DEEP] CLONE source_table_name [LOCATION path]

[CREATE OR] REPLACE TABLE table_name


[SHALLOW | DEEP] CLONE source_table_name [LOCATION path]

Parameters
IF NOT EXISTS
If specified, the statement is ignored if table_name already exists.
[CREATE OR] REPL ACE
If CREATE OR is specified the table is replaced if it exists and newly created if it does not. Without
CREATE OR the table_name must exist.
table_name
The name of the Delta Lake table to be created. The name must not include a temporal specification. If the
name is not qualified the table is created in the current schema. table_name must not exist already unless
REPLACE or IF NOT EXISTS has been specified.

SHALLOW CLONE or DEEP CLONE


If you specify SHALLOW CLONE Azure Databricks will make a copy of the source table’s definition, but refer
to the source table’s files. When you specify DEEP CLONE (default) Azure Databricks will make a complete,
independent copy of the source table.
source_table_name
The name of the Delta Lake table to be cloned. The name may include a temporal specification.
LOCATION path
Optionally creates an external table, with the provided location as the path where the data is stored. If
table_name itself a path instead of a table identifier, the operation will fail. path must be a STRING literal.
Examples
You can use CREATE TABLE CLONE for complex operations like data migration, data archiving, machine learning
flow reproduction, short-term experiments, data sharing, and so on. See Clone use cases for a few examples.
CONVERT TO DELTA
7/21/2022 • 2 minutes to read

Converts an existing Parquet table to a Delta table in-place. This command lists all the files in the directory,
creates a Delta Lake transaction log that tracks these files, and automatically infers the data schema by reading
the footers of all Parquet files. The conversion process collects statistics to improve query performance on the
converted Delta table. If you provide a table name, the metastore is also updated to reflect that the table is now
a Delta table.
This command supports converting Iceberg tables whose underlying file format is Parquet. In this case, the
converter generates the Delta Lake transaction log based on Iceberg table’s native file manifest, schema and
partitioning information.

Syntax
CONVERT TO DELTA table_name [ NO STATISTICS ] [ PARTITIONED BY clause ]

Parameters
table_name
Either an optionally qualified table identifier or a path to a Parquet file. The name must not include a
temporal specification.
NO STATISTICS
Bypass statistics collection during the conversion process and finish conversion faster. After the table is
converted to Delta Lake, you can use OPTIMIZE ZORDER BY to reorganize the data layout and generate
statistics.
PARTITIONED BY
Partition the created table by the specified columns. Required if the data is partitioned. The conversion
process aborts and throw an exception if the directory structure does not conform to the PARTITIONED BY
specification. If you do not provide the PARTITIONED BY clause, the command assumes that the table is
not partitioned.

Examples
NOTE
You do not need to provide partitioning information for Iceberg tables or tables registered to the metastore.

CONVERT TO DELTA database_name.table_name; -- only for Parquet tables

CONVERT TO DELTA parquet.`abfss://container-name@storage-account-name.dfs.core.windows.net/path/to/table`


PARTITIONED BY (date DATE); -- if the table is partitioned

CONVERT TO DELTA iceberg.`abfss://container-name@storage-account-name.dfs.core.windows.net/path/to/table`; -


- uses Iceberg manifest for metadata
Caveats
Any file not tracked by Delta Lake is invisible and can be deleted when you run VACUUM . You should avoid
updating or appending data files during the conversion process. After the table is converted, make sure all
writes go through Delta Lake.
It is possible that multiple external tables share the same underlying Parquet directory. In this case, if you run
CONVERT on one of the external tables, then you will not be able to access the other external tables because their
underlying directory has been converted from Parquet to Delta Lake. To query or write to these external tables
again, you must run CONVERT on them as well.
CONVERT populates the catalog information, such as schema and table properties, to the Delta Lake transaction
log. If the underlying directory has already been converted to Delta Lake and its metadata is different from the
catalog metadata, a convertMetastoreMetadataMismatchException is thrown. If you want CONVERT to overwrite the
existing metadata in the Delta Lake transaction log, set the SQL configuration
spark.databricks.delta.convert.metadataCheck.enabled to false.

Undo the conversion


If you have performed Delta Lake operations such as DELETE or OPTIMIZE that can change the data files:
1. Run the following command for garbage collection:

> VACUUM delta.`<path-to-table>` RETAIN 0 HOURS

1. Delete the <path-to-table>/_delta_log directory.

Related articles
PARTITIONED BY
VACUUM
COPY INTO
7/21/2022 • 15 minutes to read

Loads data from a file location into a Delta table. This is a retriable and idempotent operation—files in the
source location that have already been loaded are skipped. For examples, see Common data loading patterns
with COPY INTO.

Syntax
COPY INTO target_table
FROM { source |
( SELECT expression_list FROM source ) }
[ WITH (
[ CREDENTIAL { credential_name |
(temporary_credential_options) } ]
[ ENCRYPTION (encryption_options) ])
]
FILEFORMAT = data_source
[ VALIDATE [ ALL | num_rows ROWS ] ]
[ FILES = ( file_name [, ...] ) | PATTERN = regex_pattern ]
[ FORMAT_OPTIONS ( { data_source_reader_option = value } [, ...] ) ]
[ COPY_OPTIONS ( { copy_option = value } [, ...] ) ]

Parameters
target_table
Identifies an existing Delta table. The target_table must not include a temporal specification.
If the table name is provided in the form of a location, such as: delta.`/path/to/table` , Unity Catalog can
govern access to the locations that are being written to. You can write to an external location by:
Defining the location as an external location and having WRITE FILES permissions on that external
location.
Having WRITE FILES permissions on a named storage credential that provide authorization to write to
a location using: COPY INTO delta.`/some/location` WITH (CREDENTIAL <named_credential>)
See Manage external locations and storage credentials for more details.

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

source
The file location to load the data from. Files in this location must have the format specified in FILEFORMAT .
The location is provided in the form of a URI.
Access to the source location can be provided through:
credential_name
Optional name of the credential used to access or write to the storage location. You use this
credential only if the file location isn’t included in an external location.
Inline temporary credentials.
Defining the source location as an external location and having READ FILES permissions on the
external location through Unity Catalog.
Using a named storage credential with READ FILES permissions that provide authorization to read
from a location through Unity Catalog.

IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.

You don’t need to provide inline or named credentials if the path is already defined as an external location
that you have permissions to use. See Manage external locations and storage credentials for more details.

NOTE
If the source file path is a root path, add a slash ( / ) at the end of the file path, for example, s3://my-bucket/ .

Accepted credential options are:


AZURE_SAS_TOKEN for ADLS Gen2 and Azure Blob Storage
AWS_ACCESS_KEY , AWS_SECRET_KEY , and AWS_SESSION_TOKEN for AWS S3
Accepted encryption options are:
TYPE = 'AWS_SSE_C' , and MASTER_KEY for AWS S3

See Use temporary credentials to load data with COPY INTO.


SELECT expression_list
Selects the specified columns or expressions from the source data before copying into the Delta table.
The expressions can be anything you use with SELECT statements, including window operations. You can
use aggregation expressions only for global aggregates–you cannot GROUP BY on columns with this
syntax.
FILEFORMAT = data_source
The format of the source files to load. One of CSV , JSON , AVRO , ORC , PARQUET , TEXT , BINARYFILE .
VALIDATE
The data that is to be loaded into a table is validated but not written to the table. These validations
include:
Whether the data can be parsed.
Whether the schema matches that of the table or if the schema needs to be evolved.
Whether all nullability and check constraints are met.
The default is to validate all of the data that is to be loaded. You can provide a number of rows to be
validated with the ROWS keyword, such as VALIDATE 15 ROWS . The COPY INTO statement returns a preview
of the data of 50 rows or less, when a number of less than 50 is used with the ROWS keyword).

NOTE
VALIDATE mode is available in Databricks Runtime 10.3 and above.
FILES
A list of file names to load, with length up to 1000. Cannot be specified with PATTERN .
PATTERN
A regex pattern that identifies the files to load from the source directory. Cannot be specified with FILES .
FORMAT_OPTIONS
Options to be passed to the Apache Spark data source reader for the specified format. See Format
options for each file format.
COPY_OPTIONS
Options to control the operation of the COPY INTO command.
force: boolean, default false . If set to true , idempotency is disabled and files are loaded
regardless of whether they’ve been loaded before.
mergeSchema : boolean, default false . If set to true , the schema can be evolved according to the
incoming data. To evolve the schema of a table, you must have OWN permissions on the table.

NOTE
mergeSchema option is available in Databricks Runtime 10.3 and above.

Access file metadata


To learn how to access metadata for file-based data sources, see File metadata column.

Format options
Generic options
JSON options
CSV options
PARQUET options
AVRO options
BINARYFILE options
TEXT options
ORC options

Generic options
The following options apply to all file formats.

O P T IO N

ignoreCorruptFiles

Type: Boolean

Whether to ignore corrupt files. If true, the Spark jobs will continue to run when encountering corrupted files and the contents
that have been read will still be returned. Observable as numSkippedCorruptFiles in the
operationMetrics column of the Delta Lake history. Available in Databricks Runtime 11.0 and above.

Default value: false


O P T IO N

ignoreMissingFiles

Type: Boolean

Whether to ignore missing files. If true, the Spark jobs will continue to run when encountering missing files and the contents
that have been read will still be returned. Available in Databricks Runtime 11.0 and above.

Default value: false ( true for COPY INTO )

modifiedAfter

Type: Timestamp String , for example, 2021-01-01 00:00:00.000000 UTC+0

An optional timestamp to ingest files that have a modification timestamp after the provided timestamp.

Default value: None

modifiedBefore

Type: Timestamp String , for example, 2021-01-01 00:00:00.000000 UTC+0

An optional timestamp to ingest files that have a modification timestamp before the provided timestamp.

Default value: None

pathGlobFilter

Type: String

A potential glob pattern to provide for choosing files. Equivalent to


PATTERN in COPY INTO .

Default value: None

recursiveFileLookup

Type: Boolean

Whether to load data recursively within the base directory and skip partition inference.

Default value: false

JSON options

O P T IO N

allowBackslashEscapingAnyCharacter

Type: Boolean

Whether to allow backslashes to escape any character that succeeds it. If not enabled, only characters that are explicitly listed
by the JSON specification can be escaped.

Default value: false


O P T IO N

allowComments

Type: Boolean

Whether to allow the use of Java, C, and C++ style comments ( '/' , '*' , and '//' varieties) within parsed content or
not.

Default value: false

allowNonNumericNumbers

Type: Boolean

Whether to allow the set of not-a-number ( NaN ) tokens as legal floating number values.

Default value: true

allowNumericLeadingZeros

Type: Boolean

Whether to allow integral numbers to start with additional (ignorable) zeroes (for example, 000001).

Default value: false

allowSingleQuotes

Type: Boolean

Whether to allow use of single quotes (apostrophe, character '\' ) for quoting strings (names and String values).

Default value: true

allowUnquotedControlChars

Type: Boolean

Whether to allow JSON strings to contain unescaped control characters (ASCII characters with value less than 32, including tab
and line feed characters) or not.

Default value: false

allowUnquotedFieldNames

Type: Boolean

Whether to allow use of unquoted field names (which are allowed by JavaScript, but not by the JSON specification).

Default value: false

badRecordsPath

Type: String

The path to store files for recording the information about bad JSON records.

Default value: None


O P T IO N

columnNameOfCorruptRecord

Type: String

The column for storing records that are malformed and cannot be parsed. If the mode for parsing is set as DROPMALFORMED ,
this column will be empty.

Default value: _corrupt_record

dateFormat

Type: String

The format for parsing date strings.

Default value: yyyy-MM-dd

dropFieldIfAllNull

Type: Boolean

Whether to ignore columns of all null values or empty arrays and structs during schema inference.

Default value: false

encoding or charset

Type: String

The name of the encoding of the JSON files. See java.nio.charset.Charset for list of options. You cannot use UTF-16 and
UTF-32 when multiline is true .

Default value: UTF-8

inferTimestamp

Type: Boolean

Whether to try and infer timestamp strings as a TimestampType . When set to


true , schema inference may take noticeably longer.

Default value: false

lineSep

Type: String

A string between two consecutive JSON records.

Default value: None, which covers \r , \r\n , and \n


O P T IO N

locale

Type: String

A java.util.Locale identifier. Influences default date, timestamp, and decimal parsing within the JSON.

Default value: US

mode

Type: String

Parser mode around handling malformed records. One of 'PERMISSIVE' ,


'DROPMALFORMED' , or 'FAILFAST' .

Default value: PERMISSIVE

multiLine

Type: Boolean

Whether the JSON records span multiple lines.

Default value: false

prefersDecimal

Type: Boolean

Whether to infer floats and doubles as DecimalType during schema inference.

Default value: false

primitivesAsString

Type: Boolean

Whether to infer primitive types like numbers and booleans as StringType .

Default value: false

rescuedDataColumn

Type: String

Whether to collect all data that can’t be parsed due to a data type mismatch or schema mismatch (including column casing) to
a separate column. This column is included by default when using Auto Loader. For more details, refer to Rescued data column.

Default value: None

timestampFormat

Type: String

The format for parsing timestamp strings.

Default value: yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]


O P T IO N

timeZone

Type: String

The java.time.ZoneId to use when parsing timestamps and dates.

Default value: None

CSV options

O P T IO N

badRecordsPath

Type: String

The path to store files for recording the information about bad CSV records.

Default value: None

charToEscapeQuoteEscaping

Type: Char

The character used to escape the character used for escaping quotes. For example, for the following record: [ " a\\", b ] :

* If the character to escape the '\' is undefined, the record won’t be parsed. The parser will read characters:
[a],[\],["],[,],[ ],[b] and throw an error because it cannot find a closing quote.
* If the character to escape the '\' is defined as '\' , the record will be read with 2 values: [a\] and [b] .

Default value: '\0'

columnNameOfCorruptRecord

Type: String

A column for storing records that are malformed and cannot be parsed. If the mode for parsing is set as DROPMALFORMED ,
this column will be empty.

Default value: _corrupt_record

comment

Type: Char

Defines the character that represents a line comment when found in the beginning of a line of text. Use '\0' to disable
comment skipping.

Default value: '\u0000'

dateFormat

Type: String

The format for parsing date strings.

Default value: yyyy-MM-dd


O P T IO N

emptyValue

Type: String

String representation of an empty value.

Default value: ""

encoding or charset

Type: String

The name of the encoding of the CSV files. See java.nio.charset.Charset for the list of options. UTF-16 and UTF-32
cannot be used when multiline is true .

Default value: UTF-8

enforceSchema

Type: Boolean

Whether to forcibly apply the specified or inferred schema to the CSV files. If the option is enabled, headers of CSV files are
ignored. This option is ignored by default when using Auto Loader to rescue data and allow schema evolution.

Default value: true

escape

Type: Char

The escape character to use when parsing the data.

Default value: '\'

header

Type: Boolean

Whether the CSV files contain a header. Auto Loader assumes that files have headers when inferring the schema.

Default value: false

ignoreLeadingWhiteSpace

Type: Boolean

Whether to ignore leading whitespaces for each parsed value.

Default value: false


O P T IO N

ignoreTrailingWhiteSpace

Type: Boolean

Whether to ignore trailing whitespaces for each parsed value.

Default value: false

inferSchema

Type: Boolean

Whether to infer the data types of the parsed CSV records or to assume all columns are of StringType . Requires an
additional pass over the data if set to true .

Default value: false

lineSep

Type: String

A string between two consecutive CSV records.

Default value: None, which covers \r , \r\n , and \n

locale

Type: String

A java.util.Locale identifier. Influences default date, timestamp, and decimal parsing within the CSV.

Default value: US

maxCharsPerColumn

Type: Int

Maximum number of characters expected from a value to parse. Can be used to avoid memory errors. Defaults to -1 , which
means unlimited.

Default value: -1

maxColumns

Type: Int

The hard limit of how many columns a record can have.

Default value: 20480


O P T IO N

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader
when inferring the schema.

Default value: false

mode

Type: String

Parser mode around handling malformed records. One of 'PERMISSIVE' ,


'DROPMALFORMED' , and 'FAILFAST' .

Default value: PERMISSIVE

multiLine

Type: Boolean

Whether the CSV records span multiple lines.

Default value: false

nanValue

Type: String

The string representation of a non-a-number value when parsing FloatType and DoubleType columns.

Default value: "NaN"

negativeInf

Type: String

The string representation of negative infinity when parsing FloatType or DoubleType columns.

Default value: "-Inf"

nullValue

Type: String

String representation of a null value.

Default value: ""


O P T IO N

parserCaseSensitive (deprecated)

Type: Boolean

While reading files, whether to align columns declared in the header with the schema case sensitively. This is true by default
for Auto Loader. Columns that differ by case will be rescued in the rescuedDataColumn if enabled. This option has been
deprecated in favor of readerCaseSensitive .

Default value: false

positiveInf

Type: String

The string representation of positive infinity when parsing FloatType or DoubleType columns.

Default value: "Inf"

quote

Type: Char

The character used for escaping values where the field delimiter is part of the value.

Default value: '\'

readerCaseSensitive

Type: Boolean

Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names
differ by case from the schema; otherwise, read the data in a case-insensitive manner.

Default value: true

rescuedDataColumn

Type: String

Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing)
to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data
column.

Default value: None

sep or delimiter

Type: String

The separator string between columns.

Default value: ","


O P T IO N

skipRows

Type: Int

The number of rows from the beginning of the CSV file that should be ignored (including commented and empty rows). If
header is true, the header will be the first unskipped and uncommented row.

Default value: 0

timestampFormat

Type: String

The format for parsing timestamp strings.

Default value: yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]

timeZone

Type: String

The java.time.ZoneId to use when parsing timestamps and dates.

Default value: None

unescapedQuoteHandling

Type: String

The strategy for handling unescaped quotes. Allowed options:

* STOP_AT_CLOSING_QUOTE : If unescaped quotes are found in the input, accumulate the quote character and proceed parsing
the value as a quoted value, until a closing quote is found.
* BACK_TO_DELIMITER : If unescaped quotes are found in the input, consider the value as an unquoted value. This will make
the parser accumulate all characters of the current parsed value until the delimiter defined by sep is found. If no delimiter is
found in the value, the parser will continue accumulating characters from the input until a delimiter or line ending is found.
* STOP_AT_DELIMITER : If unescaped quotes are found in the input, consider the value as an unquoted value. This will make
the parser accumulate all characters until the delimiter defined by sep , or a line ending is found in the input.
* SKIP_VALUE : If unescaped quotes are found in the input, the content parsed for the given value will be skipped (until the
next delimiter is found) and the value set in nullValue will be produced instead.
* RAISE_ERROR : If unescaped quotes are found in the input, a
TextParsingException will be thrown.

Default value: STOP_AT_DELIMITER

PARQUET options
O P T IO N

datetimeRebaseMode

Type: String

Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values:
EXCEPTION , LEGACY , and
CORRECTED .

Default value: LEGACY

int96RebaseMode

Type: String

Controls the rebasing of the INT96 timestamp values between Julian and Proleptic Gregorian calendars. Allowed values:
EXCEPTION , LEGACY , and
CORRECTED .

Default value: LEGACY

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file.

Default value: false

readerCaseSensitive

Type: Boolean

Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names
differ by case from the schema; otherwise, read the data in a case-insensitive manner.

Default value: true

rescuedDataColumn

Type: String

Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing)
to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data
column.

Default value: None

AVRO options
O P T IO N

avroSchema

Type: String

Optional schema provided by a user in Avro format. When reading Avro, this option can be set to an evolved schema, which is
compatible but different with the actual Avro schema. The deserialization schema will be consistent with the evolved schema.
For example, if you set an evolved schema containing one additional column with a default value, the read result will contain
the new column too.

Default value: None

datetimeRebaseMode

Type: String

Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values:
EXCEPTION , LEGACY , and
CORRECTED .

Default value: LEGACY

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file.
mergeSchema for Avro does not relax data types.

Default value: false

readerCaseSensitive

Type: Boolean

Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names
differ by case from the schema; otherwise, read the data in a case-insensitive manner.

Default value: true

rescuedDataColumn

Type: String

Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing)
to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data
column.

Default value: None

BINARYFILE options
Binary files do not have any additional configuration options.
TEXT options
O P T IO N

encoding

Type: String

The name of the encoding of the TEXT files. See java.nio.charset.Charset for list of options.

Default value: UTF-8

lineSep

Type: String

A string between two consecutive TEXT records.

Default value: None, which covers \r , \r\n and \n

wholeText

Type: Boolean

Whether to read a file as a single record.

Default value: false

ORC options

O P T IO N

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file.

Default value: false

Related articles
Credentials
DELETE
INSERT
MERGE
PARTITION
query
UPDATE
CREATE BLOOM FILTER INDEX (Delta Lake on
Azure Databricks)
7/21/2022 • 2 minutes to read

Creates a Bloom filter index for new or rewritten data; it does not create Bloom filters for existing data. The
command fails if either the table name or one of the columns does not exist. If Bloom filtering is enabled for a
column, existing Bloom filter options are replaced by the new options.

Syntax
CREATE BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS( { columnName1 [ options ] } [, ...] ) ]
[ options ]

options
OPTIONS ( { key1 [ = ] val1 } [, ...] )

Parameters
table_name
Identifies an existing Delta table. The name must not include a temporal specification.
While it is not possible to build a Bloom filter index for data that is already written, the OPTIMIZE command
updates Bloom filters for data that is reorganized. Therefore, you can backfill a Bloom filter by running OPTIMIZE
on a table:
If you have not previously optimized the table.
With a different file size, requiring that the data files be re-written.
With a ZORDER (or a different ZORDER , if one is already present), requiring that the data files be re-written.

You can tune the Bloom filter by defining options at the column level or at the table level:
fpp : False positive probability. The desired false positive rate per written Bloom filter. This influences the
number of bits needed to put a single item in the Bloom filter and influences the size of the Bloom filter. The
value must be larger than 0 and smaller than or equal to 1. The default value is 0.1 which requires 5 bits per
item.
numItems : Number of distinct items the file can contain. This setting is important for the quality of filtering as
it influences the total number of bits used in the Bloom filter (number of items - number of bits per item). If
this setting is incorrect, the Bloom filter is either very sparsely populated, wasting disk space and slowing
queries that must download this file, or it is too full and is less accurate (higher FPP). The value must be
larger than 0. The default is 1 million items.
maxExpectedFpp : The expected FPP threshold for which a Bloom filter is not written to disk. The maximum
expected false positive probability at which a Bloom filter is written. If the expected FPP is larger than this
threshold, the Bloom filter’s selectivity is too low; the time and resources it takes to use the Bloom filter
outweighs its usefulness. The value must be between 0 and 1. The default is 1.0 (disabled).
These options play a role only when writing the data. You can configure these properties at various hierarchical
levels: write operation, table level, and column level. The column level takes precedence over the table and
operation levels, and the table level takes precedence over the operation level.
See Bloom filter indexes.

Related articles
DROP BLOOMFILTER INDEX
DELETE FROM
7/21/2022 • 2 minutes to read

Deletes the rows that match a predicate. When no predicate is provided, deletes all rows.
This statement is only supported for Delta Lake tables.

Syntax
DELETE FROM table_name [table_alias] [WHERE predicate]

Parameters
table_name
Identifies an existing table. The name must not include a temporal specification.
table_alias
Define an alias for the table. The alias must not include a column list.
WHERE
Filter rows by predicate.
The WHERE predicate supports subqueries, including IN , NOT IN , EXISTS , NOT EXISTS , and scalar
subqueries. The following types of subqueries are not supported:
Nested subqueries, that is, an subquery inside another subquery
NOT IN subquery inside an OR , for example, a = 3 OR b NOT IN (SELECT c from t)
In most cases, you can rewrite NOT IN subqueries using NOT EXISTS . We recommend using NOT EXISTS
whenever possible, as DELETE with NOT IN subqueries can be slow.

Examples
> DELETE FROM events WHERE date < '2017-01-01'

> DELETE FROM all_events


WHERE session_time < (SELECT min(session_time) FROM good_events)

> DELETE FROM orders AS t1


WHERE EXISTS (SELECT oid FROM returned_orders WHERE t1.oid = oid)

> DELETE FROM events


WHERE category NOT IN (SELECT category FROM events2 WHERE date > '2001-01-01')

COPY
INSERT
MERGE
PARTITION
query
UPDATE
DESCRIBE HISTORY (Delta Lake on Azure
Databricks)
7/21/2022 • 2 minutes to read

Returns provenance information, including the operation, user, and so on, for each write to a table. Table history
is retained for 30 days.

Syntax
DESCRIBE HISTORY table_name

Parameters
table_name
Identifies an existing Delta table. The name must not include a temporal specification.
See Retrieve Delta table history for details.
DROP BLOOM FILTER INDEX (Delta Lake on Azure
Databricks)
7/21/2022 • 2 minutes to read

Drops a Bloom filter index.

Syntax
DROP BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS(columnName1 [, ...] ) ]

Parameters
table_name
Identifies an existing Delta table. The name must not include a temporal specification.
The command fails if either the table name or one of the columns does not exist. All Bloom filter related
metadata is removed from the specified columns.
When a table does not have any Bloom filters, the underlying index files are cleaned when the table is
vacuumed.

Related articles
CREATE BLOOMFILTER INDEX
FSCK REPAIR TABLE
7/21/2022 • 2 minutes to read

Removes the file entries from the transaction log of a Delta table that can no longer be found in the underlying
file system. This can happen when these files have been manually deleted.

Syntax
FSCK REPAIR TABLE table_name [DRY RUN]

Parameters
table_name
Identifies an existing Delta table. The name must not include a temporal specification.
DRY RUN
Return a list of files to be removed from the transaction log.
MERGE INTO
7/21/2022 • 4 minutes to read

Merges a set of updates, insertions, and deletions based on a source table into a target Delta table.
This statement is supported only for Delta Lake tables.

Syntax
MERGE INTO target_table_name [target_alias]
USING source_table_reference [source_alias]
ON merge_condition
[ WHEN MATCHED [ AND condition ] THEN matched_action ] [...]
[ WHEN NOT MATCHED [ AND condition ] THEN not_matched_action ] [...]

matched_action
{ DELETE |
UPDATE SET * |
UPDATE SET { column1 = value1 } [, ...] }

not_matched_action
{ INSERT * |
INSERT (column1 [, ...] ) VALUES (value1 [, ...])

Parameters
target_table_name
A Table name identifying the table being modified. The table referenced must be a Delta table.
target_alias
A Table aliasfor the target table. The alias must not include a column list.
source_table_reference
A Table name identifying the source table to be merged into the target table.
source_alias
A Table alias for the source table. The alias must not include a column list.
merge_condition
How the rows from one relation are combined with the rows of another relation. An expression with a
return type of BOOLEAN.
condition
A Boolean expression which must be true to satisfy the WHEN MATCHED or WHEN NOT MATCHED clause.
matched_action
There can be any number of WHEN MATCHED and WHEN NOT MATCHED clauses each, but at least one clause is
required. Multiple matches are allowed when matches are unconditionally deleted (since unconditional
delete is not ambiguous even if there are multiple matches).
WHEN MATCHED clauses are executed when a source row matches a target table row based on the
match condition. These clauses have the following semantics.
WHEN MATCHED clauses can have at most one UPDATE and one DELETE action. The UPDATE action in
merge only updates the specified columns of the matched target row. The DELETE action will
delete the matched row.
Each WHEN MATCHED clause can have an optional condition. If this clause condition exists, the
UPDATE or DELETE action is executed for any matching source-target row pair row only when the
clause condition is true.
If there are multiple WHEN MATCHED clauses, then they are evaluated in the order they are specified.
All WHEN MATCHED clauses, except the last one, must have conditions.
If none of the WHEN MATCHED conditions evaluate to true for a source and target row pair that
matches the merge condition, then the target row is left unchanged.
To update all the columns of the target Delta table with the corresponding columns of the source
dataset, use UPDATE SET * . This is equivalent to
UPDATE SET col1 = source.col1 [, col2 = source.col2 ...] for all the columns of the target Delta
table. Therefore, this action assumes that the source table has the same columns as those in the
target table, otherwise the query will throw an analysis error.
This behavior changes when automatic schema migration is enabled. See Automatic schema
evolution for details.
WHEN NOT MATCHED clauses are executed when a source row does not match any target row based on the
match condition. These clauses have the following semantics.
WHEN NOT MATCHED clauses can only have the INSERT action. The new row is generated based on
the specified column and corresponding expressions. All the columns in the target table do not
need to be specified. For unspecified target columns, NULL is inserted.
Each WHEN NOT MATCHED clause can have an optional condition. If the clause condition is present, a
source row is inserted only if that condition is true for that row. Otherwise, the source row is
ignored.
If there are multiple WHEN NOT MATCHED clauses, then they are evaluated in the order they are
specified. All WHEN NOT MATCHED clauses, except the last one, must have conditions.
To insert all the columns of the target Delta table with the corresponding columns of the source
dataset, use INSERT * . This is equivalent to
INSERT (col1 [, col2 ...]) VALUES (source.col1 [, source.col2 ...]) for all the columns of the
target Delta table. Therefore, this action assumes that the source table has the same columns as
those in the target table, otherwise the query will throw an analysis error.

NOTE
This behavior changes when automatic schema migration is enabled. See Automatic schema evolution for
details.
IMPORTANT
A MERGE operation can fail if multiple rows of the source dataset match and attempt to update the same rows of the
target Delta table. According to the SQL semantics of merge, such an update operation is ambiguous as it is unclear
which source row should be used to update the matched target row. You can preprocess the source table to eliminate the
possibility of multiple matches. See the Change data capture example—it preprocesses the change dataset (that is, the
source dataset) to retain only the latest change for each key before applying that change into the target Delta table.

Examples
You can use MERGE INTO for complex operations like deduplicating data, upserting change data, applying SCD
Type 2 operations, etc. See Merge examples for a few examples.

Related articles
DELETE
INSERT INTO
UPDATE
OPTIMIZE (Delta Lake on Azure Databricks)
7/21/2022 • 2 minutes to read

Optimizes the layout of Delta Lake data. Optionally optimize a subset of data or colocate data by column. If you
do not specify colocation, bin-packing optimization is performed.

Syntax
OPTIMIZE table_name [WHERE predicate]
[ZORDER BY (col_name1 [, ...] ) ]

NOTE
Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no
effect. It aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number of
tuples per file. However, the two measures are most often correlated.
Z-Ordering is not idempotent but aims to be an incremental operation. The time it takes for Z-Ordering is not
guaranteed to reduce over multiple runs. However, if no new data was added to a partition that was just Z-Ordered,
another Z-Ordering of that partition will not have any effect. It aims to produce evenly-balanced data files with respect
to the number of tuples, but not necessarily data size on disk. The two measures are most often correlated, but there
can be situations when that is not the case, leading to skew in optimize task times.
To control the output file size, set the Spark configuration spark.databricks.delta.optimize.maxFileSize . The
default value is 1073741824 , which sets the size to 1 GB. Specifying the value 104857600 sets the file size to 100
MB.

Parameters
table_name
Identifies an existing Delta table. The name must not include a temporal specification.
WHERE

Optimize the subset of rows matching the given partition predicate. Only filters involving partition key
attributes are supported.
ZORDER BY

Colocate column information in the same set of files. Co-locality is used by Delta Lake data-skipping
algorithms to dramatically reduce the amount of data that needs to be read. You can specify multiple
columns for ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with
each additional column.

Examples
OPTIMIZE events

OPTIMIZE events WHERE date >= '2017-01-01'

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

For more information about the OPTIMIZE command, see Optimize performance with file management.
REORG TABLE
7/21/2022 • 2 minutes to read

Reorganize a Delta Lake table by rewriting files to purge soft-deleted data, such as the column data dropped by
ALTER TABLE DROP COLUMN.

Syntax
REORG TABLE table_name [WHERE predicate] APPLY (PURGE)

NOTE
REORG TABLE only rewrites files that actually contain soft-deleted data.
REORG TABLE is idempotent, meaning that if it is run twice on the same dataset, the second run has no effect.
After running REORG TABLE, the soft-deleted data may still exist in the old files. You can run VACUUM to physically
delete the old files.

Since: Databricks Runtime 11.0

Parameters
table_name
Identifies an existing Delta table. The name must not include a temporal specification.
WHERE predicate
Reorganizes the files that match the given partition predicate. Only filters involving partition key
attributes are supported.
APPLY (PURGE)

Specifies that the purpose of file rewriting is to purge soft-deleted data.

Examples
> REORG TABLE events APPLY (PURGE);

> REORG TABLE events WHERE date >= '2022-01-01' APPLY (PURGE);

> REORG TABLE events


WHERE date >= current_timestamp() - INTERVAL '1' DAY
APPLY (PURGE);
RESTORE (Delta Lake on Azure Databricks)
7/21/2022 • 2 minutes to read

NOTE
Available in Databricks Runtime 7.4 and above.

Restores a Delta table to an earlier state. Restoring to an earlier version number or a timestamp is supported.

Syntax
RESTORE [TABLE] table_name [TO] time_travel_version

Parameters
table_name
Identifies Delta table to be restored. The table name must not use a temporal specification.

time_travel_version
{ TIMESTAMP AS OF timestamp_expression |
VERSION AS OF version }

where
timestamp_expression can be any one of:
'2018-10-18T22:15:12.013Z' , that is, a string that can be cast to a timestamp
cast('2018-10-18 13:36:32 CEST' as timestamp)
'2018-10-18' , that is, a date string
In Databricks Runtime 6.6 and above:
current_timestamp() - interval 12 hours
date_sub(current_date(), 1)
Any other expression that is or can be cast to a timestamp
version is a long value that can be obtained from the output of DESCRIBE HISTORY table_spec .
Neither timestamp_expression nor version can be subqueries.
For more information about the RESTORE command, see Restore a Delta table.
UPDATE
7/21/2022 • 2 minutes to read

Updates the column values for the rows that match a predicate. When no predicate is provided, update the
column values for all rows.
This statement is only supported for Delta Lake tables.

Syntax
UPDATE table_name [table_alias]
SET { { column_name | field_name } = expr } [, ...]
[WHERE clause]

Parameters
table_name
Identifies table to be updated. The table name must not use a temporal specification.
table_alias
Define an alias for the table. The alias must not include a column list.
column_name
A reference to a column in the table. You may reference each column at most once.
field_name
A reference to field within a column of type STRUCT. You may reference each field at most once.
expr
An arbitrary expression. If you reference table_name columns they represent the state of the row prior
the update.
WHERE
Filter rows by predicate. The WHERE clause may include subqueries with the following exceptions:
Nested subqueries, that is, a subquery inside another subquery
A NOT IN subquery inside an OR , for example, a = 3 OR b NOT IN (SELECT c from t)
In most cases, you can rewrite NOT IN subqueries using NOT EXISTS . You should use NOT EXISTS
whenever possible, as UPDATE with NOT IN subqueries can be slow.

Examples
> UPDATE events SET eventType = 'click' WHERE eventType = 'clk'

> UPDATE all_events


SET session_time = 0, ignored = true
WHERE session_time < (SELECT min(session_time) FROM good_events)

> UPDATE orders AS t1


SET order_status = 'returned'
WHERE EXISTS (SELECT oid FROM returned_orders WHERE t1.oid = oid)

> UPDATE events


SET category = 'undefined'
WHERE category NOT IN (SELECT category FROM events2 WHERE date > '2001-01-01')

Related articles
COPY
DELETE
INSERT
MERGE
PARTITION
query
VACUUM
7/21/2022 • 2 minutes to read

Remove unused files from a table directory.

NOTE
This command works differently depending on whether you’re working on a Delta or Apache Spark table.

Vacuum a Delta table (Delta Lake on Azure Databricks)


Recursively vacuum directories associated with the Delta table. VACUUM removes all files from the table directory
that are not managed by Delta, as well as data files that are no longer in the latest state of the transaction log for
the table and are older than a retention threshold. VACUUM will skip all directories that begin with an underscore
( _ ), which includes the _delta_log . Partitioning your table on a column that begins with an underscore is an
exception to this rule; VACUUM scans all valid partitions included in the target Delta table. Delta table data files
are deleted according to the time they have been logically removed from Delta’s transaction log plus retention
hours, not their modification timestamps on the storage system. The default threshold is 7 days.
On Delta tables, Azure Databricks does not automatically trigger VACUUM operations. See Remove files no longer
referenced by a Delta table.
If you run VACUUM on a Delta table, you lose the ability to time travel back to a version older than the specified
data retention period.

WARNING
It is recommended that you set a retention interval to be at least 7 days, because old snapshots and uncommitted files
can still be in use by concurrent readers or writers to the table. If VACUUM cleans up active files, concurrent readers can
fail or, worse, tables can be corrupted when VACUUM deletes files that have not yet been committed. You must choose an
interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag
behind the most recent update to the table.

Delta Lake has a safety check to prevent you from running a dangerous VACUUM command. If you are certain
that there are no operations being performed on this table that take longer than the retention interval you plan
to specify, you can turn off this safety check by setting the Spark configuration property
spark.databricks.delta.retentionDurationCheck.enabled to false .

VACUUM table_name [RETAIN num HOURS] [DRY RUN]

Parameters
table_name
Identifies an existing Delta table. The name must not include a temporal specification.
RETAIN num HOURS
The retention threshold.
DRY RUN
Return a list of files to be deleted.

Vacuum a Spark table (Apache Spark)


Recursively vacuums directories associated with the Spark table and remove uncommitted files older than a
retention threshold. The default threshold is 7 days.
On Spark tables, Azure Databricks automatically triggers VACUUM operations as data is written. See Clean up
uncommitted files.
Syntax

VACUUM table_name [RETAIN num HOURS]

Parameters
table_name
Identifies an existing table by name or path.
RETAIN num HOURS
The retention threshold.
DENY
7/21/2022 • 2 minutes to read

Denies a privilege on a securable object to a principal. Denying a privilege takes precedent over any explicit or
implicit grant.
Denying a privilege on a schema (for example a SELECT privilege) has the effect of implicitly denying that
privilege on all objects in that schema. Denying a specific privilege on the catalog implicitly denies that privilege
on all schemas in the catalog.
…note:: This statement applies only to the hive_metastore catalog and its objects.
…important:: To undo a DENY you REVOKE the same privilege from the principal.

Syntax
DENY privilege_types ON securable_object TO principal

privilege_types
{ ALL PRIVLEGES |
privilege_type [, ...] }

Parameters
privilege_types
This identifies one or more privileges the principal is denied.
ALL PRIVILEGES

Deny all privileges applicable to the securable_object .


privilege_type
A specific privilege to deny the principal on this securable_object

securable_object
The object on which the privileges are denied to the principal.
principal
The user or group whose privileges are denied.

Example
-- Deny Alf the right to query `t`.
> DENY SELECT ON TABLE t TO `alf@melmak.et`;

-- Undo the the `DENY`.


> REVOKE SELECT ON TABLE t FROM `alf@melmak.et`;

Related
GRANT
REPAIR PRIVILEGES
REVOKE
SHOW GRANTS
ALTER GROUP
7/21/2022 • 2 minutes to read

Alters a workspace level group by either adding or dropping users and groups as members.

Syntax
ALTER GROUP parent_principal { ADD | DROP }
{ GROUP group_principal [, ...] |
USER user_principal [, ...] } [...]

Parameters
parent_group_principal
The name of the workspace level group to be altered.
group_principal
A list of workspace level subgroups to ADD or DROP.
user_principal
A list of workspace level users to ADD or DROP.

Examples
-- Creates a group named `aliens` containing a user `alf@melmak.et`.
CREATE GROUP aliens WITH USER `alf@melmak.et`;

-- Alters the group and add `tv_aliens` as a member.


ALTER GROUP aliens ADD GROUP tv_aliens;

-- Alters the group aliens and drops `alf@melmak.et` as a member.


ALTER GROUP aliens DROP USER `alf@melmak.et`;

-- Alters the group tv_aliens and add `alf@melmak.et` as a member.


ALTER GROUP tv_aliens ADD USER `alf@melmak.et`;

Related articles
SHOW GROUPS
DROP GROUP
CREATE GROUP
CREATE GROUP
7/21/2022 • 2 minutes to read

Creates a workspace level group with the specified name, optionally including a list of users and groups.

Syntax
CREATE GROUP group_principal
[ WITH
[ USER user_principal [, ...] ]
[ GROUP subgroup_principal [, ...] ]
]

Parameters
group_principal
The name of the workspace-level group to be created.
user_principal
A workspace level user to include as a member of the group.
subgroup_principal
A workspace level subgrouo to include as a member of the group.

Examples
-- Create an empty group.
CREATE GROUP humans;

-- Create tv_aliens with Alf and Thor as members.


CREATE GROUP tv_aliens WITH USER `alf@melmak.et`, `thor@asgaard.et`;

-- Create aliens with Hilo and tv_aliens as members.


CREATE GROUP aliens WITH USER `hilo@jannus.et` GROUP tv_aliens;

Related articles
SHOW GROUPS
ALTER GROUPS
DROP GROUP
principals
DROP GROUP
7/21/2022 • 2 minutes to read

Drops a workspace level grouo. An exception is thrown if the group does not exist in the system.

Syntax
DROP GROUP group_principal

Parameters
group_principal
The name of the existing workspace level group to drop.

Examples
-- Create `aliens` Group
> CREATE GROUP aliens WITH GROUP tv_aliens;

-- Drop `aliens` group


> DROP GROUP aliens;

Related articles
SHOW GROUPS
ALTER GROUP
CREATE GROUP
principal
GRANT
7/21/2022 • 2 minutes to read

Grants a privilege on an securable object to a principal.

NOTE
Modifying access to the samples catalog is not supported. This catalog is available to all workspaces, but is read-only.

Use GRANT ON SHARE to grant recipients access to shares.

Syntax
GRANT privilege_types ON securable_object TO principal

privilege_types
{ ALL PRIVILEGES |
privilege_type [, ...] }

Parameters
privilege types
This identifies one or more privileges to be granted to the principal .
ALL PRIVILEGES

Grant all privileges applicable to the securable_object .


privilege type
A specific privilege to be granted on the securabel_object to the principal .
securable_object
The object on which the privileges are granted to the principal.
principal
A user or group to which the privileges are granted.

Examples
> GRANT CREATE ON SCHEMA <schema-name> TO `alf@melmak.et`;

> GRANT ALL PRIVILEGES ON TABLE forecasts TO finance;

> GRANT SELECT ON TABLE sample_data TO USERS;

Related articles
GRANT ON SHARE
REPAIR PRIVILEGES
REVOKE
SHOW GRANTS
GRANT ON SHARE
7/21/2022 • 2 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

Grants access to a share to a recipient.


Since: Databricks Runtime 10.3

Syntax
GRANT SELECT ON SHARE share_name TO RECIPIENT recipient_name

Parameters
share_name
The name of the share which the recipient is granted access to. If the share does not exist an error is
raised.
recipient_name
The name of the recipient to which access to teh share is granted. If the recipient does not exist an error is
raised.
Examples

> GRANT SELECT ON SHARE vaccines TO RECIPIENT jab_me_now_corp;

Related articles
REVOKE ON SHARE
REVOKE
7/21/2022 • 2 minutes to read

Revokes an explicitly granted or denied privilege on a securable object from a principal.

NOTE
Modifying access to the samples catalog is not supported. This catalog is available to all workspaces, but is read-only.

Use REVOKE ON SHARE to revoke access on shares from recipients. s

Syntax
REVOKE privilege_types ON securable_object FROM principal

privilege_types
{ ALL PRIVILEGES |
privilege_type [, ...] }

Parameters
privilege_types
This identifies one or more privileges to be revoked from the principal .
ALL PRIVILEGES

Revoke all privileges applicable to the securable_object .


privilege_type
The specific privilege to be revoked on the securable_object from the principal .
securable_object
The object on which the privileges are granted to the principal.
principal
A user or group from which the privileges are revoked.

Examples
> REVOKE ALL PRIVILEGES ON SCHEMA default FROM `alf@melmak.et`;

> REVOKE SELECT TABLE ON t FROM aliens;

Related articles
GRANT
REPAIR PRIVILEGES
REVOKE ON SHARE
SHOW GRANTS
REVOKE ON SHARE
7/21/2022 • 2 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

Revokes access to a share from a recipient.


Since: Databricks Runtime 10.3

Syntax
REVOKE SELECT ON SHARE share_name FROM RECIPIENT recipient_name

Parameters
share_name
The name of the share from which the recipient is revoked access. If the share does not exist an error is
raised.
recipient_name
The name of the recipient from which access to the share is revoked. If the recipient does not exist an
error is raised.
Examples

> REVOKE SELECT ON SHARE vaccines FROM RECIPIENT jab_me_now_corp;

Related articles
GRANT ON SHARE
REVOKE
SHOW GRANTS
7/21/2022 • 2 minutes to read

Displays all privileges (inherited, denied, and granted) that affect the securable object.
To run this command you must be either:
A workspace administrator or the owner of the object.
The user specified in principal .
Use SHOW GRANTS TO RECIPIENT to list which shares a recipient has access to.

Syntax
SHOW GRANTS [ principal ] ON securable_object

You can also use GRANT as an alternative for GRANTS .

Parameters
principal
An optional user or group for which to show the privileges granted or denied. If not specified SHOW will
return privileges for all principals who have privileges on the object.
securable_object
The object whose privileges to show.

Example
> SHOW GRANTS `alf@melmak.et` ON SCHEMA my_schema;
principal prvilege
------------- --------
alf@melmak.et USE

> SHOW GRANTS ON SHARE some_share;


recipient privilege
--------- ---------
A_Corp SELECT
B.com SELECT

Related articles
GRANT
INFORMATION_SCHEMA.CATALOG_PRIVILEGES
INFORMATION_SCHEMA.SCHEMA_PRIVILEGES
INFORMATION_SCHEMA.TABLE_PRIVILEGES
REPAIR PRIVILEGES
REVOKE
REVOKE ON SHARE
SHOW GRANTS TO RECIPIENT
SHOW GRANTS ON SHARE
SHOW GROUPS
SHOW USERS
SHOW GRANTS ON SHARE
7/21/2022 • 2 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

Displays all recipients with access to a share.


To run this command you must be an administrator.
Since: Databricks Runtime 10.3

Syntax
SHOW GRANTS ON SHARE

You can also use GRANT as an alternative for GRANTS .

Parameters
share_name
The name of the share whose shares will be listed.

Example
> SHOW GRANTS ON SHARE shared_date;
recipient privilege
--------- ---------
some_corp SELECT
other_org SELECT

Related articles
GRANT TO SHARE
REVOKE ON SHARE
SHOW GRANTS
SHOW GRANTS TO RECIPIENT
SHOW SHARES
SHOW RECIPIENTS
SHOW GRANTS TO RECIPIENT
7/21/2022 • 2 minutes to read

IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.

Displays all shares which the recipient can access.


To run this command you must be an administrator.
Since: Databricks Runtime 10.3

Syntax
SHOW GRANTS TO RECIPIENT recipient_name

You can also use GRANT as an alternative for GRANTS .

Parameters
recipient_name
The name of the recipient whose shares will be listed.

Example
> SHOW GRANTS TO RECIPIENT a_corp;
share privilege
----- ---------
data1 SELECT
data2 SELECT

Related articles
GRANT TO SHARE
REVOKE ON SHARE
SHOW GRANTS
SHOW GRANTS ON SHARE
SHOW SHARES
SHOW RECIPIENTS
MSCK REPAIR PRIVILEGES
7/21/2022 • 2 minutes to read

Removes all the privileges from all the users associated with the object.
You use this statement to clean up residual access control left behind after objects have been dropped from the
Hive metastore outside of Databricks Runtime.
This statement only applies to objects in the hive_metastore catalog.

Syntax
MSCK REPAIR object PRIVILEGES

object
{ [ SCHEMA | DATABASE ] schema_name |
FUNCTION function_name |
TABLE table_name
VIEW view_name |
ANONYMOUS FUNCTION |
ANY FILE }

Parameters
schema_name
Names the schema from which privileges are removed.
function_name
Names the function from which privileges are removed.
table_name
Names the table from which privileges are removed.
view_name
Names the view from which privileges are removed.
ANY FILE
Revokes ANY FILE privilege from all users.
ANONYMOUS FUNCTION
Revokes ANONYMOUS FUNCTION privilege from all users.

Examples
> MSCK REPAIR SCHEMA gone_from_hive PRIVILEGES;

> MSCK REPAIR ANONYMOUS FUNCTION PRIVILEGES;

> MSCK REPAIR TABLE default.dropped PRIVILEGES;


Related articles
GRANT
MSCK REPAIR TABLE
REVOKE
SHOW GRANTS
Alter Database
7/21/2022 • 2 minutes to read

ALTER [DATABASE|SCHEMA] db_name SET DBPROPERTIES (key=val, ...)

SET DBPROPERTIES

Specify a property named key for the database and establish the value for the property respectively as val . If
key already exists, the old value is overwritten with val .

Assign owner
ALTER DATABASE db_name OWNER TO `user_name@user_domain.com`

Assign an owner to the database.


Alter Table or View
7/21/2022 • 2 minutes to read

Rename table or view


ALTER [TABLE|VIEW] [db_name.]table_name RENAME TO [db_name.]new_table_name

Rename an existing table or view. If the destination table name already exists, an exception is thrown. This
operation does not support moving tables across databases.
For managed tables, renaming a table moves the table location; for unmanaged (external) tables, renaming a
table does not move the table location.
For further information on managed versus unmanaged (external) tables, see Data objects in the Databricks
Lakehouse.

Set table or view properties


ALTER [TABLE|VIEW] table_name SET TBLPROPERTIES (key1=val1, key2=val2, ...)

Set the properties of an existing table or view. If a particular property was already set, this overrides the old
value with the new one.

NOTE
Property names are case sensitive. If you have key1 and then later set Key1 , a new table property is created.
To view table properties, run:

DESCRIBE EXTENDED table_name

Set a table comment


To set a table comment, run:

ALTER TABLE table_name SET TBLPROPERTIES ('comment' = 'A table comment.')

Drop table or view properties


ALTER (TABLE|VIEW) table_name UNSET TBLPROPERTIES
[IF EXISTS] (key1, key2, ...)

Drop one or more properties of an existing table or view. If a specified property does not exist, an exception is
thrown.
IF EXISTS

If a specified property does not exist, nothing will happen.


Set SerDe or SerDe properties
ALTER TABLE table_name [PARTITION part_spec] SET SERDE serde
[WITH SERDEPROPERTIES (key1=val1, key2=val2, ...)]

ALTER TABLE table_name [PARTITION part_spec]


SET SERDEPROPERTIES (key1=val1, key2=val2, ...)

part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)

Set the SerDe or the SerDe properties of a table or partition. If a specified SerDe property was already set, this
overrides the old value with the new one. Setting the SerDe is allowed only for tables created using the Hive
format.

Assign owner
ALTER (TABLE|VIEW) object-name OWNER TO `user_name@user_domain.com`

Assign an owner to the table or view.

Delta Lake schema constructs


Delta Lake supports additional constructs for modifying table schema: add, change, and replace columns.
For add, change, and replace column examples, see Explicitly update schema.
Add columns

ALTER TABLE table_name ADD COLUMNS (col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name], ...)

ALTER TABLE table_name ADD COLUMNS (col_name.nested_col_name data_type [COMMENT col_comment] [FIRST|AFTER
colA_name], ...)

Add columns to an existing table. It supports adding nested column. If a column with the same name already
exists in the table or the same nested struct, an exception is thrown.
Change columns

ALTER TABLE table_name (ALTER|CHANGE) [COLUMN] alterColumnAction

ALTER TABLE table_name (ALTER|CHANGE) [COLUMN] alterColumnAction

alterColumnAction:
: TYPE dataType
: [COMMENT col_comment]
: [FIRST|AFTER colA_name]
: (SET | DROP) NOT NULL

Change a column definition of an existing table. You can change the data type, comment, nullability of a column
or reorder columns.

NOTE
Available in Databricks Runtime 7.0 and above.
Change columns (Hive syntax)

ALTER TABLE table_name CHANGE [COLUMN] col_name col_name data_type [COMMENT col_comment] [FIRST|AFTER
colA_name]

ALTER TABLE table_name CHANGE [COLUMN] col_name.nested_col_name col_name data_type [COMMENT col_comment]
[FIRST|AFTER colA_name]

Change a column definition of an existing table. You can change the comment of the column and reorder
columns.

NOTE
In Databricks Runtime 7.0 and above you cannot use CHANGE COLUMN :
To change the contents of complex data types such as structs. Instead use ADD COLUMNS to add new columns to
nested fields, or ALTER COLUMN to change the properties of a nested column.
To relax the nullability of a column in a Delta table. Instead use
ALTER TABLE table_name ALTER COLUMN column_name DROP NOT NULL .

Replace columns

ALTER TABLE table_name REPLACE COLUMNS (col_name1 col_type1 [COMMENT col_comment1], ...)

Replace the column definitions of an existing table. It supports changing the comments of columns, adding
columns, and reordering columns. If specified column definitions are not compatible with the existing
definitions, an exception is thrown.
Alter Table Partition
7/21/2022 • 2 minutes to read

Add partition
ALTER TABLE table_name ADD [IF NOT EXISTS]
(PARTITION part_spec [LOCATION path], ...)

part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)

Add partitions to the table, optionally with a custom location for each partition added. This is supported only for
tables created using the Hive format. However, beginning with Spark 2.1, Alter Table Partitions is also
supported for tables defined using the datasource API.
IF NOT EXISTS

If the specified partitions already exist, nothing happens.

Change partition
ALTER TABLE table_name PARTITION part_spec RENAME TO PARTITION part_spec

part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)

Change the partitioning field values of a partition. This operation is allowed only for tables created using the
Hive format.

Drop partition
ALTER TABLE table_name DROP [IF EXISTS] (PARTITION part_spec, ...)
part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)

Drop a partition from a table or view. This operation is allowed only for tables created using the Hive format.
IF EXISTS

If the specified partition does not exists, nothing happens.

Set partition location


ALTER TABLE table_name PARTITION part_spec SET LOCATION path

part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)

Set the location of the specified partition. Setting the location of individual partitions is allowed only for tables
created using the Hive format.
Analyze Table
7/21/2022 • 2 minutes to read

ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS [analyze_option]

Collect statistics about the table that can be used by the query optimizer to find a better plan.

Table statistics
ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS [NOSCAN]

Collect only basic statistics for the table (number of rows, size in bytes).
NOSCAN

Collect only statistics that do not require scanning the whole table (that is, size in bytes).

Column statistics
ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS FOR COLUMNS col1 [, col2, ...]

Collect column statistics for the specified columns in addition to table statistics.

TIP
Use this command whenever possible because it collects more statistics so the optimizer can find better plans. Make sure
to collect statistics for all columns used by the query.

See also:
Use Describe Table to inspect the existing statistics
Cost-based optimizer
Cache Select (Delta Lake on Azure Databricks)
7/21/2022 • 2 minutes to read

CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ]

Cache the data accessed by the specified simple SELECT query in the Delta cache. You can choose a subset of
columns to be cached by providing a list of column names and choose a subset of rows by providing a
predicate. This enables subsequent queries to avoid scanning the original files as much as possible. This
construct is applicable only to Parquet tables. Views are also supported, but the expanded queries are restricted
to the simple queries, as described above.
See Delta and Apache Spark caching for the differences between the RDD cache and the Databricks IO cache.

Examples
CACHE SELECT * FROM boxes
CACHE SELECT width, length FROM boxes WHERE height=3
Cache Table
7/21/2022 • 2 minutes to read

CACHE [LAZY] TABLE [db_name.]table_name

Cache the contents of the table in memory using the RDD cache. This enables subsequent queries to avoid
scanning the original files as much as possible.
LAZY

Cache the table lazily instead of eagerly scanning the entire table.
See Delta and Apache Spark caching for the differences between the RDD cache and the Databricks IO cache.
Clear Cache
7/21/2022 • 2 minutes to read

CLEAR CACHE

Clear the RDD cache associated with a SQLContext.


Clone (Delta Lake on Azure Databricks)
7/21/2022 • 2 minutes to read

IMPORTANT
This feature is in Public Preview.

NOTE
Available in Databricks Runtime 7.2 and above.

Clone a source Delta table to a target destination at a specific version. A clone can be either deep or shallow
referring to whether it copies over the data from the source or not.

IMPORTANT
There are important differences between shallow and deep clones with respect to dependencies between a clone and the
source and other differences. See Clone a Delta table.

CREATE TABLE [IF NOT EXISTS] [db_name.]target_table


[SHALLOW | DEEP] CLONE [db_name.]source_table [<time_travel_version>]
[LOCATION 'path']

[CREATE OR] REPLACE TABLE [db_name.]target_table


[SHALLOW|DEEP] CLONE [db_name.]source_table [<time_travel_version>]
[LOCATION 'path']

where

<time_travel_version> =
TIMESTAMP AS OF timestamp_expression |
VERSION AS OF version

Specify CREATE TABLE IF NOT EXISTS to avoid creating a table target_table if a table already exists. If a table
already exists at the target, the clone operation is a no-op.
Specify CREATE OR REPLACE to replace the target of a clone operation if there is an existing table
target_table . This updates the metastore with the new table if table name is used.
Specifying SHALLOW or DEEP creates a shallow or deep clone at the target. If neither SHALLOW nor DEEP is
specified then a deep clone is created by default.
Specifying LOCATION creates an external table at the target with the provided location as the path where the
data will be stored. If the target provided is a path instead of a table name, the operation will fail.

Examples
You can use CLONE for complex operations like data migration, data archiving, machine learning flow
reproduction, short-term experiments, data sharing etc. See Clone use cases for a few examples.
Convert To Delta (Delta Lake on Azure Databricks)
7/21/2022 • 2 minutes to read

CONVERT TO DELTA [ [db_name.]table_name | parquet.`<path-to-table>` ] [NO STATISTICS]


[PARTITIONED BY (col_name1 col_type1, col_name2 col_type2, ...)]

NOTE
CONVERT TO DELTA [db_name.]table_name requires Databricks Runtime 6.6 or above.

Convert an existing Parquet table to a Delta table in-place. This command lists all the files in the directory,
creates a Delta Lake transaction log that tracks these files, and automatically infers the data schema by reading
the footers of all Parquet files. The conversion process collects statistics to improve query performance on the
converted Delta table. If you provide a table name, the metastore is also updated to reflect that the table is now
a Delta table.
NO STATISTICS

Bypass statistics collection during the conversion process and finish conversion faster. After the table is
converted to Delta Lake, you can use OPTIMIZE ZORDER BY to reorganize the data layout and generate statistics.
PARTITIONED BY

Partition the created table by the specified columns. Required if the data is partitioned. The conversion process
aborts and throw an exception if the directory structure does not conform to the PARTITIONED BY specification. If
you do not provide the PARTITIONED BY clause, the command assumes that the table is not partitioned.

Caveats
Any file not tracked by Delta Lake is invisible and can be deleted when you run VACUUM . You should avoid
updating or appending data files during the conversion process. After the table is converted, make sure all
writes go through Delta Lake.
It is possible that multiple external tables share the same underlying Parquet directory. In this case, if you run
CONVERT on one of the external tables, then you will not be able to access the other external tables because their
underlying directory has been converted from Parquet to Delta Lake. To query or write to these external tables
again, you must run CONVERT on them as well.
CONVERT populates the catalog information, such as schema and table properties, to the Delta Lake transaction
log. If the underlying directory has already been converted to Delta Lake and its metadata is different from the
catalog metadata, a convertMetastoreMetadataMismatchException will be thrown. If you want CONVERT to
overwrite the existing metadata in the Delta Lake transaction log, set the SQL configuration
spark.databricks.delta.convert.metadataCheck.enabled to false.

Undo the conversion


If you have performed Delta Lake operations such as DELETE or OPTIMIZE that can change the data files, first
run the following command for garbage collection:
VACUUM delta.`<path-to-table>` RETAIN 0 HOURS

Then, delete the <path-to-table>/_delta_log directory.


Copy Into (Delta Lake on Azure Databricks)
7/21/2022 • 2 minutes to read

IMPORTANT
This feature is in Public Preview.

COPY INTO table_identifier


FROM [ file_location | (SELECT identifier_list FROM file_location) ]
FILEFORMAT = data_source
[FILES = [file_name, ... | PATTERN = 'regex_pattern']
[FORMAT_OPTIONS ('data_source_reader_option' = 'value', ...)]
[COPY_OPTIONS 'force' = ('false'|'true')]

Load data from a file location into a Delta table. This is a re-triable and idempotent operation—files in the source
location that have already been loaded are skipped.
table_identifier

The Delta table to copy into.


FROM file_location

The file location to load the data from. Files in this location must have the format specified in FILEFORMAT .
SELECT identifier_list

Selects the specified columns or expressions from the source data before copying into the Delta table.
FILEFORMAT = data_source

The format of the source files to load. One of CSV , JSON , AVRO , ORC , PARQUET .
FILES

A list of file names to load, with length up to 1000. Cannot be specified with PATTERN .
PATTERN

A regex pattern that identifies the files to load from the source directory. Cannot be specified with FILES .
FORMAT_OPTIONS

Options to be passed to the Apache Spark data source reader for the specified format.
COPY_OPTIONS

Options to control the operation of the COPY INTO command. The only option is 'force' ; if set to 'true' ,
idempotency is disabled and files are loaded regardless of whether they’ve been loaded before.

Examples
COPY INTO delta.`target_path`
FROM (SELECT key, index, textData, 'constant_value' FROM 'source_path')
FILEFORMAT = CSV
PATTERN = 'folder1/file_[a-g].csv'
FORMAT_OPTIONS('header' = 'true')
Create Bloom Filter Index (Delta Lake on Azure
Databricks)
7/21/2022 • 2 minutes to read

CREATE BLOOMFILTER INDEX


ON [TABLE] table_name
[FOR COLUMNS(columnName1 [OPTIONS(..)], columnName2, ...)]
[OPTIONS(..)]

Create a Bloom filter index for new or rewritten data; it does not create Bloom filters for existing data. The
command fails if either the table name or one of the columns does not exist. If Bloom filtering is enabled for a
column, existing Bloom filter options are replaced by the new options.
While it is not possible to build a Bloom filter index for data that is already written, the OPTIMIZE command
updates Bloom filters for data that is reorganized. Therefore, you can backfill a Bloom filter by running OPTIMIZE
on a table:
If you have not previously optimized the table.
With a different file size, requiring that the data files be re-written.
With a ZORDER (or a different ZORDER , if one is already present), requiring that the data files be re-written.
You can tune the Bloom filter by defining options at the column level or at the table level:
fpp : False positive probability. The desired false positive rate per written Bloom filter. This influences the
number of bits needed to put a single item in the Bloom filter and influences the size of the Bloom filter. The
value must be larger than 0 and smaller than or equal to 1. The default value is 0.1 which requires 5 bits per
item.
numItems : Number of distinct items the file can contain. This setting is important for the quality of filtering as
it influences the total number of bits used in the Bloom filter (number of items * number of bits per item). If
this setting is incorrect, the Bloom filter is either very sparsely populated, wasting disk space and slowing
queries that must download this file, or it is too full and is less accurate (higher FPP). The value must be
larger than 0. The default is 1 million items.
maxExpectedFpp : The expected FPP threshold for which a Bloom filter is not written to disk. The maximum
expected false positive probability at which a Bloom filter is written. If the expected FPP is larger than this
threshold, the Bloom filter’s selectivity is too low; the time and resources it takes to use the Bloom filter
outweighs its usefulness. The value must be between 0 and 1. The default is 1.0 (disabled).
These options play a role only when writing the data. You can configure these properties at various hierarchical
levels: write operation, table level, and column level. The column level takes precedence over the table and
operation levels, and the table level takes precedence over the operation level.
See Bloom filter indexes.
Create Database
7/21/2022 • 2 minutes to read

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] db_name


[COMMENT comment_text]
[LOCATION path]
[WITH DBPROPERTIES (key=val, ...)]

Create a database. If a database with the same name already exists, an exception is thrown.
IF NOT EXISTS

If a database with the same name already exists, nothing will happen.
LOCATION

If the specified path does not already exist in the underlying file system, this command tries to create a directory
with the path.
WITH DBPROPERTIES

Specify a property named key for the database and establish the value for the property respectively as val . If
key already exists, the old value is overwritten with val .

Examples
-- Create database `customer_db`. This throws exception if database with name customer_db
-- already exists.
CREATE DATABASE customer_db;

-- Create database `customer_db` only if database with same name doesn't exist.
CREATE DATABASE IF NOT EXISTS customer_db;
Create Function
7/21/2022 • 2 minutes to read

CREATE [TEMPORARY] FUNCTION [db_name.]function_name AS class_name


[USING resource [, ...] ]

resource:
: [JAR|FILE|ARCHIVE] file_uri

Create a function. The specified class for the function must extend either UDF or UDAF in
org.apache.hadoop.hive.ql.exec , or one of AbstractGenericUDAFResolver , GenericUDF , or GenericUDTF in
org.apache.hadoop.hive.ql.udf.generic . If a function with the same name already exists in the database, an
exception will be thrown.

NOTE
This command is supported only when Hive support is enabled.

TEMPORARY

The created function is available only in this session and is not be persisted to the underlying metastore, if any.
No database name can be specified for temporary functions.
USING resource

The resources that must be loaded to support this function. A list of JAR, file, or archive URIs.
Create Table
7/21/2022 • 7 minutes to read

Create Table Using


CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1 col_type1 [COMMENT col_comment1], ...)]
USING data_source
[OPTIONS (key1 [ = ] val1, key2 [ = ] val2, ...)]
[PARTITIONED BY (col_name1, col_name2, ...)]
[CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
[LOCATION path]
[COMMENT table_comment]
[TBLPROPERTIES (key1 [ = ] val1, key2 [ = ] val2, ...)]
[AS select_statement]

Create a table using a data source. If a table with the same name already exists in the database, an exception is
thrown.
IF NOT EXISTS

If a table with the same name already exists in the database, nothing will happen.
USING data_source

The file format to use for the table. data_source must be one of TEXT , AVRO , CSV , JSON , JDBC , PARQUET , ORC ,
HIVE , DELTA , or LIBSVM , or a fully-qualified class name of a custom implementation of
org.apache.spark.sql.sources.DataSourceRegister .

HIVE is supported to create a Hive SerDe table. You can specify the Hive-specific file_format and row_format
using the OPTIONS clause, which is a case-insensitive string map. The option keys are FILEFORMAT , INPUTFORMAT ,
OUTPUTFORMAT , SERDE , FIELDDELIM , ESCAPEDELIM , MAPKEYDELIM , and LINEDELIM .

OPTIONS

Table options used to optimize the behavior of the table or configure HIVE tables.

NOTE
This clause is not supported by Delta Lake.

PARTITIONED BY (col_name1, col_name2, ...)

Partition the created table by the specified columns. A directory is created for each partition.
CLUSTERED BY col_name3, col_name4, ...)

Each partition in the created table will be split into a fixed number of buckets by the specified columns. This is
typically used with partitioning to read and shuffle less data.
LOCATION path

The directory to store the table data. This clause automatically implies EXTERNAL .
WARNING
To avoid accidental data loss, do not register a schema (database) to a location with existing data or create new external
tables in a location managed by a schema. Dropping a schema will recursively delete all data files in the managed location.

AS select_statement

Populate the table with input data from the SELECT statement. This cannot contain a column list.

Create Table Using Delta (Delta Lake on Azure Databricks)


CREATE [OR REPLACE] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1 col_type1 [NOT NULL] [COMMENT col_comment1], ...)]
USING DELTA
[LOCATION <path-to-delta-files>]

NOT NULL

Indicate that a column value cannot be NULL . If specified, and an Insert or Update (Delta Lake on Azure
Databricks) statements sets a column value to NULL , a SparkException is thrown. The default is to allow a NULL
value.
LOCATION <path-to-delta-files>

If you specify a LOCATION that already contains data stored in Delta Lake, Delta Lake does the following:
If you specify only the table name and location, for example:

CREATE TABLE events


USING DELTA
LOCATION '/mnt/delta/events'

the table in the Hive metastore automatically inherits the schema, partitioning, and table properties of the
existing data. This functionality can be used to “import” data into the metastore.
If you specify any configuration (schema, partitioning, or table properties), Delta Lake verifies that the
specification exactly matches the configuration of the existing data.

WARNING
If the specified configuration does not exactly match the configuration of the data, Delta Lake throws an exception that
describes the discrepancy.

Examples
CREATE TABLE boxes (width INT, length INT, height INT) USING CSV

CREATE TABLE boxes


(width INT, length INT, height INT)
USING PARQUET
OPTIONS ('compression'='snappy')

CREATE TABLE rectangles


USING PARQUET
PARTITIONED BY (width)
CLUSTERED BY (length) INTO 8 buckets
AS SELECT * FROM boxes

-- CREATE a HIVE SerDe table using the CREATE TABLE USING syntax.
CREATE TABLE my_table (name STRING, age INT, hair_color STRING)
USING HIVE
OPTIONS(
INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat',
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat',
SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe')
PARTITIONED BY (hair_color)
TBLPROPERTIES ('status'='staging', 'owner'='andrew')

Create Table with Hive format


CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1[:] col_type1 [COMMENT col_comment1], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION path]
[TBLPROPERTIES (key1=val1, key2=val2, ...)]
[AS select_statement]

row_format:
: SERDE serde_cls [WITH SERDEPROPERTIES (key1=val1, key2=val2, ...)]
| DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]]
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char]
[LINES TERMINATED BY char]
[NULL DEFINED AS char]

file_format:
: TEXTFILE | SEQUENCEFILE | RCFILE | ORC | PARQUET | AVRO
| INPUTFORMAT input_fmt OUTPUTFORMAT output_fmt

Create a table using the Hive format. If a table with the same name already exists in the database, an exception
will be thrown. When the table is dropped later, its data will be deleted from the file system.

NOTE
This command is supported only when Hive support is enabled.

EXTERNAL

The table uses the custom directory specified with LOCATION . Queries on the table access existing data
previously stored in the directory. When an EXTERNAL table is dropped, its data is not deleted from the file
system. This flag is implied if LOCATION is specified.
IF NOT EXISTS

If a table with the same name already exists in the database, nothing will happen.
PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)

Partition the table by the specified columns. This set of columns must be distinct from the set of non-partitioned
columns. You cannot specify partitioned columns with AS select_statement .
ROW FORMAT

Use the SERDE clause to specify a custom SerDe for this table. Otherwise, use the DELIMITED clause to use the
native SerDe and specify the delimiter, escape character, null character, and so on.
STORED AS file_format

Specify the file format for this table. Available formats include TEXTFILE , SEQUENCEFILE , RCFILE , ORC , PARQUET ,
and AVRO . Alternatively, you can specify your own input and output formats through INPUTFORMAT and
OUTPUTFORMAT . Only formats TEXTFILE , SEQUENCEFILE , and RCFILE can be used with ROW FORMAT SERDE and only
TEXTFILE can be used with ROW FORMAT DELIMITED .

LOCATION path

The directory to store the table data. This clause automatically implies EXTERNAL .
AS select_statement

Populate the table with input data from the select statement. You cannot specify this with PARTITIONED BY .

Data types
Spark SQL supports the following data types:
Numeric types
ByteType : Represents 1-byte signed integer numbers. The range of numbers is from -128 to 127 .
ShortType : Represents 2-byte signed integer numbers. The range of numbers is from -32768 to
32767 .
IntegerType : Represents 4-byte signed integer numbers. The range of numbers is from -2147483648
to 2147483647 .
LongType : Represents 8-byte signed integer numbers. The range of numbers is from
-9223372036854775808 to 9223372036854775807 .
FloatType : Represents 4-byte single-precision floating point numbers.
DoubleType : Represents 8-byte double-precision floating point numbers.
DecimalType : Represents arbitrary-precision signed decimal numbers. Backed internally by
java.math.BigDecimal . A BigDecimal consists of an arbitrary precision integer unscaled value and a
32-bit integer scale.
String type: StringType : Represents character string values.
Binary type: BinaryType : Represents byte sequence values.
Boolean type: BooleanType : Represents boolean values.
Datetime types
TimestampType : Represents values comprising values of fields year, month, day, hour, minute, and
second, with the session local time zone. The timestamp value represents an absolute point in time.
DateType : Represents values comprising values of fields year, month and day, without a time-zone.
Complex types
ArrayType(elementType, containsNull) : Represents values comprising a sequence of elements with the
type of elementType . containsNull is used to indicate if elements in a ArrayType value can have
null values.
MapType(keyType, valueType, valueContainsNull) : Represents values comprising a set of key-value
pairs. The data type of keys is described by keyType and the data type of values is described by
valueType . For a MapType value, keys are not allowed to have null values. valueContainsNull is
used to indicate if values of a MapType value can have null values.
StructType(fields) : Represents values with the structure described by a sequence of StructField (
fields ).
StructField(name, dataType, nullable) : Represents a field in a StructType . The name of a field is
indicated by name . The data type of a field is indicated by dataType . nullable is used to indicate if
values of these fields can have null values.

The following table shows the type names and aliases for each data type.

DATA T Y P E SQ L N A M E

BooleanType BOOLEAN

ByteType BYTE , TINYINT

ShortType SHORT , SMALLINT

IntegerType INT , INTEGER

LongType LONG , BIGINT

FloatType FLOAT , REAL

DoubleType DOUBLE

DateType DATE

TimestampType TIMESTAMP

StringType STRING

BinaryType BINARY

DecimalType DECIMAL , DEC , NUMERIC

CalendarIntervalType INTERVAL

ArrayType ARRAY<element_type>

StructType STRUCT<field1_name: field1_type, field2_name:


field2_type, ...>

MapType MAP<key_type, value_type>

Examples
CREATE TABLE my_table (name STRING, age INT)

CREATE TABLE my_table (name STRING, age INT)


COMMENT 'This table is partitioned'
PARTITIONED BY (hair_color STRING COMMENT 'This is a column comment')
TBLPROPERTIES ('status'='staging', 'owner'='andrew')

CREATE TABLE my_table (name STRING, age INT)


COMMENT 'This table specifies a custom SerDe'
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

CREATE TABLE my_table (name STRING, age INT)


COMMENT 'This table uses the CSV format'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE

CREATE TABLE your_table


COMMENT 'This table is created with existing data'
AS SELECT * FROM my_table

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (name STRING, age INT)
COMMENT 'This table is created with existing data'
LOCATION 'spark-warehouse/tables/my_existing_table'

Create Table Like


CREATE TABLE [IF NOT EXISTS] [db_name.]table_name1 LIKE [db_name.]table_name2 [LOCATION path]

Create a managed table using the definition/metadata of an existing table or view. The created table always uses
its own directory in the default warehouse location.

NOTE
Delta Lake does not support CREATE TABLE LIKE . Instead use CREATE TABLE AS . See AS.
Create View
7/21/2022 • 2 minutes to read

CREATE [OR REPLACE] [[GLOBAL] TEMPORARY] VIEW [db_name.]view_name


[(col_name1 [COMMENT col_comment1], ...)]
[COMMENT table_comment]
[TBLPROPERTIES (key1=val1, key2=val2, ...)]
AS select_statement

Define a logical view on one or more tables or views.


OR REPLACE

If the view does not exist, CREATE OR REPLACE VIEW is equivalent to CREATE VIEW . If the view does exist,
CREATE OR REPLACE VIEW is equivalent to ALTER VIEW .

[GLOBAL] TEMPORARY

TEMPORARY skips persisting the view definition in the underlying metastore, if any. If GLOBAL is specified, the
view can be accessed by different sessions and kept alive until your application ends; otherwise, the temporary
views are session-scoped and will be automatically dropped if the session terminates. All the global temporary
views are tied to a system preserved temporary database global_temp . The database name is preserved, and
thus, users are not allowed to create/use/drop this database. You must use the qualified name to access the
global temporary view.

NOTE
A temporary view defined in a notebook is not visible in other notebooks. See Notebook isolation.

(col_name1 [COMMENT col_comment1], ...)

A column list that defines the view schema. The column names must be unique with the same number of
columns retrieved by select_statement . When the column list is not given, the view schema is the output
schema of select_statement .
TBLPROPERTIES

Metadata key-value pairs.


AS select_statement

A SELECT statement that defines the view. The statement can select from base tables or the other views.

IMPORTANT
You cannot specify datasource, partition, or clustering options since a view is not materialized like a table.

Examples
-- Create a persistent view view_deptDetails in database1. The view definition is recorded in the underlying
metastore
CREATE VIEW database1.view_deptDetails
AS SELECT * FROM company JOIN dept ON company.dept_id = dept.id;

-- Create or replace a local temporary view from a persistent view with an extra filter
CREATE OR REPLACE TEMPORARY VIEW temp_DeptSFO
AS SELECT * FROM database1.view_deptDetails WHERE loc = 'SFO';

-- Access the base tables through the temporary view


SELECT * FROM temp_DeptSFO;

-- Create a global temp view to share the data through different sessions
CREATE GLOBAL TEMP VIEW global_DeptSJC
AS SELECT * FROM database1.view_deptDetails WHERE loc = 'SJC';

-- Access the global temp views


SELECT * FROM global_temp.global_DeptSJC;

-- Drop the global temp view, temp view, and persistent view.
DROP VIEW global_temp.global_DeptSJC;
DROP VIEW temp_DeptSFO;
DROP VIEW database1.view_deptDetails;
Delete From (Delta Lake on Azure Databricks)
7/21/2022 • 2 minutes to read

DELETE FROM [db_name.]table_name [AS alias] [WHERE predicate]

Delete the rows that match a predicate. When no predicate is provided, delete all rows.
WHERE

Filter rows by predicate.


The WHERE predicate supports subqueries, including IN , NOT IN , EXISTS , NOT EXISTS , and scalar subqueries.
The following types of subqueries are not supported:
Nested subqueries, that is, an subquery inside another subquery
NOT IN subquery inside an OR , for example, a = 3 OR b NOT IN (SELECT c from t)

In most cases, you can rewrite NOT IN subqueries using NOT EXISTS . We recommend using NOT EXISTS
whenever possible, as DELETE with NOT IN subqueries can be slow.

Example
DELETE FROM events WHERE date < '2017-01-01'

Subquery Examples
DELETE FROM all_events
WHERE session_time < (SELECT min(session_time) FROM good_events)

DELETE FROM orders AS t1


WHERE EXISTS (SELECT oid FROM returned_orders WHERE t1.oid = oid)

DELETE FROM events


WHERE category NOT IN (SELECT category FROM events2 WHERE date > '2001-01-01')
Deny
7/21/2022 • 2 minutes to read

DENY
privilege_type [, privilege_type ] ...
ON [CATALOG | DATABASE <database-name> | TABLE <table-name> | VIEW <view-name> | FUNCTION <function-name>
| ANONYMOUS FUNCTION | ANY FILE]
TO principal

privilege_type
: SELECT | CREATE | MODIFY | READ_METADATA | CREATE_NAMED_FUNCTION | ALL PRIVILEGES

principal
: `<user>@<domain-name>` | <group-name>

Deny a privilege on an object to a user or principal. Denying a privilege on a database (for example a SELECT
privilege) has the effect of implicitly denying that privilege on all objects in that database. Denying a specific
privilege on the catalog has the effect of implicitly denying that privilege on all databases in the catalog.
To deny a privilege to all users, specify the keyword users after TO .
DENY can be used to ensure that a user or principal cannot access the specified object, despite any implicit or
explicit GRANTs . When an object is accessed, Databricks first checks if there are any explicit or implicit DENYs on
the object before checking if there are any explicit or implicit GRANTs .
For example, suppose there is a database db with tables t1 and t2 . A user is initially granted SELECT
privileges on db . The user can access t1 and t2 due to the GRANT on the database db .
If the administrator issues a DENY on table t1 , the user will no longer be able to access t1 . If the
administrator issues a DENY on database db , the user will not be able to access any tables in db even if there
is an explicit GRANT on these tables. That is, the DENY always supersedes the GRANT .

Example
DENY SELECT ON <table-name> TO `<user>@<domain-name>`;
Describe Database
7/21/2022 • 2 minutes to read

DESCRIBE DATABASE [EXTENDED] db_name

Return the metadata of an existing database (name, comment and location). If the database does not exist, an
exception is thrown.
EXTENDED

Display the database properties.


Describe Function
7/21/2022 • 2 minutes to read

DESCRIBE FUNCTION [EXTENDED] [db_name.]function_name

Return the metadata of an existing function (implementing class and usage). If the function does not exist, an
exception is thrown.
EXTENDED

Show extended usage information.


Describe History (Delta Lake on Azure Databricks)
7/21/2022 • 2 minutes to read

DESCRIBE HISTORY [db_name.]table_name

DESCRIBE HISTORY delta.`<path-to-table>`

Return provenance information, including the operation, user, and so on, for each write to a table. Table history is
retained for 30 days.
See Retrieve Delta table history for details.
Describe Table
7/21/2022 • 2 minutes to read

Describe Table (Delta Lake on Azure Databricks)


DESCRIBE [EXTENDED] [db_name.]table_name

DESCRIBE [EXTENDED] delta.`<path-to-table>`

Return the metadata of an existing table (column names, data types, and comments). If the table does not exist,
an exception is thrown.
EXTENDED

Display detailed information about the table, including parent database, table type, storage information, and
properties.

Describe Partition (Delta Lake on Azure Databricks)


DESCRIBE [EXTENDED] [db_name.]table_name PARTITION partition_spec

DESCRIBE [EXTENDED] delta.`<path-to-table>` PARTITION partition_spec

Return the metadata of a specified partition. The partition_spec must provide the values for all the partition
columns.
EXTENDED

Display basic information about the table and the partition-specific storage information.

Describe Columns (Delta Lake on Azure Databricks)


DESCRIBE [EXTENDED] [db_name.]table_name column_name

DESCRIBE [EXTENDED] delta.`<path-to-table>`

Return the metadata of a specified column.


EXTENDED

Display detailed information about the specified columns, including the column statistics collected by the
command ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS column_name [column_name, ...] .

Describe Formatted (Delta Lake on Azure Databricks)


DESCRIBE FORMATTED [db_name.]table_name

DESCRIBE FORMATTED delta.`<path-to-table>`

Return the table format.


Describe Detail (Delta Lake on Azure Databricks)
DESCRIBE DETAIL [db_name.]table_name

DESCRIBE DETAIL delta.`<path-to-table>`

Return information about schema, partitioning, table size, and so on. For example, you can see the current
reader and writer versions of a table.
Drop Bloom Filter Index (Delta Lake on Azure
Databricks)
7/21/2022 • 2 minutes to read

DROP BLOOMFILTER INDEX


ON [TABLE] table_name
[FOR COLUMNS(columnName1, columnName2, ...)]

Drop a Bloom filter index.


The command fails if either the table name or one of the columns does not exist. All Bloom filter related
metadata is removed from the specified columns.
When a table does not have any Bloom filters, the underlying index files are cleaned when the table is
vacuumed.
Drop Database
7/21/2022 • 2 minutes to read

DROP [DATABASE | SCHEMA] [IF EXISTS] db_name [RESTRICT | CASCADE]

Drop a database and delete the directory associated with the database from the file system. If the database does
not exist, an exception is thrown.
IF EXISTS

If the database to drop does not exist, nothing happens.


RESTRICT

Dropping a non-empty database triggers an exception. Enabled by default.


CASCADE

Dropping a non-empty database also drops all associated tables and functions.
Drop Function
7/21/2022 • 2 minutes to read

DROP [TEMPORARY] FUNCTION [IF EXISTS] [db_name.]function_name

Drop an existing function. If the function to drop does not exist, an exception is thrown.

NOTE
This command is supported only when Hive support is enabled.

TEMPORARY

Whether to function to drop is a temporary function.


IF EXISTS

If the function to drop does not exist, nothing happens.


Drop Table
7/21/2022 • 2 minutes to read

DROP TABLE [IF EXISTS] [db_name.]table_name

Drop a table and delete the directory associated with the table from the file system if this is not an EXTERNAL
table. If the table to drop does not exist, an exception is thrown.
IF EXISTS

If the table does not exist, nothing happens.


Drop View
7/21/2022 • 2 minutes to read

DROP VIEW [db_name.]view_name

Drop a logical view on one or more tables.

Examples
-- Drop the global temp view, temp view, and persistent view.
DROP VIEW global_temp.global_DeptSJC;
DROP VIEW temp_DeptSFO;
DROP VIEW database1.view_deptDetails;
Explain
7/21/2022 • 2 minutes to read

EXPLAIN [EXTENDED | CODEGEN] statement

Provide detailed plan information about statement without actually running it. By default this only outputs
information about the physical plan. Explaining DESCRIBE TABLE is not supported.
EXTENDED

Output information about the logical plan before and after analysis and optimization.
CODEGEN

Output the generated code for the statement, if any.


Fsck Repair Table (Delta Lake on Azure Databricks)
7/21/2022 • 2 minutes to read

FSCK REPAIR TABLE [db_name.]table_name [DRY RUN]

Remove the file entries from the transaction log of a Delta table that can no longer be found in the underlying
file system. This can happen when these files have been manually deleted.
DRY RUN

Return a list of files to be removed from the transaction log.


Functions (Apache Spark 2.x)
7/21/2022 • 45 minutes to read

This article lists the built-in functions in Apache Spark SQL.

!
! expr - Logical not.

%
expr1 % expr2 - Returns the remainder after expr1 / expr2 .
Examples:

> SELECT 2 % 1.8;


0.2
> SELECT MOD(2, 1.8);
0.2

&
expr1 & expr2 - Returns the result of bitwise AND of expr1 and expr2 .
Examples:

> SELECT 3 & 5;


1

*
expr1 _ expr2 - Returns expr1 _ expr2 .
Examples:

> SELECT 2 * 3;
6

+
expr1 + expr2 - Returns expr1 + expr2 .
Examples:

> SELECT 1 + 2;
3

-
expr1 - expr2 - Returns expr1 - expr2 .
Examples:

> SELECT 2 - 1;
1

/
expr1 / expr2 - Returns expr1 / expr2 . It always performs floating point division.
Examples:

> SELECT 3 / 2;
1.5
> SELECT 2L / 2L;
1.0

<
expr1 < expr2 - Returns true if expr1 is less than expr2 .
Arguments:
expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a
type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types
such array/struct, the data types of fields must be orderable.
Examples:

> SELECT 1 < 2;


true
> SELECT 1.1 < '1';
false
> SELECT to_date('2009-07-30 04:17:52') < to_date('2009-07-30 04:17:52');
false
> SELECT to_date('2009-07-30 04:17:52') < to_date('2009-08-01 04:17:52');
true
> SELECT 1 < NULL;
NULL

<=
expr1 <= expr2 - Returns true if expr1 is less than or equal to expr2 .
Arguments:
expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a
type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types
such array/struct, the data types of fields must be orderable.
Examples:
> SELECT 2 <= 2;
true
> SELECT 1.0 <= '1';
true
> SELECT to_date('2009-07-30 04:17:52') <= to_date('2009-07-30 04:17:52');
true
> SELECT to_date('2009-07-30 04:17:52') <= to_date('2009-08-01 04:17:52');
true
> SELECT 1 <= NULL;
NULL

<=>
expr1 <=> expr2 - Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both
are null, false if one of the them is null.
Arguments:
expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a
type that can be used in equality comparison. Map type is not supported. For complex types such
array/struct, the data types of fields must be orderable.
Examples:

> SELECT 2 <=> 2;


true
> SELECT 1 <=> '1';
true
> SELECT true <=> NULL;
false
> SELECT NULL <=> NULL;
true

=
expr1 = expr2 - Returns true if expr1 equals expr2 , or false otherwise.
Arguments:
expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a
type that can be used in equality comparison. Map type is not supported. For complex types such
array/struct, the data types of fields must be orderable.
Examples:

> SELECT 2 = 2;
true
> SELECT 1 = '1';
true
> SELECT true = NULL;
NULL
> SELECT NULL = NULL;
NULL

==
expr1 == expr2 - Returns true if expr1 equals expr2 , or false otherwise.
Arguments:
expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a
type that can be used in equality comparison. Map type is not supported. For complex types such
array/struct, the data types of fields must be orderable.
Examples:

> SELECT 2 == 2;
true
> SELECT 1 == '1';
true
> SELECT true == NULL;
NULL
> SELECT NULL == NULL;
NULL

>
expr1 > expr2 - Returns true if expr1 is greater than expr2 .
Arguments:
expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a
type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types
such array/struct, the data types of fields must be orderable.
Examples:

> SELECT 2 > 1;


true
> SELECT 2 > '1.1';
true
> SELECT to_date('2009-07-30 04:17:52') > to_date('2009-07-30 04:17:52');
false
> SELECT to_date('2009-07-30 04:17:52') > to_date('2009-08-01 04:17:52');
false
> SELECT 1 > NULL;
NULL

>=
expr1 >= expr2 - Returns true if expr1 is greater than or equal to expr2 .
Arguments:
expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a
type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types
such array/struct, the data types of fields must be orderable.
Examples:
> SELECT 2 >= 1;
true
> SELECT 2.0 >= '2.1';
false
> SELECT to_date('2009-07-30 04:17:52') >= to_date('2009-07-30 04:17:52');
true
> SELECT to_date('2009-07-30 04:17:52') >= to_date('2009-08-01 04:17:52');
false
> SELECT 1 >= NULL;
NULL

^
expr1 ^ expr2 - Returns the result of bitwise exclusive OR of expr1 and expr2 .
Examples:

> SELECT 3 ^ 5;
2

abs
abs(expr) - Returns the absolute value of the numeric value.
Examples:

> SELECT abs(-1);


1

acos
acos(expr) - Returns the inverse cosine (arccosine) of expr , as if computed by java.lang.Math.acos .
Examples:

> SELECT acos(1);


0.0
> SELECT acos(2);
NaN

add_months
add_months(start_date, num_months) - Returns the date that is num_months after start_date .
Examples:

> SELECT add_months('2016-08-31', 1);


2016-09-30

Since: 1.5.0

aggregate
aggregate(expr, start, merge, finish) - Applies a binary operator to an initial state and all elements in the array,
and reduces this to a single state. The final state is converted into the final result by applying a finish function.
Examples:

> SELECT aggregate(array(1, 2, 3), 0, (acc, x) -> acc + x);


6
> SELECT aggregate(array(1, 2, 3), 0, (acc, x) -> acc + x, acc -> acc * 10);
60

Since: 2.4.0

and
expr1 and expr2 - Logical AND.

approx_count_distinct
approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++. relativeSD
defines the maximum estimation error allowed.

approx_percentile
approx_percentile(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column
col at the given percentage. The value of percentage must be between 0.0 and 1.0. The accuracy parameter
(default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory.
Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation. When
percentage is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the
approximate percentile array of column col at the given percentage array.
Examples:

> SELECT approx_percentile(10.0, array(0.5, 0.4, 0.1), 100);


[10.0,10.0,10.0]
> SELECT approx_percentile(10.0, 0.5, 100);
10.0

array
array(expr, …) - Returns an array with the given elements.
Examples:

> SELECT array(1, 2, 3);


[1,2,3]

array_contains
array_contains(array, value) - Returns true if the array contains the value.
Examples:

> SELECT array_contains(array(1, 2, 3), 2);


true
array_distinct
array_distinct(array) - Removes duplicate values from the array.
Examples:

> SELECT array_distinct(array(1, 2, 3, null, 3));


[1,2,3,null]

Since: 2.4.0

array_except
array_except(array1, array2) - Returns an array of the elements in array1 but not in array2, without duplicates.
Examples:

> SELECT array_except(array(1, 2, 3), array(1, 3, 5));


[2]

Since: 2.4.0

array_intersect
array_intersect(array1, array2) - Returns an array of the elements in the intersection of array1 and array2,
without duplicates.
Examples:

> SELECT array_intersect(array(1, 2, 3), array(1, 3, 5));


[1,3]

Since: 2.4.0

array_join
array_join(array, delimiter[, nullReplacement]) - Concatenates the elements of the given array using the delimiter
and an optional string to replace nulls. If no value is set for nullReplacement, any null value is filtered.
Examples:

> SELECT array_join(array('hello', 'world'), ' ');


hello world
> SELECT array_join(array('hello', null ,'world'), ' ');
hello world
> SELECT array_join(array('hello', null ,'world'), ' ', ',');
hello , world

Since: 2.4.0

array_max
array_max(array) - Returns the maximum value in the array. NULL elements are skipped.
Examples:
> SELECT array_max(array(1, 20, null, 3));
20

Since: 2.4.0

array_min
array_min(array) - Returns the minimum value in the array. NULL elements are skipped.
Examples:

> SELECT array_min(array(1, 20, null, 3));


1

Since: 2.4.0

array_position
array_position(array, element) - Returns the (1-based) index of the first element of the array as long.
Examples:

> SELECT array_position(array(3, 2, 1), 1);


3

Since: 2.4.0

array_remove
array_remove(array, element) - Remove all elements that equal to element from array.
Examples:

> SELECT array_remove(array(1, 2, 3, null, 3), 3);


[1,2,null]

Since: 2.4.0

array_repeat
array_repeat(element, count) - Returns the array containing element count times.
Examples:

> SELECT array_repeat('123', 2);


["123","123"]

Since: 2.4.0

array_sort
array_sort(array) - Sorts the input array in ascending order. The elements of the input array must be orderable.
Null elements will be placed at the end of the returned array.
Examples:

> SELECT array_sort(array('b', 'd', null, 'c', 'a'));


["a","b","c","d",null]

Since: 2.4.0

array_union
array_union(array1, array2) - Returns an array of the elements in the union of array1 and array2, without
duplicates.
Examples:

> SELECT array_union(array(1, 2, 3), array(1, 3, 5));


[1,2,3,5]

Since: 2.4.0

arrays_overlap
arrays_overlap(a1, a2) - Returns true if a1 contains at least a non-null element present also in a2. If the arrays
have no common element and they are both non-empty and either of them contains a null element null is
returned, false otherwise.
Examples:

> SELECT arrays_overlap(array(1, 2, 3), array(3, 4, 5));


true

Since: 2.4.0

arrays_zip
arrays_zip(a1, a2, …) - Returns a merged array of structs in which the N-th struct contains all N-th values of
input arrays.
Examples:

> SELECT arrays_zip(array(1, 2, 3), array(2, 3, 4));

[{"0":1,"1":2},{"0":2,"1":3},{"0":3,"1":4}]

> SELECT arrays_zip(array(1, 2), array(2, 3), array(3, 4));

[{"0":1,"1":2,"2":3},{"0":2,"1":3,"2":4}]

Since: 2.4.0

ascii
ascii(str) - Returns the numeric value of the first character of str .
Examples:

> SELECT ascii('222');


50
> SELECT ascii(2);
50

asin
asin(expr) - Returns the inverse sine (arcsine) of expr , as if computed by java.lang.Math.asin .
Examples:

> SELECT asin(0);


0.0
> SELECT asin(2);
NaN

assert_true
assert_true(expr) - Throws an exception if expr is not true.
Examples:

> SELECT assert_true(0 < 1);


NULL

atan
atan(expr) - Returns the inverse tangent (arctangent) of expr , as if computed by java.lang.Math.atan

Examples:

> SELECT atan(0);


0.0

atan2
atan2(exprY, exprX) - Returns the angle in radians between the positive x-axis of a plane and the point given by
the coordinates ( exprX , exprY ), as if computed by java.lang.Math.atan2 .
Arguments:
exprY - coordinate on y-axis
exprX - coordinate on x-axis
Examples:

> SELECT atan2(0, 0);


0.0
avg
avg(expr) - Returns the mean calculated from values of a group.

base64
base64(bin) - Converts the argument from a binary bin to a base 64 string.
Examples:

> SELECT base64('Spark SQL');


U3BhcmsgU1FM

bigint
bigint(expr) - Casts the value expr to the target data type bigint .

bin
bin(expr) - Returns the string representation of the long value expr represented in binary.
Examples:

> SELECT bin(13);


1101
> SELECT bin(-13);
1111111111111111111111111111111111111111111111111111111111110011
> SELECT bin(13.3);
1101

binary
binary(expr) - Casts the value expr to the target data type binary .

bit_length
bit_length(expr) - Returns the bit length of string data or number of bits of binary data.
Examples:

> SELECT bit_length('Spark SQL');


72

boolean
boolean(expr) - Casts the value expr to the target data type boolean .

bround
bround(expr, d) - Returns expr rounded to d decimal places using HALF_EVEN rounding mode.
Examples:
> SELECT bround(2.5, 0);
2.0

cardinality
cardinality(expr) - Returns the size of an array or a map. The function returns -1 if its input is null and
spark.sql.legacy.sizeOfNull is set to true. If spark.sql.legacy.sizeOfNull is set to false, the function returns null for
null input. By default, the spark.sql.legacy.sizeOfNull parameter is set to true.
Examples:

> SELECT cardinality(array('b', 'd', 'c', 'a'));


4
> SELECT cardinality(map('a', 1, 'b', 2));
2
> SELECT cardinality(NULL);
-1

cast
cast(expr AS type) - Casts the value expr to the target data type type .
Examples:

> SELECT cast('10' as int);


10

cbrt
cbrt(expr) - Returns the cube root of expr .
Examples:

> SELECT cbrt(27.0);


3.0

ceil
ceil(expr) - Returns the smallest integer not smaller than expr .
Examples:

> SELECT ceil(-0.1);


0
> SELECT ceil(5);
5

ceiling
ceiling(expr) - Returns the smallest integer not smaller than expr .
Examples:
> SELECT ceiling(-0.1);
0
> SELECT ceiling(5);
5

char
char(expr) - Returns the ASCII character having the binary equivalent to expr . If n is larger than 256 the result is
equivalent to chr(n % 256)
Examples:

> SELECT char(65);


A

char_length
char_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of
string data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:

> SELECT char_length('Spark SQL ');


10
> SELECT CHAR_LENGTH('Spark SQL ');
10
> SELECT CHARACTER_LENGTH('Spark SQL ');
10

character_length
character_length(expr) - Returns the character length of string data or number of bytes of binary data. The
length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:

> SELECT character_length('Spark SQL ');


10
> SELECT CHAR_LENGTH('Spark SQL ');
10
> SELECT CHARACTER_LENGTH('Spark SQL ');
10

chr
chr(expr) - Returns the ASCII character having the binary equivalent to expr . If n is larger than 256 the result is
equivalent to chr(n % 256)
Examples:

> SELECT chr(65);


A
coalesce
coalesce(expr1, expr2, …) - Returns the first non-null argument if exists. Otherwise, null.
Examples:

> SELECT coalesce(NULL, 1, NULL);


1

collect_list
collect_list(expr) - Collects and returns a list of non-unique elements.

collect_set
collect_set(expr) - Collects and returns a set of unique elements.

concat
concat(col1, col2, …, colN) - Returns the concatenation of col1, col2, …, colN.
Examples:

> SELECT concat('Spark', 'SQL');


SparkSQL
> SELECT concat(array(1, 2, 3), array(4, 5), array(6));
[1,2,3,4,5,6]
at logic for arrays is available since 2.4.0.

concat_ws
concat_ws(sep, [str | array(str)]+) - Returns the concatenation of the strings separated by sep .
Examples:

> SELECT concat_ws(' ', 'Spark', 'SQL');


Spark SQL

conv
conv(num, from_base, to_base) - Convert num from from_base to to_base .
Examples:

> SELECT conv('100', 2, 10);


4
> SELECT conv(-10, 16, -10);
-16

corr
corr(expr1, expr2) - Returns Pearson coefficient of correlation between a set of number pairs.
cos
cos(expr) - Returns the cosine of expr , as if computed by java.lang.Math.cos .
Arguments:
expr - angle in radians
Examples:

> SELECT cos(0);


1.0

cosh
cosh(expr) - Returns the hyperbolic cosine of expr , as if computed by java.lang.Math.cosh .
Arguments:
expr - hyperbolic angle
Examples:

> SELECT cosh(0);


1.0

cot
cot(expr) - Returns the cotangent of expr , as if computed by 1/java.lang.Math.cot .
Arguments:
expr - angle in radians
Examples:

> SELECT cot(1);


0.6420926159343306

count
count(*) - Returns the total number of retrieved rows, including rows containing null.
count(expr[, expr…]) - Returns the number of rows for which the supplied expression(s) are all non-null.
count(DISTINCT expr[, expr…]) - Returns the number of rows for which the supplied expression(s) are unique and
non-null.

count_min_sketch
count_min_sketch(col, eps, confidence, seed) - Returns a count-min sketch of a column with the given esp,
confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage.
Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.

covar_pop
covar_pop(expr1, expr2) - Returns the population covariance of a set of number pairs.

covar_samp
covar_samp(expr1, expr2) - Returns the sample covariance of a set of number pairs.

crc32
crc32(expr) - Returns a cyclic redundancy check value of the expr as a bigint.
Examples:

> SELECT crc32('Spark');


1557323817

cube
cume_dist
cume_dist() - Computes the position of a value relative to all values in the partition.

current_database
current_database() - Returns the current database.
Examples:

> SELECT current_database();


default

current_date
current_date() - Returns the current date at the start of query evaluation.
Since: 1.5.0

current_timestamp
current_timestamp() - Returns the current timestamp at the start of query evaluation.
Since: 1.5.0

date
date(expr) - Casts the value expr to the target data type date .

date_add
date_add(start_date, num_days) - Returns the date that is num_days after start_date .
Examples:
> SELECT date_add('2016-07-30', 1);
2016-07-31

Since: 1.5.0

date_format
date_format(timestamp, fmt) - Converts timestamp to a value of string in the format specified by the date
format fmt .
Examples:

> SELECT date_format('2016-04-08', 'y');


2016

Since: 1.5.0

date_sub
date_sub(start_date, num_days) - Returns the date that is num_days before start_date .
Examples:

> SELECT date_sub('2016-07-30', 1);


2016-07-29

Since: 1.5.0

date_trunc
date_trunc(fmt, ts) - Returns timestamp ts truncated to the unit specified by the format model fmt . fmt
should be one of [“YEAR”, “YYYY”, “YY”, “MON”, “MONTH”, “MM”, “DAY”, “DD”, “HOUR”, “MINUTE”, “SECOND”,
“WEEK”, “QUARTER”]
Examples:

> SELECT date_trunc('YEAR', '2015-03-05T09:32:05.359');


2015-01-01 00:00:00
> SELECT date_trunc('MM', '2015-03-05T09:32:05.359');
2015-03-01 00:00:00
> SELECT date_trunc('DD', '2015-03-05T09:32:05.359');
2015-03-05 00:00:00
> SELECT date_trunc('HOUR', '2015-03-05T09:32:05.359');
2015-03-05 09:00:00

Since: 2.3.0

datediff
datediff(endDate, startDate) - Returns the number of days from startDate to endDate .
Examples:
> SELECT datediff('2009-07-31', '2009-07-30');
1

> SELECT datediff('2009-07-30', '2009-07-31');


-1

Since: 1.5.0

day
day(date) - Returns the day of month of the date/timestamp.
Examples:

> SELECT day('2009-07-30');


30

Since: 1.5.0

dayofmonth
dayofmonth(date) - Returns the day of month of the date/timestamp.
Examples:

> SELECT dayofmonth('2009-07-30');


30

Since: 1.5.0

dayofweek
dayofweek(date) - Returns the day of the week for date/timestamp (1 = Sunday, 2 = Monday, …, 7 = Saturday).
Examples:

> SELECT dayofweek('2009-07-30');


5

Since: 2.3.0

dayofyear
dayofyear(date) - Returns the day of year of the date/timestamp.
Examples:

> SELECT dayofyear('2016-04-09');


100

Since: 1.5.0

decimal
decimal(expr) - Casts the value expr to the target data type decimal .

decode
decode(bin, charset) - Decodes the first argument using the second argument character set.
Examples:

> SELECT decode(encode('abc', 'utf-8'), 'utf-8');


abc

degrees
degrees(expr) - Converts radians to degrees.
Arguments:
expr - angle in radians
Examples:

> SELECT degrees(3.141592653589793);


180.0

dense_rank
dense_rank() - Computes the rank of a value in a group of values. The result is one plus the previously assigned
rank value. Unlike the function rank, dense_rank will not produce gaps in the ranking sequence.

double
double(expr) - Casts the value expr to the target data type double .

e
e() - Returns Euler’s number, e.
Examples:

> SELECT e();


2.718281828459045

element_at
element_at(array, index) - Returns element of array at given (1-based) index. If index < 0, accesses elements
from the last to the first. Returns NULL if the index exceeds the length of the array.
element_at(map, key) - Returns value for given key, or NULL if the key is not contained in the map
Examples:
> SELECT element_at(array(1, 2, 3), 2);
2
> SELECT element_at(map(1, 'a', 2, 'b'), 2);
b

Since: 2.4.0

elt
elt(n, input1, input2, …) - Returns the n -th input, e.g., returns input2 when n is 2.
Examples:

> SELECT elt(1, 'scala', 'java');


scala

encode
encode(str, charset) - Encodes the first argument using the second argument character set.
Examples:

> SELECT encode('abc', 'utf-8');


abc

exists
exists(expr, pred) - Tests whether a predicate holds for one or more elements in the array.
Examples:

> SELECT exists(array(1, 2, 3), x -> x % 2 == 0);


true

Since: 2.4.0

exp
exp(expr) - Returns e to the power of expr .
Examples:

> SELECT exp(0);


1.0

explode
explode(expr) - Separates the elements of array expr into multiple rows, or the elements of map expr into
multiple rows and columns.
Examples:
> SELECT explode(array(10, 20));
10
20

explode_outer
explode_outer(expr) - Separates the elements of array expr into multiple rows, or the elements of map expr
into multiple rows and columns.
Examples:

> SELECT explode_outer(array(10, 20));


10
20

expm1
expm1(expr) - Returns exp( expr ) - 1.
Examples:

> SELECT expm1(0);


0.0

factorial
factorial(expr) - Returns the factorial of expr . expr is [0..20]. Otherwise, null.
Examples:

> SELECT factorial(5);


120

filter
filter(expr, func) - Filters the input array using the given predicate.
Examples:

> SELECT filter(array(1, 2, 3), x -> x % 2 == 1);


[1,3]

Since: 2.4.0

find_in_set
find_in_set(str, str_array) - Returns the index (1-based) of the given string ( str ) in the comma-delimited list (
str_array ). Returns 0, if the string was not found or if the given string ( str ) contains a comma.

Examples:
> SELECT find_in_set('ab','abc,b,ab,c,def');
3

first
first(expr[, isIgnoreNull]) - Returns the first value of expr for a group of rows. If isIgnoreNull is true, returns
only non-null values.

first_value
first_value(expr[, isIgnoreNull]) - Returns the first value of expr for a group of rows. If isIgnoreNull is true,
returns only non-null values.

flatten
flatten(arrayOfArrays) - Transforms an array of arrays into a single array.
Examples:

> SELECT flatten(array(array(1, 2), array(3, 4)));


[1,2,3,4]

Since: 2.4.0

float
float(expr) - Casts the value expr to the target data type float .

floor
floor(expr) - Returns the largest integer not greater than expr .
Examples:

> SELECT floor(-0.1);


-1
> SELECT floor(5);
5

format_number
format_number(expr1, expr2) - Formats the number expr1 like ‘#,###,###.##’, rounded to expr2 decimal
places. If expr2 is 0, the result has no decimal point or fractional part. expr2 also accept a user specified
format. This is supposed to function like MySQL’s FORMAT.
Examples:

> SELECT format_number(12332.123456, 4);


12,332.1235
> SELECT format_number(12332.123456, '##################.###');
12332.123
format_string
format_string(strfmt, obj, …) - Returns a formatted string from printf-style format strings.
Examples:

> SELECT format_string("Hello World %d %s", 100, "days");


Hello World 100 days

from_json
from_json( jsonStr, schema[, options]) - Returns a struct value with the given jsonStr and schema .
Examples:

> SELECT from_json('{"a":1, "b":0.8}', 'a INT, b DOUBLE');

{"a":1, "b":0.8}

> SELECT from_json('{"time":"26/08/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));

{"time":"2015-08-26 00:00:00.0"}

Since: 2.2.0

from_unixtime
from_unixtime(unix_time, format) - Returns unix_time in the specified format .
Examples:

> SELECT from_unixtime(0, 'yyyy-MM-dd HH:mm:ss');


1970-01-01 00:00:00

Since: 1.5.0

from_utc_timestamp
from_utc_timestamp(timestamp, timezone) - Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a
time in UTC, and renders that time as a timestamp in the given time zone. For example, ‘GMT+1’ would yield
‘2017-07-14 03:40:00.0’.
Examples:

> SELECT from_utc_timestamp('2016-08-31', 'Asia/Seoul');


2016-08-31 09:00:00

Since: 1.5.0

get_json_object
get_json_object( json_txt, path) - Extracts a json object from path .
Examples:

> SELECT get_json_object('{"a":"b"}', '$.a');


b

greatest
greatest(expr, …) - Returns the greatest value of all parameters, skipping null values.
Examples:

> SELECT greatest(10, 9, 2, 4, 3);


10

grouping
grouping_id
hash
hash(expr1, expr2, …) - Returns a hash value of the arguments.
Examples:

> SELECT hash('Spark', array(123), 2);


-1321691492

hex
hex(expr) - Converts expr to hexadecimal.
Examples:

> SELECT hex(17);


11
> SELECT hex('Spark SQL');
537061726B2053514C

hour
hour(timestamp) - Returns the hour component of the string/timestamp.
Examples:

> SELECT hour('2009-07-30 12:58:59');


12

Since: 1.5.0
hypot
hypot(expr1, expr2) - Returns sqrt( expr1 **2 + expr2 **2).
Examples:

> SELECT hypot(3, 4);


5.0

if
if(expr1, expr2, expr3) - If expr1 evaluates to true, then returns expr2 ; otherwise returns expr3 .
Examples:

> SELECT if(1 < 2, 'a', 'b');


a

ifnull
ifnull(expr1, expr2) - Returns expr2 if expr1 is null, or expr1 otherwise.
Examples:

> SELECT ifnull(NULL, array('2'));


["2"]

in
expr1 in(expr2, expr3, …) - Returns true if expr equals to any valN.
Arguments:
expr1, expr2, expr3, … - the arguments must be same type.
Examples:

> SELECT 1 in(1, 2, 3);


true
> SELECT 1 in(2, 3, 4);
false
> SELECT named_struct('a', 1, 'b', 2) in(named_struct('a', 1, 'b', 1), named_struct('a', 1, 'b', 3));
false
> SELECT named_struct('a', 1, 'b', 2) in(named_struct('a', 1, 'b', 2), named_struct('a', 1, 'b', 3));
true

initcap
initcap(str) - Returns str with the first letter of each word in uppercase. All other letters are in lowercase.
Words are delimited by white space.
Examples:
> SELECT initcap('sPark sql');
Spark Sql

inline
inline(expr) - Explodes an array of structs into a table.
Examples:

> SELECT inline(array(struct(1, 'a'), struct(2, 'b')));


1 a
2 b

inline_outer
inline_outer(expr) - Explodes an array of structs into a table.
Examples:

> SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')));


1 a
2 b

input_file_block_length
input_file_block_length() - Returns the length of the block being read, or -1 if not available.

input_file_block_start
input_file_block_start() - Returns the start offset of the block being read, or -1 if not available.

input_file_name
input_file_name() - Returns the name of the file being read, or empty string if not available.

instr
instr(str, substr) - Returns the (1-based) index of the first occurrence of substr in str .
Examples:

> SELECT instr('SparkSQL', 'SQL');


6

int
int(expr) - Casts the value expr to the target data type int .

isnan
isnan(expr) - Returns true if expr is NaN, or false otherwise.
Examples:

> SELECT isnan(cast('NaN' as double));


true

isnotnull
isnotnull(expr) - Returns true if expr is not null, or false otherwise.
Examples:

> SELECT isnotnull(1);


true

isnull
isnull(expr) - Returns true if expr is null, or false otherwise.
Examples:

> SELECT isnull(1);


false

java_method
java_method(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.
Examples:

> SELECT java_method('java.util.UUID', 'randomUUID');


c33fb387-8500-4bfa-81d2-6e0e3e930df2
> SELECT java_method('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
a5cf6c42-0c85-418f-af6c-3e4e5b1328f2

json_tuple
json_tuple( jsonStr, p1, p2, …, pn) - Returns a tuple like the function get_json_object, but it takes multiple names.
All the input parameters and output column types are string.
Examples:

> SELECT json_tuple('{"a":1, "b":2}', 'a', 'b');


1 2

kurtosis
kurtosis(expr) - Returns the kurtosis value calculated from values of a group.

lag
lag(input[, offset[, default]]) - Returns the value of input at the offset th row before the current row in the
window. The default value of offset is 1 and the default value of default is null. If the value of input at the
offsetth row is null, null is returned. If there is no such offset row (e.g., when the offset is 1, the first row of the
window does not have any previous row), default is returned.

last
last(expr[, isIgnoreNull]) - Returns the last value of expr for a group of rows. If isIgnoreNull is true, returns
only non-null values.

last_day
last_day(date) - Returns the last day of the month which the date belongs to.
Examples:

> SELECT last_day('2009-01-12');


2009-01-31

Since: 1.5.0

last_value
last_value(expr[, isIgnoreNull]) - Returns the last value of expr for a group of rows. If isIgnoreNull is true,
returns only non-null values.

lcase
lcase(str) - Returns str with all characters changed to lowercase.
Examples:

> SELECT lcase('SparkSql');


sparksql

lead
lead(input[, offset[, default]]) - Returns the value of input at the offset th row after the current row in the
window. The default value of offset is 1 and the default value of default is null. If the value of input at the
offset th row is null, null is returned. If there is no such an offset row (e.g., when the offset is 1, the last row of
the window does not have any subsequent row), default is returned.

least
least(expr, …) - Returns the least value of all parameters, skipping null values.
Examples:

> SELECT least(10, 9, 2, 4, 3);


2

left
left(str, len) - Returns the leftmost len ( len can be string type) characters from the string str ,if len is less or
equal than 0 the result is an empty string.
Examples:

> SELECT left('Spark SQL', 3);


Spa

length
length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string
data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:

> SELECT length('Spark SQL ');


10
> SELECT CHAR_LENGTH('Spark SQL ');
10
> SELECT CHARACTER_LENGTH('Spark SQL ');
10

levenshtein
levenshtein(str1, str2) - Returns the Levenshtein distance between the two given strings.
Examples:

> SELECT levenshtein('kitten', 'sitting');


3

like
str like pattern - Returns true if str matches pattern, null if any arguments are null, false otherwise.
Arguments:
str - a string expression
pattern - a string expression. The pattern is a string which is matched literally, with exception to the
following special symbols:
_ matches any one character in the input (similar to . in posix regular expressions)
% matches zero or more characters in the input (similar to .* in posix regular expressions)
The escape character is ‘’. If an escape character precedes a special symbol or another escape character,
the following character is matched literally. It is invalid to escape any other character.
Since Spark 2.0, string literals are unescaped in our SQL parser. For example, in order to match “abc”, the
pattern should be “abc”.
When SQL config ‘spark.sql.parser.escapedStringLiterals’ is enabled, it fallbacks to Spark 1.6 behavior
regarding string literal parsing. For example, if the config is enabled, the pattern to match “abc” should be
“abc”.
Examples:
> SELECT '%SystemDrive%\Users\John' like '\%SystemDrive\%\\Users%'
true

Note:
Use RLIKE to match with standard regular expressions.

ln
ln(expr) - Returns the natural logarithm (base e) of expr .
Examples:

> SELECT ln(1);


0.0

locate
locate(substr, str[, pos]) - Returns the position of the first occurrence of substr in str after position pos . The
given pos and return value are 1-based.
Examples:

> SELECT locate('bar', 'foobarbar');


4
> SELECT locate('bar', 'foobarbar', 5);
7
> SELECT POSITION('bar' IN 'foobarbar');
4

log
log(base, expr) - Returns the logarithm of expr with base .
Examples:

> SELECT log(10, 100);


2.0

log10
log10(expr) - Returns the logarithm of expr with base 10.
Examples:

> SELECT log10(10);


1.0

log1p
log1p(expr) - Returns log(1 + expr ).
Examples:
> SELECT log1p(0);
0.0

log2
log2(expr) - Returns the logarithm of expr with base 2.
Examples:

> SELECT log2(2);


1.0

lower
lower(str) - Returns str with all characters changed to lowercase.
Examples:

> SELECT lower('SparkSql');


sparksql

lpad
lpad(str, len, pad) - Returns str , left-padded with pad to a length of len . If str is longer than len , the
return value is shortened to len characters.
Examples:

> SELECT lpad('hi', 5, '??');


???hi
> SELECT lpad('hi', 1, '??');
h

ltrim
ltrim(str) - Removes the leading space characters from str .
ltrim(trimStr, str) - Removes the leading string contains the characters from the trim string
Arguments:
str - a string expression
trimStr - the trim string characters to trim, the default value is a single space
Examples:

> SELECT ltrim(' SparkSQL ');


SparkSQL
> SELECT ltrim('Sp', 'SSparkSQLS');
arkSQLS

map
map(key0, value0, key1, value1, …) - Creates a map with the given key/value pairs.
Examples:

> SELECT map(1.0, '2', 3.0, '4');

{1.0:"2",3.0:"4"}

map_concat
map_concat(map, …) - Returns the union of all the given maps
Examples:

> SELECT map_concat(map(1, 'a', 2, 'b'), map(2, 'c', 3, 'd'));

{1:"a",2:"c",3:"d"}

Since: 2.4.0

map_from_arrays
map_from_arrays(keys, values) - Creates a map with a pair of the given key/value arrays. All elements in keys
should not be null
Examples:

> SELECT map_from_arrays(array(1.0, 3.0), array('2', '4'));

{1.0:"2",3.0:"4"}

Since: 2.4.0

map_from_entries
map_from_entries(arrayOfEntries) - Returns a map created from the given array of entries.
Examples:

> SELECT map_from_entries(array(struct(1, 'a'), struct(2, 'b')));

{1:"a",2:"b"}

Since: 2.4.0

map_keys
map_keys(map) - Returns an unordered array containing the keys of the map.
Examples:

> SELECT map_keys(map(1, 'a', 2, 'b'));


[1,2]

map_values
map_values(map) - Returns an unordered array containing the values of the map.
Examples:

> SELECT map_values(map(1, 'a', 2, 'b'));


["a","b"]

max
max(expr) - Returns the maximum value of expr .

md5
md5(expr) - Returns an MD5 128-bit checksum as a hex string of expr .
Examples:

> SELECT md5('Spark');


8cde774d6f7333752ed72cacddb05126

mean
mean(expr) - Returns the mean calculated from values of a group.

min
min(expr) - Returns the minimum value of expr .

minute
minute(timestamp) - Returns the minute component of the string/timestamp.
Examples:

> SELECT minute('2009-07-30 12:58:59');


58

Since: 1.5.0

mod
expr1 mod expr2 - Returns the remainder after expr1 / expr2 .
Examples:
> SELECT 2 mod 1.8;
0.2
> SELECT MOD(2, 1.8);
0.2

monotonically_increasing_id
monotonically_increasing_id() - Returns monotonically increasing 64-bit integers. The generated ID is
guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts
the partition ID in the upper 31 bits, and the lower 33 bits represent the record number within each partition.
The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion
records. The function is non-deterministic because its result depends on partition IDs.

month
month(date) - Returns the month component of the date/timestamp.
Examples:

> SELECT month('2016-07-30');


7

Since: 1.5.0

months_between
months_between(timestamp1, timestamp2[, roundOff]) - If timestamp1 is later than timestamp2 , then the result
is positive. If timestamp1 and timestamp2 are on the same day of month, or both are the last day of month, time
of day will be ignored. Otherwise, the difference is calculated based on 31 days per month, and rounded to 8
digits unless roundOff=false.
Examples:

> SELECT months_between('1997-02-28 10:30:00', '1996-10-30');


3.94959677
> SELECT months_between('1997-02-28 10:30:00', '1996-10-30', false);
3.9495967741935485

Since: 1.5.0

named_struct
named_struct(name1, val1, name2, val2, …) - Creates a struct with the given field names and values.
Examples:

> SELECT named_struct("a", 1, "b", 2, "c", 3);

{"a":1,"b":2,"c":3}

nanvl
nanvl(expr1, expr2) - Returns expr1 if it’s not NaN, or expr2 otherwise.
Examples:

> SELECT nanvl(cast('NaN' as double), 123);


123.0

negative
negative(expr) - Returns the negated value of expr .
Examples:

> SELECT negative(1);


-1

next_day
next_day(start_date, day_of_week) - Returns the first date which is later than start_date and named as
indicated.
Examples:

> SELECT next_day('2015-01-14', 'TU');


2015-01-20

Since: 1.5.0

not
not expr - Logical not.

now
now() - Returns the current timestamp at the start of query evaluation.
Since: 1.5.0

ntile
ntile(n) - Divides the rows for each window partition into n buckets ranging from 1 to at most n .

nullif
nullif(expr1, expr2) - Returns null if expr1 equals to expr2 , or expr1 otherwise.
Examples:

> SELECT nullif(2, 2);


NULL

nvl
nvl(expr1, expr2) - Returns expr2 if expr1 is null, or expr1 otherwise.
Examples:

> SELECT nvl(NULL, array('2'));


["2"]

nvl2
nvl2(expr1, expr2, expr3) - Returns expr2 if expr1 is not null, or expr3 otherwise.
Examples:

> SELECT nvl2(NULL, 2, 1);


1

octet_length
octet_length(expr) - Returns the byte length of string data or number of bytes of binary data.
Examples:

> SELECT octet_length('Spark SQL');


9

or
expr1 or expr2 - Logical OR.

parse_url
parse_url(url, partToExtract[, key]) - Extracts a part from a URL.
Examples:

> SELECT parse_url('https://spark.apache.org/path?query=1', 'HOST')


spark.apache.org
> SELECT parse_url('https://spark.apache.org/path?query=1', 'QUERY')
query=1
> SELECT parse_url('https://spark.apache.org/path?query=1', 'QUERY', 'query')
1

percent_rank
percent_rank() - Computes the percentage ranking of a value in a group of values.

percentile
percentile(col, percentage [, frequency]) - Returns the exact percentile value of numeric column col at the given
percentage. The value of percentage must be between 0.0 and 1.0. The value of frequency should be positive
integral
percentile(col, array(percentage1 [, percentage2]…) [, frequency]) - Returns the exact percentile value array of
numeric column col at the given percentage(s). Each value of the percentage array must be between 0.0 and
1.0. The value of frequency should be positive integral

percentile_approx
percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column
col at the given percentage. The value of percentage must be between 0.0 and 1.0. The accuracy parameter
(default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory.
Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation. When
percentage is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the
approximate percentile array of column col at the given percentage array.
Examples:

> SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);


[10.0,10.0,10.0]
> SELECT percentile_approx(10.0, 0.5, 100);
10.0

pi
pi() - Returns pi.
Examples:

> SELECT pi();


3.141592653589793

pmod
pmod(expr1, expr2) - Returns the positive value of expr1 mod expr2 .
Examples:

> SELECT pmod(10, 3);


1
> SELECT pmod(-10, 3);
2

posexplode
posexplode(expr) - Separates the elements of array expr into multiple rows with positions, or the elements of
map expr into multiple rows and columns with positions.
Examples:

> SELECT posexplode(array(10,20));


0 10
1 20

posexplode_outer
posexplode_outer(expr) - Separates the elements of array expr into multiple rows with positions, or the
elements of map expr into multiple rows and columns with positions.
Examples:

> SELECT posexplode_outer(array(10,20));


0 10
1 20

position
position(substr, str[, pos]) - Returns the position of the first occurrence of substr in str after position pos .
The given pos and return value are 1-based.
Examples:

> SELECT position('bar', 'foobarbar');


4
> SELECT position('bar', 'foobarbar', 5);
7
> SELECT POSITION('bar' IN 'foobarbar');
4

positive
positive(expr) - Returns the value of expr .

pow
pow(expr1, expr2) - Raises expr1 to the power of expr2 .
Examples:

> SELECT pow(2, 3);


8.0

power
power(expr1, expr2) - Raises expr1 to the power of expr2 .
Examples:

> SELECT power(2, 3);


8.0

printf
printf(strfmt, obj, …) - Returns a formatted string from printf-style format strings.
Examples:

> SELECT printf("Hello World %d %s", 100, "days");


Hello World 100 days
quarter
quarter(date) - Returns the quarter of the year for date, in the range 1 to 4.
Examples:

> SELECT quarter('2016-08-31');


3

Since: 1.5.0

radians
radians(expr) - Converts degrees to radians.
Arguments:
expr - angle in degrees
Examples:

> SELECT radians(180);


3.141592653589793

rand
rand([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed
values in [0, 1).
Examples:

> SELECT rand();


0.9629742951434543
> SELECT rand(0);
0.8446490682263027
> SELECT rand(null);
0.8446490682263027
function is non-deterministic in general case.

randn
randn([seed]) - Returns a random value with independent and identically distributed (i.i.d.) values drawn from
the standard normal distribution.
Examples:

> SELECT randn();


-0.3254147983080288
> SELECT randn(0);
1.1164209726833079
> SELECT randn(null);
1.1164209726833079
function is non-deterministic in general case.

rank
rank() - Computes the rank of a value in a group of values. The result is one plus the number of rows preceding
or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence.

reflect
reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.
Examples:

> SELECT reflect('java.util.UUID', 'randomUUID');


c33fb387-8500-4bfa-81d2-6e0e3e930df2
> SELECT reflect('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
a5cf6c42-0c85-418f-af6c-3e4e5b1328f2

regexp_extract
regexp_extract(str, regexp[, idx]) - Extracts a group that matches regexp .
Examples:

> SELECT regexp_extract('100-200', '(\\d+)-(\\d+)', 1);


100

regexp_replace
regexp_replace(str, regexp, rep) - Replaces all substrings of str that match regexp with rep .
Examples:

> SELECT regexp_replace('100-200', '(\\d+)', 'num');


num-num

repeat
repeat(str, n) - Returns the string which repeats the given string value n times.
Examples:

> SELECT repeat('123', 2);


123123

replace
replace(str, search[, replace]) - Replaces all occurrences of search with replace .
Arguments:
str - a string expression
search - a string expression. If search is not found in str , str is returned unchanged.
replace - a string expression. If replace is not specified or is an empty string, nothing replaces the string that
is removed from str .
Examples:
> SELECT replace('ABCabc', 'abc', 'DEF');
ABCDEF

reverse
reverse(array) - Returns a reversed string or an array with reverse order of elements.
Examples:

> SELECT reverse('Spark SQL');


LQS krapS
> SELECT reverse(array(2, 1, 4, 3));
[3,4,1,2]

rse logic for arrays is available since 2.4.0. Since: 1.5.0

right
right(str, len) - Returns the rightmost len ( len can be string type) characters from the string str ,if len is less
or equal than 0 the result is an empty string.
Examples:

> SELECT right('Spark SQL', 3);


SQL

rint
rint(expr) - Returns the double value that is closest in value to the argument and is equal to a mathematical
integer.
Examples:

> SELECT rint(12.3456);


12.0

rlike
str rlike regexp - Returns true if str matches regexp , or false otherwise.
Arguments:
str - a string expression
regexp - a string expression. The pattern string should be a Java regular expression.
Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser. For example, to
match “abc”, a regular expression for regexp can be “^abc$”.
There is a SQL config ‘spark.sql.parser.escapedStringLiterals’ that can be used to fallback to the Spark 1.6
behavior regarding string literal parsing. For example, if the config is enabled, the regexp that can match
“abc” is “^abc$”.
Examples:
When spark.sql.parser.escapedStringLiterals is disabled (default).
> SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*'
true

When spark.sql.parser.escapedStringLiterals is enabled.


> SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*'
true

Note:
Use LIKE to match with simple string pattern.

rollup
round
round(expr, d) - Returns expr rounded to d decimal places using HALF_UP rounding mode.
Examples:

> SELECT round(2.5, 0);


3.0

row_number
row_number() - Assigns a unique, sequential number to each row, starting with one, according to the ordering
of rows within the window partition.

rpad
rpad(str, len, pad) - Returns str , right-padded with pad to a length of len . If str is longer than len , the
return value is shortened to len characters.
Examples:

> SELECT rpad('hi', 5, '??');


hi???
> SELECT rpad('hi', 1, '??');
h

rtrim
rtrim(str) - Removes the trailing space characters from str .
rtrim(trimStr, str) - Removes the trailing string which contains the characters from the trim string from the str

Arguments:
str - a string expression
trimStr - the trim string characters to trim, the default value is a single space
Examples:
> SELECT rtrim(' SparkSQL ');
SparkSQL
> SELECT rtrim('LQSa', 'SSparkSQLS');
SSpark

schema_of_json
schema_of_json( json[, options]) - Returns schema in the DDL format of JSON string.
Examples:

> SELECT schema_of_json('[{"col":0}]');


array<struct<col:int>>

Since: 2.4.0

second
second(timestamp) - Returns the second component of the string/timestamp.
Examples:

> SELECT second('2009-07-30 12:58:59');


59

Since: 1.5.0

sentences
sentences(str[, lang, country]) - Splits str into an array of array of words.
Examples:

> SELECT sentences('Hi there! Good morning.');


[["Hi","there"],["Good","morning"]]

sequence
sequence(start, stop, step) - Generates an array of elements from start to stop (inclusive), incrementing by step.
The type of the returned elements is the same as the type of argument expressions.
Supported types are: byte, short, integer, long, date, timestamp.
The start and stop expressions must resolve to the same type. If start and stop expressions resolve to the ‘date’
or ‘timestamp’ type then the step expression must resolve to the ‘interval’ type, otherwise to the same type as
the start and stop expressions.
Arguments:
start - an expression. The start of the range.
stop - an expression. The end the range (inclusive).
step - an optional expression. The step of the range. By default step is 1 if start is less than or equal to stop,
otherwise -1. For the temporal sequences it’s 1 day and -1 day respectively. If start is greater than stop then
the step must be negative, and vice versa.
Examples:

> SELECT sequence(1, 5);


[1,2,3,4,5]
> SELECT sequence(5, 1);
[5,4,3,2,1]
> SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month);
[2018-01-01,2018-02-01,2018-03-01]

Since: 2.4.0

sha
sha(expr) - Returns a sha1 hash value as a hex string of the expr .
Examples:

> SELECT sha('Spark');


85f5955f4b27a9a4c2aab6ffe5d7189fc298b92c

sha1
sha1(expr) - Returns a sha1 hash value as a hex string of the expr .
Examples:

> SELECT sha1('Spark');


85f5955f4b27a9a4c2aab6ffe5d7189fc298b92c

sha2
sha2(expr, bitLength) - Returns a checksum of SHA-2 family as a hex string of expr . SHA-224, SHA-256, SHA-
384, and SHA-512 are supported. Bit length of 0 is equivalent to 256.
Examples:

> SELECT sha2('Spark', 256);


529bc3b07127ecb7e53a4dcf1991d9152c24537d919178022b2c42657f79a26b

shiftleft
shiftleft(base, expr) - Bitwise left shift.
Examples:

> SELECT shiftleft(2, 1);


4

shiftright
shiftright(base, expr) - Bitwise (signed) right shift.
Examples:
> SELECT shiftright(4, 1);
2

shiftrightunsigned
shiftrightunsigned(base, expr) - Bitwise unsigned right shift.
Examples:

> SELECT shiftrightunsigned(4, 1);


2

shuffle
shuffle(array) - Returns a random permutation of the given array.
Examples:

> SELECT shuffle(array(1, 20, 3, 5));


[3,1,5,20]
> SELECT shuffle(array(1, 20, null, 3));
[20,null,3,1]

function is non-deterministic. Since: 2.4.0

sign
sign(expr) - Returns -1.0, 0.0 or 1.0 as expr is negative, 0 or positive.
Examples:

> SELECT sign(40);


1.0

signum
signum(expr) - Returns -1.0, 0.0 or 1.0 as expr is negative, 0 or positive.
Examples:

> SELECT signum(40);


1.0

sin
sin(expr) - Returns the sine of expr , as if computed by java.lang.Math.sin .
Arguments:
expr - angle in radians
Examples:
> SELECT sin(0);
0.0

sinh
sinh(expr) - Returns hyperbolic sine of expr , as if computed by java.lang.Math.sinh .
Arguments:
expr - hyperbolic angle
Examples:

> SELECT sinh(0);


0.0

size
size(expr) - Returns the size of an array or a map. The function returns -1 if its input is null and
spark.sql.legacy.sizeOfNull is set to true. If spark.sql.legacy.sizeOfNull is set to false, the function returns null for
null input. By default, the spark.sql.legacy.sizeOfNull parameter is set to true.
Examples:

> SELECT size(array('b', 'd', 'c', 'a'));


4
> SELECT size(map('a', 1, 'b', 2));
2
> SELECT size(NULL);
-1

skewness
skewness(expr) - Returns the skewness value calculated from values of a group.

slice
slice(x, start, length) - Subsets array x starting from index start (or starting from the end if start is negative) with
the specified length.
Examples:

> SELECT slice(array(1, 2, 3, 4), 2, 2);


[2,3]
> SELECT slice(array(1, 2, 3, 4), -2, 2);
[3,4]

Since: 2.4.0

smallint
smallint(expr) - Casts the value expr to the target data type smallint .
sort_array
sort_array(array[, ascendingOrder]) - Sorts the input array in ascending or descending order according to the
natural ordering of the array elements. Null elements will be placed at the beginning of the returned array in
ascending order or at the end of the returned array in descending order.
Examples:

> SELECT sort_array(array('b', 'd', null, 'c', 'a'), true);


[null,"a","b","c","d"]

soundex
soundex(str) - Returns Soundex code of the string.
Examples:

> SELECT soundex('Miller');


M460

space
space(n) - Returns a string consisting of n spaces.
Examples:

> SELECT concat(space(2), '1');


1

spark_partition_id
spark_partition_id() - Returns the current partition id.

split
split(str, regex) - Splits str around occurrences that match regex .
Examples:

> SELECT split('oneAtwoBthreeC', '[ABC]');


["one","two","three",""]

sqrt
sqrt(expr) - Returns the square root of expr .
Examples:

> SELECT sqrt(4);


2.0
stack
stack(n, expr1, …, exprk) - Separates expr1 , …, exprk into n rows.
Examples:

> SELECT stack(2, 1, 2, 3);


1 2
3 NULL

std
std(expr) - Returns the sample standard deviation calculated from values of a group.

stddev
stddev(expr) - Returns the sample standard deviation calculated from values of a group.

stddev_pop
stddev_pop(expr) - Returns the population standard deviation calculated from values of a group.

stddev_samp
stddev_samp(expr) - Returns the sample standard deviation calculated from values of a group.

str_to_map
str_to_map(text[, pairDelim[, keyValueDelim]]) - Creates a map after splitting the text into key/value pairs using
delimiters. Default delimiters are ‘,’ for pairDelim and ‘:’ for keyValueDelim .
Examples:

> SELECT str_to_map('a:1,b:2,c:3', ',', ':');


map("a":"1","b":"2","c":"3")
> SELECT str_to_map('a');
map("a":null)

string
string(expr) - Casts the value expr to the target data type string .

struct
struct(col1, col2, col3, …) - Creates a struct with the given field values.

substr
substr(str, pos[, len]) - Returns the substring of str that starts at pos and is of length len , or the slice of byte
array that starts at pos and is of length len .
Examples:
> SELECT substr('Spark SQL', 5);
k SQL
> SELECT substr('Spark SQL', -3);
SQL
> SELECT substr('Spark SQL', 5, 1);
k

substring
substring(str, pos[, len]) - Returns the substring of str that starts at pos and is of length len , or the slice of
byte array that starts at pos and is of length len .
Examples:

> SELECT substring('Spark SQL', 5);


k SQL
> SELECT substring('Spark SQL', -3);
SQL
> SELECT substring('Spark SQL', 5, 1);
k

substring_index
substring_index(str, delim, count) - Returns the substring from str before count occurrences of the delimiter
delim . If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If
count is negative, everything to the right of the final delimiter (counting from the right) is returned. The
function substring_index performs a case-sensitive match when searching for delim .
Examples:

> SELECT substring_index('www.apache.org', '.', 2);


www.apache

sum
sum(expr) - Returns the sum calculated from values of a group.

tan
tan(expr) - Returns the tangent of expr , as if computed by java.lang.Math.tan .
Arguments:
expr - angle in radians
Examples:

> SELECT tan(0);


0.0

tanh
tanh(expr) - Returns the hyperbolic tangent of expr , as if computed by java.lang.Math.tanh .
Arguments:
expr - hyperbolic angle
Examples:

> SELECT tanh(0);


0.0

timestamp
timestamp(expr) - Casts the value expr to the target data type timestamp .

tinyint
tinyint(expr) - Casts the value expr to the target data type tinyint .

to_date
to_date(date_str[, fmt]) - Parses the date_str expression with the fmt expression to a date. Returns null with
invalid input. By default, it follows casting rules to a date if the fmt is omitted.
Examples:

> SELECT to_date('2009-07-30 04:17:52');


2009-07-30
> SELECT to_date('2016-12-31', 'yyyy-MM-dd');
2016-12-31

Since: 1.5.0

to_json
to_json(expr[, options]) - Returns a JSON string with a given struct value
Examples:

> SELECT to_json(named_struct('a', 1, 'b', 2));

{"a":1,"b":2}
```sql
> SELECT to_json(named_struct('time', to_timestamp('2015-08-26', 'yyyy-MM-dd')), map('timestampFormat',
'dd/MM/yyyy'));

{"time":"26/08/2015"}
```sql
> SELECT to_json(array(named_struct('a', 1, 'b', 2)));

[{"a":1,"b":2}]
```sql
> SELECT to_json(map('a', named_struct('b', 1)));
{"a":{"b":1}}
```sql
> SELECT to_json(map(named_struct('a', 1),named_struct('b', 2)));

{"[1]":{"b":2}}
```sql
> SELECT to_json(map('a', 1));

{"a":1}
```sql
> SELECT to_json(array((map('a', 1))));

[{"a":1}]

Since: 2.2.0

to_timestamp
to_timestamp(timestamp[, fmt]) - Parses the timestamp expression with the fmt expression to a timestamp.
Returns null with invalid input. By default, it follows casting rules to a timestamp if the fmt is omitted.
Examples:

> SELECT to_timestamp('2016-12-31 00:12:00');


2016-12-31 00:12:00
> SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
2016-12-31 00:00:00

Since: 2.2.0

to_unix_timestamp
to_unix_timestamp(expr[, pattern]) - Returns the UNIX timestamp of the given time.
Examples:

> SELECT to_unix_timestamp('2016-04-08', 'yyyy-MM-dd');


1460041200

Since: 1.6.0

to_utc_timestamp
to_utc_timestamp(timestamp, timezone) - Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a time
in the given time zone, and renders that time as a timestamp in UTC. For example, ‘GMT+1’ would yield ‘2017-
07-14 01:40:00.0’.
Examples:

> SELECT to_utc_timestamp('2016-08-31', 'Asia/Seoul');


2016-08-30 15:00:00
Since: 1.5.0

transform
transform(expr, func) - Transforms elements in an array using the function.
Examples:

> SELECT transform(array(1, 2, 3), x -> x + 1);


[2,3,4]
> SELECT transform(array(1, 2, 3), (x, i) -> x + i);
[1,3,5]

Since: 2.4.0

translate
translate(input, from, to) - Translates the input string by replacing the characters present in the from string
with the corresponding characters in the to string.
Examples:

> SELECT translate('AaBbCc', 'abc', '123');


A1B2C3

trim
trim(str) - Removes the leading and trailing space characters from str .
trim(BOTH trimStr FROM str) - Remove the leading and trailing trimStr characters from str

trim(LEADING trimStr FROM str) - Remove the leading trimStr characters from str

trim(TRAILING trimStr FROM str) - Remove the trailing trimStr characters from str

Arguments:
str - a string expression
trimStr - the trim string characters to trim, the default value is a single space
BOTH, FROM - these are keywords to specify trimming string characters from both ends of the string
LEADING, FROM - these are keywords to specify trimming string characters from the left end of the string
TRAILING, FROM - these are keywords to specify trimming string characters from the right end of the string
Examples:

> SELECT trim(' SparkSQL ');


SparkSQL
> SELECT trim('SL', 'SSparkSQLS');
parkSQ
> SELECT trim(BOTH 'SL' FROM 'SSparkSQLS');
parkSQ
> SELECT trim(LEADING 'SL' FROM 'SSparkSQLS');
parkSQLS
> SELECT trim(TRAILING 'SL' FROM 'SSparkSQLS');
SSparkSQ
trunc
trunc(date, fmt) - Returns date with the time portion of the day truncated to the unit specified by the format
model fmt . fmt should be one of [“year”, “yyyy”, “yy”, “mon”, “month”, “mm”]
Examples:

> SELECT trunc('2009-02-12', 'MM');


2009-02-01
> SELECT trunc('2015-10-27', 'YEAR');
2015-01-01

Since: 1.5.0

ucase
ucase(str) - Returns str with all characters changed to uppercase.
Examples:

> SELECT ucase('SparkSql');


SPARKSQL

unbase64
unbase64(str) - Converts the argument from a base 64 string str to a binary.
Examples:

> SELECT unbase64('U3BhcmsgU1FM');


Spark SQL

unhex
unhex(expr) - Converts hexadecimal expr to binary.
Examples:

> SELECT decode(unhex('537061726B2053514C'), 'UTF-8');


Spark SQL

unix_timestamp
unix_timestamp([expr[, pattern]]) - Returns the UNIX timestamp of current or specified time.
Examples:

> SELECT unix_timestamp();


1476884637
> SELECT unix_timestamp('2016-04-08', 'yyyy-MM-dd');
1460041200

Since: 1.5.0
upper
upper(str) - Returns str with all characters changed to uppercase.
Examples:

> SELECT upper('SparkSql');


SPARKSQL

uuid
uuid() - Returns an universally unique identifier (UUID) string. The value is returned as a canonical UUID 36-
character string.
Examples:

> SELECT uuid();


46707d92-02f4-4817-8116-a4c3b23e6266
function is non-deterministic.

var_pop
var_pop(expr) - Returns the population variance calculated from values of a group.

var_samp
var_samp(expr) - Returns the sample variance calculated from values of a group.

variance
variance(expr) - Returns the sample variance calculated from values of a group.

weekday
weekday(date) - Returns the day of the week for date/timestamp (0 = Monday, 1 = Tuesday, …, 6 = Sunday).
Examples:

> SELECT weekday('2009-07-30');


3

Since: 2.4.0

weekofyear
weekofyear(date) - Returns the week of the year of the given date. A week is considered to start on a Monday
and week 1 is the first week with >3 days.
Examples:

> SELECT weekofyear('2008-02-20');


8

Since: 1.5.0
when
CASE WHEN expr1 THEN expr2 [WHEN expr3 THEN expr4]* [ELSE expr5] END - When expr1 = true, returns
expr2 ; else when expr3 = true, returns expr4 ; else returns expr5 .

Arguments:
expr1, expr3 - the branch condition expressions should all be boolean type.
expr2, expr4, expr5 - the branch value expressions and else value expression should all be same type or
coercible to a common type.
Examples:

> SELECT CASE WHEN 1 > 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;
1
> SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;
2
> SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 END;
NULL

window
xpath
xpath(xml, xpath) - Returns a string array of values within the nodes of xml that match the XPath expression.
Examples:

> SELECT xpath('<a><b>b1</b><b>b2</b><b>b3</b><c>c1</c><c>c2</c></a>','a/b/text()');


['b1','b2','b3']

xpath_boolean
xpath_boolean(xml, xpath) - Returns true if the XPath expression evaluates to true, or if a matching node is
found.
Examples:

> SELECT xpath_boolean('<a><b>1</b></a>','a/b');


true

xpath_double
xpath_double(xml, xpath) - Returns a double value, the value zero if no match is found, or NaN if a match is
found but the value is non-numeric.
Examples:

> SELECT xpath_double('<a><b>1</b><b>2</b></a>', 'sum(a/b)');


3.0
xpath_float
xpath_float(xml, xpath) - Returns a float value, the value zero if no match is found, or NaN if a match is found but
the value is non-numeric.
Examples:

> SELECT xpath_float('<a><b>1</b><b>2</b></a>', 'sum(a/b)');


3.0

xpath_int
xpath_int(xml, xpath) - Returns an integer value, or the value zero if no match is found, or a match is found but
the value is non-numeric.
Examples:

> SELECT xpath_int('<a><b>1</b><b>2</b></a>', 'sum(a/b)');


3

xpath_long
xpath_long(xml, xpath) - Returns a long integer value, or the value zero if no match is found, or a match is found
but the value is non-numeric.
Examples:

> SELECT xpath_long('<a><b>1</b><b>2</b></a>', 'sum(a/b)');


3

xpath_number
xpath_number(xml, xpath) - Returns a double value, the value zero if no match is found, or NaN if a match is
found but the value is non-numeric.
Examples:

> SELECT xpath_number('<a><b>1</b><b>2</b></a>', 'sum(a/b)');


3.0

xpath_short
xpath_short(xml, xpath) - Returns a short integer value, or the value zero if no match is found, or a match is
found but the value is non-numeric.
Examples:

> SELECT xpath_short('<a><b>1</b><b>2</b></a>', 'sum(a/b)');


3

xpath_string
xpath_string(xml, xpath) - Returns the text contents of the first xml node that matches the XPath expression.
Examples:

> SELECT xpath_string('<a><b>b</b><c>cc</c></a>','a/c');


cc

year
year(date) - Returns the year component of the date/timestamp.
Examples:

> SELECT year('2016-07-30');


2016

Since: 1.5.0

zip_with
zip_with(left, right, func) - Merges the two given arrays, element-wise, into a single array using function. If one
array is shorter, nulls are appended at the end to match the length of the longer array, before applying function.
Examples:

> SELECT zip_with(array(1, 2, 3), array('a', 'b', 'c'), (x, y) -> (y, x));

[{"y":"a","x":1},{"y":"b","x":2},{"y":"c","x":3}]

> SELECT zip_with(array(1, 2), array(3, 4), (x, y) -> x + y);

[4,6]

> SELECT zip_with(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) -> concat(x, y));

["ad","be","cf"]

Since: 2.4.0

|
expr1 | expr2 - Returns the result of bitwise OR of expr1 and expr2 .
Examples:

> SELECT 3 | 5;
7
~
~ expr - Returns the result of bitwise NOT of expr .
Examples:

> SELECT ~ 0;
-1
Grant
7/21/2022 • 2 minutes to read

GRANT
privilege_type [, privilege_type ] ...
ON (CATALOG | DATABASE <database-name> | TABLE <table-name> | VIEW <view-name> | FUNCTION <function-name>
| ANONYMOUS FUNCTION | ANY FILE)
TO principal

privilege_type
: SELECT | CREATE | MODIFY | READ_METADATA | CREATE_NAMED_FUNCTION | ALL PRIVILEGES

principal
: `<user>@<domain-name>` | <group-name>

Grant a privilege on an object to a user or principal. Granting a privilege on a database (for example a SELECT
privilege) has the effect of implicitly granting that privilege on all objects in that database. Granting a specific
privilege on the catalog has the effect of implicitly granting that privilege on all databases in the catalog.
To grant a privilege to all users, specify the keyword users after TO .

Examples
GRANT SELECT ON DATABASE <database-name> TO `<user>@<domain-name>`
GRANT SELECT ON ANONYMOUS FUNCTION TO `<user>@<domain-name>`
GRANT SELECT ON ANY FILE TO `<user>@<domain-name>`

View-based access control


You can configure fine-grained access control (to rows and columns matching specific conditions, for example)
by granting access to derived views that contain arbitrary queries.
Examples

CREATE OR REPLACE VIEW <view-name> AS SELECT columnA, columnB FROM <table-name> WHERE columnC > 1000;
GRANT SELECT ON VIEW <view-name> TO `<user>@<domain-name>`;

For details on required table ownership, see Frequently asked questions (FAQ).
Insert
7/21/2022 • 6 minutes to read

Insert from select queries


INSERT INTO [TABLE] [db_name.]table_name [PARTITION part_spec] select_statement

INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement

part_spec:
: (part_col_name1=val1 [, part_col_name2=val2, ...])

Insert data into a table or a partition from the result table of a select statement. Data is inserted by ordinal
(ordering of columns) and not by names.

NOTE
(Delta Lake on Azure Databricks) If a column has a NOT NULL constraint, and an INSERT INTO statement sets a column
value to NULL , a SparkException is thrown.

OVERWRITE

Overwrite existing data in the table or the partition. Otherwise, new data is appended.
Examples

-- Creates a partitioned native parquet table


CREATE TABLE data_source_tab1 (col1 INT, p1 INT, p2 INT)
USING PARQUET PARTITIONED BY (p1, p2)

-- Appends two rows into the partition (p1 = 3, p2 = 4)


INSERT INTO data_source_tab1 PARTITION (p1 = 3, p2 = 4)
SELECT id FROM RANGE(1, 3)

-- Overwrites the partition (p1 = 3, p2 = 4) using two new rows


INSERT OVERWRITE TABLE default.data_source_tab1 PARTITION (p1 = 3, p2 = 4)
SELECT id FROM RANGE(3, 5)

Insert values into tables


INSERT INTO [TABLE] [db_name.]table_name [PARTITION part_spec] VALUES values_row [, values_row ...]

INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] VALUES values_row [, values_row ...]

values_row:
: (val1 [, val2, ...])

Insert data into a table or a partition from a row value list.


OVERWRITE

Overwrite existing data in the table or the partition. Otherwise, new data is appended.
Examples

-- Creates a partitioned hive serde table (using the HiveQL syntax)


CREATE TABLE hive_serde_tab1 (col1 INT, p1 INT, p2 INT)
USING HIVE OPTIONS(fileFormat 'PARQUET') PARTITIONED BY (p1, p2)

-- Appends two rows into the partition (p1 = 3, p2 = 4)


INSERT INTO hive_serde_tab1 PARTITION (p1 = 3, p2 = 4)
VALUES (1), (2)

-- Overwrites the partition (p1 = 3, p2 = 4) using two new rows


INSERT OVERWRITE TABLE hive_serde_tab1 PARTITION (p1 = 3, p2 = 4)
VALUES (3), (4)

Dynamic partition inserts


In part_spec , the partition column values are optional. When the partition specification part_spec is not
completely provided, such inserts are called dynamic partition inserts or multi-partition inserts. When the
values are not specified, these columns are referred to as dynamic partition columns; otherwise, they are static
partition columns. For example, the partition spec (p1 = 3, p2, p3) has a static partition column ( p1 ) and two
dynamic partition columns ( p2 and p3 ).
In part_spec , static partition keys must come before the dynamic partition keys. This means all partition
columns having constant values must appear before other partition columns that do not have an assigned
constant value.
The partition values of dynamic partition columns are determined during the execution. The dynamic partition
columns must be specified last in both part_spec and the input result set (of the row value lists or the select
query). They are resolved by position, instead of by names. Thus, the orders must be exactly matched.
The DataFrameWriter APIs do not have an interface to specify partition values. Therefore, the insertInto() API
is always using dynamic partition mode.

IMPORTANT
In the dynamic partition mode, the input result set could result in a large number of dynamic partitions, and thus
generate a large number of partition directories.
OVERWRITE

The semantics are different based on the type of the target table.
Hive SerDe tables: INSERT OVERWRITE doesn’t delete partitions ahead, and only overwrite those partitions that have
data written into it at runtime. This matches Apache Hive semantics. For Hive SerDe tables, Spark SQL respects the
Hive-related configuration, including hive.exec.dynamic.partition and hive.exec.dynamic.partition.mode .
Native data source tables: INSERT OVERWRITE first deletes all the partitions that match the partition specification (e.g.,
PARTITION(a=1, b)) and then inserts all the remaining values. The behavior of native data source tables can be
changed to be consistent with Hive SerDe tables by changing the session-specific configuration
spark.sql.sources.partitionOverwriteMode to DYNAMIC . The default mode is STATIC .

Examples
-- Create a partitioned native Parquet table
CREATE TABLE data_source_tab2 (col1 INT, p1 STRING, p2 STRING)
USING PARQUET PARTITIONED BY (p1, p2)

-- Two partitions ('part1', 'part1') and ('part1', 'part2') are created by this dynamic insert.
-- The dynamic partition column p2 is resolved by the last column `'part' || id`
INSERT INTO data_source_tab2 PARTITION (p1 = 'part1', p2)
SELECT id, 'part' || id FROM RANGE(1, 3)

-- A new partition ('partNew1', 'partNew2') is added by this INSERT OVERWRITE.


INSERT OVERWRITE TABLE data_source_tab2 PARTITION (p1 = 'partNew1', p2)
VALUES (3, 'partNew2')

-- After this INSERT OVERWRITE, the two partitions ('part1', 'part1') and ('part1', 'part2') are dropped,
-- because both partitions are included by (p1 = 'part1', p2).
-- Then, two partitions ('partNew1', 'partNew2'), ('part1', 'part1') exist after this operation.
INSERT OVERWRITE TABLE data_source_tab2 PARTITION (p1 = 'part1', p2)
VALUES (5, 'part1')

-- Create and fill a partitioned hive serde table with three partitions:
-- ('part1', 'part1'), ('part1', 'part2') and ('partNew1', 'partNew2')
CREATE TABLE hive_serde_tab2 (col1 INT, p1 STRING, p2 STRING)
USING HIVE OPTIONS(fileFormat 'PARQUET') PARTITIONED BY (p1, p2)
INSERT INTO hive_serde_tab2 PARTITION (p1 = 'part1', p2)
SELECT id, 'part' || id FROM RANGE(1, 3)
INSERT OVERWRITE TABLE hive_serde_tab2 PARTITION (p1 = 'partNew1', p2)
VALUES (3, 'partNew2')

-- After this INSERT OVERWRITE, only the partitions ('part1', 'part1') is overwritten by the new value.
-- All the three partitions still exist.
INSERT OVERWRITE TABLE hive_serde_tab2 PARTITION (p1 = 'part1', p2)
VALUES (5, 'part1')

Insert values into directory


INSERT OVERWRITE [LOCAL] DIRECTORY [directory_path]
USING data_source [OPTIONS (key1=val1, key2=val2, ...)]
[AS] SELECT ... FROM ...

Insert the query results of select_statement into a directory directory_path using Spark native format. If the
specified path exists, it is replaced with the output of the select_statement .
DIRECTORY

The path of the destination directory of the insert. The directory can also be specified in OPTIONS using the key
path . If the specified path exists, it is replaced with the output of the select_statement . If LOCAL is used, the
directory is on the local file system.
USING

The file format to use for the insert. One of TEXT , CSV , JSON , JDBC , PARQUET , ORC , HIVE , and LIBSVM , or a
fully qualified class name of a custom implementation of org.apache.spark.sql.sources.DataSourceRegister .
AS

Populate the destination directory with input data from the select statement.
Examples
INSERT OVERWRITE DIRECTORY
USING parquet
OPTIONS ('path' '/tmp/destination/path')
SELECT key, col1, col2 FROM source_table

INSERT OVERWRITE DIRECTORY '/tmp/destination/path'


USING json
SELECT 1 as a, 'c' as b

Insert values into directory with Hive format


INSERT OVERWRITE [LOCAL] DIRECTORY directory_path
[ROW FORMAT row_format] [STORED AS file_format]
[AS] select_statement

Insert the query results of select_statement into a directory directory_path using Hive SerDe. If the specified
path exists, it is replaced with the output of the select_statement .

NOTE
This command is supported only when Hive support is enabled.

DIRECTORY

The path of the destination directory of the insert. If the specified path exists, it will be replaced with the output
of the select_statement . If LOCAL is used, the directory is on the local file system.
ROW FORMAT

Use the SERDE clause to specify a custom SerDe for this insert. Otherwise, use the DELIMITED clause to use the
native SerDe and specify the delimiter, escape character, null character, and so on.
STORED AS

The file format for this insert. One of TEXTFILE , SEQUENCEFILE , RCFILE , ORC , PARQUET , and AVRO . Alternatively,
you can specify your own input and output format through INPUTFORMAT and OUTPUTFORMAT . Only TEXTFILE ,
SEQUENCEFILE , and RCFILE can be used with ROW FORMAT SERDE , and only TEXTFILE can be used with
ROW FORMAT DELIMITED .

AS

Populate the destination directory with input data from the select statement.
Examples

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/destination/path'


STORED AS orc
SELECT * FROM source_table where key < 10
Load Data
7/21/2022 • 2 minutes to read

LOAD DATA [LOCAL] INPATH path [OVERWRITE] INTO TABLE [db_name.]table_name [PARTITION part_spec]

part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)

Load data from a file into a table or a partition in the table. The target table must not be temporary. A partition
spec must be provided if and only if the target table is partitioned.

NOTE
This is supported only for tables created using the Hive format.

LOCAL

Load the path from the local file system. Otherwise, the default file system is used.
OVERWRITE

Delete existing data in the table. Otherwise, new data is appended to the table.
Merge Into (Delta Lake on Azure Databricks)
7/21/2022 • 3 minutes to read

Merge a set of updates, insertions, and deletions based on a source table into a target Delta table.

MERGE INTO [db_name.]target_table [AS target_alias]


USING [db_name.]source_table [<time_travel_version>] [AS source_alias]
ON <merge_condition>
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN NOT MATCHED [ AND <condition> ] THEN <not_matched_action> ]

where

<matched_action> =
DELETE |
UPDATE SET * |
UPDATE SET column1 = value1 [, column2 = value2 ...]

<not_matched_action> =
INSERT * |
INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...])

<time_travel_version> =
TIMESTAMP AS OF timestamp_expression |
VERSION AS OF version

In Databricks Runtime 5.5 LTS and 6.x, MERGE can have at most 2 WHEN MATCHED clauses and at most 1
WHEN NOT MATCHED clause.
WHEN MATCHED clauses are executed when a source row matches a target table row based on the match
condition. These clauses have the following semantics.
WHEN MATCHED clauses can have at most on UPDATE and one DELETE action. The UPDATE action in
merge only updates the specified columns of the matched target row. The DELETE action will delete
the matched row.
Each WHEN MATCHED clause can have an optional condition. If this clause condition exists, the UPDATE or
DELETE action is executed for any matching source-target row pair row only when the clause
condition is true.
If there are multiple WHEN MATCHED clauses, then they are evaluated in order they are specified (that is,
the order of the clauses matter). All WHEN MATCHED clauses, except the last one, must have conditions.
If both WHEN MATCHED clauses have conditions and neither of the conditions are true for a matching
source-target row pair, then the matched target row is left unchanged.
To update all the columns of the target Delta table with the corresponding columns of the source
dataset, use UPDATE SET * . This is equivalent to
UPDATE SET col1 = source.col1 [, col2 = source.col2 ...] for all the columns of the target Delta
table. Therefore, this action assumes that the source table has the same columns as those in the target
table, otherwise the query will throw an analysis error.
This behavior changes when automatic schema migration is enabled. See Automatic schema
evolution for details.
WHEN NOT MATCHED clauses are executed when a source row does not match any target row based on the
match condition. These clauses have the following semantics.
WHEN NOT MATCHED clauses can only have the INSERT action. The new row is generated based on
the specified column and corresponding expressions. All the columns in the target table do not
need to be specified. For unspecified target columns, NULL will be inserted.

NOTE
In Databricks Runtime 6.5 and below, you must provide all columns in the target table for the INSERT
action.

Each WHEN NOT MATCHED clause can have an optional condition. If the clause condition is present, a
source row is inserted only if that condition is true for that row. Otherwise, the source column is
ignored.
If there are multiple WHEN NOT MATCHED clauses, then they are evaluated in order they are specified
(that is, the order of the clauses matter). All WHEN NOT MATCHED clauses, except the last one, must
have conditions.
To insert all the columns of the target Delta table with the corresponding columns of the source
dataset, use INSERT * . This is equivalent to
INSERT (col1 [, col2 ...]) VALUES (source.col1 [, source.col2 ...]) for all the columns of the
target Delta table. Therefore, this action assumes that the source table has the same columns as
those in the target table, otherwise the query will throw an analysis error.

NOTE
This behavior changes when automatic schema migration is enabled. See Automatic schema evolution for
details.

IMPORTANT
A MERGE operation can fail if multiple rows of the source dataset match and attempt to update the same rows of the
target Delta table. According to the SQL semantics of merge, such an update operation is ambiguous as it is unclear
which source row should be used to update the matched target row. You can preprocess the source table to eliminate the
possibility of multiple matches. See the Change data capture example—it preprocesses the change dataset (that is, the
source dataset) to retain only the latest change for each key before applying that change into the target Delta table.

Examples
You can use MERGE for complex operations such as deduplicating data, upserting change data, applying SCD
Type 2 operations. See Merge examples for a few examples.
Msck
7/21/2022 • 2 minutes to read

MSCK TABLE [db_name.]table_name

Remove all the privileges associated with a table.


Optimize (Delta Lake on Azure Databricks)
7/21/2022 • 2 minutes to read

OPTIMIZE [db_name.]table_name [WHERE predicate]


[ZORDER BY (col_name1, col_name2, ...)]

Optimize the layout of Delta Lake data. Optionally optimize a subset of data or colocate data by column. If you
do not specify colocation, bin-packing optimization is performed.

NOTE
Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no
effect. It aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number of
tuples per file. However, the two measures are most often correlated.
Z-Ordering is not idempotent but aims to be an incremental operation. The time it takes for Z-Ordering is not
guaranteed to reduce over multiple runs. However, if no new data was added to a partition that was just Z-Ordered,
another Z-Ordering of that partition will not have any effect. It aims to produce evenly-balanced data files with respect
to the number of tuples, but not necessarily data size on disk. The two measures are most often correlated, but there
can be situations when that is not the case, leading to skew in optimize task times.
To control the output file size, set the Spark configuration spark.databricks.delta.optimize.maxFileSize . The
default value is 1073741824 , which sets the size to 1 GB. Specifying the value 104857600 sets the file size to 100
MB.

WHERE

Optimize the subset of rows matching the given partition predicate. Only filters involving partition key
attributes are supported.
ZORDER BY

Colocate column information in the same set of files. Co-locality is used by Delta Lake data-skipping algorithms
to dramatically reduce the amount of data that needs to be read. You can specify multiple columns for
ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with each additional
column.

Examples
OPTIMIZE events

OPTIMIZE events WHERE date >= '2017-01-01'

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
Refresh Table
7/21/2022 • 2 minutes to read

REFRESH TABLE [db_name.]table_name

Refresh all cached entries associated with the table. If the table was previously cached, then it would be cached
lazily the next time it is scanned.
Reset
7/21/2022 • 2 minutes to read

RESET

Reset all properties to their default values. The Set command output will be empty after this.
Revoke
7/21/2022 • 2 minutes to read

REVOKE
privilege_type [, privilege_type ] ...
ON (CATALOG | DATABASE <database-name> | TABLE <table-name> | VIEW <view-name> | FUNCTION <function-name>
| ANONYMOUS FUNCTION | ANY FILE)
FROM principal

privilege_type
: SELECT | CREATE | MODIFY | READ_METADATA | CREATE_NAMED_FUNCTION | ALL PRIVILEGES

principal
: `<user>@<domain-name>` | <group-name>

Revoke an explicitly granted or denied privilege on an object from a user or principal. A REVOKE is strictly
scoped to the object specified in the command and does not cascade to contained objects.
To revoke a privilege from all users, specify the keyword users after FROM .
For example, suppose there is a database db with tables t1 and t2 . A user is initially granted SELECT
privileges on db and on t1 . The user can access t2 due to the GRANT on the database db .
If the administrator revokes the SELECT privilege on db , the user will no longer be able to access t2 , but will
still be able to access t1 since there is an explicit GRANT on table t1 .
If the administrator instead revokes the SELECT on table t1 but still keeps the SELECT on database db , the
user can still access t1 because the SELECT on the database db implicitly confers privileges on the table t1 .

Examples
REVOKE ALL PRIVILEGES ON DATABASE default FROM `<user>@<domain-name>`
REVOKE SELECT ON <table-name> FROM `<user>@<domain-name>`
Select
7/21/2022 • 6 minutes to read

SELECT [hints, ...] [ALL|DISTINCT] named_expression[, named_expression, ...]


FROM relation[, relation, ...]
[lateral_view[, lateral_view, ...]]
[WHERE boolean_expression]
[aggregation [HAVING boolean_expression]]
[ORDER BY sort_expressions]
[CLUSTER BY expressions]
[DISTRIBUTE BY expressions]
[SORT BY sort_expressions]
[WINDOW named_window[, WINDOW named_window, ...]]
[LIMIT num_rows]

named_expression:
: expression [AS alias]

relation:
| join_relation
| (table_name|query|relation) [sample] [AS alias]
: VALUES (expressions)[, (expressions), ...]
[AS (column_name[, column_name, ...])]

expressions:
: expression[, expression, ...]

sort_expressions:
: expression [ASC|DESC][, expression [ASC|DESC], ...]

Output data from one or more relations.


A relation refers to any source of input data. It could be the contents of an existing table (or view), the joined
result of two existing tables, or a subquery (the result of another SELECT statement).
ALL

Select all matching rows from the relation. Enabled by default.


DISTINCT

Select all matching rows from the relation then remove duplicate results.
WHERE

Filter rows by predicate.


HAVING

Filter grouped result by predicate.


ORDER BY

Impose total ordering on a set of expressions. Default sort direction is ascending. You cannot use this with
SORT BY , CLUSTER BY , or DISTRIBUTE BY .

DISTRIBUTE BY

Repartition rows in the relation based on a set of expressions. Rows with the same expression values will be
hashed to the same worker. You cannot use this with ORDER BY or CLUSTER BY .
SORT BY

Impose ordering on a set of expressions within each partition. Default sort direction is ascending. You cannot
use this with ORDER BY or CLUSTER BY .
CLUSTER BY

Repartition rows in the relation based on a set of expressions and sort the rows in ascending order based on the
expressions. In other words, this is a shorthand for DISTRIBUTE BY and SORT BY where all expressions are
sorted in ascending order. You cannot use this with ORDER BY , DISTRIBUTE BY , or SORT BY .
WINDOW

Assign an identifier to a window specification. See Window functions.


LIMIT

Limit the number of rows returned.


VALUES

Explicitly specify values instead of reading them from a relation.

Examples
SELECT * FROM boxes
SELECT width, length FROM boxes WHERE height=3
SELECT DISTINCT width, length FROM boxes WHERE height=3 LIMIT 2
SELECT * FROM VALUES (1, 2, 3) AS (width, length, height)
SELECT * FROM VALUES (1, 2, 3), (2, 3, 4) AS (width, length, height)
SELECT * FROM boxes ORDER BY width
SELECT * FROM boxes DISTRIBUTE BY width SORT BY width
SELECT * FROM boxes CLUSTER BY length

Delta tables
You can specify a table as delta.<path-to-table> or <table-name> .
You can specify a time travel version after the table identifier using TIMESTAMP AS OF , VERSION AS OF , or @
syntax. See Query an older snapshot of a table (time travel) for details.
Examples

SELECT * FROM delta.`/mnt/delta/events` TIMESTAMP AS OF '2019-10-18T22:15:12.013Z'


SELECT * FROM events VERSION AS OF 5

Table sample
sample:
| TABLESAMPLE ([integer_expression | decimal_expression] PERCENT)
: TABLESAMPLE (integer_expression ROWS)

Sample the input data. Express in terms of either a percentage (must be between 0 and 100) or a fixed number
of input rows.
Examples
SELECT * FROM boxes TABLESAMPLE (3 ROWS)
SELECT * FROM boxes TABLESAMPLE (25 PERCENT)

Join
join_relation:
| relation join_type JOIN relation [ON boolean_expression | USING (column_name, column_name) ]
: relation NATURAL join_type JOIN relation
join_type:
| INNER
| [LEFT | RIGHT] SEMI
| [LEFT | RIGHT | FULL] [OUTER]
: [LEFT] ANTI

INNER JOIN
Select all rows from both relations where there is match.
OUTER JOIN
Select all rows from both relations, filling with null values on the side that does not have a match.
SEMI JOIN
Select only rows from the side of the SEMI JOIN where there is a match. If one row matches multiple
rows, only the first match is returned.
LEFT ANTI JOIN
Select only rows from the left side that match no rows on the right side.
Examples

SELECT * FROM boxes INNER JOIN rectangles ON boxes.width = rectangles.width


SELECT * FROM boxes FULL OUTER JOIN rectangles USING (width, length)
SELECT * FROM boxes NATURAL JOIN rectangles

Lateral view
lateral_view:
: LATERAL VIEW [OUTER] function_name (expressions)
table_name [AS (column_name[, column_name, ...])]

Generate zero or more output rows for each input row using a table-generating function. The most common
built-in function used with LATERAL VIEW is explode .
LATERAL VIEW OUTER

Generate a row with null values even when the function returned zero rows.
Examples

SELECT * FROM boxes LATERAL VIEW explode(Array(1, 2, 3)) my_view


SELECT name, my_view.grade FROM students LATERAL VIEW OUTER explode(grades) my_view AS grade
Group by
aggregation:
: GROUP BY expressions [WITH ROLLUP | WITH CUBE | GROUPING SETS (expressions)]

Group by a set of expressions using one or more aggregate functions. Common built-in aggregate functions
include count, avg, min, max, and sum.
ROLLUP

Create a grouping set at each hierarchical level of the specified expressions. For instance, For instance,
GROUP BY a, b, c WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (a), ()) .
The total number of grouping sets will be N + 1 , where N is the number of group expressions.
CUBE

Create a grouping set for each possible combination of set of the specified expressions. For instance,
GROUP BY a, b, c WITH CUBE is equivalent to
GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ()) . The total number of
grouping sets will be 2^N , where N is the number of group expressions.
GROUPING SETS

Perform a group by for each subset of the group expressions specified in the grouping sets. For instance,
GROUP BY x, y GROUPING SETS (x, y) is equivalent to the result of GROUP BY x unioned with that of GROUP BY y .
Examples

SELECT height, COUNT(*) AS num_rows FROM boxes GROUP BY height


SELECT width, AVG(length) AS average_length FROM boxes GROUP BY width
SELECT width, length, height FROM boxes GROUP BY width, length, height WITH ROLLUP
SELECT width, length, avg(height) FROM boxes GROUP BY width, length GROUPING SETS (width, length)

Window functions
window_expression:
: expression OVER window_spec

named_window:
: window_identifier AS window_spec

window_spec:
| window_identifier
: ( [PARTITION | DISTRIBUTE] BY expressions
[[ORDER | SORT] BY sort_expressions] [window_frame])

window_frame:
| [RANGE | ROWS] frame_bound
: [RANGE | ROWS] BETWEEN frame_bound AND frame_bound

frame_bound:
| CURRENT ROW
| UNBOUNDED [PRECEDING | FOLLOWING]
: expression [PRECEDING | FOLLOWING]

Compute a result over a range of input rows. A windowed expression is specified using the OVER keyword,
which is followed by either an identifier to the window (defined using the WINDOW keyword) or the specification
of a window.
PARTITION BY

Specify which rows will be in the same partition, aliased by DISTRIBUTE BY .


ORDER BY

Specify how rows within a window partition are ordered, aliased by SORT BY .
RANGE bound

Express the size of the window in terms of a value range for the expression.
ROWS bound

Express the size of the window in terms of the number of rows before and/or after the current row.
CURRENT ROW

Use the current row as a bound.


UNBOUNDED

Use negative infinity as the lower bound or infinity as the upper bound.
PRECEDING

If used with a RANGE bound, this defines the lower bound of the value range. If used with a ROWS bound, this
determines the number of rows before the current row to keep in the window.
FOLLOWING

If used with a RANGE bound, this defines the upper bound of the value range. If used with a ROWS bound, this
determines the number of rows after the current row to keep in the window.

Hints
hints:
: /*+ hint[, hint, ...] */
hint:
: hintName [(expression[, expression, ...])]

You use hints improve the performance of a query. For example, you can hint that a table is small enough to be
broadcast, which would speed up joins.
You add one or more hints to a SELECT statement inside /*+ ... */ comment blocks. You can specify multiple
hints inside the same comment block, in which case the hints are separated by commas, and there can be
multiple such comment blocks. A hint has a name (for example, BROADCAST ) and accepts 0 or more parameters.
Examples

SELECT /*+ BROADCAST(customers) */ * FROM customers, orders WHERE o_custId = c_custId


SELECT /*+ SKEW('orders') */ * FROM customers, orders WHERE o_custId = c_custId
SELECT /*+ SKEW('orders'), BROADCAST(demographic) */ * FROM orders, customers, demographic WHERE o_custId =
c_custId AND c_demoId = d_demoId

Delta Lake on Azure Databricks See Skew join optimization for more information about the SKEW hint.
Set
7/21/2022 • 2 minutes to read

SET [-v]
SET property_key[=property_value]

Set a property, return the value of an existing property, or list all existing properties. If a value is provided for an
existing property key, the old value will be overridden.
-v

Output the meaning of the existing properties.


property_key

Set or return the value of an individual property.


Show Columns
7/21/2022 • 2 minutes to read

SHOW COLUMNS (FROM | IN) [db_name.]table_name

Return the list of columns in a table. If the table does not exist, an exception is thrown.
Show Create Table
7/21/2022 • 2 minutes to read

SHOW CREATE TABLE [db_name.]table_name

Return the command used to create an existing table. If the table does not exist, an exception is thrown.
Show Databases
7/21/2022 • 2 minutes to read

SHOW [DATABASES | SCHEMAS] [LIKE 'pattern']

Return all databases. SHOW SCHEMAS is a synonym for SHOW DATABASES .


LIKE 'pattern'

Which database names to match. In pattern , * matches any number of characters.


Show Functions
7/21/2022 • 2 minutes to read

SHOW [USER | SYSTEM |ALL ] FUNCTIONS ([LIKE] regex | [db_name.]function_name)

Show functions matching the given regex or function name. If no regex or name is provided, then all functions
are shown. IF USER or SYSTEM is declared then these will only show user-defined Spark SQL functions and
system-defined Spark SQL functions respectively.
LIKE

This qualifier is allowed only for compatibility and has no effect.


Show Grants
7/21/2022 • 2 minutes to read

SHOW GRANTS [user] ON [CATALOG | DATABASE <database-name> | TABLE <table-name> | VIEW <view-name> | FUNCTION
<function-name> | ANONYMOUS FUNCTION | ANY FILE]

Display all privileges (including inherited, denied, and granted) that affect the specified object.
Example

SHOW GRANTS `<user>@<domain-name>` ON DATABASE <database-name>


Show Partitions
7/21/2022 • 2 minutes to read

SHOW PARTITIONS [db_name.]table_name [PARTITION part_spec]

part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)

List the partitions of a table, filtering by given partition values. Listing partitions is supported only for tables
created using the Delta Lake format or the Hive format, when Hive support is enabled.
Show Table Properties
7/21/2022 • 2 minutes to read

SHOW TBLPROPERTIES [db_name.]table_name [property_key]

Return all properties or the value of a specific property set in a table. If the table does not exist, an exception will
be thrown.
Show Tables
7/21/2022 • 2 minutes to read

SHOW TABLES [FROM | IN] db_name [LIKE 'pattern']

Return all tables. Shows a table’s database and whether a table is temporary.
FROM | IN

Return all tables in a database.


LIKE 'pattern'

Indicates which table names to match. In pattern , * matches any number of characters.
Truncate Table
7/21/2022 • 2 minutes to read

TRUNCATE TABLE table_name [PARTITION part_spec]

part_spec:
: (part_col1=value1, part_col2=value2, ...)

Delete all rows from a table or matching partitions in the table. The table must not be an external table or a view.
PARTITION

A partial partition spec to match partitions to be truncated. In Spark 2.0, this is supported only for tables created
using the Hive format. Since Spark 2.1, data source tables are also supported. Not supported for Delta tables.
Uncache Table
7/21/2022 • 2 minutes to read

UNCACHE TABLE [db_name.]table_name

Drop all cached entries associated with the table from the RDD cache.
Update (Delta Lake on Azure Databricks)
7/21/2022 • 2 minutes to read

UPDATE [db_name.]table_name [AS alias] SET col1 = value1 [, col2 = value2 ...] [WHERE predicate]

Update the column values for the rows that match a predicate. When no predicate is provided, update the
column values for all rows.

NOTE
(Delta Lake on Azure Databricks) If a column has a NOT NULL constraint, and an INSERT INTO statement sets a column
value to NULL , a SparkException is thrown.

WHERE

Filter rows by predicate.

Example
UPDATE events SET eventType = 'click' WHERE eventType = 'clk'

UPDATE supports subqueries in the WHERE predicate, including IN , NOT IN , EXISTS , NOT EXISTS , and scalar
subqueries.

Subquery Examples
UPDATE all_events
SET session_time = 0, ignored = true
WHERE session_time < (SELECT min(session_time) FROM good_events)

UPDATE orders AS t1
SET order_status = 'returned'
WHERE EXISTS (SELECT oid FROM returned_orders WHERE t1.oid = oid)

UPDATE events
SET category = 'undefined'
WHERE category NOT IN (SELECT category FROM events2 WHERE date > '2001-01-01')

NOTE
The following types of subqueries are not supported:
Nested subqueries, that is, a subquery inside another subquery
A NOT IN subquery inside an OR , for example, a = 3 OR b NOT IN (SELECT c from t)

In most cases, you can rewrite NOT IN subqueries using NOT EXISTS . We recommend using NOT EXISTS whenever
possible, as UPDATE with NOT IN subqueries can be slow.
Use Database
7/21/2022 • 2 minutes to read

USE db_name

Set the current database. All subsequent commands that do not explicitly specify a database will use this one. If
the provided database does not exist, an exception is thrown. The default current database is default .
Vacuum
7/21/2022 • 2 minutes to read

Clean up files associated with a table.


This command works differently depending on whether you’re working on a Delta or Apache Spark table.

Vacuum a Delta table (Delta Lake on Azure Databricks)


VACUUM [ [db_name.]table_name | path] [RETAIN num HOURS] [DRY RUN]

Recursively vacuum directories associated with the Delta table and remove data files that are no longer in the
latest state of the transaction log for the table and are older than a retention threshold. Files are deleted
according to the time they have been logically removed from Delta’s transaction log + retention hours, not their
modification timestamps on the storage system. The default threshold is 7 days.
On Delta tables, Azure Databricks does not automatically trigger VACUUM operations. See Remove files no longer
referenced by a Delta table.
If you run VACUUM on a Delta table, you lose the ability to time travel back to a version older than the specified
data retention period.
RETAIN num HOURS

The retention threshold.


DRY RUN

Return a list of files to be deleted.

Vacuum a Spark table (Apache Spark)


VACUUM [ [db_name.]table_name | path] [RETAIN num HOURS]

RETAIN num HOURS

The retention threshold.


Recursively vacuum directories associated with the Spark table and remove uncommitted files older than a
retention threshold. The default threshold is 7 days.
On Spark tables, Azure Databricks automatically triggers VACUUM operations as data is written. See Clean up
uncommitted files.
Cost-based optimizer
7/21/2022 • 2 minutes to read

Spark SQL can use a cost-based optimizer (CBO) to improve query plans. This is especially useful for queries
with multiple joins. For this to work it is critical to collect table and column statistics and keep them up to date.

Collect statistics
To get the full benefit of the CBO it is important to collect both column statistics and table statistics. Statistics can
be collected using the Analyze Table command.

TIP
To maintain the statistics up-to-date, run ANALYZE TABLE after writing to the table.

Verify query plans


There are several ways to verify the query plan.
EXPLAIN command
To check if the plan uses statistics, use the SQL commands
Databricks Runtime 7.x and above: EXPLAIN
Databricks Runtime 5.5 LTS and 6.x: Explain
If statistics are missing then the query plan might not be optimal.

== Optimized Logical Plan ==


Aggregate [s_store_sk], [s_store_sk, count(1) AS count(1)L], Statistics(sizeInBytes=20.0 B, rowCount=1,
hints=none)
+- Project [s_store_sk], Statistics(sizeInBytes=18.5 MB, rowCount=1.62E+6, hints=none)
+- Join Inner, (d_date_sk = ss_sold_date_sk), Statistics(sizeInBytes=30.8 MB, rowCount=1.62E+6,
hints=none)
:- Project [ss_sold_date_sk, s_store_sk], Statistics(sizeInBytes=39.1 GB, rowCount=2.63E+9,
hints=none)
: +- Join Inner, (s_store_sk = ss_store_sk), Statistics(sizeInBytes=48.9 GB, rowCount=2.63E+9,
hints=none)
: :- Project [ss_store_sk, ss_sold_date_sk], Statistics(sizeInBytes=39.1 GB, rowCount=2.63E+9,
hints=none)
: : +- Filter (isnotnull(ss_store_sk) && isnotnull(ss_sold_date_sk)), Statistics(sizeInBytes=39.1
GB, rowCount=2.63E+9, hints=none)
: : +- Relation[ss_store_sk,ss_sold_date_sk] parquet, Statistics(sizeInBytes=134.6 GB,
rowCount=2.88E+9, hints=none)
: +- Project [s_store_sk], Statistics(sizeInBytes=11.7 KB, rowCount=1.00E+3, hints=none)
: +- Filter isnotnull(s_store_sk), Statistics(sizeInBytes=11.7 KB, rowCount=1.00E+3,
hints=none)
: +- Relation[s_store_sk] parquet, Statistics(sizeInBytes=88.0 KB, rowCount=1.00E+3,
hints=none)
+- Project [d_date_sk], Statistics(sizeInBytes=12.0 B, rowCount=1, hints=none)
+- Filter ((((isnotnull(d_year) && isnotnull(d_date)) && (d_year = 2000)) && (d_date = 2000-12-31))
&& isnotnull(d_date_sk)), Statistics(sizeInBytes=38.0 B, rowCount=1, hints=none)
+- Relation[d_date_sk,d_date,d_year] parquet, Statistics(sizeInBytes=1786.7 KB,
rowCount=7.30E+4, hints=none)
IMPORTANT
The rowCount statistic is especially important for queries with multiple joins. If rowCount is missing, it means there is
not enough information to calculate it (that is, some required columns do not have statistics).

Spark SQL UI
Use the Spark SQL UI page to see the executed plan and accuracy of the statistics.

A line such as rows output: 2,451,005 est: N/A means that this operator produces approximately 2M rows and
there were no statistics available.

A line such as rows output: 2,451,005 est: 1616404 (1X) means that this operator produces approx. 2M rows,
while the estimate was approx. 1.6M and the estimation error factor was 1.

A line such as rows output: 2,451,005 est: 2626656323 means that this operator produces approximately 2M
rows while the estimate was 2B rows, so the estimation error factor was 1000.

Disable the Cost-Based Optimizer


The CBO is enabled by default. You disable the CBO by changing the spark.sql.cbo.enabled flag.

spark.conf.set("spark.sql.cbo.enabled", false)
Data skipping index
7/21/2022 • 2 minutes to read

IMPORTANT
DATASKIPPING INDEX was removed in Databricks Runtime 7.0. We recommend that you use Delta tables instead, which
offer improved data skipping capabilities.

Description
In addition to partition pruning, Databricks Runtime includes another feature that is meant to avoid scanning
irrelevant data, namely the Data Skipping Index. It uses file-level statistics in order to perform additional
skipping at file granularity. This works with, but does not depend on, Hive-style partitioning.
The effectiveness of data skipping depends on the characteristics of your data and its physical layout. As
skipping is done at file granularity, it is important that your data is horizontally partitioned across multiple files.
This will typically happen as a consequence of having multiple append jobs, (shuffle) partitioning, bucketing,
and/or the use of spark.sql.files.maxRecordsPerFile . It works best on tables with sorted buckets (
df.write.bucketBy(...).sortBy(...).saveAsTable(...) / CREATE TABLE ... CLUSTERED BY ... SORTED BY ... ), or
with columns that are correlated with partition keys (for example, brandName - modelName ,
companyID - stockPrice ), but also when your data just happens to exhibit some sortedness / clusteredness (for
example, orderID , bitcoinValue ).

NOTE
This beta feature has a number of important limitations:
It’s Opt-In: needs to be enabled manually, on a per-table basis.
It’s SQL only: there is no DataFrame API for it.
Once a table is indexed, the effects of subsequent INSERT or ADD PARTITION operations are not guaranteed to be
visible until the index is explicitly REFRESHed.

SQL Syntax
Create Index

CREATE DATASKIPPING INDEX ON [TABLE] [db_name.]table_name

Enables Data Skipping on the given table for the first (i.e. left-most) N supported columns, where N is controlled
by spark.databricks.io.skipping.defaultNumIndexedCols (default: 32)
partitionBy columns are always indexed and do not count towards this N.
Create Index For Columns

CREATE DATASKIPPING INDEX ON [TABLE] [db_name.]table_name


FOR COLUMNS (col1, ...)

Enables Data Skipping on the given table for the specified list of columns. Same as above, all partitionBy
columns will always be indexed in addition to the ones specified.
Describe Index

DESCRIBE DATASKIPPING INDEX [EXTENDED] ON [TABLE] [db_name.]table_name

Displays which columns of the given table are indexed, along with the corresponding types of file-level statistic
that are collected.
If EXTENDED is specified, a third column called “effectiveness_score” is displayed that gives an approximate
measure of how beneficial we expect DataSkipping to be for filters on the corresponding columns.
Refresh Full Index

REFRESH DATASKIPPING INDEX ON [TABLE] [db_name.]table_name

Rebuilds the whole index. I.e. all the table’s partitions will be re-indexed.
Refresh Partitions

REFRESH DATASKIPPING INDEX ON [TABLE] [db_name.]table_name


PARTITION (part_col_name1[=val1], part_col_name2[=val2], ...)

Re-indexes the specified partitions only. This operation should generally be faster than full index refresh.
Drop Index

DROP DATASKIPPING INDEX ON [TABLE] [db_name.]table_name

Disables Data Skipping on the given table and deletes all index data.
Transactional writes to cloud storage with DBIO
7/21/2022 • 2 minutes to read

The Databricks DBIO package provides transactional writes to cloud storage for Apache Spark jobs. This solves a
number of performance and correctness issues that occur when Spark is used in a cloud-native setting (for
example, writing directly to storage services).

IMPORTANT
The commit protocol is not respected when you access data using paths ending in * . For example, reading
dbfs://my/path will only return committed changes, while reading dbfs://my/path/* will return the content of all the
data files in the directory, irrespective of whether their content was committed or not. This is an expected behavior.

With DBIO transactional commit, metadata files starting with _started_<id> and _committed_<id> accompany
data files created by Spark jobs. Generally you shouldn’t alter these files directly. Rather, you should use the
VACUUM command to clean them up.

Clean up uncommitted files


To clean up uncommitted files left over from Spark jobs, use the VACUUM command to remove them. Normally
VACUUM happens automatically after Spark jobs complete, but you can also run it manually if a job is aborted.

For example, VACUUM ... RETAIN 1 HOUR removes uncommitted files older than one hour.

IMPORTANT
Avoid vacuuming with a horizon of less than one hour. It can cause data inconsistency.

Also see Vacuum.


SQL

-- recursively vacuum an output path


VACUUM '/path/to/output/directory' [RETAIN <N> HOURS]

-- vacuum all partitions of a catalog table


VACUUM tableName [RETAIN <N> HOURS]

Scala

// recursively vacuum an output path


spark.sql("VACUUM '/path/to/output/directory' [RETAIN <N> HOURS]")

// vacuum all partitions of a catalog table


spark.sql("VACUUM tableName [RETAIN <N> HOURS]")
Handling bad records and files
7/21/2022 • 2 minutes to read

When reading data from a file-based data source, Apache Spark SQL faces two typical error cases. First, the files
may not be readable (for instance, they could be missing, inaccessible or corrupted). Second, even if the files are
processable, some records may not be parsable (for example, due to syntax errors and schema mismatch).
Azure Databricks provides a unified interface for handling bad records and files without interrupting Spark jobs.
You can obtain the exception records/files and reasons from the exception logs by setting the data source option
badRecordsPath . badRecordsPath specifies a path to store exception files for recording the information about
bad records for CSV and JSON sources and bad files for all the file-based built-in sources (for example, Parquet).
In addition, when reading files transient errors like network connection exception, IO exception, and so on, may
occur. These errors are ignored and also recorded under the badRecordsPath , and Spark will continue to run the
tasks.

NOTE
Using the badRecordsPath option in a file-based data source has a few important limitations:
It is non-transactional and can lead to inconsistent results.
Transient errors are treated as failures.

Examples
Unable to find input file

val df = spark.read
.option("badRecordsPath", "/tmp/badRecordsPath")
.format("parquet").load("/input/parquetFile")

// Delete the input parquet file '/input/parquetFile'


dbutils.fs.rm("/input/parquetFile")

df.show()

In the above example, since df.show() is unable to find the input file, Spark creates an exception file in JSON
format to record the error. For example, /tmp/badRecordsPath/20170724T101153/bad_files/xyz is the path of the
exception file. This file is under the specified badRecordsPath directory, /tmp/badRecordsPath . 20170724T101153 is
the creation time of this DataFrameReader . bad_files is the exception type. xyz is a file that contains a JSON
record, which has the path of the bad file and the exception/reason message.
Input file contains bad record
// Creates a json file containing both parsable and corrupted records
Seq("""{"a": 1, "b": 2}""", """{bad-record""").toDF().write.format("text").save("/tmp/input/jsonFile")

val df = spark.read
.option("badRecordsPath", "/tmp/badRecordsPath")
.schema("a int, b int")
.format("json")
.load("/tmp/input/jsonFile")

df.show()

In this example, the DataFrame contains only the first parsable record ( {"a": 1, "b": 2} ). The second bad
record ( {bad-record ) is recorded in the exception file, which is a JSON file located in
/tmp/badRecordsPath/20170724T114715/bad_records/xyz . The exception file contains the bad record, the path of the
file containing the record, and the exception/reason message. After you locate the exception files, you can use a
JSON reader to process them.
Handling large queries in interactive workflows
7/21/2022 • 4 minutes to read

A challenge with interactive data workflows is handling large queries. This includes queries that generate too
many output rows, fetch many external partitions, or compute on extremely large data sets. These queries can
be extremely slow, saturate cluster resources, and make it difficult for others to share the same cluster.
Query Watchdog is a process that prevents queries from monopolizing cluster resources by examining the most
common causes of large queries and terminating queries that pass a threshold. This article describes how to
enable and configure Query Watchdog.

IMPORTANT
Query Watchdog is enabled for all all-purpose clusters created using the UI.

Example of a disruptive query


An analyst is performing some ad hoc queries in a just-in-time data warehouse. The analyst uses a shared
autoscaling cluster that makes it easy for multiple users to use a single cluster at the same time. Suppose there
are two tables that each have a million rows.

import org.apache.spark.sql.functions._
spark.conf.set("spark.sql.shuffle.partitions", 10)

spark.range(1000000)
.withColumn("join_key", lit(" "))
.createOrReplaceTempView("table_x")
spark.range(1000000)
.withColumn("join_key", lit(" "))
.createOrReplaceTempView("table_y")

These table sizes are manageable in Apache Spark. However, they each include a join_key column with an
empty string in every row. This can happen if the data is not perfectly clean or if there is significant data skew
where some keys are more prevalent than others. These empty join keys are far more prevalent than any other
value.
In the following code, the analyst is joining these two tables on their keys, which produces output of one trillion
results, and all of these are produced on a single executor (the executor that gets the " " key):

SELECT
id, count()
FROM
(SELECT
x.id
FROM
table_x x
JOIN
table_y y
on x.join_key = y.join_key)
GROUP BY id

This query appears to be running. But without knowing about the data, the analyst sees that there’s “only” a
single task left over the course of executing the job. The query never finishes, leaving the analyst frustrated and
confused about why it did not work.
In this case there is only one problematic join key. Other times there may be many more.

Enable and configure Query Watchdog


To a prevent a query from creating too many output rows for the number of input rows, you can enable Query
Watchdog and configure the maximum number of output rows as a multiple of the number of input rows. In this
example we use a ratio of 1000 (the default).

spark.conf.set("spark.databricks.queryWatchdog.enabled", true)
spark.conf.set("spark.databricks.queryWatchdog.outputRatioThreshold", 1000L)

The latter configuration declares that any given task should never produce more than 1000 times the number of
input rows.

TIP
The output ratio is completely customizable. We recommend starting lower and seeing what threshold works well for you
and your team. A range of 1,000 to 10,000 is a good starting point.

Not only does Query Watchdog prevent users from monopolizing cluster resources for jobs that will never
complete, it also saves time by fast-failing a query that would have never completed. For example, the following
query will fail after several minutes because it exceeds the ratio.

SELECT
join_key,
sum(x.id),
count()
FROM
(SELECT
x.id,
y.join_key
FROM
table_x x
JOIN
table_y y
on x.join_key = y.join_key)
GROUP BY join_key

Here’s what you would see:


It’s usually enough to enable Query Watchdog and set the output/input threshold ratio, but you also have the
option to set two additional properties: spark.databricks.queryWatchdog.minTimeSecs and
spark.databricks.queryWatchdog.minOutputRows . These properties specify the minimum time a given task in a
query must run before cancelling it and the minimum number of output rows for a task in that query.
For example, you can set minTimeSecs to a higher value if you want to give it a chance to produce a large
number of rows per task. Likewise, you can set spark.databricks.queryWatchdog.minOutputRows to ten million if
you want to stop a query only after a task in that query has produced ten million rows. Anything less and the
query succeeds, even if the output/input ratio was exceeded.

spark.conf.set("spark.databricks.queryWatchdog.minTimeSecs", 10L)
spark.conf.set("spark.databricks.queryWatchdog.minOutputRows", 100000L)

TIP
If you configure Query Watchdog in a notebook, the configuration does not persist across cluster restarts. If you want to
configure Query Watchdog for all users of a cluster, we recommend that you use a cluster configuration.

Detect query on extremely large dataset


Another typical large query may scan a large amount of data from big tables/datasets. The scan operation may
last for a long time and saturate cluster resources (even reading metadata of a big Hive table can take a
significant amount of time). You can set maxHivePartitions to prevent fetching too many partitions from a big
Hive table. Similarly, you can also set maxQueryTasks to limit queries on an extremely large dataset.

spark.conf.set("spark.databricks.queryWatchdog.maxHivePartitions", 20000)
spark.conf.set("spark.databricks.queryWatchdog.maxQueryTasks", 20000)

When should you enable Query Watchdog?


Query Watchdog should be enabled for ad hoc analytics clusters where SQL analysts and data scientists are
sharing a given cluster and an administrator needs to make sure that queries “play nicely” with one another.

When should you disable Query Watchdog?


In general we do not advise eagerly cancelling queries used in an ETL scenario because there typically isn’t a
human in the loop to correct the error. We recommend that you disable Query Watchdog for all but ad hoc
analytics clusters.
Adaptive query execution
7/21/2022 • 8 minutes to read

Adaptive query execution (AQE) is query re-optimization that occurs during query execution.
The motivation for runtime re-optimization is that Azure Databricks has the most up-to-date accurate statistics
at the end of a shuffle and broadcast exchange (referred to as a query stage in AQE). As a result, Azure
Databricks can opt for a better physical strategy, pick an optimal post-shuffle partition size and number, or do
optimizations that used to require hints, for example, skew join handling.
This can be very useful when statistics collection is not turned on or when statistics are stale. It is also useful in
places where statically derived statistics are inaccurate, such as in the middle of a complicated query, or after the
occurrence of data skew.

Capabilities
In Databricks Runtime 7.3 LTS and above, AQE is enabled by default. It has 4 major features:
Dynamically changes sort merge join into broadcast hash join.
Dynamically coalesces partitions (combine small partitions into reasonably sized partitions) after shuffle
exchange. Very small tasks have worse I/O throughput and tend to suffer more from scheduling overhead
and task setup overhead. Combining small tasks saves resources and improves cluster throughput.
Dynamically handles skew in sort merge join and shuffle hash join by splitting (and replicating if needed)
skewed tasks into roughly evenly sized tasks.
Dynamically detects and propagates empty relations.

Application
AQE applies to all queries that are:
Non-streaming
Contain at least one exchange (usually when there’s a join, aggregate, or window), one sub-query, or both.
Not all AQE-applied queries are necessarily re-optimized. The re-optimization might or might not come up with
a different query plan than the one statically compiled. To determine whether a query’s plan has been changed
by AQE, see the following section, Query plans.

Query plans
This section discusses how you can examine query plans in different ways.
In this section:
Spark UI
DataFrame.explain()
SQL EXPLAIN

Spark UI
AdaptiveSparkPlan node
AQE-applied queries contain one or more AdaptiveSparkPlan nodes, usually as the root node of each main
query or sub-query. Before the query runs or when it is running, the isFinalPlan flag of the corresponding
AdaptiveSparkPlan node shows as false ; after the query execution completes, the isFinalPlan flag changes to
true.

Evolving plan
The query plan diagram evolves as the execution progresses and reflects the most current plan that is being
executed. Nodes that have already been executed (in which metrics are available) will not change, but those that
haven’t can change over time as the result of re-optimizations.
The following is a query plan diagram example:

DataFrame.explain()
AdaptiveSparkPlan node
AQE-applied queries contain one or more AdaptiveSparkPlan nodes, usually as the root node of each main
query or sub-query. Before the query runs or when it is running, the isFinalPlan flag of the corresponding
AdaptiveSparkPlan node shows as false ; after the query execution completes, the isFinalPlan flag changes to
true .

Current and initial plan


Under each AdaptiveSparkPlan node there will be both the initial plan (the plan before applying any AQE
optimizations) and the current or the final plan, depending on whether the execution has completed. The current
plan will evolve as the execution progresses.
Runtime statistics
Each shuffle and broadcast stage contains data statistics.
Before the stage runs or when the stage is running, the statistics are compile-time estimates, and the flag
isRuntime is false , for example: Statistics(sizeInBytes=1024.0 KiB, rowCount=4, isRuntime=false);

After the stage execution completes, the statistics are those collected at runtime, and the flag isRuntime will
become true , for example: Statistics(sizeInBytes=658.1 KiB, rowCount=2.81E+4, isRuntime=true)
The following is a DataFrame.explain example:
Before the execution
During the execution

After the execution

SQL EXPLAIN
AdaptiveSparkPlan node
AQE-applied queries contain one or more AdaptiveSparkPlan nodes, usually as the root node of each main
query or sub-query.
No current plan
As SQL EXPLAIN does not execute the query, the current plan is always the same as the initial plan and does not
reflect what would eventually get executed by AQE.
The following is a SQL explain example:

Effectiveness
The query plan will change if one or more AQE optimizations take effect. The effect of these AQE optimizations
is demonstrated by the difference between the current and final plans and the initial plan and specific plan
nodes in the current and final plans.
Dynamically change sort merge join into broadcast hash join: different physical join nodes between the
current/final plan and the initial plan

Dynamically coalesce partitions: node CustomShuffleReader with property Coalesced

Dynamically handle skew join: node SortMergeJoin with field isSkew as true.
Dynamically detect and propagate empty relations: part of (or entire) the plan is replaced by node
LocalTableScan with the relation field as empty.

Configuration
In this section:
Enable and disable adaptive query execution
Dynamically change sort merge join into broadcast hash join
Dynamically coalesce partitions
Dynamically handle skew join
Dynamically detect and propagate empty relations
Enable and disable adaptive query execution
P RO P ERT Y

spark .databricks.optimizer.adaptive.enabled

Type: Boolean

Whether to enable or disable adaptive query execution.

Default value: true

Dynamically change sort merge join into broadcast hash join


P RO P ERT Y

spark .databricks.adaptive.autoBroadcastJoinThreshold

Type: Byte String

The threshold to trigger switching to broadcast join at runtime.

Default value: 30MB

Dynamically coalesce partitions


P RO P ERT Y

spark .sql.adaptive.coalescePar titions.enabled

Type: Boolean

Whether to enable or disable partition coalescing.

Default value: true

spark .sql.adaptive.advisor yPar titionSizeInBytes

Type: Byte String

The target size after coalescing. The coalesced partition sizes will be close to but no bigger than this target size.

Default value: 64MB

spark .sql.adaptive.coalescePar titions.minPar titionSize

Type: Byte String

The minimum size of partitions after coalescing. The coalesced partition sizes will be no smaller than this size.

Default value: 1MB

spark .sql.adaptive.coalescePar titions.minPar titionNum

Type: Integer

The minimum number of partitions after coalescing. Not recommended, because setting explicitly overrides
spark.sql.adaptive.coalescePartitions.minPartitionSize .

Default value: 2x no. of cluster cores


Dynamically handle skew join
P RO P ERT Y

spark .sql.adaptive.skewJoin.enabled

Type: Boolean

Whether to enable or disable skew join handling.

Default value: true

spark .sql.adaptive.skewJoin.skewedPar titionFactor

Type: Integer

A factor that when multiplied by the median partition size contributes to determining whether a partition is skewed.

Default value: 5

spark .sql.adaptive.skewJoin.skewedPar titionThresholdInBytes

Type: Byte String

A threshold that contributes to determining whether a partition is skewed.

Default value: 256MB

A partition is considered skewed when both (partition size > skewedPartitionFactor * median partition size)
and (partition size > skewedPartitionThresholdInBytes) are true .
Dynamically detect and propagate empty relations
P RO P ERT Y

spark .databricks.adaptive.emptyRelationPropagation.enabled

Type: Boolean

Whether to enable or disable dynamic empty relation propagation.

Default value: true

Frequently asked questions (FAQ)


In this section:
Why didn’t AQE change the shuffle partition number despite the partition coalescing already being enabled?
Why didn’t AQE broadcast a small join table?
Should I still use a broadcast join strategy hint with AQE enabled?
What is the difference between skew join hint and AQE skew join optimization? Which one should I use?
Why didn’t AQE adjust my join ordering automatically?
Why didn’t AQE detect my data skew?
Why didn’t AQE change the shuffle partition number despite the partition coalescing already being enabled?
AQE does not change the initial partition number. It is recommended that you set a reasonably high value for
the shuffle partition number and let AQE coalesce small partitions based on the output data size at each stage of
the query.
If you see spilling in your jobs, you can try:
Increasing the shuffle partition number config: spark.sql.shuffle.partitions
Enabling auto optimized shuffle by setting spark.databricks.adaptive.autoOptimizeShuffle.enabled to true

Why didn’t AQE broadcast a small join table?


If the size of the relation expected to be broadcast does fall under this threshold but is still not broadcast:
Check the join type. Broadcast is not supported for certain join types, for example, the left relation of a
LEFT OUTER JOIN cannot be broadcast.
It can also be that the relation contains a lot of empty partitions, in which case the majority of the tasks can
finish quickly with sort merge join or it can potentially be optimized with skew join handling. AQE avoids
changing such sort merge joins to broadcast hash joins if the percentage of non-empty partitions is lower
than spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin .
Should I still use a broadcast join strategy hint with AQE enabled?
Yes. A statically planned broadcast join is usually more performant than a dynamically planned one by AQE as
AQE might not switch to broadcast join until after performing shuffle for both sides of the join (by which time
the actual relation sizes are obtained). So using a broadcast hint can still be a good choice if you know your
query well. AQE will respect query hints the same way as static optimization does, but can still apply dynamic
optimizations that are not affected by the hints.
What is the difference between skew join hint and AQE skew join optimization? Which one should I use?
It is recommended to rely on AQE skew join handling rather than use the skew join hint, because AQE skew join
is completely automatic and in general performs better than the hint counterpart.
Why didn’t AQE adjust my join ordering automatically?
Dynamic join reordering is not part of AQE as of Databricks Runtime 7.3 LTS.
Why didn’t AQE detect my data skew?
There are two size conditions that must be satisfied for AQE to detect a partition as a skewed partition:
The partition size is larger than the spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes (default
256MB)
The partition size is larger than the median size of all partitions times the skewed partition factor
spark.sql.adaptive.skewJoin.skewedPartitionFactor (default 5)

In addition, skew handling support is limited for certain join types, for example, in LEFT OUTER JOIN , only skew
on the left side can be optimized.

Legacy
The term “Adaptive Execution” has existed since Spark 1.6, but the new AQE in Spark 3.0 is fundamentally
different. In terms of functionality, Spark 1.6 does only the “dynamically coalesce partitions” part. In terms of
technical architecture, the new AQE is a framework of dynamic planning and replanning of queries based on
runtime stats, which supports a variety of optimizations such as the ones we have described in this article and
can be extended to enable more potential optimizations.
Query semi-structured data in SQL
7/21/2022 • 4 minutes to read

NOTE
Available in Databricks Runtime 8.1 and above.

This article describes the Databricks SQL operators you can use to query and transform semi-structured data
stored as JSON.

NOTE
This feature lets you read semi-structured data without flattening the files. However, for optimal read query performance
Databricks recommends that you extract nested columns with the correct data types.

Syntax
You extract a column from fields containing JSON strings using the syntax <column-name>:<extraction-path> ,
where <column-name> is the string column name and <extraction-path> is the path to the field to extract. The
returned results are strings.

Examples
The following examples use the data created with the statement in Example data.
In this section:
Extract a top-level column
Extract nested fields
Extract values from arrays
Cast values
Example data
Extract a top-level column
To extract a column, specify the name of the JSON field in your extraction path.
You can provide column names within brackets. Columns referenced inside brackets are matched case
sensitively. The column name is also referenced case insensitively.

SELECT raw:owner, RAW:owner FROM store_data

+-------+-------+
| owner | owner |
+-------+-------+
| amy | amy |
+-------+-------+
-- References are case sensitive when you use brackets
SELECT raw:OWNER case_insensitive, raw:['OWNER'] case_sensitive FROM store_data

+------------------+----------------+
| case_insensitive | case_sensitive |
+------------------+----------------+
| amy | null |
+------------------+----------------+

Use backticks to escape spaces and special characters. The field names are matched case insensitively.

-- Use backticks to escape special characters. References are case insensitive when you use backticks.
-- Use brackets to make them case sensitive.
SELECT raw:`zip code`, raw:`Zip Code`, raw:['fb:testid'] FROM store_data

+----------+----------+-----------+
| zip code | Zip Code | fb:testid |
+----------+----------+-----------+
| 94025 | 94025 | 1234 |
+----------+----------+-----------+

NOTE
If a JSON record contains multiple columns that can match your extraction path due to case insensitive matching, you will
receive an error asking you to use brackets. If you have matches of columns across rows, you will not receive any errors.
The following will throw an error: {"foo":"bar", "Foo":"bar"} , and the following won’t throw an error:

{"foo":"bar"}
{"Foo":"bar"}

Extract nested fields


You specify nested fields through dot notation or using brackets. When you use brackets, columns are matched
case sensitively.

-- Use dot notation


SELECT raw:store.bicycle FROM store_data
-- the column returned is a string

+------------------+
| bicycle |
+------------------+
| { |
| "price":19.95, |
| "color":"red" |
| } |
+------------------+

-- Use brackets
SELECT raw:store['bicycle'], raw:store['BICYCLE'] FROM store_data
+------------------+---------+
| bicycle | BICYCLE |
+------------------+---------+
| { | null |
| "price":19.95, | |
| "color":"red" | |
| } | |
+------------------+---------+

Extract values from arrays


You index elements in arrays with brackets. Indices are 0-based. You can use an asterisk ( * ) followed by dot or
bracket notation to extract subfields from all elements in an array.

-- Index elements
SELECT raw:store.fruit[0], raw:store.fruit[1] FROM store_data

+------------------+-----------------+
| fruit | fruit |
+------------------+-----------------+
| { | { |
| "weight":8, | "weight":9, |
| "type":"apple" | "type":"pear" |
| } | } |
+------------------+-----------------+

-- Extract subfields from arrays


SELECT raw:store.book[*].isbn FROM store_data

+--------------------+
| isbn |
+--------------------+
| [ |
| null, |
| "0-553-21311-3", |
| "0-395-19395-8" |
| ] |
+--------------------+

-- Access arrays within arrays or structs within arrays


SELECT
raw:store.basket[*],
raw:store.basket[*][0] first_of_baskets,
raw:store.basket[0][*] first_basket,
raw:store.basket[*][*] all_elements_flattened,
raw:store.basket[0][2].b subfield
FROM store_data
+----------------------------+------------------+---------------------+---------------------------------+---
-------+
| basket | first_of_baskets | first_basket | all_elements_flattened |
subfield |
+----------------------------+------------------+---------------------+---------------------------------+---
-------+
| [ | [ | [ | [1,2,{"b":"y","a":"x"},3,4,5,6] | y
|
| [1,2,{"b":"y","a":"x"}], | 1, | 1, | |
|
| [3,4], | 3, | 2, | |
|
| [5,6] | 5 | {"b":"y","a":"x"} | |
|
| ] | ] | ] | |
|
+----------------------------+------------------+---------------------+---------------------------------+---
-------+

Cast values
You can use :: to cast values to basic data types. Use the from_json method to cast nested results into more
complex data types, such as arrays or structs.

-- price is returned as a double, not a string


SELECT raw:store.bicycle.price::double FROM store_data

+------------------+
| price |
+------------------+
| 19.95 |
+------------------+

-- use from_json to cast into more complex types


SELECT from_json(raw:store.bicycle, 'price double, color string') bicycle FROM store_data
-- the column returned is a struct containing the columns price and color

+------------------+
| bicycle |
+------------------+
| { |
| "price":19.95, |
| "color":"red" |
| } |
+------------------+

SELECT from_json(raw:store.basket[*], 'array<array<string>>') baskets FROM store_data


-- the column returned is an array of string arrays
+------------------------------------------+
| basket |
+------------------------------------------+
| [ |
| ["1","2","{\"b\":\"y\",\"a\":\"x\"}]", |
| ["3","4"], |
| ["5","6"] |
| ] |
+------------------------------------------+

Example data

CREATE TABLE store_data AS SELECT


'{
"store":{
"fruit": [
{"weight":8,"type":"apple"},
{"weight":9,"type":"pear"}
],
"basket":[
[1,2,{"b":"y","a":"x"}],
[3,4],
[5,6]
],
"book":[
{
"author":"Nigel Rees",
"title":"Sayings of the Century",
"category":"reference",
"price":8.95
},
{
"author":"Herman Melville",
"title":"Moby Dick",
"category":"fiction",
"price":8.99,
"isbn":"0-553-21311-3"
},
{
"author":"J. R. R. Tolkien",
"title":"The Lord of the Rings",
"category":"fiction",
"reader":[
{"age":25,"name":"bob"},
{"age":26,"name":"jack"}
],
"price":22.99,
"isbn":"0-395-19395-8"
}
],
"bicycle":{
"price":19.95,
"color":"red"
}
},
"owner":"amy",
"zip code":"94025",
"fb:testid":"1234"
}' as raw

NULL behavior
When a JSON field exists with a null value, you will receive a SQL null value for that column, not a null text
value.

select '{"key":null}':key is null sql_null, '{"key":null}':key == 'null' text_null

+-------------+-----------+
| sql_null | text_null |
+-------------+-----------+
| true | null |
+-------------+-----------+
Optimize conversion between PySpark and pandas
DataFrames
7/21/2022 • 2 minutes to read

Apache Arrow is an in-memory columnar data format used in Apache Spark to efficiently transfer data between
JVM and Python processes. This is beneficial to Python developers that work with pandas and NumPy data.
However, its usage is not automatic and requires some minor changes to configuration or code to take full
advantage and ensure compatibility.

PyArrow versions
PyArrow is installed in Databricks Runtime. For information on the version of PyArrow available in each
Databricks Runtime version, see the Databricks runtime release notes.

Supported SQL types


All Spark SQL data types are supported by Arrow-based conversion except MapType , ArrayType of
TimestampType , and nested StructType . StructType is represented as a pandas.DataFrame instead of
pandas.Series . BinaryType is supported only when PyArrow is equal to or higher than 0.10.0.

Convert PySpark DataFrames to and from pandas DataFrames


Arrow is available as an optimization when converting a PySpark DataFrame to a pandas DataFrame with
toPandas() and when creating a PySpark DataFrame from a pandas DataFrame with
createDataFrame(pandas_df) .

To use Arrow for these methods, set the Spark configuration spark.sql.execution.arrow.pyspark.enabled to
true . This configuration is enabled by default except for High Concurrency clusters as well as user isolation
clusters in workspaces that are Unity Catalog enabled.
In addition, optimizations enabled by spark.sql.execution.arrow.pyspark.enabled could fall back to a non-Arrow
implementation if an error occurs before the computation within Spark. You can control this behavior using the
Spark configuration spark.sql.execution.arrow.pyspark.fallback.enabled .
Example

import numpy as np
import pandas as pd

# Enable Arrow-based columnar data transfers


spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

# Generate a pandas DataFrame


pdf = pd.DataFrame(np.random.rand(100, 3))

# Create a Spark DataFrame from a pandas DataFrame using Arrow


df = spark.createDataFrame(pdf)

# Convert the Spark DataFrame back to a pandas DataFrame using Arrow


result_pdf = df.select("*").toPandas()

Using the Arrow optimizations produces the same results as when Arrow is not enabled. Even with Arrow,
toPandas() results in the collection of all records in the DataFrame to the driver program and should be done
on a small subset of the data.
In addition, not all Spark data types are supported and an error can be raised if a column has an unsupported
type. If an error occurs during createDataFrame() , Spark falls back to create the DataFrame without Arrow.
User-defined scalar functions - Scala
7/21/2022 • 2 minutes to read

This article contains Scala user-defined function (UDF) examples. It shows how to register UDFs, how to invoke
UDFs, and caveats regarding evaluation order of subexpressions in Spark SQL. See User-defined scalar functions
(UDFs) for more details.

Register a function as a UDF


val squared = (s: Long) => {
s * s
}
spark.udf.register("square", squared)

Call the UDF in Spark SQL


spark.range(1, 20).createOrReplaceTempView("test")

%sql select id, square(id) as id_squared from test

Use UDF with DataFrames


import org.apache.spark.sql.functions.{col, udf}
val squared = udf((s: Long) => s * s)
display(spark.range(1, 20).select(squared(col("id")) as "id_squared"))

Evaluation order and null checking


Spark SQL (including SQL and the DataFrame and Dataset APIs) does not guarantee the order of evaluation of
subexpressions. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or
in any other fixed order. For example, logical AND and OR expressions do not have left-to-right “short-
circuiting” semantics.
Therefore, it is dangerous to rely on the side effects or order of evaluation of Boolean expressions, and the order
of WHERE and HAVING clauses, since such expressions and clauses can be reordered during query optimization
and planning. Specifically, if a UDF relies on short-circuiting semantics in SQL for null checking, there’s no
guarantee that the null check will happen before invoking the UDF. For example,

spark.udf.register("strlen", (s: String) => s.length)


spark.sql("select s from test1 where s is not null and strlen(s) > 1") // no guarantee

This WHERE clause does not guarantee the strlen UDF to be invoked after filtering out nulls.
To perform proper null checking, we recommend that you do either of the following:
Make the UDF itself null-aware and do null checking inside the UDF itself
Use IF or CASE WHEN expressions to do the null check and invoke the UDF in a conditional branch

spark.udf.register("strlen_nullsafe", (s: String) => if (s != null) s.length else -1)


spark.sql("select s from test1 where s is not null and strlen_nullsafe(s) > 1") // ok
spark.sql("select s from test1 where if(s is not null, strlen(s), null) > 1") // ok
User-defined scalar functions - Python
7/21/2022 • 2 minutes to read

This article contains Python user-defined function (UDF) examples. It shows how to register UDFs, how to invoke
UDFs, and caveats regarding evaluation order of subexpressions in Spark SQL.

Register a function as a UDF


def squared(s):
return s * s
spark.udf.register("squaredWithPython", squared)

You can optionally set the return type of your UDF. The default return type is StringType .

from pyspark.sql.types import LongType


def squared_typed(s):
return s * s
spark.udf.register("squaredWithPython", squared_typed, LongType())

Call the UDF in Spark SQL


spark.range(1, 20).createOrReplaceTempView("test")

%sql select id, squaredWithPython(id) as id_squared from test

Use UDF with DataFrames


from pyspark.sql.functions import udf
from pyspark.sql.types import LongType
squared_udf = udf(squared, LongType())
df = spark.table("test")
display(df.select("id", squared_udf("id").alias("id_squared")))

Alternatively, you can declare the same UDF using annotation syntax:

from pyspark.sql.functions import udf


@udf("long")
def squared_udf(s):
return s * s
df = spark.table("test")
display(df.select("id", squared_udf("id").alias("id_squared")))

Evaluation order and null checking


Spark SQL (including SQL and the DataFrame and Dataset API) does not guarantee the order of evaluation of
subexpressions. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or
in any other fixed order. For example, logical AND and OR expressions do not have left-to-right “short-
circuiting” semantics.
Therefore, it is dangerous to rely on the side effects or order of evaluation of Boolean expressions, and the order
of WHERE and HAVING clauses, since such expressions and clauses can be reordered during query optimization
and planning. Specifically, if a UDF relies on short-circuiting semantics in SQL for null checking, there’s no
guarantee that the null check will happen before invoking the UDF. For example,

spark.udf.register("strlen", lambda s: len(s), "int")


spark.sql("select s from test1 where s is not null and strlen(s) > 1") # no guarantee

This WHERE clause does not guarantee the strlen UDF to be invoked after filtering out nulls.
To perform proper null checking, we recommend that you do either of the following:
Make the UDF itself null-aware and do null checking inside the UDF itself
Use IF or CASE WHEN expressions to do the null check and invoke the UDF in a conditional branch

spark.udf.register("strlen_nullsafe", lambda s: len(s) if not s is None else -1, "int")


spark.sql("select s from test1 where s is not null and strlen_nullsafe(s) > 1") // ok
spark.sql("select s from test1 where if(s is not null, strlen(s), null) > 1") // ok
pandas user-defined functions
7/21/2022 • 7 minutes to read

A pandas user-defined function (UDF)—also known as vectorized UDF—is a user-defined function that uses
Apache Arrow to transfer data and pandas to work with the data. pandas UDFs allow vectorized operations that
can increase performance up to 100x compared to row-at-a-time Python UDFs.
For background information, see the blog post New Pandas UDFs and Python Type Hints in the Upcoming
Release of Apache Spark 3.0 and Optimize conversion between PySpark and pandas DataFrames.
You define a pandas UDF using the keyword pandas_udf as a decorator and wrap the function with a Python
type hint. This article describes the different types of pandas UDFs and shows how to use pandas UDFs with
type hints.

Series to Series UDF


You use a Series to Series pandas UDF to vectorize scalar operations. You can use them with APIs such as
select and withColumn .

The Python function should take a pandas Series as an input and return a pandas Series of the same length, and
you should specify these in the Python type hints. Spark runs a pandas UDF by splitting columns into batches,
calling the function for each batch as a subset of the data, then concatenating the results.
The following example shows how to create a pandas UDF that computes the product of 2 columns.

import pandas as pd
from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType

# Declare the function and create the UDF


def multiply_func(a: pd.Series, b: pd.Series) -> pd.Series:
return a * b

multiply = pandas_udf(multiply_func, returnType=LongType())

# The function for a pandas_udf should be able to execute with local pandas data
x = pd.Series([1, 2, 3])
print(multiply_func(x, x))
# 0 1
# 1 4
# 2 9
# dtype: int64

# Create a Spark DataFrame, 'spark' is an existing SparkSession


df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))

# Execute function as a Spark vectorized UDF


df.select(multiply(col("x"), col("x"))).show()
# +-------------------+
# |multiply_func(x, x)|
# +-------------------+
# | 1|
# | 4|
# | 9|
# +-------------------+
Iterator of Series to Iterator of Series UDF
An iterator UDF is the same as a scalar pandas UDF except:
The Python function
Takes an iterator of batches instead of a single input batch as input.
Returns an iterator of output batches instead of a single output batch.
The length of the entire output in the iterator should be the same as the length of the entire input.
The wrapped pandas UDF takes a single Spark column as an input.
You should specify the Python type hint as Iterator[pandas.Series] -> Iterator[pandas.Series] .
This pandas UDF is useful when the UDF execution requires initializing some state, for example, loading a
machine learning model file to apply inference to every input batch.
The following example shows how to create a pandas UDF with iterator support.

import pandas as pd
from typing import Iterator
from pyspark.sql.functions import col, pandas_udf, struct

pdf = pd.DataFrame([1, 2, 3], columns=["x"])


df = spark.createDataFrame(pdf)

# When the UDF is called with the column,


# the input to the underlying function is an iterator of pd.Series.
@pandas_udf("long")
def plus_one(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:
for x in batch_iter:
yield x + 1

df.select(plus_one(col("x"))).show()
# +-----------+
# |plus_one(x)|
# +-----------+
# | 2|
# | 3|
# | 4|
# +-----------+

# In the UDF, you can initialize some state before processing batches.
# Wrap your code with try/finally or use context managers to ensure
# the release of resources at the end.
y_bc = spark.sparkContext.broadcast(1)

@pandas_udf("long")
def plus_y(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:
y = y_bc.value # initialize states
try:
for x in batch_iter:
yield x + y
finally:
pass # release resources here, if any

df.select(plus_y(col("x"))).show()
# +---------+
# |plus_y(x)|
# +---------+
# | 2|
# | 3|
# | 4|
# +---------+
Iterator of multiple Series to Iterator of Series UDF
An Iterator of multiple Series to Iterator of Series UDF has similar characteristics and restrictions as Iterator of
Series to Iterator of Series UDF. The specified function takes an iterator of batches and outputs an iterator of
batches. It is also useful when the UDF execution requires initializing some state.
The differences are:
The underlying Python function takes an iterator of a tuple of pandas Series.
The wrapped pandas UDF takes multiple Spark columns as an input.
You specify the type hints as Iterator[Tuple[pandas.Series, ...]] -> Iterator[pandas.Series] .

from typing import Iterator, Tuple


import pandas as pd

from pyspark.sql.functions import col, pandas_udf, struct

pdf = pd.DataFrame([1, 2, 3], columns=["x"])


df = spark.createDataFrame(pdf)

@pandas_udf("long")
def multiply_two_cols(
iterator: Iterator[Tuple[pd.Series, pd.Series]]) -> Iterator[pd.Series]:
for a, b in iterator:
yield a * b

df.select(multiply_two_cols("x", "x")).show()
# +-----------------------+
# |multiply_two_cols(x, x)|
# +-----------------------+
# | 1|
# | 4|
# | 9|
# +-----------------------+

Series to scalar UDF


Series to scalar pandas UDFs are similar to Spark aggregate functions. A Series to scalar pandas UDF defines an
aggregation from one or more pandas Series to a scalar value, where each pandas Series represents a Spark
column. You use a Series to scalar pandas UDF with APIs such as select , withColumn , groupBy.agg , and
pyspark.sql.Window.
You express the type hint as pandas.Series, ... -> Any . The return type should be a primitive data type, and
the returned scalar can be either a Python primitive type, for example, int or float or a NumPy data type
such as numpy.int64 or numpy.float64 . Any should ideally be a specific scalar type.
This type of UDF does not support partial aggregation and all data for each group is loaded into memory.
The following example shows how to use this type of UDF to compute mean with select , groupBy , and
window operations:
import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql import Window

df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))

# Declare the function and create the UDF


@pandas_udf("double")
def mean_udf(v: pd.Series) -> float:
return v.mean()

df.select(mean_udf(df['v'])).show()
# +-----------+
# |mean_udf(v)|
# +-----------+
# | 4.2|
# +-----------+

df.groupby("id").agg(mean_udf(df['v'])).show()
# +---+-----------+
# | id|mean_udf(v)|
# +---+-----------+
# | 1| 1.5|
# | 2| 6.0|
# +---+-----------+

w = Window \
.partitionBy('id') \
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('mean_v', mean_udf(df['v']).over(w)).show()
# +---+----+------+
# | id| v|mean_v|
# +---+----+------+
# | 1| 1.0| 1.5|
# | 1| 2.0| 1.5|
# | 2| 3.0| 6.0|
# | 2| 5.0| 6.0|
# | 2|10.0| 6.0|
# +---+----+------+

For detailed usage, see pyspark.sql.functions.pandas_udf.

Usage
Setting Arrow batch size
Data partitions in Spark are converted into Arrow record batches, which can temporarily lead to high memory
usage in the JVM. To avoid possible out of memory exceptions, you can adjust the size of the Arrow record
batches by setting the spark.sql.execution.arrow.maxRecordsPerBatch configuration to an integer that
determines the maximum number of rows for each batch. The default value is 10,000 records per batch. If the
number of columns is large, the value should be adjusted accordingly. Using this limit, each data partition is
divided into 1 or more record batches for processing.
Timestamp with time zone semantics
Spark internally stores timestamps as UTC values, and timestamp data brought in without a specified time zone
is converted as local time to UTC with microsecond resolution.
When timestamp data is exported or displayed in Spark, the session time zone is used to localize the timestamp
values. The session time zone is set with the spark.sql.session.timeZone configuration and defaults to the JVM
system local time zone. pandas uses a datetime64 type with nanosecond resolution, datetime64[ns] , with
optional time zone on a per-column basis.
When timestamp data is transferred from Spark to pandas it is converted to nanoseconds and each column is
converted to the Spark session time zone then localized to that time zone, which removes the time zone and
displays values as local time. This occurs when calling toPandas() or pandas_udf with timestamp columns.
When timestamp data is transferred from pandas to Spark, it is converted to UTC microseconds. This occurs
when calling createDataFrame with a pandas DataFrame or when returning a timestamp from a pandas UDF.
These conversions are done automatically to ensure Spark has data in the expected format, so it is not necessary
to do any of these conversions yourself. Any nanosecond values are truncated.
A standard UDF loads timestamp data as Python datetime objects, which is different than a pandas timestamp.
To get the best performance, we recommend that you use pandas time series functionality when working with
timestamps in a pandas UDF. For details, see Time Series / Date functionality.

Example notebook
The following notebook illustrates the performance improvements you can achieve with pandas UDFs:
pandas UDFs benchmark notebook
Get notebook
pandas function APIs
7/21/2022 • 4 minutes to read

pandas function APIs enable you to directly apply a Python native function, which takes and outputs pandas
instances, to a PySpark DataFrame. Similar to pandas user-defined functions, function APIs also use Apache
Arrow to transfer data and pandas to work with the data; however, Python type hints are optional in pandas
function APIs.
There are three types of pandas function APIs:
Grouped map
Map
Cogrouped map
pandas function APIs leverage the same internal logic that pandas UDF executions use. Therefore, it shares the
same characteristics with pandas UDFs such as PyArrow, supported SQL types, and the configurations.
For more information, see the blog post New Pandas UDFs and Python Type Hints in the Upcoming Release of
Apache Spark 3.0.

Grouped map
You transform your grouped data via groupBy().applyInPandas() to implement the “split-apply-combine”
pattern. Split-apply-combine consists of three steps:
Split the data into groups by using DataFrame.groupBy .
Apply a function on each group. The input and output of the function are both pandas.DataFrame . The input
data contains all the rows and columns for each group.
Combine the results into a new DataFrame .

To use groupBy().applyInPandas() , you must define the following:


A Python function that defines the computation for each group
A StructType object or a string that defines the schema of the output DataFrame

The column labels of the returned pandas.DataFrame must either match the field names in the defined output
schema if specified as strings, or match the field data types by position if not strings, for example, integer
indices. See pandas.DataFrame for how to label columns when constructing a pandas.DataFrame .
All data for a group is loaded into memory before the function is applied. This can lead to out of memory
exceptions, especially if the group sizes are skewed. The configuration for maxRecordsPerBatch is not applied on
groups and it is up to you to ensure that the grouped data fits into the available memory.
The following example shows how to use groupby().apply() to subtract the mean from each value in the group.
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))

def subtract_mean(pdf):
# pdf is a pandas.DataFrame
v = pdf.v
return pdf.assign(v=v - v.mean())

df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double").show()


# +---+----+
# | id| v|
# +---+----+
# | 1|-0.5|
# | 1| 0.5|
# | 2|-3.0|
# | 2|-1.0|
# | 2| 4.0|
# +---+----+

For detailed usage, see pyspark.sql.GroupedData.applyInPandas.

Map
You perform map operations with pandas instances by DataFrame.mapInPandas() in order to transform an
iterator of pandas.DataFrame to another iterator of pandas.DataFrame that represents the current PySpark
DataFrame and returns the result as a PySpark DataFrame.
The underlying function takes and outputs an iterator of pandas.DataFrame . It can return the output of arbitrary
length in contrast to some pandas UDFs such as Series to Series pandas UDF.
The following example shows how to use mapInPandas() :

df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))

def filter_func(iterator):
for pdf in iterator:
yield pdf[pdf.id == 1]

df.mapInPandas(filter_func, schema=df.schema).show()
# +---+---+
# | id|age|
# +---+---+
# | 1| 21|
# +---+---+

For detailed usage, see pyspark.sql.DataFrame.mapInPandas.

Cogrouped map
For cogrouped map operations with pandas instances, use DataFrame.groupby().cogroup().applyInPandas() for
two PySpark DataFrame s to be cogrouped by a common key and then a Python function applied to each
cogroup. It consists of the following steps:
Shuffle the data such that the groups of each DataFrame which share a key are cogrouped together.
Apply a function to each cogroup. The input of the function is two pandas.DataFrame (with an optional tuple
representing the key). The output of the function is a pandas.DataFrame .
Combine the pandas.DataFrame s from all groups into a new PySpark DataFrame .
To use groupBy().cogroup().applyInPandas() , you must define the following:
A Python function that defines the computation for each cogroup.
A StructType object or a string that defines the schema of the output PySpark DataFrame .

The column labels of the returned pandas.DataFrame must either match the field names in the defined output
schema if specified as strings, or match the field data types by position if not strings, for example, integer
indices. See pandas.DataFrame for how to label columns when constructing a pandas.DataFrame .
All data for a cogroup is loaded into memory before the function is applied. This can lead to out of memory
exceptions, especially if the group sizes are skewed. The configuration for maxRecordsPerBatch is not applied
and it is up to you to ensure that the cogrouped data fits into the available memory.
The following example shows how to use groupby().cogroup().applyInPandas() to perform an asof join
between two datasets.

import pandas as pd

df1 = spark.createDataFrame(
[(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)],
("time", "id", "v1"))

df2 = spark.createDataFrame(
[(20000101, 1, "x"), (20000101, 2, "y")],
("time", "id", "v2"))

def asof_join(l, r):


return pd.merge_asof(l, r, on="time", by="id")

df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
asof_join, schema="time int, id int, v1 double, v2 string").show()
# +--------+---+---+---+
# | time| id| v1| v2|
# +--------+---+---+---+
# |20000101| 1|1.0| x|
# |20000102| 1|3.0| x|
# |20000101| 2|2.0| y|
# |20000102| 2|4.0| y|
# +--------+---+---+---+

For detailed usage, see pyspark.sql.PandasCogroupedOps.applyInPandas.


Apache Hive compatibility
7/21/2022 • 2 minutes to read

Apache Spark SQL in Azure Databricks is designed to be compatible with the Apache Hive, including metastore
connectivity, SerDes, and UDFs.

SerDes and UDFs


Hive SerDes and UDFs are based on Hive 1.2.1.

Metastore connectivity
See External Apache Hive metastore for information on how to connect Azure Databricks to an externally hosted
Hive metastore.

Supported Hive features


Spark SQL supports the vast majority of Hive features, such as:
Hive query statements, including:
SELECT
GROUP BY
ORDER BY
CLUSTER BY
SORT BY
All Hive expressions, including:
Relational expressions ( = , ⇔ , == , <> , < , > , >= , <= , etc)
Arithmetic expressions ( + , - , * , / , % , etc)
Logical expressions (AND, &&, OR, ||, etc)
Complex type constructors
Mathematical expressions (sign, ln, cos, etc)
String expressions (instr, length, printf, etc)
User defined functions (UDF)
User defined aggregation functions (UDAF)
User defined serialization formats (SerDes)
Window functions
Joins
JOIN
{LEFT|RIGHT|FULL} OUTER JOIN
LEFT SEMI JOIN
CROSS JOIN
Unions
Sub-queries
SELECT col FROM ( SELECT a + b AS col from t1) t2
Sampling
Explain
Partitioned tables including dynamic partition insertion
View
Vast majority of DDL statements, including:
CREATE TABLE
CREATE TABLE AS SELECT
ALTER TABLE
Most Hive data types, including:
TINYINT
SMALLINT
INT
BIGINT
BOOLEAN
FLOAT
DOUBLE
STRING
BINARY
TIMESTAMP
DATE
ARRAY<>
MAP<>
STRUCT<>

Unsupported Hive functionality


The following sections contain a list of Hive features that Spark SQL doesn’t support. Most of these features are
rarely used in Hive deployments.
Major Hive features
Writing to bucketed table created by Hive
ACID fine-grained updates
Esoteric Hive features
Union type
Unique join
Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at the moment
and only supports populating the sizeInBytes field of the Hive metastore
Hive input and output formats
File format for CLI: For results showing back to the CLI, Spark SQL supports only TextOutputFormat
Hadoop archive
Hive optimizations
A handful of Hive optimizations are not included in Spark. Some of these (such as indexes) are less important
due to Spark SQL’s in-memory computational model.
Block level bitmap indexes and virtual columns (used to build indexes).
Automatically determine the number of reducers for joins and groupbys: In Spark SQL, you need to control
the degree of parallelism post-shuffle using SET spark.sql.shuffle.partitions=[num_tasks]; .
Skew data flag: Spark SQL does not follow the skew data flag in Hive.
STREAMTABLE hint in join: Spark SQL does not follow the STREAMTABLE hint.
Merge multiple small files for query results: if the result output contains multiple small files, Hive can
optionally merge the small files into fewer large files to avoid overflowing the HDFS metadata. Spark SQL
does not support that.
Frequently asked questions about Azure Databricks
7/21/2022 • 6 minutes to read

This article lists the top questions you might have related to Azure Databricks. It also lists some common
problems you might have while using Databricks. For more information, see What is Azure Databricks.

Can I use Azure Key Vault to store keys/secrets to be used in Azure


Databricks?
Yes. You can use Azure Key Vault to store keys/secrets for use with Azure Databricks. For more information, see
Azure Key Vault-backed scopes.

Can I use Azure Virtual Networks with Databricks?


Yes. You can use an Azure Virtual Network (VNET) with Azure Databricks. For more information, see Deploying
Azure Databricks in your Azure Virtual Network.

How do I access Azure Data Lake Storage from a notebook?


Follow these steps:
1. In Azure Active Directory (Azure AD), provision a service principal, and record its key.
2. Assign the necessary permissions to the service principal in Data Lake Storage.
3. To access a file in Data Lake Storage, use the service principal credentials in Notebook.
For more information, see Use Azure Data Lake Storage with Azure Databricks.

Fix common problems


Here are a few problems you might encounter with Databricks.
Issue: This subscription is not registered to use the namespace 'Microsoft.Databricks'
Error message
"This subscription is not registered to use the namespace 'Microsoft.Databricks'. See https://aka.ms/rps-not-
found for how to register subscriptions. (Code: MissingSubscriptionRegistration)"
Solution
1. Go to the Azure portal.
2. Select Subscriptions , the subscription you are using, and then Resource providers .
3. In the list of resource providers, against Microsoft.Databricks , select Register . You must have the
contributor or owner role on the subscription to register the resource provider.
Issue: Your account {email} does not have the owner or contributor role on the Databricks workspace
resource in the Azure portal
Error message
"Your account {email} does not have Owner or Contributor role on the Databricks workspace resource in the
Azure portal. This error can also occur if you are a guest user in the tenant. Ask your administrator to grant you
access or add you as a user directly in the Databricks workspace." (Code: AADSTS90015)
Solution
The following are some solutions to this issue:
If you are an Azure Databricks user without the Owner or Contributor role on the Databricks workspace
resource and you simply want to access the workspace:
You should access it directly using the URL (for example, https://adb-
5555555555555555.19.azuredatabricks.net). Do not use the Launch Workspace button in the Azure portal.
If you expected to be recognized as an Owner or Contributor on the workspace resource:
To initialize the tenant, you must be signed in as a regular user of the tenant, not as a guest user. You must
also have the Contributor or Owner role on the Databricks workspace resource. An administrator can
grant a user a role from the Access control (IAM) tab within the Azure Databricks workspace in the
Azure portal.
This error might also occur if your email domain name is assigned to multiple directories in Azure AD. To
work around this issue, create a new user in the directory that contains the subscription with your
Databricks workspace.
a. In the Azure portal, go to Azure AD. Select Users and Groups > Add a user .
b. Add a user with an @<tenant_name>.onmicrosoft.com email instead of @<your_domain> email. You can
find this option in Custom Domains , under Azure AD in the Azure portal.
c. Grant this new user the Contributor role on the Databricks workspace resource.
d. Sign in to the Azure portal with the new user, and find the Databricks workspace.
e. Launch the Databricks workspace as this user.
Issue: Your account {email} has not been registered in Databricks
Solution
If you did not create the workspace, and you are added as a user, contact the person who created the workspace.
Have that person add you by using the Azure Databricks Admin Console. For instructions, see Adding and
managing users. If you created the workspace and still you get this error, try selecting Initialize Workspace
again from the Azure portal.
Issue: Cloud provider launch failure while setting up the cluster (PublicIPCountLimitReached)
Error message
"Cloud Provider Launch Failure: A cloud provider error was encountered while setting up the cluster. For more
information, see the Databricks guide. Azure error code: PublicIPCountLimitReached. Azure error message:
Cannot create more than 10 public IP addresses for this subscription in this region."
Background
Databricks clusters use one public IP address per node (including the driver node). Azure subscriptions have
public IP address limits per region. Thus, cluster creation and scale-up operations may fail if they would cause
the number of public IP addresses allocated to that subscription in that region to exceed the limit. This limit also
includes public IP addresses allocated for non-Databricks usage, such as custom user-defined VMs.
In general, clusters only consume public IP addresses while they are active. However, PublicIPCountLimitReached
errors may continue to occur for a short period of time even after other clusters are terminated. This is because
Databricks temporarily caches Azure resources when a cluster is terminated. Resource caching is by design,
since it significantly reduces the latency of cluster startup and autoscaling in many common scenarios.
Solution
If your subscription has already reached its public IP address limit for a given region, then you should do one or
the other of the following.
Create new clusters in a different Databricks workspace. The other workspace must be located in a region in
which you have not reached your subscription's public IP address limit.
Request to increase your public IP address limit. Choose Quota as the Issue Type , and Networking: ARM
as the Quota Type . In Details , request a Public IP Address quota increase. For example, if your limit is
currently 60, and you want to create a 100-node cluster, request a limit increase to 160.
Issue: A second type of cloud provider launch failure while setting up the cluster
(MissingSubscriptionRegistration)
Error message
"Cloud Provider Launch Failure: A cloud provider error was encountered while setting up the cluster. For more
information, see the Databricks guide. Azure error code: MissingSubscriptionRegistration Azure error message:
The subscription is not registered to use namespace 'Microsoft.Compute'. See https://aka.ms/rps-not-found for
how to register subscriptions."
Solution
1. Go to the Azure portal.
2. Select Subscriptions , the subscription you are using, and then Resource providers .
3. In the list of resource providers, against Microsoft.Compute , select Register . You must have the
contributor or owner role on the subscription to register the resource provider.
For more detailed instructions, see Resource providers and types.
Issue: Azure Databricks needs permissions to access resources in your organization that only an admin can
grant.
Background
Azure Databricks is integrated with Azure Active Directory. You can set permissions within Azure Databricks (for
example, on notebooks or clusters) by specifying users from Azure AD. For Azure Databricks to be able to list the
names of the users from your Azure AD, it requires read permission to that information and consent to be given.
If the consent is not already available, you see the error.
Solution
Log in as a global administrator to the Azure portal. For Azure Active Directory, go to the User Settings tab and
make sure Users can consent to apps accessing company data on their behalf is set to Yes .
Issue: Azure Databricks does not support creation of workspace under Azure Resource Groups, which are
created with Chinese characters.
Solution
Support for validation for this scenario as part of workspace create will be added in later release.

Next steps
Quickstart: Get started with Azure Databricks
What is Azure Databricks?
Welcome to the Knowledge Base for Azure
Databricks
7/21/2022 • 2 minutes to read

This Knowledge Base provides a wide variety of troubleshooting, how-to, and best practices articles to help you
succeed with Azure Databricks, Delta Lake, and Apache Spark. These articles were written mostly by support and
field engineers, in response to typical customer questions and issues.
Azure Databricks administration: tips and troubleshooting
Azure infrastructure: tips and troubleshooting
Business intelligence tools: tips and troubleshooting
Clusters: tips and troubleshooting
Data management: tips and troubleshooting
Data sources: tips and troubleshooting
Databricks File System (DBFS): tips and troubleshooting
Databricks SQL: tips and troubleshooting
Developer tools: tips and troubleshooting
Delta Lake: tips and troubleshooting
Jobs: tips and troubleshooting
Job execution: tips and troubleshooting
Libraries: tips and troubleshooting
Machine learning: tips and troubleshooting
Metastore: tips and troubleshooting
Metrics: tips and troubleshooting
Notebooks: tips and troubleshooting
Security and permissions: tips and troubleshooting
Streaming: tips and troubleshooting
Visualizations: tips and troubleshooting
Python with Apache Spark: tips and troubleshooting
R with Apache Spark: tips and troubleshooting
Scala with Apache Spark: tips and troubleshooting
SQL with Apache Spark: tips and troubleshooting
How to discover who deleted a workspace in Azure
portal
7/21/2022 • 2 minutes to read

If your workspace has disappeared or been deleted, you can identify which user deleted it by checking the
Activity log in the Azure portal.
1. Go to the Activity log in the Azure portal.
2. Expand the timeline to focus on when the workspace was deleted.
3. Filter the log for a record of the specific event.
4. Click on the event to display information about the event, including the user who initiated the event.
The screenshot shows how you can click the Remove Databricks Workspace event in the Operation Name
column, and then view detailed information about the event.

If you are still unable to find who deleted the workspace, create a support case with Microsoft Support. Provide
details such as the workspace id and the time range of the event (including your time zone). Microsoft Support
will review the corresponding backend activity logs.
How to discover who deleted a cluster in Azure
portal
7/21/2022 • 2 minutes to read

If a cluster in your workspace has disappeared or been deleted, you can identify which user deleted it by running
a query in the Log Analytics workspaces service in the Azure portal.

NOTE
If you do not have an analytics workspace set up, you must configure Diagnostic Logging in Azure Databricks before you
continue.

1. Load the Log Analytics workspaces service in the Azure portal.


2. Click the name of your workspace.
3. Click Logs .
4. Look for the following text: Type your quer y here or click one of the example queries to star t .

5. Enter the following query:

DatabricksClusters
| where ActionName == "permanentDelete"
and Response contains "\"statusCode\":200"
and RequestParams contains "\"cluster_id\":\"0210-024915-bore731\"" // Add cluster_id filter if
cluster id is known
and TimeGenerated between(datetime("2020-01-25 00:00:00") .. datetime("2020-01-28 00:00:00"))
// Add timestamp (in UTC) filter to narrow down the result.
| extend id = parse_json(Identity)
| extend requestParams = parse_json(RequestParams)
| project UserEmail=id.email,clusterId = requestParams.cluster_id, SourceIPAddress,
EventTime=TimeGenerated

6. Edit the cluster_id as required.


7. Edit the datetime values to filter on a specific time range.
8. Click Run to execute the query.
The results (if any) display below the query box.

If you are still unable to find who deleted the cluster, create a support case with Microsoft Support. Provide
details such as the workspace id and the time range of the event (including your time zone). Microsoft Support
will review the corresponding backend activity logs.
Configure Simba JDBC driver using Azure AD
7/21/2022 • 2 minutes to read

This article describes how to access Azure Databricks with a Simba JDBC driver using Azure AD authentication.
This can be useful if you want to use an Azure AD user account to connect to Azure Databricks.

NOTE
Power BI has native support for Azure AD authentication with Azure Databricks. Review the Power BI documentation for
more information.

Create a service principal


Create a service principal in Azure AD. The service principal obtains an access token for the user.
1. Open the Azure Portal.
2. Open the Azure Active Director y service.
3. Click App registrations in the left menu.
4. Click New registration .
5. Complete the form and click Register .
Your service principal has been successfully created.

Configure service principal permissions


1. Open the service principal you created.
2. Click API permissions in the left menu.
3. Click Add a permission .
4. Click Azure Rights Management Ser vices .
5. Click Delegated permissions .
6. Select user_impersonation .
7. Click Add permissions .
8. The user_impersonation permission is now assigned to your service principal.

NOTE
If Grant admin consent is not enabled, you may encounter an error later on in the process.

Update service principal manifest


1. Click Manifest in the left menu.
2. Look for the line containing the "allowPublicClient" property.
3. Set the value to true .

4. Click Save .

Download and configure the JDBC driver


1. Download the Databricks JDBC Driver.
2. Configure the JDBC driver as detailed in the documentation.

Obtain the Azure AD token


Use the sample code to obtain the Azure AD token for the user.
Replace the variables with values that are appropriate for your account.
from adal import AuthenticationContext

authority_host_url = "https://login.microsoftonline.com/""
# Application ID of Azure Databricks
azure_databricks_resource_id = "2ff814a6-3304-4ab8-85cb-cd0e6f879c1d"

# Required user input


user_parameters = {
"tenant" : "<tenantId>",
"client_id" : "<clientId>",
"username" : "<user@domain.com>",
"password" : <password>
}

# configure AuthenticationContext
# authority URL and tenant ID are used
authority_url = authority_host_url + user_parameters['tenant']
context = AuthenticationContext(authority_url)

# API call to get the token


token_response = context.acquire_token_with_username_password(
azure_databricks_resource_id,
user_parameters['username'],
user_parameters['password'],
user_parameters['client_id']
)

access_token = token_response['accessToken']
refresh_token = token_response['refreshToken']

Pass the Azure AD token to the JDBC driver


Now that you have the user’s Azure AD token, you can pass it to the JDBC driver using Auth_AccessToken in the
JDBC URL as detailed in the Azure Active Directory token authentication documentation.
This sample code demonstrates how to pass the Azure AD token.
# Install jaydebeapi pypi module (used for demo)

import jaydebeapi
import pandas as pd

import os os.environ["CLASSPATH"] = "<path to downloaded Simba Spark JDBC/ODBC driver>"

# JDBC connection string


url="jdbc:spark://adb-
111111111111xxxxx.xx.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/<wor
kspaceId>/<clusterId>;AuthMech=11;Auth_Flow=0;Auth_AccessToken={0}".format(access_token)

try:
conn=jaydebeapi.connect("com.simba.spark.jdbc.Driver", url)
cursor = conn.cursor()

# Execute SQL query


sql="select * from <tablename>"
cursor.execute(sql)
results = cursor.fetchall()
column_names = [x[0] for x in cursor.description]
pdf = pd.DataFrame(results, columns=column_names)
print(pdf.head())

# Uncomment the following two lines if this code is running in the Databricks Connect IDE or within a
workspace notebook.
# df = spark.createDataFrame(pdf)
# df.show()

finally:
if cursor is not None:
cursor.close()
Configure Simba ODBC driver with a proxy in
Windows
7/21/2022 • 2 minutes to read

In this article you learn how to configure the Databricks ODBC Driver when your local Windows machine is
behind a proxy server.

Download the Simba driver for Windows


Download and install the latest version of the Databricks ODBC Driver for Windows.

Add proxy settings to the Windows registry


Open the Windows registry and add the proxy settings to the Simba Spark ODBC Driver key.
1. Open the Windows Registr y Editor .
2. Navigate to the HKEY_LOCAL_MACHINE\SOFTWARE\Simba\Simba Spark ODBC Driver\Driver key.
3. Click Edit .
4. Select New .
5. Click String Value .
6. Enter UseProxy as the Name and 1 as the Data value.
7. Repeat this until you have added the following string value pairs:
Name ProxyHost Data <proxy-host-address>
Name ProxyPort Data <proxy-port-number>
Name ProxyUID Data <proxy-username>
Name ProxyPWD Data <proxy-password
8. Close the registry editor.

Configure settings in ODBC Data Source Administrator


1. Open the ODBC Data Sources application.
2. Click the System DSN tab.
3. Select the Simba Spark ODBC Driver and click Configure .
4. Enter the connection information of your Apache Spark server.
5. Click Advanced Options .
6. Enable the Driver Config Take Precedence check box.

7. Click OK .
8. Click OK .
9. Click OK .
Troubleshooting JDBC and ODBC connections
7/21/2022 • 2 minutes to read

This article provides information to help you troubleshoot the connection between your Databricks JDBC/ODBC
server and BI tools and data sources.

Fetching result set is slow after statement execution


After a query execution, you can fetch result rows by calling the next() method on the returned ResultSet
repeatedly. This method triggers a request to the driver Thrift server to fetch a batch of rows back if the buffered
ones are exhausted. We found the size of the batch significantly affects the performance. The default value in the
most of the JDBC/ODBC drivers is too conservative, and we recommend that you set it to at least 100,000.
Contact the BI tool provider if you cannot access this configuration.

Timeout/Exception when creating the connection


Once you have the server hostname, you can run the following tests from a terminal to check for connectivity to
the endpoint.

curl https://<server-hostname>:<port>/sql/protocolv1/o/0/<cluster-id> -H "Authorization: Basic $(echo -n


'token:<personal-access-token>' | base64)"

If the connection times out, check whether your network settings of the connection are correct.

TTransportException
If the response contains a TTransportException (the error is expected) like the following, it means that the
gateway is functioning properly and you have passed in valid credentials. If you are not able to connect with the
same credentials, check that the client you are using is properly configured and is using the latest Simba drivers
(version >= 1.2.0):

<h2>HTTP ERROR: 500</h2>


<p>Problem accessing /cliservice. Reason:
<pre> javax.servlet.ServletException: org.apache.thrift.transport.TTransportException</pre></p>

Referencing temporary views


If the response contains the message Table or view not found: SPARK..temp_view it means that a temporary
view is not properly referenced in the client application. Simba has an internal configuration parameter called
UseNativeQuery that decides whether the query is translated or not before being submitted to the Thrift server.
By default, the parameter is set to 0, in which case Simba can modify the query. In particular, Simba creates a
custom #temp schema for temporary views and it expects the client application to reference a temporary view
with this schema. You can avoid using this special alias by setting UseNativeQuery=1 , which prevents Simba from
modifying the query. In this case, Simba sends the query directly to the Thrift server. However, the client needs
to make sure that the queries are written in the dialect that Spark expects, that is, HiveQL.
To sum up, you have the following options to handle temporary views over Simba and Spark:
UseNativeQuery=0 and reference the view by prefixing its name with #temp .
UseNativeQuery=1 and make sure the query is written in the dialect that Spark expects.

Other errors
If you get the error 401 Unauthorized , check the credentials you are using:

<h2>HTTP ERROR: 401</h2>


<p>Problem accessing /sql/protocolv1/o/0/test-cluster. Reason:
<pre> Unauthorized</pre></p>

Verify that the username is token (not your username) and the password is a personal access token (it
should start with dapi ).
Responses such as 404, Not Found usually indicate problems with locating the specified cluster:

<h2>HTTP ERROR: 404</h2>


<p>Problem accessing /sql/protocolv1/o/0/missing-cluster. Reason:
<pre> RESOURCE_DOES_NOT_EXIST: No cluster found matching: missing-cluster</pre></p>

If you see the following errors in your application log4j logs:

log4j:ERROR A "org.apache.log4j.FileAppender" object is not assignable to a


"com.simba.spark.jdbc42.internal.apache.log4j.Appender" variable.

You can ignore these errors. The Simba internal log4j library is shaded to avoid conflicts with the log4j
library in your application. However, Simba may still load the log4j configuration of your application, and
attempt to use some custom log4j appenders. This attempt fails with the shaded library. Relevant
information is still captured in the logs.
Power BI proxy and SSL configuration
7/21/2022 • 3 minutes to read

Driver configurations
You can set driver configurations using the microsoft.sparkodbc.ini file which can be found in the
ODBC Drivers\Simba Spark ODBC Driver directory. The absolute path of the microsoft.sparkodbc.ini directory
depends on whether you are using Power BI Desktop or on-premises Power BI Gateway:
Power BI Desktop:
C:\Program Files\Microsoft Power BI Desktop\bin\ODBC Drivers\Simba Spark ODBC
Driver\microsoft.sparkodbc.ini
Power BI Gateway: m\ODBC Drivers\Simba Spark ODBC Driver\microsoft.sparkodbc.ini ,
where m is placed inside the gateway installation directory.
Set driver configurations
1. Check if the microsoft.sparkodbc.ini file was already created. If it is then jump to step 3.
2. Open Notepad or File Explorer as Run As Administrator and create a file at ODBC DriversSimba
Spark ODBC Drivermicrosoft.sparkodbc.ini .
3. Add the new driver configurations to the file below the header [Driver] by using the syntax =. Configuration
keys can be found in the manual provided with the installation of the Databricks ODBC Driver. The manual is
located at
C:\Program Files\Simba Spark ODBC Driver\Simba Apache Spark ODBC Connector Install and Configuration
Guide.html
.
Configuring a proxy
To configure a proxy, add the following configurations to the driver configuration in the
microsoft.sparkodbc.ini file:

[Driver]
UseProxy=1
ProxyHost=<proxy.example.com>
ProxyPort=<port>
ProxyUID=<username>
ProxyPWD=<password>

Depending on the firewall configuration it might also be necessary to add:

[Driver]
CheckCertRevocation=0

Troubleshooting
Error: SSL_connect: certificate verify failed
When SSL issues occur, the ODBC driver returns a generic error SSL_connect: cer tificate verify failed . You
can get more detailed SSL debugging logs by setting in the ODBC DriversSimba Spark ODBC
Drivermicrosoft.sparkodbc.inimicrosoft.sparkodbc.ini file the following two configurations:
[Driver]
AllowDetailedSSLErrorMessages=1
EnableCurlDebugLogging=1

Diagnose issues by analyzing CryptoAPI logs


Most issues can be diagnosed by using Windows CryptoAPI logs, which can be found in the Event Viewer. The
following steps describe how to capture these logs.
1. Open Event Viewer and go to Applications and Ser vices Logs > Microsoft > Windows > CAPI2 >
Operational .
2. In Filter Current Log , check the boxes Critical , Error and Warning and click OK .
3. In the Event Viewer , go to Actions > Enable Log to start collecting logs.
4. Connect Power BI to Azure Databricks to reproduce the issue.
5. In the Event Viewer, go to Actions > Disable Log to stop collecting logs.
6. Click Refresh to retrieve the list of collected events.
7. Export logs by clicking Actions > Save Filtered Log File As .
Diagnose Build Chain or Verify Chain Policy event errors
If the collected logs contain an error on the Build Chain or Verify Chain Policy events, this likely points to the
issue. More details can be found by selecting the event and reading the Details section. Two fields of interest
are Result , and RevocationResult .
The revocation status of the certificate or one of the certificates in the certificate chain is unknown.
CAPI2 error : RevocationResult: [80092013] The revocation function was unable to check revocation
because the revocation server was offline.
Cause: The revocation check failed due to an unavailable certificate revocation server.
Resolution: Disable certificate revocation checking.
The certificate chain is not complete.
CAPI2 error : Result: [800B010A] A certificate chain could not be built to a trusted root authority.
Cause: The certificate advertised by the VPN or proxy server is incomplete and does not contain a full
chain to the trusted root authority.
Resolution: The preferred solution is to configure the VPN or proxy server to advertise the full chain.
If this is not possible, a workaround is to obtain the intermediate certificates for the Databricks
workspace, and install these in the Intermediate Certification Authorities store, to enable Windows to
find the unadvertised certificates. See Install intermediate certificates.
If possible, it is recommended to install these certificates for all Power BI users using a group policy in
Windows. This has to be set up by the system administrator.
Certificate configurations
Disable certificate revocation checking
If the ODBC driver is unable to reach the certificate revocation list server, for example because of a firewall
configuration, it will fail to validate the certificate. This can be resolved by disabling this check. To disable
certificate revocation checking, set the configuration CheckCer tRevocation=0 to the
microsoft.sparkodbc.ini file.
Install intermediate certificates
1. Open your Azure Databricks workspace URL in Chrome and go to View site information by clicking the
padlock icon in the address bar.
2. Click Cer tificate > Cer tificate Path and repeat steps 3 to 6 for every intermediate certificate in the chain.
3. Choose an intermediate certificate and go to Details > Copy to File > Next to export the certificate.
4. Select the location of the certificate and click Finish .
5. Open the exported certificate and click Install Cer tificate > Next .
6. From the Cer tificate Impor t Wizard click Place all cer tificates in the following store > Browse and
choose Intermediate Cer tification Authorities .
Unable to mount Azure Data Lake Storage Gen1
account
7/21/2022 • 2 minutes to read

Problem
When you try to mount an Azure Data Lake Storage (ADLS) Gen1 account on Azure Databricks, it fails with the
error:

com.microsoft.azure.datalake.store.ADLException: Error creating directory /


Error fetching access token
Operation null failed with exception java.io.IOException : Server returned HTTP response code: 401 for URL:
https://login.windows.net/18b0b5d6-b6eb-4f5d-964b-c03a6dfdeb22/oauth2/token
Last encountered exception thrown after 5 tries.
[java.io.IOException,java.io.IOException,java.io.IOException,java.io.IOException,java.io.IOException]
[ServerRequestId:null]
at com.microsoft.azure.datalake.store.ADLStoreClient.getExceptionFromResponse(ADLStoreClient.java:1169)
at com.microsoft.azure.datalake.store.ADLStoreClient.createDirectory(ADLStoreClient.java:589)
at com.databricks.adl.AdlFileSystem.mkdirs(AdlFileSystem.java:533)
At
com.databricks.backend.daemon.data.client.DatabricksFileSystemV2$$anonfun$mkdirs$1$$anonfun$apply$mcZ$sp$7$$
anonfun$apply$mcZ$sp$8.apply$mcZ$sp(DatabricksFileSystemV2.scala:638)

Cause
This error can occur if the ADLS Gen1 account was previously mounted in the workspace, but not unmounted,
and the credential used for that mount subsequently expired. When you try to mount the same account with a
new credential, there is a conflict between the expired and new credentials.

Solution
You need to unmount all existing mounts, and then create a new mount with a new, unexpired credential.
For more information, see Mount Azure Data Lake Storage Gen1 with DBFS.
Network configuration of Azure Data Lake Storage
Gen1 causes
ADLException: Error getting info for file
7/21/2022 • 2 minutes to read

Problem
Access to Azure Data Lake Storage Gen1 (ADLS Gen1) fails with
ADLException: Error getting info for file <filename> when the following network configuration is in place:

Azure Databricks workspace is deployed in your own virtual network (uses VNet injection).
Traffic is allowed via Azure Data Lake Storage credential passthrough.
ADLS Gen1 storage firewall is enabled.
Azure Active Directory (Azure AD) service endpoint is enabled for the Azure Databricks workspace’s virtual
network.

Cause
Azure Databricks uses a control plane located in its own virtual network, and the control plane is responsible for
obtaining a token from Azure AD. ADLS credential passthrough uses the control plane to obtain Azure AD
tokens to authenticate the interactive user with ADLS Gen1.
When you deploy your Databricks workspace in your own virtual network (using VNet injection), Azure
Databricks clusters are created in your own virtual network. For increased security, you can restrict access to the
ADLS Gen 1 account by configuring the ADLS Gen1 firewall to allow only requests from your own virtual
network, by implementing service endpoints to Azure AD.
However, ADLS credential passthrough fails in this case. The reason is that when ADLS Gen1 checks for the
virtual network where the token was created, it finds the network to be the Azure Databricks control plane and
not the customer-provided virtual network where the original passthrough call was made.

Solution
To use ADLS credential passthrough with a service endpoint, storage firewall, and ADLS Gen1, enable Allow
access to Azure ser vices in the firewall settings.
If you have security concerns about enabling this setting in the firewall, you can upgrade to ADLS Gen2. ADLS
Gen2 works with the network configuration described above.
For more information, see:
Deploying Azure Databricks in your Azure Virtual Network
Accessing Azure Data Lake Storage Automatically with your Azure Active Directory Credentials
How to assign a single public IP for VNet-injected
workspaces using Azure Firewall
7/21/2022 • 2 minutes to read

You can use an Azure Firewall to create a VNet-injected workspace in which all clusters have a single IP
outbound address. The single IP address can be used as an additional security layer with other Azure services
and applications that allow access based on specific IP addresses.
1. Set up an Azure Databricks Workspace in your own virtual network.
2. Set up a firewall within the virtual network. See Create an NVA. When you create the firewall, you should:
Note both the private and public IP addresses for the firewall for later use.
Create a network rule for the public subnet to forward all traffic to the internet:
Name: any arbitrary name
Priority: 100
Protocol: Any
Source Addresses: IP range for the public subnet in the virtual network that you created
Destination Addresses: 0.0.0.0/1
Destination Ports: *
3. Create a Custom Route Table and associate it with the public subnet.
a. Add custom routes, also known as user-defined routes (UDR) for the following services. Specify the
Azure Databricks region addresses for your region. For Next hop type , enter Internet , as shown in
creating a route table.
Control Plane NAT VIP
Webapp
Metastore
Artifact Blob Storage
Logs Blob Storage
b. Add a custom route for the firewall with the following values:
Address prefix: 0.0.0.0./0
Next hop type: Virtual appliance
Next hop address: The private IP address for the firewall.
c. Associate the route table with the public subnet.
4. Validate the setup
a. Create a cluster in the Azure Databricks workspace.
b. Next, query blob storage to your own paths or run %fs ls in a cell.
c. If it fails, confirm that the route table has all required UDRs (including Service Endpoint instead of the
UDR for Blob Storage)
For more information, see Route Azure Databricks traffic using a virtual appliance or firewall.
How to analyze user interface performance issues
7/21/2022 • 2 minutes to read

Problem
The Azure Databricks user interface seems to be running slowly.

Cause
User interface performance issues typically occur due to network latency or a database query taking more time
than expected.
In order to troubleshoot this type of problem, you need to collect network logs and analyze them to see which
network traffic is affected.
In most cases, you will need the assistance of Databricks Support to identify and resolve issues with Databricks
user interface performance, but you can also analyze the logs yourself with a tool such as G Suite Toolbox HAR
Analyzer. This tool helps you analyze the logs and identify the exact API and the time taken for each request.

Troubleshooting procedure
This is the procedure for Google Chrome. For other browsers, see G Suite Toolbox HAR Analyzer.
1. Open Google Chrome and go to the page where the issue occurs.
2. In the Chrome menu bar, select View > Developer > Developer Tools .
3. In the panel at the bottom of your screen, select the Network tab.
4. Look for a round Record button in the upper left corner of the Network tab, and make sure it is red. If it is
grey, click it once to start recording.
5. Check the box next to Preser ve log .
6. Click Clear to clear out any existing logs from the Network tab.
7. Reproduce the issue while the network requests are being recorded.
8. After you reproduce and record the issue, right-click anywhere on the grid of network requests to open a
context menu, select Save all as HAR with Content , and save the file to your computer.
9. Analyze the file using the HAR Analyzer tool. If this analysis does not resolve the problem, open a support
ticket and upload the HAR file or attach it to your email so that Databricks can analyze it.
Example output from HAR Analyzer
Configure custom DNS settings using dnsmasq
7/21/2022 • 2 minutes to read

dnsmasq is a tool for installing and configuring DNS routing rules for cluster nodes. You can use it to set up
routing between your Azure Databricks environment and your on-premise network.

WARNING
If you use your own DNS server and it goes down, you will experience an outage and will not be able to create clusters.

Use the following cluster-scoped init script to configure dnsmasq for a cluster node.
1. Use netcat ( nc ) to test connectivity from the notebook environment to your on-premise network.

nc -vz <on-premise-ip> 53

2. Create the base directory you want to store the init script in if it does not already exist.

dbutils.fs.mkdirs("dbfs:/databricks/<init-script-folder>/")

3. Create the script.

dbutils.fs.put("/databricks/<init-script-folder>/dns-masq-az.sh";,"""
#!/bin/bash
sudo apt-get update -y
sudo apt-get install dnsmasq -y --force-yes

## Add dns entries for internal nameservers


echo server=/databricks.net/<dns-server-ip> | sudo tee --append /etc/dnsmasq.conf

## Find the default DNS settings for the instance and use them as the default DNS route
azvm_dns=cat /etc/resolv.conf | grep "nameserver"; | cut -d' ' -f 2
echo "Old dns in resolv.conf $azvm_dns"
echo "server=$azvm_dns" | sudo tee --append /etc/dnsmasq.conf

## configure resolv.conf to point to dnsmasq service instead of static resolv.conf file


mv /etc/resolv.conf /etc/resolv.conf.orig
echo nameserver 127.0.0.1 | sudo tee --append /etc/resolv.conf
sudo systemctl disable --now systemd-resolved
sudo systemctl enable --now dnsmasq
""", true)

4. Check that the script exists.

display(dbutils.fs.ls("dbfs:/databricks/<init-script-folder>/dns-masq-az.sh"))

5. Install the init script that you just created as a cluster-scoped init script.
You will need the full path to the location of the script (
dbfs:/databricks/<init-script-folder>/dns-masq-az.sh ).
6. Launch a zero-node cluster to confirm that you can create clusters.
Jobs are not progressing in the workspace
7/21/2022 • 2 minutes to read

Problem
Jobs fail to run on any cluster in the workspace.

Cause
This can happen if you have changed the VNet of an existing workspace. Changing the VNet of an existing Azure
Databricks workspace is not supported.
Review Deploy Azure Databricks in your VNet for more details.

Solution
1. Open the cluster driver logs in the Azure Databricks UI.
2. Search for the following WARN messages:

19/11/19 16:50:29 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your
cluster UI to ensure that workers are registered and have sufficient resources
19/11/19 16:50:44 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your
cluster UI to ensure that workers are registered and have sufficient resources
19/11/19 16:50:59 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your
cluster UI to ensure that workers are registered and have sufficient resources

If this error is present, it is likely that the VNet of the Azure Databricks workspace was changed.
3. Revert the change to restore the original VNet configuration that was used when the Azure Databricks
workspace was created.
4. Restart the running cluster.
5. Resubmit your jobs.
6. Verify the jobs are getting resources.
SAS requires current ABFS client
7/21/2022 • 2 minutes to read

Problem
While using SAS token authentication, you encounter an IllegalArgumentException error.

IllegalArgumentException: No enum constant


shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AuthType.SAS

Cause
SAS requires the current ABFS client. Previous ABFS clients do not support SAS.

Solution
You must use the current ABFS client (
shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem ) to use SAS.
This ABFS client is available by default in Databricks Runtime 7.3 LTS and above.
If you are using an old ABFS client, you should update your code so it references the current ABFS client.
Auto termination is disabled when starting a job
cluster
7/21/2022 • 2 minutes to read

Problem
You are trying to start a job cluster, but the job creation fails with an error message.

Error creating job


Cluster autotermination is currently disabled.

Cause
Job clusters auto terminate once the job is completed. As a result, they do not support explicit autotermination
policies.
If you include autotermination_minutes in your cluster policy JSON, you get the error on job creation.

{
"autotermination_minutes": {
"type": "fixed",
"value": 30,
"hidden": true
}
}

Solution
Do not define autotermination_minutes in the cluster policy for job clusters.
Auto termination should only be used for all-purpose clusters.
How to calculate the number of cores in a cluster
7/21/2022 • 2 minutes to read

If your organization has installed a metrics service on your cluster nodes, you can view the number of cores in
an Azure Databricks cluster in the Workspace UI using the Metrics tab on the cluster details page.
If the driver and executors are of the same node type, you can also determine the number of cores available in a
cluster programmatically, using Scala utility code:
1. Use sc.statusTracker.getExecutorInfos.length to get the total number of nodes. The result includes the
driver node, so subtract 1.
2. Use java.lang.Runtime.getRuntime.availableProcessors to get the number of cores per node.
3. Multiply both results (subtracting 1 from the total number of nodes) to get the total number of cores
available:

java.lang.Runtime.getRuntime.availableProcessors * (sc.statusTracker.getExecutorInfos.length -1)


Cluster Apache Spark configuration not applied
7/21/2022 • 2 minutes to read

Problem
Your cluster’s Spark configuration values are not applied.

Cause
This happens when the Spark config values are declared in the cluster configuration as well as in an init script.
When Spark config values are located in more than one place, the configuration in the init script takes
precedence and the cluster ignores the configuration settings in the UI.

Solution
You should define your Spark configuration values in one place.
Choose to define the Spark configuration in the cluster configuration or include the Spark configuration in an
init script.
Do not do both.
Cluster failed to launch
7/21/2022 • 3 minutes to read

This article describes several scenarios in which a cluster fails to launch, and provides troubleshooting steps for
each scenario based on error messages found in logs.

Cluster timeout
Error messages:

Driver failed to start in time

INTERNAL_ERROR: The Spark driver failed to start within 300 seconds

Cluster failed to be healthy within 200 seconds

Cause
The cluster can fail to launch if it has a connection to an external Hive metastore and it tries to download all the
Hive metastore libraries from a maven repo. A cluster downloads almost 200 JAR files, including dependencies.
If the Azure Databricks cluster manager cannot confirm that the driver is ready within 5 minutes, then cluster
launch fails. This can occur because JAR downloading is taking too much time.
Solution
Store the Hive libraries in DBFS and access them locally from the DBFS location. See Spark Options.

Global or cluster-specific init scripts


Error message:

The cluster could not be started in 50 minutes. Cause: Timed out with exception after <xxx> attempts

Cause
Init scripts that run during the cluster spin-up stage send an RPC (remote procedure call) to each worker
machine to run the scripts locally. All RPCs must return their status before the process continues. If any RPC hits
an issue and doesn’t respond back (due to a transient networking issue, for example), then the 1-hour timeout
can be hit, causing the cluster setup job to fail.
Solution
Use a cluster-scoped init script instead of global or cluster-named init scripts. With cluster-scoped init scripts,
Azure Databricks does not use synchronous blocking of RPCs to fetch init script execution status.

Too many libraries installed in cluster UI


Error message:

Library installation timed out after 1800 seconds. Libraries that are not yet installed:

Cause
This is usually an intermittent problem due to network problems.
Solution
Usually you can fix this problem by re-running the job or restarting the cluster.
The library installer is configured to time out after 3 minutes. While fetching and installing jars, a timeout can
occur due to network problems. To mitigate this issue, you can download the libraries from maven to a DBFS
location and install it from there.

Cloud provider limit


Error message:

Cluster terminated. Reason: Cloud Provider Limit

Cause
This error is usually returned by the cloud provider.
Solution
See the cloud provider error information in cluster unexpected termination.

Cloud provider shutdown


Error message:

Cluster terminated. Reason: Cloud Provider Shutdown

Cause
This error is usually returned by the cloud provider.
Solution
See the cloud provider error information in cluster unexpected termination.

Instances unreachable
Error message:

Cluster terminated. Reason: Instances Unreachable

An unexpected error was encountered while setting up the cluster. Please retry and contact Azure Databricks
if the problem persists. Internal error message: Timeout while placing node

Cause
This error is usually returned by the cloud provider. Typically, it occurs when you have an Azure Databricks
workspace deployed to your own virtual network (VNet) (as opposed to the default VNet created when you
launch a new Azure Databricks workspace). If the virtual network where the workspace is deployed is already
peered or has an ExpressRoute connection to on-premises resources, the virtual network cannot make an ssh
connection to the cluster node when Azure Databricks is attempting to create a cluster.
Solution
Add a user-defined route (UDR) to give the Azure Databricks control plane ssh access to the cluster instances,
Blob Storage instances, and artifact resources. This custom UDR allows outbound connections and does not
interfere with cluster creation. For detailed UDR instructions, see Step 3: Create user-defined routes and
associate them with your Azure Databricks virtual network subnets. For more VNet-related troubleshooting
information, see Troubleshooting.
Cluster fails to start with dummy does not exist
error
7/21/2022 • 2 minutes to read

Problem
You try to start a cluster, but it fails to start. You get an Apache Spark error message.

Internal error message: Spark error: Driver down

You review the cluster driver and worker logs and see an error message containing
java.io.FileNotFoundException: File file:/databricks/driver/dummy does not exist .

21/07/14 21:44:06 ERROR DriverDaemon$: XXX Fatal uncaught exception. Terminating driver.
java.io.FileNotFoundException: File file:/databricks/driver/dummy does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1668)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1632)
at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:511)
at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:511)
at scala.collection.immutable.List.foreach(List.scala:392)

Cause
You have spark.files dummy set in your Spark Config, but no such file exists.
Spark interprets the dummy configuration value as a valid file path and tries to find it on the local file system. If
the file does not exist, it generates the error message.

java.io.FileNotFoundException: File file:/databricks/driver/dummy does not exist

Solution
Option 1: Delete spark.files dummy from your Spark Config if you are not passing actual files to Spark.
Option 2: Create a dummy file and place it on the cluster. You can do this with an init script.
1. Create the init script.

dbutils.fs.put("dbfs:/databricks/<init-script-folder>/create_dummy_file.sh",
"""
#!/bin/bash
touch /databricks/driver/dummy""", True)

2. Install the init script that you just created as a cluster-scoped init script.
You will need the full path to the location of the script (
dbfs:/databricks/<init-script-folder>/create_dummy_file.sh ).
3. Restart the cluster
Restart your cluster after you have installed the init script.
Job fails due to cluster manager core instance
request limit
7/21/2022 • 2 minutes to read

Problem
An Azure Databricks Notebook or Job API returns the following error:

Unexpected failure while creating the cluster for the job. Cause REQUEST_LIMIT_EXCEEDED: Your request was
rejected due to API rate limit. Please retry your request later, or choose a larger node type instead.

Cause
The error indicates the Cluster Manager Service core instance request limit was exceeded.
A Cluster Manager core instance can support a maximum of 1000 requests.

Solution
Contact Azure Databricks Support to increase the limit set in the core instance.
Azure Databricks can increase the job limit maxBurstyUpsizePerOrg up to 2000, and upsizeTokenRefillRatePerMin
up to 120. Current running jobs are affected when the limit is increased.
Increasing these values can stop the throttling issue, but can also cause high CPU utilization.
The best solution for this issue is to replace the Cluster Manager core instance with a larger instance that can
support maximum data transmission rates.
Azure Databricks Support can change the current Cluster Manager instance type to a larger one.
Cannot apply updated cluster policy
7/21/2022 • 2 minutes to read

Problem
You are attempting to update an existing cluster policy, however the update does not apply to the cluster
associated with the policy. If you attempt to edit a cluster that is managed by a policy, the changes are not
applied or saved.

Cause
This is a known issue that is being addressed.

Solution
You can use a workaround until a permanent fix is available.
1. Edit the cluster policy.
2. Re-attribute the policy to Free form .
3. Add the edited policy back to the cluster.
If you want to edit a cluster that is associated with a policy:
1. Terminate the cluster.
2. Associate a different policy to the cluster.
3. Edit the cluster.
4. Re-associate the original policy to the cluster.
Custom Docker image requires root
7/21/2022 • 2 minutes to read

Problem
You are trying to launch an Azure Databricks cluster with a custom Docker container, but cluster creation fails
with an error.

{
"reason": {
"code": "CONTAINER_LAUNCH_FAILURE",
"type": "SERVICE_FAULT",
"parameters": {
"instance_id": "i-xxxxxxx",
"databricks_error_message": "Failed to launch spark container on instance i-xxxx. Exception: Could not add
container for xxxx with address xxxx. Could not mkdir in container"
}
}
}

Cause
Azure Databricks clusters require a root user and sudo .
Custom container images that are configured to start as a non-root user are not supported.
For more information, review the custom container documentation.

Solution
You must configure your Docker container to start as the root user.
Example
This container configuration starts as the standard user ubuntu. It fails to launch.

FROM databricksruntime/standard:8.x
RUN apt-get update -y && apt-get install -y git && \
ln -s /databricks/conda/envs/dcs-minimal/bin/pip /usr/local/bin/pip && \
ln -s /databricks/conda/envs/dcs-minimal/bin/python /usr/local/bin/python
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt .
RUN chown -R ubuntu /app
USER ubuntu

This container configuration starts as the root user. It launches successfully.


FROM databricksruntime/standard:8.x
RUN apt-get update -y && apt-get install -y git && \
ln -s /databricks/conda/envs/dcs-minimal/bin/pip /usr/local/bin/pip && \
ln -s /databricks/conda/envs/dcs-minimal/bin/python /usr/local/bin/python
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt .
Custom garbage collection prevents cluster launch
7/21/2022 • 2 minutes to read

Problem
You are trying to use a custom Apache Spark garbage collection algorithm (other than the default one (parallel
garbage collection) on clusters running Databricks Runtime 10.0 and above. When you try to start a cluster, it
fails to start. If the configuration is set on an executor, the executor is immediately terminated.
For example, if you set either of the following custom garbage collection algorithms in your Spark config, the
cluster creation fails.
Spark driver

spark.driver.extraJavaOptions -XX:+UseG1GC

Spark executor

spark.executor.extraJavaOptions -XX:+UseG1GC

Cause
A new Java virtual machine (JVM) flag was introduced to set the garbage collection algorithm to parallel
garbage collection. If you do not change the default, the change has no impact.
If you change the garbage collection algorithm by setting spark.executor.extraJavaOptions or
spark.driver.extraJavaOptions in your Spark config , the value conflicts with the new flag. As a result, the JVM
crashed and prevents the cluster from starting.

Solution
To work around this issue, you must explicitly remove the parallel garbage collection flag in your Spark config .
This must be done at the cluster level.

spark.driver.extraJavaOptions -XX:-UseParallelGC -XX:+UseG1GC

spark.executor.extraJavaOptions -XX:-UseParallelGC -XX:+UseG1GC


Enable OpenJSSE and TLS 1.3
7/21/2022 • 2 minutes to read

Queries and transformations are encrypted before being send to your clusters. By default, the data exchanged
between worker nodes in a cluster is not encrypted.
If you require that data is encrypted at all times, you can encrypt traffic between cluster worker nodes using AES
128 over a TLS 1.2 connection.
In some cases, you may want to use TLS 1.3 instead of TLS 1.2 because it allows for stronger ciphers.
To use TLS 1.3 on your clusters, you must enable OpenJSSE in the cluster’s Apache Spark configuration.
1. Add spark.driver.extraJavaOptions -XX:+UseOpenJSSE to your Spark Config .
2. Restart your cluster.
OpenJSSE and TLS 1.3 are now enabled on your cluster and can be used in notebooks.
Enable retries in init script
7/21/2022 • 2 minutes to read

Init scripts are commonly used to configure Azure Databricks clusters.


There are some scenarios where you may want to implement retries in an init script.

Example init script


This sample init script shows you how to implement a retry for a basic copy operation.
You can use this sample code as a base for implementing retries in your own init script.

dbutils.fs.put("dbfs:/databricks/<path-to-init-script>/retry-example-init.sh", """#!/bin/bash

echo "starting script at `date`"

function fail {
echo $1 >&2
exit 1
}

function retry {
local n=1
local max=5
local delay=5
while true; do
"$@" && break || {
if [[ $n -lt $max ]]; then
((n++))
echo "Command failed. Attempt $n/$max: `date`"
sleep $delay;
else
echo "Collecting additional info for debugging.."
ps aux > /tmp/ps_info.txt
debug_log_file=debug_logs_${HOSTNAME}_$(date +"%Y-%m-%d--%H-%M").zip
zip -r /tmp/${debug_log_file} /var/log/ /tmp/ps_info.txt /databricks/data/logs/
cp /tmp/${debug_log_file} /dbfs/tmp/
fail "The command has failed after $n attempts. `date`"
fi
}
done
}

sleep 15s
echo "starting Copying at `date`"
retry cp -rv /dbfs/libraries/xyz.jar /databricks/jars/

echo "Finished script at `date`"


""", true)
Admin user cannot restart cluster to run job
7/21/2022 • 2 minutes to read

Problem
When a user who has permission to start a cluster, such as a Azure Databricks Admin user, submits a job that is
owned by a different user, the job fails with the following message:

Message: Run executed on existing cluster ID <cluster id> failed because of insufficient permissions. The
error received from the cluster manager was: 'You are not authorized to restart this cluster. Please contact
your administrator or the cluster creator.'

Cause
This error can occur when the job owner’s privilege to start the cluster is revoked. In this scenario, the job will
fail even if it is submitted by an Admin user.

Solution
Re-grant the privilege to start the cluster (known as Can Manage ) to the job owner. Change the job owner to a
user or group that has the cluster start privilege. You can change it by navigating to your job page in Jobs , then
to Advanced > Permissions > Edit .
Adding a configuration setting overwrites all default
spark.executor.extraJavaOptions settings
7/21/2022 • 2 minutes to read

Problem
When you add a configuration setting by entering it in the Apache Spark Config text area, the new setting
replaces existing settings instead of being appended.

Version
Databricks Runtime 5.1 and below.

Cause
When the cluster restarts, the cluster reads settings from a configuration file that is created in the Clusters UI,
and overwrites the default settings.
For example, when you add the following extraJavaOptions to the Spark Config text area:

spark.executor.extraJavaOptions -
javaagent:/opt/prometheus_jmx_exporter/jmx_prometheus_javaagent.jar=9404:/opt/prometheus
_jmx_exporter/jmx_prometheus_javaagent.yml

Then, in Spark UI > Environment > Spark Proper ties under spark.executor.extraJavaOptions , only the
newly added configuration setting shows:

-javaagent:/opt/prometheus_jmx_exporter/jmx_prometheus_javaagent.jar=9404:/opt/prometheus
_jmx_exporter/jmx_prometheus_javaagent.yml

Any existing settings are removed.


For reference, the default settings are:

-Djava.io.tmpdir=/local_disk0/tmp -XX:ReservedCodeCacheSize=256m -
XX:+UseCodeCacheFlushing -Ddatabricks.serviceName=spark-executor-1 -
Djava.security.properties=/databricks/spark/dbconf/java/extra.security -XX:+PrintFlagsFinal -
XX:+PrintGCDateStamps -verbose:gc -XX:+PrintGCDetails -Xss4m -
Djavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.Dataty
peFactoryImpl -
Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.Documen
tBuilderFactoryImpl -
Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFact
oryImpl -
Djavax.xml.validation.SchemaFactory=https://www.w3.org/2001/XMLSchema=com.sun.org.apache.xerces.internal.jax
p.validation.XMLSchemaFactory -
Dorg.xml.sax.driver=com.sun.org.apache.xerces.internal.parsers.SAXParser -
Dorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMX
SImplementationSourceImpl

Solution
To add a new configuration setting to spark.executor.extraJavaOptions without losing the default settings:
1. In Spark UI > Environment > Spark Proper ties , select and copy all of the properties set by default for
spark.executor.extraJavaOptions .
2. Click Edit .
3. In the Spark Config text area (Clusters > cluster-name > Advanced Options > Spark ), paste the
default settings.
4. Append the new configuration setting below the default settings.
5. Click outside the text area, then click Confirm .
6. Restart the cluster.
For example, let’s say you paste the following settings into the Spark Config text area. The new configuration
setting is appended to the default settings.

spark.executor.extraJavaOptions = -Djava.io.tmpdir=/local_disk0/tmp -
XX:ReservedCodeCacheSize=256m -XX:+UseCodeCacheFlushing -Ddatabricks.serviceName=spark-
executor-1 -Djava.security.properties=/databricks/spark/dbconf/java/extra.security -
XX:+PrintFlagsFinal -XX:+PrintGCDateStamps -verbose:gc -XX:+PrintGCDetails -Xss4m -
Djavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.Dataty
peFactoryImpl -
Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentB
uilderFactoryImpl -
Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactor
yImpl -
Djavax.xml.validation.SchemaFactory:https://www.w3.org/2001/XMLSchema=com.sun.org.apache.xer
ces.internal.jaxp.validation.XMLSchemaFactory -
Dorg.xml.sax.driver=com.sun.org.apache.xerces.internal.parsers.SAXParser -
Dorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMXSImplem
entationSourceImpl -
javaagent:/opt/prometheus_jmx_exporter/jmx_prometheus_javaagent.jar=9404:/opt/prometheus_jm
x_exporter/jmx_prometheus_javaagent.yml

After you restart the cluster, the default settings and newly added configuration setting appear in Spark UI >
Environment > Spark Proper ties .
Enable GCM cipher suites
7/21/2022 • 2 minutes to read

Azure Databricks clusters do not have GCM (Galois/Counter Mode) cipher suites enabled by default.
You must enable GCM cipher suites on your cluster to connect to an external server that requires GCM cipher
suites.

Verify required cipher suites


Use the nmap utility to verify which cipher suites are required by the external server.

nmap --script ssl-enum-ciphers -p <port> <hostname>

NOTE
If nmap is not installed, run sudo apt-get install -y nmap to install it on your cluster.

Create an init script to enable GCM cipher suites


Use the example code to create an init script that enables GCM cipher suites on your cluster.

dbutils.fs.put("/<path-to-init-script>/enable-gcm.sh", """#!/bin/bash
sed -i 's/, GCM//g' /databricks/spark/dbconf/java/extra.security
""",True)

dbutils.fs.put("/<path-to-init-script>/enable-gcm.sh", """#!/bin/bash
sed -i 's/, GCM//g' /databricks/spark/dbconf/java/extra.security
""",true)

Remember the path to the init script. You will need it when configuring your cluster.

Configure cluster with init script


Follow the documentation to configure a cluster-scoped init script.
You must specify the path to the init script.
After configuring the init script, restart the cluster.

Verify that GCM cipher suites are enabled


This example code queries the cluster for all supported cipher suites and then prints the output.
import java.util.Map;
import java.util.TreeMap;
import javax.net.ssl.SSLServerSocketFactory
import javax.net.ssl._
SSLContext.getDefault.getDefaultSSLParameters.getProtocols.foreach(println)
SSLContext.getDefault.getDefaultSSLParameters.getCipherSuites.foreach(println)

If the GCM cipher suites are enabled, you will see the following AES-GCM ciphers listed in the output.

TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
TLS_RSA_WITH_AES_256_GCM_SHA384
TLS_ECDH_ECDSA_WITH_AES_256_GCM_SHA384
TLS_ECDH_RSA_WITH_AES_256_GCM_SHA384
TLS_DHE_RSA_WITH_AES_256_GCM_SHA384
TLS_DHE_DSS_WITH_AES_256_GCM_SHA384
TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
TLS_RSA_WITH_AES_128_GCM_SHA256
TLS_ECDH_ECDSA_WITH_AES_128_GCM_SHA256
TLS_ECDH_RSA_WITH_AES_128_GCM_SHA256
TLS_DHE_RSA_WITH_AES_128_GCM_SHA256
TLS_DHE_DSS_WITH_AES_128_GCM_SHA256

Connect to the external server


Once you have verified that GCM cipher suites are installed on your cluster, make a connection to the external
server.
Failed to create cluster with invalid tag value
7/21/2022 • 2 minutes to read

Problem
You are trying to create a cluster, but it is failing with an invalid tag value error message.

System.Exception: Content={"error_code":"INVALID_PARAMETER_VALUE","message":"\nInvalid tag value (<<<<TAG-


VALUE>>>>) - the length cannot exceed 256\nUnicode characters in UTF-8.\n "}

Cause
Limitations on tag Key and Value are set by Azure.
Azure tag keys must:
Contain 1-512 characters
Contain letters, numbers, spaces (except < > * % & : \ ? / + )
Not start with azure , microsoft , or windows
Not duplicate an existing key
Azure tag values must:
Contain 1-256 characters
Contain letters, numbers, spaces (except < > * % & : \ ? / + )
Not start with azure , microsoft , or windows
For more information, please refer to the Azure tag resource limitations documentation.

Solution
Requests to update any limits on tagging must be made directly with the Azure support team.
Install a private PyPI repo
7/21/2022 • 2 minutes to read

Certain use cases may require you to install libraries from private PyPI repositories.
If you are installing from a public repository, you should review the library documentation.
This article shows you how to configure an example init script that authenticates and downloads a PyPI library
from a private repository.

Create init script


1. Create (or verify) a directory to store the init script.
<init-script-folder> is the name of the folder where you store your init scripts.

dbutils.fs.mkdirs("dbfs:/databricks/<init-script-folder>/")

2. Create the init script.

dbutils.fs.put("/databricks/<init-script-folder>/private-pypi-install.sh","""
#!/bin/bash
/databricks/python/bin/pip install --index-url=https://${<repo-username>}:${<repo-
password>}@<private-pypi-repo-domain-name> private-package==<version>
""", True)

3. Verify that your init script exists.

display(dbutils.fs.ls("dbfs:/databricks/<init-script-folder>/private-pypi-install.sh"))

Install as a cluster-scoped init script


Install the init script that you just created as a cluster-scoped init script.
You will need the full path to the location of the script (
dbfs:/databricks/<init-script-folder>/private-pypi-install.sh ).

Restart the cluster


Restart your cluster after you have installed the init script.
Once the cluster starts up, verify that it successfully installed the custom library from the private PyPI repository.
If the custom library is not installed, double check the username and password that you set for the private PyPI
repository in the init script.

Use the init script with a job cluster


Once you have the init script created, and verified working, you can include it in a create-job.json file when
using the Jobs API to start a job cluster.
{
"cluster_id": "1202-211320-brick1",
"num_workers": 1,
"spark_version": "<spark-version>",
"node_type_id": "<node-type>",
"cluster_log_conf": {
"dbfs" : {
"destination": "dbfs:/cluster-logs"
}
},
"init_scripts": [ {
"dbfs": {
"destination": "dbfs:/databricks/<init-script-folder>/private-pypi-install.sh"
}
} ]
}
IP access list update returns INVALID_STATE
7/21/2022 • 2 minutes to read

Problem
You are trying to update an IP access list and you get an INVALID_STATE error message.

{"error_code":"INVALID_STATE","message":"Your current IP 3.3.3.3 will not be allowed to access the workspace


under current configuration"}

Cause
The IP access list update that you are trying to commit does not include your current public IP address. If your
current IP address is not included in the access list, you are blocked from the environment.
If you assume that your current IP is 3.3.3.3, this example API call results in an INVALID_STATE error message.

curl -X POST -n \
https://<databricks-instance>/api/2.0/ip-access-lists
-d '{
"label": "office",
"list_type": "ALLOW",
"ip_addresses": [
"1.1.1.1",
"2.2.2.2/21"
]
}'
Solution
You must always include your current public IP address in the JSON file that is used to update the IP access list.
If you assume that your current IP is 3.3.3.3, this example API call results in a successful IP access list update.

curl -X POST -n \
https://<databricks-instance>/api/2.0/ip-access-lists
-d '{
"label": "office",
"list_type": "ALLOW",
"ip_addresses": [
"1.1.1.1",
"2.2.2.2/21",
"3.3.3.3"
]
}'
How to overwrite log4j configurations on Azure
Databricks clusters
7/21/2022 • 2 minutes to read

IMPORTANT
This article describes steps related to customer use of Log4j 1.x within an Azure Databricks cluster. Log4j 1.x is no longer
maintained and has three known CVEs (CVE-2021-4104, CVE-2020-9488, and CVE-2019-17571). If your code uses one
of the affected classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilities. You
should not enable either of these classes in your cluster.

There is no standard way to overwrite log4j configurations on clusters with custom configurations. You must
overwrite the configuration files using init scripts.
The current configurations are stored in two log4j.properties files:
On the driver:

%sh
cat /home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties

On the worker:

%sh
cat /home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties

To set class-specific logging on the driver or on workers, use the following script:

#!/bin/bash
echo "Executing on Driver: $DB_IS_DRIVER"
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties"
else
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties"
fi
echo "Adjusting log4j.properties here: ${LOG4J_PATH}"
echo "log4j.<custom-prop>=<value>" >> ${LOG4J_PATH}

Replace <custom-prop> with the property name, and <value> with the property value.
Upload the script to DBFS and select a cluster using the cluster configuration UI.
You can also set log4j.properties for the driver in the same way.
See Cluster Node Initialization Scripts for more information.
Persist Apache Spark CSV metrics to a DBFS
location
7/21/2022 • 2 minutes to read

Spark has a configurable metrics system that supports a number of sinks, including CSV files.
In this article, we are going to show you how to configure an Azure Databricks cluster to use a CSV sink and
persist those metrics to a DBFS location.

Create an init script


All of the configuration is done in an init script.
The init script does the following three things:
1. Configures the cluster to generate CSV metrics on both the driver and the worker.
2. Writes the CSV metrics to a temporary, local folder.
3. Uploads the CSV metrics from the temporary, local folder to the chosen DBFS location.

NOTE
The CSV metrics are saved locally before being uploaded to the DBFS location because DBFS is not designed for a large
number of random writes.

Customize the sample code and then run it in a notebook to create an init script on your cluster.
Sample code to create an init script:
dbutils.fs.put("/<init-path>/metrics.sh","""
#!/bin/bash
mkdir /tmp/csv
sudo bash -c "cat <<EOF >> /databricks/spark/dbconf/log4j/master-worker/metrics.properties
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
spark.metrics.staticSources.enabled true
spark.metrics.executorMetricsSource.enabled true
spark.executor.processTreeMetrics.enabled true
spark.sql.streaming.metricsEnabled true
master.source.jvm.class org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class org.apache.spark.metrics.source.JvmSource
*.sink.csv.period 5
*.sink.csv.unit seconds
*.sink.csv.directory /tmp/csv/
worker.sink.csv.period 5
worker.sink.csv.unit seconds
EOF"

sudo bash -c "cat <<EOF >> /databricks/spark/conf/metrics.properties


*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
spark.metrics.staticSources.enabled true
spark.metrics.executorMetricsSource.enabled true
spark.executor.processTreeMetrics.enabled true
spark.sql.streaming.metricsEnabled true
driver.source.jvm.class org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class org.apache.spark.metrics.source.JvmSource
*.sink.csv.period 5
*.sink.csv.unit seconds
*.sink.csv.directory /tmp/csv/
worker.sink.csv.period 5
worker.sink.csv.unit seconds
EOF"

cat <<'EOF' >> /tmp/asynccode.sh


#!/bin/bash
DB_CLUSTER_ID=$(echo $HOSTNAME | awk -F '-' '{print$1"-"$2"-"$3}')
MYIP=$(hostname -I)
if [[ ! -d /dbfs/<metrics-path>/${DB_CLUSTER_ID}/metrics-${MYIP} ]] ; then
sudo mkdir -p /dbfs/<metrics-path>/${DB_CLUSTER_ID}/metrics-${MYIP}
fi
while true; do
if [ -d "/tmp/csv" ]; then
sudo cp -r /tmp/csv/* /dbfs/<metrics-path>/$DB_CLUSTER_ID/metrics-$MYIP
fi
sleep 5
done
EOF
chmod a+x /tmp/asynccode.sh
/tmp/asynccode.sh & disown
""", True)

Replace <init-path> with the DBFS location you want to use to save the init script.
Replace <metrics-path> with the DBFS location you want to use to save the CSV metrics.

Cluster-scoped init script


Once you have created the init script on your cluster, you must configure it as a cluster-scoped init script.

Verify that CSV metrics are correctly written


Restart your cluster and run a sample job.
Check the DBFS location that you configured for CSV metrics and verify that they were correctly written.
Replay Apache Spark events in a cluster
7/21/2022 • 2 minutes to read

The Spark UI is commonly used as a debugging tool for Spark jobs.


If the Spark UI is inaccessible, you can load the event logs in another cluster and use the Event Log Replay
notebook to replay the Spark events.

IMPORTANT
Cluster log delivery is not enabled by default. You must enable cluster log delivery before starting your cluster, otherwise
there will be no logs to replay.

Enable cluster log delivery


Follow the documentation to configure Cluster log delivery on your cluster.
The location of the cluster logs depends on the Cluster Log Path that you set during cluster configuration.
For example, if the log path is , the log files for a specific cluster will be stored in
dbfs:/cluster-logs
dbfs:/cluster-logs/<cluster-name> and the individual event logs will be stored in
dbfs:/cluster-logs/<cluster-name>/eventlog/<cluster-name-cluster-ip>/<log-id>/ .

Confirm cluster logs exist


Review the cluster log path and verify that logs are being written for your chosen cluster. Log files are written
every five minutes.

Launch a single node cluster


Launch a single node cluster. You will replay the logs on this cluster.
Select the instance type based on the size of the event logs that you want to replay.

Run the Event Log Replay notebook


Attach the Event Log Replay notebook to the single node cluster.
Enter the path to your chosen cluster event logs in the event_log_path field in the notebook.
Run the notebook.
Event Log Replay notebook
Get notebook

Prevent items getting dropped from the UI


If you have a long-running cluster, it is possible for some jobs and/or stages to get dropped from the Spark UI.
This happens due to default UI limits that are intended to prevent the UI from using up too much memory and
causing an out-of-memory error on the cluster.
If you are using a single node cluster to replay the event logs, you can increase the default UI limits and devote
more memory to the Spark UI. This prevents items from getting dropped.
You can adjust these values during cluster creation by editing the Spark Config .
This example contains the default values for these properties.

spark.ui.retainedJobs 1000
spark.ui.retainedStages 1000
spark.ui.retainedTasks 100000
spark.sql.ui.retainedExecutions 1000
Set Apache Hadoop core-site.xml properties
7/21/2022 • 2 minutes to read

You have a scenario that requires Apache Hadoop properties to be set.


You would normally do this in the core-site.xml file.
In this article, we explain how you can set core-site.xml in a cluster.

Create the core-site.xml file in DBFS


You need to create a core-site.xml file and save it to DBFS on your cluster.
An easy way to create this file is via a bash script in a notebook.
This example code creates a hadoop-configs folder on your cluster and then writes a single property
core-site.xml file to that folder.

mkdir -p /dbfs/hadoop-configs/
cat << 'EOF' > /dbfs/hadoop-configs/core-site.xml
<property>
<name><property-name-here></name>
<value><property-value-here></value>
</property>
EOF

You can add multiple properties to the file by adding additional name/value pairs to the script.
You can also create this file locally, and then upload it to your cluster.

Create an init script that loads core-site.xml


This example code creates an init script called set-core-site-configs.sh that uses the core-site.xml file you
just created.
If you manually uploaded a core-site.xml file and stored it elsewhere, you should update the config_xml value
in the example code.
dbutils.fs.put("/databricks/scripts/set-core-site-configs.sh", """
#!/bin/bash

echo "Setting core-site.xml configs at `date`"

START_DRIVER_SCRIPT=/databricks/spark/scripts/start_driver.sh
START_WORKER_SCRIPT=/databricks/spark/scripts/start_spark_slave.sh

TMP_DRIVER_SCRIPT=/tmp/start_driver_temp.sh
TMP_WORKER_SCRIPT=/tmp/start_spark_slave_temp.sh

TMP_SCRIPT=/tmp/set_core-site_configs.sh

config_xml="/dbfs/hadoop-configs/core-site.xml"

cat >"$TMP_SCRIPT" <<EOL


#!/bin/bash
## Setting core-site.xml configs

sed -i '/<\/configuration>/{
r $config_xml
a \</configuration>
d
}' /databricks/spark/dbconf/hadoop/core-site.xml

EOL
cat "$TMP_SCRIPT" > "$TMP_DRIVER_SCRIPT"
cat "$TMP_SCRIPT" > "$TMP_WORKER_SCRIPT"

cat "$START_DRIVER_SCRIPT" >> "$TMP_DRIVER_SCRIPT"


mv "$TMP_DRIVER_SCRIPT" "$START_DRIVER_SCRIPT"

cat "$START_WORKER_SCRIPT" >> "$TMP_WORKER_SCRIPT"


mv "$TMP_WORKER_SCRIPT" "$START_WORKER_SCRIPT"

echo "Completed core-site.xml config changes `date`"

""", True)

Attach the init script to your cluster


You need to configure the newly created init script as a cluster-scoped init script.
If you used the example code, your Destination is DBFS and the Init Script Path is
dbfs:/databricks/scripts/set-core-site-configs.sh .
If you customized the example code, ensure that you enter the correct path and name of the init script when you
attach it to the cluster.
Set executor log level
7/21/2022 • 2 minutes to read

IMPORTANT
This article describes steps related to customer use of Log4j 1.x within an Azure Databricks cluster. Log4j 1.x is no longer
maintained and has three known CVEs (CVE-2021-4104, CVE-2020-9488, and CVE-2019-17571). If your code uses one
of the affected classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilities.

To set the log level on all executors, you must set it inside the JVM on each worker.
For example:

sc.parallelize(Seq("")).foreachPartition(x => {
import org.apache.log4j.{LogManager, Level}
import org.apache.commons.logging.LogFactory

LogManager.getRootLogger().setLevel(Level.DEBUG)
val log = LogFactory.getLog("EXECUTOR-LOG:")
log.debug("START EXECUTOR DEBUG LOG LEVEL")
})

To verify that the level is set, navigate to the Spark UI, select the Executors tab, and open the stderr log for any
executor:
Unexpected cluster termination
7/21/2022 • 2 minutes to read

Sometimes a cluster is terminated unexpectedly, not as a result of a manual termination or a configured


automatic termination. A cluster can be terminated for many reasons. Some terminations are initiated by Azure
Databricks and others are initiated by the cloud provider. This article describes termination reasons and steps for
remediation.

Azure Databricks initiated request limit exceeded


To defend against API abuses, ensure quality of service, and prevent you from accidentally creating too many
large clusters, Azure Databricks throttles all cluster up-sizing requests, including cluster creation, starting, and
resizing. The throttling uses the token bucket algorithm to limit the total number of nodes that anyone can
launch over a defined interval across your Databricks deployment, while allowing burst requests of certain sizes.
Requests coming from both the web UI and the APIs are subject to rate limiting. When cluster requests exceed
rate limits, the limit-exceeding request fails with a REQUEST_LIMIT_EXCEEDED error.
Solution
If you hit the limit for your legitimate workflow, Databricks recommends that you do the following:
Retry your request a few minutes later.
Spread out your recurring workflow evenly in the planned time frame. For example, instead of scheduling all
of your jobs to run at an hourly boundary, try distributing them at different intervals within the hour.
Consider using clusters with a larger node type and smaller number of nodes.
Use autoscaling clusters.
If these options don’t work for you, contact Azure Databricks Support to request a limit increase for the core
instance.
For other Azure Databricks initiated termination reasons, see Termination Code.

Cloud provider initiated terminations


This article lists common cloud provider related termination reasons and remediation steps.
Launch failure
This termination reason occurs when Azure Databricks fails to acquire virtual machines. The error code and
message from the API are propagated to help you troubleshoot the issue.
OperationNotAllowed
You have reached a quota limit, usually number of cores, that your subscription can launch. Request a limit
increase in Azure portal. See Azure subscription and service limits, quotas, and constraints.
PublicIPCountLimitReached
You have reached the limit of the public IPs that you can have running. Request a limit increase in Azure Portal.
SkuNotAvailable
The resource SKU you have selected (such as VM size) is not available for the location you have selected. To
resolve, see Resolve errors for SKU not available.
ReadOnlyDisabledSubscription
Your subscription was disabled. Follow the steps in Why is my Azure subscription disabled and how do I
reactivate it? to reactivate your subscription.
ResourceGroupBeingDeleted
Can occur if someone cancels your Azure Databricks workspace in the Azure portal and you try to create a
cluster at the same time. The cluster fails because the resource group is being deleted.
SubscriptionRequestsThrottled
Your subscription is hitting the Azure Resource Manager request limit (see Throttling Resource Manager
requests). Typical cause is that another system outside Azure Databricks) making a lot of API calls to Azure.
Contact Azure support to identify this system and then reduce the number of API calls.
Communication lost
Azure Databricks was able to launch the cluster, but lost the connection to the instance hosting the
Spark driver.
Caused by the driver virtual machine going down or a networking issue.
UnknownHostException on cluster launch
7/21/2022 • 2 minutes to read

Problem
When you launch an Azure Databricks cluster, you get an UnknownHostException error.
You may also get one of the following error messages:
Error: There was an error in the network configuration. databricks_error_message: Could not access worker
artifacts.
Error: Temporary failure in name resolution.
Internal error message: Failed to launch spark container on instance XXX. Exception: Could not add
container for XXX with address X.X.X.X.mysql.database.azure.com: Temporary failure in name resolution.

Cause
These errors indicate an issue with DNS settings.
Primary DNS could be down or unresponsive.
Artifacts are not being resolved, which results in the cluster launch failure.
You may have a host record listing the artifact public IP as static, but it has changed.

Solution
Identify a working DNS server and update the DNS entry on the cluster.
1. Start a standalone Azure VM and verify that the artifacts blob storage account is reachable from the
instance.

`telnet dbartifactsprodeastus.blob.core.windows.net 443`.

2. Verify that you can reach your primary DNS server from a notebook by running a ping command.
3. If your DNS server is not responding, try to reach your secondary DNS server from a notebook by
running a ping command.
4. Launch a Web Terminal from the cluster workspace.
5. Edit the /etc/resolv.conf file on the cluster.
6. Update the nameserver value with your working DNS server.
7. Save the changes to the file.
8. Restart systemd-resolved .

$ sudo systemctl restart systemd-resolved.service


NOTE
This is a temporary change to the DNS and will be lost on cluster restart. After verifying that the custom DNS settings are
correct, you can configure custom DNS settings using dnsmasq to make the change permanent.

Further troubleshooting
If you are still having DNS issues, you should try the following steps:
Verify that port 43 (used for whois) and port 53 (used for DNS) are open in your firewall.
Add the Azure recursive resolver (168,.63.129.16) to the default DNS forwarder. Review the VMs and role
instances documentation for more information.
Verify that nslookup results are identical between your laptop and the default DNS. If there is a mistmatch,
your DNS server may have an incorrect host record.
Verify that everything works with a default Azure DNS server. If it works with Azure DNS, but fails with your
custom DNS, your DNS admin should review your DNS server settings.
Apache Spark executor memory allocation
7/21/2022 • 2 minutes to read

By default, the amount of memory available for each executor is allocated within the Java Virtual Machine (JVM)
memory heap. This is controlled by the spark.executor.memory property.

However, some unexpected behaviors were observed on instances with a large amount of memory allocated. As
JVMs scale up in memory size, issues with the garbage collector become apparent. These issues can be resolved
by limiting the amount of memory under garbage collector management.
Selected Azure Databricks cluster types enable the off-heap mode, which limits the amount of memory under
garbage collector management. This is why certain Spark clusters have the spark.executor.memory value set to a
fraction of the overall cluster memory.

The off-heap mode is controlled by the properties spark.memory.offHeap.enabled and


spark.memory.offHeap.size which are available in Spark 1.6.0 and above.

The following Azure Databricks cluster types enable the off-heap memory policy:
Standard_L8s_v2
Standard_L16s_v2
Standard_L32s_v2
Standard_L32s_v2
Standard_L80s_v2
Apache Spark UI shows less than total node
memory
7/21/2022 • 2 minutes to read

Problem
The Executors tab in the Spark UI shows less memory than is actually available on the node:
An F8s instance (16 GB, 4 core) for the driver node, shows 4.5 GB memory on the Executors tab
An F4s instance (8 GB, 4 core) for the driver node, shows 710 GB memory on the Executors tab:

Cause
The total amount of memory shown is less than the memory on the cluster because some memory is occupied
by kernel and node-level services.

Solution
To calculate the available amount of memory, you can use the formula used for executor memory allocation
(all_memory_size * 0.97 - 4800MB) * 0.8 , where:

0.97 accounts for kernel overhead.


4800 MB accounts for internal node-level services (node daemon, log daemon, and so on).
0.8 is a heuristic to ensure the LXC container running the Spark process doesn’t crash due to out-of-memory
errors.
Total available memory for storage on an F4s instance is (8192MB * 0.97 - 4800MB) * 0.8 - 1024 = 1.2 GB.
Because the parameter spark.memory.fraction is by default 0.6, approximately (1.2 * 0.6) = ~710 MB is
available for storage.
You can change the spark.memory.fraction Spark configuration to adjust this parameter. Calculate the available
memory for a new parameter as follows:
1. If you use an F4s instance, which has 8192 MB memory, it has available memory 1.2 GB.
2. If you specify a spark.memory.fraction of 0.8, the Executors tab in the Spark UI should show: (1.2 * 0.8)
GB = ~960 MB.
CPU core limit prevents cluster creation
7/21/2022 • 2 minutes to read

Problem
Cluster creation fails with a message about a cloud provider error when you hover over cluster state.

Cloud Provider Launch Failure: A cloud provider error was encountered while setting up the cluster.

When you view the cluster event log to get more details, you see a message about core quota limits.

Operation results in exceeding quota limits of Core. Maximum allowed: 350, Current in use: 350, Additional
requested: 4.

Cause
Azure subscriptions have a CPU core quota limit which restricts the number of CPU cores you can use. This is a
hard limit. If you try to start a cluster that would result in your account exceeding the CPU core quota the cluster
launch will fail.

Solution
You can either free up resources or request a quota increase for your account.
Stop inactive clusters to free up CPU cores for use.
Open an Azure support case with a request to increase the CPU core quota limit for your subscription.
IP address limit prevents cluster creation
7/21/2022 • 2 minutes to read

Problem
Cluster creation fails with a message about a cloud provider error when you hover over cluster state.

Cloud Provider Launch Failure: A cloud provider error was encountered while setting up the cluster.

When you view the cluster event log to get more details, you see a message about publicIPAddresses limits.

ResourceQuotaExceeded Azure error message: Creating the resource of type


'Microsoft.Network/publicIPAddresses' would exceed the quota of '800' resources of type
'Microsoft.Network/publicIPAddresses' per resource group. The current resource count is '800', please delete
some resources of this type before creating a new one.'

Cause
Azure subscriptions have a public IP address limit which restricts the number of public IP addresses you can use.
This is a hard limit. If you try to start a cluster that would result in your account exceeding the public IP address
quota the cluster launch will fail.

Solution
You can either free up resources or request a quota increase for your account.
Stop inactive clusters to free up public IP addresses for use.
Open an Azure support case with a request to increase the public IP address quota limit for your
subscription.
Slow cluster launch and missing nodes
7/21/2022 • 2 minutes to read

Problem
A cluster takes a long time to launch and displays an error message similar to the following:

Cluster is running but X nodes could not be acquired

Cause
Provisioning an Azure VM typically takes 2-4 minutes, but if all the VMs in a cluster cannot be provisioned at the
same time, cluster creation can be delayed. This is due to Azure Databricks having to reissue VM creation
requests over a period of time.

Solution
If a cluster launches without all of the nodes, Azure Databricks automatically tries to acquire the additional
nodes and will update the cluster once available.
To workaround this, you should configure a cluster with a bigger instance type and a smaller number of nodes.
Cluster slowdown due to Ganglia metrics filling root
partition
7/21/2022 • 2 minutes to read

NOTE
This article applies to Databricks Runtime 7.3 LTS and below.

Problem
Clusters start slowing down and may show a combination of the following symptoms:
1. Unhealthy cluster events are reported:
Request timed out. Driver is temporarily unavailable.
Metastore is down.
DBFS is down.
2. You do not see any high GC events or memory utilization associated with the driver process.
3. When you use top on the driver node you see an intermittent high average load.
4. The Ganglia related gmetad process shows intermittent high CPU utilization.
5. The root disk shows high disk usage with df -h / . Specifically, /var/lib/ganglia/rrds shows high disk
usage.
6. The Ganglia UI is unable to show the load distribution.
You can verify the issue by looking for files with local in the prefix in /var/lib/ganglia/rrds . Generally, this
directory should only have files prefixed with application-<applicationId> .
For example:

%sh ls -ltrhR /var/lib/ganglia/rrds/ | grep -i local

rw-rw-rw- 1 ganglia ganglia 616K Jun 29 18:00 local-


1593453624916.driver.Databricks.directoryCommit.markerReadErrors.count.rrd -rw-rw-rw- 1 ganglia ganglia 616K
Jun 29 18:00 local-1593453614595.driver.Databricks.directoryCommit.deletedFilesFiltered.count.rrd -rw-rw-rw-
1 ganglia ganglia 616K Jun 29 18:00 local-
1593453614595.driver.Databricks.directoryCommit.autoVacuumCount.count.rrd -rw-rw-rw- 1 ganglia ganglia 616K
Jun 29 18:00 local-1593453605184.driver.CodeGenerator.generatedMethodSize.min.rrd

Cause
Ganglia metrics typically use less than 10GB of disk space. However, under certain circumstances, a “data
explosion” can occur, which causes the root partition to fill with Ganglia metrics. Data explosions also create a
dirty cache. When this happens, the Ganglia metrics can consume more than 100GB of disk space on root.
This “data explosion” can happen if you define the spark session variable as global in your Python file and then
call functions defined in the same file to perform Apache Spark transformation on data. When this happens, the
Spark session logic can be serialized, along with the required function definition, resulting in a Spark session
being created on the worker node.
For example, take the following Spark session definition:
from pyspark.sql import SparkSession

def get_spark():
"""Returns a spark session."""
return SparkSession.builder.getOrCreate()

if "spark" not in globals():


spark = get_spark()

def generator(partition):
print(globals()['spark'])
for row in partition:
yield [word.lower() for word in row["value"]]

If you use the following example commands, local prefixed files are created:

from repro import ganglia_test


df = spark.createDataFrame([(["Hello"], ), (["Spark"], )], ["value"])
df.rdd.mapPartitions(ganglia_test.generator).toDF(["value"]).show()

The print(globals()['spark']) statement in the generator() function doesn’t result in an error, because it is
available as a global variable in the worker nodes. It may fail with an invalid key error in some cases, as that
value is not available as a global variable. Streaming jobs that execute on short batch intervals are susceptible to
this issue.

Solution
Ensure that you are not using SparkSession.builder.getOrCreate() to define a Spark session as a global variable.
When you troubleshoot, you can use the timestamps on files with the local prefix to help determine when a
problematic change was first introduced.
Configure a cluster to use a custom NTP server
7/21/2022 • 2 minutes to read

By default Azure Databricks clusters use public NTP servers. This is sufficient for most use cases, however you
can configure a cluster to use a custom NTP server. This does not have to be a public NTP server. It can be a
private NTP server under your control. A common use case is to minimize the amount of Internet traffic from
your cluster.

Update the NTP configuration on a cluster


1. Create a ntp.conf file with the following information:

# NTP configuration
server <ntp-server-hostname> iburst

where <ntp-server-hostname> is a NTP server hostname or a NTP server IP address.


If you have multiple NTP servers to list, add them all to the file. Each server should be listed on its own
line.
2. Upload the ntp.conf file to /dbfs/databricks/init_scripts/ on your cluster.
3. Create the script ntp.sh on your cluster:

dbutils.fs.put("/databricks/init_scripts/ntp.sh","""
#!/bin/bash
echo "<ntp-server-ip> <ntp-server-hostname>" >> /etc/hosts
cp /dbfs/databricks/init_scripts/ntp.conf /etc/
sudo service ntp restart""",True)

4. Confirm that the script exists:

display(dbutils.fs.ls("dbfs:/databricks/init_scripts/ntp.sh"))

5. Click Clusters , click your cluster name, click Edit , click Advanced Options , click Init Scripts .
6. Select DBFS under Destination .
7. Enter the full path to ntp.sh and click Add .
8. Click Confirm and Restar t . A confirmation dialog box appears. Click Confirm and wait for the cluster
to restart.

Verify the cluster is using the updated NTP configuration


1. Run the following code in a notebook:

%sh ntpq -p

2. The output displays the NTP servers that are in use.


SSH to the cluster driver node
7/21/2022 • 2 minutes to read

This article explains how to use SSH to connect to an Apache Spark driver node for advanced troubleshooting
and installing custom software.

IMPORTANT
You can only use SSH if your workspace is deployed in an Azure Virtual Network (VNet) under your control. If your
workspace is NOT VNet injected, the SSH option will not appear.

Configure an Azure network security group


The network security group associated with your VNet must allow SSH traffic. The default port for SSH is 2200.
If you are using a custom port, you should make note of it before proceeding. You also have to identify a traffic
source. This can be a single IP address, or it can be an IP range that represents your entire office.
1. In the Azure portal, find the network security group. The network security group name can be found in
the public subnet.
2. Edit the inbound security rules to allow connections to the SSH port. In this example, we are using the
default port.
NOTE
Make sure that your computer and office firewall rules allow you to send TCP traffic on the port you are using for SSH. If
the SSH port is blocked at your computer or office firewall, you cannot connect to the Azure VNet via SSH.

Generate SSH key pair


1. Open a local terminal.
2. Create an SSH key pair by running this command:
ssh-keygen -t rsa -b 4096 -C

NOTE
You must provide the path to the directory where you want to save the public and private key. The public key is saved
with the extension .pub.

Configure a new cluster with your public key


1. Copy the ENTIRE contents of the public key file.
2. Open the cluster configuration page.
3. Click Advanced Options .
4. Click the SSH tab.
5. Paste the ENTIRE contents of the public key into the Public key field.

6. Continue with cluster configuration as normal.

Configure an existing cluster with your public key


If you have an existing cluster and did not provide the public key during cluster creation, you can inject the
public key from a notebook.
1. Open any notebook that is attached to the cluster.
2. Copy the following code into the notebook, updating it with your public key as noted:

val publicKey = "<put your public key here>"

def addAuthorizedPublicKey(key: String): Unit = {


val fw = new java.io.FileWriter("/home/ubuntu/.ssh/authorized_keys", /* append */ true)
fw.write("\n" + key)
fw.close()
}
addAuthorizedPublicKey(publicKey)

3. Run the code block to inject the public key.

SSH into the Spark driver


1. Open the cluster configuration page.
2. Click Advanced Options .
3. Click the SSH tab.
4. Note the Driver Hostname .
5. Open a local terminal.
6. Run the following command, replacing the hostname and private key file path:
ssh ubuntu@<hostname> -p 2200 -i <private-key-file-path>
Validate environment variable behavior
7/21/2022 • 2 minutes to read

In November 2021, the way environment variables are interpreted when creating, editing, or updating clusters
was changed in some workspaces.
This change will be reverted on December 3, 2021 from 01:00-03:00 UTC.
After the change is reverted, environment variables will behave as they did before the change.
This article explains how to validate the environment variable behavior on your cluster.

Behavior change examples


USE C A SE “ N EW ” IN P UT O RIGIN A L IN P UT EXP EC T ED VA L UE

Escape special characters var=" var=\" "


($,`,”,)

Use $ to access other vars Not supported otherVar=1 12


var=$otherVar2

Use ‘ or escape ‘ var=te'\''st or var=te'st te'st


var=te'"'"'st

NOTE
The “new” input behavior will no longer work after the change has been reverted.

Check the environmen

You might also like